This command lists the reasons why a job is not dispatchable in principle. For
this purpose, a dry scheduling run is performed. All consumable resources, as well
as all slots, are considered to be fully available for this job. Similarly, all load
values are ignored because these values vary.
Job or Queue Reported in Error State E
Job or queue errors are indicated by an uppercase E in the qstat output.
A job enters the error state when the grid engine system tries to run a job but fails
for a reason that is specific to the job.
A queue enters the error state when the grid engine system tries to run a job but fails
for a reason that is specific to the queue.
The grid engine system offers a set of possibilities for users and administrators to
gather diagnosis information in case of job execution errors. Both the queue and the
job error states result from a failed job execution. Therefore the diagnosis possibilities
are applicable to both types of error states. User abort mail. If jobs are submitted
with the qsub -m a command, abort mail is sent to the address specified
with the -M user[@host] option. The abort mail contains diagnosis information about job errors.
Abort mail is the recommended source of information for users.
qacct accounting. If no abort mail is available, the user can run the qacct -j command. This command gets information about the job error from
the grid engine system's job accounting function.
Administrator abort mail. An administrator
can order administrator mails about job execution problems by specifying an appropriate
email address. See under administrator_mail on the sge_conf(5) man page. Administrator mail contains more detailed diagnosis information
than user abort mail. Administrator mail is the recommended method in case of frequent
job execution errors.
Messages files. If no administrator
mail is available, you should investigate the qmaster messages file first. You can find entries that are related to a certain
job by searching for the appropriate job ID. In the default installation, the sge_qmaster messages file is sge-root/cell/spool/qmaster/messages.
You can sometimes find additional information in the messages of the sge_execd daemon from which the job was started. Use qacct -j job-id to discover the host from which the job was started,
and search in sge-root/cell/spool/host/messages for the job ID.
Troubleshooting Common Problems
This section provides information to help you diagnose and respond to the cause
of common problems. Problem -- The output file
for your job says, Warning: no access to tty; thus no job control
in this shell....
Possible cause -- One or more
of your login files contain an stty command. These commands are
useful only if a terminal is present.
Possible solution -- No terminal
is associated with batch jobs. You must remove all stty commands
from your login files, or you must bracket such commands with an if statement.
The if statement should check for a terminal before processing.
The following example shows an if statement:
/bin/csh:
stty -g # checks terminal status
if ($status == 0) # succeeds if a
terminal is present
<put all stty commands in here>
endif
|
Problem -- The job standard
error log file says `tty`:Ambiguous. However, no
reference to tty exists in the user's shell that is called in the
job script.
Possible cause -- shell_start_mode is, by default, posix_compliant. Therefore
all job scripts run with the shell that is specified in the queue definition. The
scripts do not run with the shell that is specified on the first line of the job script.
Possible solution -- Use the -S flag to the qsub command, or change shell_start_mode to unix_behavior.
Problem -- You can run your
job script from the command line, but the job script fails when you run it using the qsub command.
Possible cause -- Process
limits might be being set for your job. To test whether limits are being set, write
a test script that performs limit and limit -h functions.
Run both functions interactively, at the shell prompt and using the qsub command, to compare the results.
Possible solution -- Remove
any commands in configuration files that sets limits in your shell.
Problem -- Execution hosts
report a load of 99.99.
Possible cause -- The sge_execd daemon is not running on the host.
Possible solution -- As root, start up the sge_execd daemon
on the execution host by running the sge-root/cell/common/sgeexecd script.
Possible cause -- A default
domain is incorrectly specified.
Possible solution -- As the grid engine system administrator, run the qconf -mconf command and change the default_domain variable to none.
Possible cause -- The sge_qmaster host sees the name of the execution host as different from the
name that the execution host sees for itself.
Possible solution -- If you are using DNS to resolve the host names
of your compute cluster, configure /etc/hosts and NIS to return
the fully qualified domain name (FQDN) as the primary host name. Of course, you can
still define and use the short alias name, for example, 168.0.0.1 myhost.dom.com
myhost.
If you are not using DNS, make
sure that all of your /etc/hosts files and your NIS table are
consistent, for example, 168.0.0.1 myhost.corp myhost or 168.0.0.1 myhost
Problem -- Every 30 seconds
a warning that is similar to the following message is printed to cell/spool/host/messages:
Tue Jan 23 21:20:46 2001|execd|meta|W|local
configuration meta not defined - using global configuration
|
But cell/common/local_conf contains
a file for each host, with FQDN.
Possible cause -- The host
name resolving at your machine meta returns the short name, but
at your master machine, meta with FQDN is returned.
Possible solution -- Make
sure that all of your /etc/hosts files and your NIS table are
consistent in this respect. In this example, a line such as the following text could
erroneously be included in the /etc/hosts file of the host meta:
168.0.0.1 meta meta.your.domain
The line should instead be:
168.0.0.1 meta.your.domain
meta.
Problem -- Occasionally you
see CHECKSUM ERROR, WRITE ERROR, or READ ERROR messages in the messages files of the daemons.
Problem -- Jobs finish on
a particular queue and return the following message in qmaster/messages:
Wed Mar 28 10:57:15 2001|qmaster|masterhost|I|job 490.1
finished on host exechost
|
Then you see the following error messages in the execution host's exechost/messages file:
Wed Mar 28 10:57:15 2001|execd|exechost|E|can't find directory
"active_jobs/490.1" for reaping job 490.1
|
Wed Mar 28 10:57:15 2001|execd|exechost|E|can't remove directory
"active_jobs/490.1": opendir(active_jobs/490.1) failed:
Input/output error
|
Possible cause -- The sge-root directory, which is automounted, is being unmounted, causing
the sge_execd daemon to lose its current working directory.
Possible solution -- Use a
local spool directory for your sge_execd host. Set the parameter execd_spool_dir, using QMON or the qconf command.
Problem -- When submitting
interactive jobs with the qrsh utility, you get the following error
message:
% qrsh -l mem_free=1G error: error: no suitable queues
|
However, queues are available for submitting batch jobs with the qsub command. These queues can be queried using qhost -l mem_free=1G and qstat -f -l mem_free=1G.
Possible cause -- The message error: no suitable queues results from the -w e submit
option, which is active by default for interactive jobs such as qrsh.
Look for -w e on the qrsh(1) man page. This
option causes the submit command to fail if the sge_qmaster does
not know for sure that the job is dispatchable according to the current cluster configuration.
The intention of this mechanism is to decline job requests in advance, in case the
requests can't be granted.
Possible solution -- In this
case, mem_free is configured to be a consumable resource, but you
have not specified the amount of memory that is to be available at each host. The
memory load values are deliberately not considered for this check because memory load
values vary. Thus they can't be seen as part of the cluster configuration. You can
do one of the following:
Omit this check generally by explicitly
overriding the qrsh default option -w e with
the -w n option. You can also put this command into sge-root/cell/common/sge_request.
If you intend to manage mem_free as a consumable resource, specify the mem_free capacity
for your hosts in complex_values of host_conf by
using qconf -me hostname.
If you don't intend to manage mem_free as a consumable resource, make it a nonconsumable resource again
in the consumable column of complex(5) by using qconf -mc hostname.
Problem -- qrsh won't dispatch to the same node it is on. From a qsh shell
you get a message such as the following:
host2 [49]% qrsh -inherit host2 hostname
error: executing task of job 1 failed:
host2 [50]% qrsh -inherit host4 hostname
host4
|
Possible cause -- gid_range is not sufficient. gid_range should be defined
as a range, not as a single number. The grid engine system assigns each job on a host a distinct gid.
Possible solution -- Adjust
the gid_range with the qconf -mconf command
or with QMON. The suggested range is as follows:
Problem -- qrsh -inherit -V does not work when used inside a parallel job. You get
the following message:
cannot get connection to "qlogin_starter"
|
Possible cause -- This problem
occurs with nested qrsh calls. The problem is caused by the -V option. The first qrsh -inherit call sets the environment
variable TASK_ID. TASK_ID is the ID of the tightly
integrated task within the parallel job. The second qrsh -inherit call
uses this environment variable for registering its task. The command fails as it tries
to start a task with the same ID as the already-running first task.
Possible solution -- You can
either unset TASK_ID before calling qrsh -inherit,
or use the -v option instead of -V. This option
exports only the environment variables that you really need.
Problem -- qrsh does not seem to work at all. Messages like the following are generated:
host2$ qrsh -verbose hostname
local configuration host2 not defined - using global configuration
waiting for interactive job to be scheduled ...
Your interactive job 88 has been successfully scheduled.
Establishing /share/gridware/utilbin/solaris64/rsh session
to host exehost ...
rcmd: socket: Permission denied
/share/gridware/utilbin/solaris64/rsh exited with exit code 1
reading exit code from shepherd ...
error: error waiting on socket for client to connect:
Interrupted system call
error: error reading return code of remote command
cleaning up after abnormal exit of
/share/gridware/utilbin/solaris64/rsh
host2$
|
Possible cause -- Permissions
for qrsh are not set properly.
Possible solution -- Check
the permissions of the following files, which are located in sge-root/utilbin/. Note that rlogin and rsh must be setuid and owned by root.
-r-s--x--x 1 root root 28856 Sep 18 06:00 rlogin*
-r-s--x--x 1 root root 19808 Sep 18 06:00 rsh*
-rwxr-xr-x 1 sgeadmin adm 128160 Sep 18 06:00 rshd*
|
Note - The sge-root directory also needs to be NFS-mounted
with the setuid option. If sge-root is
mounted with nosuid from your submit client, qrsh and
associated commands will not work.
Problem - When
you try to start a distributed make, qmake exits with the following
error message:
qrsh_starter: executing child process
qmake failed: No such file or directory
|
Possible cause -- The grid engine system starts
an instance of qmake on the execution host. If the grid engine system environment,
especially the PATH variable, is not set up in the user's shell
resource file (.profile or .cshrc), this qmake call fails.
Possible solution -- Use the -v option to export the PATH environment variable to
the qmake job. A typical qmake call is as follows:
qmake -v PATH -cwd -pe make 2-10 --
|
Problem -- When
using the qmake utility, you get the following error message:
waiting for interactive job to be scheduled ...timeout (4 s)
expired while waiting on socket fd 5
Your "qrsh" request could not be scheduled, try again later.
|
Possible cause -- The ARCH environment variable might be set incorrectly in the shell from which qmake was called.
Possible solution - Set the ARCH variable correctly to a supported value that matches an available host
in your cluster, or else specify the correct value at submit time, for example, qmake -v ARCH=solaris64 ...
|