Chapter 6Error Messages, and Troubleshooting
This chapter describes the error messaging procedures of the grid engine system and
offers tips on how to resolve various common problems.
How the Software Retrieves Error Reports
The grid engine software reports errors
and warnings by logging messages into certain files or by sending email, or both.
The log files include message files and job STDERR output.
As soon as a job is started, the standard error
(STDERR) output of the job script is redirected to a file. The
default file name and location are used, or you can specify the filename and the location
with certain options of the qsub command. See the grid engine system man
pages for detailed information.
Separate messages
files exist for the sge_qmaster, the sge_schedd,
and the sge_execds. The files have the same file name: messages. The sge_qmaster log file resides in the master
spool directory. The sge_schedd message file resides in the scheduler
spool directory. The execution daemons' log files reside in the spool directories
of the execution daemons. See "Spool Directories Under the Root Directory" in N1 Grid Engine 6 Installation Guide for more information
about the spool directories.
Each message takes up a single line in the files. Each message is subdivided
into five components separated by the vertical bar sign (|).
The components
of a message are as follows: The first component is a time stamp for the message.
The second component specifies the grid engine system daemon that generates
the message.
The third component is the name of the host where the daemon runs.
The fourth is a message type. The message type is one of the following:
N for notice - for informational purposes
I for info - for informational purposes
W for warning
E for error - an error condition has been
detected
C for critical - can lead to a program abort
Use the loglevel parameter in the cluster configuration to
specify on a global basis or a local basis what message types you want to log.
The fifth component is the message text.
Note - If an error log file is not accessible for some reason, the grid engine system tries
to log the error message to the files /tmp/sge_qmaster_messages, /tmp/sge_schedd_messages, or /tmp/sge_execd_messages on
the corresponding host.
In some circumstances, the grid engine system notifies users, administrators,
or both, about error events by email. The email messages sent by the grid engine system do
not contain a message body. The message text is fully contained in the mail subject
field.
Consequences of Different Error or Exit Codes
The following table lists the consequences of different job-related error codes
or exit codes. These codes are valid for every type of job.
Table 6-1 Job-Related Error or
Exit Codes
Script/Method | Exit or Error Code | Consequence |
Job script | 0 | Success |
| 99 | Requeue |
| Rest | Success: exit code in accounting file |
| | |
prolog/epilog | 0 | Success |
| 99 | Requeue |
| Rest | Queue error state, job requeued |
The following table lists the consequences of error codes or exit codes of jobs
related to parallel environment (PE) configuration.
Table 6-2 Parallel-Environment-Related
Error or Exit Codes
Script/Method | Exit or Error Code | Consequence |
pe_start | 0 | Success |
| Rest | Queue set to error state, job requeued |
| | |
pe_stop | 0 | Success |
| Rest | Queue set to error state, job not requeued |
The following table lists the consequences of error codes or exit codes of jobs
related to queue configuration. These codes are valid only if corresponding methods
were overwritten.
Table 6-3 Queue-Related Error
or Exit Codes
Script/Method | Exit or Error Code | Consequence |
Job starter | 0 | Success |
| Rest | Success, no other special meaning |
| | |
Suspend | 0 | Success |
| Rest | Success, no other special meaning |
| | |
Resume | 0 | Success |
| Rest | Success, no other special meaning |
| | |
Terminate | 0 | Success |
| Rest | Success, no other special meaning |
The following table lists the consequences of error or exit codes of jobs related
to checkpointing.
Table 6-4 Checkpointing-Related
Error or Exit Codes
Script/Method | Exit or Error Code | Consequence |
Checkpoint | 0 | Success |
| Rest | Success. For kernel checkpoint, however, this means that the checkpoint was
not successful. |
| | |
Migrate | 0 | Success |
| Rest | Success. For kernel checkpoint, however, this means that the checkpoint was
not successful. Migration will occur. |
| | |
Restart | 0 | Success |
| Rest | Success, no other special meaning |
| | |
Clean | 0 | Success |
| Rest | Success, no other special meaning |
For jobs that run successfully, the qacct -j command output
shows a value of 0 in the failed field, and
the output shows the exit status of the job in the exit_status field.
However, the shepherd might not be able to run a job successfully. For example, the
epilog script might fail, or the shepherd might not be able to start the job. In such
cases, the failed field displays one of the code values listed
in the following table.
Table 6-5 qacct -j failed Field Codes
Code | Description | acctvalid | Meaning for Job |
0 | No failure | t | Job ran, exited normally |
1 | Presumably before job | f | Job could not be started |
3 | Before writing config | f | Job could not be started |
4 | Before writing PID | f | Job could not be started |
5 | On reading config file | f | Job could not be started |
6 | Setting processor set | f | Job could not be started |
7 | Before prolog | f | Job could not be started |
8 | In prolog | f | Job could not be started |
9 | Before pestart | f | Job could not be started |
10 | In pestart | f | Job could not be started |
11 | Before job | f | Job could not be started |
12 | Before pestop | t | Job ran, failed before calling PE stop procedure |
13 | In pestop | t | Job ran, PE stop procedure failed |
14 | Before epilog | t | Job ran, failed before calling epilog script |
15 | In epilog | t | Job ran, failed in epilog script |
16 | Releasing processor set | t | Job ran, processor set could not be released |
24 | Migrating (checkpointing jobs) | t | Job ran, job will be migrated |
25 | Rescheduling | t | Job ran, job will be rescheduled |
26 | Opening output file | f | Job could not be started, stderr/stdout file could not be opened |
27 | Searching requested shell | f | Job could not be started, shell not found |
28 | Changing to working directory | f | Job could not be started, error changing to start directory |
100 | Assumedly after job | t | Job ran, job killed by a signal |
|