Sun Microsystems
Products & Services
 
Support & Training
 
 

Previous Previous     Contents     Index     Next Next

Consequences of Different Error or Exit Codes

The following table lists the consequences of different job-related error codes or exit codes. These codes are valid for every type of job.

Table 8-1 Job-Related Error or Exit Codes

Script/Method

Exit or Error Code

Consequence

Job script

0

Success

 

99

Requeue

 

Rest

Success: exit code in accounting file

 

 

 

prolog/epilog

0

Success

 

99

Requeue

 

Rest

Queue error state, job requeued

The following table lists the consequences of error codes or exit codes of jobs related to parallel environment (PE) configuration.

Table 8-2 Parallel-Environment-Related Error or Exit Codes

Script/Method

Exit or Error Code

Consequence

pe_start

0

Success

 

Rest

Queue set to error state, job requeued

 

 

 

pe_stop

0

Success

 

Rest

Queue set to error state, job not requeued

The following table lists the consequences of error codes or exit codes of jobs related to queue configuration. These codes are valid only if corresponding methods were overwritten.

Table 8-3 Queue-Related Error or Exit Codes

Script/Method

Exit or Error Code

Consequence

Job starter

0

Success

 

Rest

Success, no other special meaning

 

 

 

Suspend

0

Success

 

Rest

Success, no other special meaning

 

 

 

Resume

0

Success

 

Rest

Success, no other special meaning

 

 

 

Terminate

0

Success

 

Rest

Success, no other special meaning

The following table lists the consequences of error or exit codes of jobs related to checkpointing.

Table 8-4 Checkpointing-Related Error or Exit Codes

Script/Method

Exit or Error Code

Consequence

Checkpoint

0

Success

 

Rest

Success. For kernel checkpoint, however, this means that the checkpoint was not successful.

 

 

 

Migrate

0

Success

 

Rest

Success. For kernel checkpoint, however, this means that the checkpoint was not successful. Migration will occur.

 

 

 

Restart

0

Success

 

Rest

Success, no other special meaning

 

 

 

Clean

0

Success

 

Rest

Success, no other special meaning

Running Grid Engine System Programs in Debug Mode

For some severe error conditions, the error-logging mechanism might not yield sufficient information to identify the problems. Therefore, the grid engine system offers the ability to run almost all ancillary programs and the daemons in debug mode. Different debug levels vary in the extent and depth of information that is provided. The debug levels range from zero through 10, with 10 being the level delivering the most detailed information and zero turning off debugging.

To set a debug level, an extension to your .cshrc or .profile resource files is provided with the distribution of the grid engine system. For csh or tcsh users, the file sge-root/util/dl.csh is included. For sh or ksh users, the corresponding file is named sge-root/util/dl.sh. The files must be sourced into your standard resource file. As csh or tcsh user, include the following line in your .cshrc file:

source sge-root/util/dl.csh

As sh or ksh user, include the following line in your .profile file:

. sge-root/util/dl.sh

As soon as you log out and log in again, you can use the following command to set a debug level:

% dl level

If level is greater than 0, starting a grid engine system command forces the command to write trace output to STDOUT. The trace output can contain warning messages, status messages, and error messages, as well as the names of the program modules that are called internally. The messages also include line number information, which is helpful for error reporting, depending on the debug level you specify.


Note - To watch a debug trace, you should use a window with a large scroll-line buffer. For example, you might use a scroll-line buffer of 1000 lines.



Note - If your window is an xterm, you might want to use the xterm logging mechanism to examine the trace output later on.


If you run one of the grid engine system daemons in debug mode, the daemons keep their terminal connection to write the trace output. You can abort the terminal connections by typing the interrupt character of the terminal emulation you use. For example, you might use Control-C.

To switch off debug mode, set the debug level back to 0.

Setting the dbwriter Debug Level

The sgedbwriter script starts the dbwriter program. The script is located in sge_root/dbwriter/bin/sgedbwriter. The sgedbwriter script reads the dbwriter configuration file, dbwriter.conf. This configuration file is located in sge_root/cell/common/dbwriter.conf. This configuration file sets the debug level of dbwriter. For example:

#
# Debug level
# Valid values: WARNING, INFO, CONFIG, FINE, FINER, FINEST, ALL
#
DBWRITER_DEBUG=INFO

You can use the -debug option of the dbwriter command to change the number of messages that the dbwriter produces. In general, you should use the default debug level, which is info. If you use a more verbose debug level, you substantially increase the amount of data output by dbwriter.

You can specify the following debug levels:

warning

Displays only severe errors and warnings.

info

Adds a number of informational messages. info is the default debug level.

config

Gives additional information that is related to dbwriter configuration, for example, about the processing of rules.

fine

Produces more information. If you choose this debug level, all SQL statements run by dbwriter are output.

finer

For debugging.

finest

For debugging.

all

Displays information for all levels. For debugging.

Diagnosing Problems

The grid engine system offers several reporting methods to help you diagnose problems. The following sections outline their uses.

Pending Jobs Not Being Dispatched

Sometimes a pending job is obviously capable of being run, but the job does not get dispatched. To diagnose the reason, the grid engine system offers a pair of utilities and options, qstat -j job-id and qalter-w v job-id.

  • qstat -j job-id

    When enabled, qstat -j job-id provides a list of reasons why a certain job was not dispatched in the last scheduling run. This monitoring can be enabled or disabled. You might want to disable monitoring because it can cause undesired communication overhead between the sge_schedd daemon and sge_qmaster. See schedd_job_info in the sched_conf(5) man page. The following example shows output for a job with the ID 242059:

    % qstat -j 242059
    scheduling info: queue "fangorn.q" dropped because it is temporarily not available
    queue "lolek.q" dropped because it is temporarily not available
    queue "balrog.q" dropped because it is temporarily not available
    queue "saruman.q" dropped because it is full
    cannot run in queue "bilbur.q" because it is not contained in its hard queuelist (-q)
    
    cannot run in queue "dwain.q" because it is not contained in its hard queue list (-q)
    has no permission for host "ori"

    This information is generated directly by the sge_schedd daemon. The generating of this information takes the current usage of the cluster into account. Sometimes this information does not provide what you are looking for. For example, if all queue slots are already occupied by jobs of other users, no detailed message is generated for the job you are interested in.

  • qalter -w v job-id

Previous Previous     Contents     Index     Next Next