Sun Microsystems
Products & Services
 
Support & Training
 
 

Previous Previous     Contents     Index     Next Next

Example 4-2 Example of qstat Output

job-ID   prior   name         user      state   submit/start at     queue      function
231      0       hydra        craig     r       07/13/96            durin.q    MASTER 
                                                20:27:15
232      0       compile      penny     r       07/13/96            durin.q    MASTER 
                                                20:30:40
230      0       blackhole    don       r       07/13/96            dwain.q    MASTER 
                                                20:26:10
233      0       mac          elaine    r       07/13/96            dwain.q    MASTER 
                                                20:30:40
234      0       golf         shannon   r       07/13/96            dwain.q    MASTER 
                                                20:31:44
236      5       word         elaine    qw      07/13/96           
                                                20:32:07
235      0       andrun       penny     qw      07/13/96 20:31:43 

Controlling Jobs With qdel and qmod

To control jobs from the command line, type one of the following commands with the appropriate arguments.

% qdel arguments
% qmod arguments

Use the qdel command to cancel jobs, regardless of whether the jobs are running or are spooled. Use the qmod command to suspend and resume (unsuspend) jobs already running.

For both commands, you need to know the job identification number, which is displayed in response to a successful qsub command. If you forget the number, you can retrieve it with qstat. See Monitoring Jobs With qstat.

Here are several examples of the qdel and qmod commands:

% qdel job-id
% qdel -f job-id1, job-id2
% qmod -s job-id
% qmod -us -f job-id1, job-id2
% qmod -s job-id.task-id-range

In order to delete, suspend, or resume a job, you must be the owner of the job or a grid engine manager or operator. See Managers, Operators, and Owners.

You can use the -f (force) option with both commands to register a job status change at sge_qmaster without contacting sge_execd. You might want to use the force option in cases where sge_execd is unreachable, for example, due to network problems. The -f option is intended for use only by the administrator. In the case of qdel, however, users can force deletion of their own jobs if the flag ENABLE_FORCED_QDEL in the cluster configuration qmaster_params entry is set. See the sge_conf(5) man page for more information.

Monitoring Jobs by Email

From the command line, type the following command with appropriate arguments.

% qsub arguments

The qsub -m command requests email to be sent to the user who submitted a job or to the email addresses specified by the -M flag if certain events occur. See the qsub(1) man page for a description of the flags. An argument to the -m option specifies the events. The following arguments are available:

  • b - Send email at the beginning of the job.

  • e - Send email at the end of the job.

  • a - Send email when the job is rescheduled or aborted (for example, by using the qdel command).

  • s - Send email when the job is suspended.

  • n - Do not send email. n is the default.

Use a string made up of one or more of the letter arguments to specify several of these options with a single -m option. For example, -m be sends email at the beginning and at the end of a job.

You can also use the Submit Job dialog box to configure these mail events. See Submitting Advanced Jobs With QMON.

Monitoring and Controlling Queues

As described in Displaying Queues and Queue Properties, the owners of queues have permission to suspend and resume queues, and to disable and enable queues. Owners might want to suspend or disable queues if certain machines are needed for important work, and those machines are strongly affected by jobs running in the background.

You can control queues in two ways:

  • Using the QMON Queue Control dialog box

  • Using the qmod command

Monitoring and Controlling Queues With QMON

In the QMON Main Control window, click the Queue Control button. The Cluster Queues dialog box appears.

Dialog box titled Cluster Queues. Shows the Cluster Queues tab
with a list of defined cluster queues. Shows buttons you can use to manipulate queues.

Monitoring and Controlling Cluster Queues

The Cluster Queue tab provides a quick overview of all cluster queues that are defined for the cluster. The Cluster Queue tab also provides the means to suspend and resume cluster queues, to disable and enable cluster queues, as well as to configure them.

Information displayed in the Cluster Queue dialog box is updated periodically. Click Refresh to force an update.

To select a cluster queue, click it.

Click Delete, Suspend, Resume, Disable, or Enable to execute the corresponding operation on cluster queues that you select. The suspend/resume and disable/enable operations require notification of the corresponding sge_execd. If notification is not possible, you can force an sge_qmaster internal status change by clicking Force. For example, notification might not be possible because a host is down.

The suspend/resume and disable/enable operations require cluster queue owner permission, grid engine manager permission, or operator permission. See Managers, Operators, and Owners for details.

Suspended cluster queues are closed for further jobs. The jobs already running in suspended queues are also suspended, as described in Monitoring and Controlling Jobs With QMON. The cluster queue and its jobs are unsuspended as soon as the queue is resumed.


Note - If a job in a suspended cluster queue was suspended explicitly, the job is not resumed when the queue is resumed. The job must be resumed explicitly.


Disabled cluster queues are closed. However, the jobs that are running in those queues are allowed to continue. The disabling of a cluster queue is commonly used to "drain" a queue. After the cluster queue is enabled, it is eligible to run jobs again. No action on currently running jobs is performed.

Error states are displayed using a red font in the queue list. Click Clear Error to remove an error state from a queue.

Click Reschedule to reschedule all jobs currently running in the selected cluster queues.

To configure cluster queues and queue instances, click Add or Modify on the Cluster Queue dialog box. See "Configuring Queues With QMON" in N1 Grid Engine 6 Administration Guide for details.

Click Done to close the dialog box.

Cluster Queue Status

Each row in the cluster queue table represents one cluster queue. For each cluster queue, the table lists the following information:

  • Cluster Queue - Name of the cluster queue.

  • Load - Average of the normalized load average of all cluster queue hosts. Only hosts with a load value are considered.

  • Used - Number of currently used job slots.

  • Avail - Number of currently available job slots.

  • Total - Total number of job slots.

  • aoACD - Number of queue instances that are in at least one of the following states:

    • a - Load threshold alarm

    • o - Orphaned

    • A - Suspend threshold alarm

    • C - Suspended by calendar

    • D - Disabled by calendar

  • cdsuE - Number of queue instances that are in at least one of the following states:

    • c - Configuration ambiguous

    • d - Disabled

    • s - Suspended

    • u - Unknown

    • E - Error

  • s - Number of queue instances that are in the suspended state.

  • A - Number of queue instances where one or more suspend thresholds are currently exceeded. No more jobs

  • S - Number of queue instances that are suspended through subordination to another queue.

  • C - Number of queue instances that are automatically suspended by the grid engine system calendar.

  • u - Number of queue instances that are in an unknown state.

  • a - Number of queue instances where one or more load thresholds are currently exceeded.

  • d - Number of queue instances that are in the disabled state.

  • D - Number of queue instances that are automatically disabled by the grid engine system calendar.

  • c - Number of queue instances whose configuration is ambiguous.

  • o - Number of queue instances that are in the orphaned state.

  • E - Number of queue instances that are in the error state.

Previous Previous     Contents     Index     Next Next