Sun Microsystems
Products & Services
 
Support & Training
 
 

Previous Previous     Contents     Index     Next Next

See the qstat(1) man page for complete information about cluster queues and their states.

Monitoring and Controlling Queue Instances

The Queue Instances tab provides a quick overview of all queue instances that are associated with the selected cluster queue. The Queue Instance tab also provides the means to suspend, resume, disable, and enable queue instances.

Dialog box titled Cluster Queues. Shows the Queue Instances tab
with a list of defined queue instances. Shows buttons you can use to manipulate queues.

To select a queue instance, click it.

Click Suspend, Resume, Disable, or Enable to execute the corresponding operation on queue instances that you select. The suspend/resume and disable/enable operations require notification of the corresponding sge_execd. If notification is not possible, for example, because the host is down, you can force an sge_qmaster internal status change by clicking Force.

The suspend/resume and disable/enable operations require queue owner permission, manager permission, or operator permission. See Managers, Operators, and Owners.

Suspended queue instances are closed for further jobs. The jobs already running in suspended queue instances are also suspended, as described in Monitoring and Controlling Jobs With QMON. The queue instance and its jobs are unsuspended as soon as the queue instance is resumed.


Note - If a job in a suspended queue instance was suspended explicitly, the job is not resumed when the queue instance is resumed. The job must be resumed explicitly.


Disabled queue instances are closed. However, the jobs executing in those queue instances are allowed to continue. The disabling of a queue instance is commonly used to "drain" a queue instance. After the queue instance is enabled, it is eligible to run jobs again. No action on currently running jobs is performed.

Queue Instance Status

Each row in the queue instances table represents one queue instance. For each queue instance, the table lists the following information:

  • Queue - Name of the queue instance

  • qtype - Type of queue instance, which can be B (batch), I (interactive), or P (parallel)

  • used/total - Number of used job slots and the total number of job slots

  • load_avg - Load average of the queue instance host

  • arch - Architecture of the queue instance host

  • states - States of the queue instance

See Cluster Queue Status for a list of queue states. See the qstat(1) man page for complete information about queue instances and their states.

Displaying Queue Instance Attributes

To retrieve a queue instance's current attribute information, load information, and resource consumption information, select the queue instance, and then click Load. This information also implicitly includes information about the machine that is hosting the queue instance. The following window appears:

Dialog box titled Attributes for queue <name>. Shows lists
titled Attributes, Slot-Limits/Fixed Attributes, Load(scaled)/Consumable. Shows Ok
button.

The Attribute column lists all attributes attached to the queue instance, including those attributes that are inherited from the host or the global cluster.

The Slot-Limits/Fixed Attributes column shows values for those attributes that are defined as per queue instance slot limits or as fixed resource attributes.

The Load(scaled)/Consumable column shows information about the reported and scaled load parameters. The column also shows information about the available resource capacities based on the consumable resources facility. See "Load Parameters" in N1 Grid Engine 6 Administration Guide and "Consumable Resources" in N1 Grid Engine 6 Administration Guide.

Load reports and consumable capacities can override each other if a load attribute is configured as a consumable resource. The minimum value of both, which is used in the job-dispatching algorithm, is displayed.


Note - The displayed load and consumable values currently do not take into account load adjustment corrections, as described in Execution Hosts.


Filtering Cluster Queues and Queue Instances

The Customize button enables you to filter the cluster queues and queue instances you want to display.

The following figure shows a filtered selection of only those queue instances whose current configuration is ambiguous.

Dialog box titled Queue Customize. Shows Misc Filters tab with
User, Pattern, and State filtering options. Shows Save, Cancel, and Ok buttons.

Click Save in the Queue Customize dialog box to store your settings in the file .qmon_preferences in your home directory for standard reactivation on later invocations of QMON.

Controlling Queues With qmod

You can use the qmod command to suspend and resume queues. You can also use qmod to disable and enable queues.

Type the following command with appropriate arguments.

% qmod arguments

The following commands are examples of how to use qmod:

% qmod -s q-name
% qmod -us -f q-name1, q-name2
% qmod -d q-name
% qmod -e q-name1, q-name2, q-name3

qmod -s suspends a queue. qmod -us -f resumes (unsuspends) two queues. qmod -d disables a queue. qmod -e enables three queues.

The -f option forces registration of the status change in sge_qmaster when the corresponding sge_execd is not reachable, for example, due to network problems.

Suspending and resuming queues as well as disabling and enabling queues requires queue owner permission, manager permission, or operator permission. See Managers, Operators, and Owners.


Note - You can use qmod commands with crontab or at jobs.


Using Job Checkpointing

This section explores two different kinds of job checkpointing:

  • User-level

  • Kernel-level

User-Level Checkpointing

Many application programs, especially programs that consume considerable CPU time, use checkpointing and restart mechanisms to increase fault tolerance. Status information and important parts of the processed data are repeatedly written to one or more files at certain stages of the algorithm. If the application is aborted, these restart files can be processed and restarted at a later time. The files reach a consistent state that is comparable to the situation just before the checkpoint. Because the user mostly has to move the restart files to a proper location, this kind of checkpointing is called user-level checkpointing.

For application programs that do not have integrated user-level checkpointing, an alternative is to use a checkpointing library. A checkpointing library can be provided by some hardware vendors or by the public domain. The Condor project of the University of Wisconsin is an example. By relinking an application with such a library, a checkpointing mechanism is installed in the application without requiring source code changes.

Kernel-Level Checkpointing

Some operating systems provide checkpointing support inside the operating system kernel. No preparations in the application programs and no relinking of the application is necessary in this case. Kernel-level checkpointing usually applies to single processes as well as to complete process hierarchies. That is, a hierarchy of interdependent processes can be checkpointed and restarted at any time. Usually both a user command and a C library interface are available to initiate a checkpoint.

The grid engine system supports operating system checkpointing if available. See the release notes for the N1 Grid Engine 6 softwarefor information about the currently supported kernel-level checkpointing facilities.

Migrating Checkpointing Jobs

Checkpointing jobs are interruptible at any time since their restart capability ensures that only little work already done must be repeated. This ability is used to build migration and dynamic load balancing mechanism in the grid engine system. If requested, checkpointing jobs are aborted on demand. The jobs are migrated to other machines in the grid engine system, thus averaging the load in the cluster dynamically. Checkpointing jobs are aborted and migrated for the following reasons:

  • The executing queue or the job is suspended explicitly by a qmod or a QMON command.

  • The job or the queue where the job runs is suspended automatically because a suspend threshold for the queue is exceeded. The checkpoint occasion specification for the job includes the suspension case. For more information, see "Configuring Load and Suspend Thresholds" in N1 Grid Engine 6 Administration Guide and Submitting, Monitoring, or Deleting a Checkpointing Job From the Command Line.

A migrating job moves back to sge_qmaster. The job is subsequently dispatched to another suitable queue if such a queue is available. In such a case, the qstat output shows R as the status.

Previous Previous     Contents     Index     Next Next