![]() |
|||
![]() |
![]() ![]() |
![]() |
![]() ![]() |
![]() |
![]() ![]() |
![]() |
| ||
See the qstat(1) man page for complete information about cluster queues and their states. Monitoring and Controlling Queue InstancesThe Queue Instances tab provides a quick overview of all queue instances that are associated with the selected cluster queue. The Queue Instance tab also provides the means to suspend, resume, disable, and enable queue instances. ![]() To select a queue instance, click it. Click Suspend, Resume, Disable, or Enable to execute the corresponding operation on queue instances that you select. The suspend/resume and disable/enable operations require notification of the corresponding sge_execd. If notification is not possible, for example, because the host is down, you can force an sge_qmaster internal status change by clicking Force. The suspend/resume and disable/enable operations require queue owner permission, manager permission, or operator permission. See Managers, Operators, and Owners. Suspended queue instances are closed for further jobs. The jobs already running in suspended queue instances are also suspended, as described in Monitoring and Controlling Jobs With QMON. The queue instance and its jobs are unsuspended as soon as the queue instance is resumed. Note - If a job in a suspended queue instance was suspended explicitly, the job is not resumed when the queue instance is resumed. The job must be resumed explicitly. Disabled queue instances are closed. However, the jobs executing in those queue instances are allowed to continue. The disabling of a queue instance is commonly used to "drain" a queue instance. After the queue instance is enabled, it is eligible to run jobs again. No action on currently running jobs is performed. Queue Instance StatusEach row in the queue instances table represents one queue instance. For each queue instance, the table lists the following information:
See Cluster Queue Status for a list of queue states. See the qstat(1) man page for complete information about queue instances and their states. Displaying Queue Instance AttributesTo retrieve a queue instance's current attribute information, load information, and resource consumption information, select the queue instance, and then click Load. This information also implicitly includes information about the machine that is hosting the queue instance. The following window appears: ![]() The Attribute column lists all attributes attached to the queue instance, including those attributes that are inherited from the host or the global cluster. The Slot-Limits/Fixed Attributes column shows values for those attributes that are defined as per queue instance slot limits or as fixed resource attributes. The Load(scaled)/Consumable column shows information about the reported and scaled load parameters. The column also shows information about the available resource capacities based on the consumable resources facility. See "Load Parameters" in N1 Grid Engine 6 Administration Guide and "Consumable Resources" in N1 Grid Engine 6 Administration Guide. Load reports and consumable capacities can override each other if a load attribute is configured as a consumable resource. The minimum value of both, which is used in the job-dispatching algorithm, is displayed. Note - The displayed load and consumable values currently do not take into account load adjustment corrections, as described in Execution Hosts. Filtering Cluster Queues and Queue InstancesThe Customize button enables you to filter the cluster queues and queue instances you want to display. The following figure shows a filtered selection of only those queue instances whose current configuration is ambiguous. ![]() Click Save in the Queue Customize dialog box to store your settings in the file .qmon_preferences in your home directory for standard reactivation on later invocations of QMON. Controlling Queues With qmodYou can use the qmod command to suspend and resume queues. You can also use qmod to disable and enable queues. Type the following command with appropriate arguments.
The following commands are examples of how to use qmod:
qmod -s suspends a queue. qmod -us -f resumes (unsuspends) two queues. qmod -d disables a queue. qmod -e enables three queues. The -f option forces registration of the status change in sge_qmaster when the corresponding sge_execd is not reachable, for example, due to network problems. Suspending and resuming queues as well as disabling and enabling queues requires queue owner permission, manager permission, or operator permission. See Managers, Operators, and Owners. Note - You can use qmod commands with crontab or at jobs. Using Job CheckpointingThis section explores two different kinds of job checkpointing:
User-Level CheckpointingMany application programs, especially programs that consume considerable CPU time, use checkpointing and restart mechanisms to increase fault tolerance. Status information and important parts of the processed data are repeatedly written to one or more files at certain stages of the algorithm. If the application is aborted, these restart files can be processed and restarted at a later time. The files reach a consistent state that is comparable to the situation just before the checkpoint. Because the user mostly has to move the restart files to a proper location, this kind of checkpointing is called user-level checkpointing. For application programs that do not have integrated user-level checkpointing, an alternative is to use a checkpointing library. A checkpointing library can be provided by some hardware vendors or by the public domain. The Condor project of the University of Wisconsin is an example. By relinking an application with such a library, a checkpointing mechanism is installed in the application without requiring source code changes. Kernel-Level CheckpointingSome operating systems provide checkpointing support inside the operating system kernel. No preparations in the application programs and no relinking of the application is necessary in this case. Kernel-level checkpointing usually applies to single processes as well as to complete process hierarchies. That is, a hierarchy of interdependent processes can be checkpointed and restarted at any time. Usually both a user command and a C library interface are available to initiate a checkpoint. The grid engine system supports operating system checkpointing if available. See the release notes for the N1 Grid Engine 6 softwarefor information about the currently supported kernel-level checkpointing facilities. Migrating Checkpointing JobsCheckpointing jobs are interruptible at any time since their restart capability ensures that only little work already done must be repeated. This ability is used to build migration and dynamic load balancing mechanism in the grid engine system. If requested, checkpointing jobs are aborted on demand. The jobs are migrated to other machines in the grid engine system, thus averaging the load in the cluster dynamically. Checkpointing jobs are aborted and migrated for the following reasons:
A migrating job moves back to sge_qmaster. The job is subsequently dispatched to another suitable queue if such a queue is available. In such a case, the qstat output shows R as the status. | ||
| ||
![]() |