![]() |
|||
![]() |
![]() ![]() |
![]() |
![]() ![]() |
![]() |
![]() ![]() |
![]() |
| ||
Chapter 8Fine Tuning, Error Messages, and TroubleshootingThis chapter describes some ways to fine-tune your grid engine system environment. The chapter also describes the error messaging procedures and offers tips on how to resolve various common problems. This chapter includes the following sections: Fine-Tuning Your Grid EnvironmentThe grid engine system is a full-function, general-purpose distributed resource management tool. The scheduler component of the system supports a wide range of different compute farm scenarios. To get the maximum performance from your compute environment, you should review the features that are enabled. You should then determine which features you really need to solve your load management problem. Disabling some of these features can improve performance on the throughput of your cluster. Scheduler MonitoringScheduler monitoring can help you to find out why certain jobs are not dispatched. However, providing this information for all jobs at all times can consume resources. You usually do not need this much information. To disable scheduler monitoring, set schedd_job_info to false in the scheduler configuration. See Changing the Scheduler Configuration With QMON, and the sched_conf(5) man page. Finished JobsIn the case of array jobs, the finished job list in qmaster can become quite large. By switching the finished job list off, you save memory and speed up the qstat process, because qstat also fetches the finished jobs list. To turn off the finished job list function, set finished_jobs to zero in the cluster configuration. See Adding and Modifying Global and Host Configurations With QMON, and the sge_conf(5) man page. Job ValidationForced validation at job submission time can be a valuable procedure to prevent nondispatchable jobs from forever remaining in a pending state. However, job validation can also be a time-consuming task. Job validation can be especially time-consuming in heterogeneous environments with different execution nodes and consumable resources, and in which all users have their own job profiles. In homogeneous environments with only a few different jobs, a general job validation usually can be omitted. To disable job verification, add the qsub option -w n in the cluster-wide default requests. See "Submitting Advanced Jobs With QMON" in N1 Grid Engine 6 User's Guide, and the sge_request(5) man page. Load Thresholds and Suspend ThresholdsLoad thresholds are needed if you deliberately oversubscribe your machines and you need to prevent excessive system load. Suspend thresholds are also used to prevent overloading the system. Another case where you want to prevent the overloading of a node is when the execution node is still open for interactive load. Interactive load is not under the control of the grid engine system. A compute farm might be more single-purpose. For example, each CPU at a compute node might be represented by only one queue slot, and no interactive load might be expected at these nodes. In such cases, you can omit load_thresholds. To disable both thresholds, set load_thresholds to none and suspend_thresholds to none. See Configuring Load and Suspend Thresholds, and the queue_conf(5) man page. Load AdjustmentsLoad adjustments are used to increase the measured load after a job is dispatched. This mechanism prevents oversubscription of machines that is caused by the delay between job dispatching and the corresponding load impact. You can switch off load adjustments if you do not need them. Load adjustments impose on the scheduler some additional work in connection with sorting hosts and load thresholds verification. To disable load adjustments, set job_load_adjustments to none and load_adjustment_decay_time to zero in the scheduler configuration. See Changing the Scheduler Configuration With QMON, and the sched_conf(5) man page. Immediate SchedulingThe default for the grid engine system is to start scheduling runs in a fixed schedule interval. A good feature of fixed intervals is that they limit the CPU time consumption of the qmaster and the scheduler. A bad feature is that fixed intervals choke the scheduler, artificially resulting in a limited throughput. Many compute farms have machines specifically dedicated to qmaster and the scheduler, and such setups provide no reason to choke the scheduler. See schedule_interval in sched_conf(5). You can configure immediate scheduling by using the flush_submit_sec and flush_finish_sec parameters of the scheduler configuration. See Changing the Scheduler Configuration With QMON, and the sched_conf(5) man page. If immediate scheduling is activated, the throughput of a compute farm is limited only by the power of the machine that is hosting sge_qmaster and the scheduler. Urgency Policy and Resource ReservationThe urgency policy enables you to customize job priority schemes that are resource-dependent. Such job priority schemes include the following:
The implementing of both objectives is especially valuable if you are using resource reservation. How the Grid Engine Software Retrieves Error ReportsThe grid engine software reports errors and warnings by logging messages into certain files or by sending email, or both. The log files include message files and job STDERR output. As soon as a job is started, the standard error (STDERR) output of the job script is redirected to a file. The default file name and location are used, or you can specify the filename and the location with certain options of the qsub command. See the grid engine system man pages for detailed information. Separate messages files exist for the sge_qmaster, the sge_schedd, and the sge_execds. The files have the same file name: messages. The sge_qmaster log file resides in the master spool directory. The sge_schedd message file resides in the scheduler spool directory. The execution daemons' log files reside in the spool directories of the execution daemons. See "Spool Directories Under the Root Directory" in N1 Grid Engine 6 Installation Guide for more information about the spool directories. Each message takes up a single line in the files. Each message is subdivided into five components separated by the vertical bar sign (|). The components of a message are as follows:
In some circumstances, the grid engine system notifies users, administrators, or both, about error events by email. The email messages sent by the grid engine system do not contain a message body. The message text is fully contained in the mail subject field. | ||
| ||
![]() |