![]() |
|||
![]() |
![]() ![]() |
![]() |
![]() ![]() |
![]() |
![]() ![]() |
![]() |
| ||
Tight Integration of Parallel Environments and Grid Engine SoftwareConfiguring Parallel Environments With QMON mentions that using sge_execd and sge_shepherd to create parallel tasks offers benefits over parallel environments that create their own parallel tasks. The UNIX operating system allows reliable resource control only for the creator of a process hierarchy. Features such as correct accounting, resource limits, and process control for parallel applications, can be enforced only by the creator of all parallel tasks. Most parallel environments do not implement these features. Therefore parallel environments do not provide a sufficient interface for the integration with a resource management system like the grid engine system. To overcome this problem, the grid engine system provides an advanced parallel environment interface for tight integration with parallel environments. This parallel environment interface transfers the responsibility for creating tasks from the parallel environment to the grid engine software. The distribution of the grid engine system contains two examples of such a tight integration, one for the PVM public domain version, and one for the MPICH MPI implementation from Argonne National Laboratories. The examples are contained in the directories sge-root/pvm and sge-root/mpi, respectively. The directories also contain README files that describe the usage and any current restrictions. Refer to those README files for more details. For the purpose of comparison, the sge-root/mpi/sunhpc/loose-integration directory contains a loose integration sample with Sun HPC ClusterTools software, and the sge-root/mpi directory contain a loosely integrated variant of the interfaces for comparison. Note - The performance of a tight integration with a parallel environment is an advanced task that can require expert knowledge of the parallel environment and the grid engine system parallel environment interface. You might want to contact your Sun support representative distributor for assistance. Configuring Checkpointing EnvironmentsCheckpointing is a facility that does the following tasks:
If you move a checkpoint from one host to another host, checkpointing can migrate jobs or applications in a cluster without significant loss of resources. Hence, dynamic load balancing can be provided with the help of a checkpointing facility. The grid engine system supports two levels of checkpointing:
Kernel-level checkpointing can be applied to complete jobs, that is, the process hierarchy created by a job. By contrast, user-level checkpointing is usually restricted to single programs. Therefore the job in which such programs are embedded needs to properly handle cases where the entire job gets restarted. Kernel-level checkpointing, as well as checkpointing based on checkpointing libraries, can consume many resources. The complete virtual address space that is in use by the job or application at the time of the checkpoint must be dumped to disk. By contrast, user-level checkpointing based on restart files can restrict the data that is written to the checkpoint on the important information only. About Checkpointing EnvironmentsThe grid engine system provides a configurable attribute description for each checkpointing method used. Different attribute descriptions reflect the different checkpointing methods and the potential variety of derivatives from these methods on different operating system architectures. This attribute description is called a checkpointing environment. Default checkpointing environments are provided with the distribution of the grid engine system and can be modified according to the site's needs. New checkpointing methods can be integrated in principal. However, the integration of new methods can be a challenging task. This integration should be performed only by experienced personnel or by your grid engine system support team. Configuring Checkpointing Environments With QMONOn the QMON Main Control window, click the Checkpoint Configuration button. The Checkpointing Configuration dialog box appears. ![]() Viewing Configured Checkpointing EnvironmentsTo view previously configured checkpointing environments, select one of the checkpointing environment names listed under Checkpoint Objects. The corresponding configuration is displayed under Configuration. Adding a Checkpointing EnvironmentIn the Checkpointing Configuration dialog box, click Add. The Add/Modify Checkpoint Object dialog box appears, along with a template configuration that you can edit. ![]() Fill out the template with the requested information. Click OK to register your changes with sge_qmaster. Click Cancel to close the dialog box without saving changes. Modifying Checkpointing EnvironmentsIn the Checkpoint Objects list, select the name of the configured checkpointing environment you want to modify, and then click Modify. The Add/Modify Checkpoint Object dialog box appears, along with the current configuration of the selected checkpointing environment. The Add/Modify Checkpoint Object dialog box enables you to change the following information:
See the checkpoint(5) man page for details about these parameters. In addition, you must define the Interface to use. The Interface is also called checkpointing method. From the Interface list under Name, select an Interface. See the checkpoint(5) man page for details about the meaning of the different interfaces. Note - For the checkpointing environments provided with the distribution of the grid engine system, change only the Name parameter and the Checkpointing Directory parameter. Click OK to register your changes with sge_qmaster. Click Cancel to close the dialog box without saving changes. Deleting Checkpointing EnvironmentsTo delete a configured checkpointing environment, select it, and then click Delete. Configuring Checkpointing Environments From the Command LineTo configure the checkpointing environment from the command line, type the qconf command with the appropriate options. The following options are available:
| ||
| ||
![]() |