![]() |
|||
![]() |
![]() ![]() |
![]() |
![]() ![]() |
![]() |
![]() ![]() |
![]() |
| ||
Configuring Shadow Master HostsShadow master hosts are machines in the cluster that can detect a failure of the master daemon and take over its role as master host. When the shadow master daemon detects that the master daemon sge_qmaster has failed abnormally, it starts up a new sge_qmaster on the host where the shadow master daemon is running. Note - If the master daemon is shut down gracefully, the shadow master daemon does not start up. If you want the shadow master daemon to take over after you shut down the master daemon gracefully, remove the lock file that is located in the sge_qmaster spool directory. The default location of this spool directory is sge-root/cell/spool/qmaster. The automatic failover start of a sge_qmaster on a shadow master host takes approximately one minute. Meanwhile, you get an error message whenever a grid engine system command is run. Note - The file sge-root/cell/common/act_qmaster contains the name of the host actually running the sge_qmaster daemon. Shadow Master Host RequirementsTo prepare a host as a shadow master, the following requirements must be met:
As soon as these requirements are met, the shadow-master-host facility is activated for this host. No restart of grid engine system daemons is necessary to activate the feature. Shadow Master Hosts FileThe shadow master host name file, sge-root/cell/common/shadow_masters, contains the following:
The format of the shadow master hostname file is as follows:
The order of the shadow master hosts is significant. The primary master host is the first line in the file. If the primary master host fails to proceed, the shadow master defined in the second line takes over. If this shadow master also fails, the shadow master defined in the third line takes over, and so forth. Starting Shadow Master HostsIn order to start a shadow sge_qmaster, the system must be sure either that the old sge_qmaster has terminated, or that it will terminate without performing actions that interfere with the newly-started shadow sge_qmaster. In very rare circumstances it might be impossible to determine that the old sge_qmaster has terminated or that it will terminate. In such cases, an error message is logged to the messages log file of the sge_shadowds on the shadow master hosts. See Chapter 8, Fine Tuning, Error Messages, and Troubleshooting. Also, any attempts to open a tcp connection to a sge_qmaster daemon permanently fail. If this occurs, make sure that no master daemon is running, and then restart sge_qmaster manually on any of the shadow master machines. See Restarting Daemons From the Command Line. Configuring Shadow Master Hosts Environment VariablesThere are three environment variables which affect the takeover time for a shadow master:
These variables interact in the following way.
A reasonable configuration might be to set the SGE_CHECK_INTERVAL to be 45 seconds and the SGE_GET_ACTIVE_INTERVAL to be 90 seconds. So, after about 2 minutes, the take over will occur. If you want to check the operation of the shadow host after you have configured these environment variables you will have to pull out the master host's network cable to simulate a failure. Configuring HostsN1 Grid Engine 6 software (grid engine software) maintains object lists for all types of hosts except for the master host. The lists of administration host objects and submit host objects indicate whether a host has administrative or submit permission. The execution host objects include other parameters. Among these parameters are the load information that is reported by the sge_execd running on the host, and the load parameter scaling factors that are defined by the administrator. You can configure host objects with QMON or from the command line. QMON provides a set of host configuration dialog boxes that are invoked by clicking the Host Configuration button on the QMON Main Control window. The Host Configuration dialog box has four tabs:
The qconf command provides the command-line interface for managing host objects. | ||
| ||
![]() |