5    Managing Cluster Members

This chapter discusses the following topics:

For information on the following topics that are related to managing cluster members, see the TruCluster Server Cluster Installation manual:

For information about configuring and managing your Tru64 UNIX and TruCluster Server systems for availability and serviceability, see Managing Online Addition and Removal. This manual provides users with guidelines for configuring and managing any system for higher availability, with an emphasis on those capable of Online Addition and Replacement (OLAR) management of system components.

Note

As described in Managing Online Addition and Removal, the /etc/olar.config file is used to define system-specific policies and the /etc/olar.config.common file is used to define cluster-wide policies. Any settings in a system's /etc/olar.config file override clusterwide policies in the /etc/olar.config.common file for that system only.

5.1    Managing Configuration Variables

The hierarchy of the /etc/rc.config* files lets you define configuration variables consistently over all systems within a local area network (LAN) and within a cluster. Table 5-1 presents the uses of the configuration files.

Table 5-1:  /etc/rc.config* Files

File Scope
/etc/rc.config

Member-specific variables.

/etc/rc.config is a context-dependent symbolic link (CDSL). Each cluster member has a unique version of the file.

Configuration variables in /etc/rc.config override those in /etc/rc.config.common and /etc/rc.config.site.

/etc/rc.config.common

Clusterwide variables. These configuration variables apply to all members.

Configuration variables in /etc/rc.config.common override those in /etc/rc.config.site, but are overridden by those in /etc/rc.config.

/etc/rc.config.site file

Sitewide variables, which are the same for all machines on the LAN.

Values in this file are overridden by any corresponding values in /etc/rc.config.common or /etc/rc.config.

By default, there is no /etc/rc.config.site. If you want to set sitewide variables, you have to create the file and copy it to /etc/rc.config.site on every participating system.

You must then edit /etc/rc.config on each participating system and add the following code just before the line that executes /etc/rc.config.common:

# Read in the cluster sitewide attributes

# before overriding them with the

# clusterwide and member-specific values.

#

./etc/rc.config.site

For more information, see rcmgr(8).

The rcmgr command accesses these variables in a standard search order (first /etc/rc.config, then /etc/rc.config.common, and finally etc/rc.config.site) until it finds or sets the specified configuration variable.

Use the -h option to get or set the run-time configuration variables for a specific member. The command then acts on /etc/rc.config, the member-specific CDSL configuration file.

To make the command act clusterwide, use the -c option. The command then acts on /etc/rc.config.common, which is the clusterwide configuration file.

If you specify neither -h nor -c, then the member-specific values in /etc/rc.config are used.

For information about member-specific configuration variables, see Appendix B.

5.2    Managing Kernel Attributes

Each member of a cluster runs its own kernel and therefore has its own /etc/sysconfigtab file. This file contains static member-specific attribute settings. Although a clusterwide /etc/sysconfigtab.cluster file exists, its purpose is different from that of /etc/rc.config.common, and it is reserved to utilities that are shipped in the TruCluster Server product.

This section presents a partial list of those kernel attributes that are provided by each TruCluster Server subsystem.

Use the following command to display the current settings of these attributes for a given subsystem:

#  sysconfig -q subsystem-name attribute-list

To get a list and the status of all the subsystems, use the following command:

# sysconfig -s

In addition to the cluster-related kernel attributes presented here, two kernel attributes are set during cluster installation. Table 5-2 lists these kernel attributes. You can increase the values for these attributes, but do not decrease them.

Table 5-2:  Kernel Attributes Not to Decrease

Attribute Value (Do Not Decrease)
vm_page_free_min 30
vm_page_free_reserved 20

Table 5-3 lists the subsystem names that are associated with each TruCluster Server component.

Table 5-3:  Configurable TruCluster Server Subsystems

Subsystem Name Component For More Information
cfs Cluster File System (CFS) sys_attrs_cfs(5)
clua Cluster alias sys_attrs_clua(5)
clubase Cluster base sys_attrs_clubase(5)
cms Cluster mount service sys_attrs_cms(5)
cnx Connection manager sys_attrs_cnx(5)
dlm Distributed lock manager sys_attrs_dlm(5)
drd Device request dispatcher sys_attrs_drd(5)
hwcc Hardware components cluster sys_attrs_hwcc(5)
icsnet Internode communications service's network service sys_attrs_icsnet(5)
ics_hl Internode communications service (ICS) high level sys_attrs_ics_hl(5)
mcs Memory Channel application programming interface (API) sys_attrs_mcs(5)
rm Memory Channel sys_attrs_rm(5)
token CFS token subsystem sys_attrs_token(5)

To tune the performance of a kernel subsystem, use one of the following methods to set one or more attributes in the /etc/sysconfigtab file:

You can also use the configuration manager framework, as described in the Tru64 UNIX System Administration manual, to change attributes and otherwise administer a cluster kernel subsystem on another host. To do this, set up the host names in the /etc/cfgmgr.auth file on the remote client system and then specify the -h option to the /sbin/sysconfig command, as in the following example:

# sysconfig -h fcbra13 -r drd drd-do-local-io=0
 
drd-do-local-io: reconfigured

5.3    Managing Remote Access Within and From the Cluster

An rlogin, rsh, or rcp command from the cluster uses the default cluster alias as the source address. Therefore, if a noncluster host must allow remote host access from any account in the cluster, the .rhosts file on the noncluster member must include the cluster alias name in one of the forms by which it is listed in the /etc/hosts file or one resolvable through Network Information Service (NIS) or Domain Name System (DNS).

The same requirement holds for rlogin, rsh, or rcp to work between cluster members. At cluster creation, the clu_create utility prompts for all required host names and puts them in the correct locations in the proper format. The clu_add_member does the same when a new member is added to the cluster. You do not need to edit /.rhosts to enable /bin/rsh commands from a cluster member to the cluster alias or between individual members. Do not change the generated name entries in /etc/hosts and /.rhosts.

If the /etc/hosts and /.rhosts files are configured incorrectly, many applications will not function properly. For example, the Advanced File System (AdvFS) rmvol and addvol commands use rsh when the member where the commands are executed is not the server of the domain. These commands fail if /etc/hosts or /.rhosts is configured incorrectly.

The following error indicates that the /etc/hosts or /.rhosts file has been configured incorrectly:

rsh cluster-alias date
Permission denied.
 

5.4    Shutting Down the Cluster

To halt all members of a cluster, use the -c option to the shutdown command. For example, to shut down the cluster in 5 minutes, enter the following command:

# shutdown -c +5 Cluster going down in 5 minutes
 

For information on shutting down a single cluster member, see Section 5.5.

During the shutdown grace period, which is the time between when the cluster shutdown command is entered and when actual shutdown occurs, the clu_add_member command is disabled and new members cannot be added to the cluster.

To cancel a cluster shutdown during the grace period, kill the processes that are associated with the shutdown command as follows:

  1. Get the process identifiers (PIDs) that are associated with the shutdown command. For example:

    # ps ax | grep -v grep | grep shutdown
     14680 ttyp5    I <    0:00.01 /usr/sbin/shutdown +20 going down
    

    Depending on how far along shutdown is in the grace period, ps might show either /usr/sbin/shutdown or /usr/sbin/clu_shutdown.

  2. Terminate all shutdown processes by specifying their PIDs in a kill command from any member. For example:

    # kill 14680
     
    

If you kill the shutdown processes during the grace period, the shutdown is canceled.

The shutdown -c command fails if a clu_quorum, clu_add_member, clu_delete_member, or clu_upgrade is in progress.

There is no clusterwide reboot. The shutdown -r command, the reboot command, and the halt command act only on the member on which they are executed. The halt, reboot, and init commands have been modified to leave file systems in a cluster mounted, so the cluster continues functioning when one of its members is halted or rebooted, as long as it retains quorum.

For more information, see shutdown(8).

5.5    Shutting Down and Starting One Cluster Member

When booting a member, you must boot from the boot disk that was created by the clu_add_member command. You cannot boot from a copy of the boot disk.

Shutting down a single cluster member is more complex than shutting down a standalone server. If you halt a cluster member whose vote is required for quorum (referred to as a critical voting member), the cluster will lose quorum and hang. As a result, you will be unable to enter commands from any cluster member until you reboot the halted member. Therefore, before you shut down a cluster member, you must first determine whether that member's vote is required for quorum.

5.5.1    Identifying a Critical Voting Member

A cluster that contains a critical voting member is either operating in a degraded mode (for example, one or more voting members or a quorum disk is down) or was not configured for availability to begin with (for example, it is a two-member configuration with each member assigned a vote). Removing a critical voting member from a cluster causes the cluster to hang and compromise availability. Before halting or deleting a cluster member, ensure that it is not supplying a critical vote.

To determine whether a member is a critical voting member, follow these steps:

  1. If possible, make sure that all voting cluster members are up.

  2. Enter the clu_quorum command and note the running values of current votes, quorum votes, and the node votes of the member in question.

  3. Subtract the member's node votes from the current votes. If the result is less than the quorum votes, the member is a critical voting member and you cannot shut it down without causing the cluster to lose quorum and hang.

5.5.2    Preparing to Halt or Delete a Critical Voting Member

Before halting or deleting a critical voting member, ensure that its votes are no longer critical to the cluster retaining quorum. The best way to do this involves restoring node votes or a quorum disk vote to the cluster without increasing expected votes. Some ways to accomplish this are:

If the cluster has an even number of votes, adding a new voting member or configuring a quorum disk can also make a critical voting member noncritical. In these cases, expected votes is incremented, but quorum votes remains the same.

5.5.3    Halting a Noncritical Member

A noncritical member, one with no vote or whose vote is not required to maintain quorum, can be shut down, halted, or rebooted like a standalone system.

Execute the shutdown command on the member to be shut down. To halt a member, enter the following command:

# shutdown -h time
 

To reboot a member, enter the following command:

# shutdown -r time
 

For information on identifying critical voting members, see Section 5.5.1.

5.5.4    Shutting Down a Hosting Member

The cluster application availability (CAA) profile for an application allows you to specify an ordered list of members, separated by white space, that can host the application resource. The hosting members list is used in conjunction with the application resource's failover policy (favored or restricted), as discussed in caa(4).

If the cluster member that you are shutting down is the only hosting member for one or more applications with a restricted placement policy, you need to specify another hosting member or the application cannot run while the member is down. You can add an additional hosting member, or replace the existing hosting member with another.

To do this, perform these steps:

  1. Verify the current hosting members and placement policy.

    # caa_profile -print resource-name
    

  2. If the cluster member that you are shutting down is the only hosting member, you can add an additional hosting member to the hosting members list, or replace the existing member.

    # caa_profile -update resource-name -h hosting-member another-hosting-member
    # caa_profile -update resource-name -h hosting-member
     
    

  3. Update the CAA registry entry with the latest resource profile.

    # caa_register -u resource-name
    

  4. Relocate the application to the other member.

    # caa_relocate resource-name -c member-name
    

5.6    Shutting Down a Cluster Member to Single-User Mode

If you need to shut down a cluster member to single-user mode, you must first halt the member and then boot it to single user-mode. Shutting down the member in this manner assures that the member provides the minimal set of services to the cluster and that the running cluster has a minimal reliance on the member running in single-user mode. In particular, halting the member satisfies services that require the cluster member to have a status of DOWN before completing a service failover. If you do not first halt the cluster member, the services do not fail over as expected.

To take a cluster member to single-user mode, use the shutdown -h command to halt the member, and then boot the member to single-user mode. When the system reaches single-user mode, run the init s, bcheckrc, and lmf reset commands. For example:

Note

Before halting a cluster member, make sure that the cluster can maintain quorum without the member's vote.

/sbin/shutdown -h now
 
>>> boot -fl s
 
# /sbin/init s/sbin/bcheckrc/usr/sbin/lmf reset
 

A cluster member that is shut down to single-user mode (that is, not shut down to a halt and then booted to single-user mode as recommended) continues to have a status of UP. Shutting down a cluster member to single-user mode in this manner does not affect the voting status of the member: a member contributing a vote before being shut down to single-user mode continues contributing the vote in single-user mode.

5.7    Deleting a Cluster Member

The clu_delete_member command permanently removes a member from the cluster.

Caution

If you are reinstalling TruCluster Server, see the TruCluster Server Cluster Installation manual. Do not delete a member from an existing cluster and then create a new single-member cluster from the member that you just deleted. If the new cluster has the same name as the old cluster, the newly installed system might join the old cluster. This can cause data corruption.

The clu_delete_member command has the following syntax:

/usr/sbin/clu_delete_member [-f] [-m memberid]

If you do not supply a member ID, the command prompts you for the member ID of the member to delete.

The clu_delete_member command does the following:

To delete a member from the cluster, follow these steps:

  1. Determine whether or not the member is a critical voting member of the cluster. If the member supplies a critical vote to the cluster, halting it will cause the cluster to lose quorum and suspend operations. Before halting the member, use the procedure in Section 5.5 to determine whether it is safe to do so.

  2. Halt the member to be deleted.

  3. If possible, make sure that all voting cluster members are up.

  4. Use the clu_delete_member command from another member to remove the member from the cluster. For example, to delete a halted member whose member ID is 3, enter the following command:

    # clu_delete_member -m 3
    

  5. When you run clu_delete_member and the boot disk for the member is inaccessible, the command displays a message to that effect.

    If the member being deleted is a voting member, after the member is deleted you must manually lower by one vote the expected votes for the cluster. Do this with the following command:

    # clu_quorum -e expected-votes
    

    Note

    This step applies only when the member boot disk cannot be accessed by clu_delete_member and the member that is being deleted is a voting member.

For an example of the /cluster/admin/clu_delete_member.log that results when a member is deleted, see Appendix C.

5.8    Removing a Cluster Member and Restoring It as a Standalone System

To restore a cluster member as a standalone system, follow these steps:

  1. Halt and delete the member by following the procedures in Section 5.5 and Section 5.7.

  2. Physically disconnect the halted member from the cluster, disconnecting the Memory Channel and storage.

  3. On the halted member, select a disk that is local to the member and install Tru64 UNIX. See the Tru64 UNIX Installation Guide for information on installing system software.

For information about moving clusterized Logical Storage Manager (LSM) volumes to a noncluster system, see Section 10.5.

5.9    Changing the Cluster Name or IP Address

Changing the name of a cluster requires a shutdown and reboot of the entire cluster. Changing the IP address of a cluster requires that you shut down and reboot each member individually.

To change the cluster name, follow these steps carefully. Any mistake can prevent the cluster from booting.

  1. Create a file with the new cluster_name attribute for the clubase subsystem stanza entry. For example, to change the cluster name to deli, add the following clubase subsystem stanza entry:

    clubase:
     cluster_name=deli
     
    

    Notes

    Ensure that you include a line-feed at the end of each line in the file that you create. If you do not, when the sysconfigtab file is modified, you will have two attributes on the same line. This may prevent your system from booting.

    If you create the file in the cluster root directory, you can use it on every system in the cluster without a need to copy the file.

  2. On each cluster member, use the sysconfigdb -m -f file clubase command to merge the new clubase subsystem attributes from the file that you created with the clubase subsystem attributes in the /etc/sysconfig file.

    For example, assume that the file cluster-name-change contains the information shown in the example in step 1. To use the file cluster-name-change to change the cluster name from poach to deli, use the following command:

    # sysconfigdb -m -f cluster-name-change clubase
    Warning: duplicate attribute in clubase: 
    was cluster_name = poach, now cluster_name = deli
     
    

    Caution

    Do not use the sysconfigdb -u command with a file with only one or two attributes to be changed. The -u flag causes the subsystem entry in the input file to replace a subsystem entry (for instance clubase). If you specify only the cluster_name attribute for the clubase subsystem, the new clubase subsystem will contain only the cluster_name attribute and none of the other required attributes.

  3. Change the cluster name in each of the following files:

    There is only one copy of these files in a cluster.

  4. Add the new cluster name to the /.rhosts file (which is common to all cluster members).

    Leave the current cluster name in the file. The current name is needed for the shutdown -c command in the next step to function.

    Change any client .rhosts file as appropriate.

  5. Shut down the entire cluster with the shutdown -c command and reboot each system in the cluster.

  6. Remove the previous cluster name from the /.rhosts file.

  7. To verify that the cluster name has changed, run the /usr/sbin/clu_get_info command:

    # /usr/sbin/clu_get_info
    Cluster information for cluster deli    
    
    .
    .
    .

5.9.1    Changing the Cluster IP Address

To change the cluster IP address, follow these steps:

  1. Edit the /etc/hosts file, and change the IP address for the cluster.

  2. One at a time (to keep quorum), shut down and reboot each cluster member system.

To verify that the cluster IP address has changed, run the /usr/sbin/ping command from a system that is not in the cluster to ensure that the cluster provides the echo response when you use the cluster address:

# /usr/sbin/ping -c 3 16.160.160.160
PING 16.160.160.160 (16.160.160.160): 56 data bytes
64 bytes from 16.160.160.160: icmp_seq=0 ttl=64 time=26 ms
64 bytes from 16.160.160.160: icmp_seq=1 ttl=64 time=0 ms
64 bytes from 16.160.160.160: icmp_seq=2 ttl=64 time=0 ms
 
----16.160.160.160 PING Statistics----
3 packets transmitted, 3 packets received, 0% packet loss
round-trip (ms)  min/avg/max = 0/9/26 ms
 

5.10    Changing the Member Name, IP Address, or Cluster Interconnect Address

To change the member name, member IP address, or member cluster interconnect address, you must remove the member from the cluster and then add it back in with the desired member name or address. Do this as follows:

  1. Halt the member. See Section 5.5 for information on shutting down a single cluster member.

  2. On an active member of the cluster, delete the member that you just shut down. Do this by running the clu_delete_member command:

    # clu_delete_member -m memberid
     
    

    To learn the member ID of the member to be deleted, use the clu_get_info command.

    See Section 5.7 for details on using clu_delete_member.

  3. Use the clu_add_member command to add the system back into the cluster, specifying the desired member name, member IP address, and cluster interconnect address.

    For details on adding a member to the cluster, see the TruCluster Server Cluster Installation manual.

5.11    Managing Software Licenses

When you add a new member to a cluster, you must register application licenses on that member for those applications that may run on that member.

For information about adding new cluster members and Tru64 UNIX licenses, see the chapter on adding members in the TruCluster Cluster Installation manual.

5.12    Installing and Deleting Layered Applications

The procedure to install or delete an application is usually the same for both a cluster and a standalone system. Applications can be installed once in a cluster. However, some applications require additional steps.

5.13    Managing Accounting Services

The system accounting services are not cluster-aware. The services rely on files and databases that are member-specific. Because of this, to use accounting services in a cluster, you must set up and administer the services on a member-by-member basis.

The /usr/sbin/acct directory is a CDSL. The accounting services files in /usr/sbin/acct are specific to each cluster member.

To set up accounting services on a cluster, use the following modifications to the directions in the chapter on administering system accounting services in the Tru64 UNIX System Administration manual:

  1. You must enable accounting services on each cluster member where you want accounting to run. To enable accounting on all cluster members, enter the following command on each member:

    # rcmgr -c set ACCOUNTING YES
    

    If you want to enable accounting on only certain members, use the -h option to the rcmgr command. For example, to enable accounting on members 2, 3, and 6, enter the following commands:

    # rcmgr -h 2 set ACCOUNTING YES
    # rcmgr -h 3 set ACCOUNTING YES
    # rcmgr -h 6 set ACCOUNTING YES
     
    

  2. You must start accounting on each member. Log in to each member where you want to start accounting, and enter the following command:

    # /usr/sbin/acct/startup
    

    To stop accounting on a member, you must log in to that member and run the command /usr/sbin/acct/shutacct.

The directory /usr/spool/cron is a CDSL; the files in this directory are member-specific, and you can use them to tailor accounting on a per-member basis. To do so, log in to each member where accounting is to run. Use the crontab command to modify the crontab files as desired. For more information, see the chapter on administering the system accounting services in the Tru64 UNIX System Administration manual.

The file /usr/sbin/acct/holidays is a CDSL. Because of this, you set accounting service holidays on a per-member basis.

For more information on accounting services, see acct(8).