The following sections describe issues and known problems with TruCluster Server
Version 5.1A.
3.1 Installation
The information in this section applies to installation.
3.1.1 Update to Latest Firmware Before Installing Tru64 UNIX Version 5.1A
Before installing Tru64 UNIX and TruCluster Server Version 5.1A, update all systems that will become cluster members with the latest firmware. A cluster member running old firmware may not be able to use some hardware connected to the cluster. For example, with old firmware, a member with a boot disk behind an HSZ80 or HSG80 controller may fail to boot, indicating "Reservation Conflict" errors.
To update a system's firmware, do the following:
Insert the firmware CD-ROM in the drive and boot from it:
>>> boot cdrom_console_device_name
The firmware update utility automatically identifies your system type and model and determines the correct firmware revision required for your system.
Follow the instructions on the screen. There is an automatic display of the READ-ME-FIRST file, which describes the firmware changes included in the update.
Power off the processor for at least 10 seconds when the firmware update is complete to initialize the new firmware.
If you do not have access to a firmware CD-ROM, you can find the latest firmware at the following URL:
ftp.digital.com/pub/Digital/Alpha/firmware/readme.html
You can download the firmware and associated documentation
with the anonymous File Transfer Protocol (FTP).
3.1.2 Do Not Use the Installation Branch of the Software Option of the SysMan Menu
The Installation branch of the Software menu of the SysMan Menu
application is not supported in a cluster.
Use the mechanisms for installing
and deinstalling the TruCluster Server product and layered product
software that are discussed in the TruCluster Server
Cluster Installation
and
Cluster Administration
manuals.
3.2 Cluster Creation and Member Addition
Information in this section applies
to creating a cluster and adding cluster members.
3.2.1 Conflict in Use of the Default Physical Cluster Interconnect IP Name
The default physical cluster interconnect IP name has the form
membermemberID-icstcp0
.
The
clu_create
and
clu_add_member
commands use
ping
to determine whether
the default name is already in use on the net.
If this check
finds a host already using the default IP name, then the command
will fail.
You are instructed:
Enter the physical cluster interconnect interface device name []
After this, the command fails. Depending on which command was executing at the time of failure, you get one of the following messages:
Error: clu_create: Bad configuration Error: clu_add_member: Bad configuration
If you see either of these messages, look in
/cluster/admin/clu_create.log
or
/cluster/admin/clu_add_member.log
, as appropriate,
for the following error message:
Error: A system with the name 'membermemberID-icstcp0' is currently running on your network.
If you find this message, contact your network administrator
about changing the hostname of the system already
using the default IP name,
because the
clu_create
and
clu_add_member
commands do not allow you to change the
default physical cluster interconnect IP name.
3.2.2 Misleading LAN Interconnect Information Provided by clu_create
When you select
Help
from the cluster interconnect selection,
the information displayed by
clu_create
implies
that Ethernet hubs used in a LAN interconnect must operate in
full-duplex mode.
In fact, 100 Mb/sec Ethernet hubs in half-duplex mode
are supported, subject to certain restrictions specified in the
Cluster LAN Interconnect
manual.
More generally, a
cluster must have a dedicated cluster interconnect to which all
members are connected.
The cluster interconnect serves as the primary
communications channel between cluster members.
For hardware, the cluster
interconnect can use either Memory Channel or a private LAN.
See
the
Cluster Hardware Configuration
manual
and the
Cluster LAN Interconnect
manual for configuration details.
3.2.3 The clu_create Command Does Not Add the First Member's Fully Qualified Hostname to the /etc/cfgmgr.auth File
The
clu_create
command fails to add the first member's fully
qualified hostname to the
/etc/cfgmgr.auth
file.
The
clu_add_member
command, however, does add subsequent
members' hostnames to the file.
To avoid problems with remote kernel configuration management in a cluster,
manually add the first member's fully qualified hostname to the
/etc/cfgmgr.auth
file.
For example:
member1.zk3.dec.com member2.zk3.dec.com member3.zk3.dec.com
3.2.4 Re-adding a Member with clu_add_member -c Does Not Correctly Configure NetRAIN
If you have configured NetRAIN in a cluster with a LAN interconnect, and
you re-add a member via the
clu_add_member -c
command using the configuration file created when the member was
last added to the cluster, NetRAIN is not configured correctly for
the re-added member.
When you boot the re-added member, you may see a message like the following:
CNX MGR: cannot form: quorum disk is in use. Unable to establish contact with members using disk.
To resolve this problem, edit
/etc/sysconfigtab
on the re-added member, and change the lines in the
ics_ll_tcp
stanza.
The value for
ics_tcp_adapter0
incorrectly lists
the device names of the Ethernet interfaces.
Set
ics_tcp_adapter0
to
nr0
:
ics_tcp_adapter0=nr0
For each network adapter that is in the NetRAIN set, assign
the device name of the adapter to an
ics_tcp_nr0
array member.
For example, if the line in
/etc/sysconfigtab
looks like the following:
ics_tcp_adapter0=ee0,ee1,ee2,ee3
Then you would change it as follows:
ics_tcp_adapter0=nr0 ics_tcp_nr0[0]=ee0 ics_tcp_nr0[1]=ee1 ics_tcp_nr0[2]=ee2 ics_tcp_nr0[3]=ee3
3.2.5 Run /sbin/kreg After Building Kernel with No Kernel Layered Products
During cluster creation, if you build a clusterized kernel with no
kernel layered products,
you must then rebuild
/usr/sys/conf/.product.list
,
which is the registration file for kernel layered products.
To do this, run the following command:
# /sbin/kreg -l DEC TruCluster /usr/opt/TruCluster/sys
The situation where you build a clusterized kernel with no
kernel layered products can arise if
the initial build of the clusterized kernel fails.
If
that happens, you are offered as a default the option of building the
kernel with no kernel layered products.
3.3 Booting and Shutdown
This section discusses requirements and restrictions for booting members
into a cluster and for shutting down cluster members.
3.3.1 Do Not Reboot All Members Simultaneously
If you attempt to reboot all cluster members simultaneously, one or more members will hang on the way down due to quorum loss, and the other rebooting nodes may fail to boot because they are unable to rejoin the cluster (the hung nodes).
The method you use to reboot the entire cluster depends on your intent:
To reboot all cluster nodes without halting the cluster, reboot one member at a time, allowing the rebooting member to rejoin the cluster before you reboot the next member.
To reboot the entire cluster (cluster state will be lost),
shut down the entire cluster with the
shutdown -c
command and then boot the members.
3.3.2 AlphaServer GS80/160/320 May Hang on Boot
The following applies only to an AlphaServerTM GS80, GS160, or GS320 that is a member in a cluster that uses Memory Channel for the cluster interconnect.
When you boot a cluster member that is an AlphaServer GS80, GS160, or GS320
and the member hangs,
you must
halt
and
reset
the
member, and then boot it again, if a message similar to
the following is displayed:
>>> Registering CMS Services
or
>>> Registering CMS Services >>> rm_meet_rail_requirements: No access to physical rail 0
For example, if you are managing the member from the partition's console head (connected to the SCM/SRM port of the hung partition), do the following:
Use the
[Esc]
sequence to get to the
prompt for the system control manager (SCM):
[Esc][Esc]scm
At the prompt for SCM,
halt
the partition and
quit
:
SLV_E1> halt Slave request to master SLV_E1> quit
At the system reference manual (SRM) console prompt, reset the system:
P00>>> reset
A
reset
from SRM is recommended because,
if the system is partitioned, an SRM
reset
acts only on the partition where the command executes.
Boot the system.
3.3.3 A Member May Hang on Boot in a Cluster Using a KZPBA-CB SCSI Bus Adapter
When you boot a member of a cluster with a KZPBA-CB SCSI bus adapter, the member may hang during the boot. The console log for that member will display messages similar to the following:
cam_logger: SCSI event packet cam_logger: bus 10 target 15 lun 0 ss_perform_timeout timeout on disconnected request Active CCB at time of error cam_logger: SCSI event packet cam_logger: bus 10 target 15 lun 0 isp_process_abort_queue IO abort failure (mailbox status 0x0), chip reinit scheduled Active CCB at time of error cam_logger: SCSI event packet cam_logger: bus 10 isp_reinit Beginning Adapter/Chip reinitialization (0x3) cam_logger: SCSI event packet cam_logger: bus 10 isp_reinit Fatal reinit error 1: Unable to bring Qlogic chip back online
If you see messages like this, you must reset the system and then boot it again.
On an AlphaServer GS80, GS160, or GS320, you can perform a
system control manager (SCM)
halt
and
reset
.
(See
Section 3.3.2
for specifics.)
On other systems, you will have to do a hardware reset before booting
again.
If the reset fails to clear up the problem, you must cycle the power.
On an AlphaServer GS80, GS160, or GS320, you can do this with the
SCM
power off
and
power on
commands.
3.3.4 Booting Member Hangs While Setting Time and Date with ntpdate
A booting member may hang while attempting to use
ntpdate
to set the time and date.
The last message before the hang is as follows:
Setting the current time and date with ntpdate
Press [Ctrl/C], and the boot process will continue normally.
If the cluster uses the suggested configuration, it is
running xntpd, the Network Time Protocol (NTP) daemon.
In this situation, the member will get the time despite the hang.
3.3.5 CNX Panic During Boot
When you boot a member in a cluster with a large storage configuration, the member may panic and display the following message:
CNX MGR: Invalid configuration for cluster seq disk
If this occurs, reboot the member.
3.3.6 Member Hangs During Clusterwide Shutdown
This note applies only to clusters with a LAN interconnect.
During clusterwide shutdown, a member might hang for
ten minutes or more before finally shutting down.
If this delay is
inconvenient, use the halt button to shut down the member.
3.3.7 Booting a New Member Without a Cluster License Displays ATTENTION Message
When you boot a newly added member, the
clu_check_config
utility performs a series of configuration checks.
If you have not
yet installed the TruCluster Server license, the
TCS-UA
product authorization key (PAK), on the member, the boot procedure will
display the following messages:
Starting Cluster Configuration Check... The boottime cluster check found a potential problem. For details search for !!!!!ATTENTION!!!!! in /cluster/admin/clu_check_log_hostname check_cdsl_config : Boot Mode : Running /usr/sbin/cdslinvchk in the background check_cdsl_config : Results can be found in : /var/adm/cdsl_check_list clu_check_config : no configuration errors or warnings were detected
When you inspect the
/cluster/admin/clu_check_log_hostname
file, the following message is displayed:
/usr/sbin/caad is NOT_RUNNING !!!!!ATTENTION!!!!!
When the TruCluster Server license is not configured on a member, the
cluster application availability (CAA)
daemon (caad
) is not
automatically started on that member.
This is normal and expected
behavior.
If you did not configure the license from within
clu_add_member
when you added the new member (as
discussed in the TruCluster Server
Cluster Installation
manual), you can
configure it later using the
lmf register
command.
After
the license has been installed, you can start the CAA daemon on that member
using the
/usr/sbin/caad
command.
3.3.8 Boot All Cluster Members Before Starting Applications That Use the Memory Channel API Library
If there is insufficient Memory Channel address space in a cluster, a booting
node may have problems joining the cluster.
When this is the case, one or
more members may panic with an assertion failure (ICS MCT Assertion
Failed
), or the booting member may hang early in its boot.
Memory Channel resources are dynamically allocated as new members join the
cluster.
Running applications that call the Memory Channel application
programming interface (API) library functions can consume required
Memory Channel resources, and prevent a member from getting the resources it
needs to join the cluster.
To avoid this problem, boot all cluster members
before starting any applications that call the Memory Channel API library functions.
3.3.9 "Cannot make specified change" Message During Boot
If your cluster is not configured to use Network Information Service (NIS), the following error messages are displayed when you first boot the cluster:
/cluster/admin/run/C30niscluster: Cannot make specified change. /cluster/admin/run/C30niscluster : Bad exit code : 1
You can safely ignore these messages.
To prevent these messages from being displayed,
remove the
C30niscluster
symbolic link and then recreate it, as follows:
# rm /cluster/admin/run/C30niscluster # ln -sf /usr/ucb/true /cluster/admin/run/C30niscluster
3.3.10 Booting a Member During Disaster Recovery
When, as part of a disaster recovery process, you boot the first cluster member, the cluster may fail to form due to lack of quorum. This can happen because the cluster expected votes value is greater than the number of voting members.
If this situation occurs, you must boot one member
interactively and set the value of the
cluster_expected_votes
variable of the
clubase
kernel subsystem.
For example:
Reboot one cluster member as follows:
>>> boot -fl "ia" (boot dkb200.2.0.7.0 -flags ia) block 0 of dkb200.2.0.7.0 is a valid boot block reading 18 blocks from dkb200.2.0.7.0 bootstrap code read in base = 200000, image_start = 0, image_bytes = 2400 initializing HWRPB at 2000 initializing page table at fff0000 initializing machine state setting affinity to the primary CPU jumping to bootstrap code Tru64 UNIX boot - Mon Jan 4 14:08:41 EDT 2000 Enter kernel_name [option_1 ... option_n] Press Return to boot default kernel 'vmunix':vmunix clubase:cluster_expected_votes=1[Return]
Reboot the other surviving cluster members in a similar fashion,
incrementing the interactive boot value of
cluster_expected_votes
with each
member you boot into the cluster.
For example, when booting the second
cluster member, specify the following when the interactive boot procedure
prompts for the kernel name and options:
Enter kernel_name [option_1 ... option_n] Press Return to boot default kernel 'vmunix':vmunix clubase:cluster_expected_votes=2[Return]
Adjust cluster votes or membership based on your expectations of when
the lost members can be restored.
If they will be restored soon, use the
clu_quorum
command on a current member to
temporarily remove their votes.
If they will remain down for a long
time, use the
clu_delete_member
command to
remove them from the cluster.
You may need to configure a quorum disk in the
surviving cluster to improve its availability.
For more information, see the section on
troubleshooting unfortunate expected vote settings
in the chapter on managing cluster membership
in the
Cluster Administration
manual.
3.4 Cluster Configuration
This section discusses problems with adding and deleting
members, and configuring a quorum disk.
3.4.1 Configurations That Support Ethernet Hubs
The Cluster LAN Interconnect manual describes the Ethernet hardware that can be configured as a LAN interconnect. It is, however, ambiguous as to those configurations in which Ethernet hubs are supported.
All Ethernet hubs (also known as shared hubs to distinguish them from Ethernet switches) run in half-duplex mode. As a result, when a hub is used in a LAN interconnect, the Ethernet adapters connected to it must be set to (or must autonegotiate) 100 Mb/sec, half-duplex mode. (See Section 4.7.1 of the Cluster LAN Interconnect manual for additional information on how to accomplish this for the DE50x and DE60x families of adapters.)
Use of an Ethernet hub in a LAN interconnect is supported as follows:
A single Ethernet adapter (or multiple adapters configured as a NetRAIN virtual interface) on each member, connected to a single Ethernet hub. Note that the use of NetRAIN in this configuration guards against the failure of a single adapter in a member's NetRAIN set. The hub remains a single point of failure.
Multiple Ethernet adapters configured as a NetRAIN virtual interface on each member connected as depicted in Figure 2-5 of the Cluster LAN Interconnect manual to a pair of Ethernet hubs connected by a single crossover cable. This configuration guards against the failure of a single member adapter or a single hub failure. However, because the failure of the crossover cable link between the hubs can cause a cluster network partition (as described in Section 2.2.3 of the Cluster LAN Interconnect manual), this configuration is not recommended.
Unlike Ethernet switches, Ethernet hubs cannot be configured with multiple parallel crossover cables to guard against potential network partitions. Hubs do not provide features to detect and respond to routing loops.
Because of the performance characteristics of Ethernet hubs, use
them only in small clusters (two or three members).
3.4.2 Set Ethernet Switch Address Aging to 15 Seconds
Ethernet switches maintain tables that associate MAC addresses (and virtual LAN (VLAN) identifiers) with ports, thus allowing the switches to efficiently forward packets. These forwarding data bases (also known as unicast address tables) provide a mechanism for setting the time interval when dynamically learned forwarding information grows stale and is invalidated. This mechanism is sometimes referred to as the aging time.
For any Ethernet switch participating in a LAN interconnect, set its aging time to 15 seconds. Failure to so may cause the switch to erroneously continue to route packets for a given MAC address to a port listed in the forwarding table after the MAC address has moved to another port (for example, due to NetRAIN failover).
Ultimately, this can cause cluster nodes to lose communication and result in one or more nodes being removed from the cluster. The consequence may be that one or more nodes hang due to loss of quorum, but may also result in one of several panic messages. For example:
CNX MGR: this node removed from cluster CNX QDISK: Yielding to foreign owner
3.4.3 On LAN-based Clusters, Add Physical Network Addresses to ifaccess.conf
On each member of a cluster, an interface access filter
configuration file (ifaccess.conf
) is used
to deny access from untrusted subnets to the cluster interconnect.
Without this filtering, a system outside of the cluster could
masquerade as a cluster member.
For clusters that use a LAN interconnect, each member's
/etc/ifaccess.conf
file must contain one
entry for the virtual network address (for example, 10.0.0.0) and
one entry for the physical network address
(for example, 10.0.1.0) for each network interface.
For clusters that use a Memory Channel interconnect, each
/etc/ifaccess.conf
file must contain only
the virtual network interface address.
In either case,
the
/etc/ifaccess.conf
file is correctly
configured on the initial cluster member, memberid = 1.
For additional cluster members, the configuration process automatically
adds virtual network address entries in the
/etc/ifaccess.conf
file for
each network interface.
This is sufficient for
a cluster that uses a Memory Channel interconnect.
However, in a
LAN interconnect cluster, in addition to an entry for the virtual network
address, an entry is needed for the
physical network address for each network interface.
The member addition process does not
add these required entries to
/etc/ifaccess.conf
.
On each cluster member other than member 1,
you must manually edit
/etc/ifaccess.conf
and
add a line for the physical network addresses for each network interface.
The line should have the following format:
interface-name interconnect_net_address 255.255.255.0 deny
For example, if the physical cluster interconnect network address
of a member is 10.0.1.0 (the default)
and the virtual address is 10.0.0.0 (the default)
and the interface name is
tu0
,
the following lines are required in
/etc/ifaccess.conf
:
tu0 10.0.1.0 255.255.255.0 deny tu0 10.0.0.0 255.255.255.0 deny
3.4.4 Whenever Changing Network Interfaces, Update ifaccess.conf
Whenever you
add a new network interface, or change or replace an
existing one, you must update the
/etc/ifaccess.conf
file on the cluster member
where the change occurred.
Do this to
deny access from untrusted subnets to the cluster interconnect.
In a cluster with a Memory Channel interconnect, you need to add only a line for the virtual network address of the new or changed network interface.
In a cluster with a LAN interconnect, you must add a line for the virtual network address and another line for the physical network address for the new or changed network interface.
Make the change as follows:
Log in to the member where the network interface changed.
Use the
ifconfig
command to learn the
names of the network interfaces.
For example:
# ifconfig -l ics0 lo0 sl0 ee0 ee1 ee2 tu0 tun0
Add a line to
/etc/ifaccess.conf
for each
network interface.
Edit the file appropriately for each changed interface.
Note
On a cluster that uses the LAN interconnect, you must add a line for the physical network address and another line for the virtual network address.
Each line should have the following format:
interface-name interconnect_net_address 255.255.255.0 deny
For example, if the virtual cluster interconnect network address
of a member is 10.0.0.0 (the default)
and a new interface card has been added to a cluster member
and the interface name is
tu0
,
add the following line to
/etc/ifaccess.conf
:
tu0 10.0.0.0 255.255.255.0 deny
If the cluster has a LAN interconnect, you must also add a line for the physical network address. Suppose the physical cluster interconnect network address of the member is 10.0.1.0 (the default). In addition to the line for the virtual network address, you must add the following line:
tu0 10.0.1.0 255.255.255.0 deny
The cluster interconnect network address is common to all members.
To see the address used by your cluster, look
in
/etc/ifaccess.conf
.
On clusters using the LAN interconnect, the network
interface used by a member for the interconnect must not appear in
the member's
/etc/ifaccess.conf
file.
To learn the name of a member's network interface for the LAN interconnect,
log in to that member and use the
sysconfig
command to query the
ics_ll_tcp
subsystem and the
ics_tcp_adapter0
attribute.
In the following example, the member where the
sysconfig
command executes is using
nr0
as the network interface for the LAN interconnect:
# sysconfig -q ics_ll_tcp ics_tcp_adapter0 ics_ll_tcp: ics_tcp_adapter0 = nr0
For more information, see
ifaccess.conf
(4)3.4.5 Configuring a Cluster Member as a DHCP Server
The description of configuring the Dynamic Host Configuration Protocol (DHCP) in the Cluster Administration manual is incorrect. Do not perform step 4, which reads "Under Server/Security Parameters, set the Canonical Name entry to the default cluster alias."
Perform all other steps as documented:
Familiarize yourself with the DHCP server configuration process that is described in the chapter on DHCP in the Tru64 UNIX Network Administration: Connections manual.
On the cluster member that you want to act as the initial
DHCP server, run
/usr/bin/X11/xjoin
and configure
DHCP.
Select Server/Security.
From the pulldown menu that currently shows Server/Security Parameters, select IP Ranges.
Set the DHCP Server entry to the IP address of the default cluster alias.
There can be multiple entries for the DHCP Server IP address
in the DHCP database.
You might find it more convenient to
use the
jdbdump
command to generate a text file
representation of the DHCP database.
Then use a text editor
to change all the occurrences of the original DHCP server IP
address to the cluster alias IP address.
Finally, use
jdbmod
to
repopulate the DHCP database from the file you edited.
For
example:
# jdbdump > dhcp_db.txt # vi dhcp_db.txt
Edit
dhcp_db.txt
and change
the owner IP address to the IP
address of the default cluster alias.
Update the database with your changes by entering the following command:
# jdbmod -e dhcp_db.txt
When you finish with
xjoin
, make DHCP a highly available
application.
DHCP already has an action script and a resource
profile, and it is already registered with the CAA daemon.
To
start DHCP with CAA, enter the following command:
# caa_start dhcp
3.4.6 After Entering TruCluster Server License Information, Run lmf reset
When you add a cluster member, the
clu_add_member
command prompts you
to indicate whether you want to register the TruCluster Server
license for the new member at this time.
If you answer "yes," you are
prompted to enter either the license information or the name of the
file with the license information.
If you enter the license information,
after
clu_add_member
completes, you must run
the following command on the member where
clu_add_member
executed:
# lmf reset
3.4.7 Deleting a Member with an Inaccessible or Bad Member Boot Disk
The
clu_delete_member -f
command can delete a member
with an inaccessible or bad boot disk.
However, be aware that,
when you delete a member with an inaccessible boot disk,
clu_delete_member -f
does not adjust expected votes in the
running cluster (as a normal
clu_delete_member
command
does).
If the deleted member was a voting member, use the
clu_quorum -e
command after the member has been deleted to
adjust expected votes appropriately.
3.5 File System
This section discusses issues with the CFS, AdvFS, and NFS file systems
in a cluster.
3.5.1 Do Not Create UFS File Systems on a Quorum Disk
Do not use the quorum disk for user data.
The
mkfdmn
command prevents you from creating
an AdvFS domain on a quorum disk.
However,
the
newfs
command incorrectly allows
you to create a file system on a quorum disk.
Do not use
newfs
to create a file system
on a quorum disk.
3.5.2 CFS Relocation Failures Involving Applications That Wire Memory
Applications that use the
plock()
or
mlock()
system call to lock pages of
physical memory can cause the
cfsmgr
command
to fail when performing a manual relocation.
If the application uses
plock()
, the domain
or file system that contains the application executable cannot relocate.
In the case of
mlock()
, if the locked pages are
associated with files, then the file systems where those files reside cannot
relocate.
In the event of failure, the
cfsmgr
command returns
the following message:
Server Relocation Failed Failure Reason: Invalid Relocation
To allow the relocation to complete for the domain or file system on
which the executables reside, kill the processes that
are running the executables using the
plock()
and
mlock()
system calls.
Find out whether
collect
is running.
If it is,
kill
collect
and restart it with the
-l
(do not lock pages into memory) option.
3.6 LSM
This section discusses known problems with using the Logical
Storage Manager (LSM) in a TruCluster Server cluster.
3.6.1 LSM Commands Fail When Disk Left in Failing State
When a node boots into an existing cluster and has connectivity to a
failed device, it automatically brings the device online and reestablishes
the associations with appropriate disk media records.
After this
process, the disk is occasionally left in the failing state, which
prevents the disk from being used when space is requested
by commands such as
volassist
.
If this situation occurs, you must manually turn off the disk's failing state, as follows:
# voledit set failing=off device_name
3.6.2 Problem Encapsulating Swap in Clusters with Long Host Names
LSM has a problem encapsulating swap in a cluster on members with
base host names greater than 24 characters, for example,
reallyreallyreallyverylonghostname.foo.bar.com
.
To work around this problem, reduce the base
hostname to fewer than 25 characters.
3.6.3 Problem Encapsulating Swap When Remote Member Swap Areas Are Already Queued for Encapsulation
LSM has a problem encapsulating swap in a cluster when you run
the
volreconfig
command on one member and another
member already has swap devices queued for encapsulation.
To encapsulate the swap devices in a cluster, run
volencap
followed by
volreconfig
on each cluster member in turn.
Only after
volreconfig
completes and the member has rebooted
can you begin the swap encapsulation process for the next member.
Do not run
volencap
or
volreconfig
on the next member until
after the current member has rebooted.
For more information, see the chapter on
using LSM in a
cluster in the
Cluster Administration
manual.
3.6.4 The voldisk list Command Does Not Display All Non-LSM Disks in a Cluster
The
voldisk list
command is supposed to show disks
that are configured both under LSM and not under LSM, which gives disks
that are not configured under LSM a status of
unknown
.
In a cluster, however, the
voldisk list
command displays
only those non-LSM disks that are directly connected to the member on which
the command was executed.
3.7 CAA
This section discusses known problems in the cluster application
availability (CAA) subsystem.
3.7.1 Failure Threshold Value Should Not Be Greater Than 10
The failure threshold value for any resource should not be
greater than 10.
Setting the failure threshold value greater
than 10 may cause the
caad
daemon to crash if there are
a large number of failures for this resource and the
failure threshold is never exceeded within the failure interval.
An application's failure threshold is set in its CAA application
resource profile.
For more information, see
caa_profile
(8)3.7.2 Problems with caa_relocate When Multiple Interdependent Applications Are Specified
A forced relocate that is not directed to a specific cluster member
and has multiple interdependent applications specified
will relocate all of these applications multiple times.
For example, if
app1
depends on
app2
and
app3
, then
caa_relocate -f app1 app2 app3
causes all of these
applications to relocate three times.
A forced relocate directed to a specific cluster member
with interdependent applications specified
relocates the applications to the specified cluster member
correctly, but responds with an error.
For example,
caa_relocate -c member2 -f app1 app2 app3
successfully relocates all applications to
member2
,
but error messages are displayed and the return code is non-zero.
If you are relocating interdependent applications,
specify only
one
application, for example,
caa_relocate -f app1
or
caa_relocate -c member2 -f app1
.
All dependent applications relocate with the application specified.
Also,
caa_relocate -s
fails when
it is used with resources
with dependencies.
To avoid this, do the following:
Identify applications with dependencies by
using the
caa_stat -p
command.
Those applications
with entries for
REQUIRED_RESOURCES
have dependencies.
Use the
caa_relocate -f
command to
relocate the applications with dependencies.
Use
caa_relocate -s member1 -c member2
to
relocate the other applications.
3.7.3 Action Scripts PATH Considerations
Newly generated CAA scripts do not set the
PATH
environment variable.
When they are executed, the
PATH
is set to a default value
/sbin:/usr/sbin:/usr/bin
.
Therefore, you must
explicitly specify most path names that are used in scripts, or you
must modify the resulting scripts to explicitly set the
PATH
.
Action scripts generated in previous
releases may have a
PATH
that includes the current
directory (.
).
Because this situation may be a
potential security issue, modify these scripts to remove the
current directory from the path.
3.7.4 The caa_register -u Command Does Not Correctly Update a Nonapplication Resource's State
When you change the
SUBNET
value in a network
resource's profile or the
DEVICE_NAME
value in a tape
or changer resource's profile and run the
caa_register -u
command to update the CAA registry, CAA does not update the
STATE
value for the resource.
To correctly update a
nonapplication resource, follow these steps:
Unregister the resource using the
caa_unregister
command.
Change the
SUBNET
or
DEVICE_NAME
attribute in the resource profile.
Reregister the resource using the
caa_register
command.
3.7.5 Changes to HOSTING_MEMBERS List Can Leave Applications Running on Disallowed Members
A change to the
HOSTING_MEMBERS
list only affects future
relocations and starts.
If you update the
HOSTING_MEMBERS
list in the profile of an
ONLINE
application resource
with a restricted placement policy, make sure that the
application is running on one of the cluster members in that list.
If
the application is not running on one of the allowed members, run
the
caa_relocate
on the application after running
the
caa_register
-u
command.
3.7.6 CAA Events Are Malformed When Viewed from the Event Manager (EVM) Viewer
The Event Manager (EVM) viewer may display malformed CAA event messages, or messages with missing information. For example, the message:
CAA named is transitioning from state ONLINE to state OFFLINE on skiing
is displayed as:
CAA named is transitioning from state to state skiing
To work around this problem, examine the messages in the
daemon.log
file for more complete information.
The
messages in the
daemon
log file are in a slightly
different format from those that the EVM viewer displays.
3.7.7 CAA Lets You Update a Running Application's Required Resources
The
caa_register -u
command and the CAA
graphical user interface (GUI) allow you to
successfully update the
REQUIRED_RESOURCES
field in the
profile of an
ONLINE
resource with the name of a resource
that is
OFFLINE
.
(It should not allow you to update this
field unless the required resource is
ONLINE
or
unless the
resource whose profile you are modifying is
OFFLINE
.)
If you accidentally update the
REQUIRED_RESOURCES
field
incorrectly, you must manually start the required resource or stop the
updated resource to correct its states.
3.7.8 SysMan Station Shows CAA Application Resources in UNKNOWN State as Having an Error
The SysMan Station shows any application resources that are in the
UNKNOWN
state as having an error and does not show what
member the application is
UNKNOWN
on.
For example, an
application resource named
foo
in the
UNKNOWN
state is shown underneath the cluster icon as
foo (error)
.
3.8 Rolling Upgrade
This section discusses issues with rolling upgrade.
3.8.1 Rolling Upgrade: Do Not Shut Down a Cluster Member To Single-User Mode
This note applies only when you perform a rolling upgrade from TruCluster Server Version 5.1 to Version 5.1A.
Note
Use the procedure given here to get to single-user mode. Do not follow the procedure given in the rolling upgrade section of the Version 5.1 TruCluster Server Software Installation manual.
Before halting a cluster member, make sure that the cluster can maintain quorum without the member's vote. For more information, see the section on shutting down and starting a cluster member in the Cluster Administration manual.
To take a cluster member to single-user mode, use the
shutdown -h
command to halt the member, and then
boot the member to single-user mode.
When the system reaches
single-user mode,
run the
init s
,
bcheckrc
,
kloadsrv
, and
lmf reset
commands.
For example:
# /sbin/shutdown -h now >>> boot -fl s # /sbin/init s # /sbin/bcheckrc # /sbin/kloadsrv # /usr/sbin/lmf reset
We recommend this halt and boot method because it ensures that the member will provide a minimal set of services to the cluster, and conversely, that the running cluster will have minimal reliance on the member in single-user mode.
Shutting a member down to
single-user mode can take a long time, and in some cases, services
such as
automount
,
autofs
, or
NIS might appear to be working from the point of view of other cluster
members, when in fact the services are either down or partially up.
3.8.2 Stop prpasswdd Prior to Rolling Upgrade on Clusters Running Enhanced Security Environment
Before starting the rolling upgrade process in a TruCluster Enhanced Security environment, set a checkpoint in your authentication databases. Do this as follows:
# /usr/tcb/bin/db_checkpoint -h /var/tcb/files -1
After the checkpoint has been set, you must then
shut down the
prpasswdd
daemon on each cluster member.
Do this by issuing the following command on each member:
# /sbin/init.d/prpasswdd stop
Users can still log in to the cluster; however, login performance may suffer.
During the period that the cluster is rolling,
disable the starting of the
prpasswdd
daemon.
Do this by setting false
startup arguments in your
rc.config.common
database by
using the following command on one cluster member:
# rcmgr -c set PRPASSWDD_ARGS "WAIT FOR COMPLETED ROLL"
As nodes roll to the new revision, a message
indicating the
prpasswdd
daemon did not start will
be issued.
This is expected.
After all the cluster members have rolled to the
new release, restart the
prpasswdd
daemon.
Do this by issuing the following command on one member:
# rcmgr -c delete PRPASSWDD_ARGS
Then restart the
prpasswdd
daemon on each node
as follows:
# /sbin/init.d/prpasswd start
3.8.3 Updating Worldwide Language Support from CD-ROM During a Rolling Upgrade
Do not remove the Worldwide Language Support (WLS) product from a cluster while a rolling upgrade is in progress; otherwise the cluster is put into a state that prevents the rolling upgrade from completing successfully.
If you remove the WLS product prior to starting the rolling
upgrade, use the
/usr/sbin/clu_upgrade
command to
determine whether a rolling upgrade is in progress.
See
clu_upgrade
(8)3.8.4 Run Updates to dop Database Only on Upgraded Members
When performing a cluster rolling upgrade, do not issue any
dop
commands (for example,
dop -a
,
dop -d
, or
dop -W
)
on cluster members running the older version of the operating
system.
Any
dop
additions or deletions made on
the members running the older operating system might
be lost after
dop
commands are issued from
the upgraded members.
3.8.5 Recovering from a Failed Install Stage During a Rolling Upgrade from V5.1 to V5.1A
If the install stage of a
clu_upgrade
fails, do the following:
Halt the lead member.
Execute the following command from any member still
UP
:
# zcat < /var/adm/update/TruClusterKit/TCRBASE520 | tar -xf - \ ./usr/sbin/cluster/clu_common \ ./usr/sbin/clu_upgrade \ ./usr/lib/nls/msg/en_US.ISO8859-1/clu_upgrade.cat # clu_upgrade -undo install
The following is displayed:
This is the cluster upgrade program. You have indicated that you want to undo the 'install' stage of the upgrade. Do you want to continue to undo this stage of the upgrade? [yes]:[Return] Restoring tagged files. ..................................... clu_rollprop: "/usr/lib/nls/msg/en_US.ISO59-1/.Old..clu_upgrade.cat" \ does not exist .......... /usr/lib/nls/msg/en_US.ISO8859-1/.Old..clu_upgrade.cat does not exist ....... clu_rollprop: "/usr/sbin/.Old..clu_upgrade" does not exist clu_rollprop: "/usr/sbin/cluster/.Old..clu_common" does not exist ................... /usr/sbin/.Old..clu_upgrade does not exist /usr/sbin/cluster/.Old..clu_common does not exist ........................................................... The undo of the 'install' stage completed successfully.
Boot the lead member as follows:
>>> boot -fl s
When the system reaches single-user mode, run the following commands:
# init s # bcheckrc # kloadsrv # lmf reset
Re-run
installupdate
as documented in
Cluster Installation.
3.9 Miscellaneous Administration
This section discusses issues with various administration tools that
are used in a cluster.
3.9.1 RIS Boot Failures When Cluster is RIS Server
If the system that became the initial cluster member
was configured as a RIS server before the
clu_create
command was run, then
the cluster creation process does not update the
sa
entry in
/etc/bootptab
.
The
sa
remains the IP address of the standalone system.
Because of this, attempts at RIS boots after clusterization fail to
mount the root file system.
You must manually edit
/etc/bootptab
and
update the
sa
entry to be the IP address for
the default cluster alias.
3.9.2 The cfsstat Command Can Return Incorrect Statistics
The
cfsstat
command can return incorrect statistics
for the following:
tokens
(statistics on tokens)
The values for the total number of token requests and the total number of token cache hits are always 0 (zero).
tokstats
(statistics on tokens traffic)
Some statistics might be returned as negative numbers.
cfsmem
(statistics on memory usage by CFS)
Incorrect values are returned for
tokens_t
structures
(cli_tokens.tokens_t
) and
token
structures (cli_tokens.token
).
In addition, when displaying very large
numbers, the
cfsstat
command might truncate the values or display them as negative numbers.
To avoid this, reset CFS statistics
with the command
cfsstat -z
before gathering
statistics.
3.9.3 All Tape Writes at Same Density and Compression
All writes to a tape, regardless of the device file that is used,
will have the same density and compression.
This density and
compression may not correspond to the
density and compression documented for the device file
in
tape
(7)3.9.4 Tape Opens Take Long Time to Fail When No Tape Is Present
When an attempt is made to open a tape when no tape is present,
more than 5 minutes can elapse before an error is returned.
3.9.5 Avoid Using Block Sizes Greater Than 512K Bytes for Tape Transfers
If the block size on a tape operation is greater than 512K bytes,
the actual read or write operation is divided into 512K-byte transfers.
If an error occurs during a transfer greater than 512K, no
EIO
or other error is returned.
Instead, the number
of bytes transferred is reported.
3.9.6 Disabling Internet Services on a Per-Member Basis
The
Cluster Administration
manual describes a way to disable the Internet
services daemon
(inetd
) on a particular member by using
the
disable
keyword in the
/etc/inetd.conf.local
file.
For example:
finger stream tcp nowait root disable fingerd
If you add this entry, the following error appears in
/var/adm/syslog.dated/current/daemon.log
:
inetd: disable: file does not exist
The Internet services daemon continues to run normally, and the
service is not disabled.
The workaround is to remove the entry for
the service from the global
/etc/inetd.conf
file and add it to the member-specific
/etc/inetd.conf.local
file
on the members that you want to offer the service.
For more information
on configuring
inetd
, see
inetd.conf
(4)3.9.7 The hwmgr -show comp Command May Report an Inconsistency Error When Creating a Clusterwide Name for a SCSI Device
After you have used the
hwmgr -edit scsi
command to create a
clusterwide unique name for a SCSI device, a subsequent
hwmgr
-show comp
command may report an inconsistency
on the SCSI device.
The inconsistency appears when the
hwmgr -edit scsi
command is invoked on the second and
subsequent members for the same device.
You can ignore the
inconsistency error in this situation.
For example:
root> hwmgr -show comp -id 373 -full HWID: HOSTNAME FLAGS SERVICE COMPONENT NAME -------------------------------------------------------------------- 373: rovel-qa1 rcd-i iomap SCSI-WWID:ff10000b:"media_chngr" DSF GROUP INSTANCE GRPFLAGS GROUPID SUBSYSTEM BASENAME L1 L2 -------------------------------------------------------------------- 0 40 81 cam_changer mc2 media_changer generic DEVICE NODE ID LBdevT LCdevT CBdevT CCdevT BFlags CFlags Class Suffix L3B L3 ----------------------------------------------------------------------------- 0 0 56008c0 0 13003b3 0x0 0x861 0x0 (null) (null) (null) COMPONENT INCONSISTENCY ----------------------- Component should not have an entry in the cluster database but it does.
3.9.8 The hwmgr -scan scsi Command Does Not Work Clusterwide
The
hwmgr -scan scsi
command
-cluster
option does not work clusterwide.
When
entered on a cluster
member, the
hwmgr -scan scsi -cluster
command updates
the device databases for only that member, not for all cluster members.
To perform a clusterwide scan, enter the following command on each cluster member when you need to update the member device databases clusterwide:
# hwmgr -scan comp -cat scsi_bus
You usually use this command when you add a new disk to a cluster (see
Section 3.9.9).
3.9.9 Adding a Disk to a Running Cluster
When you add a new disk to a running cluster (for example, when you replace a failed disk), the cluster may not properly identify or configure the disk. To ensure that all cluster members properly recognize a new disk, follow these steps:
For all disk models, enter the following command on each member to scan SCSI buses clusterwide and configure any new devices.
# hwmgr -scan comp -cat scsi_bus
Allow a minute or two for the scans to complete.
If the disk that you are adding is an RZ26, RZ28, RZ29, or RZ1CB-CA model, enter the following command on each cluster member after you install the disk:
# /usr/sbin/clu_disk_install
Note
This command may take several minutes to complete if the cluster has a large number of storage devices.
3.9.10 Running Process Accounting on Large Clusters Can Exhaust Member Process Quotas
If process accounting is enabled on large clusters (six to eight members),
cluster members may start swapping heavily and eventually exhaust
their process quotas.
A
ps
command on such a member will show tens of thousands
of
icssvr_daemon_from_pool
processes.
If you see this situation developing in a cluster that is running process
accounting, use the
accton
command with no
parameters to disable accounting.
3.10 SysMan Menu
This section discusses known problems that you may encounter
when you use SysMan Menu in a cluster.
3.10.1 Nonroot Users with No Home Directory Cannot Run System Management Applications
Most system management applications require
root
privileges to make configuration changes.
Nonroot users are permitted to
run system management applications only to view the current configuration.
They are prevented from changing the configuration.
In a cluster, the system management applications use the remote shell
command (rsh
) to execute commands at a remote host.
Part of the
rsh
command processing includes
verifying access in the remote user's
$HOME/.rhosts
file in their home directory.
For this reason, a nonroot user, without a
home directory, who runs a system management application might encounter a
core dump.
Users can avoid these problems by ensuring that they have home
directories set up before attempting to use the system management
applications.
3.11 SysMan Station
This section discusses known problems that you may encounter when
using SysMan Station in a cluster.
3.11.1 SysMan Station Does Not Display Applications with Long Resource Names
CAA application resources with names of 64 or more characters are not
displayed by SysMan Station.
If you plan on managing your system with
SysMan Station, limit the length of resource names to 63 characters.
Resources with names beyond this limit can still be successfully managed
with either SysMan Menu or the command line CAA commands.
3.11.2 Unable to Expand Host Object in a Cluster
The SysMan Station client may occasionally encounter a Java class exception error when you attempt to expand a Host object.
If you encounter this error, click on a different view and then reselect the
Hardware view and retry the expand operation.
If the problem persists,
restart the SysMan Station client and retry the expand operation.
3.11.3 SysMan Station Might Display Cluster Status Incorrectly
The SysMan Station relies on events generated by the Event Manager (EVM) subsystem in order to monitor and display cluster status. In the following situations, the SysMan Station may reflect the state of the system incorrectly:
The Filesystems light in the Monitor window may indicate a warning
state (yellow) after all file system objects have returned to a normal state.
This situation may occur after a new member has been added to the cluster.
To clear this warning, restart the SysMan Station daemon
(smsd
) on the affected cluster members by following
these steps:
Close all open SysMan Station sessions.
Enter the following command:
# /sbin/init.d/smsd restart
After a cluster member has booted, the Network light in the Monitor window may indicate a warning state (yellow) when no network errors exist. This condition is caused by network events that are generated during the boot sequence. To clear this warning, follow these steps:
Click on the Network light in the Monitor window to display the Network Event window.
Click on the Clear Events button.
If the cluster application availability daemon (caad
)
fails to start on a cluster member, the SysMan Station will not
correctly display the state of CAA objects.
For example, this situation can
happen when the TruCluster Server license is not loaded on
all the cluster members.
To obtain
accurate information on CAA applications from the SysMan Station, follow
these steps:
Start the
caad
daemon on the affected cluster members
using the following command:
# /usr/sbin/caad
Restart the SysMan Station daemon (smsd
) using the
following command:
# /sbin/init.d/smsd restart
Additionally, the Filesystem Attention group in the SysMan Station Monitor window does not properly update on all cluster members for suboptimal file system states. If a file system becomes suboptimally configured, the Filesystem Attention group on only one cluster member will reflect the new state properly. SysMan Station clients that are connected to cluster members will not reflect this change in the Monitor window. However, the Physical Filesystem View on all cluster members will properly display this state information.
To correct this problem, stop and restart the SysMan Station
daemon (smsd
) on each cluster member where the
Filesystems Attention Group is not reflecting the suboptimal state, as
follows:
Close all open SysMan Station sessions.
Restart the SysMan Station daemon using the following command:
# /sbin/init.d/smsd restart
3.11.4 SysMan Station Might Display New Hardware Objects Incorrectly
If a new disk device is added or an existing disk device is replaced in a running cluster, the SysMan Station's Hardware View may display the new or modified disk object incorrectly. The disk object may be positioned incorrectly in the hardware hierarchy; for example, the disk may be drawn as a child of the host object instead of as a child of a SCSI bus.
To correct the view, restart the SysMan Station daemon
(smsd
) on each cluster member by performing the
following steps on all affected members:
Close all open SysMan Station sessions.
Enter the following command:
# /sbin/init.d/smsd restart
3.11.5 Properties Might Not Be Displayed for Selected Objects
Properties may not be displayed for selected objects. The Properties dialog box may appear briefly on the screen or may not be displayed at all.
To work around this problem, continue to try to display properties in the
current SysMan Station client, or exit the SysMan Station client
and start a new SysMan Station session.
3.11.6 Some SysMan Station Applications Display Wrong Target Member Name
When the following applications are launched from the SysMan Station, their title bars incorrectly display the name of the cluster member on which the SysMan Station client is running instead of the cluster member that is the target of the application's actions:
Security Auditing Configuration
Network Configuration Applications
NFS Configuration Applications
NTP Configuration Applications
PPP Configuration Applications
The application is directed to the correct cluster member; only the name in
the title bar is incorrect.
3.12 Documentation
This section discusses TruCluster Server Version 5.1A
documentation issues.
3.12.1 Recovering Cluster Root File System
Take the following into consideration when using the procedures for recovering the cluster root file system described in the chapter on troubleshooting clusters in the Cluster Administration manual:
When booting the initial cluster member, you may need to adjust expected quorum votes.
In the Cluster Administration manual, see the section on forming a cluster when members do not have enough votes.
The recovery process is not complete until the
h
partition of each member's boot disk is
updated with the correct information about the devices used for
the cluster root file system.
You can do this by booting each member.
You can also update the
h
partition of
a member's boot disk with the
clu_bdmgr
command.
For more information, see
clu_bdmgr
(8)
3.12.2 Correct Size of Member Boot Disk a (Boot) Partition
In the
Cluster Administration
manual, the section on backing up and
repairing a member's boot disk gives the wrong size for the
a
(boot) partition.
The size of the
a
partition is 256 MB.
3.12.3 Error in Example of Hexadecimal Format of IP Address
In the Cluster Administration manual, section 3.11, Enabling Cluster Alias vMAC Support, has a typographical error in the example of the hexadecimal format of an IP address. The following line is incorrect:
IP address in hex format: 10.8C.70:D1
The colon (:) in the IP address should be a period (.):
IP address in hex format: 10.8C.70.D1
3.12.4 Media Changer Utility, mcutil, Works on Remote Members
The
Cluster Administration
manual incorrectly states that the
mcutil
command works only on a device
that is directly connected to the member where the command
is executed.
The
mcutil
command works on devices clusterwide,
regardless of the member on which the command is executed.
3.12.5 Using LSM to Mirror the Cluster Root File System
The section of the Cluster Administration manual on using Logical Storage Manager in a cluster is missing an example of how to use LSM to mirror the cluster root file system.
Use the
volmigrate
command to mirror
cluster_root
, the cluster root file system.
Doing so moves
cluster_root
from
its current volume to the LSM volume you specify in the
volmigrate
command.
To mirror
cluster_root
requires
that the LSM
rootdg
disk group
contains at least two disks of the
appropriate size and type to create a volume for the
cluster root domain.
These disks must be accessible to
all cluster members.
You can display the size of the cluster root domain as follows:
# showfdmn cluster_root
To learn the connectivity of disks,
use the
hwmgr
command:
# hwmgr -view devices -cluster
If a disk is suitable for mirroring
cluster_root
,
then the output of the
hwmgr
command will list
the device special file name of the disk multiple times, once for
each member of the cluster.
If necessary, you can add disks to
rootdg
with
the
voldg
command:
# voldg adddisk disk_name
To create the mirrored volume, use the
volmigrate
command.
For example, if you want to use
dsk5
and
dsk7
as LSM mirrors for
the cluster root file system, use the following command:
# volmigrate -m 2 cluster_root dsk5 dsk7
For more information about mirroring the cluster root file system,
see
volmigrate
(8)3.12.6 Migrating from Automount to AutoFS
This section describes how to migrate from Automount to AutoFS. The information in this section was not included in the Cluster Administration manual.
The
autofsd
daemon automatically and
transparently mounts and unmounts NFS file systems on an as-needed
basis.
Like the
automount
daemon, it provides
another alternative to using the
/etc/fstab
file for
mounting NFS file systems on client machines.
However, AutoFS is
more efficient than the
automount
daemon because it
requires less communication between the kernel and the user space daemon.
The
autofsd
daemon also provides higher availability
than the
automount
daemon.
Three possible migration scenarios are presented: migrating without rebooting any cluster member, migrating when you reboot the active Automount server node, and migrating when rebooting the entire cluster.
Regardless of which approach you select, first use the CAA
caa_stat
command to verify that the
autofs
CAA resource is registered.
For example,
# /usr/bin/caa_stat autofs Could not find resource autofs.
If
autofs
is not registered as a CAA resource, then
register it as follows:
# /usr/sbin/caa_register autofs
3.12.6.1 Migrating Without a Reboot
Migrating without rebooting any cluster member requires the largest number of procedural steps, but provides the highest availability. The additional steps are required to ensure cleanup of the Automount intercept points and to automatically start AutoFS.
Do the following:
Change the
rc.config.common
file.
Determine any arguments to pass to
the
autofsmount
command.
These arguments are typically a subset of those already
specified by the
AUTOMOUNT_ARGS
environment
variable.
To view the value of that variable, use the
rcmgr
-get
command, as shown in the following example:
# /usr/sbin/rcmgr -c get AUTOMOUNT_ARGS -m /net -hosts -D MACH=alpha -D NET=f
The
-m
option ignores directory-mapname pairs listed
in the
auto.master
file.
This option might be
useful for debugging purposes if you suspect there is a syntax
error in the
auto.master
file.
Environment variables set by using the
-D
option resolve placeholders in the definition of
auto-mount map file entries.
For example, the associated
NET
entry might appear in the map file as follows:
vsx ${NET}system:/usr/projects2/vsx
and would resolve to
vsx fsystem:/usr/projects2/vsx
Set the arguments to pass to the
autofsmount
command, as
determined in the previous step.
To do this,
use the
rcmgr -set
command.
For example:
# /usr/sbin/rcmgr -c set AUTOFSMOUNT_ARGS -m /net -hosts -D MACH=alpha -D NET=f
Set the arguments to pass to the
autofsd
daemon.
For example:
# /usr/sbin/rcmgr -c set AUTOFSD_ARGS -D MACH=alpha -D NET=f
These arguments must match the environment variables, specified with
the
-D
option, as set for
AUTOMOUNT_ARGS
.
Use the
mount -e
command to
identify an automounted file system.
# mount -e | grep nfs deli.zk3.dec.com:(pid524825) on /net type nfs (v2, ro, nogrpid, udp, hard, intr, noac,timeo=350, retrans=5)
The automounted file system is indicated by
hostname:(PID)
.
Determine which cluster member is the Automount server node for the NFS file system you identified in the previous step, as shown in the following example:
# cfsmgr -p /net Domain or filesystem name = /net Server Name = swiss Server Status : OK
Stop the Automount service on
all cluster members other than the Automount server you identified in
the previous step.
To do this, use the
ps -ef
command to display process identifiers,
search the output for instances of
automount
, and then use
the
kill
command to kill each process.
This causes the
automount
daemon to
unmount all file systems that it has mounted, and to exit.
# ps -ef | grep automount root 1049132 1048577 0.0 May 10 ?? 0:00.00 /usr/sbin/automount -m /net -hosts
# kill 1049132
Note that, as of Tru64 UNIX Version 5.1A, the
kill
command is clusterwide and you can kill a
process from any cluster member.
Disable Automount and enable AutoFS in the
rc.config.common
file, as follows:
# /usr/sbin/rcmgr -c set AUTOMOUNT 0 # /usr/sbin/rcmgr -c set AUTOFS 1
Allow all auto-mounted file systems to become quiescent.
Stop the Automount service on the cluster member
operating as the server.
To do this, use the
ps -ef
command to display process identifiers,
search the output for instances of
automount
, and then use
the
kill
command
to kill each process.
# ps -ef | grep automount root 524825 524289 0.0 May 10 ?? 0:00.01 /usr/sbin/automount -m /net -hosts
# kill 524825
Use the
mount -e
command and search the
output for
tmp_mnt
, or the directory specified
with the
automount -M
command, to verify
that auto-mounted file systems are no longer mounted.
No file
systems should be mounted on
tmp_mnt
.
# mount -e | grep tmp_mnt
If some mount points still exist, they will no longer be usable via the
expected pathnames.
However, they are still usable under the
full
/tmp_mnt/...
pathnames.
Because AutoFS
does not use the
/tmp_mnt
mount point, there is no
conflict and the full auto-mount name space is available for
AutoFS.
If these
tmp_mnt
mount points later
become idle, you can
unmount them by using the
-f
option of the
umount
command, which unmounts remote file systems
without notifying the server.
Start AutoFS.
AutoFS provides automatic failover of
the automounting service by means of CAA: one cluster member acts as the CFS
server for auto-mounted file systems, and runs the one active copy of
the AutoFS daemon.
If this cluster member fails, CAA starts the
autofs
resource on another member.
If you do not care which node serves
AutoFS, use the
/usr/sbin/caa_start autofs
command
without specifying a cluster member;
otherwise, use the
/usr/sbin/caa_start autofs -c
member-name
command to specify
the cluster member that you want to serve AutoFS.
# /usr/sbin/caa_start autofs
The
-c
option starts the
autofs
resource on the specified member if the cluster
member is allowed by the placement policy and resource dependencies.
If the cluster member specified is not allowed by the placement
policy and resource dependiencies, the
caa_start
command fails.
If the specified member is not available, the
command fails.
See the discussion of the resource file options in
caa_profile
(8)
Use the
caa_stat autofs
command to make
sure that the
autofs
resource started as expected.
# /usr/bin/caa_stat autofs NAME=autofs TYPE=application TARGET=ONLINE STATE=ONLINE on swiss
3.12.6.2 Migrating When Rebooting a Cluster Member
Migrating when rebooting a cluster member requires fewer procedural steps than migrating without a reboot, at the expense of availability.
Note
Before you shut down a cluster member, you need to determine whether the cluster member you are shutting down is a critical voting member, and whether it is the only hosting member for one or more applications with a restricted placement policy. Both of these issues are described in Chapter 5 of the Cluster Administration manual.
Follow these steps to migrate from Automount to AutoFS when rebooting a cluster member:
Change the
rc.config.common
file.
Determine any arguments to pass to
the
autofsmount
command.
These arguments are typically a subset of those already
specified by the
AUTOMOUNT_ARGS
environment
variable.
To view the value of that variable, use the
rcmgr
-get
command, as shown in the following example:
# /usr/sbin/rcmgr -c get AUTOMOUNT_ARGS -m /net -hosts -D MACH=alpha -D NET=f
Environment variables set by using the
-D
option resolve placeholders in the definition of
auto-mount map file entries.
For example, the associated
NET
entry might appear in the map file as follows:
vsx ${NET}system:/usr/projects2/vsx
and would resolve to
vsx fsystem:/usr/projects2/vsx
Set the arguments to pass to the
autofsmount
command, as
determined in the previous step.
To do this, use
the
rcmgr -set
command.
For example:
# /usr/sbin/rcmgr -c set AUTOFSMOUNT_ARGS -m /net -hosts -D MACH=alpha -D NET=f
Set the arguments to pass to the
autofsd
daemon.
For example:
# /usr/sbin/rcmgr -c set AUTOFSD_ARGS -D MACH=alpha -D NET=f
These arguments must match the environment variables, specified with
the
-D
option, as set for
AUTOMOUNT_ARGS
.
Use the
mount -e
command to identify a file
system served by Automount:
# mount -e | grep nfs deli.zk3.dec.com:(pid524825) on /net type nfs (v2, ro, nogrpid, udp, hard, intr, noac,timeo=350, retrans=5)
The automounted file system is indicated by
hostname:(PID)
.
Determine which cluster member is the Automount server node for the NFS file system you identified in the previous step.
# cfsmgr -p /net Domain or filesystem name = /net Server Name = swiss Server Status : OK
Stop the Automount service on
all cluster members other than the Automount server you identified in
the previous step.
To do this, use the
ps -ef
command to display process identifiers,
search the output for instances of
automount
, and then use
the
kill
command to kill each process.
This causes the
automount
daemon to
unmount all file systems that it has mounted, and to exit.
# ps -ef | grep automount root 1049132 1048577 0.0 May 10 ?? 0:00.00 /usr/sbin/automount -m /net -hosts
# kill 1049132
Note that, as of Tru64 UNIX Version 5.1A, the
kill
command is clusterwide and you can kill a
process from any cluster member.
Disable Automount and enable AutoFS in the
rc.config.common
file, as follows:
# /usr/sbin/rcmgr -c set AUTOMOUNT 0 # /usr/sbin/rcmgr -c set AUTOFS 1
(Optional)
Specify the AutoFS server.
AutoFS
provides automatic failover of
the automounting service by means of CAA: one cluster member acts as the CFS
server for auto-mounted file systems, and runs the one active copy of
the AutoFS daemon.
If this cluster member fails, CAA starts the
autofs
resource on another member.
You can use the
caa_profile autofs -print
command to view the CAA hosting and placement policy, if any.
The
hosting policy specifies an ordered list of members, separated by
white space, that can host the application resource.
The placement policy
specifies the policy according to which CAA selects the member on
which to start or restart the application resource.
# /usr/sbin/caa_profile autofs -print NAME=autofs TYPE=application ACTION_SCRIPT=autofs.scr ACTIVE_PLACEMENT=0 AUTO_START=0 CHECK_INTERVAL=0 DESCRIPTION=Autofs Services FAILOVER_DELAY=0 FAILURE_INTERVAL=0 FAILURE_THRESHOLD=0 HOSTING_MEMBERS= OPTIONAL_RESOURCES= PLACEMENT=balanced REQUIRED_RESOURCES= RESTART_ATTEMPTS=3 SCRIPT_TIMEOUT=3600
The default, and recommended, behavior is to run on any cluster
member, with a placement policy of
balanced
.
If this is not
suitable for your environment, use the
/usr/sbin/caa_profile -update
command to change the
autofs
resource profile.
See the discussion of the resource file options in
caa_profile
(8)
If you make a change, use the following command to have the update take effect:
# /usr/sbin/caa_register -u autofs
Reboot the cluster member. Before you shut down the cluster member, make sure that it is not a critical voting member or the only hosting member for one or more applications with a restricted placement policy.
When it reboots, Automount will no longer be running in the cluster, and AutoFS will start.
# /sbin/shutdown -r now
3.12.6.3 Migrating When Rebooting the Cluster
Migrating when rebooting the entire cluster requires fewer procedural steps than migrating without a reboot or migrating when rebooting a single member. The trade-off is a loss of cluster availability.
Rebooting the cluster is a drastic measure and is not the preferred migration method.
Follow these steps to migrate from Automount to AutoFS when rebooting the cluster:
Change the
rc.config.common
file.
Determine any arguments to pass to
the
autofsmount
command.
These arguments are typically a subset of those already
specified by the
AUTOMOUNT_ARGS
environment
variable.
To view the value of that variable, use the
rcmgr
-get
command, as shown in the following example:
# /usr/sbin/rcmgr -c get AUTOMOUNT_ARGS -m /net -hosts -D MACH=alpha -D NET=f
Environment variables set by using the
-D
option resolve placeholders in the definition of
auto-mount map file entries.
For example, the associated
NET
entry might appear in the map file as follows:
vsx ${NET}system:/usr/projects2/vsx
and would resolve to
vsx fsystem:/usr/projects2/vsx
Set the arguments to pass to the
autofsmount
command, as
determined in the previous step.
To do this, use
the
rcmgr -set
command.
For example:
# /usr/sbin/rcmgr -c set AUTOFSMOUNT_ARGS -m /net -hosts -D MACH=alpha -D NET=f
Set the arguments to pass to the
autofsd
daemon.
For example:
# /usr/sbin/rcmgr -c set AUTOFSD_ARGS -D MACH=alpha -D NET=f
These arguments must match the environment variables, specified with
the
-D
option, as set for
AUTOMOUNT_ARGS
.
Use the
mount -e
command to identify a file
system served by Automount:
# mount -e | grep nfs deli.zk3.dec.com:(pid524825) on /net type nfs (v2, ro, nogrpid, udp, hard, intr, noac,timeo=350, retrans=5)
The automounted file system is indicated by
hostname:(PID)
.
Determine which cluster member is the Automount server node for the NFS file system you identified in the previous step.
# cfsmgr -p /net Domain or filesystem name = /net Server Name = swiss Server Status : OK
Disable Automount and enable AutoFS in the
rc.config.common
file, as follows:
# /usr/sbin/rcmgr -c set AUTOMOUNT 0 # /usr/sbin/rcmgr -c set AUTOFS 1
(Optional) Specify the AutoFS server.
AutoFS
provides automatic failover of
the automounting service by means of CAA: one cluster member acts as the CFS
server for auto-mounted file systems, and runs the one active copy of
the AutoFS daemon.
If this cluster member fails, CAA starts the
autofs
resource on another member.
You can use the
/usr/bin/caa_profile autofs -print
command to view the CAA hosting and placement policy, if any.
The
hosting policy specifies an ordered list of members, separated by
white space, that can host the application resource.
The placement policy
specifies the policy according to which CAA selects the member on
which to start or restart the application resource.
# /usr/bin/caa_profile autofs -print NAME=autofs TYPE=application ACTION_SCRIPT=autofs.scr ACTIVE_PLACEMENT=0 AUTO_START=0 CHECK_INTERVAL=0 DESCRIPTION=Autofs Services FAILOVER_DELAY=0 FAILURE_INTERVAL=0 FAILURE_THRESHOLD=0 HOSTING_MEMBERS= OPTIONAL_RESOURCES= PLACEMENT=balanced REQUIRED_RESOURCES= RESTART_ATTEMPTS=3 SCRIPT_TIMEOUT=3600
The default, and recommended, behavior is to run on any cluster
member, with a placement policy of
balanced
.
If this is not
suitable for your environment, use the
/usr/bin/caa_profile -update
command to change the
autofs
resource profile.
See the discussion of the resource file options in
caa_profile
(8)
If you make a change, use the following command to have the update take effect:
# /usr/sbin/caa_register -u autofs
Reboot the cluster. When it reboots, Automount will no longer be running in the cluster, and AutoFS will start.
# /sbin/shutdown -c now
3.12.7 Clusterwide IPC--Supported Mechanisms
This section describes how to migrate from Automount to AutoFS. The information in this section was not included in the Cluster Administration manual.
The following mechanisms for clusterwide IPC are supported:
TCP/IP with sockets
Memory Channel API:
memory windows
low level locks
signals
Files:
Buffered I/O or memory mapped
UNIX API file locks
Distributed Lock Manager Locks
Clusterwide kill signals
The following mechanisms are not supported for clusterwide IPC:
UNIX domain sockets
Named pipes (FIFO special files)
Signals
System V IPC (messages, shared memory, semaphores)
3.12.8 Software Product Description (SPD) Replaced by QuickSpec
For TruCluster Server Version 5.1A, the SPD has been replaced
by the TruCluster Server
QuickSpec.
For a description of TruCluster Server Version 5.1A and information about
its capabilities and the hardware it supports, see the
QuickSpec.
You can find it online at
http://www.tru64unix.compaq.com/docs/pub_page/spds.html,
and in the
TruCluster/DOCS
directory on the Associated Products Volume 2 CD-ROM.