2    TruCluster Patches

This chapter provides information about the patches included in Patch Kit 1 for the TruCluster Server software.

This chapter is organized as follows:

2.1    Release Notes

This section provides release notes that are specific to the TruCluster Server software patches in this kit.

2.1.1    Required Storage Space

The following storage space is required to install the base and TruCluster Server components of this patch kit:

See Section 1.1.1 for information on space needed for the operating system patches.

2.1.2    AlphaServer ES47 or AlphaServer GS1280 Hangs When Added to Cluster

If after running clu_add_member to add an AlphaServer ES47 or AlphaServer GS1280 as a member of a TruCluster the AlphaServer hangs during its first boot, try rebooting it with the original V5.1B generic cluster kernel, clu_genvmunix.

Use the following instructions to extract and copy the V5.1B cluster genvmunix from your original Tru64 UNIX kit to your AlphaServer ES47 or AlphaServer GS1280 system. In these instructions, the AlphaServer ES47 or AlphaServer GS1280 is designated as member 5. Substitute the appropriate member number for your cluster.

  1. Insert the Tru64 UNIX Associated Products Disk 2 into the CD-ROM drive of an active member.

  2. Mount the CD-ROM to /mnt. For example:

    # mount -r /dev/disk/cdrom0c /mnt
    

  3. Mount the bootdisk of the AlphaServer ES47 or AlphaServer GS1280 on its specific mount point; for example:

    # mount root5_domain#root /cluster/members/member5/boot_partition
     
     
    

  4. Extract the original clu_genvmunix from the CD-ROM and copy it to the bootdisk of the AlphaServer ES47 or AlphaServer GS1280 member.

    # zcat < TCRBASE540 | ( cd /cluster/admin/tmp; tar -xf - ./usr/opt/TruCluster/clu_genvmunix)
    # cp /cluster/admin/tmp/usr/opt/TruCluster/clu_genvmunix \
    /cluster/members/member?/boot_partition/genvmunix
    # rm /cluster/admin/tmp/usr/opt/TruCluster/clu_genvmunix
    
    

  5. Unmount the CD-ROM and the bootdisk:

    # umount /mnt
    # umount /cluster/members/member5/boot_partition
    
    

  6. Reboot the AlphaServer ES47 or AlphaServer GS1280

2.1.3    Updates for Rolling Upgrade Procedures

The following sections provide information on rolling upgrade procedures.

2.1.3.1    Unrecoverable Failure Procedure

The procedure to follow if you encounter unrecoverable failures while running dupatch during a rolling upgrade has changed. The new procedure calls for you to run the clu_upgrade -undo install command and then set the system baseline. The procedure is explained in the Patch Kit Installation Instructions as notes in Section 5.3 and Section 5.6.

2.1.3.2    During Rolling Patch, Do Not Add or Delete OSF, TCR, IOS, or OSH Subsets

During a rolling upgrade, do not use the /usr/sbin/setld command to add or delete any of the following subsets:

Adding or deleting these subsets during a roll creates inconsistencies in the tagged files.

2.1.3.3    Undoing a Rolling Patch

When you undo the stages of a rolling upgrade, the stages must be undone in the correct order. However, the clu_upgrade command incorrectly allows a user undoing the stages of a rolling patch to run the clu_upgrade undo preinstall command before running the clu_upgrade undo install command.

The problem is that in the install stage, clu_upgrade cannot tell from the dupatch flag files whether the roll is going forward or backward. This ambiguity allows a user who is undoing a rolling patch to run the clu_upgrade undo preinstall command without first having run the clu_upgrade undo install command.

To avoid this problem when undoing the stages of a rolling patch, make sure to follow the documented procedure and undo the stages in order.

2.1.3.4    Ignore Message About Missing ladebug.cat File During Rolling Upgrade

When installing the patch kit during a rolling upgrade, you may see the following error and warning messages. You can ignore these messages and continue with the rolling upgrade.

Creating tagged files.
 
...............................................................................
.....
 
*** Error ***
The tar commands used to create tagged files in the '/usr' file system have
reported the following errors and warnings:
     tar: lib/nls/msg/en_US.88591/ladebug.cat : No such file or directory
.........................................................
 
*** Warning ***
The above errors were detected during the cluster upgrade. If you believe that
the errors are not critical to system operation, you can choose to continue.
If you are unsure, you should check the cluster upgrade log and refer
to clu_upgrade(8) before continuing with the upgrade.

2.1.3.5    clu_upgrade undo of Install Stage Can Result in Incorrect File Permissions

This note applies only when both of the following are true:

In this situation, incorrect file permissions can be set for files on the lead member. This can result in the failure of rsh, rlogin, and other commands that assume user IDs or identities by means of setuid.

The clu_upgrade undo install command must be run from a nonlead member that has access to the lead member's boot disk. After the command completes, follow these steps:

  1. Boot the lead member to single-user mode.

  2. Run the following script:

    #!/usr/bin/ksh -p
    #
    #    Script for restoring installed permissions
    #
    cd /
    for i in /usr/.smdb./$(OSF|TCR|IOS|OSH)*.sts
    do
      grep -q "_INSTALLED" $i 2>/dev/null && /usr/lbin/fverify -y <"${i%.sts}.inv"
    done
    

  3. Rerun installupdate, dupatch, or nhd_install, whichever is appropriate, and complete the rolling upgrade.

For information about rolling upgrades, see Chapter 7 of the Cluster Installation manual, installupdate(8), and clu_upgrade(8).

2.1.3.6    Missing Entry Messages Can Be Ignored During Rolling Patch

During the setup stage of a rolling patch, you might see a message like the following:

Creating tagged files.
............................................................................
 
clubase: Entry not found in /cluster/admin/tmp/stanza.stdin.597530
 
clubase: Entry not found in /cluster/admin/tmp/stanza.stdin.597568

An Entry not found message will appear once for each member in the cluster. The number in the message corresponds to a PID.

You can safely ignore this Entry not found message.

2.1.3.7    Relocating AutoFS During a Rolling Upgrade on a Cluster

This note applies only to performing rolling upgrades on cluster systems that use AutoFS.

During a cluster rolling upgrade, each cluster member is singly halted and rebooted several times. The Patch Kit Installation Instructions direct you to manually relocate applications under the control of Cluster Application Availability (CAA) prior to halting a member on which CAA applications run.

Depending on the amount of NFS traffic, the manual relocation of AutoFS may sometimes fail. Failure is most likely to occur when NFS traffic is heavy. The following procedure avoids that problem.

At the start of the rolling upgrade procedure, use the caa_stat command to learn which member is running AutoFS. For example:

# caa_stat -t
Name           Type           Target    State     Host
------------------------------------------------------------
autofs         application    ONLINE    ONLINE    rye
cluster_lockd  application    ONLINE    ONLINE    rye
clustercron    application    ONLINE    ONLINE    swiss
dhcp           application    ONLINE    ONLINE    swiss
named          application    ONLINE    ONLINE    rye

To minimize your effort in the procedure described as follows, it is desirable to perform the roll stage last on the member where AutoFS runs.

When it comes time to perform a manual relocation on a member where AutoFS is running, follow these steps:

  1. Stop AutoFS by entering the following command on the member where AutoFS runs:

    # /usr/sbin/caa_stop -f autofs
    

  2. Perform the manual relocation of other applications running on that member:

    # /usr/sbin/caa_relocate -s current_member -c target_member
    

After the member that had been running AutoFS has been halted as part of the rolling upgrade procedure, restart AutoFS on a member that is still up. (If this is the roll stage and the halted member is not the last member to be rolled, you can minimize your effort by restarting AutoFS on the member you plan to roll last.)

  1. On a member that is up, enter the following command to restart AutoFS. (The member where AutoFS is to run, target_member, must be up and running in multi-user mode.)

    # /usr/sbin/caa_startautofs -c target_member
    

  2. Continue with the rolling upgrade procedure.

2.1.4    Additional Steps Required When Installing Patches Before Cluster Creation

This note applies only if you install a patch kit before creating a cluster; that is, if you do the following:

  1. Install the Tru64 UNIX base kit.

  2. Install the TruCluster Server kit.

  3. Install the Version 5.1B patch kit before running the clu_create command.

In this situation, you must then perform three additional steps:

  1. Run versw, the version switch command, to set the new version identifier:

    # /usr/sbin/versw -setnew
    

  2. Run versw to switch to the new version:

    # /usr/sbin/versw -switch
    

  3. Run the clu_create command to create your cluster:

    # /usr/sbin/clu_create
    

2.1.5    When Taking a Cluster Member to Single-User Mode, First Halt the Member

To take a cluster member from multiuser mode to single-user mode, first halt the member and then boot it to single-user mode. For example:

# shutdown -h now
>>> boot -fl s

Halting and booting the system ensures that it provides the minimal set of services to the cluster and that the running cluster has a minimal reliance on the member running in single-user mode.

When the system reaches single-user mode, run the following commands:

# /sbin/init s
# /sbin/bcheckrc
# /usr/sbin/lmf reset

2.1.6    Problems with clu_upgrade switch Stage

If the clu_upgrade switch stage does not complete successfully, you may see a message like the following:

versw: No switch due to inconsistent versions

The problem can be due to one or more members running genvmunix, a generic kernel.

Use the command clu_get_info -full and note each member's version number, as reported in the line beginning

Member base O/S version

If a member has a version number different from that of the other members, shut down the member and reboot it from vmunix, the custom kernel. If multiple members have the different version numbers, reboot them one at a time from vmunix.

2.2    Summary of TruCluster Software Patches

This section provides brief descriptions of the patches in Patch Kit 1 for the TruCluster Server software products.

Number: Patch 2.00

Abstract: Fix for aliasd daemon

State: New

Modifies the aliasd daemon to include interface aliases when determining whether or not an interface is appropriate for use as the ARP address for a cluster alias when selecting the proxy ARP master.

Number: Patch 5.00

Abstract: Fix for initialization of Memory Channel driver

State: New

This patch:

  • Fixes a regression for single physical rail Memory Channel configurations, and cleans up stale data left on an offline physical rail by the Memory Channel driver.

  • Fixes issues associated with the initialization of the Memory Channel driver.

Number: Patch 7.00

Abstract: Fixes an issue with ICS on NUMA-based systems

State: New

This patch fixes an issue with ICS (Internode Communication Services) on a NUMA-based system in a cluster.

Number: Patch 14.00

Abstract: Cluster specific fix for mounting cluster root domain

State: New

This patch enables a cluster to boot even if the cluster root domain devices are private to different cluster members. Although this is not a recommended configuration, it should not result in an unbootable cluster. Currently, this is with respect to cluster root domains not under LSM control.

Number: Patch 17.00

Abstract: Fixes memory leak in cluster alias subsystem

State: Supersedes Patch 15.00

This patch:

  • Fixes a problem in which cluster alias connections are not distributed among cluster members according to the defined selection weight.

  • Fixes a memory leak in the cluster alias subsystem.

Number: Patch 19.00

Abstract: Fix for Oracle startup failure

State: New

This patch fixes a problem in one of the shipped rc scripts whereby Oracle fails during startup on a clustered system.

Number: Patch 22.00

Abstract: Fixes panic seen on LAN cluster running under load

State: Supersedes Patch 20.00

This patch:

  • Corrects a problem involving discarded UDP datagrams that do not come from the correct port.

  • Corrects a problem in which a panic displaying the message "error CNX MGR: cnx_comm_error: invalid node state" occurs on a LAN cluster running under load when other members are rebooting.

Number: Patch 26.00

Abstract: Problems with LSM disks and cluster quorum tool

State: New

This patch corrects problems with LSM disks and the cluster quorum tools. When a member having LSM disks local to it is down, the quorum tools fail to update quorum. This causes other cluster commands to fail.

Number: Patch 33.00

Abstract: Fix for CAA daemon

State: New

This patch:

  • Addresses an error "caa_register -u" produces with no balance data.

  • Corrects a problem with resource inaccessibility if the hosting member crashes during a remote caa_stop operation.

Number: Patch 35.00

Abstract: Fix for cluster alias manager SUITlet

State: New

This patch fixes a problem that causes the cluster alias manager SUITlet to falsely interpret any cluster alias with virtual={t|f} configured as a virtual alias regardless of its actual setting.

Number: Patch 37.00

Abstract: Security (SSRT2265)

State: New

This patch corrects a potential security vulnerability which, under certain circumstances, could compromise system integrity.

Number: Patch 39.00

Abstract: Reliable DataGram kernel thread problem

State: New

This patch fixes a problem in which an RDG (Reliable DataGram) kernel thread can starve other timeshare threads on a uniprocessor cluster member. In particular, system services such as networking threads can be affected.

Number: Patch 43.00

Abstract: Fixes a cluster member hang

State: Supersedes Patch 24.00

This patch:

  • Addresses an assertion caused by a bad user pointer passed to the kernel via sys_call.

  • Corrects a condition that causes a node to hang during testing the of Memory Channel cable pulls. A cluster member sometimes hangs when a Memory Channel cable is pulled, the node is taken down, the cable is plugged back in, and the node is rebooted.

Number: Patch 46.00

Abstract: Fixes a cluster deadlock

State: Supersedes Patches 8.00, 9.00, 10.00, 12.00, 41.00, 44.00

This patch:

  • Fixes a problem that causes a hang to occur when multiple nodes are shutting down simultaneously.

  • Fixes a problem that causes a Cluster File System panic when using raw Asynchronous I/O.

  • Adds code to assist in problem diagnosis.

  • Relieves pressure on the CMS global DLM lock by allowing AutoFS auto-UNmounts to back off.

  • Updates the attributes on a directory when files are removed by a cluster node that is not the file system server.

  • Fixes a problem of excessive FIDS_LOCK contention that occurs when large number of files are using system-based file locking.

  • Fixes a cluster deadlock that may occur during failover and recovery when direct I/O is in use.

  • Corrects diagnostic code that might result in a panic during kernel boot.

  • Prevents a panic when an AutoFS file system is auto-unmounted.

Number: Patch 48.00

Abstract: Fixes a cfsd core dumping problem

State: New

This patch fixes a problem with a cfsd core dump that can occur shortly after startup if cfsd is enabled then, or if it enabled later, soon after that. The problem requires applying a dsfmgr patch.

Number: Patch 50.00

Abstract: Fixes a regression associated with non-SCSI storage

State: Supersedes Patch 27.00, 28.00, 29.00, 31.00

This patch:

  • Fixes a regression associated with non SCSI storage.

  • Improves the responsiveness of EINPROGRESS handling during the issuing of I/O barriers by removing a possible infinite loop scenario that could occur due to the deletion of a storage device.

  • Fixes a problem that causes a panic with the message "CNX MGR: Invalid configuration for cluster seq disk" during simultaneous booting of cluster nodes.

  • Fixes a possible race condition between a SCSI reservation conflict and an I/O drain, which could result in a hang.

  • Alleviates a condition in which a cluster member takes an extremely long time to boot when using LSM.

  • Fixes a problem in the cluster kernel where a cluster member panics while doing remote I/O over the interconnect.

  • Corrects an issue to allow the Device Request Dispatcher, DRD, to retry to get disk attributes when EINPROGRESS is returned from the disk driver.

  • Fixes a problem in which access to the quorum disk can be lost if the quorum disk is on a parallel SCSI bus and multiple bus resets are encountered.