2    TruCluster Server Patches

This chapter provides information about the patches included in Patch Kit 2 for the TruCluster Server software.

This chapter is organized as follows:

2.1    Release Notes

This section provides release notes that are specific to the TruCluster Server software patches in this kit.

2.1.1    Required Storage Space

The following storage space is required to install the base and TruCluster Server components of this patch kit:

See Section 1.1.1 for information on space needed for the operating system patches.

2.1.2    AlphaServer ES47 or AlphaServer GS1280 Hangs When Added to Cluster

If after running clu_add_member to add an AlphaServer ES47 or AlphaServer GS1280 as a member of a TruCluster the AlphaServer hangs during its first boot, try rebooting it with the original V5.1B generic cluster kernel, clu_genvmunix.

Use the following instructions to extract and copy the V5.1B cluster genvmunix from your original Tru64 UNIX kit to your AlphaServer ES47 or AlphaServer GS1280 system. In these instructions, the AlphaServer ES47 or AlphaServer GS1280 is designated as member 5. Substitute the appropriate member number for your cluster.

  1. Insert the Tru64 UNIX Associated Products Disk 2 into the CD-ROM drive of an active member.

  2. Mount the CD-ROM to /mnt. For example:

    # mount -r /dev/disk/cdrom0c /mnt
    

  3. Mount the bootdisk of the AlphaServer ES47 or AlphaServer GS1280 on its specific mount point; for example:

    # mount root5_domain#root /cluster/members/member5/boot_partition
    

  4. Extract the original clu_genvmunix from the CD-ROM and copy it to the bootdisk of the AlphaServer ES47 or AlphaServer GS1280 member.

    # zcat < TCRBASE540 | ( cd /cluster/admin/tmp; tar -xf - ./usr/opt/TruCluster/clu_genvmunix)
    # cp /cluster/admin/tmp/usr/opt/TruCluster/clu_genvmunix \
    /cluster/members/member?/boot_partition/genvmunix
    # rm /cluster/admin/tmp/usr/opt/TruCluster/clu_genvmunix
    

  5. Unmount the CD-ROM and the bootdisk:

    # umount /mnt
    # umount /cluster/members/member5/boot_partition
    

  6. Reboot the AlphaServer ES47 or AlphaServer GS1280.

2.1.3    No-Roll Procedure Cannot Be Used to Remove Patch Kit

To remove Patch Kit 2, you must run the /etc/dn_fix_dat.sh script after rebuilding the kernel and before rebooting the system. If the script is not executed before rebooting, the system will fail to boot.

Because the no-roll procedure automatically reboots the system after deleting the patches, you would not be able to run the script as required. Therefore, the no-roll procedure cannot be used to remove the patch kit.

The workaround is to use the rolling upgrade procedure to remove the patch kit. See Section 1.1.3 for more information.

2.1.4    Updates for Rolling Upgrade Procedures

The following sections provide information on rolling upgrade procedures.

2.1.4.1    Noncritical Errors

During a rolling upgrade to installing Patch Kit 2, you may encounter the following noncritical situations:

2.1.4.2    Procedure for Simultaneous Upgrades

When doing a simultaneous rolling upgrade of NHD6 and the Version 5.1B Patch Kit 2, you must install the NHD6 kit first. If this is not done, you may get a number of installation errors, although you can safely ignore them.

2.1.4.3    Unrecoverable Failure Procedure

The procedure to follow if you encounter unrecoverable failures while running dupatch during a rolling upgrade has changed. The new procedure calls for you to run the clu_upgrade -undo install command and then set the system baseline. The procedure is explained in the Patch Kit Installation Instructions as notes in Section 5.3 and Section 5.6.

2.1.4.4    Do Not Add or Delete OSF, TCR, IOS, or OSH Subsets During Roll

During a rolling upgrade, do not use the /usr/sbin/setld command to add or delete any of the following subsets:

Adding or deleting these subsets during a roll creates inconsistencies in the tagged files.

2.1.4.5    Undo Stages in Correct Order

If you need to undo the install stage, because the lead member is in an unrecoverable state, be sure to undo the stages in the correct order.

During the install stage, clu_upgrade cannot tell whether the roll is going forward or backward. This ambiguity incorrectly allows the clu_upgrade undo preinstall stage to be run before clu_upgrade undo install. Refer to the Patch Kit Installation Instructions for additional information on undoing a rolling patch.

2.1.4.6    Ignore Message About Missing ladebug.cat File

When installing the patch kit during a rolling upgrade, you may see the following error and warning messages. You can ignore these messages and continue with the rolling upgrade.

Creating tagged files.
 
...............................................................................
.....
 
*** Error ***
The tar commands used to create tagged files in the '/usr' file system have
reported the following errors and warnings:
     tar: lib/nls/msg/en_US.88591/ladebug.cat : No such file or directory
.........................................................
 
*** Warning ***
The above errors were detected during the cluster upgrade. If you believe that
the errors are not critical to system operation, you can choose to continue.
If you are unsure, you should check the cluster upgrade log and refer
to clu_upgrade(8) before continuing with the upgrade.

2.1.4.7    clu_upgrade undo of Install Stage Can Result in Incorrect File Permissions

This note applies only when both of the following are true:

In this situation, incorrect file permissions can be set for files on the lead member. This can result in the failure of rsh, rlogin, and other commands that assume user IDs or identities by means of setuid.

The clu_upgrade undo install command must be run from a nonlead member that has access to the lead member's boot disk. After the command completes, follow these steps:

  1. Boot the lead member to single-user mode.

  2. Run the following script:

    #!/usr/bin/ksh -p
    #
    #    Script for restoring installed permissions
    #
    cd /
    for i in /usr/.smdb./$(OSF|TCR|IOS|OSH)*.sts
    do
      grep -q "_INSTALLED" $i 2>/dev/null && /usr/lbin/fverify -y <"${i%.sts}.inv"
    done
    

  3. Rerun installupdate, dupatch, or nhd_install, whichever is appropriate, and complete the rolling upgrade.

For information about rolling upgrades, see Chapter 7 of the Cluster Installation manual, installupdate(8), and clu_upgrade(8).

2.1.4.8    Missing Entry Messages Can Be Ignored During Rolling Patch

During the setup stage of a rolling patch, you might see a message like the following:

Creating tagged files.
............................................................................
 
clubase: Entry not found in /cluster/admin/tmp/stanza.stdin.597530
 
clubase: Entry not found in /cluster/admin/tmp/stanza.stdin.597568

An Entry not found message will appear once for each member in the cluster. The number in the message corresponds to a PID.

You can safely ignore this Entry not found message.

2.1.4.9    Relocating AutoFS During a Rolling Upgrade on a Cluster

This note applies only to performing rolling upgrades on cluster systems that use AutoFS.

During a cluster rolling upgrade, each cluster member is singly halted and rebooted several times. The Patch Kit Installation Instructions direct you to manually relocate applications under the control of Cluster Application Availability (CAA) prior to halting a member on which CAA applications run.

Depending on the amount of NFS traffic, the manual relocation of AutoFS may sometimes fail. Failure is most likely to occur when NFS traffic is heavy. The following procedure avoids that problem.

At the start of the rolling upgrade procedure, use the caa_stat command to learn which member is running AutoFS. For example:

# caa_stat -t
Name           Type           Target    State     Host
------------------------------------------------------------
autofs         application    ONLINE    ONLINE    rye
cluster_lockd  application    ONLINE    ONLINE    rye
clustercron    application    ONLINE    ONLINE    swiss
dhcp           application    ONLINE    ONLINE    swiss
named          application    ONLINE    ONLINE    rye

To minimize your effort in the following procedure, perform the roll stage last on the member where AutoFS runs.

When it is time to perform a manual relocation on a member where AutoFS is running, follow these steps:

  1. Stop AutoFS by entering the following command on the member where AutoFS runs:

    # /usr/sbin/caa_stop -f autofs
    

  2. Perform the manual relocation of other applications running on that member:

    # /usr/sbin/caa_relocate -s current_member -c target_member
    

After the member that had been running AutoFS has been halted as part of the rolling upgrade procedure, restart AutoFS on a member that is still up. (If this is the roll stage and the halted member is not the last member to be rolled, you can minimize your effort by restarting AutoFS on the member you plan to roll last.)

  1. On a member that is up, enter the following command to restart AutoFS. (The member where AutoFS is to run, target_member, must be up and running in multi-user mode.)

    # /usr/sbin/caa_startautofs -c target_member
    

  2. Continue with the rolling upgrade procedure.

2.1.5    Additional Steps Required When Installing Patches Before Cluster Creation

This note applies only if you install a patch kit before creating a cluster; that is, if you do the following:

  1. Install the Tru64 UNIX base kit.

  2. Install the TruCluster Server kit.

  3. Install the Version 5.1B patch kit before running the clu_create command.

In this situation, you must then perform three additional steps:

  1. Run versw, the version switch command, to set the new version identifier:

    # /usr/sbin/versw -setnew
    

  2. Run versw to switch to the new version:

    # /usr/sbin/versw -switch
    

  3. Run the clu_create command to create your cluster:

    # /usr/sbin/clu_create
    

2.1.6    When Taking a Cluster Member to Single-User Mode, First Halt the Member

To take a cluster member from multiuser mode to single-user mode, first halt the member and then boot it to single-user mode. For example:

# shutdown -h now
>>> boot -fl s

Halting and booting the system ensures that it provides the minimal set of services to the cluster and that the running cluster has a minimal reliance on the member running in single-user mode.

When the system reaches single-user mode, run the following commands:

# /sbin/init s
# /sbin/bcheckrc
# /usr/sbin/lmf reset

2.1.7    Problems with clu_upgrade switch Stage

If the clu_upgrade switch stage does not complete successfully, you may see a message like the following:

versw: No switch due to inconsistent versions

The problem can be due to one or more members running genvmunix, a generic kernel.

Use the command clu_get_info -full and note each member's version number, as reported in the line beginning

Member base O/S version

If a member has a version number different from that of the other members, shut down the member and reboot it from vmunix, the custom kernel. If multiple members have the different version numbers, reboot them one at a time from vmunix.

2.2    Summary of TruCluster Software Patches

This section provides brief descriptions of the patches in Patch Kit 1 for the software products.

This section provides descriptions of the patches in Patch Kit 2 for the TruCluster Server software products. Because Tru64 UNIX patch kits are cumulative, each patch lists its state according to the following criteria:

Number: Patch 2.00

Abstract: Fix for aliasd daemon

State: Existing (Kit 1)

Modifies the aliasd daemon to include interface aliases when determining whether or not an interface is appropriate for use as the ARP address for a cluster alias when selecting the proxy ARP master.

Number: Patch 7.00

Abstract: Fixes an issue with ICS on NUMA-based systems

State: Existing (Kit 1)

  • Fixes an issue with ICS (Internode Communication Services) on a NUMA-based system in a cluster.

Number: Patch 14.00

Abstract: Cluster specific fix for mounting cluster root domain

State: Existing (Kit 1)

This patch enables a cluster to boot even if the cluster root domain devices are private to different cluster members. Although this is not a recommended configuration, it should not result in an unbootable cluster. Currently, this is with respect to cluster root domains not under LSM control.

Number: Patch 19.00

Abstract: Fix for Oracle startup failure

State: Existing (Kit 1)

  • Fixes a problem in one of the shipped rc scripts whereby Oracle fails during startup on a clustered system.

Number: Patch 26.00

Abstract: Problems with LSM disks and cluster quorum tool

State: Existing (Kit 1)

  • Corrects problems with LSM disks and the cluster quorum tools. When a member having LSM disks local to it is down, the quorum tools fail to update quorum. This causes other cluster commands to fail.

Number: Patch 35.00

Abstract: Fix for cluster alias manager SUITlet

State: Existing (Kit 1)

  • Fixes a problem that causes the cluster alias manager SUITlet to falsely interpret any cluster alias with virtual={t|f} configured as a virtual alias regardless of its actual setting.

Number: Patch 39.00

Abstract: Reliable DataGram kernel thread problem

State: Existing (Kit 1)

  • Fixes a problem in which an RDG (Reliable DataGram) kernel thread can starve other timeshare threads on a uniprocessor cluster member. In particular, system services such as networking threads can be affected.

Number: Patch 52.00

Abstract: Patch: Fix for RM_AUDIT_ACK_BLOCK

State: Supersedes Patches 3.00, 5.00

  • Fixes a regression for single physical rail Memory Channel configurations, and cleans up stale data left on an offline physical rail by the Memory Channel driver.

  • Fixes issues associated with the initialization of the Memory Channel driver.

  • Corrects a problem in a Memory Channel cluster where rebooting a node without performing a hardware reset can crash other members with a RM_AUDIT_ACK_BLOCK panic.

Number: Patch 63.00

Abstract: Cluster member panics with Kernel Memory Fault

State: Supersedes Patches 15.00, 17.00

  • Fixes a problem in which cluster alias connections are not distributed among cluster members according to the defined selection weight.

  • Fixes a memory leak in the cluster alias subsystem.

  • Corrects a problem that occurs when running nmap or nessus targetted at the cluster alias, where the cluster member panics with Kernel Memory Fault.

Number: Patch 65.00

Abstract: Resolves problem with caa_register command

State: New

  • Resolves a problem in which the caa_register command allowed a CAA resource to be registered even when its profile contained an unknown attribute. This fix prevents the caa_register command from registering a resource with an unknown attribute and will cause it to return an error message which includes the unknown attribute information.

Number: Patch 67.00

Abstract: Fixes a cfsd core dumping problem

State: Supersedes Patch 48.00

  • Fixes a problem with cfsd core dumping shortly after startup if it is enabled or shortly after enabling it. The problem fixed by this patch is only seen after applying a recent dsfmgr patch.

  • Corrects a problem in which cfsd will terminate prematurely and core dump when a node leaves the cluster very shortly after joining the cluster.

Number: Patch 69.00

Abstract: Fixes cluster interconnect

State: Supersedes Patches 20.00, 22.00

  • Corrects a problem involving discarded UDP datagrams that do not come from the correct port.

  • Corrects a problem in which a panic displaying the message "error CNX MGR: cnx_comm_error: invalid node state" occurs on a LAN cluster running under load when other members are rebooting.

  • Fixes a coding error, a memory leak, and a deinitialization problem in the cluster interconnect networking layer.

Number: Patch 72.00

Abstract: Fixes race condition in Device Request Dispatcher

State: Supersedes Patches 27.00, 28.00, 29.00, 31.00, 50.00, 70.00

  • Fixes a regression associated with non-SCSI storage.

  • Improves the responsiveness of EINPROGRESS handling during the issuing of I/O barriers by removing a possible infinite loop scenario that could occur due to the deletion of a storage device.

  • Fixes a problem that causes a panic with the message "CNX MGR: Invalid configuration for cluster seq disk" during simultaneous booting of cluster nodes.

  • Fixes a possible race condition between a SCSI reservation conflict and an I/O drain, which could result in a hang.

  • Alleviates a condition in which a cluster member takes an extremely long time to boot when using LSM.

  • Fixes a problem in the cluster kernel where a cluster member panics while doing remote I/O over the interconnect.

  • Corrects an issue to allow the Device Request Dispatcher, DRD, to retry to get disk attributes when EINPROGRESS is returned from the disk driver.

  • Fixes an problem where access to the quorum disk can be lost if the quorum disk is on a parallel SCSI bus and multiple bus resets are encountered.

  • Fixes several problems in the Device Request Dispatcher, including a race condition.

Number: Patch 74.00

Abstract: Fix for caa_report

State: New

  • Fixes a condition in which uptimes greater than 100 percent are reported for resources by caa_report.

  • Fixes a problem in which resources that never started have an ending timestamp.

Number: Patch 81.00

Abstract: Security (SSRT2265)

State: Supersedes Patch 37.00

  • Fixes a security vulnerability in the cluster interconnect security configuration that may result in a denial of service on systems running TruCluster Server software.

  • Provides enhancements to the clu_upgrade command.

Number: Patch 85.00

Abstract: Fixes a panic that may occur during an unmount

State: Supersedes Patches 8.00, 9.00, 10.00, 12.00, 41.00, 44.00, 46.00, 53.00, 54.00, 55.00, 56.00, 57.00, 58.00, 59.00, 61.00, 83.00

  • Fixes a problem that causes a hang to occur when multiple nodes are shutting down simultaneously.

  • Fixes a problem that causes a Cluster File System panic when using raw Asynchronous I/O.

  • Adds code to assist in problem diagnosis.

  • Relieves pressure on the CMS global DLM lock by allowing AutoFS auto-unmounts to back off.

  • Updates the attributes on a directory when files are removed by a cluster node that is not the file system server.

  • Fixes a problem of excessive FIDS_LOCK contention that occurs when large number of files are using system-based file locking.

  • Fixes a cluster deadlock that may occur during failover and recovery when direct I/O is in use.

  • Corrects diagnostic code that might result in a panic during kernel boot.

  • Prevents a panic when an AutoFS file system is auto-unmounted.

  • Enhances cluster file system performance when using file locks to coordinate file access.

  • Corrects several problems with various installation commands and utilities.

  • Fixes a memory leak in the clu_get_info interface.

  • Displays the correct error message for freezefs -q on a non-AdvFS file system.

  • Eliminates a performance problem when a node, acting as CFS server of an NFS client file system, is write-appending to an external NFS server.

  • Fixes a timing window during asynchronous reads on a CFS client.

  • Fixes cfsmgr to properly return a failure status when a relocation request has failed.

  • Fixes a race condition where stale name cache entries allow file access after file unlink.

  • Fixes a panic that may occur during an unmount.

  • Fixes an internal problem in the kernel's AdvFS, UFS, and NFS file systems where extended attributes with extremely long names, greater than 247 characters, could not be set on files. The new limit is 254 + a Null string terminator.

  • Corrects a problem where a CFS lookup for a mount could leave stale state behind that could adversely affect subsequent NFS operations.

Number: Patch 87.00

Abstract: Fixes a panic that occurs on a booting node

State: Supersedes Patches 24.00, 43.00, 75.00, 77.00

  • Addresses an assertion caused by a bad user pointer passed to the kernel via sys_call.

  • Corrects a condition that causes a node to hang during testing the of Memory Channel cable pulls. A cluster member sometimes hangs when a Memory Channel cable is pulled, the node is taken down, the cable is plugged back in, and the node is rebooted.

  • Increases performance by reducing the lock miss rate in the ics_mct_llnode_info_lock.

  • Addresses a panic that occurs on a booting node.

  • Addresses a panic that may occur when a node is joining the cluster. A node recognizing the joining node panics while it is trying to establish a preboot channel connection with the peer node, causing the folling message to be displayed on the console or in /var/adm/messages:

    panic (cpu x): ics_mct: rx conn 3

Number: Patch 89.00

Abstract: Fix for CAA core dumping problem

State: Supersedes Patch 33.00, 79.00

  • Addresses an error "caa_register -u" produces with no balance data.

  • Corrects a problem with resource inaccessibility if the hosting member crashes during a remote caa_stop operation.

  • Fixes a problem in which CAA dumps core when trying to deal with cluster member ID 63.

  • Fixes a problem in which CAAD might dump core due to a race condition when multiple events to which it subscribes arrive simultaneously.