3 Summary of TruCluster Software Patches

This chapter summarizes the TruCluster software patches included in Patch Kit-0005.

Table 3-1 lists patches that have been updated.

Table 3-2 provides a summary of patches in Patch Kit-0005.

Table 3-1: Updated TruCluster Software Patches

Patch IDs	Change Summary
Patches 119.00, 122.00, 127.00, 129.00	New
Patches 5.00, 7.00, 32.00, 70.00, 72.00, 17.00, 42.00, 43.00, 45.00, 83.00	Superseded by Patch 85.00
Patches 28.00, 41.00, 46.00, 47.00, 49.00, 86.00, 87.00, 88.00	Superseded by Patch 90.00
Patch 120.00	Superseded by Patch 122.00
Patches 11.00, 30.00, 67.00, 69.00	Superseded by Patch 131.00
Patches 78.00, 80.00, 141.00	Superseded by Patch 143.00
Patches 66.00, 123.00, 125.00	Superseded by Patch 148.00
Patches 15.00, 33.00, 34.00, 35.00, 36.00, 37.00, 39.00, 73.00, 74.00, 75.00, 77.00, 132.00, 133.00, 134.00, 135.00, 136.00, 137.00, 138.00, 140.00	Superseded by Patch 150.00
Patches 2.00, 13.00, 18.00, 19.00, 20.00, 21.00, 22.00, 23.00, 24.00, 26.00, 50.00, 51.00, 52.00, 53.00, 54.00, 55.00, 56.00, 57.00, 58.00, 59.00, 60.00, 61.00, 62.00, 64.00, 82.00, 91.00, 92.00, 93.00, 94.00, 95.00, 96.00, 97.00, 98.00, 99.00, 100.00, 101.00, 102.00, 103.00, 104.00, 105.00, 106.00, 107.00, 108.00, 109.00, 110.00, 111.00, 112.00, 113.00, 114.00, 115.00, 117.00, 146.00	Superseded by Patch 152.01

Table 3-2: Summary of TruCluster Patches

Patch IDs

Abstract

Patch 4.00

TCR510DX-001

Patch: Fix for Cluster Alias Manager system management tool

State: Existing

This patch fixes the Cluster Alias Manager system management tool from crashing and displaying errors.

Patch 9.00

TCR510-001

Patch: Initializing the MC-API results in system crash

State: Existing

This patch fixes a problem where on the AlphaServer GS160 systems, initializing the MC-API results in the system crashing with a "kernel memory fault" message.

Patch 85.00

TCR510-107

Patch: Fixes memory hang

State: Supersedes patches TCR510-002 (5.00), TCR510-003 (7.00), TCR510-023 (32.00), TCR510-042 (70.00), TCR510-039 (72.00), TCR510-018 (17.00), TCR510-028 (42.00), TCR510-052 (43.00), TCR510-043 (45.00), TCR510-095 (83.00)

This patch corrects the following:

Fixes an occasional cluster hang which can occur after a Memory Channel error.

Fixes a kernel memory fault which occurs in the ics_mct_ring_recv() routine. The kernel memory fault is seen when a node is booting into the cluster, and can occur on the booting node or on another node.

Fixes a problem in ICS where ring_recv() does not properly handle a change in channel numbers. The fix will, in turn, improve validation of the connection structure on node joins.

Fixes the way communication errors occur on clusters such that a down node will not declare all other nodes dead.

Fixes the problem that causes a panic with error message "CNX QDISK: Yielding to foreign owner with quorum" caused by a long running thread, ICS/MCT receive thread, which defers other kernel threads from accessing the CPU.

Eliminates unnecessary rail failovers in vhub configurations and removes rmerror_int diagnostic messages.

Fixes an issue which causes all cluster nodes to hang or panic if a Wildfire is halted via the halt button.

Fixes a panic that is caused in a clustered environment that has the following error message:
rm_request_on_bad_prail

Prevents an "ics_mct: Error from establish_RM_notification_channel" panic on clusters.

Fixes four problem situations:
- When a physical MC rail goes offline.
- When the master failover node goes offline during a failover.
- How ICS handles the resend situation when MC errors take place.
- Failing over due to parity errors increasing beyond the limit.

Fixes hangs and increases performance of memory channel ICS operation.

Patch 90.00

TCR510-087

Patch: Fixes a panic in clua_cnx_unregister

State: Supersedes patches TCR510-019 (28.00), TCR510-029 (41.00), TCR510-041 (46.00), TCR510-048 (47.00), TCR510-037 (49.00), TCR510-091 (86.00), TCR510-082 (87.00), TCR510-066 (88.00)

This patch corrects the following:

Fixes the cluamgr command where it will display the alias status even if no cluster member has joined the alias.

Fixes a problem in which RPC requests to the cluster alias may fail with "RPC timeout" message.

Fixes a cluster node hang from in_pcbnotify.

Fixes a problem that a rebooted node not able of sending messages to the cluster alias.

Fixes multiple networking issues within a cluster environment:
- Cluster member loses connectivity with clients on remote subnets.
- aliasd not handling multiple virtual aliases in a subnet and/or IP aliases.
- Allows cluster members to route for an alias without joining it.
- aliasd writing illegal configurations into gated.conf.memebrX.
- Default route not being restored after network connectivity issues.
- Fixes a race condition between aliasd and gated.
- Fixes a problem with a hang caused by an incorrect /etc/hosts entry.

Fixes a problem when the cluster alias subsystem does not send a reply to a client that pings a cluster alias address with a packet size of less than 28 bytes.

Fixes a memory corruption panic which could occur after a member joins the cluster or after adding a new cluster alias to one or more of the members.

Fixes a problem with cluster alias selection priority when adding a member to an alias.

Fixes a panic in clua_cnx_unregister where a TP structure could not be allocated for a new TCP connection.

Patch 119.00

TCR510DX-002

Patch: Security (SSRT1-40U, SSRT1-41U, SSRT1-42U, SSRT1-45U)

State: New

A potential security vulnerability has been discovered where, under certain circumstances, system integrity may be compromised. This may be in the form of improper file access. Compaq has corrected this potential vulnerability.

Patch 122.00

TCR510-063

Patch: cfsmgr works correctly with upper case member names

State: New. Supersedes patch TCR510-070 (120.00)

This patch corrects the following:

Correct a cfsmgr error "Not enough space" when attempting to relocate a file system with a large amount of disks.

Allows cfsmgr to work correctly with upper and mixed case member names.

Patch 127.00

TCR510-092

Patch: Using a cluster as a RIS server causes a panic

State: New

This patch addresses two problems:

A panic caused by a known problem, using a cluster as a RIS server.

A fix to RIS/DMS serving in a TruCluster.

Patch 129.00

TCR510-071

Patch: EVM cluster-wide event may cause a panic

State: New

This patch fixes a problem that, under very heavy loads in a cluster, could cause the system to panic when duplicating a cluster EVM event.

Patch 131.00

TCR510-104

Patch: Fix for Oracle 9i hang

State: Supersedes patches TCR510-007 (11.00), TCR510-024 (30.00), TCR510-036 (67.00), TCR510-049 (69.00)

This patch corrects the following:

Corrects a problem in which the RDG subsystem will stop sending messages even though there are messages which are deliverable.

Fixes an incorrect display of the following warning message at boot time:
rdg: failed to start context rcvq scan thread

Fixes a kernel memory fault with the RDG autowiring mechanism, also seen as a "pte not valid" crash.

Adds a multichannel wait flag to pid_unblock.

Contains performance enhancements.

Fixes a problem with RDG whereby broadcast packets can interact with the context receive queue.

Closes a timing window that can cause Oracle 9i to hang when a remote node in the cluster goes down.

Patch 143.00

TCR510-085

Patch: Panic in distributed lock mgr deadlock detection code

State: Supersedes patches TCR510-033 (78.00), TCR510-047 (80.00), TCR510-061 (141.00)

This patch corrects the following:

Fixes an Oracle process hang if a node fails after receiving a "rsbinfo"message.

Fixes a DLM problem where two processes could take out the same lock.

Fixes a panic in dlm when another node in the cluster is halted.

Fixes a panic in the distributed lock managed deadlock detection code.

Patch 148.00

TCR510-121

Patch: CAA applications not failing over

State: Supersedes patches TCR510-027 (66.00), TCR510-067 (123.00), TCR510-110 (125.00)

This patch corrects the following:

For systems running TruCluster Server V5.1 with the following configurations:
- Tapes and/or media changer devices used as CAA resources.
- A combination of tapes, media changers, and network interfaces used as CAA resources.

Fixes a problem that prevents CAA from updating the state of any of the above resources when connectivity to the corresponding device (tape, media changer, or network) is lost or restored.

Fixes a situation when CAA daemon on a clustered system crashes and dumps core.

Fixes the major problems of CAA applications not failing over during a node shutdown and caad hang condition at startup.

Corrects the inability to start and stop CAA resources. When started they will go to the unknown state and never start. The problem is nondeterministic. Several CAA resources may be started before the problem is seen.

Patch 150.00

TCR510-115

Patch: Failover does not occur properly

State: Supersedes patches TCR510-005 (15.00), TCR510-021 (33.00), TCR510-009 (34.00), TCR510-016 (35.00), TCR510-011 (36.00), TCR510-022 (37.00), TCR510-012 (39.00), TCR510-035 (73.00), TCR510-038 (74.00), TCR510-030 (75.00), TCR510-034 (77.00), TCR510-109 (132.00), TCR510-108 (133.00), TCR510-094 (134.00), TCR510-065 (135.00), TCR510-084 (136.00), TCR510-105 (137.00), TCR510-106 (138.00), TCR510-090) (140.00)

This patch corrects the following:

Fixes two TruCluster problems:
- If a Quorum disk is manually added by the command clu_quorum -d add, the disk becomes inaccessible because the PR flag is not being cleaned up. The same command will work in the next reboot.
- A cluster member cannot boot under a specific hardware setup. The CFS mount fails because of the PR flag is not cleaned up.

Addresses the need for IOCTL for remote DRD, adds clean up for failed remote closes for non-disks, fixes error returns on failed tape/changer closes, and fixes tape deadlock experienced in netbackups.

Fixes an issue with a tape/changer failing to correctly report a close failure of a device in a cluster environment.

Fixes a problem which results in a system panic while doing tape failovers.

Fixes a node panic during fiber port disables.

Fixes an issue with a tape/changer giving back "busy on open" if a close from a remote node failed.

Provides the TCR portion of the functionality to support EMC storage boxes that support Persistent Reserves (SCSI command set) as defined by the final SCSI specification.

Fixes an issue with requests being stuck on a failed disk in a cluster.

Allows high density tape drives to use the high density compression setting in a cluster environment.

Fixes a kernel memory fault panic that can occur within a cluster member during failover while using shared served devices.

Fixes an issue with the hwmgr -delete command that causes a panic in a cluster.

Fixes the KZPCC controller problem seen when deleting a Virtual Drive using SWCC and adding the same drive back can result in the disk being unaccessible.

Fixes several problems with the device request dispatcher (drd) kernel subsystem, including cluster hangs, kernel memory faults, reboot problems, node recovery problems, and device failover problems.

Fixes cluster hangs and panics due to I/O problems.

Fixes a problem where the tape changer is only accessible from member that's the drd server for the changer.

Fixes a race condition problem when multiple unbarrierable disks failed at the same time.

Fixes a problem where CAA applications using tape/changers as required resources will not come ONLINE (as seen by caa_stat).

Patch 150.00

continued

Fixes a kernel memory fault in drd_open.

Fixes the following problems:
- Prevents HSG80 controller crashes.
- Fixes cam_logger error message problems during cluster boot.
- Fixes DRD problems and persistent reservation problems.
- Fixes problems with drdmgr not responding to a failover disk.
- Fixes a domain panic in a cluster when a file system is mounted on a disk accessed remotely over the cluster interconnect.

Patch 152.01

TCR510-123

Patch: Security (SSRT0691U)

State: Supersedes patches TCR510-004 (2.00), TCR510-006 (13.00), TCR510-026 (18.00), TCR510-020 (19.00), TCR510-013 (20.00), TCR510-015 (21.00), TCR510-017 (22.00), TCR510-014 (23.00), TCR510-025 (24.00), TCR510-008 (26.00), TCR510-056 (50.00), TCR510-050 (51.00), TCR510-054 (52.00), TCR510-057 (53.00), TCR510-046 (54.00), TCR510-040 (55.00), TCR510-031 (56.00), TCR510-032 (57.00), TCR510-051 (58.00), TCR510-060 (59.00), TCR510-044 (60.00), TCR510-053 (61.00), TCR510-045 (62.00), TCR510-058 (64.00), TCR510-064 (82.00), TCR510-077 (91.00), TCR510-100 (92.00), TCR510-098 (93.00), TCR510-081 (94.00), TCR510-072 (95.00), TCR510-073 (96.00), TCR510-075 (97.00), TCR510-083 (98.00), TCR510-093 (99.00), TCR510-096 (100.00), TCR510-069 (101.00), TCR510-088 (102.00), TCR510-076 (103.00), TCR510-079 (104.00), TCR510-086 (105.00), TCR510-089 (106.00), TCR510-078 (107.00), TCR510-099 (108.00), TCR510-097 (109.00), TCR510-102 (110.00), TCR510-101 (111.00), TCR510-103 (112.00), TCR510-074 (113.00), TCR510-080 (114.00), TCR510-062 (115.00), TCR510-068 (117.00), TCR510-127 (144.00), TCR510-123 (146.00)

This patch corrects the following:

A potential security vulnerability has been discovered, where under certain circumstances, system integrity may be compromised. This may be in the form of improper file or privilege management. Compaq has corrected this potential vulnerability.

Provides a small TPC-C performance optimization to cfsspec_read for reporting TPC-C single node cluster numbers.

When attempting to roll a patch kit on a single member cluster without this patch, the following error messages will be seen when running the postinstall stage:
*** Error***
Members '2' is NOT at the new base software version.

*** Error***
Members '2' is NOT at the new TruCluster software version.

During backup stage of clu_upgrade setup 1, clu_upgrade is unable to determine the name of the kernel configuration file.

clu_upgrade does not check the availabilty of space in /, /usr, and /usr/i18n.

During the preinstalled phase, clu_upgrade will ignore a no answer when the user is prompted, during an error condition, whether they wish to continue.

clu_upgrade incorrectly assumes that if the directory /usr/i18n exists, then it is in its own file system.

After the clu_upgrade clean phase, the final step of clu_upgrade, no message is displayed that leads the user to believe they have completed the upgrade. Only the prompt is returned and the clu _upgrade -completed clean command reports that the clean had not completed.

clu_upgrade can display "Could not get property..." and "...does not exist" type of error messages during the undo install phase.

The clu_upgrade undo switch command, after completing a clu_upgrade switch command, should display an error message instead of claiming it has succeeded.

Fixes a problem with disaster recovery whereby the node being restored will hang on boot.

Patch 152.01

continued

Corrects a problem in which a cluster may panic with a "cfsdb_assert" message when restoring files from backup while simultaneously relocating the CFS server for that file system.

Corrects a problem in which a cluster member can panic with the panic string "cfsdb_assert" when a NFS V3 TCP client attempts to create a socket using mknod(2).

Corrects a problem in which a cluster member will panic with the patch string "lock_terminate: lock held" from cinactive().

Fixes a hang seen while running collect and the vdump utility. This patch prevents the hang in tok_wait from occurring. This also prevents a cfsdb_assert panic that contains the following message:
Assert Failed: (tcbp->tcb_flags & TOK_GIVEBACK) == 0

Prevents a cfsdb_assert panic from occurring in the cfs block reserve code. The system is most likely running process accounting that will receive this type of panic.

Provides performance enhancements for copying large files (files smaller than the total size of client's physical memory) between a CFS client and server within the cluster.

Corrects a token hang situation by comparing against the correct revision mode.

Fixes a bug in the cluster filesytem that can cause a kernel memory fault.

Eliminates superfluous AutoFS auto-mount attempts during rolling upgrade. These attempted auto-mounts slow down certain operations and leave the AutoFS namespace polluted with directories prefexed with ".Old..".

Fixes memory leak in cfscall_ioctl().

Fixes a panic with the following error message:
panic: cfsdb_assert

Contains corrections required for proper operation of Oracle 9i with Tru64 UNIX/TruCluster 5.1. The problems corrected include:
- Processes hanging when using Cluster File System/Direct I/O feature.
- Improper handling of direct I/O to an AdvFS fileset if a clone fileset was already in use, potentially resulting in an inconsistent backup.
- Using ls -l, the Cluster File System file attribute could be seen inconsistently from the server and client members. For example, a file's mode could be seen differently from the server and the client.
- A file opened for Direct I/O on the Cluster File System server may inappropriately be opened in non-direct I/O mode by a client.
- Oracle processes hanging due to shutting down one cluster member.
- A problem with the Cluster File System which could cause a cluster system to panic with the panic string "kernel memory fault" in the routine mc_bcopy().
- A problem with Cluster File System which could cause a cluster member to panic with the panic string "uiomove: mode." This problem could cause Oracle multi-instance data bases to crash with the message similar to the following:
  ORA-27063: skgfospo: number of bytes read/written is incorrect

Patch 152.01

continued

Fixes data inconsistency problems that can be seen on clusters that are NFS clients.

Frevents a cfsdb_assert panic from occurring in cfs_reclaim. This panic has been seen while running ensight7.

Prevents a potential hang due to external NFS servers.

Provides a warning to users installing a patch kit that includes a patch which requires a version switch. The warning informs the user that the installed patches include a version switch which cannot be removed using the normal patch removal procedure. The warning allows the user to continue with the switch stage or exit clu_upgrade.

Prevents a potential hang that can occur on a CFS failover.

Allows POSIX semaphores/msg queues to operate properly on a CFS client.

Allows the command cfsstat -i to execute properly.

Corrects a problem which can cause cluster members to hang, waiting for the update daemon to flush /var/adm/pacct.

Fixes a potential CFS hang on defragment.

Fixes a possible "Kernel Memory Fault" panic on racing mount update/unmount/remount operations for the same mount point.

Fixes a possible "Kernel Memory Fault" in function ckidtokgs.

Fixes possible "cfs_add_mount() - database entry present" panic and possible multinode reboot hang which shows the following message:
WARNING: RETRYING TO LOCK THE BOOT PARTITION DEVICE

Fixes two race conditions in Cluster Mount support:
- One results in a transient mount failure.
- The second might result in a kernel memory fault panic during mount.

Fixes two AutoFS problems:
- AutoFS is unable to establish an intercept point when mounton directory is busy.
- Fixes an "Unaligned Kernel Access" panic in cfs_vget_fhp().

Fixes a panic that would occur during the mount of a cluster file system on top of a non-cluster file system.

Prevents a "Kernel Memory Fault" panic during unmount in a Cluster or during a planned relocation.

Corrects a "cfsdb_assert" panic which can occur following the failure of a cluster node.

Patch 152.01

continued

Addresses three CFS problems:
- A kernel memory fault in the CFS read-ahead code.
- A deadlock in the CFS read-ahead code.
- A potential data inconsistency problem which could occur when a filesystem becomes 100% full.

Enforces the rule that mounting on a server-only file system makes the new mount server-only.

Fixes two race conditions:
- Between cluster root failover and mount which results in a kernel memory fault.
- Between failover-related cleanup and bootup-time mount processing, which results in deadlock and hangs the new node.

Eliminates a Kernel Memory Fault panic during node shutdown.

Addresses a problem in CFS where, under certain conditions, CFS would temporarily change the value of p_pid of the current running process. The result of this could break certain pid-based hashing algorithms in the kernel, as well as advery affect certain kernel debugging tools.

Fixes a race condition during cluster mount which results in a transient ENODEV seen by a name space lookup.

Addresses a problem where a file's attributes (owner, group, mode, etc) could become inconsistent cluster-wide.

Fixes a PANIC: CFS_ADD_MOUNT() - DATABASE ENTRY PRESENT panic when a node re-joins the cluster.

Addresses a problem where CFS may not properly invalidate cached access rights when a change is made to a file's property list.

Fixes a race condition between node shutdown and unmount, and ensures that all file sets from an AdvFS domain mounted as server_only get unmounted when the server node is shut down.

This patch addresses two cluster problems:
- Hung unmounts, possibly seen as hung node shutdowns.
- A cfsdb_assert panic in cfs_tokmsg().

Fixes the assertion failure ERROR != ECFS_TRYAGAIN.

Corrects a CFS problem that could cause a panic with the panic string of "CFS_INFS full".

Fixes several potential CFS panics.

Fixes functional problems dealing with CFS direct I/O and CFS block reservation.

Fixes a possible panic on boot if mount request is received from another node too early in the boot process.

Prevents a panic:
Assert failed: vp->v_numoutput > 0

or a system hang when a filesystem becomes full and direct async I/O via CFS is used. A vnode will exist that has v_numoutput with a greater than 0 value and the thread is hung in vflushbuf_aged().

This patch prevents the following panic:
cms_kgs_callback_thr: in use already set on non-initiator

Patch 152.01

continued

Fixes a potential CFS deadlock.

Addresses a problem seen during the setup stage of a rolling upgrade during tag file creation. The fix is to change a variable to only look at 500 files at a time while making tag files, instead of the current 700.

Fixes a hang during cluster unmount which results in the blocking of all further mounts and unmounts.

Addresses a cluster problem that can arise in the case where a cluster is serving as an NFS server. The problem can result in stale data being cached at the nodes which are servicing NFS requests.