3    Summary of TruCluster Software Patches

This chapter summarizes the TruCluster software patches included in Patch Kit-0003.

Table 3-1 lists patches that have been updated.

Table 3-2 provides a summary of patches.

Table 3-1:  Updated TruCluster Software Patches

Patch IDs Change Summary
Patches 150.00, 195.00 New
Patches 11.00, 62.00, 97.00, 145.00, 146.00 Superseded by Patch 148.00
Patches 41.00, 80.00, 173.00 Superseded by Patch 175.00
Patches 39.00, 131.00, 178.00, 179.00 Superseded by Patch 181.00
Patches 37.00, 82.00, 132.00, 134.00, 182.00, 183.00 Superseded by Patch 185.00
Patches 70.00, 186.00 Superseded by Patch 188.00
Patches 44.00, 46.00, 189.00, 190.00, 191.00 Superseded by Patch 193.00
Patch 50.00 Superseded by Patch 200.00
Patches 12.00, 13.00, 14.00, 15.00, 16.00, 17.00, 18.00, 19.00, 20.00, 21.00, 22.00, 23.00, 25.00, 76.00, 92.00, 98.00, 99.00, 100.00, 101.00, 102.00, 103.00, 104.00, 105.00, 106.00, 107.00, 108.00, 109.00, 110.00, 111.00, 112.00, 113.00, 114.00, 116.00, 140.00, 142.00, 64.00, 86.00, 117.00, 119.00, 43.00, 151.00, 152.00, 153.00, 154.00, 155.00, 156.00, 157.00, 158.00, 159.00, 160.00, 161.00, 162.00, 163.00, 164.00, 165.00, 166.00, 167.00, 168.00, 169.00, 170.00, 172.00, 30.00, 31.00, 32.00, 33.00, 35.00, 78.00, 90.00, 122.00, 123.00, 124.00, 125.00, 126.00, 127.00, 129.00, 144.00, 196.00, 198.00 Superseded by Patch 202.00

Table 3-2:  Summary of TruCluster Patches

Patch IDs Abstract

Patch 9.00

TCR520-019

Patch: Fixes networking issues within cluster environment

State: Supersedes patches TCR520-008 (6.00), TCR520-037 (7.00)

This patch fixes the following problems:

  • Multiple networking issues within a cluster environment:

    • Cluster member loses connectivity with clients on remote subnets.

    • aliasd not handling multiple virtual aliases in a subnet and/or IP aliases.

    • Allows cluster members to route for an alias without joining it.

    • aliasd writing illegal configurations into gated.conf.memebrX.

    • Default route not being restored after network connectivity issues.

    • Fixes a race condition between aliasd and gated.

    • Fixes a problem with a hang caused by an incorrect /etc/hosts entry.

  • Fixes aliasd_niff to allow EVM restart.

  • Provides enablers for the Compaq Database Utility.

Patch 27.00

TCR520-028

Patch: Fix for clusterwide wall messages not being received

State: Existing

This patch allows the cluster wall daemon to restart following an EVM daemon failure.

Patch 52.00

TCR520DX-001

Patch: Fixes smsd/caad performance problems

State: Existing

This patch provides enablers for the Compaq Database Utility.

Patch 68.00

TCR520-045

Patch: Fix for confusing panics on SMP systems

State: Existing

This patch fixes a problem where node reboots during a clusterwide shutdown would result in difficult to diagnose system panics.

Patch 88.00

TCR520-076

Patch: Fix for cluster hang during boot

State: Supersedes patch TCR520-027 (29.00)

This patch addresses a situation where the second node in a cluster hangs upon boot while setting the current time and date with ntpdate.

Patch 95.00

TCR520-071

Patch: Fix for CAA problems

State: Supersedes patches TCR520-029 (1.00), TCR520-035 (2.00), TCR520-022 (3.00), TCR520-032 (5.00), TCR520-054 (53.00), TCR520-047 (54.00), TCR520-048 (55.00), TCR520-051 (56.00), TCR520-056 (57.00), TCR520-046 (58.00), TCR520-052 (60.00), TCR520-049 (66.00), TCR520-065 (71.00), TCR520-060 (72.00), TCR520-063 (74.00), TCR520-072 (84.00), TCR520-102 (93.00)

This patch corrects the following:

  • Increases parallelism in CAA event handling.

  • CAA cannot start or stop resources. The resource moves to the unknown state. Also, a core file is left behind by the action of starting and stopping resources. The problem will occur after the first resource is started.

  • Enables the Compaq Database Utility.

  • The datastore may get corrupted due to improper datastore locking. This may occur when multiple CAA CLI commands are run in the background.

  • The caa_profile command may complain of failure to create and log EVM events.

  • The caa_profile -create command inserts extra attributes such as REBALANCE into the profile when a user uses it to create an application profile. This will cause CAA GUI to fail to validate the profile.

  • The caa_stat command can crash, leaving a core file, when it receives a SIGPIPE signal. The problem has been known to occur when caa_stat output is piped to a command such as head.

  • When long resource or attribute names are used the space will not be reclaimed correctly when the resource is unregistered.

  • Fixed a caad memory leak caused by caa_stat -f.

  • CAA fails to close a TDF after processing a corresponding resource profile. Over time this will lead to reaching the process limit for open file descriptors and will prevent CAA from functioning properly.

  • The clu_mibs agent has been changed to retry the connection with the Event Manager daemon (evmd) indefinitely until it succeeds.

  • the clu_mibs agent's start and stop control has been moved from /sbin/init.d/clu_max script to /sbin/init.d/snmpd script.

  • Resolves erroneous behavior of resources with dependencies upon other resources (required resources). This solves several problems with starting, stopping, and relocating a resource with dependencies when the resource's start or stop scripts fail, or when relocating during a shutdown.

  • Migrates the old datastore to the new datastore during the rolling upgrade and corrects the problem where no resource information was preserved.

  • Resolves the issue with the default CAA system services (dhcp named cluster_lockd autofs) not running after the installation of the patch kit. In addition to the default CAA system services, any previously registered resource would be lost.

  • Prevents member hangs during boot in unusual circumstances that cause the CAA daemon to crash or exit during initialization.

  • Fixes three CAA problems triggered by heavy CAA activity conditions.

Patch 121.00

TCR520-114

Patch: Using a cluster as a RIS server causes panic

State: New

This patch corrects the following:

  • A panic caused by a known problem, using a cluster as a RIS server.

  • A fix to RIS/DMS serving in a TruCluster.

Patch 136.00

TCR520-085

Patch: Enhancement for clu_autofs shutdown script

State: Existing

This patch makes the /sbin/init.d/clu_autofs script more robust.

Patch 138.00

TCR520-121

Patch: Provides enhanced clu_upgrade switch

State: Supersedes patch TCR520-009 (48.00)

This patch corrects the following:

  • Provides a warning to users who have installed a patch kit that includes a patch which requires a version switch. The warning informs the user that the installed patches include a version switch which cannot be removed using the normal patch removal procedure. The warning allows the user to continue with the switch stage or exit clu_upgrade.

  • Provides additional user information after the user has decided to perform a patch rolling upgrade and has entered the pathname to a patch kit which contains one or more patches requiring a version switch. The additional user information identifies the patches containing the version switch and provides references to the appropriate user documentation.

  • Addresses a problem seen during the setup stage of a rolling upgrade during tag file creation. The fix is to change a variable to only look at 500 files at a time while making tag files, instead of the current 700.

Patch 148.00

TCR520-134

Patch: Fixes cluster hang during Memory Channel initialization

State: Supersedes patches TCR520-013 (11.00), TCR520-055 (62.00), TCR520-106 (97.00), TCR520-132 (145.00), TCR520-152 (146.00)

This patch corrects the following:

  • Fixes a situation in which one or several cluster members would panic if a Memory Channel cable was removed or faulty.

  • Fixes the following problems with Memory Channel in a cluster environment:

    • A problem with the Memory Channel power off in LAN interconnect cluster which causes a clusterwide panic.

    • A user is now allow to kill a LAN interconnect cluster via Memory Channel.

    • Supports Memory Channel usage in a LAN cluster.

  • Corrects a problem when the master failover node goes off line during a failover and fails over due to parity errors increasing beyond the limit.

  • Addresses a problem in which a bad Memory Channel cable causes a cluster member to panic with a panic string of "rm_eh_init" or "rm_eh_init_prail".

  • Contains changes that will make Memory Channel failovers work better, and will also handle bad optical cables.

  • Fixes a problem in which a node booting into a cluster hangs during Memory Channel initialization.

Patch 150.00

TCR520-142

Patch: Eliminates spurious duplicate error message

State: New

This patch eliminates a spurious duplicate error message.

Patch 175.00

TCR520-128

Patch: Resolves issues with version switched patches

State: Supersedes patches TCR520-024 (41.00), TCR520-057 (80.00), TCR520-154 (173.00)

This patch corrects the following:

  • Fixes a cluster installation problem of having an LSM disk and a disk media with the same name. Normally, the install script would not let you install because it was looking at the disk name, not the disk media name. This has been fixed.

  • Disks over 10 GB are unable to be used as member or quorum disks. This fix allows the user to use them as such.

  • Resolves issues with version-switched patches and cluster installation. Previously, the user could run with old functionality if they had not run versw; now dupatch automatically runs it for them.

  • Automatically enables ip filtering for the cluster interconnect on cluster installion and member addition; allows installation on unlabeled disks; and allows the cluster installation to detect layered product kits in /var as well as /usr/var.

Patch 181.00

TCR520-115

Patch: Fixes problems in the DLM subsystem

State: Supersedes patches TCR520-034 (39.00), TCR520-074 (131.00), TCR520-123 (178.00), TCR520-122 (179.00)

This patch corrects the following:

  • Fixes a panic in DLM when another node in the cluster is halted.

  • Fixes a panic in the DLM deadlock detection code.

  • Fixes a problem where a process using the Distributed Lock Manager can take up to ten minutes to exit.

  • Fixes several DLM related crashes and performance issues.

  • Corrects a cluster member panic.

  • DLM was not always returning the resource block information for the sublock even if the sublock was held.

Patch 185.00

TCR520-125

Patch: Resolves an RDG panic in the RdgShutdown routine

State: Supersedes patches TCR520-015 (37.00), TCR520-058 (82.00), TCR520-087 (132.00), TCR520-105 (134.00), TCR520-141 (182.00), TCR520-150 (183.00)

This patch corrects the following:

  • Enables the Compaq Database Utility.

  • Changes RDG wiring behavior to match the VM fix to wiring GH chunks.

  • The RDG fix closes a timing window that can cause Oracle 9i to hang when a remote node in the cluster goes down.

  • Fixes a possible panic on process termination and a panic involving multiple Memory Channel adapters.

  • Makes the RDGinit daemon program safe to execute multiple times on all cluster interconnect types.

  • Resolves a problem resulting in an incorrect error status being returned from RDGinit.

  • Fixes a Reliable DataGram (RDG) problem that can result in user processes hanging in an uninterruptable state.

  • Resolves an RDG panic in the RdgShutdown routine.

Patch 188.00

TCR520-138

Patch: Fixes cluster kernel problem that causes a hang

State: Supersedes patches TCR520-042 (70.00), TCR520-133 (186.00)

This patch corrects the following:

  • Fixes a panic in the kernel group services when another node is booted into the cluster.

  • Fixes a problem in the cluster kernel that causes the cluster to hang when a member is rebooted into the cluster.

  • Fixes a problem in the cluster kernel that causes one or more members to panic during a cluster shutdown.

Patch 193.00

TCR520-146

Patch: Fix for ICS_UNABLE_TO_MAKE_PROGRESS panic

State: Supersedes patches TCR520-021 (44.00), TCR520-023 (46.00), TCR520-139 (189.00), TCR520-145 (190.00), TCR520-127 (191.00)

This patch corrects the following:

  • Fixes a situation where ICS is unable to make progress because heartbeat checking is blocked or the input thread is stalled. The symptom is a panic of a cluster member with the panic string ICS_UNABLE_TO_MAKE_PROGRESS: HEARTBEAT CHECKING BLOCKED/INPUT THREAD STALLED.

  • Fixes the problem of a cluster member failing to rejoin the cluster after Memory Channel failover.

  • Addresses a panic that occurs when higher priority threads running on a cluster member block the internode communication service Memory Channel transport (ics_ll_mct) subsystem's input thread from execution.

  • Fixes numerous panics and hangs with the way a cluster communicates with its nodes. It also fixes hangs and panics during boot.

  • Fixes a panic with the string "rcnx_status: different node."

  • Fixes a boot hang on "ics_mct: Node arrival waiting for out of line node down cleanup to complete".

Patch 195.00

TCR520-143

Patch: Memory Channel API problem causes system hang

State: New

This patch fixes a problem in the Memory Channel API that can cause a system to hang.

Patch 200.00

TCR520-137

Patch: Fix for ICS_BROADCAST_SETUP panic

State: Supersedes patches TCR520-025 (50.00)

This patch corrects the following:

  • Fixes a situation where a cluster shutdown under load on a cluster using a LAN interconnect takes a very long time.

  • On boot, "duplicate incoming connections" will not cause a panic. Provides a complete and better error message in event of a misconfigured ICS/TCP adapter.

Patch 202.00

TCR520-167

Patch: Fixes several Device Request Dispatcher problems

State: Supersedes patches (12.00), TCR520-011 (13.00), TCR520-005 (14.00), TCR520-002 (15.00), TCR520-004 (16.00), TCR520-039 (17.00), TCR520-014 (18.00), TCR520-016 (19.00), TCR520-018 (20.00), TCR520-010 (21.00), TCR520-012 (22.00), TCR520-026 (23.00), TCR520-001 (25.00), TCR520-068 (76.00), TCR520-100 (92.00), TCR520-090 (98.00), TCR520-091 (99.00), TCR520-104 (100.00), TCR520-080 (101.00), TCR520-083 (102.00), TCR520-089 (103.00), TCR520-095 (104.00), TCR520-099 (105.00), TCR520-078 (106.00), TCR520-101 (107.00), TCR520-081 (108.00), TCR520-082 (109.00), TCR520-070 (110.00), TCR520-092 (111.00), TCR520-059 (112.00), TCR520-062 (113.00), TCR520-093 (114.00), TCR520-084 (116.00), TCR520-136 (140.00), TCR520-116 (142.00), TCR520-053 (64.00), TCR520-067 (86.00), TCR520-044 (117.00), TCR520-077 (119.00), TCR520-003 (43.00), TCR520-159 (151.00), TCR520-124 (152.00), TCR520-126 (153.00), TCR520-109 (154.00), TCR520-119 (155.00), TCR520-147 (156.00), TCR520-118 (157.00), TCR520-110 (158.00), TCR520-111 (159.00), TCR520-120 (160.00), TCR520-130 (161.00), TCR520-112 (162.00), TCR520-113 (163.00), TCR520-107 (164.00), TCR520-131 (165.00), TCR520-151 (166.00), TCR520-108 (167.00), TCR520-144 (168.00), TCR520-157 (169.00), TCR520-148 (170.00), TCR520-129 (172.00), TCR520-033 (30.00), TCR520-017 (31.00), TCR520-006 (32.00), TCR520-007 (33.00), TCR520-020 (35.00), TCR520-064 (78.00), TCR520-075 (90.00), TCR520-079 (122.00), TCR520-094 (123.00), TCR520-096 (124.00), TCR520-097 (125.00), TCR520-088 (126.00), TCR520-098 (127.00), TCR520-103 (129.00), TCR520-155 (144.00), TCR520-149 (196.00), TCR520-162 (198.00)

This patch corrects the following:

  • Provides the I/O barrier code that prevents HSG80 controller crashes (firmware issue).

  • Fixes a situation in which a rebooting cluster member would panic shortly after rejoining the cluster if another cluster member was doing remote disk I/O to the rebooting member when it was rebooted.

  • Allows high density tape drives to use the high-density compression setting in a cluster environment.

  • Fixes a kernel memory fault panic that can occur within a cluster member during failover while using shared served devices.

  • Fixes the problem of cluster-wide hang because of DRD node failover is stuck and unable to bid a new server for served device.

  • Adds DRD Barrier retries to fixes for HSx firmware problems.

Patch 202.00

continued

  • Fixes a problem where CAA applications using tape/changers as required resources will not come on line (as seen by caa_stat).

  • Fixes a problem where the tape changer is only accessible from the member that is the DRD server for the changer.

  • Fixes a problem where an open request to a disk in a cluster fails with an illegal errno (>=1024).

  • Fixes a problem where an open to a tape drive in a cluster would take 6 minutes (instead of 2) to fail if there were no tape in the drive.

  • Solves a problem in which a cluster would hang the next time a node was rebooted after a tape device was deleted from the cluster.

  • Fixes a domain panic in a cluster when a file system is mounted on a disk accessed remotely over the cluster interconnect.

  • Fixes the race condition problem when multiple unbarrierable disks failed at the same time.

  • Fixes a kernel memory fault in drd_open.

  • Prevents an infinite loop in drd_open().

  • Fixes serveral Device Request Dispatcher problems.

  • Removes a rolling upgrade issue with CDROM and FLOPPY device handling.

  • Addresses a problem in which a cluster or a device can get blocked I/O, or a cluster node may panic after a device has been deleted.

Patch 202.00

continued

  • Makes AdvFS fileset quota enforcement work properly on a cluster.

  • Corrects a "cfsdb_assert" panic which can occur following the failure of a cluster node.

  • Corrects a problem which can cause cluster members to hang waiting for the update daemon to flush /var/adm/pacct.

  • Prevents a potential hang that can occur on a CFS failover.

  • Allows POSIX semaphores/msg queues to operate properly on a CFS client.

  • Addresses a potential file inconsistency problem which could cause erroneous data to be returned when reading a file at a CFS client node. There is also a small possibility that this problem could result in a CFS panic ("AssertFailed: bp->b_dev").

  • Addresses two potential CFS panics that might occur for a DMAPI/HSM managed filesystem. The first panic problem string is:

    Assert Failed: ( t)->cntk_mode <= 2"

    The second panic problem string is:

    Assert Failed: get_recursion_count(
    current_threa&CFS_CMI_TO_REC_LOCK(mi)) == 1

  • Addresses a possible panic which could occur if multiple CFS client nodes leave the cluster while a CFS relocate or unmount is occurring.

  • Addresses a possible KMF panic when executing the command cfsmgr -a DEVICES on a filesystem with LSM volumes.

  • Corrects a CFS problem that could cause a panic with the panic string of "CFS_INFS full".

  • Addresses a potential CFS panic that might occur when a file being opened in direct I/O mode, while at the same time the file is being truncated by a separate process.

  • Provides enabler support for the Enterprise Volume Manager product.

  • Fixes a memory a leak in cfscall_ioctl().

  • Provides freezefs support.

  • Addresses a data inconsistency that can occur when a CFS client reads a file that was recently written to and whose underlying AdvFS extent map contains more than 100 extents.

  • Fixes a panic that would occur during the mount of a clustered file system on top of a nonclustered file system.

  • Prevents a Kernel Memory Fault panic during unmount in a cluster or during a planned relocation.

  • Fixes support for mounting other filesets from the cluster_root domain in a cluster.

  • Fixes the assertion failure ERROR != ECFS_TRYAGAIN.

Patch 202.00

continued

  • Fixes a race condition during cluster mount which results in a transient ENODEV seen by a name space lookup.

  • Fixes a possible panic on boot if mount request is received from another node too early in the boot process.

  • Fixes a PANIC: CFS_ADD_MOUNT() - DATABASE ENTRY PRESENT panic when a node re-joins the cluster.

  • Fixes two race conditions in Cluster Mount support:

    • One results in a transient mount failure.

    • The second might result in a kernel memory fault panic during mount.

  • Fixes a cluster problem with hung unmounts (possibly seen as hung node shutdowns).

  • Addresses a potential UBC panic which could occur when accessing CFS filesystems.

  • Fixes a possible Kernel Memory Fault panic on racing mount update/unmount/remount operations for the same mount point.

  • Fixes a possible race between node shutdown and unmount.

  • Fixes a possible Kernel Memory Fault panic on the mount update on a Memory File System (MFS) and other possible panics when bad arguments are passed to the mount library interface.

  • Prevents a panic "Assert failed: vp->v_numoutput > 0" or a system hang when a filesystem becomes full and direct async I/O via CFS is used. A vnode will exist that has v_numoutput with a greater than 0 value and the thread is hung in vflushbuf_aged().

  • Fixes a possible Kernel Memory Fault in function ckidtokgs.

  • Fixes a potential CFS deadlock.

  • Correct a cfsmgr error "Not enough space" when attempting to relocate a file system with a large amount of disks.

  • Addresses possible CFS client node file read failures which could occur if on a previous failure to perform a failover mount on the client node the domain storage devices were closed.

  • Fixes support for mounting other filesets from a cluster node's boot partition domain.

  • Addresses a cluster problem that can arise in the case where a cluster is serving as an NFS server. The problem can result in stale data being cached at the nodes which are servicing NFS requests.

  • Addresses a CFS panic that might occur for a DMAPI/HSM managed fs:

    (panic): cfstok_hold_tok(): held token table overflow