3 Summary of TruCluster Software Patches

This chapter summarizes the TruCluster software patches included in Patch Kit-0010.

Table 3-1 lists patches that have been updated.

Table 3-1: Updated TruCluster Software Patches

Patch IDs	Change Summary
Patch 53.00	New
Patches 16.00, 39.00, 37.00, 45.00, 43.00, 49.00	Superseded by Patch 56.00
Patch 31.00	Superseded by Patch 60.00
Patches 5.00, 29.00, 34.00, 36.00, 40.00, 41.00, 42.01, 51.00, 54.00, 55.00, 57.00	Superseded by Patch 61.00
Patches 14.00, 22.00, 26.00, 38.00, 44.00, 58.00, 59.00	Superseded by Patch 62.00
Patches 5.00, 29.00, 34.00, 36.00, 40.00, 41.00, 42.00, 46.00, 63.00	Superseded by Patch 64.00

Table 3-2 provides a summary of patches in Patch Kit-0010.

Table 3-2: Summary of TruCluster Patches

Patch IDs

Abstract

Patch 3.00

TCR141-003

Patch: Correction For DRD I/O Hangs When No CPU In Slot 0

State: Existing

This fixes a problem that occurs on all AlphaServer 8200 systems and on AlphaServer 8400 systems having certain nonstandard configurations. When there is no CPU in slot 0, remote DRD I/O operations hang.

Patch 4.00

TCR141-004

Patch: Correction For Distributed Lock Manager Hang

State: Existing

This patch fixes a problem that occurrs when MEMORY CHANNEL errors are encountered at the same time that a particular code path is executed. When these events occurr simultaneously, the distributed lock manager (DLM) would hang. The likelihood of this problem occurring is low.

Patch 6.00

TCR141-006

Patch: tractd Corrections

State: Existing

This patch corrects the following:

Fixes a problem where the Cluster Monitor (cmon) in some cases may display incomplete or incorrect ASE service status and node UP/DOWN status.

Fixes a problem with complete depletion of system socket resources, the result of tractd daemons doing repeated connect retries. This problem is most commonly seen when all nodes in a three- or four-node cluster are booted simultaneously.

Dramatically reduces tractd daemon interconnect delays seen when multiple cluster nodes are booted simultaneously. These delays are reduced from the 5+ minutes range in the case of four node clusters, to just a few seconds. In addition, the interconnects in these circumstances are more reliably complete.

Patch 7.00

TCR141-007

Patch: Memory Channel Memory Allocation Corrections

State: Existing

This patch fixes a problem which caused the "map_RM_receive" panic to occur in some cases. This problem may also be seen as distributed raw disk (DRD) print warnings on the console if the drd-mc-drd-print-warn parameter is set in the /etc/sysconfigtab file.

Patch 21.00

TCR141-021

Patch: lsm_dg_action Correction

State: Existing

This patch fixes two problems that were causing certain LSM actions to not be retried upon failure, even though the conditions that caused the failures were only temporary.

Patch 24.00

TCR141-009

Patch: Network interface and Routing Corrections

State: Existing

This patch fixes the following problems:

During the failover of an ASE service, the removal of the -alias parameter from the /var/ase/sbin/nfs_ifconfig file caused the routing file to become corrupted.

When removing and adding services in an available server environment (ASE) using multiple network interfaces, the gated daemon would be started even when value of the ASEROUTING variable in the /etc/rc.config file is "no."

Patch 25.00

TCR141-025

Patch: Distributed Lock Manager Corrections

State: Existing

This patch fixes a problem in TruCluster Production Server Software that can cause a cluster member to panic during a shutdown.

Patch 27.00

TCR141-027

Patch: Correction for KZPBA controllers

State: Existing

Without this patch the ase_fix_config utility will not recognize KZPBA controllers.

Patch 28.00

TCR141-028

Patch: Correction for KZPBA SCSI controllers

State: Existing

This patch replaces the /usr/sbin/clu_ivp script with a new script that will recognize the "isp" KZPBA SCSI controllers. Without this patch the clu_ivp program will ignore these controllers.

Patch 30.00

TCR141DX-002

Patch: Cluster Monitor Hang Correction

State: Existing

If an ASE service is renamed, any running Cluster Monitor (cmon) will lockup and hang. This occurs whether the rename was done from within cmon or independent of cmon.

Patch 32.00

TCR141-033

Patch: Booting Node Hang Correction

State: Existing

Fixes a problem where a booting node hangs in the imc_init command. A re-reboot would also hang in imc_init, requiring a reboot of all members.

Patch 33.01

TCR141-034-1

Patch: Kern Mem Fault And simple_lock Panic Correction

State: Supersedes patches TCR141-011 (11.00), TCR141-019 (23.00)

This patch corrects the following:

Fixes the following problems in the ASE Availability Manager (AM):
- A "simple_lock: time limit exceeded" panic on multi-processor, and system hangs in single processor systems. This can occur when multiple host target mode requests are issued due to SCSI aborts and resets on a shared bus.
- A kernel memory fault panic caused by a race condition when the AM de-initializes.

Fixes a kernel memory fault in am_select() in the Availability Manager.

Fixes a problem where the aseagent process goes into a U state when another ASE member leaves the cluster, due to the aseagent process waiting on a SCSI ping request that never completes.

Patch 35.00

TCR141-036

Patch: rm_spur Driver Correction

State: Supersedes patch TCR141-002 (2.00)

This patch corrects the following problems:

Eliminates the loss of a cluster node when "sysconfig -q rm" is run after the cluster has formed.

Allows more time to remove a node from an 8-node cluster before causing the system to panic.

Corrects some instances on busy clusters when the software does not realize a node has gone down.

Corrects the sense of the long/short heartbeat timeout delay in virtual hub systems, and enables code that allows the system to see a hub power up after it has been powered down.

Patch 47.00

TCR141-013B

Patch: Memory Channel API Shared Library Correction

State: Supersedes patch TCR141-013 (13.00)

This patch fixes various problems in the MEMORY CHANNEL API. In particular, changes were made to ensure that the API is thread safe, that locks are properly acquired and released, and to increase performance and reliability.

Patch 48.00

TCR141-013-1

Patch: Memory Channel API Static Library Correction

State: Supersedes patch TCR141-013 (13.00)

Patch 50.00

TCR141-045B

Patch: LSM and AdvFS Corrections

State: Supersedes patches TCR141-041 (39.00), TCR141-048 (45.00)

This patch fixes the following problems:

Increases the timeout values for the LSM action scripts that are part of the TruCluster Production Server, Available Server and DECsafe Available Server products. The timeouts were too small for large LSM configurations and, under certain conditions, would cause the start of the services to fail, leaving them unassigned.

Fixes a problem in which under certain cercumstances, an ASE service modification could result in a corrupted configuration data base.

Patch 52.00

TCR141-044C

Patch: Message Service Routine Fixes

State: Supersedes patches TCR141-005 (5.00), TCR141-029 (29.00), TCR141-035 (34.00), TCR141-038 (36.00), TCR141-042 (40.00), TCR141-043 (41.00), TCR141-044-1 (42.01)

This patch corrects the following:

Fixes a problem in the message service routines used by the daemons in TruCluster Available Server and Production Server software. When the message queue fills, the following message is entered in the daemon.log file, but the queue is not emptied:
msgSvc: message queue overflow, LOST MESSAGE!

From this point on, no further messages will be received.

Fixes the following problems in the ASE Availability Manager (AM):
- A "simple_lock: time limit exceeded" panic on multi-processor, and system hangs in single processor systems. This can occur when multiple host target mode requests are issued due to SCSI aborts and resets on a shared bus.
- A kernel memory fault panic caused by a race condition when the AM de-initializes.

Fixes a problem where, during an orderly shutdown (init 0), the ASE agent shuts down the director before shutting down the services.

Causes the host status monitor (asehsm) to actively go out and learn current member states before responding to the director with member state information.

Pulling all monitored network interface cables on the machine running the asedirector and a service can result in another machine starting a new director and starting the same service before it has been fully stopped on the first machine. This is especially noticeable when a service takes a long time to stop.

Fixes a problem that caused the asedirector to core dump if asemgr processes were modifying services from more than one node in the cluster at the same time.

Fixes a problem where the Host Status Monitor (asehsm) incorrectly reports a network down (HSM_NI_STATUS DOWN) if the counters for the network interface get zeroed.

Fixes scalability problems in the DECsafe Available Server, TruCluster Available Server and TruCluster Production Server products. The problems caused the asemgr to core dump when adding or modifying services with a large number of disks.

Patch 53.00

TCR141-049

Patch: ASE Check Service Script May Be Corrupt

State: New

This patch corrects a problem in which an ASE check service script could become corrupted in the ASE configuration data base.

Patch 56.00

TCR141-052

Patch: LSM Disk Info Not Properly Updated In ASE DB

State: Supersedes patches TCR141-016 (16.00), TCR141-041 (39.00), TCR141-039 (37.00), TCR141-048 (45.00), TCR141-045 (43.00), TCR141-045-1 (49.00)

This patch fixes the following problems:

Provides support in asemgr for the new AdvFS mount option "-o noatimes".

Fixes a problem where changes in the LSM configuration were not being properly handled during the delete of an LSM volume from a service.

Increases the timeout values for the LSM action scripts that are part of the TruCluster Production Server, Available Server and DECsafe Available Server products. The timeouts were too small for large LSM configurations and, under certain conditions, would cause the start of the services to fail, leaving them unassigned.

Fixes a problem in which under certain cercumstances, an ASE service modification could result in a corrupted configuration data base.

Fixes a problem where LSM disk information was not properly updated in the ASE database when volumes were removed from a disk service.

Patch 60.00

TCR141-056

Patch: Fix For AdvFS Panic

State: Supersedes patch TCR141-032 (31.00)

This patch corrects the following:

Fixes a problem in which running the vquotacheck command on a filesystem participating in an ASE service will cause a system to panic if the service fails over or relocates while the command is in progress.

Fixes a problem that could cause an AdvFS panic when a service that has quotas enabled is relocated. The problem occurs if a command is running that has a large number of arguments (>99).

Patch 61.00

TCR141-058A

Patch: asemgr May Core Dump

State: Supersedes patches TCR141-005 (5.00), TCR141-029 (29.00), TCR141-035 (34.00), TCR141-038 (36.00), TCR141-042 (40.00), TCR141-043 (41.00), TCR141-044-1 (42.01), TCR141-044-2 (51.00), TCR141-050 (54.00), TCR141-051 (55.00), TCR141-053A (57.00)

This patch corrects the following:

Fixes a problem in the message service routines used by the daemons in TruCluster Available Server and Production Server software. When the message queue fills, the following message is entered in the daemon.log file, but the queue is not emptied:
msgSvc: message queue overflow, LOST MESSAGE!

From this point on, no further messages will be received.

Fixes the following problems in the ASE Availability Manager (AM):
- A "simple_lock: time limit exceeded" panic on multi-processor, and system hangs in single processor systems. This can occur when multiple host target mode requests are issued due to SCSI aborts and resets on a shared bus.
- A kernel memory fault panic caused by a race condition when the AM de-initializes.

Fixes a problem where, during an orderly shutdown (init 0), the ASE agent shuts down the director before shutting down the services.

Pulling all monitored network interface cables on the machine running the asedirector and a service can result in another machine starting a new director and starting the same service before it has been fully stopped on the first machine. This is especially noticeable when a service takes a long time to stop.

Fixes a problem that caused the asedirector to core dump if asemgr processes were modifying services from more than one node in the cluster at the same time.

Fixes a problem where the Host Status Monitor (asehsm) incorrectly reports a network down (HSM_NI_STATUS DOWN) if the counters for the network interface get zeroed.

Fixes scalability problems in the DECsafe Available Server, TruCluster Available Server, and TruCluster Production Server products. The problems caused the asemgr to core dump when adding or modifying services with a large number of disks.

Fixes a problem where the ASE management utility, asemgr, consumes increasing amounts of memory when invoked to add several services to the database at one time. Under certain circumstances it could consume all the available memory, causing allocation failures.

Fixes two related problems:
- Initializes hostname field properly because lower-layer code may de-reference it.
- Handles an error from IPToHost() properly. Failure to handle this error properly could result in the aseagent core dumping.

Patch 61.00

continued

Fixes the following problems:
- The "asemgr -dv" command core dumps if no services are defined.
- When deleting a service that has LSM and/or AdvFS volumes, the asemgr utility prompts for a member on which to leave the LSM/AdvFS information so that it can be re-used. If ASE cannot resolve the IP address for the member, asemgr or aseagent, will core dump.

Fixes a problem that can cause the asemgr utility to core dump when modifying services that contain a large number of disks.

Patch 62.00

TCR141-059

Patch: Node Panics With String dlm_panic

State: Supersedes patches TCR141-014 (14.00), TCR141-022 (22.00), TCR141-026 (26.00), TCR141-040 (38.00), TCR141-046 (44.00), TCR141-054 (58.00), TCR141-055 (59.00)

This patch corrects the following:

Fixes a problem in the TruCluster Production Server Software in which a system can panic with:
rcv_invvalb_req: value block out of sequence

Two problems in the TruCluster Distributed Lock Manager (DLM): one resulting from a process's effective group ID not being checked when a process attempts to join a namespace, another in which repeated calls to the dlm_quecvt function would erroneously return DLM_LKBUSY status.

An assertion panic that occurs after a large number of transactions are made using the same lock. The assertion panic is triggered by integer wrapping of the lock transaction ID field. The system may panic with "dlm_panic". The actual assertion message is "<lkbp->lk_txid == 0>".

An erroneous assertion involving deadlock search. The system may panic with "dlm_panic". The actual assertion message is "<otxid != (dlm_trans_id_t)-1>".

Fixes a problem that can cause a cluster member to panic in rcv_deqlk_msg() with the panic string set to:
dlm_panic

Fixes a system panic with the following message:
snd_grantlk_msg: no memory for message

Fixes a dlm_panic if a process is exiting and a rebuild for the Distributed Lock Manager (DLM) takes place.

Fixes a problem that caused the command: "sysconfig -q dlm" to hang if DLM is currently suspended.

Fixes a problem in TruCluster in which a node panics with the string "dlm_panic".

Patch 64.00

TCR141-058B

Patch: Kernel Memory Fault Panic

State: Supersedes patches TCR141-005 (5.00), TCR141-029 (29.00), TCR141-035 (34.00), TCR141-038 (36.00), TCR141-042 (40.00), TCR141-043 (41.00), TCR141-044 (42.00), TCR141-044B (46.00), TCR141-053B (63.00)

This patch corrects the following:

Fixes a problem in the message service routines used by the daemons in TruCluster Available Server and Production Server software. When the message queue fills, the following message is entered in the daemon.log file, but the queue is not emptied:
msgSvc: message queue overflow, LOST MESSAGE!

From this point on, no further messages will be received.

Fixes the following problems in the ASE Availability Manager (AM):
- A "simple_lock: time limit exceeded" panic on multi-processor, and system hangs in single processor systems. This can occur when multiple host target mode requests are issued due to SCSI aborts and resets on a shared bus.
- A kernel memory fault panic caused by a race condition when the AM de-initializes.

Fixes a problem where, during an orderly shutdown (init 0), the ASE agent shuts down the director before shutting down the services.

Causes the host status monitor (asehsm) to actively go out and learn current member states before responding to the director with member state information.

Pulling all monitored network interface cables on the machine running the asedirector and a service can result in another machine starting a new director and starting the same service before it has been fully stopped on the first machine. This is especially noticeable when a service takes a long time to stop.

Fixes a problem that caused the asedirector to core dump if asemgr processes were modifying services from more than one node in the cluster at the same time.

Fixes a problem where the Host Status Monitor (asehsm) incorrectly reports a network down (HSM_NI_STATUS DOWN) if the counters for the network interface get zeroed.

Fixes scalability problems in the DECsafe Available Server, TruCluster Available Server and TruCluster Production Server products. The problems caused the asemgr to core dump when adding or modifying services with a large number of disks.

Fixes the following problems:
- The 'asemgr -dv' command core dumps if no services are defined.
- When deleting a service that has LSM and/or AdvFS volumes, the asemgr utility prompts for a member on which to leave the LSM/AdvFS information so that it can be re-used. If ASE cannot resolve the IP address for the member, asemgr or aseagent, will core dump.

Fixes a problem that can cause the asemgr utility to core dump when modifying services that contain a large number of disks.