3 Summary of TruCluster Software Patches

This chapter summarizes the TruCluster software patches included in Patch Kit-0008.

Table 3-1 lists patches that have been updated.

Table 3-1: Updated TruCluster Software Patches

Patch IDs	Change Summary
Patches 5.00, 29.00, 34.00, 36.00, 40.00, 41.00, 42.01	Superseded by Patch 51.00
Patches 5.00, 29.00, 34.00, 36.00, 40.00, 41.00, 42.01	Superseded by Patch 52.00
Patches 5.00, 29.00, 34.00, 36.00, 40.00, 41.00, 42.00	Superseded by Patch 46.00
Patches 11.00, 23.00	Superseded by Patch 33.01
Patch 13.00	Superseded by Patch 48.00
Patch 13.00	Superseded by Patch 47.00
Patches 14.00, 22.00, 26.00, 38.00	Superseded by Patch 44.00
Patches 16.00, 39.00, 37.00, 45.00, 43.00	Superseded by Patch 49.00
Patches 16.00, 39.00, 37.00, 45.00, 43.00	Superseded by Patch 50.00

Table 3-2 provides a summary of patches in Patch Kit-0008.

Table 3-2: Summary of TruCluster Patches

Patch IDs

Abstract

Patch 3.00

TCR141-003

Patch: Correction For DRD I/O Hangs When No CPU In Slot 0

State: Existing

This fixes a problem that occurs on all AlphaServer 8200 systems and on AlphaServer 8400 systems having certain nonstandard configurations. When there is no CPU in slot 0, remote DRD I/O operations hang.

Patch 4.00

TCR141-004

Patch: Correction For Distributed Lock Manager Hang

State: Existing

This patch fixes a problem that occurrs when MEMORY CHANNEL errors are encountered at the same time that a particular code path is executed. When these events occurr simultaneously, the distributed lock manager (DLM) would hang. The likelihood of this problem occurring is low.

Patch 6.00

TCR141-006

Patch: tractd Corrections

State: Existing

This patch corrects the following:

Fixes a problem where the Cluster Monitor (cmon) in some cases may display incomplete or incorrect ASE service status and node UP/DOWN status.

Fixes a problem with complete depletion of system socket resources, the result of tractd daemons doing repeated connect retries. This problem is most commonly seen when all nodes in a three- or four-node cluster are booted simultaneously.

Dramatically reduces tractd daemon interconnect delays seen when multiple cluster nodes are booted simultaneously. These delays are reduced from the 5+ minutes range in the case of four node clusters, to just a few seconds. In addition, the interconnects in these circumstances are more reliably complete.

Patch 7.00

TCR141-007

Patch: Memory Channel Memory Allocation Corrections

State: Existing

This patch fixes a problem which caused the "map_RM_receive" panic to occur in some cases. This problem may also be seen as distributed raw disk (DRD) print warnings on the console if the drd-mc-drd-print-warn parameter is set in the /etc/sysconfigtab file.

Patch 21.00

TCR141-021

Patch: lsm_dg_action Correction

State: Existing

This patch fixes two problems that were causing certain LSM actions to not be retried upon failure, even though the conditions that caused the failures were only temporary.

Patch 24.00

TCR141-009

Patch: Network interface and Routing Corrections

State: Existing

This patch fixes the following problems:

During the failover of an ASE service, the removal of the -alias parameter from the /var/ase/sbin/nfs_ifconfig file caused the routing file to become corrupted.

When removing and adding services in an available server environment (ASE) using multiple network interfaces, the gated daemon would be started even when value of the ASEROUTING variable in the /etc/rc.config file is "no."

Patch 25.00

TCR141-025

Patch: Distributed Lock Manager Corrections

State: Existing

This patch fixes a problem in TruCluster Production Server Software that can cause a cluster member to panic during a shutdown.

Patch 27.00

TCR141-027

Patch: Correction for KZPBA controllers

State: Existing

Without this patch the ase_fix_config utility will not recognize KZPBA controllers.

Patch 28.00

TCR141-028

Patch: Correction for KZPBA SCSI controllers

State: Existing

This patch replaces the /usr/sbin/clu_ivp script with a new script that will recognize the "isp" KZPBA SCSI controllers. Without this patch the clu_ivp program will ignore these controllers.

Patch 30.00

TCR141DX-002

Patch: Cluster Monitor Hang Correction

State: Existing

If an ASE service is renamed, any running Cluster Monitor (cmon) will lockup and hang. This occurs whether the rename was done from within cmon or independent of cmon.

Patch 31.00

TCR141-032

Patch: ase_mount_action Correction

State: Existing

Fixes a problem in which running the vquotacheck command on a filesystem participating in an ASE service will cause a system to panic if the service fails over or relocates while the command is in progress.

Patch 32.00

TCR141-033

Patch: Booting Node Hang Correction

State: Existing

Fixes a problem where a booting node hangs in the imc_init command. A re-reboot would also hang in imc_init, requiring a reboot of all members.

Patch 33.01

TCR141-034-1

Patch: Kern Mem Fault And simple_lock Panic Correction

State: Supersedes patches TCR141-011 (11.00), TCR141-019 (23.00)

This patch corrects the following:

Fixes the following problems in the ASE Availability Manager (AM):
- A "simple_lock: time limit exceeded" panic on multi-processor, and system hangs in single processor systems. This can occur when multiple host target mode requests are issued due to SCSI aborts and resets on a shared bus.
- A kernel memory fault panic caused by a race condition when the AM de-initializes.

Fixes a kernel memory fault in am_select() in the Availability Manager.

Fixes a problem where the aseagent process goes into a U state when another ASE member leaves the cluster, due to the aseagent process waiting on a SCSI ping request that never completes.

Patch 35.00

TCR141-036

Patch: rm_spur Driver Correction

State: Supersedes patch TCR141-002 (2.00)

This patch corrects the following problems:

Eliminates the loss of a cluster node when "sysconfig -q rm" is run after the cluster has formed.

Allows more time to remove a node from an 8-node cluster before causing the system to panic.

Corrects some instances on busy clusters when the software doesn't realize a node has gone down.

Corrects the sense of the long/short heartbeat timeout delay in virtual hub systems, and enables code that allows the system to see a hub power up after it has been powered down.

Patch 44.00

TCR141-046

Patch: Lock Manager Corrections

State: Supersedes patches TCR141-014 (14.00), TCR141-022 (22.00), TCR141-026 (26.00), TCR141-040 (38.00)

This patch corrects the following:

Fixes a problem in the TruCluster Production Server Software in which a system can panic with:
rcv_invvalb_req: value block out of sequence

Two problems in the TruCluster Distributed Lock Manager (DLM): one resulting from a process's effective group ID not being checked when a process attempts to join a namespace, another in which repeated calls to the dlm_quecvt function would erroneously return DLM_LKBUSY status.

An assertion panic that occurs after a large number of transactions are made using the same lock. The assertion panic is triggered by integer wrapping of the lock transaction ID field. The system may panic with "dlm_panic". The actual assertion message is "lk_txid == 0>".

An erroneous assertion involving deadlock search. The system may panic with "dlm_panic". The actual assertion message is "<otxid != (dlm_trans_id_t)-1>".

Fixes a problem that can cause a cluster member to panic in rcv_deqlk_msg() with the panic string set to:
dlm_panic

Fixes a system panic with the following message:
"snd_grantlk_msg: no memory for message"

Patch 46.00

TCR141-044B

Patch: Kernel Memory Fault Panic

State: Supersedes patches TCR141-005 (5.00), TCR141-029 (29.00), TCR141-035 (34.00), TCR141-038 (36.00), TCR141-042 (40.00), TCR141-043 (41.00), TCR141-044 (42.00)

This patch corrects the following:

Fixes a problem in the message service routines used by the daemons in TruCluster Available Server and Production Server software. When the message queue fills, the following message is entered in the daemon.log file, but the queue is not emptied:
msgSvc: message queue overflow, LOST MESSAGE!

From this point on, no further messages will be received.

Fixes the following problems in the ASE Availability Manager (AM):
- A "simple_lock: time limit exceeded" panic on multi-processor, and system hangs in single processor systems. This can occur when multiple host target mode requests are issued due to SCSI aborts and resets on a shared bus.
- A kernel memory fault panic caused by a race condition when the AM de-initializes.

Fixes a problem where, during an orderly shutdown (init 0), the ASE agent shuts down the director before shutting down the services.

Causes the host status monitor (asehsm) to actively go out and learn current member states before responding to the director with member state information.

Pulling all monitored network interface cables on the machine running the asedirector and a service can result in another machine starting a new director and starting the same service before it has been fully stopped on the first machine. This is especially noticeable when a service takes a long time to stop.

Fixes a problem that caused the asedirector to core dump if asemgr processes were modifying services from more than one node in the cluster at the same time.

Fixes a problem where the Host Status Monitor (asehsm) incorrectly reports a network down (HSM_NI_STATUS DOWN) if the counters for the network interface get zeroed.

Fixes scalability problems in the DECsafe Available Server, TruCluster Available Server and TruCluster Production Server products. The problems caused the asemgr to core dump when adding or modifying services with a large number of disks.

Patch 47.00

TCR141-013B

Patch: Memory Channel API Shared Library Correction

State: Supersedes patch TCR141-013 (13.00)

This patch fixes various problems in the MEMORY CHANNEL API. In particular, changes were made to ensure that the API is thread safe, that locks are properly acquired and released, and to increase performance and reliability.

Patch 48.00

TCR141-013-1

Patch: Memory Channel API Static Library Correction

State: Supersedes patch TCR141-013 (13.00)

Patch 49.00

TCR141-045-1

Patch: Support For New AdvFS Mount Option "-o noatimes"

State: Supersedes patches TCR141-016 (16.00), TCR141-041 (39.00), TCR141-039 (37.00), TCR141-048 (45.00), TCR141-045 (43.00)

This patch fixes the following problems:

Provides support in asemgr for the new AdvFS mount option "-o noatimes".

Fixes a problem where changes in the LSM configuration were not being properly handled during the delete of an LSM volume from a service.

Increases the timeout values for the LSM action scripts that are part of the TruCluster Production Server, Available Server and DECsafe Available Server products. The timeouts were too small for large LSM configurations and, under certain conditions, would cause the start of the services to fail, leaving them unassigned.

Fixes a problem in which under certain cercumstances, an ASE service modification could result in a corrupted configuration data base.

Fixes a problem where LSM disk information was not properly updated in the ASE database when volumes were removed from a disk service.

Patch 50.00

TCR141-045B

Patch: LSM and AdvFS Corrections

State: Supersedes patches TCR141-016 (16.00), TCR141-041 (39.00), TCR141-039 (37.00), TCR141-048 (45.00), TCR141-045 (43.00)

This patch fixes the following problems:

Provides support in asemgr for the new AdvFS mount option "-o noatimes".

Fixes a problem where changes in the LSM configuration were not being properly handled during the delete of an LSM volume from a service.

Increases the timeout values for the LSM action scripts that are part of the TruCluster Production Server, Available Server and DECsafe Available Server products. The timeouts were too small for large LSM configurations and, under certain conditions, would cause the start of the services to fail, leaving them unassigned.

Fixes a problem in which under certain cercumstances, an ASE service modification could result in a corrupted configuration data base.

Fixes a problem where LSM disk information was not properly updated in the ASE database when volumes were removed from a disk service.

Patch 51.00

TCR141-044-2

Patch: Not Properly Handling Error Condition Correction

State: Supersedes patches TCR141-005 (5.00), TCR141-029 (29.00), TCR141-035 (34.00), TCR141-038 (36.00), TCR141-042 (40.00), TCR141-043 (41.00), TCR141-044-1 (42.01)

This patch corrects the following:

Fixes a problem in the message service routines used by the daemons in TruCluster Available Server and Production Server software. When the message queue fills, the following message is entered in the daemon.log file, but the queue is not emptied:
msgSvc: message queue overflow, LOST MESSAGE!

From this point on, no further messages will be received.

Fixes the following problems in the ASE Availability Manager (AM):
- A "simple_lock: time limit exceeded" panic on multi-processor, and system hangs in single processor systems. This can occur when multiple host target mode requests are issued due to SCSI aborts and resets on a shared bus.
- A kernel memory fault panic caused by a race condition when the AM de-initializes.

Fixes a problem where, during an orderly shutdown (init 0), the ASE agent shuts down the director before shutting down the services.

Causes the host status monitor (asehsm) to actively go out and learn current member states before responding to the director with member state information.

Pulling all monitored network interface cables on the machine running the asedirector and a service can result in another machine starting a new director and starting the same service before it has been fully stopped on the first machine. This is especially noticeable when a service takes a long time to stop.

Fixes a problem that caused the asedirector to core dump if asemgr processes were modifying services from more than one node in the cluster at the same time.

Fixes a problem where the Host Status Monitor (asehsm) incorrectly reports a network down (HSM_NI_STATUS DOWN) if the counters for the network interface get zeroed.

Fixes scalability problems in the DECsafe Available Server, TruCluster Available Server and TruCluster Production Server products. The problems caused the asemgr to core dump when adding or modifying services with a large number of disks.

Patch 52.00

TCR141-044C

Patch: Message Service Routine Fixes

State: Supersedes patches TCR141-005 (5.00), TCR141-029 (29.00), TCR141-035 (34.00), TCR141-038 (36.00), TCR141-042 (40.00), TCR141-043 (41.00), TCR141-044-1 (42.01)

This patch corrects the following:

Fixes a problem in the message service routines used by the daemons in TruCluster Available Server and Production Server software. When the message queue fills, the following message is entered in the daemon.log file, but the queue is not emptied:
msgSvc: message queue overflow, LOST MESSAGE!

From this point on, no further messages will be received.

Fixes the following problems in the ASE Availability Manager (AM):
- A "simple_lock: time limit exceeded" panic on multi-processor, and system hangs in single processor systems. This can occur when multiple host target mode requests are issued due to SCSI aborts and resets on a shared bus.
- A kernel memory fault panic caused by a race condition when the AM de-initializes.

Fixes a problem where, during an orderly shutdown (init 0), the ASE agent shuts down the director before shutting down the services.

Causes the host status monitor (asehsm) to actively go out and learn current member states before responding to the director with member state information.

Pulling all monitored network interface cables on the machine running the asedirector and a service can result in another machine starting a new director and starting the same service before it has been fully stopped on the first machine. This is especially noticeable when a service takes a long time to stop.

Fixes a problem that caused the asedirector to core dump if asemgr processes were modifying services from more than one node in the cluster at the same time.

Fixes a problem where the Host Status Monitor (asehsm) incorrectly reports a network down (HSM_NI_STATUS DOWN) if the counters for the network interface get zeroed.

Fixes scalability problems in the DECsafe Available Server, TruCluster Available Server and TruCluster Production Server products. The problems caused the asemgr to core dump when adding or modifying services with a large number of disks.