3 Summary of TruCluster Software Patches

This chapter summarizes the TruCluster software patches included in Patch Kit-0004.

Table 3-1 lists patches that have been updated.

Table 3-2 provides a summary of patches.

Table 3-1: Updated TruCluster Software Patches

Patch IDs	Change Summary
Patches 1.00, 31.00, 19.00, 24.00, 26.00, 64.00, 76.00, 90.00	Superseded by Patch 83.00
Patches 14.00, 20.00, 23.00, 37.00, 50.00, 5.00, 66.00, 71.00	Superseded by Patch 86.00
Patches 2.00, 8.00, 10.00, 15.00, 16.00, 18.00, 21.00, 22.01, 38.00, 30.00, 44.00, 53.00, 56.00, 4.00, 45.00, 62.00, 51.00, 69.00, 67.00, 73.00, 72.00, 74.00, 75.00, 81.00, 82.00, 84.00, 85.00, 87.00, 88.00, 89.00	Superseded by Patch 91.00
Patches 2.00, 8.00, 10.00, 15.00, 16.00, 18.00, 21.00, 22.01, 33.00, 40.00, 57.00, 63.00, 68.00, 77.0, 78.00, 80.00, 92.00, 93.00, 94.00	Superseded by Patch 95.00

Table 3-2: Summary of TruCluster Patches

Patch IDs

Abstract

Patch 11.00

TCR150-012

Patch: Cluster Map Not Being Loaded At Boot Time Correction

State: Existing

Fixes a problem in TruCluster Available Server V1.5. The cluster map (/etc/CCM) was not being loaded at boot time, which prevented the Cluster Monitor utility (cmon) and its associated daemons (tractd and submon) from running.

Patch 13.00

TCR150DX-003

Patch: Cluster Monitor Hang Correction

State: Existing

Fixes a problem where if the name of an ASE service is changed using asemgr, any Cluster Monitor (cmon) that is running on the cluster will hang.

Patch 28.00

TCR150-031

Patch: ASE Check Service Script Could Be Corrupt

State: Existing

This patch corrects a problem in which an ASE check service script could become corrupted in the ASE configuration data base.

Patch 36.00

TCR150-025-1

Patch: dlm_panic Fix

State: Supersedes patches TCR150-016 (14.00), TCR150-022 (20.00), TCR150-025 (23.00)

This patch fixes the following problems:

Problem that can cause a cluster member to panic in rcv_deqlk_msg() with the panic string set to:
dlm_panic

Provides performance enhancements that are required by Oracle V8.0.5.

Fixes a system panic with the following message:
snd_grantlk_msg: no memory for message

Patch 46.00

TCR150-042

Patch: LSM Disk Not Updated in ASE Database

State: Supersedes patches TCR150-014 (12.00), TCR150-027 (25.00), TCR150-027A-1 (34.01), TCR150-035 (43.00)

This patch fixes the following problems:

Provides support in asemgr for the new AdvFS mount option -o noatimes.

Fixes a problem in which, under certain circumstances, an ASE service modification could result in a corrupted configuration data base.

Fixes a problem in which a service fails to start when the ASE service name and the AdvFS domain name are identical.

Fixes a problem where LSM disk information was not properly updated in the ASE database when volumes were removed from a disk service.

Fixes a deadlock condition between the DLM rebuild thread and the Connection Manager ping daemon (cnxpingd). The deadlock can cause users of DLM (e.g., Oracle) to hang.

Patch 47.00

TCR150-044

Patch: Kernel Memory Fault Panic

State: Existing

This patch fixes two panics:

A kernel memory fault with bss_rm_biodone() in the stack.

A "bsc_rm_strategy: can't send notification" panic.

Patch 48.00

TCR150-045

Patch: Fix for AdvFS Panic

State: Supersedes patch TCR150-008 (7.00)

This patch corrects the following:

Fixes a problem in which running the vquotacheck command on a filesystem participating in an ASE service will cause a system to panic if the service fails over or relocates while the command is in progress.

Fixes a problem that could cause an AdvFS panic when a service that has quotas enabled is relocated. The problem occurs if a command is running that has a large number of arguments (>99).

Patch 49.00

TCR150-046

Patch: drdadmin Incorrectly Builds drdtab File

State: Supersedes patch TCR150-007 (6.00)

This patch fixes the following problems:

If a cluster member issued a drdadmin command to create new DRD map entry while another member is rebooting or had explicitly issued a SCSI bus reset, the command may fail with the following message:
drdadmin: Error: Can not add map entry for drdadmin:
Error: Can not add map entry for <drd device name>

During system startup, as each DRD map entry is being added. the following informational message may be seen on the console:
No cluster has been setup, there are 0 nodes.

Fixes a problem where drdadmin does not properly build the drdtab file during bootup.

Patch 52.00

TCR150-050

Patch: Adding second cnxmond Causes Cluster Partition

State: Existing

This patch fixes a problem where starting a second cnxmond could cause a cluster partition. Attempting to start a second one will now log an error message, and the new process will exit.

Patch 60.00

TCR150-040A

Patch: Fix for Memory Channel API

State: Supersedes patches TCR150-010 (9.00), TCR150-019 (17.00), TCR150-019-1 (41.00), TCR150-039A (58.00)

This patch fixes the following problems:

Problem with the Memory Channel API whereby the function imc_asalloc did not allow a negative key (most significant bit of key being set).

Problem that caused mcm_init to core dump when resolver fails on system boot.

Problem in which a resolver failure produces an unhelpful error message from mcm_init on boot.

Problem with the Memory Channel API whereby the function imc_ckerrcnt was signifying an error had occured when in fact no error had occurred. The following is the error code seen when running an MPI code:
[5]MPI Die-ump2chck.c 91 "ump_wait failure" (-16)

Patch 61.00

TCR150-040B

Patch: Fix For ump_wait failure Error

State: Supersedes patches TCR150-010 (9.00), TCR150-019 (17.00), TCR150-019B (42.00), TCR150-039B (59.00)

This patch fixes the following problems:

Problem with the Memory Channel API whereby the function imc_asalloc did not allow a negative key (most significant bit of key being set).

Problem that caused mcm_init to core dump when resolver fails on system boot.

Fixes a problem in which a resolver failure produces an unhelpful error message from mcm_init on boot.

Fixes a problem with the Memory Channel API whereby the function imc_ckerrcnt was signifying an error had occured when in fact no error had occurred. The following is the error code seen when running an MPI code:
[5]MPI Die-ump2chck.c 91 "ump_wait failure" (-16)

Patch 65.00

TCR150-006B

Patch: System Panic dlm getch: illegal csid Correction

State: Existing

Fixes a problem in the TruCluster Production Server Software in which a system can panic with the following message:

dlm getch: illegal csid

Patch 70.00

TCR150-057

Patch: Fix For tmv2_notify_cbf Error Message

State: Supersedes patches TCR150-004 (3.00), TCR150-030 (27.00), TCR150-036 (32.00)

This patch fixes the following problems in the ASE Availability Manager (AM):

A "simple_lock: time limit exceeded" panic on multiprocessor and system hangs in single processor systems. This can occur when multiple host target mode requests are issued due to SCSI aborts and resets on a shared bus.

A kernel memory fault panic caused by a race condition when the AM de-initializes.

Fixes a problem in which tape services may not failover as expected.

Fixes two problems:
- A problem in which the following messages may appear in the binary error log:
  SCSI STATUS RESERVATION CONFLICT Target xx Lun xx
  
  or:
  Max SEND SCSI BUSY retries exhausted
- A problem in which a system may panic if the system has an IDE interface and ASE is then installed.

Fixes a problem in clustered systems. It reduces the occurrences of tmv2_notify_cbf error messages in the errlog.

Patch 79.00

TCR150-062C

Patch: Message Service Routine Fixes

State: Supersedes patches TCR150-003 (2.00), TCR150-009 (8.00), TCR150-011 (10.00), TCR150-017 (15.00), TCR150-018 (16.00), TCR150-020 (18.00), TCR150-023 (21.00), TCR150-024-1 (22.01), TCR150-014 (12.00), TCR150-027 (25.00), TCR150-024B-1 (39.00), TCR150-027B-1 (35.01)

This patch fixes the following problems:

Fixes a problem in the message service routines used by the daemons in TruCluster Available Server and TruCluster Production Server software. When the message queue fills, the following message is entered in the daemon.log file, but the queue is not emptied:
msgSvc: message queue overflow, LOST MESSAGE!

From this point on, no further messages will be received.

Fixes a problem in Version 1.5 of the TruCluster Production Server and TruClusterAvailable Server products where, during the start of a service, missing special device files were not being created for HSZ disks. Since the special device files did not get created, the service start would fail.

Fixes a segmentation fault that can cause ASE daemons to exit or hang.

Fixes a problem where the Host Status Monitor (asehsm) incorrectly reports a network down (HSM_NI_STATUS DOWN) if the counters for the network interface get zeroed.

Fixes a problem that caused the asedirector to core dump if asemgr processes were modifying services from more than one node in the cluster at the same time.

Fixes scalability problems in the DECsafe Available Server, TruCluster Available Server, and TruCluster Production Server products. The problems caused the asemgr to core dump when adding or modifying services with a large number of disks.

Fixes several problems related to ASE service relocation and reporting in the event of network failures.

Fixes a problem that could cause the ASE daemons or asemgr utility to core dump with a segmentation violation.

Fixes a problem in which under certain circumstances, an ASE service modification could result in a corrupted configuration data base.

Fixes several TCR problems involving large sites with services containing large numbers of DRDs.

Patch 83.00

TCR150-069

Patch: Fixes simple_lock timeout panic

State: Supersedes patches TCR150-002 (1.00), TCR150-015 (31.00), TCR150-021 (19.00), TCR150-026 (24.00), TCR150-029 (26.00), TCR150-052 (64.00),TCR150-065 (76.00), TCR150-078 (90.00)

This patches fixes the following problems:

Problem booting a second member into a cluster.

In a virtual hub cluster, shutting down one node can cause the other to crash. Typical panic strings on the node that crashes are:
rm_failover_self

and:
rm_failover_all: target rail offline

Various repairs in Memory Channel error handling. Fixes for virtual hub booting with cable unplugged.

Various problems with MC errror handling discovered in cable pull under load tests.

Hubless MC2 systems hang during boot and/or experience error interrupts.

Reliable datagram (RDG) messaging support.

RDG: bug fix to the completion queue synchronization protocol.

Fixes a kernel memory fault in rm_lock_update_retry().

Fixes a problem where both nodes in a cluster will panic at the same time with a simple_lock timeout panic.

Patch 86.00

TCR150-074

Patch: Various dlm Corrections

State: Supersedes patches TCR150-016 (14.00), TCR150-022 (20.00), TCR150-025 (23.00), TCR150-025B (37.00), TCR150-047 (50.00), TCR150-006A (5.00), TCR150-041 (66.00), TCR150-059 (71.00)

This patch fixes the following problems:

Problem that can cause a cluster member to panic in rcv_deqlk_msg() with the panic string set to:
dlm_panic

Provides performance enhancements that are required by Oracle V8.0.5.

Fixes a system panic with the following message:
snd_grantlk_msg: no memory for message

Fixes a problem in TruCluster in which a node panics with the following string:
dlm_panic

Fixes a problem in the TruCluster Production Server Software in which a system can panic with the following message:
dlm getch: illegal csid

Fixes a deadlock condition between the DLM rebuild thread and the Connection Manager ping daemon (cnxpingd). The deadlock can cause users of DLM (e.g., Oracle) to hang.

Fixes a problem in which a cluster node can panic with the following panic string:
convert_lock: bad lock state

Corrects a problem in which a failure in the session layer can cause DLM messages to become corrupt resulting in random DLM panic on the receiving member.

Patch 91.00

TCR150-080A

Patch: TCR Available Server and Production Server Fixes

State: Supersedes patches TCR150-003 (2.00), TCR150-009 (8.00), TCR150-011 (10.00), TCR150-017 (15.00), TCR150-018 (16.00), TCR150-020 (18.00), TCR150-023 (21.00), TCR150-024-1 (22.01), TCR150-024-2 (38.00), TCR150-033 (30.00), TCR150-037 (44.00), TCR150-051 (53.00), TCR150-032A (56.00), TCR150-005 (4.00), TCR150-038 (45.00) TCR150-043A (62.00), TCR150-048 (51.00), TCR150-056 (69.00), TCR150-049A (67.00), TCR150-061 (73.00), TCR150-060A (72.00), TCR150-062A (74.00), TCR150-063A (75.00), TCR150-064A (81.00), TCR150-068A (82.00), TCR150-071 (84.00), TCR150-073A (85.00), TCR150-075 (87.00), TCR150-076 (88.00), TCR150-077 (89.00)

This patch fixes the following problems:

Fixes a problem in the message service routines used by the daemons in TruCluster Available Server and TruCluster Production Server software. When the message queue fills, the following message is entered in the daemon.log file, but the queue is not emptied:
msgSvc: message queue overflow, LOST MESSAGE!

From this point on, no further messages will be received.

Fixes a problem in Version 1.5 of the TruCluster Production Server and TruClusterAvailable Server products where, during the start of a service, missing special device files were not being created for HSZ disks. Since the special device files did not get created, the service start would fail.

Fixes a segmentation fault that can cause ASE daemons to exit or hang.

Fixes a problem where the Host Status Monitor (asehsm) incorrectly reports a network down (HSM_NI_STATUS DOWN) if the counters for the network interface get zeroed.

Fixes a problem that caused the asedirector to core dump if asemgr processes were modifying services from more than one node in the cluster at the same time.

Fixes scalability problems in the DECsafe Available Server, TruCluster Available Server, and TruCluster Production Server products. The problems caused the asemgr to core dump when adding or modifying services with a large number of disks.

Fixes several problems related to ASE service relocation and reporting in the event of network failures.

Fixes a problem that could cause the ASE daemons or asemgr utility to core dump with a segmentation violation.

Fixes a problem in which under certain circumstances, an ASE service modification could result in a corrupted configuration data base.

Fixes several TCR problems involving large sites with services containing large numbers of DRDs.

Fixes a problem that caused the ASE daemons and asemgr to core dump when the lookup for an IP address failed.

This is a performance improvement in the startup of start scripts. It will reduce the necessary system calls to start the scripts.

Corrects a problem in which a member add will fail in a large ASE environment.

Patch 91.00

continued

Corrects a problem with Networker displaying garbage characters following service names. It occurs when the service name is 8 characters or greater.

Corrects a problem which causes asemgr to core dump when modifying a drd service to add more than 200 devices in a single service.

Corrects a problem with TruCluster Available Server or Production Server cluster in which services have been started with elevated priority and scheduling algorithm. Under significant load this could lead to intermittent network and cluster problems.

Fixes a problem which caused a service not to start when there was a short network failure. This was seen only with long running stop scripts and special network configurations.

Fixes a bug where ASE picks up an extra socket after failing over.

Corrects a problem which causes an aseagent to hang when restarting the ASE member.

Patch 95.00

TCR150-080B

Patch: aseagent and asemgr Fixes

State: Supersedes patches TCR150-003 (2.00), TCR150-009 (8.00), TCR150-011 (10.00), TCR150-017 (15.00), TCR150-018 (16.00), TCR150-020 (18.00), TCR150-023 (21.00), TCR150-024-1 (22.01), TCR150-024B (33.00), TCR150-024C (40.00), TCR150-032B (57.00), TCR150-043B (63.00), TCR150-049B (68.00), TCR150-060B (77.00), TCR150-062B (78.00), TCR150-063B (80.00), TCR150-064B (92.00), TCR150-068B (93.00), TCR150-073B (94.00)

This patch fixes the following problems:

Fixes a problem in the message service routines used by the daemons in TruCluster Available Server and Production Server software. When the message queue fills, the following message is entered in the daemon.log file, but the queue is not emptied:
msgSvc: message queue overflow, LOST MESSAGE!

From this point on, no further messages will be received.

Fixes a problem in Version 1.5 of the TruCluster Production Server and Available Server products where, during the start of a service, missing special device files were not being created for HSZ disks. Since the special device files did not get created, the service start would fail.

Fixes a segmentation fault that can cause ASE daemons to exit or hang.

Fixes a problem where the Host Status Monitor (asehsm) incorrectly reports a network down (HSM_NI_STATUS DOWN) if the counters for the network interface get zeroed.

Fixes a problem that caused the asedirector to core dump if asemgr processes were modifying services from more than one node in the cluster at the same time.

Fixes scalability problems in the DECsafe Available Server, TruCluster Available Server, and TruCluster Production Server products. The problems caused the asemgr to core dump when adding or modifying services with a large number of disks.

Fixes several problems related to ASE service relocation and reporting in the event of network failures.

Fixes a problem that could cause the ASE daemons or asemgr utility to core dump with a segmentation violation.

Patch 95.00

continued

Corrects problems with temporary files not being removed, and eliminates the need for one temporary file.

Fixes a problem that can cause the asemgr utility to core dump when modifying services that contain a large number of disks.

Fixes a number of ASE behavior problems resulting from network cable failure.

Fixes several TCR problems involving large sites with services containing large numbers of DRDs.

Fixes a problem that caused the ASE daemons and asemgr to core dump when the lookup for an IP address failed.

This is a performance improvement in the startup of start scripts. It will reduce the necessary system calls to start the scripts.

Corrects a problem in which a member add will fail in a large ASE environment

Corrects a problem which causes asemgr to core dump when modifying a DRD service to add more than 200 devices in a single service.

Corrects a problem which causes an aseagent to hang when restarting the ASE member.