3 Summary of TruCluster Software Patches

This chapter summarizes the TruCluster software patches included in Patch Kit-0007.

Table 3-1 lists patches that have been updated.

Table 3-2 provides a summary of patches.

Table 3-1: Updated TruCluster Software Patches

Patch IDs	Change Summary
Patches 3.00, 27.00, 32.00, 70.00	Superseded by Patch 97.00
Patches14.00, 20.00, 23.00, 37.00, 50.00, 5.00, 66.00, 71.00, 86.00, 101.00	Superseded by Patch 102.00
Patches 1.00, 31.00, 19.00, 24.00, 26.00, 64.00, 76.00, 90.00, 83.00, 98.00	Superseded by Patch 104.00
Patches 12.00, 25.00, 34.01, 43.00, 46.00, 96.00, 99.00	Superseded by Patch 105.00
Patches 9.00, 17.00, 42.00, 59.00, 61.00	Superseded by Patch 106.00
Patches 2.00, 8.00, 10.00, 15.00, 16.00, 18.00, 21.00, 22.01, 38.00, 30.00, 44.00, 53.00, 56.00, 4.00, 45.00, 62.00, 51.00, 69.00, 67.00, 73.00, 72.00, 74.00, 75.00, 81.00, 82.00, 84.00, 85.00, 87.00, 88.00, 89.00, 91.00, 109.00, 100.00, 103.00, 107.00	Superseded by Patch 108.00

Table 3-2: Summary of TruCluster Patches

Patch IDs

Abstract

Patch 11.00

TCR150-012

Patch: Cluster Map Not Being Loaded At Boot Time Correction

State: Existing

This patch fixes a problem in TruCluster Available Server V1.5. The cluster map (/etc/CCM) was not being loaded at boot time, which prevented the Cluster Monitor utility (cmon) and its associated daemons (tractd and submon) from running.

Patch 13.00

TCR150DX-003

Patch: Cluster Monitor Hang Correction

State: Existing

This patch fixes a problem where if the name of an ASE service is changed using asemgr, Any Cluster Monitor (cmon) that is running on the cluster will hang.

Patch 28.00

TCR150-031

Patch: ASE Check Service Script Could Be Corrupt

State: Existing

This patch corrects a problem in which an ASE check service script could become corrupted in the ASE configuration data base.

Patch 36.00

TCR150-025-1

Patch: dlm_panic Fix

State: Supersedes patches TCR150-016 (14.00), TCR150-022 (20.00), TCR150-025 (23.00)

This patch fixes the following problems:

Problem that can cause a cluster member to panic in rcv_deqlk_msg() with the panic string set to:
dlm_panic

Provides performance enhancements that are required by Oracle V8.0.5.

Fixes a system panic with the following message:
snd_grantlk_msg: no memory for message

Patch 47.00

TCR150-044

Patch: Kernel Memory Fault Panic

State: Existing

This patch fixes two panics:

A kernel memory fault with bss_rm_biodone() in the stack.

A "bsc_rm_strategy: can't send notification" panic.

Patch 48.00

TCR150-045

Patch: Fix for AdvFS Panic

State: Supersedes patch TCR150-008 (7.00)

This patch corrects the following:

Fixes a problem in which running the vquotacheck command on a filesystem participating in an ASE service will cause a system to panic if the service fails over or relocates while the command is in progress.

Fixes a problem that could cause an AdvFS panic when a service that has quotas enabled is relocated. The problem occurs if a command is running that has a large number of arguments (>99).

Patch 49.00

TCR150-046

Patch: drdadmin Incorrectly Builds drdtab File

State: Supersedes patch TCR150-007 (6.00)

This patch fixes the following problems:

If a cluster member issued a drdadmin command to create new DRD map entry while another member is rebooting or had explicitly issued a SCSI bus reset, the command may fail with the following message:
drdadmin: Error: Can not add map entry for drdadmin:
Error: Can not add map entry for <drd device name>

During system startup, as each DRD map entry is being added. the following informational message may be seen on the console:
No cluster has been setup, there are 0 nodes.

Fixes a problem where drdadmin does not properly build the drdtab file during bootup.

Patch 52.00

TCR150-050

Patch: Adding second cnxmond Causes Cluster Partition

State: Existing

This patch fixes a problem where starting a second cnxmond could cause a cluster partition. Attempting to start a second one will now log an error message, and the new process will exit.

Patch 60.00

TCR150-040A

Patch: Fix for Memory Channel API

State: Supersedes patches TCR150-010 (9.00), TCR150-019 (17.00), TCR150-019-1 (41.00), TCR150-039A (58.00)

This patch fixes the following problems:

Problem with the Memory Channel API whereby the function imc_asalloc did not allow a negative key (most significant bit of key being set).

Problem that caused mcm_init to core dump when resolver fails on system boot.

Problem in which a resolver failure produces an unhelpful error message from mcm_init on boot.

Problem with the Memory Channel API whereby the function imc_ckerrcnt was signifying an error had occured when in fact no error had occurred. The following is the error code seen when running an MPI code:
[5]MPI Die-ump2chck.c 91 "ump_wait failure" (-16)

Patch 65.00

TCR150-006B

Patch: System Panic dlm getch: illegal csid Correction

State: Existing

Fixes a problem in the TruCluster Production Server Software in which a system can panic with the following message:

dlm getch: illegal csid

Patch 79.00

TCR150-062C

Patch: Message Service Routine Fixes

State: Supersedes patches TCR150-003 (2.00), TCR150-009 (8.00), TCR150-011 (10.00), TCR150-017 (15.00), TCR150-018 (16.00), TCR150-020 (18.00), TCR150-023 (21.00), TCR150-024-1 (22.01), TCR150-014 (12.00), TCR150-027 (25.00), TCR150-024B-1 (39.00), TCR150-027B-1 (35.01)

This patch fixes the following problems:

Fixes a problem in the message service routines used by the daemons in TruCluster Available Server and TruCluster Production Server software. When the message queue fills, the following message is entered in the daemon.log file, but the queue is not emptied:
msgSvc: message queue overflow, LOST MESSAGE!

From this point on, no further messages will be received.

Fixes a problem in Version 1.5 of the TruCluster Production Server and TruClusterAvailable Server products where, during the start of a service, missing special device files were not being created for HSZ disks. Since the special device files did not get created, the service start would fail.

Fixes a segmentation fault that can cause ASE daemons to exit or hang.

Fixes a problem where the Host Status Monitor (asehsm) incorrectly reports a network down (HSM_NI_STATUS DOWN) if the counters for the network interface get zeroed.

Fixes a problem that caused the asedirector to core dump if asemgr processes were modifying services from more than one node in the cluster at the same time.

Fixes scalability problems in the DECsafe Available Server, TruCluster Available Server, and TruCluster Production Server products. The problems caused the asemgr to core dump when adding or modifying services with a large number of disks.

Fixes several problems related to ASE service relocation and reporting in the event of network failures.

Fixes a problem that could cause the ASE daemons or asemgr utility to core dump with a segmentation violation.

Fixes a problem where, under certain circumstances, an ASE service modification could result in a corrupted configuration data base.

Fixes several TCR problems involving large sites with services containing large numbers of DRDs.

Patch 95.00

TCR150-080B

Patch: aseagent and asemgr Fixes

State: Supersedes patches TCR150-003 (2.00), TCR150-009 (8.00), TCR150-011 (10.00), TCR150-017 (15.00), TCR150-018 (16.00), TCR150-020 (18.00), TCR150-023 (21.00), TCR150-024-1 (22.01), TCR150-024B (33.00), TCR150-024C (40.00), TCR150-032B (57.00), TCR150-043B (63.00), TCR150-049B (68.00), TCR150-060B (77.00), TCR150-062B (78.00), TCR150-063B (80.00), TCR150-064B (92.00), TCR150-068B (93.00), TCR150-073B (94.00)

This patch fixes the following problems:

Fixes a problem in the message service routines used by the daemons in TruCluster Available Server and TruCluster Production Server software. When the message queue fills, the following message is entered in the daemon.log file, but the queue is not emptied:
msgSvc: message queue overflow, LOST MESSAGE!

From this point on, no further messages will be received.

Fixes a problem in Version 1.5 of the TruCluster Production Server and Available Server products where, during the start of a service, missing special device files were not being created for HSZ disks. Since the special device files did not get created, the service start would fail.

Fixes a segmentation fault that can cause ASE daemons to exit or hang.

Fixes a problem where the Host Status Monitor (asehsm) incorrectly reports a network down (HSM_NI_STATUS DOWN) if the counters for the network interface get zeroed.

Fixes a problem that caused the asedirector to core dump if asemgr processes were modifying services from more than one node in the cluster at the same time.

Fixes scalability problems in the DECsafe Available Server, TruCluster Available Server, and TruCluster Production Server products. The problems caused the asemgr to core dump when adding or modifying services with a large number of disks.

Fixes several problems related to ASE service relocation and reporting in the event of network failures.

Fixes a problem that could cause the ASE daemons or asemgr utility to core dump with a segmentation violation.

Corrects problems with temporary files not being removed and eliminates the need for one temporary file.

Fixes a problem that can cause the asemgr utility to core dump when modifying services that contain a large number of disks.

Fixes a number of ASE behavior problems resulting from network cable failure.

Fixes several TCR problems involving large sites with services containing large numbers of DRDs.

Fixes a problem that caused the ASE daemons and asemgr to core dump when the lookup for an IP address failed.

Performance improvement in the startup of start scripts. It will reduce the necessary system calls to start the scripts.

Patch 95.00

continued

Corrects a problem in which a member add will fail in a large ASE environment.

Corrects a problem which causes asemgr to core dump when modifying a DRD service to add more than 200 devices in a single service.

Corrects a problem which causes an aseagent to hang when restarting the ASE member.

Patch 97.00

TCR150-081A

Patch: Fix SCSI device reservations lost

State: Supersedes patches TCR150-004 (3.00), TCR150-030 (27.00), TCR150-036 (32.00), TCR150-057 (70.00)

This patch fixes the following problems in the ASE Availability Manager (AM):

A "simple_lock: time limit exceeded" panic on multiprocessor and system hangs in single processor systems. This can occur when multiple host target mode requests are issued due to SCSI aborts and resets on a shared bus.

A kernel memory fault panic caused by a race condition when the AM de-initializes.

Fixes a problem in which tape services may not failover as expected.

Fixes two problems:
- A problem in which the following messages may appear in the binary error log:
  SCSI STATUS RESERVATION CONFLICT Target xx Lun xx
  
  or:
  Max SEND SCSI BUSY retries exhausted
- A problem in which a system may panic if the system has an IDE interface and ASE is then installed.

Fixes a problem in clustered systems. It reduces the occurrences of tmv2_notify_cbf error messages in the errlog.

Fixes the following TCR problems:
- After error events are processed, a timing hole exists whereby important events can be lost.
- After a HSZ controller failure, SCSI device reservations could get lost because the error events are not being ordered properly.

Patch 102.00

TCR150-086

Patch: Various dlm Corrections

State: Supersedes patches TCR150-016 (14.00), TCR150-022 (20.00), TCR150-025 (23.00), TCR150-025B (37.00), TCR150-047 (50.00), TCR150-006A (5.00), TCR150-041 (66.00), TCR150-059 (71.00), TCR150-074 (86.00), TCR150-085 (101.00)

This patch fixes the following problems:

Problem that can cause a cluster member to panic in rcv_deqlk_msg() with the panic string set to:
dlm_panic

Provides performance enhancements that are required by Oracle V8.0.5.

Fixes a system panic with the following message:
snd_grantlk_msg: no memory for message

Fixes a problem in TruCluster in which a node panics with the string dlm_panic.

Fixes a problem in the TruCluster Production Server Software in which a system can panic with the following message:
dlm getch: illegal csid

Fixes a deadlock condition between the DLM rebuild thread and the Connection Manager ping daemon (cnxpingd). The deadlock can cause users of DLM (e.g., Oracle) to hang.

Fixes a problem in which a cluster node can panic with the panic string "convert_lock: bad lock state".

Corrects a problem in which a failure in the session layer can cause DLM messages to become corrupt resulting in random DLM panic on the receiving member.

Fixes a problem that can cause a TruCluster member to panic during shutdown.

Fixes a bug where sometimes a certain shared sequence number will not be freed after use. It also fixes a problem where certain processes could get referenced several times.

Patch 104.00

TCR150-088

Patch: clumember produces error msg during system startup

State: Supersedes patches TCR150-002 (1.00), TCR150-015 (31.00), TCR150-021 (19.00), TCR150-026 (24.00), TCR150-029 (26.00), TCR150-052 (64.00), TCR150-065 (76.00), TCR150-078 (90.00), TCR150-069 (83.00), TCR150-082 (98.00)

This patches fixes the following problems:

Problem booting a second member into a cluster.

In a virtual hub cluster, shutting down one node can cause the other to crash. Typical panic strings on the node that crashes are as follows:
rm_failover_self

and
rm_failover_all: target rail offline

Various repairs in Memory Channel error handling. Fixes for virtual hub booting with cable unplugged.

Various problems with MC errror handling discovered in cable pull under load tests.

Hubless MC2 systems hang during boot and/or experience error interrupts.

Reliable datagram (RDG) messaging support.

RDG: bug fix to the completion queue synchronization protocol.

Fixes a kernel memory fault in rm_lock_update_retry().

Fixes a problem where both nodes in a cluster will panic at the same time with a simple_lock timeout panic.

Fixes a problem which can cause the following panic:
panic (cpu 0): rm_update_single_lock_miss: time limit exceeded

Fixes a problem where /sbin/init.d/clumember produces an error message during system startup if DRD_AUTO_FAILOVER is not defined in /etc/rc.config.

Patch 105.00

TCR150-089

Patch: Shell errors occur if invalid mount option specified

State: Supersedes patches TCR150-014 (12.00), TCR150-027 (25.00), TCR150-027A-1 (34.01), TCR150-035 (43.00)), TCR150-042 (46.00), TCR150-079 (96.00), TCR150-083 (99.00)

This patch fixes the following problems:

Provides support in asemgr for the new AdvFS mount option -o noatimes.

Fixes a problem in which, under certain circumstances, an ASE service modification could result in a corrupted configuration data base.

Fixes a problem in which a service fails to start when the ASE service name and the AdvFS domain name are identical.

Fixes a problem where LSM disk information was not properly updated in the ASE database when volumes were removed from a disk service.

Fixes a deadlock condition between the DLM rebuild thread and the Connection Manager ping daemon (cnxpingd). The deadlock can cause users of DLM (e.g., Oracle) to hang.

Fixes a problem that would cause an error from awk(1) when modifying an ASE service that contained a large number of LSM volumes. The error would prevent the service from being properly modified.

Fixes a problem where LSM disk information was not properly updated in the ASE database when volumes were removed from a disk service.

Fixes a problem that caused shell errors if an invalid mount option was specified via the asemgr menu.

Patch 106.00

TCR150-090

Patch: Fixes panic with the Memory Channel API

State: Supersedes patches TCR150-010 (9.00), TCR150-019 (17.00), TCR150-019B (42.00), TCR150-039B (59.00), TCR150-040B (61.00)

This patch fixes the following problems:

Problem with the Memory Channel API whereby the function imc_asalloc did not allow a negative key (most significant bit of key being set).

Problem that caused mcm_init to core dump when resolver fails on system boot.

Problem in which a resolver failure produces an unhelpful error message from mcm_init on boot.

Problem with the Memory Channel API whereby the function imc_ckerrcnt was signifying an error had occured when in fact no error had occurred. The following is the error code seen when running an MPI code:
[5]MPI Die-ump2chck.c 91 "ump_wait failure" (-16)

Fixes a problem that can cause a panic in mcs_wait_cluster_event() when using the Memory Channel API.

Patch 108.00

TCR150-092

Patch: TCR Available Server and Production Server Fixes

State: Supersedes patches TCR150-003 (2.00), TCR150-009 (8.00), TCR150-011 (10.00), TCR150-017 (15.00), TCR150-018 (16.00), TCR150-020 (18.00), TCR150-023 (21.00), TCR150-024-1 (22.01), TCR150-024-2 (38.00), TCR150-033 (30.00), TCR150-037 (44.00), TCR150-051 (53.00), TCR150-032A (56.00), TCR150-005 (4.00), TCR150-038 (45.00) TCR150-043A (62.00), TCR150-048 (51.00), TCR150-056 (69.00), TCR150-049A (67.00), TCR150-061 (73.00), TCR150-060A (72.00), TCR150-062A (74.00), TCR150-063A (75.00), TCR150-064A (81.00), TCR150-068A (82.00), TCR150-071 (84.00), TCR150-073A (85.00), TCR150-075 (87.00), TCR150-076 (88.00), TCR150-077 (89.00), TCR150-080A (91.00), TCR150-081B (109.00), TCR150-084 (100.00), TCR150-087 (103.00), TCR150-091 (107.00)

This patch fixes the following problems:

Fixes a problem in the message service routines used by the daemons in TruCluster Available Server and TruCluster Production Server software. When the message queue fills, the following message is entered in the daemon.log file, but the queue is not emptied:
msgSvc: message queue overflow, LOST MESSAGE!

From this point on, no further messages will be received.

Fixes a problem in Version 1.5 of the TruCluster Production Server and TruClusterAvailable Server products where, during the start of a service, missing special device files were not being created for HSZ disks. Since the special device files did not get created, the service start would fail.

Fixes a segmentation fault that can cause ASE daemons to exit or hang.

Fixes a problem where the Host Status Monitor (asehsm) incorrectly reports a network down (HSM_NI_STATUS DOWN) if the counters for the network interface get zeroed.

Fixes a problem that caused the asedirector to core dump if asemgr processes were modifying services from more than one node in the cluster at the same time.

Fixes scalability problems in the DECsafe Available Server, TruCluster Available Server, and TruCluster Production Server products. The problems caused the asemgr to core dump when adding or modifying services with a large number of disks.

Fixes several problems related to ASE service relocation and reporting in the event of network failures.

Fixes a problem that could cause the ASE daemons or asemgr utility to core dump with a segmentation violation.

Fixes a problem where, under certain circumstances, an ASE service modification could result in a corrupted configuration data base.

Fixes several TCR problems involving large sites with services containing large numbers of DRDs.

Patch 108.00

continued

Fixes a problem that caused the ASE daemons and asemgr to core dump when the lookup for an IP address failed.

This is a performance improvement in the startup of start scripts. It will reduce the necessary system calls to start the scripts.

Corrects a problem in which a member add will fail in a large ASE environment.

Corrects a problem with Networker displaying garbage characters following service names. It occurs when the service name is 8 characters or greater.

Corrects a problem which causes asemgr to core dump when modifying a drd service to add more than 200 devices in a single service.

Corrects a problem with TruCluster Available Server or Production Server cluster in which services have been started with elevated priority and scheduling algorithm. Under significant load this could lead to intermittent network and cluster problems.

Fixes a problem that caused a service not to start when there was a short network failure. This was seen only with long running stop scripts and special network configurations.

Fixes a bug where ASE picks up an extra socket after failing over.

Corrects a problem which causes an aseagent to hang when restarting the ASE member.

Fixes the following TCR problems:
- After error events are processed, a timing hole exists whereby important events can be lost.
- After a HSZ controller failure, SCSI device reservations could get lost because the error events are not being ordered properly.

Corrects a problem where modifying a service with a large number of DRDs will fail and a "could not malloc" message is seen in the daemon.log.

Fixes a problem that caused the asemgr utility to not run when called from a program that is owned by root and has the setuid bit turned on.

Corrects a problem in which a network cable failure that corrects within seven seconds of the failure can leave the services in a bad state.

Fixes a problem that caused the asemgr to have a memory fault when adding multiple services one after the other.