3    Summary of TruCluster Software Patches

This chapter summarizes the TruCluster software patches included in Patch Kit-0008.

Table 3-1 lists patches that have been updated.

Table 3-2 provides a summary of patches.

Table 3-1:  Updated TruCluster Software Patches

Patch IDs Change Summary
Patches 121.00, 116.00, 118.00, 120.00 New
Patches 1.00, 31.00, 19.00, 24.00, 26.00, 64.00, 76.00, 90.00, 83.00, 98.00, 104.00, 111.00 Superseded by Patch 121.00
Patches 2.00, 8.00, 10.00, 15.00, 16.00, 18.00, 21.00, 22.01, 38.00, 30.00, 44.00, 53.00, 56.00, 4.00, 45.00, 62.00, 51.00, 69.00, 67.00, 73.00, 72.00, 74.00, 75.00, 81.00, 82.00, 84.00, 85.00, 87.00, 88.00, 89.00, 91.00, 109.00, 100.00, 103.00, 107.00, 108.00, 112.00, 113.00, 114.00 Superseded by Patch 116.00
Patches 9.00, 17.00, 42.00, 59.00, 61.00, 106.00 Superseded by Patch 120.00

Table 3-2:  Summary of TruCluster Patches

Patch IDs Abstract

Patch 11.00

TCR150-012

Patch: Cluster Map Not Being Loaded At Boot Time Correction

State: Existing

This patch fixes a problem in TruCluster Available Server V1.5. The cluster map (/etc/CCM) was not being loaded at boot time, which prevented the Cluster Monitor utility (cmon) and its associated daemons (tractd and submon) from running.

Patch 13.00

TCR150DX-003

Patch: Cluster Monitor Hang Correction

State: Existing

This patch fixes a problem where if the name of an ASE service is changed using asemgr, Any Cluster Monitor (cmon) that is running on the cluster will hang.

Patch 28.00

TCR150-031

Patch: ASE Check Service Script Could Be Corrupt

State: Existing

This patch corrects a problem in which an ASE check service script could become corrupted in the ASE configuration data base.

Patch 36.00

TCR150-025-1

Patch: dlm_panic Fix

State: Supersedes patches TCR150-016 (14.00), TCR150-022 (20.00), TCR150-025 (23.00)

This patch fixes the following problems:

  • Problem that can cause a cluster member to panic in rcv_deqlk_msg() with the panic string set to:

    dlm_panic

  • Provides performance enhancements that are required by Oracle V8.0.5.

  • Fixes a system panic with the following message:

    snd_grantlk_msg: no memory for message

Patch 47.00

TCR150-044

Patch: Kernel Memory Fault Panic

State: Existing

This patch fixes two panics:

  • A kernel memory fault with bss_rm_biodone() in the stack.

  • A "bsc_rm_strategy: can't send notification" panic.

Patch 48.00

TCR150-045

Patch: Fix for AdvFS Panic

State: Supersedes patch TCR150-008 (7.00)

This patch corrects the following:

  • Fixes a problem in which running the vquotacheck command on a filesystem participating in an ASE service will cause a system to panic if the service fails over or relocates while the command is in progress.

  • Fixes a problem that could cause an AdvFS panic when a service that has quotas enabled is relocated. The problem occurs if a command is running that has a large number of arguments (>99).

Patch 49.00

TCR150-046

Patch: drdadmin Incorrectly Builds drdtab File

State: Supersedes patch TCR150-007 (6.00)

This patch fixes the following problems:

  • If a cluster member issued a drdadmin command to create new DRD map entry while another member is rebooting or had explicitly issued a SCSI bus reset, the command may fail with the following message:

    drdadmin: Error: Can not add map entry for drdadmin:
     Error: Can not add map entry for <drd device name>

  • During system startup, as each DRD map entry is being added. the following informational message may be seen on the console:

    No cluster has been setup, there are 0 nodes.

  • Fixes a problem where drdadmin does not properly build the drdtab file during bootup.

Patch 52.00

TCR150-050

Patch: Adding second cnxmond Causes Cluster Partition

State: Existing

This patch fixes a problem where starting a second cnxmond could cause a cluster partition. Attempting to start a second one will now log an error message, and the new process will exit.

Patch 60.00

TCR150-040A

Patch: Fix for Memory Channel API

State: Supersedes patches TCR150-010 (9.00), TCR150-019 (17.00), TCR150-019-1 (41.00), TCR150-039A (58.00)

This patch fixes the following problems:

  • Problem with the Memory Channel API whereby the function imc_asalloc did not allow a negative key (most significant bit of key being set).

  • Problem that caused mcm_init to core dump when resolver fails on system boot.

  • Problem in which a resolver failure produces an unhelpful error message from mcm_init on boot.

  • Problem with the Memory Channel API whereby the function imc_ckerrcnt was signifying an error had occurred when in fact no error had occurred. The following is the error code seen when running an MPI code:

    [5]MPI Die-ump2chck.c 91 "ump_wait failure" (-16)

Patch 65.00

TCR150-006B

Patch: System Panic dlm getch: illegal csid Correction

State: Existing

Fixes a problem in the TruCluster Production Server Software in which a system can panic with the following message:

dlm getch: illegal csid

Patch 79.00

TCR150-062C

Patch: Message Service Routine Fixes

State: Supersedes patches TCR150-003 (2.00), TCR150-009 (8.00), TCR150-011 (10.00), TCR150-017 (15.00), TCR150-018 (16.00), TCR150-020 (18.00), TCR150-023 (21.00), TCR150-024-1 (22.01), TCR150-014 (12.00), TCR150-027 (25.00), TCR150-024B-1 (39.00), TCR150-027B-1 (35.01)

This patch fixes the following problems:

  • Fixes a problem in the message service routines used by the daemons in TruCluster Available Server and TruCluster Production Server software. When the message queue fills, the following message is entered in the daemon.log file, but the queue is not emptied:

    msgSvc: message queue overflow, LOST MESSAGE!

    From this point on, no further messages will be received.

  • Fixes a problem in Version 1.5 of the TruCluster Production Server and TruCluster Available Server products where, during the start of a service, missing special device files were not being created for HSZ disks. Since the special device files did not get created, the service start would fail.

  • Fixes a segmentation fault that can cause ASE daemons to exit or hang.

  • Fixes a problem where the Host Status Monitor (asehsm) incorrectly reports a network down (HSM_NI_STATUS DOWN) if the counters for the network interface get zeroed.

  • Fixes a problem that caused the asedirector to core dump if asemgr processes were modifying services from more than one node in the cluster at the same time.

  • Fixes scalability problems in the DECsafe Available Server, TruCluster Available Server, and TruCluster Production Server products. The problems caused the asemgr to core dump when adding or modifying services with a large number of disks.

  • Fixes several problems related to ASE service relocation and reporting in the event of network failures.

  • Fixes a problem that could cause the ASE daemons or asemgr utility to core dump with a segmentation violation.

  • Fixes a problem where, under certain circumstances, an ASE service modification could result in a corrupted configuration data base.

  • Fixes several TCR problems involving large sites with services containing large numbers of DRDs.

Patch 95.00

TCR150-080B

Patch: aseagent and asemgr Fixes

State: Supersedes patches TCR150-003 (2.00), TCR150-009 (8.00), TCR150-011 (10.00), TCR150-017 (15.00), TCR150-018 (16.00), TCR150-020 (18.00), TCR150-023 (21.00), TCR150-024-1 (22.01), TCR150-024B (33.00), TCR150-024C (40.00), TCR150-032B (57.00), TCR150-043B (63.00), TCR150-049B (68.00), TCR150-060B (77.00), TCR150-062B (78.00), TCR150-063B (80.00), TCR150-064B (92.00), TCR150-068B (93.00), TCR150-073B (94.00)

This patch fixes the following problems:

  • Fixes a problem in the message service routines used by the daemons in TruCluster Available Server and TruCluster Production Server software. When the message queue fills, the following message is entered in the daemon.log file, but the queue is not emptied:

    msgSvc: message queue overflow, LOST MESSAGE!

    From this point on, no further messages will be received.

  • Fixes a problem in Version 1.5 of the TruCluster Production Server and Available Server products where, during the start of a service, missing special device files were not being created for HSZ disks. Since the special device files did not get created, the service start would fail.

  • Fixes a segmentation fault that can cause ASE daemons to exit or hang.

  • Fixes a problem where the Host Status Monitor (asehsm) incorrectly reports a network down (HSM_NI_STATUS DOWN) if the counters for the network interface get zeroed.

  • Fixes a problem that caused the asedirector to core dump if asemgr processes were modifying services from more than one node in the cluster at the same time.

  • Fixes scalability problems in the DECsafe Available Server, TruCluster Available Server, and TruCluster Production Server products. The problems caused the asemgr to core dump when adding or modifying services with a large number of disks.

  • Fixes several problems related to ASE service relocation and reporting in the event of network failures.

  • Fixes a problem that could cause the ASE daemons or asemgr utility to core dump with a segmentation violation.

  • Corrects problems with temporary files not being removed and eliminates the need for one temporary file.

  • Fixes a problem that can cause the asemgr utility to core dump when modifying services that contain a large number of disks.

  • Fixes a number of ASE behavior problems resulting from network cable failure.

  • Fixes several TCR problems involving large sites with services containing large numbers of DRDs.

  • Fixes a problem that caused the ASE daemons and asemgr to core dump when the lookup for an IP address failed.

  • Performance improvement in the startup of start scripts. It will reduce the necessary system calls to start the scripts.

Patch 95.00

continued

  • Corrects a problem in which a member add will fail in a large ASE environment.

  • Corrects a problem which causes asemgr to core dump when modifying a DRD service to add more than 200 devices in a single service.

  • Corrects a problem which causes an aseagent to hang when restarting the ASE member.

Patch 97.00

TCR150-081A

Patch: Fix SCSI device reservations lost

State: Supersedes patches TCR150-004 (3.00), TCR150-030 (27.00), TCR150-036 (32.00), TCR150-057 (70.00)

This patch fixes the following problems in the ASE Availability Manager (AM):

  • A "simple_lock: time limit exceeded" panic on multiprocessor and system hangs in single processor systems. This can occur when multiple host target mode requests are issued due to SCSI aborts and resets on a shared bus.

  • A kernel memory fault panic caused by a race condition when the AM de-initializes.

  • Fixes a problem in which tape services may not failover as expected.

  • Fixes two problems:

    • A problem in which the following messages may appear in the binary error log:

      SCSI STATUS RESERVATION CONFLICT Target xx Lun xx

      or:

      Max SEND SCSI BUSY retries exhausted

    • A problem in which a system may panic if the system has an IDE interface and ASE is then installed.

  • Fixes a problem in clustered systems. It reduces the occurrences of tmv2_notify_cbf error messages in the errlog.

  • Fixes the following TCR problems:

    • After error events are processed, a timing hole exists whereby important events can be lost.

    • After a HSZ controller failure, SCSI device reservations could get lost because the error events are not being ordered properly.

Patch 102.00

TCR150-086

Patch: Various dlm Corrections

State: Supersedes patches TCR150-016 (14.00), TCR150-022 (20.00), TCR150-025 (23.00), TCR150-025B (37.00), TCR150-047 (50.00), TCR150-006A (5.00), TCR150-041 (66.00), TCR150-059 (71.00), TCR150-074 (86.00), TCR150-085 (101.00)

This patch fixes the following problems:

  • Problem that can cause a cluster member to panic in rcv_deqlk_msg() with the panic string set to:

    dlm_panic

  • Provides performance enhancements that are required by Oracle V8.0.5.

  • Fixes a system panic with the following message:

    snd_grantlk_msg: no memory for message

  • Fixes a problem in TruCluster in which a node panics with the string dlm_panic.

  • Fixes a problem in the TruCluster Production Server Software in which a system can panic with the following message:

    dlm getch: illegal csid

  • Fixes a deadlock condition between the DLM rebuild thread and the Connection Manager ping daemon (cnxpingd). The deadlock can cause users of DLM (e.g., Oracle) to hang.

  • Fixes a problem in which a cluster node can panic with the panic string "convert_lock: bad lock state".

  • Corrects a problem in which a failure in the session layer can cause DLM messages to become corrupt resulting in random DLM panic on the receiving member.

  • Fixes a problem that can cause a TruCluster member to panic during shutdown.

  • Fixes a bug where sometimes a certain shared sequence number will not be freed after use. It also fixes a problem where certain processes could get referenced several times.

Patch 105.00

TCR150-089

Patch: Shell errors occur if invalid mount option specified

State: Supersedes patches TCR150-014 (12.00), TCR150-027 (25.00), TCR150-027A-1 (34.01), TCR150-035 (43.00)), TCR150-042 (46.00), TCR150-079 (96.00), TCR150-083 (99.00)

This patch fixes the following problems:

  • Provides support in asemgr for the new AdvFS mount option -o noatimes.

  • Fixes a problem in which, under certain circumstances, an ASE service modification could result in a corrupted configuration data base.

  • Fixes a problem in which a service fails to start when the ASE service name and the AdvFS domain name are identical.

  • Fixes a problem where LSM disk information was not properly updated in the ASE database when volumes were removed from a disk service.

  • Fixes a deadlock condition between the DLM rebuild thread and the Connection Manager ping daemon (cnxpingd). The deadlock can cause users of DLM (e.g., Oracle) to hang.

  • Fixes a problem that would cause an error from awk(1) when modifying an ASE service that contained a large number of LSM volumes. The error would prevent the service from being properly modified.

  • Fixes a problem where LSM disk information was not properly updated in the ASE database when volumes were removed from a disk service.

  • Fixes a problem that caused shell errors if an invalid mount option was specified via the asemgr menu.

Patch 116.00

TCR150-095

Patch: TCR Available Server and Production Server Fixes

State: New. Supersedes patches TCR150-003 (2.00), TCR150-009 (8.00), TCR150-011 (10.00), TCR150-017 (15.00), TCR150-018 (16.00), TCR150-020 (18.00), TCR150-023 (21.00), TCR150-024-1 (22.01), TCR150-024-2 (38.00), TCR150-033 (30.00), TCR150-037 (44.00), TCR150-051 (53.00), TCR150-032A (56.00), TCR150-005 (4.00), TCR150-038 (45.00), TCR150-043A (62.00), TCR150-048 (51.00), TCR150-056 (69.00), TCR150-049A (67.00), TCR150-061 (73.00), TCR150-060A (72.00), TCR150-062A (74.00), TCR150-063A (75.00), TCR150-064A (81.00), TCR150-068A (82.00), TCR150-071 (84.00), TCR150-073A (85.00), TCR150-075 (87.00), TCR150-076 (88.00), TCR150-077 (89.00), TCR150-080A (91.00), TCR150-081B (109.00), TCR150-084 (100.00), TCR150-087 (103.00), TCR150-091 (107.00), TCR150-092 (108.00), TCR150-100 (112.00), TCR150-099 (113.00), TCR150-096 (114.00)

This patch fixes the following problems:

  • Fixes a problem in the message service routines used by the daemons in TruCluster Available Server and TruCluster Production Server software. When the message queue fills, the following message is entered in the daemon.log file, but the queue is not emptied:

    msgSvc: message queue overflow, LOST MESSAGE!

    From this point on, no further messages will be received.

  • Fixes a problem in Version 1.5 of the TruCluster Production Server and TruClusterAvailable Server products where, during the start of a service, missing special device files were not being created for HSZ disks. Since the special device files did not get created, the service start would fail.

  • Fixes a segmentation fault that can cause ASE daemons to exit or hang.

  • Fixes a problem where the Host Status Monitor (asehsm) incorrectly reports a network down (HSM_NI_STATUS DOWN) if the counters for the network interface get zeroed.

  • Fixes a problem that caused the asedirector to core dump if asemgr processes were modifying services from more than one node in the cluster at the same time.

  • Fixes scalability problems in the DECsafe Available Server, TruCluster Available Server, and TruCluster Production Server products. The problems caused the asemgr to core dump when adding or modifying services with a large number of disks.

  • Fixes several problems related to ASE service relocation and reporting in the event of network failures.

  • Fixes a problem that could cause the ASE daemons or asemgr utility to core dump with a segmentation violation.

  • Fixes a problem where, under certain circumstances, an ASE service modification could result in a corrupted configuration data base.

  • Fixes several TCR problems involving large sites with services containing large numbers of DRDs.

  • Fixes a problem that caused the ASE daemons and asemgr to core dump when the lookup for an IP address failed.

  • Performance improvement in the startup of start scripts. It will reduce the necessary system calls to start the scripts.

  • Corrects a problem in which a member add will fail in a large ASE environment.

Patch 116.00

continued

  • Corrects a problem which causes asemgr to core dump when modifying a DRD service to add more than 200 devices in a single service.

  • Corrects a problem which causes an aseagent to hang when restarting the ASE member.

  • Corrects a problem with TruCluster Available Server or Production Server cluster in which services have been started with elevated priority and scheduling algorithm. Under significant load this could lead to intermittent network and cluster problems.

  • Fixes a problem that caused a service not to start when there was a short network failure. This was seen only with long running stop scripts and special network configurations.

  • Fixes a bug where ASE picks up an extra socket after failing over.

  • Corrects a problem which causes an aseagent to hang when restarting the ASE member.

  • Fixes the following TCR problems:

    • After error events are processed, a timing hole exists whereby important events can be lost.

    • After a HSZ controller failure, SCSI device reservations could get lost because the error events are not being ordered properly.

  • Corrects a problem where modifying a service with a large number of DRDs will fail and a "could not malloc" message is seen in the daemon.log.

  • Fixes a problem that caused the asemgr utility to not run when called from a program that is owned by root and has the setuid bit turned on.

  • Corrects a problem in which a network cable failure that corrects within seven seconds of the failure can leave the services in a bad state.

  • Fixes a problem that caused the asemgr to have a memory fault when adding multiple services one after the other.

  • Fixes a problem where timeout values of greater than 30 seconds in /etc/hsm.conf would cause ASE agent to fail at start up.

  • Fixes two issues with clusters:

    • When the clluster is brought up with ASE off, other members report it as UP and RUNNING instead of UP and UNKNOWN.

    • When a restricted service is running on a member, and asemember stop or aseam stop is executed, the service status is still reported as the member name instead of Unassigned.

  • Fixes a problem that caused the asemgr to report that a disk, or mount point, was in multiple services when modifying a service name.

  • Fixes a bug where the aseagent will occasionally core dump on a SCSI bus hang.

Patch 118.00

TCR150-093

Patch: mountd exits without error during boot

State: New

This patch fixes a problem that could cause mountd to exit without error during boot.

Patch 120.00

TCR150-098

Patch: Fix for Memory Channel API node crash

State: New. Supersedes patches TCR150-010 (9.00), TCR150-019 (17.00), TCR150-019B (42.00), TCR150-039B (59.00), TCR150-040B (61.00), TCR150-090 (106.00)

This patch fixes the following problems:

  • Problem with the Memory Channel API whereby the function imc_asalloc did not allow a negative key (most significant bit of key being set).

  • Problem that caused mcm_init to core dump when resolver fails on system boot.

  • Problem in which a resolver failure produces an unhelpful error message from mcm_init on boot.

  • Problem with the Memory Channel API whereby the function imc_ckerrcnt was signifying an error had occured when in fact no error had occurred. The following is the error code seen when running an MPI code:

    [5]MPI Die-ump2chck.c 91 "ump_wait failure" (-16)

  • Fixes a problem that can cause a panic in mcs_wait_cluster_event() when using the Memory Channel API.

  • Fixes a problem with the Memory Channel API whereby a node crashes holding an MC-API lock, under certain circumstances the lock will not be released after the node crashes.

Patch 121.00

TCR150-097

Patch: clumember produces error msg during system startup

State: New. Supersedes patches TCR150-002 (1.00), TCR150-015 (31.00), TCR150-021 (19.00), TCR150-026 (24.00), TCR150-029 (26.00), TCR150-052 (64.00), TCR150-065 (76.00), TCR150-078 (90.00), TCR150-069 (83.00), TCR150-082 (98.00), TCR150-088 (104.00), TCR150-094 (111.00)

This patches fixes the following problems:

  • Problem booting a second member into a cluster.

  • In a virtual hub cluster, shutting down one node can cause the other to crash. Typical panic strings on the node that crashes are as follows:

    rm_failover_self

    and

    rm_failover_all: target rail offline

  • Various repairs in Memory Channel error handling. Fixes for virtual hub booting with cable unplugged.

  • Various problems with MC errror handling discovered in cable pull under load tests.

  • Hubless MC2 systems hang during boot and/or experience error interrupts.

  • Reliable datagram (RDG) messaging support.

  • RDG: bug fix to the completion queue synchronization protocol.

  • Fixes a kernel memory fault in rm_lock_update_retry().

  • Fixes a problem where both nodes in a cluster will panic at the same time with a simple_lock timeout panic.

  • Fixes a problem which can cause the following panic:

    panic (cpu 0): rm_update_single_lock_miss: time limit exceeded

  • Fixes a problem where /sbin/init.d/clumember produces an error message during system startup if DRD_AUTO_FAILOVER is not defined in /etc/rc.config.

  • Fixes a problem that could cause a TruCluster Production server member to hang during boot, and can cause a "simple lock time limit exceeded" panic.

  • Fixes a problem that could cause an error to be returned when the Cluster software should wait until a global lock is freed.