4    Summary of TruCluster Software Patches

This chapter summarizes the TruCluster software patches included in Patch Kit-0005.

Table 4-1 lists patches that have been updated.

Table 4-1:  Updated TruCluster Software Patches

Patch IDs Change Summary
Patch 22.00, 15.00 New
Patch 17.00 Superseded by Patch 22.00
Patches 5.00, 16.00, 23.00 Superseded by Patch 26.00
Patches 6.00, 9.00 Superseded by Patch 21.00
Patches 12.00, 18.00, 19.00, 24.00, 25.00 Superseded by Patch 27.00
Patches 12.00, 18.00, 19.00, 24.00, 25.00 Superseded by Patch 28.00
Patch 14.00 Superseded by Patch 20.01

Table 4-2 provides a summary of patches in Patch Kit-0005.

Table 4-2:  Summary of TruCluster Patches

Patch IDs Abstract

Patch 1.00

TCR100-001

Patch: Function Naming And Kernel Build Failure Correction

State: Existing

Backport changes to function names and data symbols. This fixes problems where some function names in TCR V1.0 collided with function names in X.25 V2.0 causing the kernel build to fail.

Patch 2.00

TCR100-002

Patch: Support The DEVGETGEOM ioctl And SAP R3

State: Existing

This patch allows the DRD subsystem to support the DEVGETGEOM ioctl. This change is necessary for support of SAP R3 with TruCluster software.

Patch 7.00

TCR100-008

Patch: Disk Label Re-Init, Retry Command Correction

State: Existing

This patch fixes the following problems with Logical Storage Manager (LSM) volumes in DECsafe Available Server (ASE) and TruCluster environments:

  • After installing a patch to the LSM voldisk command, the disk labels of LSM disks are inadvertently being reinitialized during service modification. This causes attempts to start the service to fail and leaves the service unassigned.

  • Certain LSM operations that should have been retried were failing on the first attempt.

  • Retry messages were not being printed to the log file.

Patch 8.00

TCR100-009

Patch: Correction For Service Aliases

State: Existing

This patch fixes a problem in /var/opt/TCR100/ase/sbin/nfs_ifconfig that corrupts the memory resident routing table and subsequent netstat output (netstat -r) during ASE service failover.

Patch 10.00

TCR100-011

Patch: Cluster transition Problem Corrections

State: Supersedes patch TCR100-003 (03.00), TCR100-005 (04.00)

This patch fixes the following problems:

  • This patch is a software workaround for a hardware problem with CCMAA-AA MEMORY CHANNEL adapters that may cause a cluster to hang or panic during node transitions. (Clusters using only CCMAA-BA MEMORY CHANNEL adapters do not exhibit this problem.)

  • Patch to fix process space remote sync page deallocation.

  • Backporting fix to check for valid state (FAILOVER_DONE) during failover. Not checking for this condition was causing a hang.

Patch 11.00

TCR100-012

Patch: Panic During A Shutdown Correction

State: Existing

This patch corrects a problem whereby the ASE agent daemon (aseagent), ASE director daemon (asedirector), the trigger-action server daemon (tractd), or the submon process fails and exits without a core file if a SIGPIPE or other stray signal occurs.

Patch 13.00

TCR100-014

Patch: Recognize KZPBA Correction

State: Existing

This patch adds KZPBA controller support for the ase_fix_config utility.

Patch 15.00

TCR100-016

Patch: Workaround To vquotacheck Command Panic Correction

State: New

Fixes a problem in which running the vquotacheck command on a filesystem participating in an ASE service will cause a system to panic if the service fails over or relocates while the command is in progress.

Patch 20.01

TCR100-021-1

Patch: System Panic, SCSI Error Condition Correction

State:New. Supersedes patch TCR100-015 (14.00)

This patch fixes the following problems:

  • Fixes the following problems in the ASE Availability Manager (AM):

    • A "simple_lock: time limit exceeded" panic on multi-processor, and system hangs in single processor systems. This can occur when multiple host target mode requests are issued due to SCSI aborts and resets on a shared bus.

    • A kernel memory fault panic caused by a race condition when the AM de-initializes.

  • This patch is part of the set of DIGITAL UNIX patches required to support the HSZ70 UltraSCSI Raid Array controller on the KZPSA adapter under TCR 1.0.

Patch 21.00

TCR100-022

Patch: Lock Trans ID, Group ID And Lock Processing Corr

State: Supersedes patches TCR100-007 (6.00), TCR100-010 (9.00)

This patch fixes the following problems:

  • Fixes two problems in the TruCluster Distributed Lock Manager (DLM):

    • A process's effective group ID not being checked when a process attempts to join a namespace. -

    • Repeated calls to the dlm_quecvt function would erroneously return DLM_LKBUSY status.

  • Corrects an assertion panic that occurs after a large number of transactions are made using the same lock. The assertion panic is triggered by integer wrapping of the lock transaction ID field. The system may panic with "dlm_panic". The actual assertion message is "<lkbp-->txid == 0>".

  • Fixes a problem that can cause a cluster member to panic in rcv_deqlk_msg() with the panic string set to:

    dlm_panic

Patch 22.00

TCR100-023

Patch: Panic During Transition Correction

State: New. Supersedes patch TCR100-018 (17.00)

This patch fixes the following problems:

  • Allows more time to remove a node from an 8-node cluster before causing the system to panic.

  • Fixes a problem that can cause a "sysconfig -q rm" command to crash a cluster member.

Patch 26.00

TCR100-027

Patch: ASE Data Base For LSM Correction

State: Supersedes patches TCR100-006 (5.00), TCR100-017 (16.00), TCR100-024 (23.00)

This patch fixes the following problems:

  • Fixes a problem where changes in the LSM configuration were not being properly handled during the delete of an LSM volume from a service.

  • Increases the timeout values for the LSM action scripts that are part of the TruCluster Production Server, Available Server and DECsafe Available Server products. The timeouts were too small for large LSM configurations and, under certain conditions, would cause the start of the services to fail, leaving them unassigned.

  • Fixes a problem in ASE where removing a volume from an AdvFS domain mounted by an ASE service causes the service to fail to restart. The daemon.log says "I/O error".

  • Fixes a problem in which under certain circumstances, an ASE service modification could result in a corrupted configuration data base.

Patch 28.00

TCR100-026B

Patch: Not Properly Handling Error Condition Correction

State: Supersedes patches TCR100-013 (12.00), TCR100-019 (18.00), TCR100-020 (19.00), TCR100-025 (24.00), TCR100-026 (25.00)

This patch fixes the following problems:

  • Fixes a problem in the message service routines used by the daemons in TruCluster Available Server and Production Server software. When the message queue fills, the following message is entered in the daemon.log file, but the queue is not emptied:

    msgSvc: message queue overflow, LOST MESSAGE!

    From this point on, no further messages will be received.

  • Fixes a problem that may occur in an ASE (either DECsafe ASE Version 1.3, TruCluster Available Server, or TruCluster Production Server) when the ASE encounters connection attempts from hosts whose IP addresses cannot be resolved to hostnames. Instead of printing a warning about a possible security breach, the ASE daemons will core dump with a segmentation violation. One cause of this problem may be unknown hosts on the network using public domain internet security software which scans all TCP ports on remote hosts.

  • Fixes a problem in TruCluster Production Server Software that can cause a cluster member to panic during a shutdown. One of the following panics will be issued by the distributed lock manager (DLM) if it attempts to rebuild the member's lock database and the connection manager daemons were already killed before they were able to stop all DLM activity:

    rcv_credir_req: illegal state
    rcv_crelk_req: illegal state
    rcv_newlk_req: illegal state

  • Fixes a problem where the Host Status Monitor (asehsm) incorrectly reports a network down (HSM_NI_STATUS DOWN) if the counters for the network interface get zeroed.

  • Fixes a problem that caused the asedirector to core dump if asemgr processes were modifying services from more than one node in the cluster at the same time.

Patch 29.00

TCR100-031

Patch: asemgr Core Dumps

State: Supersedes patches TCR100-013 (12.00), TCR100-019 (18.00), TCR100-020 (19.00), TCR100-025 (24.00), TCR100-026 (25.00), TCR100-026A (27.00)

This patch fixes the following problems:

  • This patch corrects a problem in which the asemgr can core dump when adding a member back into an ASE.

  • Fixes a problem in the message service routines used by the daemons in TruCluster Available Server and Production Server software. When the message queue fills, the following message is entered in the daemon.log file, but the queue is not emptied:

    msgSvc: message queue overflow, LOST MESSAGE!

    From this point on, no further messages will be received.

  • Fixes a problem that may occur in an ASE (either DECsafe ASE Version 1.3, TruCluster Available Server, or TruCluster Production Server) when the ASE encounters connection attempts from hosts whose IP addresses cannot be resolved to hostnames. Instead of printing a warning about a possible security breach, the ASE daemons will core dump with a segmentation violation. One cause of this problem may be unknown hosts on the network using public domain internet security software which scans all TCP ports on remote hosts.

  • Fixes a problem in TruCluster Production Server Software that can cause a cluster member to panic during a shutdown. One of the following panics will be issued by the distributed lock manager (DLM) if it attempts to rebuild the member's lock database and the connection manager daemons were already killed before they were able to stop all DLM activity:

    rcv_credir_req: illegal state
    rcv_crelk_req: illegal state
    rcv_newlk_req: illegal state

  • Fixes a problem where the Host Status Monitor (asehsm) incorrectly reports a network down (HSM_NI_STATUS DOWN) if the counters for the network interface get zeroed.

  • Fixes a problem that caused the asedirector to core dump if asemgr processes were modifying services from more than one node in the cluster at the same time.