3    Summary of TruCluster Software Patches

This chapter summarizes the TruCluster software patches included in Patch Kit-0008.

Table 3-1 provides a summary of patches.

Table 3-1:  Summary of TruCluster Patches

Patch IDs Abstract

Patch 4.00

TCR160-004

Patch: Fix for Kernel Memory Fault On DRD Client Nodes

State: Existing

This patch fixes a kernel memory fault on the DRD client nodes just as or after the DRD server node has initiated MC2 hub failover.

Patch 7.00

TCR160-010

Patch: Fix for Reliable Datagram API

State: Supersedes patch TCR160-001 (1.00)

This patch corrects the following:

  • Reliable Datagram (RDG) messaging support.

  • RDG: bug fix to the completion queue synchronization protocol.

Patch 8.00

TCR160-011

Patch: doconfig may hang when running in TruCluster environment

State: Existing

This patch fixes two problems that could cause doconfig to appear to hang when running in a TruCluster environment.

Patch 33.00

TCR160-037

Patch: Fix for drdadmin problems

State: Existing

This patch fixes various problems with drdadmin to be user friendly.

Patch 34.00

TCR160-038

Patch: Fixes a limitation in ase_reconfig_bus

State: Existing

This patch fixes a limitation in ase_reconfig_bus. Now up to 99 buses can be reconfigured with this command.

Patch 36.00

TCR160-040

Patch: Fix for asedirector hang

State: Existing

This patch fixes a problem that could cause an NFS or Disk Service that has a hyphen (-) in the service name to end up unassigned after a disk failure. A side effect of the problem was that the asedirector would hang after the disk failure was corrected.

Patch 61.00

TCR160-054B

Patch: Fixes problems with the clu_ivp script

State: Supersedes patches TCR160-009B (22.00), TCR160-021B (23.00), TCR160-022B (24.00), TCR160-031B (25.00), TCR160-036B (50.00), TCR160-047B (51.00)

This patch corrects the following:

  • This is a performance improvement in the startup of start scripts. It will reduce the necessary system calls to start the scripts.

  • Corrects a problem with member add in a large environment.

Patch 61.00

continued

  • Corrects a problem which causes asemgr to core dump when modifying a single drd service to add more than 200 devices.

  • Fixes a problem that caused aseagent or asehsm to core dump when starting NFS and Disk Services that contain several LSM volumes.

  • Fixes a problem with extraneous compiler warnings about strdup() function calls from ASE.

  • Fixes a problem that caused the asemgr utility to not run when called from a program that is owned by root and has the setuid bit turned on.

  • Fixes three problems with the clu_ivp script. The script now checks to be sure that the cluster members are listed in the /etc/hosts file, and it no longer copies /var/adm/messages to /tmp. Copying the messages file to /tmp could result in the file system becoming full, and clu_ivp exiting with an error. The clu_ivp script now also checks the /var/adm/messages file for shared busses if none are listed in the configuration file.

Patch 65.00

TCR160-063

Patch: Unable to remove LSM volumes from DRD service

State: Supersedes patch TCR160-003 (3.00)

This patch corrects the following:

  • Fixes a problem where DRD permissions could be lost if a service is modified more than once.

  • Fixes a problem that prevented the removal of LSM volumes from a DRD service. The problem occurs when there are multiple LSM diskgroups in the service, and all of the volumes from one diskgroup were removed.

Patch 70.00

TCR160-056

Patch: TruCluster Production server hangs during boot

State: Supersedes patches TCR160-017 (11.00), TCR160-027 (19.00), TCR160-032 (26.00), TCR160-062 (68.00)

This patch corrects the following:

  • Fixes a problem where both nodes in a cluster will panic at the same time with a simple_lock timeout panic.

  • Fixes a kernel memory fault in rm_lock_update_retry().

  • Fixes a problem which can cause the following panic:

    panic (cpu 0): rm_update_single_lock_miss: time limit exceeded

  • Fixes a problem that could cause an error to be returned when the TruCluster software should wait until a global lock is freed.

  • Fixes a problem that could cause a TruCluster Production server member to hang during boot, and can cause a "simple lock time limit exceeded" panic.

Patch 72.00

TCR160-067

Patch: Error msg if system contained unsupported controllers

State: Existing

This patch fixes a problem that caused an error message to be printed if the system contained unsupported controllers. The error message will now only be printed when running the command in verbose mode.

Patch 74.00

TCR160-061

Patch: Access mode for a directory not set to default

State: Supersedes patches TCR160-045 (41.00), TCR160-048 (44.00), TCR160-049 (45.00)

This patch corrects the following:

  • Fixes a problem that caused the setting of the force unmount option to be incorrectly displayed by the asemgr utility.

  • Fixes a problem that caused shell errors if an invalid mount option was specified via the asemgr menu.

  • Fixes a problem that caused the device name for a UNIX File System (UFS) to not be displayed when modifying the force unmount option via the asemgr utility.

  • Fixes a problem that caused the access mode for a directory to not get set to the default after modifying them via asemgr.

Patch 76.00

TCR160-055

Patch: Problem causes mountd to exit without error

State: Existing

This patch fixes a problem that could cause mountd to exit without error during boot.

Patch 80.00

TCR160-070

Patch: Fixes problem with ASE_SNMPD_IGNORE_DISKS

State: Existing

This patch fixes a problem with the ASE_SNMPD_IGNORE_DISKS feature. After specifying a disk to ignore, the ASE service stop and add commands result in conflicting data. While the daemon.log reports apparent success ("hrm_dsk.c will ignore /dev/rzb10") the error log reports a failure that indicates that the device is NOT being ignored (CAM "unit reserved error").

Patch 88.00

TCR160-077

Patch: Fixes a problem that causes asedirector to core dump

State: Supersedes patches TCR160-018 (12.00), TCR160-002 (2.00), TCR160-009A (9.00), TCR160-016 (10.00), TCR160-007 (5.00), TCR160-021A (13.00), TCR160-024 (16.00), TCR160-025 (17.00), TCR160-022A (14.00), TCR160-033 (29.00), TCR160-035 (31.00), TCR160-042 (38.00), TCR160-043 (39.00), TCR160-051 (47.00), TCR160-031A (21.00), TCR160-053 (49.00), TCR160-036A (32.00), TCR160-047A (43.00), TCR160-028 (27.00), TCR160-052 (48.00), TCR160-065 (52.00), TCR160-066 (53.00), TCR160-058 (54.00), TCR160-060 (55.00), TCR160-054A (56.00), TCR160-057 (57.00), TCR160-059 (59.00), TCR160-071 (78.00), TCR160-078 (83.00), TCR160-079 (84.00), TCR160-075 (85.00), TCR160-076 (86.00)

This patch corrects the following:

  • Corrects a problem with Networker displaying garbage characters following service names. It occurs when the service name is 8 characters or greater.

  • Fixes two problems in the asedirector:

    • An ASE command timeout problem encountered by large ASE services.

    • An incorrect decision made by the asedirector as a result of a failed inquire services command.

  • This is a performance improvement in the startup of start scripts. It will reduce the necessary system calls to start the scripts.

  • Fixes a problem where the Host Status Monitor (asehsm) incorrectly reports a network down (HSM_NI_STATUS DOWN) if the counters for the network interface get zeroed.

Patch 88.00

continued

  • Fixes an ASE problem where, under certain circumstances, the service scripts could cause the ASE agent to loop during a start or stop service.

  • Corrects a problem with member add in a large environment.

  • Corrects a problem with TruCluster Available Server or Production Server cluster in which services have been started with elevated priority and scheduling algorithm. Under significant load this could lead to intermittent network and cluster problems.

  • Fixes a problem which caused a service not to start when there was a short network failure. This was seen only with long running stop scripts and special network configurations.

  • Corrects a problem which causes asemgr to core dump when modifying a single drd service to add more than 200 devices.

  • Fixes a problem that caused aseagent or asehsm to core dump when starting NFS and Disk Services that contain several LSM volumes.

  • Fixes a problem where the asemgr will hang as it continuously creates and kills multiple directors.

  • Corrects a problem that causes the ASE director to core dump during initialization.

  • Corrects a problem where modifying a service with a large number of DRDs will fail and a "could not malloc" message is seen in the daemon.log file.

  • Fixes a problem where the MEMBER_STATE variable always is shown as BOOTING instead of RUNNING. After first installing TCR, there is no way to have scripts know the MEMBER_STATE. This problem is cleared on a reboot.

  • Corrects a problem in which a network cable failure that corrects within 7 seconds of the failure can leave the services in a bad state.

  • Fixes a problem that caused the asemgr to get a memory fault when adding multiple services in a row.

  • Fixes a problem with extraneous compiler warnings about strdup() function calls from ASE.

  • Fixes a problem that caused the asemgr utility to not run when called from a program that is owned by root and has the setuid bit turned on.

  • Fixes a problem that can cause the TruCluster MIB daemon (cnxmibd) to core dump in Available Server environments.

  • Fixes a problem which caused an error message to be logged for the cnxmibd even though no error had occurred.

  • Fixes two issues with clusters:

    • When a cluster is brought up with ASE off, other members report it as UP and RUNNING instead of UP and UNKNOWN.

    • When a restricted service is running on a member, and asemember stop or aseam stop is executed, the service status is still reported as the member name, instead of Unassigned.

  • Fixes a problem where timeout values of greater than 30 seconds in /etc/hsm.conf would cause the ASE agent to fail at start up.

  • Fixes a bug where the aseagent will occasionally core dump on a SCSI bus hang.

Patch 88.00

continued

  • This patch fixes the following problems with the clu_ivp script:

    The script now checks to be sure that the cluster members are listed in the /etc/hosts file, and it no longer copies /var/adm/messages to /tmp. Copying the messages file to /tmp could result in the file system becoming full, and clu_ivp exiting with an error.

  • The script now also checks the /var/adm/messages file for shared busses if none are listed in the configuration file.

  • Fixes a problem that could cause the asedirector to core dump.

  • Fixes a problem that caused the asemgr to report that a disk, or mount point, was in multiple services when modifying a service name.

  • Fixes a problem with the ASE application from reporting an incorrect status while booting, after installation or while re-initializing the database.

  • Fixes a TruCluster 1.6 problem that when a member of the cluster is being port scanned, the asedirector, aseagent, and aselogger would core dump.

  • Corrects a problem in which ASE may attempt to start a service twice on the same member. This may cause service interruption.

  • Fixes a problem that caused the asedirector to hang and consume 100% of the CPU time if asemgr processes were modifying services from more than one node in the cluster at the same time.

  • Fixes a problem that caused the Host Status Monitor (asehsm) to hang.

  • Fixes a problem that caused error messages to be logged by the Host Status Monitor. The message should have been informational, rather than an error.

Patch 90.00

TCR160-080

Patch: Node crashes when holding an mc-api lock

State: Supersedes patches TCR160-029 (20.00), TCR160-050 (46.00), TCR160-064 (63.00)

This patch corrects the following:

  • Fixes a hang problem in a cluster when two nodes communicate using the mc-api and a third node, not involved in the calculation, is rebooted.

  • Fixes a problem that can cause a panic in mcs_wait_cluster_event() when using the Memory Channel API.

  • Fixes a problem with the Memory Channel API where, when a node crashes holding an mc-api lock, under certain circumstances the lock will not be released after the node crashes.

  • Fixes a problem in the Memory Channel API that can cause a system to hang.

Patch 92.00

TCR160-082

Patch: Routing info for ASE service not properly updated

State: New

This patch fixes a problem that could cause the routing information for an ASE service to not get properly updated when ASEROUTING is enabled, and a service relocates.

Patch 94.00

TCR160-072

Patch: LSM disk information not updated in ASE database

State: Supersedes patches TCR160-030 (28.00), TCR160-039 (35.00)

This patch corrects the following:

  • Fixes a problem that would cause an error from awk(1) when modifying an ASE service that contained a large number of LSM volumes. The error would prevent the service from being properly modified.

  • Fixes a problem where LSM disk information was not properly updated in the ASE database when volumes were removed from a disk service.

  • Fixes a problem with updating ASE services which involves deleting and adding AdvFS domains on LSM volumes.

Patch 97.00

TCR160-074

Patch: Processes may get referenced several times

State: Supersedes patches TCR160-008 (6.00), TCR160-023 (15.00), TCR160-044 (40.00), TCR160-046 (42.00), TCR160-073A (95.00)

This patch corrects the following:

  • Fixes a problem in which a cluster node can panic with the panic string "convert_lock: bad lock state".

  • Corrects a problem in which a failure in the session layer can cause DLM messages to become inconsistent, resulting in random DLM panics on the receiving member.

  • Fixes a problem that can cause a TruCluster member to panic during shutdown.

  • Fixes a bug where sometimes a certain shared sequence number will not be freed after use.

  • Fixes a problem where certain processes could get referenced several times.

  • Fixes an Oracle process hang if a node fails after receiving a rsbinfo message.

  • Fixes a DLM problem where two processes could take out the same lock.

Patch 99.00

TCR160-073B

Patch: clu_ivp script enhancements

State: Supersedes patches TCR160-054C (67.00)

This patch fixes three problems with the clu_ivp script:

  • The script now checks to be sure that the cluster members are listed in the /etc/hosts file.

  • The script no longer copies /var/adm/messages to /tmp. Copying the messages file to /tmp could result in the file system becoming full, and clu_ivp exiting with an error.

  • The script now checks the /var/adm/messages file for shared busses if none are listed in the configuration file.

  • Fixes an Oracle process hang if a node fails after receiving a rsbinfo message.

Patch 101.00

TCR160-081

Patch: clu_ivp does not recognize Emulex adapter

State: Supersedes patch TCR160-041 (37.00)

This patch corrects the following:

  • Fixes a problem where the Emulex Fibre Channel adapter was not recognized by clu_ivp.

  • Fixes a problem that could cause the clu_ivp script to loop forever if the network interface was not configured.

Patch 103.00

TCR160-083

Patch: Fix for boot failure on a cluster

State: Supersedes patch TCR160-034 (30.00), TCR160-068 (82.00)

This patch corrects the following:

  • Fixes a problem which caused a boot failure on a cluster with a large number of shared SCSI buses.

  • Fixes a problem in clustered systems. It reduces the occurrences of tmv2_notify_cbf error messages in the errlog.

  • Fixes a possible system hang during shutdown due to a process having an active light weight wiring.

Patch 105.00

TCR160-084

Patch: Corrects a problem in memory channel

State: New

This patch corrects a problem in the memory channel that can cause communication to stop and erroneous network partitions.