2    TruCluster Server Patches

This chapter provides information about the patches included in Patch Kit 4 for the TruCluster Server software.

This chapter is organized as follows:

Tru64 UNIX patch kits are cumulative. For this kit, this means that the patches and related documentation from patch kits 1 through 3 are included, along with patches that are new to this kit. To aid you in using this document, release notes that are new with this release are listed as (New) in the section head. The beginning of Section 2.2 provides a key for understanding the history of individual patches.

2.1    Release Notes

This section provides release notes that are specific to the TruCluster Server software patches in this kit.

2.1.1    Required Storage Space

The following storage space is required to successfully install this patch kit:

See Section 1.1.1 for information on space needed for the operating system patches.

2.1.2    Updates for Rolling Upgrade Procedures

The following sections provide information on rolling upgrade procedures.

2.1.2.1    Unrecoverable Failure Procedure

The procedure to follow if you encounter unrecoverable failures while running dupatch during a rolling upgrade has changed. The new procedure calls for you to run the clu_upgrade -undo install command and then set the system baseline. The procedure is explained in the Patch Kit Installation Instructions as notes in Section 5.3 and Section 5.6.

2.1.2.2    During Rolling Patch, Do Not Add or Delete OSF, TCR, IOS, or OSH Subsets

During a rolling upgrade, do not use the /usr/sbin/setld command to add or delete any of the following subsets:

Adding or deleting these subsets during a roll creates inconsistencies in the tagged files.

2.1.2.3    Undoing a Rolling Patch

When you undo the stages of a rolling upgrade, the stages must be undone in the correct order. However, the clu_upgrade command incorrectly allows a user undoing the stages of a rolling patch to run the clu_upgrade undo preinstall command before running the clu_upgrade undo install command.

The problem is that in the install stage, clu_upgrade cannot tell from the dupatch flag files whether the roll is going forward or backward. This ambiguity allows a user who is undoing a rolling patch to run the clu_upgrade undo preinstall command without first having run the clu_upgrade undo install command.

To avoid this problem when undoing the stages of a rolling patch, make sure to follow the documented procedure and undo the stages in order.

2.1.2.4    Ignore Message About Missing ladebug.cat File During Rolling Upgrade

When installing the patch kit during a rolling upgrade, you may see the following error and warning messages. You can ignore these messages and continue with the rolling upgrade.

Creating tagged files.
 
...............................................................................
.....
 
*** Error ***
The tar commands used to create tagged files in the '/usr' file system have
reported the following errors and warnings:
     tar: lib/nls/msg/en_US.88591/ladebug.cat : No such file or directory
.........................................................
 
*** Warning ***
The above errors were detected during the cluster upgrade. If you believe that
the errors are not critical to system operation, you can choose to continue.
If you are unsure, you should check the cluster upgrade log and refer
to clu_upgrade(8) before continuing with the upgrade.

2.1.2.5    clu_upgrade undo of Install Stage Can Result in Incorrect File Permissions

This note applies only when both of the following are true:

In this situation, incorrect file permissions can be set for files on the lead member. This can result in the failure of rsh, rlogin, and other commands that assume user IDs or identities by means of setuid.

The clu_upgrade undo install command must be run from a nonlead member that has access to the lead member's boot disk. After the command completes, follow these steps:

  1. Boot the lead member to single-user mode.

  2. Run the following script:

    #!/usr/bin/ksh -p
    #
    #    Script for restoring installed permissions
    #
    cd /
    for i in /usr/.smdb./$(OSF|TCR|IOS|OSH)*.sts
    do
      grep -q "_INSTALLED" $i 2>/dev/null && /usr/lbin/fverify -y <"${i%.sts}.inv"
    done
    

  3. Rerun installupdate, dupatch, or nhd_install, whichever is appropriate, and complete the rolling upgrade.

For information about rolling upgrades, see Chapter 7 of the Cluster Installation manual, installupdate(8), and clu_upgrade(8).

2.1.2.6    Missing Entry Messages Can Be Ignored During Rolling Patch

During the setup stage of a rolling patch, you might see a message like the following:

Creating tagged files.
............................................................................
 
clubase: Entry not found in /cluster/admin/tmp/stanza.stdin.597530
 
clubase: Entry not found in /cluster/admin/tmp/stanza.stdin.597568

An Entry not found message will appear once for each member in the cluster. The number in the message corresponds to a PID.

You can safely ignore this Entry not found message.

2.1.2.7    Relocating AutoFS During a Rolling Upgrade on a Cluster

This note applies only to performing rolling upgrades on cluster systems that use AutoFS.

During a cluster rolling upgrade, each cluster member is singly halted and rebooted several times. The Patch Kit Installation Instructions direct you to manually relocate applications under the control of Cluster Application Availability (CAA) prior to halting a member on which CAA applications run.

Depending on the amount of NFS traffic, the manual relocation of AutoFS may sometimes fail. Failure is most likely to occur when NFS traffic is heavy. The following procedure avoids that problem.

At the start of the rolling upgrade procedure, use the caa_stat command to learn which member is running AutoFS. For example:

# caa_stat -t
Name           Type           Target    State     Host
------------------------------------------------------------
autofs         application    ONLINE    ONLINE    rye
cluster_lockd  application    ONLINE    ONLINE    rye
clustercron    application    ONLINE    ONLINE    swiss
dhcp           application    ONLINE    ONLINE    swiss
named          application    ONLINE    ONLINE    rye

To minimize your effort in the procedure described as follows, it is desirable to perform the roll stage last on the member where AutoFS runs.

When it comes time to perform a manual relocation on a member where AutoFS is running, follow these steps:

  1. Stop AutoFS by entering the following command on the member where AutoFS runs:

    # /usr/sbin/caa_stop -f autofs
    

  2. Perform the manual relocation of other applications running on that member:

    # /usr/sbin/caa_relocate -s current_member -c target_member
    

After the member that had been running AutoFS has been halted as part of the rolling upgrade procedure, restart AutoFS on a member that is still up. (If this is the roll stage and the halted member is not the last member to be rolled, you can minimize your effort by restarting AutoFS on the member you plan to roll last.)

  1. On a member that is up, enter the following command to restart AutoFS. (The member where AutoFS is to run, target_member, must be up and running in multi-user mode.)

    # /usr/sbin/caa_startautofs -c target_member
    

  2. Continue with the rolling upgrade procedure.

2.1.3    When Taking a Cluster Member to Single-User Mode, First Halt the Member

To take a cluster member from multiuser mode to single-user mode, first halt the member and then boot it to single-user mode. For example:

# shutdown -h now
>>> boot -fl s

Halting and booting the system ensures that it provides the minimal set of services to the cluster and that the running cluster has a minimal reliance on the member running in single-user mode.

When the system reaches single-user mode, run the following commands:

# init s
# bcheckrc
# lmf reset

2.1.4    Additional Steps Required When Installing Patches Before Cluster Creation

This note applies only if you install a patch kit before creating a cluster; that is, if you do the following:

  1. Install the Tru64 UNIX base kit.

  2. Install the TruCluster Server kit.

  3. Install the Version 5.1A Patch Kit-0003 before running the clu_create command.

In this situation, you must then perform three additional steps:

  1. Run versw, the version switch command, to set the new version identifier:

    # /usr/sbin/versw -setnew
    

  2. Run versw to switch to the new version:

    # /usr/sbin/versw -switch
    

  3. Run the clu_create command to create your cluster:

    # /usr/sbin/clu_create
    

2.1.5    Problems with clu_upgrade switch Stage

If the clu_upgrade switch stage does not complete successfully, you may see a message like the following:

versw: No switch due to inconsistent versions

The problem can be due to one or more members running genvmunix, a generic kernel.

Use the command clu_get_info -full and note each member's version number, as reported in the line beginning

Member base O/S version

If a member has a version number different from that of the other members, shut down the member and reboot it from vmunix, the custom kernel. If multiple members have the different version numbers, reboot them one at a time from vmunix.

2.1.6    Cluster Information for Tru64 UNIX Patch 1367.00

See Section 1.1.12.2 for version switch information related to Tru64 UNIX Patch 1367.00.

2.1.7    Change to gated Restriction — TruCluster Patch 210.00

The following information explains the relaxed Cluster Alias: gated restriction, delivered in TruCluster Patch 210.00.

Prior to this patch, we required that you use gated as a routing daemon for the correct operation of cluster alias routing because the cluster alias subsystem did not coexist gracefully with either the routed or static routes. This patch provides an aliasd daemon that does not depend on having gated running in order to function correctly.

The following is a list of features supported by this patch:

By default, the cluster alias subsystem uses gated, customized configuration files (/etc/gated.conf.member<n>), and RIP to advertise host routes for alias addresses. You can disable this behavior by specifying the nogated option to cluamgr, either by running the cluamgr -r nogated command on a member or by setting CLUAMGR_ROUTE_ARGS="nogated" in that members /etc/rc.config file. For example, the network configuration for a member could use routed, or gated with a site-customized /etc/gated.conf file, or static routing.

For a cluster, there are three general routing configuration scenarios:

2.1.8    Information for TruCluster Patch 272.00

This section provides information for TruCluster Patch 272.00.

2.1.8.1    Enablers for EVM

This patch provides enablers for the Compaq SANworksTM Enterprise Volume Manager (EVM) Version 2.0.

2.1.8.2    Rolling Upgrade Version Switch

This patch uses the rolling upgrade version switch to ensure that all members of the cluster have installed the patch before it is enabled.

Prior to throwing the version switch, you can remove this patch by returning to the rolling upgrade install stage, rerunning dupatch, and selecting the Patch Deletion item in the Main Menu.

You can remove this patch after the version switch is thrown, but this requires a shutdown of the entire cluster.

To remove this patch after the version switch is thrown, use the following procedure:

Note

Use this procedure only under the following conditions:

  1. Run the /usr/sbin/evm_versw_undo command.

    When this command completes, it asks whether it should shut down the entire cluster now. The patch removal process is not complete until after the cluster has been shut down and restarted.

    If you do not shut down the cluster at this time, you will not be able to shut down and reboot an individual member until the entire cluster has been shut down.

  2. After cluster shutdown, boot the cluster to multiuser mode.

  3. Rerun the rolling upgrade procedure from the beginning (starting with the setup stage). When you rerun dupatch, select the Patch Deletion item in the Main Menu.

For more information about rolling upgrades and removing patches, see the Patch Kit Installation Instructions.

2.1.8.3    Restrictions Removed

The restriction of not supporting multiple filesets from the cluster_root domain has been removed. It is now fully supported to have multiple filesets from the cluster_root domain to be mounted in a cluster; however, this could slow down the failover of this domain in certain cases and should only be used when necessary.

The restriction of not supporting muliptle filesets from a boot partition domain has been removed. It is now fully supported to have multiple filesets from a node's boot partition to be mounted in a cluster; however, when the CFS server node leaves the cluster all filesets mounted from that node's boot partition domain will be force unmounted.

2.1.9    CAA and Datastore — TruCluster Patch 242.00

This section provides information about TruCluster Patch 242.00.

During a rolling upgrade, when the last member is rolled and immediately after the version switch is thrown, a script is run to put CAA on hold and copy the old datastore to the new datastore. CAA will connect to the new datastore when it is available.

The time required to do this depends on the amount of information in the datastore and the speed of each member machine. For 50 resources we have found the datastore conversion itself to only take a few seconds.

To undo this patch, the following command must be run:

/usr/sbin/cluster/caa_rollDatastore backward

You are prompted to guide the backward conversion process.

One step of this command will prompt you to kill the caad daemons on all members. A caad daemon may still appear to be running as an uninterruptible sleeping process (state U in the ps command) after issuing a kill -9 command. You can safely ignore this and continue with the conversion process as prompted, because caad will be killed when the process wakes up.

2.2    Summary of TruCluster Software Patches

This section provides capsule summaries of the patches in Patch Kit 4 for the TruCluster Server software products. Because Tru64 UNIX patch kits are cumulative, each patch lists its state according to the following criteria:

This section provides capsule summaries of the patches in Patch Kit 4 for the TruCluster Server software products.

Number: Patch 27.00

Abstract: Fix for clusterwide wall messages not being received

State: Existing

This patch allows the cluster wall daemon to restart following an EVM daemon failure.

Number: Patch 88.00

Abstract: Fix for cluster hang during boot

State: Supersedes Patch 29.00

This patch addresses a situation where the second node in a cluster hangs upon boot while setting the current time and date with ntpdate.

Number: Patch 121.00

Abstract: Using a cluster as a RIS server causes panic

State: Supersedes Patch 29.00

This patch:

  • Fixes a problem that causes a panic when using a cluster as a RIS server.

  • Provides a fix to RIS/DMS serving in a cluster.

Number: 136.00

Abstract: Enhancement for clu_autofs shutdown script

State: Existing

This patch makes the /sbin/init.d/clu_autofs script more robust.

Number: 181.00

Abstract: Fixes problems in the DLM subsystem

State: Superseds patches 39.00, 131.00, 178.00, 179.00

This patch:

  • Fixes a panic in DLM when another node in the cluster is halted.

  • Fixes a panic in the DLM deadlock detection code.

  • Fixes a problem where a process using the Distributed Lock Manager can take up to ten minutes to exit.

  • Fixes several DLM related crashes and performance issues.

  • Corrects a problem causing a cluster member panic.

  • DLM was not always returning the resource block information for the sublock even if the sublock was held.

Number: Patch 188.00

Abstract: Fixes cluster kernel problem that causes a hang

State: Superseds patches 70.00 and 186.00

This patch:

  • Fixes a panic in the kernel group services when another node is booted into the cluster.

  • Fixes a problem in the cluster kernel that causes the cluster to hang when a member is rebooted into the cluster.

  • Fixes a problem in the cluster kernel that causes one or more members to panic during a cluster shutdown.

Number: Patch 195.00

Abstract: Memory Channel API problem causes system hang

State: Existing (Patch Kit 3)

This patch fixes a problem in the Memory Channel API that can cause a system to hang.

Number: Patch 206.00

Abstract: Fixes kernel memory fault in rm_get_lock_master

State: Supersedes Patches 11.00, 62.00, 97.00, 145.00, 146.00, 148.00, 203.00, 204.00

This patch:

  • Fixes a situation in which one or several cluster members would panic if a Memory Channel cable was removed or faulty.

  • Fixes a problem that causes a clusterwide panic with the Memory Channel power off in a LAN interconnect cluster.

  • Allows a user to kill a LAN interconnect cluster via Memory Channel.

  • Supports Memory Channel usage in a LAN cluster.

  • Fixes a problem where the master failover node goes offline during a failover and failing over due to parity errors increasing beyond the limit.

  • Fixes a problem in which a bad Memory Channel cable causes a cluster member to panic with a panic string of "rm_eh_init" or "rm_eh_init_prail."

  • Provides changes that should make Memory Channel failovers work better and to handle bad optical cables.

  • Fixes a problem in which a node booting into a cluster hangs during Memory Channel initialization.

  • Fixes a kernel memory fault in rm_get_lock_master.

  • Fixes a regression for single physical rail Memory Channel configurations.

  • Provides a fix to clean up stale data left on an offline physical rail by the Memory Channel driver.

Number: Patch 210.00

Abstract: aliasd now interprets NIFF parameters correctly

State: Supersedes Patches 6.00, 7.00, 9.00, 207.00, 208.00

This patch:

  • Fixes a problem in which a cluster member loses connectivity with clients on remote subnets.

  • Fixes a problem with aliasd not handling multiple virtual aliases in a subnet and IP aliases.

  • Allows cluster members to route for an alias without joining.

  • Fixes a problem with aliasd writing illegal configurations into gated.conf.memebrX.

  • Fixes a problem with a default route not being restored after network connectivity issues.

  • Fixes a race condition between aliasd and gated.

  • Fixes a problem with a hang caused by an incorrect /etc/hosts entry.

  • Fixes aliasd_niff to allow EVM restart.

  • Provides enablers for Compaq Database Utility.

  • Allows aliasd daemon to include interface aliases when determining whether or not an interface is appropriate for use as the ARP address for a cluster alias when selecting the proxy ARP master.

  • Fixes a problem in which with multiple members booting simultaneously aliasd would become deadlocked when trying to select the proxy ARP master for cluster aliases. As a result, some aliases could become unreachable because there would be no proxy ARP master.

  • Fixes a problem in which the aliasd daemon message "NIFF parameters for interface are too lax" was erroneously output due to the conversion of internal NIFF parameters from seconds to milliseconds. The aliasd deamon now interprets NIFF parameters correctly.

Number: Patch 212.00

Abstract: Corrects performance issues on starting cluster LSM

State: Supersedes Patch 150.00

This patch:

  • Eliminates spurious duplicate error messages when cluster root is under LSM control.

  • Corrects performance issues on starting of Cluster Logical Storage Manager with large configurations.

Patch: Patch 242.00

Abstract: Fix for Oracle failure during start-up

State: Supersedes Patches 1.00, 2.00, 3.00, 5.00, 53.00, 54.00, 55.00, 56.00, 57.00, 58.00, 60.00, 66.00, 71.00, 72.00, 74.00, 84.00, 93.00, 95.00

This patch:

  • Increases parallelism in CAA event handling.

  • Fixes a problem with CAA in which after the first resource is started CAA cannot start or stop resources, the resource moves to the unknown state, and a core file is left behind by the action of starting and stopping resources.

  • Provides enablers for Compaq Database Utility.

  • Corrects a problem in which datastore may get corrupted due to improper datastore locking. This may occur when multiple CAA CLI commands are run in the background.

  • Corrects a problem in which the caa_profile command may complain of failure to create and log EVM events.

  • Corrects a problem in which the caa_profile -create command inserts extra attributes, such as REBALANCE, into the profile when used to create an application profile. This will cause CAA GUI to fail to validate the profile.

  • Corrects a problem where the caa_stat command can crash, leaving a core file when it receives a SIGPIPE signal. The problem has been known to occur when caa_stat output is piped to a command such as head.

  • Fixes a problem that occurs when long resource or attribute names are used and the space is not reclaimed correctly when the resource is unregistered.

  • Fixes a caad memory leak caused by caa_stat -f.

  • Corrects a problem in which CAA fails to close a TDF after processing a corresponding resource profile. Over time this will lead to reaching the process limit for open file descriptors and will prevent CAA from functioning properly.

  • Changes the clu_mibs agent to cause it to retry the connection with the Event Manager daemon (evmd) indefinitely until it succeeds. The clu_mibs agent's start and stop control has been moved from /sbin/init.d/clu_max script to /sbin/init.d/snmpd script.

  • Resolves erroneous behavior of resources with dependencies upon other resources (required resources). This solves several problems with starting, stopping, and relocating a resource with dependencies when the resource's start or stop scripts fail, or when relocating during a shutdown.

  • Causes the old datastore to correctly migrate to the new datastore during the rolling upgrade and corrects the problem where no resource information was preserved.

  • Resolves the issue with the default CAA system services (dhcp named cluster_lockd autofs) not running after the installation of the patch kit. In addition to the default CAA system services"any previously registered resource would be lost.

  • Prevents member hangs during boot in unusual circumstances that cause the CAA daemon to crash or exit during initialization.

  • Fixes three CAA problems triggered by heavy CAA activity conditions.

Number: Patch 244.00

Abstract: Security (SSRT2265, SSRT2265)

State: Supersedes Patches 48.00, 138.00

This patch:

  • Provides a warning when an installed patch includes a version switch, which means the patch cannot be removed using the normal patch removal procedure. The warning allows the user to continue with the switch stage or exit clu_upgrade.

    In addition to the warning prior to the switch stage, this patch also provides additional user information after the user has decided to perform a patch rolling upgrade and has entered the pathname to a patch kit which contains one or more patches requiring a version switch. The additional user information identifies the patches containing the version switch and provides references to the appropriate user documentation.

  • Changes a variable to look at 500 files at a time (instead of the current 700) while making tag files, thereby fixing a problem that can occur during the setup stage of a rolling upgrade during tag file creation.

  • Fixes a potential security vulnerablity that may result in a denial of service (DofS) on systems running TruCluster Server software.

Patch: Patch 246.00

Abstract: Fixes lsm disks and cluster quorum tools problems

State: Supersedes Patches 41.00, 80.00, 173.00, 175.00

This patch:

  • Fixes a cluster installation problem of having an LSM disk and a disk media with the same name. Normally, the installation script would not let you install because it was looking at the disk name, not the disk media name.

  • Allows disks over 10 GB to be used as member or quorum disks.

  • Automates the running of run versw to resolve issues with version switched patches and cluster installation.

  • Automatically enables IP filtering for the cluster interconnect on cluster installion and member addition.

  • Allows installation on unlabeled disks.

  • Allows the cluster installation to detect layered product kits in /var as well as /usr/var.

  • Corrects problems with LSM disks and the cluster quorum tools, specifically when a member having lsm disks local to it is down, the quorum tools fail to update quorum, thereby causing other cluster commands to fail.

Number: Patch 252.00

Abstract: Fix for ICS panics

State: Supersedes Patches 37.00, 82.00, 132.00, 134.00, 182.00, 183.00, 185.00, 249.00, 250.00

This patch:

  • Closes a timing window that can cause Oracle 9i to hang when a remote node in the cluster goes down.

  • Fixes a problem in which panics could occur on process termination and in situations involving multiple memory channel adapters.

  • Makes the rdginit daemon program safe to execute multiple times on all cluster interconnect types.

  • Resolves a problem resulting in an incorrect error status being returned from RdgInit.

  • Makes the following changes to Reliable DataGram (RDG):

    • Changes RDG wiring behavior to match VM's fix to wiring GH chunks.

    • Fixes an RDG problem that can result in user processes hanging in an uninterruptable state.

    • Resolves an RDG panic in the RdgShutdown routine.

    • Fixes a problem in which an RDG kernel thread can starve other timeshare threads on a uniprocessor cluster member. In particular, system services such as networking threads can be affected.

  • Resolves a potential kernel memory fault when another node is powered off .

  • Resolves a potential user process hang under extreme stress conditions.

  • Fixes a kernel thread pre-emption problem that can result in panics due to the starvation of other kernel threads.

  • Fixes some misleading send/receive byte count statistics.

Number: Patch 254.00

Abstract: Security

State: Supersedes Patch 52.00

This patch:

  • Provides enablers for the Compaq Database Utility.

  • Fixes a potential security vulnerability where, under certain circumstances, system integrity may be compromised.

Number: Patch 256.00

Abstract: Fix for cluster hang

State: New

This patch enables a cluster to boot even if the cluster root domain devices are private to different cluster members. This is not a recommended configuration; however, it should not result in an unbootable cluster. Currently, this is with respect to cluster root domains not under LSM control.

Number: Patch 259.00

Abstract: Fixes timing problem in the Connection Manager

State: Supersedes Patches 68.00, 257.00

This patch:

  • Fixes a problem where node reboots during a clusterwide shutdown would result in difficult-to-diagnose system panics.

  • Fixes connection manager problems that could result in panics.

  • Fixes a timing problem in the Connection Manager that could cause the panics "CNX MGR: COMMIT_TX: INVALID NODE STATE" or "CNX unaligned access".

Number: Patch 263.00

Abstract: Fix for cluster panic

State: Supersedes Patches 44.00, 46.00, 189.00, 190.00, 191.00, 193.00, 260.00, 261.00

This patch:

  • Fixes a situation where ICS is unable to make progress because heartbeat checking is blocked or the input thread is stalled. The symptom is a panic of a cluster member with the panic string ICS_UNABLE_TO_MAKE_PROGRESS: HEARTBEAT CHECKING BLOCKED/INPUT THREAD STALLED.

  • Fixes the problem of a cluster member failing to rejoin the cluster after Memory Channel failover.

  • Addresses a panic that occurs when higher priority threads running on a cluster member block the internode communication service Memory Channel transport (ics_ll_mct) subsystem's input thread from execution.

  • Fixes numerous panics and hangs with the way a cluster communicates with its nodes.

  • Fixes a problem with hang and panics during boot.

  • Fixes a problem that causes a panic with the string "rcnx_status: different node.'"

  • Fixes a boot hang of the string:

    ics_mct: Node arrival waiting for out of line node down cleanup to complete

  • Fixes a clusterwide hang during extensive of Memory Channel traffic.

  • Addresses an assertion caused by a bad user pointer passed to the kernel via sys_call.

  • Addresses a panic that occurs while another member was going down.

Number: Patch 265.00

Abstract: Fix for cluster alias manager SUITlet

State:New

This patch fixes the problem in which the cluster alias manager SUITlet falsely interprets any cluster alias with virtual={t|f} configured as a virtual alias regardless of its actual setting.

Number: Patch 269.00

Abstract: A node may panic while under load

State: Supersedes Patches 50.00, 200.00, 267.00

This patch:

  • Fixes a situation where a cluster shutdown under load on a cluster using a LAN interconnect takes a very long time.

  • Prevents a panic with duplicate incoming connections on boot.

  • Provides a complete and better error message in event of a misconfigured ICS/TCP adapter.

  • Fixes a condition where a node is not allowed to join the cluster after a panic.

  • Addresses a condition where a node may panic while under load.

  • Address a situation involving discarded UDP datagrams that do not come from the correct port.

Number: Patch 272.00

Abstract: Improves responsiveness of EINPROGRESS handling

State: Supersedes Patches 12.00, 13.00, 14.00, 15.00, 16.00, 17.00, 18.00, 19.00, 20.00, 21.00, 22.00, 23.00, 25.00, 76.00, 92.00, 98.00, 99.00, 100.00, 101.00, 102.00, 103.00, 104.00, 105.00, 106.00, 107.00, 108.00, 109.00, 110.00, 111.00, 112.00, 113.00, 114.00, 116.00, 140.00, 142.00, 64.00, 86.00, 117.00, 119.00, 43.00, 151.00, 152.00, 153.00, 154.00, 155.00, 156.00, 157.00, 158.00, 159.00, 160.00, 161.00, 162.00, 163.00, 164.00, 165.00, 166.00, 167.00, 168.00, 169.00, 170.00, 172.00, 30.00, 31.00, 32.00, 33.00, 35.00, 78.00, 90.00, 122.00, 123.00, 124.00, 125.00, 126.00, 127.00, 129.00, 144.00, 196.00, 198.00, 202.00, 213.00, 214.00, 215.00, 216.00, 217.00, 218.00, 219.00, 220.00, 221.00, 222.00, 223.00, 224.00, 225.00, 226.00, 227.00, 228.00, 229.00, 230.00, 231.00, 232.00, 233.00, 234.00, 235.00, 236.00, 237.00, 238.00, 240.00, 270.00

This patch:

  • Makes AdvFS fileset quota enforcement work properly on a cluster.

  • Corrects a "cfsdb_assert" panic condition which can occur following the failure of a cluster node.

  • Corrects a problem that can cause cluster members to hang while waiting for the update daemon to flush /var/adm/pacct.

  • Prevents a potential hang that can occur on a CFS failover.

  • Allows POSIX semaphores/msg queues to operate properly on a CFS client.

  • Addresses a potential file corruption problem, which could cause erroneous data to the returned when reading a file at a CFS client node. There is also a small possibility that this problem could result in the CFS panic "AssertFailed: bp->b_dev."

Patch 272.00 Continued

  • Addresseses two potential CFS panic conditions that might occur for a DMAPI/HSM managed file system. The panic strings are:

    • Assert Failed: ( t)->cntk_mode <= 2

    • Assert Failed: get_recursion_count( current_threa&CMI_TO_REC_LOCK(mi)) == 1

  • Corrects a problem in which a possible panic that could occur if multiple CFS client nodes leave the cluster while a CFS relocate or unmount is occurring.

  • Fixes a problem where a possible KMF panic occurs when executing the command cfsmgr -a DEVICES on a file system with LSM volumes.

  • Corrects a CFS problem that could cause a panic with the panic string of "CFS_INFS full".

  • Fixes a problem where a possible CFS panic might occur when a file is opened in Direct I/O mode at the same time it is being truncated by a separate process.

  • Provides enabler support for Enterprise Volume Manager product.

  • Fixes memory a leak in cfscall_ioctl().

  • Provides support for the freezefs utility.

  • Fixes a data inconsistency that can occur when a CFS client reads a file that was recently written to and whose underlying AdvFS extent map contains more than 100 extents.

  • Fixes a panic that would occur during the mount of a clusterized file system on top of a non-clusterized file system.

  • Prevents a kernel memory fault panic during unmount in a Cluster or during a planned relocation.

  • Fixes support for mounting other filesets from the cluster_root domain in a cluster.

  • Fixes the assertion failure ERROR != ECFS_TRYAGAIN.

  • Fixes a race condition during a cluster mount that results in a transient ENODEV seen by a name space lookup.

  • Fixes a problem in which a panic on boot could occur if a mount request is received from another node too early in the boot process.

  • Fixes a problem in which a PANIC: CFS_ADD_MOUNT() - DATABASE ENTRY PRESENT panic could occur when a node rejoins the cluster.

  • Fixes a race condition in cluster mount support that results in a transient mount failure and a second race that might result in a kernel memory fault panic during mount.

  • Fixes a cluster problem with hung unmounts (possibly seen as hung node shutdowns).

  • Fixes a problem in which a UBC panic could occur when accessing CFS file systems.

Patch 272.00 Continued

  • Prevents a possible Kernel Memory Fault panic on racing mount update/unmount/remount operations for the same mount point.

  • Fixes a possible race between node shutdown and unmount.

  • Prevents a possible Kernel Memory Fault panic on the mount update on a Memory File System (MFS) and other possible panics when bad arguments are passed to the mount library interface.

  • Prevents the panic "Assert failed: vp->v_numoutput > 0" or a system hang when a file system becomes full and direct asyncronous I/O via CFS is used. A vnode will exist that has v_numoutput with a greater than 0 value and the thread is hung in vflushbuf_aged().

  • Prevents a possible Kernel Memory Fault in function ckidtokgs.

  • Fixes a potential CFS deadlock condition.

  • Corrects the problem of the cfsmgr error "Not enough space" when attempting to relocate a file system with a large amount of disks.

  • Fixes a problem in which possible CFS client node file read failures could occur if the domain storage devices were closed during a previous failure to perform a failover mount on the client node,

  • Fixes support for mounting other filesets from a cluster node's boot partition domain.

  • Addresses a cluster problem that can arise in the case where a cluster is serving as an NFS server. The problem can result in stale data being cached at the nodes which are servicing NFS requests.

  • Fixes a CFS panic that might occur for a DMAPI/HSM mananged fs: (panic): cfstok_hold_tok(): held token table overflow

  • Fixes a panic "cmn_err: CE_PANIC: ics_unable_to_make_progress: netisrs stalled" in clua.mod due to wait for malloc when memory is exhausted.

  • Fixes a panic in clua_cnx_unregister where a TP structure could not be allocated for a new TCP connection.

  • Fixes problems with cluster alias selection priority when adding a member to an alias.

  • Fixes a problem when the cluster alias subsystem does not send a reply to a client that pings a cluster alias address with a packet size of less than 28 bytes.

  • Allows the cfsstat -i command to execute properly.

  • Fixes a potential Cluster File System deadlock that can occur during CFS failover processing following the failure of a CFS server node.

  • Prevents process hangs on clusters mounting NFS file systems and accessing plock-ed files on the NFS server.

  • Fixes a possible timing window whereby a booting node may panic due to memory corruption if another node dies.

  • Fixes a small window that can cause a clusterwide panic on node reboot in a quorum loss situation.

Patch 272.00 Continued

  • Fixes a problem in which a cluster member may panic with the panic string "kernel memory fault".

  • Fixes a possible boot hang that could occur if the cluster_root domain consists of LSM volumes whereby the underlying physcial storage is nonshared.

  • Prevents a memory leak from occurring when using small, unaligned Direct I/O access (that is, not aligned on a 512 boundary and does not cross a 512 byte boundary).

  • Prevents the cfsmgr command from displaying an erroneous server name when a request is made for statistics for an unmounted file system.

  • Fixes support for Synchronized I/O in clusters.

  • Eliminates erroneous EIO errors that could occur if a client node becomes a server during a rename/unlink/rmdir system call.

  • Corrects a CFS problem that could result in degraded performance when reading at file offsets past 2GB.

  • Corrects a cluster file locking problem that can arise when file systems are exported from the cluster to NFS client nodes.

  • Fixes a CFS problem where file access rights may not appear consistent clusterwide.

  • Fixes a race between cluster mounts and file system lookups.

  • Fixes a problem in which file system failover deadlocks.

  • Corrects a Cluster File System (CFS) performance issue seen when multiple threads/processes simultaneously access the same file on an SMP (>1 cpu) system.

  • Addresses a potential clusterwide hang which can occur in the Cluster File System.

  • Fixes a problem in which file permissions inherited from the default ACL may be different than expected under the following conditions:

    • ACLs are enabled on the system

    • There is a default ACL on a directory

    • A request is issued from a CFS client to create a file within that directory

  • Fixes a problem where cluster file system I/O and AdvFS domain access causes processes to hang.

  • Prevents an infinite loop during node shutdown when using server_only file systems.

Patch 272.00 Continued

  • Fixes a memory fault panic from clua_cnx_thread.

  • Fixes a problem in which an application that uses file locking may experience degraded performance.

  • Provides the I/O barrier code that prevents HSG80 controller crashes (firmware issue).

  • Fixes a situation in which a rebooting cluster member would panic shortly after rejoining the cluster if another cluster member was doing remote disk I/O to the rebooting member when it was rebooted.

  • Allows high density tape drives to use the high density compression setting in a cluster environment.

  • Fixes a kernel memory fault panic that can occur within a cluster member during failover while using shared served devices.

  • Fixes the problem of clusterwide hang because a DRD node failover is stuck and unable to bid a new server for served device.

  • Adds DRD Barrier reties to work around HSx firmware problems.

  • Fixes a problem in which CAA applications using tape/changers as required resources will not come ONLINE (as seen by caa_stat).

  • Fixes a problem in which the tape changer is only accessible from member that is the DRD server for the changer.

  • Fixes a problem where an open request to a disk in a cluster fails with an illegal errno (>=1024).

  • Fixes a problem where an open to a tape drive in a cluster would take six minutes (instead of two) to fail if there were no tape in the drive.

  • Corrects a problem in which a cluster would hang the next time a node was rebooted after a tape device was deleted from the cluster.

  • Fixes a domain panic in a cluster when a file system is mounted on a disk accessed remotely over the cluster interconnect.

  • Fixes the race condition problem when multiple unbarrierable disks failed at the same time.

  • Fixes a kernel memory fault in drd_open.

  • Prevents an infinite loop in drd_open().

  • Fixes several Device Request Dispatcher problems.

  • Provides the required mechnism to remove a rolling upgrade issue with CD-ROM and floppy disk device handling.

  • Fixes a problem in which a cluster or a device can get I/O stuck or that a cluster node may panic after a device has been deleted.

  • Fixes a problem of excessive FIDS_LOCK contention that occurs when large number of files are using system-based file locking.

  • Causes the immediate updating of the attributes on a directory when files are removed by a cluster node that is not the file system server.

  • Fixes a hang condition in Device Request Dispatcher (DRD) when accessing a failed disk.

Patch 272.00 Continued

  • Prevents a simple_lock: time limit exceeded panic or a Assert Failed: brp->br_fs_svr_out panic that can be seen while executing chfsets on a cluster.

  • Fixes problems in the cluster kernel where a cluster member hangs during cluster shutdown or while booting.

  • Fixes a problem in the cluster kernel where a cluster member panics when a tape device is accessed.

  • Fixes a token problem that could cause an unmount to hang.

  • Fixes a condition that causes the panic "CNX MGR: Invalid configuration for cluster seq disk" during simultaneous booting of cluster nodes.

  • Prevents an infinite loop in drd_open().

  • Fixes a problem in which two nodes leaving the cluster within a short time period would cause I/O on some devices to get stuck.

  • Fixes a problem in which a new device would not be properly configured in a cluster if the device was discovered during a boot.

  • Causes the Device Request Dispatcher (DRD) to retry to get disk attributes when EINPROGRESS is returned from the disk driver.

  • Fixes an issue with ICS (Internode Communication Services) on a NUMA-based system in a cluster.

  • Fixes a possible race condition between a SCSI reservation conflict and an I/O drain that could result in a hang.

  • Adds support for multiple opens to tape libraries/media changers.

  • Alleviates a condition in which a cluster member takes an extremely long time to boot when using LSM.

  • Corrects reference-counting errors that may lead to a panic during cluster mount.

  • Relieves pressure on the CMS global DLM lock by allowing AutoFS auto-mounts to back off.

  • Addresses a potential panic in the Cluster File System that can occur when using raw Asynchronous I/O.

  • Addresses a potential panic in the Cluster File System that can occur when using file system quotas.

  • Fixes kernel memory faults associated with passing in invalid parameters to the mount system call.

  • Fixes the problem of a potential hang when multiple nodes are shutting down simultaneously and server-only file systems are mounted.

  • Fixes the problem of a potential system crash when adding a cluster alias.

  • Improves the responsiveness of EINPROGRESS handling during the issuing of I/O barriers. The fix removes a possible infinite loop scenario which could occur due to the deletion of a storage device.

  • Allows AutoFS auto-UNmounts to back off, thereby relieving pressure on the CMS global DLM lock.

  • Adds data validation checking pertaining to cluster messages involving tokens, to assist in problem isolation and diagnosis.

  • Corrects diagnostic code that might result in a panic during kernel boot.

  • Corrects a problem in which a bus reset causes the loss of quorum, resulting in a cluster hang.