2    TruCluster Server Patches

This chapter provides information about the patches included in Patch Kit 5 for the TruCluster Server software.

This chapter is organized as follows:

Tru64 UNIX patch kits are cumulative. For this kit, this means that the patches and related documentation from patch kits 1 through 4 are included, along with patches that are new to this kit. To aid you in using this document, release notes that are new with this release are listed as (New) in the section head. The beginning of Section 2.2 provides a key for understanding the history of individual patches.

2.1    Release Notes

This section provides release notes that are specific to the TruCluster Server software patches in this kit. References to patch numbers are for TruCluster Server patches unless otherwise indicated.

2.1.1    Required Storage Space

The following storage space is required to successfully install this patch kit:

See Section 1.1.1 for information on space needed for the operating system patches.

2.1.2    Removing Some Patches Can Cause Problems (new)

Removing the following patches from your cluster may cause problems:

Because some patches cannot be safely removed in a cluster without causing member-specific problems, we recommend that you do not remove patches under these conditions. Section 2.1.3 describes a problem of this type.

2.1.3    Patch Removal Causes Login Error (new)

If you removed Version 5.1A patches that were installed before a cluster was created. or before new members were added, you may see an error similar to the following when you attempt to log into a member:

Login Error: Compaq Tru64 UNIX V5.1A (Rev. 1885) (system.xyzcorp.net) console
login: 
INIT: Command is respawning too rapidly. Check for possible errors.
id:  esmd "/usr/sbin/esmd </dev/null >/dev/null 2>&1

If you see this error, remove the following lines from that cluster member's /etc/inittab file:

  esm_init:23:wait:/sbin/init.d/esm init </dev/null >/dev/null 2>&1
  esmd:23:respawn:/usr/sbin/esmd </dev/null >/dev/null 2>&1
 
 

2.1.4    Updates for Rolling Upgrade Procedures

The following sections provide information on rolling upgrade procedures.

2.1.4.1    Problem When Undoing Roll with Worldwide Languages Installed (new)

If, on a system with Worldwide Languages installed, you complete a rolling upgrade of Patch Kit 5 and then run the clu_upgrade -undo install command, the tar program may report that it cannot find fills it expects to find. This condition is caused by a file left in the /cluster/admin/tmp directory from the previous setup stage.

To correct this problem, take the following steps:

  1. Undo the setup stage again.

  2. Issue the following command:

    # rm -f /cluster/admin/tmp/*
    

  3. Redo the setup stage.

2.1.4.2    Order for Rolling NHD6 and Patch Kit 5 (new)

Because Patch Kit 5 contains all of the information contained in the NHD6 kit, you do not need to roll the NHD6 kit in addition to Patch Kit 5. If, however, you plan to roll both kits, you must install the NHD kit first, followed by the patch kit. If you reverse the installation order, you will get a kernel build failure after installing the NHD6 kit.

2.1.4.3    Unrecoverable Failure Procedure

The procedure to follow if you encounter unrecoverable failures while running dupatch during a rolling upgrade has changed. The new procedure calls for you to run the clu_upgrade -undo install command and then set the system baseline. The procedure is explained in the Patch Kit Installation Instructions as notes in Section 5.3 and Section 5.6.

2.1.4.4    During Rolling Patch, Do Not Add or Delete OSF, TCR, IOS, or OSH Subsets

During a rolling upgrade, do not use the /usr/sbin/setld command to add or delete any of the following subsets:

Adding or deleting these subsets during a roll creates inconsistencies in the tagged files.

2.1.4.5    Undoing a Rolling Patch

When you undo the stages of a rolling upgrade, the stages must be undone in the correct order. However, the clu_upgrade command incorrectly allows a user undoing the stages of a rolling patch to run the clu_upgrade undo preinstall command before running the clu_upgrade undo install command.

The problem is that in the install stage, clu_upgrade cannot tell from the dupatch flag files whether the roll is going forward or backward. This ambiguity allows a user who is undoing a rolling patch to run the clu_upgrade undo preinstall command without first having run the clu_upgrade undo install command.

To avoid this problem when undoing the stages of a rolling patch, make sure to follow the documented procedure and undo the stages in order.

2.1.4.6    Ignore Message About Missing ladebug.cat File During Rolling Upgrade

When installing the patch kit during a rolling upgrade, you may see the following error and warning messages. You can ignore these messages and continue with the rolling upgrade.

Creating tagged files.
 
...............................................................................
.....
 
*** Error ***
The tar commands used to create tagged files in the '/usr' file system have
reported the following errors and warnings:
     tar: lib/nls/msg/en_US.88591/ladebug.cat : No such file or directory
.........................................................
 
*** Warning ***
The above errors were detected during the cluster upgrade. If you believe that
the errors are not critical to system operation, you can choose to continue.
If you are unsure, you should check the cluster upgrade log and refer
to clu_upgrade(8) before continuing with the upgrade.

2.1.4.7    clu_upgrade undo of Install Stage Can Result in Incorrect File Permissions

This note applies only when both of the following are true:

In this situation, incorrect file permissions can be set for files on the lead member. This can result in the failure of rsh, rlogin, and other commands that assume user IDs or identities by means of setuid.

The clu_upgrade undo install command must be run from a nonlead member that has access to the lead member's boot disk. After the command completes, follow these steps:

  1. Boot the lead member to single-user mode.

  2. Run the following script:

    #!/usr/bin/ksh -p
    #
    #    Script for restoring installed permissions
    #
    cd /
    for i in /usr/.smdb./$(OSF|TCR|IOS|OSH)*.sts
    do
      grep -q "_INSTALLED" $i 2>/dev/null && /usr/lbin/fverify -y <"${i%.sts}.inv"
    done
    

  3. Rerun installupdate, dupatch, or nhd_install, whichever is appropriate, and complete the rolling upgrade.

For information about rolling upgrades, see Chapter 7 of the Cluster Installation manual, installupdate(8), and clu_upgrade(8).

2.1.4.8    Missing Entry Messages Can Be Ignored During Rolling Patch

During the setup stage of a rolling patch, you might see a message like the following:

Creating tagged files.
............................................................................
 
clubase: Entry not found in /cluster/admin/tmp/stanza.stdin.597530
 
clubase: Entry not found in /cluster/admin/tmp/stanza.stdin.597568

An Entry not found message will appear once for each member in the cluster. The number in the message corresponds to a PID.

You can safely ignore this Entry not found message.

2.1.4.9    Relocating AutoFS During a Rolling Upgrade on a Cluster

This note applies only to performing rolling upgrades on cluster systems that use AutoFS.

During a cluster rolling upgrade, each cluster member is singly halted and rebooted several times. The Patch Kit Installation Instructions direct you to manually relocate applications under the control of Cluster Application Availability (CAA) prior to halting a member on which CAA applications run.

Depending on the amount of NFS traffic, the manual relocation of AutoFS may sometimes fail. Failure is most likely to occur when NFS traffic is heavy. The following procedure avoids that problem.

At the start of the rolling upgrade procedure, use the caa_stat command to learn which member is running AutoFS. For example:

# caa_stat -t
Name           Type           Target    State     Host
------------------------------------------------------------
autofs         application    ONLINE    ONLINE    rye
cluster_lockd  application    ONLINE    ONLINE    rye
clustercron    application    ONLINE    ONLINE    swiss
dhcp           application    ONLINE    ONLINE    swiss
named          application    ONLINE    ONLINE    rye

To minimize your effort in the procedure described as follows, it is desirable to perform the roll stage last on the member where AutoFS runs.

When it comes time to perform a manual relocation on a member where AutoFS is running, follow these steps:

  1. Stop AutoFS by entering the following command on the member where AutoFS runs:

    # /usr/sbin/caa_stop -f autofs
    

  2. Perform the manual relocation of other applications running on that member:

    # /usr/sbin/caa_relocate -s current_member -c target_member
    

After the member that had been running AutoFS has been halted as part of the rolling upgrade procedure, restart AutoFS on a member that is still up. (If this is the roll stage and the halted member is not the last member to be rolled, you can minimize your effort by restarting AutoFS on the member you plan to roll last.)

  1. On a member that is up, enter the following command to restart AutoFS. (The member where AutoFS is to run, target_member, must be up and running in multi-user mode.)

    # /usr/sbin/caa_startautofs -c target_member
    

  2. Continue with the rolling upgrade procedure.

2.1.5    When Taking a Cluster Member to Single-User Mode, First Halt the Member

To take a cluster member from multiuser mode to single-user mode, first halt the member and then boot it to single-user mode. For example:

# shutdown -h now
>>> boot -fl s

Halting and booting the system ensures that it provides the minimal set of services to the cluster and that the running cluster has a minimal reliance on the member running in single-user mode.

When the system reaches single-user mode, run the following commands:

# init s
# bcheckrc
# lmf reset

2.1.6    Additional Steps Required When Installing Patches Before Cluster Creation

This note applies only if you install a patch kit before creating a cluster; that is, if you do the following:

  1. Install the Tru64 UNIX base kit.

  2. Install the TruCluster Server kit.

  3. Install the Version 5.1A Patch Kit-0005 before running the clu_create command.

In this situation, you must then perform three additional steps:

  1. Run versw, the version switch command, to set the new version identifier:

    # /usr/sbin/versw -setnew
    

  2. Run versw to switch to the new version:

    # /usr/sbin/versw -switch
    

  3. Run the clu_create command to create your cluster:

    # /usr/sbin/clu_create
    

2.1.7    Problems with clu_upgrade switch Stage

If the clu_upgrade switch stage does not complete successfully, you may see a message like the following:

versw: No switch due to inconsistent versions

The problem can be due to one or more members running genvmunix, a generic kernel.

Use the command clu_get_info -full and note each member's version number, as reported in the line beginning

Member base O/S version

If a member has a version number different from that of the other members, shut down the member and reboot it from vmunix, the custom kernel. If multiple members have the different version numbers, reboot them one at a time from vmunix.

2.1.8    Cluster Information for Tru64 UNIX Patch 1830.00

See Section 1.1.15.7 for version switch information related to Tru64 UNIX Patch 1830.00.

2.1.9    Change to gated Restriction — Patch 210.00

The following information explains the relaxed Cluster Alias: gated restriction, delivered in TruCluster Patch 210.00.

Prior to this patch, we required that you use gated as a routing daemon for the correct operation of cluster alias routing because the cluster alias subsystem did not coexist gracefully with either the routed or static routes. This patch provides an aliasd daemon that does not depend on having gated running in order to function correctly.

The following is a list of features supported by this patch:

By default, the cluster alias subsystem uses gated, customized configuration files (/etc/gated.conf.member<n>), and RIP to advertise host routes for alias addresses. You can disable this behavior by specifying the nogated option to cluamgr, either by running the cluamgr -r nogated command on a member or by setting CLUAMGR_ROUTE_ARGS="nogated" in that members /etc/rc.config file. For example, the network configuration for a member could use routed, or gated with a site-customized /etc/gated.conf file, or static routing.

For a cluster, there are three general routing configuration scenarios:

2.1.10    Version Switch Warning Added — Patch 306.00

TruCluster Server Patch 306.00 provides a warning that informs you that installed patches include a version switch, so those patches cannot be removed using the normal patch removal procedure. The waring allows you to continue with the switch strage or exit clu_upgrade.

In addition to the warning prior to the switch stage, this patch also provides additional user information after the user has decided to perform a patch rolling upgrade and has entered the pathname to a patch kit which contains one or more patches requiring a version switch.

The additional user information identifies the patches containing the version switch and provides references to the appropriate user documentation.

2.1.11    Information for Patch 328.00

This section provides information for TruCluster Server Patch 328.00.

2.1.11.1    Enablers for EVM

This patch provides enablers for the Compaq SANworksTM Enterprise Volume Manager (EVM) Version 2.0.

2.1.11.2    Rolling Upgrade Version Switch

This patch uses the rolling upgrade version switch to ensure that all members of the cluster have installed the patch before it is enabled.

Prior to throwing the version switch, you can remove this patch by returning to the rolling upgrade install stage, rerunning dupatch, and selecting the Patch Deletion item in the Main Menu.

You can remove this patch after the version switch is thrown, but this requires a shutdown of the entire cluster.

To remove this patch after the version switch is thrown, use the following procedure:

Note

Use this procedure only under the following conditions:

  1. Run the /usr/sbin/evm_versw_undo command.

    When this command completes, it asks whether it should shut down the entire cluster now. The patch removal process is not complete until after the cluster has been shut down and restarted.

    If you do not shut down the cluster at this time, you will not be able to shut down and reboot an individual member until the entire cluster has been shut down.

  2. After cluster shutdown, boot the cluster to multiuser mode.

  3. Rerun the rolling upgrade procedure from the beginning (starting with the setup stage). When you rerun dupatch, select the Patch Deletion item in the Main Menu.

For more information about rolling upgrades and removing patches, see the Patch Kit Installation Instructions.

2.1.11.3    Restrictions Removed

The restriction of not supporting multiple filesets from the cluster_root domain has been removed. It is now fully supported to have multiple filesets from the cluster_root domain to be mounted in a cluster; however, this could slow down the failover of this domain in certain cases and should only be used when necessary.

The restriction of not supporting muliptle filesets from a boot partition domain has been removed. It is now fully supported to have multiple filesets from a node's boot partition to be mounted in a cluster; however, when the CFS server node leaves the cluster all filesets mounted from that node's boot partition domain will be force unmounted.

2.1.12    CAA and Datastore — Patch 304.00

This section provides information about TruCluster Server Patch 304.00.

During a rolling upgrade, when the last member is rolled and immediately after the version switch is thrown, a script is run to put CAA on hold and copy the old datastore to the new datastore. CAA will connect to the new datastore when it is available.

The time required to do this depends on the amount of information in the datastore and the speed of each member machine. For 50 resources we have found the datastore conversion itself to only take a few seconds.

To undo this patch, the following command must be run:

/usr/sbin/cluster/caa_rollDatastore backward

You are prompted to guide the backward conversion process.

One step of this command will prompt you to kill the caad daemons on all members. A caad daemon may still appear to be running as an uninterruptible sleeping process (state U in the ps command) after issuing a kill -9 command. You can safely ignore this and continue with the conversion process as prompted, because caad will be killed when the process wakes up.

2.2    Summary of TruCluster Software Patches

This section provides brief descriptions of the patches in Patch Kit 5 for the TruCluster Server software products. Because Tru64 UNIX patch kits are cumulative, each patch lists its state according to the following criteria:

This section provides brief descriptions of the patches in Patch Kit 5 for the TruCluster Server software products.

Number: Patch 27.00

Abstract: Fix for clusterwide wall messages not being received

State: Existing

  • Allows the cluster wall daemon to restart following an EVM daemon failure.

Number: Patch 88.00

Abstract: Fix for cluster hang during boot

State: Supersedes Patch 29.00

  • Addresses a situation where the second node in a cluster hangs upon boot while setting the current time and date with ntpdate.

Number: Patch 121.00

Abstract: Using a cluster as a RIS server causes panic

State: Supersedes Patch 29.00

  • Fixes a problem that causes a panic when using a cluster as a RIS server.

  • Provides a fix to RIS/DMS serving in a cluster.

Number: 136.00

Abstract: Enhancement for clu_autofs shutdown script

State: Existing

  • Makes the /sbin/init.d/clu_autofs script more robust.

Number: 181.00

Abstract: Fixes problems in the DLM subsystem

State: Supersedes patches 39.00, 131.00, 178.00, 179.00

  • Fixes a panic in DLM when another node in the cluster is halted.

  • Fixes a panic in the DLM deadlock detection code.

  • Fixes a problem where a process using the Distributed Lock Manager can take up to ten minutes to exit.

  • Fixes several DLM related crashes and performance issues.

  • Corrects a problem causing a cluster member panic.

  • DLM was not always returning the resource block information for the sublock even if the sublock was held.

Number: Patch 188.00

Abstract: Fixes cluster kernel problem that causes a hang

State: Supersedes patches 70.00, 186.00

  • Fixes a panic in the kernel group services when another node is booted into the cluster.

  • Fixes a problem in the cluster kernel that causes the cluster to hang when a member is rebooted into the cluster.

  • Fixes a problem in the cluster kernel that causes one or more members to panic during a cluster shutdown.

Number: Patch 195.00

Abstract: Memory Channel API problem causes system hang

State: Existing (Patch Kit 3)

  • Fixes a problem in the Memory Channel API that can cause a system to hang.

Number: Patch 210.00

Abstract: aliasd now interprets NIFF parameters correctly

State: Supersedes Patches 6.00, 7.00, 9.00, 207.00, 208.00

  • Fixes a problem in which a cluster member loses connectivity with clients on remote subnets.

  • Fixes a problem with aliasd not handling multiple virtual aliases in a subnet and IP aliases.

  • Allows cluster members to route for an alias without joining.

  • Fixes a problem with aliasd writing illegal configurations into gated.conf.memberX.

  • Fixes a problem with a default route not being restored after network connectivity issues.

  • Fixes a race condition between aliasd and gated.

  • Fixes a problem with a hang caused by an incorrect /etc/hosts entry.

  • Fixes aliasd_niff to allow EVM restart.

  • Provides enablers for Compaq Database Utility.

  • Allows aliasd daemon to include interface aliases when determining whether or not an interface is appropriate for use as the ARP address for a cluster alias when selecting the proxy ARP master.

  • Fixes a problem in which with multiple members booting simultaneously aliasd would become deadlocked when trying to select the proxy ARP master for cluster aliases. As a result, some aliases could become unreachable because there would be no proxy ARP master.

  • Fixes a problem in which the aliasd daemon message "NIFF parameters for interface are too lax" was erroneously output due to the conversion of internal NIFF parameters from seconds to milliseconds. The aliasd deamon now interprets NIFF parameters correctly.

Number: Patch 212.00

Abstract: Corrects performance issues on starting cluster LSM

State: Supersedes Patch 150.00

  • Eliminates spurious duplicate error messages when cluster root is under LSM control.

  • Corrects performance issues on starting of Cluster Logical Storage Manager with large configurations.

Patch: Patch 246.00

Abstract: Fixes lsm disks and cluster quorum tools problems

State: Supersedes Patches 41.00, 80.00, 173.00, 175.00

  • Fixes a cluster installation problem of having an LSM disk and a disk media with the same name. Normally, the installation script would not let you install because it was looking at the disk name, not the disk media name.

  • Allows disks over 10 GB to be used as member or quorum disks.

  • Automates the running of run versw to resolve issues with version switched patches and cluster installation.

  • Automatically enables IP filtering for the cluster interconnect on cluster installion and member addition.

  • Allows installation on unlabeled disks.

  • Allows the cluster installation to detect layered product kits in /var as well as /usr/var.

  • Corrects problems with LSM disks and the cluster quorum tools, specifically when a member having lsm disks local to it is down, the quorum tools fail to update quorum, thereby causing other cluster commands to fail.

Number: Patch 252.00

Abstract: Fix for ICS panics

State: Supersedes Patches 37.00, 82.00, 132.00, 134.00, 182.00, 183.00, 185.00, 249.00, 250.00

  • Closes a timing window that can cause Oracle 9i to hang when a remote node in the cluster goes down.

  • Fixes a problem in which panics could occur on process termination and in situations involving multiple memory channel adapters.

  • Makes the rdginit daemon program safe to execute multiple times on all cluster interconnect types.

  • Resolves a problem resulting in an incorrect error status being returned from RdgInit.

  • Makes the following changes to Reliable DataGram (RDG):

    • Changes RDG wiring behavior to match VM's fix to wiring GH chunks.

    • Fixes an RDG problem that can result in user processes hanging in an uninterruptable state.

    • Resolves an RDG panic in the RdgShutdown routine.

    • Fixes a problem in which an RDG kernel thread can starve other timeshare threads on a uniprocessor cluster member. In particular, system services such as networking threads can be affected.

  • Resolves a potential kernel memory fault when another node is powered off.

  • Resolves a potential user process hang under extreme stress conditions.

  • Fixes a kernel thread pre-emption problem that can result in panics due to the starvation of other kernel threads.

  • Fixes some misleading send/receive byte count statistics.

Number: Patch 254.00

Abstract: Security

State: Supersedes Patch 52.00

  • Provides enablers for the Compaq Database Utility.

  • Corrects a potential security vulnerability where, under certain circumstances, system integrity may be compromised.

Number: Patch 256.00

Abstract: Fix for cluster hang

State: New (Kit 4)

  • Enables a cluster to boot even if the cluster root domain devices are private to different cluster members. This is not a recommended configuration; however, it should not result in an unbootable cluster. Currently, this is with respect to cluster root domains not under LSM control.

Number: Patch 259.00

Abstract: Fixes timing problem in the Connection Manager

State: Supersedes Patches 68.00, 257.00

  • Fixes a problem where node reboots during a clusterwide shutdown would result in difficult-to-diagnose system panics.

  • Fixes Connection Manager problems that could result in panics.

  • Fixes a timing problem in the Connection Manager that could cause the panics "CNX MGR: COMMIT_TX: INVALID NODE STATE" or "CNX unaligned access."

Number: Patch 265.00

Abstract: Fix for cluster alias manager SUITlet

State: New (Kit 4)

  • Fixes the problem in which the cluster alias manager SUITlet falsely interprets any cluster alias with virtual={t|f} configured as a virtual alias regardless of its actual setting.

Number: Patch 277.00

Abstract: Fixes kernel memory fault in rm_get_lock_master

State: Supersedes Patches 11.00, 62.00, 97.00, 145.00, 146.00, 148.00, 203.00, 204.00, 206.00, 273.00, 274.00, 275.00

  • Fixes a situation in which one or several cluster members would panic if a Memory Channel cable was removed or faulty.

  • Fixes a problem that causes a clusterwide panic with the Memory Channel power off in a LAN interconnect cluster.

  • Allows a user to kill a LAN interconnect cluster via Memory Channel.

  • Supports Memory Channel usage in a LAN cluster.

  • Fixes a problem where the master failover node goes offline during a failover and failing over due to parity errors increasing beyond the limit.

  • Fixes a problem in which a bad Memory Channel cable causes a cluster member to panic with a panic string of "rm_eh_init" or "rm_eh_init_prail."

  • Provides changes that should make Memory Channel failovers work better and to handle bad optical cables.

  • Fixes a problem in which a node booting into a cluster hangs during Memory Channel initialization.

  • Fixes a kernel memory fault in rm_get_lock_master.

  • Fixes a regression for single physical rail Memory Channel configurations.

  • Provides a fix to clean up stale data left on an offline physical rail by the Memory Channel driver.

  • Facilitates kernel debugging.

  • Corrects a conditition that can cause superfluous "rm_event, index too big" messages may appear on a system console.

  • Corrects a probem in a memory channel cluster in which rebooting a node without performing a hardware reset can crash other members with a RM_AUDIT_ACK_BLOCK panic.

  • Fixes issues associated with the initialization of the RM driver.

Number: Patch 304.00

Abstract: Fix for Oracle failure during start-up

State: Supersedes Patches 1.00, 2.00, 3.00, 5.00, 53.00, 54.00, 55.00, 56.00, 57.00, 58.00, 60.00, 66.00, 71.00, 72.00, 74.00, 84.00, 93.00, 95.00, 242.00, 301.00, 302.00

  • Increases parallelism in CAA event handling.

  • Fixes a problem with CAA in which after the first resource is started CAA cannot start or stop resources, the resource moves to the unknown state, and a core file is left behind by the action of starting and stopping resources.

  • Provides enablers for Compaq Database Utility.

  • Corrects a problem in which datastore may get corrupted due to improper datastore locking. This may occur when multiple CAA CLI commands are run in the background.

  • Corrects a problem in which the caa_profile command may complain of failure to create and log EVM events.

  • Corrects a problem in which the caa_profile -create command inserts extra attributes, such as REBALANCE, into the profile when used to create an application profile. This will cause CAA GUI to fail to validate the profile.

  • Corrects a problem where the caa_stat command can crash, leaving a core file when it receives a SIGPIPE signal. The problem has been known to occur when caa_stat output is piped to a command such as head.

  • Fixes a problem that occurs when long resource or attribute names are used and the space is not reclaimed correctly when the resource is unregistered.

  • Fixes a caad memory leak caused by caa_stat -f.

  • Corrects a problem in which CAA fails to close a TDF after processing a corresponding resource profile. Over time this will lead to reaching the process limit for open file descriptors and will prevent CAA from functioning properly.

  • Changes the clu_mibs agent to cause it to retry the connection with the Event Manager daemon (evmd) indefinitely until it succeeds. The clu_mibs agent's start and stop control has been moved from /sbin/init.d/clu_max script to /sbin/init.d/snmpd script.

  • Resolves erroneous behavior of resources with dependencies upon other resources (required resources). This solves several problems with starting, stopping, and relocating a resource with dependencies when the resource's start or stop scripts fail, or when relocating during a shutdown.

  • Causes the old datastore to correctly migrate to the new datastore during the rolling upgrade and corrects the problem where no resource information was preserved.

  • Resolves the issue with the default CAA system services (dhcp named cluster_lockd autofs) not running after the installation of the patch kit. In addition to the default CAA system services, any previously registered resource would be lost.

  • Prevents member hangs during boot in unusual circumstances that cause the CAA daemon to crash or exit during initialization.

  • Fixes three CAA problems triggered by heavy CAA activity conditions.

  • Fixes a problem in one of the shipped rc scripts whereby Oracle fails during start-up on a clustered system.

  • Fixes the problem that causes the App resource to not go off line when last dependent network resource goes off line.

  • Fixes a problem where CAAD might core dump due to a race condition when multiple events to which it subscribes arrive simultaneously.

  • Fixes the problem that could cause the target member crashes during service start up.

Number: Patch 306.00

Abstract: Security (SSRT2265, SSRT2265)

State: Supersedes Patch 48.00, 138.00, 244.00

  • Provides a warning to users who have installed a patch kit that includes a patch that requires a version switch. See Section 2.1.10

  • Addresses a problem seen during the setup stage of a rolling upgrade, during tag file creation. This patch changes a variable to only look at 500 files at a time, while making tag files instead of the current 700.

  • Corrects a potential security vulnerablity in the cluster interconnect security configuration that may result in a denial of service.

  • Provides clu_upgrade enhancements.

Number: Patch 308.00

Abstract: Corrects various problems with CAA commands

State: New

  • Fixes a problem in which some CAA commands, especially caa_profile, in rare scenarios might not function correctly.

Number: Patch 310.00

Abstract: Fixes kernel EVM threads not properly preempting

State: New

  • Fixes the potential of multiple assert_wait and timeout panics due to kernel EVM threads not properly preempting.

Number: Patch 312.00

Abstract: Fix for cluster panic

State: Supersedes Patches 44.00, 46.00, 189.00, 190.00, 191.00, 193.00, 260.00, 261.00, 263.00

  • Fixes a situation where ICS is unable to make progress because heartbeat checking is blocked or the input thread is stalled. The symptom is a panic of a cluster member with the panic string ICS_UNABLE_TO_MAKE_PROGRESS: HEARTBEAT CHECKING BLOCKED/INPUT THREAD STALLED.

  • Fixes the problem of a cluster member failing to rejoin the cluster after Memory Channel failover.

  • Addresses a panic that occurs when higher priority threads running on a cluster member block the internode communication service Memory Channel transport (ics_ll_mct) subsystem's input thread from execution.

  • Fixes numerous panics and hangs with the way a cluster communicates with its nodes.

  • Fixes a problem with hang and panics during boot.

  • Fixes a problem that causes a panic with the string "rcnx_status: different node.'"

  • Fixes a boot hang of the string:

    ics_mct: Node arrival waiting for out of line node down cleanup to complete

  • Fixes a clusterwide hang during extensive of Memory Channel traffic.

  • Addresses an assertion caused by a bad user pointer passed to the kernel via sys_call.

  • Addresses a panic that occurs while another member was going down.

  • Corrects an ICS (cluster interconnect) handle memory leak.

Number: Patch 314.00

Abstract: Corrects LSM partition types in CNX partition

State: New

  • Corrects the LSM partition types in the CNX partition of boot disk for the clu_partmgr utility.

Number: Patch 316.00

Abstract: Security (SSRT2394)

State: Supersedes Patches 50.00, 200.00, 267.00, 269.00

  • Fixes a situation where a cluster shutdown under load on a cluster using a LAN interconnect takes a very long time.

  • Prevents a panic with duplicate incoming connections on boot.

  • Provides a complete and better error message in event of a misconfigured ICS/TCP adapter.

  • Fixes a condition where a node is not allowed to join the cluster after a panic.

  • Addresses a condition where a node may panic while under load.

  • Corrects a potential security vulnerability that may result in denial of service. This potential security vulnerability may be in the form of local and remote security domain risks.

  • Corrects a problem in which setting the sysconfig inet subsystem values for the tcp_keepcnt attribute to a value < 2 cause the member to panic on boot with the following panic string:

    NetRAIN configured. 
    panic (cpu 0): trap: illegal instruction 
    DUMP: Warning: no disk available for dump.
    

Number: Patch 320.00

Abstract: File names with dollar sign cause upgrade undo probs

State: New

  • Fixes rolling upgrade undo problems with file names that contain a dollar sign ($).

Number: Patch 328.00

Abstract: Improves responsiveness of EINPROGRESS handling

State: Supersedes Patches 12.00, 13.00, 14.00, 15.00, 16.00, 17.00, 18.00, 19.00, 20.00, 21.00, 22.00, 23.00, 25.00, 76.00, 92.00, 98.00, 99.00, 100.00, 101.00, 102.00, 103.00, 104.00, 105.00, 106.00, 107.00, 108.00, 109.00, 110.00, 111.00, 112.00, 113.00, 114.00, 116.00, 140.00, 142.00, 64.00, 86.00, 117.00, 119.00, 43.00, 151.00, 152.00, 153.00, 154.00, 155.00, 156.00, 157.00, 158.00, 159.00, 160.00, 161.00, 162.00, 163.00, 164.00, 165.00, 166.00, 167.00, 168.00, 169.00, 170.00, 172.00, 30.00, 31.00, 32.00, 33.00, 35.00, 78.00, 90.00, 122.00, 123.00, 124.00, 125.00, 126.00, 127.00, 129.00, 144.00, 196.00, 198.00, 202.00, 213.00, 214.00, 215.00, 216.00, 217.00, 218.00, 219.00, 220.00, 221.00, 222.00, 223.00, 224.00, 225.00, 226.00, 227.00, 228.00, 229.00, 230.00, 231.00, 232.00, 233.00, 234.00, 235.00, 236.00, 237.00, 238.00, 240.00, 270.00, 272.00, 278.00, 279.00, 280.00, 281.00, 282.00, 283.00, 284.00, 285.00, 286.00, 287.00, 288.00, 289.00, 290.00, 291.00, 292.00, 293.00, 294.00, 295.00, 296.00, 297.00, 298.00, 300.00, 318.00, 321.00, 322.00, 324.00, 325.00, 326.00

  • Makes AdvFS fileset quota enforcement work properly on a cluster.

  • Corrects a "cfsdb_assert" panic condition which can occur following the failure of a cluster node.

  • Corrects a problem that can cause cluster members to hang while waiting for the update daemon to flush /var/adm/pacct.

  • Prevents a potential hang that can occur on a CFS failover.

  • Allows POSIX semaphores/msg queues to operate properly on a CFS client.

  • Addresses a potential file corruption problem, which could cause erroneous data to the returned when reading a file at a CFS client node. There is also a small possibility that this problem could result in the CFS panic "AssertFailed: bp->b_dev."

Patch 328.00 Continued

  • Addresseses two potential CFS panic conditions that might occur for a DMAPI/HSM managed file system. The panic strings are:

    • Assert Failed: ( t)->cntk_mode <= 2

    • Assert Failed: get_recursion_count( current_threa&CMI_TO_REC_LOCK(mi)) == 1

  • Corrects a problem in which a possible panic that could occur if multiple CFS client nodes leave the cluster while a CFS relocate or unmount is occurring.

  • Fixes a problem where a possible KMF panic occurs when executing the command cfsmgr -a DEVICES on a file system with LSM volumes.

  • Corrects a CFS problem that could cause a panic with the panic string of "CFS_INFS full".

  • Fixes a problem where a possible CFS panic might occur when a file is opened in Direct I/O mode at the same time it is being truncated by a separate process.

  • Provides enablers for the Enterprise Volume Manager product.

  • Fixes a memory leak in cfscall_ioctl().

  • Provides support for the freezefs utility.

  • Fixes a data inconsistency that can occur when a CFS client reads a file that was recently written to and whose underlying AdvFS extent map contains more than 100 extents.

  • Fixes a panic that can occur during the mount of a clusterized file system on top of a non-clusterized file system.

  • Prevents a kernel memory fault panic during unmount in a cluster or during a planned relocation.

  • Fixes support for mounting other filesets from the cluster_root domain in a cluster.

  • Fixes the assertion failure ERROR != ECFS_TRYAGAIN.

  • Fixes a race condition during a cluster mount that results in a transient ENODEV seen by a name space lookup.

  • Fixes a problem in which a panic on boot could occur if a mount request is received from another node too early in the boot process.

  • Fixes a problem in which a PANIC: CFS_ADD_MOUNT() - DATABASE ENTRY PRESENT panic could occur when a node rejoins the cluster.

  • Fixes a race condition in cluster mount support that results in a transient mount failure and a second race that might result in a kernel memory fault panic during mount.

  • Fixes a cluster problem with hung unmounts (possibly seen as hung node shutdowns).

  • Fixes a problem in which a UBC panic could occur when accessing CFS file systems.

Patch 328.00 Continued

  • Prevents a possible Kernel Memory Fault panic on racing mount update/unmount/remount operations for the same mount point.

  • Fixes a possible race between node shutdown and unmount.

  • Prevents a possible Kernel Memory Fault panic on the mount update on a Memory File System (MFS) and other possible panics when bad arguments are passed to the mount library interface.

  • Prevents the panic "Assert failed: vp->v_numoutput > 0" or a system hang when a file system becomes full and direct asyncronous I/O via CFS is used. A vnode will exist that has v_numoutput with a greater than 0 value and the thread is hung in vflushbuf_aged().

  • Prevents a possible Kernel Memory Fault in function ckidtokgs.

  • Fixes a potential CFS deadlock condition.

  • Corrects the problem of the cfsmgr error "Not enough space" when attempting to relocate a file system with a large amount of disks.

  • Fixes a problem in which possible CFS client node file read failures could occur if the domain storage devices were closed during a previous failure to perform a failover mount on the client node,

  • Fixes support for mounting other filesets from a cluster node's boot partition domain.

  • Addresses a cluster problem that can arise in the case where a cluster is serving as an NFS server. The problem can result in stale data being cached at the nodes which are servicing NFS requests.

  • Fixes a CFS panic that might occur for a DMAPI/HSM mananged fs:

    (panic): cfstok_hold_tok(): held token table overflow

  • Fixes a panic "cmn_err: CE_PANIC: ics_unable_to_make_progress: netisrs stalled" in clua.mod due to wait for malloc when memory is exhausted.

  • Fixes a panic in clua_cnx_unregister where a TP structure could not be allocated for a new TCP connection.

  • Fixes problems with cluster alias selection priority when adding a member to an alias.

  • Fixes a problem when the cluster alias subsystem does not send a reply to a client that pings a cluster alias address with a packet size of less than 28 bytes.

  • Allows the cfsstat -i command to execute properly.

  • Fixes a potential Cluster File System deadlock that can occur during CFS failover processing following the failure of a CFS server node.

  • Prevents process hangs on clusters mounting NFS file systems and accessing files locked by the plock() function on the NFS server.

  • Fixes a possible timing window whereby a booting node may panic due to memory corruption if another node dies.

  • Fixes a small window that can cause a clusterwide panic on node reboot in a quorum loss situation.

Patch 328.00 Continued

  • Fixes a problem in which a cluster member may panic with the panic string "kernel memory fault".

  • Fixes a possible boot hang that could occur if the cluster_root domain consists of LSM volumes whereby the underlying physcial storage is nonshared.

  • Prevents a memory leak from occurring when using small, unaligned Direct I/O access (that is, not aligned on a 512 boundary and does not cross a 512 byte boundary).

  • Prevents the cfsmgr command from displaying an erroneous server name when a request is made for statistics for an unmounted file system.

  • Fixes support for Synchronized I/O in clusters.

  • Eliminates erroneous EIO errors that could occur if a client node becomes a server during a rename/unlink/rmdir system call.

  • Corrects a CFS problem that could result in degraded performance when reading at file offsets past 2 GB.

  • Corrects a cluster file locking problem that can arise when file systems are exported from the cluster to NFS client nodes.

  • Fixes a CFS problem where file access rights may not appear consistent clusterwide.

  • Fixes a race between cluster mounts and file system lookups.

  • Fixes a problem in which file system failover deadlocks.

  • Corrects a Cluster File System (CFS) performance issue seen when multiple threads/processes simultaneously access the same file on an SMP (more than one CPU) system.

  • Addresses a potential clusterwide hang which can occur in the Cluster File System.

  • Fixes a problem in which file permissions inherited from the default ACL may be different than expected under the following conditions:

    • ACLs are enabled on the system.

    • There is a default ACL on a directory.

    • A request is issued from a CFS client to create a file within that directory.

  • Fixes a problem where cluster file system I/O and AdvFS domain access causes processes to hang.

  • Prevents an infinite loop during node shutdown when using server_only file systems.

Patch 328.00 Continued

  • Fixes a memory fault panic from clua_cnx_thread.

  • Fixes a problem in which an application that uses file locking may experience degraded performance.

  • Provides the I/O barrier code that prevents HSG80 controller crashes (firmware issue).

  • Fixes a situation in which a rebooting cluster member would panic shortly after rejoining the cluster if another cluster member was doing remote disk I/O to the rebooting member when it was rebooted.

  • Allows high density tape drives to use the high density compression setting in a cluster environment.

  • Fixes a kernel memory fault panic that can occur within a cluster member during failover while using shared served devices.

  • Fixes the problem of clusterwide hang because a DRD node failover is stuck and unable to bid a new server for served device.

  • Adds DRD Barrier reties to work around HSx firmware problems.

  • Fixes a problem in which CAA applications using tape/changers as required resources will not come ONLINE (as seen by caa_stat).

  • Fixes a problem in which the tape changer is only accessible from member that is the DRD server for the changer.

  • Fixes a problem where an open request to a disk in a cluster fails with an illegal errno (>=1024).

  • Fixes a problem where an open to a tape drive in a cluster would take six minutes (instead of two) to fail if there were no tape in the drive.

  • Corrects a problem in which a cluster would hang the next time a node was rebooted after a tape device was deleted from the cluster.

  • Fixes a domain panic in a cluster when a file system is mounted on a disk accessed remotely over the cluster interconnect.

  • Fixes the race condition problem when multiple unbarrierable disks failed at the same time.

  • Fixes a kernel memory fault in drd_open.

  • Prevents an infinite loop in drd_open().

  • Fixes several Device Request Dispatcher problems.

  • Provides the required mechanism to remove a rolling upgrade issue with CD-ROM and floppy disk device handling.

  • Fixes a problem in which a cluster or a device can get I/O stuck or that a cluster node may panic after a device has been deleted.

  • Fixes a problem of excessive FIDS_LOCK contention that occurs when large number of files are using system-based file locking.

  • Causes the immediate updating of the attributes on a directory when files are removed by a cluster node that is not the file system server.

  • Fixes a hang condition in Device Request Dispatcher (DRD) when accessing a failed disk.

Patch 328.00 Continued

  • Prevents a "simple_lock: time limit exceeded" panic or an "Assert Failed: brp->br_fs_svr_out" panic that can be seen while executing chfsets on a cluster.

  • Fixes problems in the cluster kernel where a cluster member hangs during cluster shutdown or while booting.

  • Fixes a problem in the cluster kernel where a cluster member panics when a tape device is accessed.

  • Fixes a token problem that could cause an unmount to hang.

  • Fixes a condition that causes the panic "CNX MGR: Invalid configuration for cluster seq disk" during simultaneous booting of cluster nodes.

  • Fixes a problem in which two nodes leaving the cluster within a short time period would cause I/O on some devices to get stuck.

  • Fixes a problem in which a new device would not be properly configured in a cluster if the device was discovered during a boot.

  • Causes the Device Request Dispatcher (DRD) to retry to get disk attributes when EINPROGRESS is returned from the disk driver.

  • Fixes an issue with ICS (Internode Communication Services) on a NUMA-based system in a cluster.

  • Fixes a possible race condition between a SCSI reservation conflict and an I/O drain that could result in a hang.

  • Adds support for multiple opens to tape libraries/media changers.

  • Alleviates a condition in which a cluster member takes an extremely long time to boot when using LSM.

  • Corrects reference-counting errors that may lead to a panic during cluster mount.

  • Relieves pressure on the CMS global DLM lock by allowing AutoFS automounts to back off.

  • Addresses a potential panic in the Cluster File System that can occur when using raw Asynchronous I/O.

  • Addresses a potential panic in the Cluster File System that can occur when using file system quotas.

  • Fixes kernel memory faults associated with passing in invalid parameters to the mount system call.

  • Fixes the problem of a potential hang when multiple nodes are shutting down simultaneously and server-only file systems are mounted.

  • Fixes the problem of a potential system crash when adding a cluster alias.

  • Improves the responsiveness of EINPROGRESS handling during the issuing of I/O barriers. The fix removes a possible infinite loop scenario which could occur due to the deletion of a storage device.

  • Allows AutoFS auto-UNmounts to back off, thereby relieving pressure on the CMS global DLM lock.

  • Adds data validation checking pertaining to cluster messages involving tokens, to assist in problem isolation and diagnosis.

  • Corrects diagnostic code that might result in a panic during kernel boot.

  • Corrects a problem in which a bus reset causes the loss of quorum, resulting in a cluster hang.

Patch 328.00 Continued

  • Fixes a problem in the cluster kernel where a cluster member panics while doing remote I/O over the interconnect.

  • Fixes a performance problem where threads could spend a long time in the check_busy() AdvFS routine.

  • Fixes a panic that may occur during an unmount.

  • Fixes a cross-node cluster deadlock that can occur when AdvFS threads on two cluster nodes simultaneously call code that requires the taking of already held AdvFS locks on the other node.

  • Corrects a problem in which mounting on a directory in a clone fileset fails with "Device Busy".

  • Enhances cluster file system performance when using file locks to coordinate file access.

  • Prevents a kernel memory fault panic in some cases where AdvFS administration commands are performed on a mounted fileset of an inaccessible AdvFS domain.

  • Fixes a problem in the Device Request Dispatcher.

  • Fixes a race condition in the Device Request Dispatcher.

  • Fixes a problem that occurs when multiple rsh sessions target the cluster alias address, and clua.mod gives out a single port to be used for multiple sessions and cause chaos.

  • Fixes an internal problem in the kernel's AdvFS, UFS, and NFS file systems where extended attributes with names greater than 247 characters could not be set on files. The new limit is 254 plus a null string terminator.

  • Fixes a condition that could cause a panic when a node is halting.

  • Fixes a race condition in CFS readahead logic and a race condition in CFS token logic.

  • Improves the fragment gathering mechanism to boost performance.

  • Fixes a condition that can cause a panic problem when clua.mod is unloaded.

  • Fixes a condition that can cause a boot up panic when ippport_userreserved is 1000 or less.

  • Fixes a problem where access to the quorum disk can be lost if the quorum disk is on a parallel SCSI bus and multiple bus resets are encountered.

  • Fixes a regression associated with non-SCSI storage.

Patch 328.00 Continued

  • Fixes a timing window during asynchronous reads on a CFS client.

  • Fixes a cfsmgr core dump when passing the incorrect number of arguments upon force unmounting a served file system.

  • Fixes a problem in which a CFS client for a file with a hole preceding a frag might drop the frag.

  • Eliminates a performance problem when a node acting as CFS server of an NFS client file system is write-appending to an external NFS server.

  • Fixes a panic that may occur due to a race condition during the mounting of a booting node's boot partition.

  • Fixes a race between nodes performing failover processing which might lead to an incorrect change in the state of file locks.

  • Corrects a problem where, under some circumstances, a system may panic with a kernel memory fault when a device that is being opened by one program is being deleted via the hwmgr utility.

  • Fixes a KMF from mc_bcopy or _OtsMove.

  • Addresses a potential hang in the NFS server when file systems are being relocated in a cluster.

  • Helps to close a race where synchronous writes may obtain disk allocations that were promised to cached client writes.

  • Preserves the error code from an asynchronous write error on a CFS client and returns the error from the close system call.

  • Fixes a potential data inconsistency caused by a problem in the CFS block reservation code which incorrectly calculates the amount of space requested and used by direct I/O writes.

  • Fixes a potential data inconsistency that may occur when a domain is nearly full. The problem causes client write requests shipped synchronously to the server to no longer have subsets of pages written asynchronously due to a race with VM.