PROBLEM: (88862) (PATCH ID: TCR520-031) ******** This patch will make AdvFS fileset quota enforcement work properly on a cluster. PROBLEM: (89251, GB_G01729) (PATCH ID: TCR520-011) ******** This patch fixes a "deltokp->tok_hldfifo[TOK_RDWR].fifo_req_begin == NULL" assertion failure following filesystem failover recovery which results in a "cfsdb_assert" panic orginating from the cfsdb_assert() routine called by the deletetoken() routine. A message similar to the following may appear on the console or in the message buffer just prior to the panic: "Assert Failed: deltokp->tok_hldfifo[TOK_RDWR].fifo_req_begin == NULL file: ../../../../src/kernel/tnc_common/tnc_cfe/clitok.c line: 1118 caller: 0xfffffc0000953548" This problem produces stack traces similar to the following: 10 panic 11 cfsdb_assert 12 deletetoken 13 send_to_server 14 revoke_internal 15 tok_revoke_range 16 cfstok_revoke 17 cfs_tokmsg 18 rcfstok_revoke 19 svr_rcfstok_revoke 20 icssvr_daemon_from_pool PROBLEM: (TKT220054, BCGM80DRR, SE_G01558, 89342) (PATCH ID: TCR520-005) ******** This patch corrects a problem which can cause cluster members to hang, waiting for the update daemon to flush /var/adm/pacct. PROBLEM: (VNO88299B) (PATCH ID: TCR520-002) ******** This patch prevents a potential hang that can occur on a CFS failover. The cfs_fo_thread's stack trace will look like: 1: lock_wait+228: thread_block() 2: lock_read+1004: lock_wait(0x70, 0xfffffc00008eb330, 0xfffffc008839bc00, 0xfffffc0000838544) 3: cfs_reclaim+308: lock_read(0xfffffc00fa5ed440) 4: vclean+388: cfs_reclaim(0xfffffc00a78a9440, 0x7) 5: vgone+196: vclean(0xfffffc00a78a9440, 0x7, 0xfffffc000092e578) 6: cfs_inactive+376: vgone(0xfffffc00a78a9440, 0x3, 0xfffffc000092e578) 7: vrele+276: cfs_inactive(0xfffffc00a78a9440) 8: freefid+220: vrele(0xfffffc00a78a9440) 9: cfs_remove_client_locks+932: freefid(0xfffffc008b607da0) 10: cfs_rec_remove_server_state+80: cfs_remove_client_locks(0xfffffc00f9de8300) 11: cfs_rec_start_server+616: cfs_rec_remove_server_state(0xfffffc00f9de8300, 0x1) 12: cfs_fo_handle_bid_accept+292: cfs_rec_start_server(0xfffffc00fa459dc0, 0xfffffc00faa2a900, 0xfffffc0083bddf80, 0x4) 13: cfs_fo_thread+1204: cfs_fo_handle_bid_accept(0xfffffc00fa5ed400) PROBLEM: (FR_G01276, 86827) (PATCH ID: TCR520-004) ******** This patch allows POSIX semaphores/msg queues to operate properly on a CFS client. These mechanisms are not "clusterized" and cannot be used across nodes but any application using semaphores or message that works on a base system should also work when run on a single node in a cluster (client or server). PROBLEM: (90288, HGO104051) (PATCH ID: TCR520-039) ******** This problem can manifest itself when reading files which contain "holes". Under certain conditions (outlined below), the CFS Cached Direct Access Read code could access incorrect disk blocks while servicing a read request. Specifically, the problem can manifest itself under the following conditions: - file is being read at a CFS client node. - underlying physical filesystem type is AdvFS. - file is larger than 64k in size (ie, read will be handled via the CFS Cached Direct Access Read method). - file contains a "hole" at the end of the file The net effect of the problem is that when the file is read at the CFS server, the expected file contents are seen, but when read from a CFS client, "random" data is returned. It's also possible that the CFS client node could panic with a panic message of "Assert Failed: bp->b_dev". PROBLEM: (89109, 89142) (PATCH ID: TCR520-014) ******** PROBLEM: (89142) (PATCH ID: ) This patch corrects a CFS problem which could be seen on a DMAPI/HSM managed filesystem whereby retries are exhausted for an internal DMAPI event which is not cleared for a region after event generation completes successfully, Once the vgoning of the effected file vnode occurs, there will be the following panic: "Assert Failed: ( t)->cntk_mode <= 2". PROBLEM: (89109) (PATCH ID: ) This patch corrects a CFS problem which could cause a panic when an internal message is sent to the CFS server of a DMAPI/HSM managed filesystem from a CFS client node and the CFS server node dies while processing the message. The resulting panic string is: "Assert Failed: get_recursion_count(current_threa& CFS_CMI_TO_REC_LOCK(mi)) == 1". PROBLEM: (89797) (PATCH ID: TCR520-016) ******** PROBLEM: (89797) This patch corrects a CFS timing window whereby a panic can result if during a CFS relocation or unmount, multiple client nodes fail simultaneously. PROBLEM: (89843) (PATCH ID: TCR520-018) ******** PROBLEM: (89843) This patch corrects a CFS KMF panic which can occur on executing the "cfsmgr -a DEVICES" command on a domain or filesystem which contains LSM volumes whereby internally LSM reports that the number of disks in an LSM volume is 0. PROBLEM: (89728) (PATCH ID: TCR520-010) ******** This patch corrects a CFS problem that could cause a panic with the panic string of "CFS_INFS full". PROBLEM: (86949) (PATCH ID: TCR520-012) ******** This patch corrects a CFS problem which could cause a panic when a file is being opened in Direct I/O mode, while at the same time, a separate process is attempting to extend the file via a truncate() syscall. PROBLEM: (89942) (PATCH ID: TCR520-026) ******** Enabler support for Enterprise Volume Manager product PROBLEM: (GB_G01710) (PATCH ID: TCR520-001) ******** This patch fixes memory leak in cfscall_ioctl(). PROBLEM: (90891) (PATCH ID: TCR520-068) ******** This patch is required for freezefs support. PROBLEM: (91142) (PATCH ID: TCR520-100) ******** This fix addresses a data inconsistency that can occur when a CFS client reads a file that was recently written to and whose underlying AdvFS extent map contains more than 100 extents. To find out how many extents a file has, use the showfile -x command. PROBLEM: (87918) (PATCH ID: TCR520-090) ******** If the mount of a clusterized file system type is attempted onto a non-clusterized file system type, a panic TRAP: INVALID MEMORY READ ACCESS FROM KERNEL MODE occurs. PROBLEM: (87406) (PATCH ID: TCR520-091) ******** This patch prevents panics during unmount processing and during planned relocation. In the former case a representative stack trace would indicate a panic during the cfs_unmount() routine, and in the latter case a panic during the do_pfs_unmount() routine. PROBLEM: (90070) (PATCH ID: TCR520-104) ******** This patch corrects support for muliptle filesets being mounted from the cluster_root domain. When the server node leaves the cluster while other filesets from cluster_root are mounted, all other cluster nodes could panic with the following: panic : "cfs_do_pfs_mount: pfs and cfs fsids differ on failover" PROBLEM: (BCGMB2CCS) (PATCH ID: TCR520-080) ******** This patch fixes the assertion failure ERROR != ECFS_TRYAGAIN. The stack trace looks like: 0 stop_secondary_cpu: 1205 1 panic: 1252 2 event_timeout: 1971 3 printf: 940 4 panic: 1309 5 cfsdb_assert: 452 6 cfs_create: 5186 7 vn_open: 707 8 copen: 3300 9 syscall: 727 10 _Xsyscall: 1785 PROBLEM: (86882) (PATCH ID: TCR520-083) ******** There is a race between cluster mount and name space lookup logic which may result in a transient ENODEV error returned to the lookup. This problem was first noticed in the context of auto-mounting a home directory during remote login, under AutoFS. Infrequently, the initial attempt to login would fail but subsequent attempts would succeed. This problem may occur, however, with arbitrary applications and depends only upon timing considerations. PROBLEM: (89503, (for) (PATCH ID: TCR520-089) ******** When a node is booting and a mount request is executed on another node whereby the booting node is selected as the CFS server of the fs, a booting node could panic if is not ready for the mount request. PROBLEM: (88766) (PATCH ID: TCR520-095) ******** This patch fixes a panic of a node already in the cluster, when a node re-joins the cluster. This problem is most likely to occur when quorum has been lost, and has just been regained due the the joining node. The panic string will be : PANIC: CFS_ADD_MOUNT() - DATABASE ENTRY PRESENT PROBLEM: (80986) (PATCH ID: TCR520-099) ******** One race condition involves a transient failure to reserve a kernel resource associated with the file system to be mounted, and results in an ENODEV errno returned to the mount system call. The second race condition involves the use of a stale memory pointer within the kernel, and will likely result in a panic during the cms_mount_initial() routine. PROBLEM: (84254) (PATCH ID: TCR520-078) ******** This patch fixes a cluster problem with hung unmounts (possibly seen as hung node shutdowns). Messages similar to the following will appear on the console of the node serving the filesystem being unmounted: WARNING: svrcfstok_waitfortokens: svrcfstok structures not cleaned up (retries = 25) WARNING: svrcfstok_waitfortokens: svrcfstok structures not cleaned up (retries = 25) PROBLEM: (90221, 91235) (PATCH ID: TCR520-101) ******** This patch addresses a problem where, under certain very rare conditions, a panic with a stack trace similar to the following could result: PANIC: "pgl_remove: remove from empty (vop)->vu_cleanpl" 4 panic src/kernel/bsd/subr_prf.c 5 ubc_page_release src/kernel/vfs/vfs_ubc.c 6 cfs_putpage src/kernel/tnc_common/tnc_cfe/alpha/cfs_vm_alpha.c 7 ubc_invalidate src/kernel/vfs/vfs_ubc.c 8 vclean src/kernel/vfs/vfs_subr.c 9 vgone src/kernel/vfs/vfs_subr.c PROBLEM: (90792, 89527) (PATCH ID: TCR520-081) ******** PROBLEM: (90792) (PATCH ID: ) There is a small window where if a mount update races with an unmount and remount of the same mount point, it is possible for one node to experience and Kernel Memory Fault panic in the cfs_mount_update_accept() function. PROBLEM: (89527) (PATCH ID: ) The race condition fixed by this patch eliminates a kernel memory fault panic during the cms_shutdown() routine. PROBLEM: (87952, 87231) (PATCH ID: TCR520-082) ******** PROBLEM: (87952) A mount update request to a Memory File System (MFS) will cause a Kernel Memory Fault panic. PROBLEM: (87231) A bad argument to the mount syscall could cause a panic and there are some error cases for mounts which will leave the resulting failed mount point busy. PROBLEM: (DE_G02611, 90821, LU_G02822) (PATCH ID: TCR520-070) ******** This patch is to prevent a panic: Assert failed: vp->v_numoutput > 0 or a system hang when a filesystem becomes full and direct async I/O via CFS is used. A vnode will exist that has v_numoutput with a greater than 0 value and the thread is hung in vflushbuf_aged(). PROBLEM: (91051) (PATCH ID: TCR520-092) ******** This patch fixes a possible Kernel Memory Fault panic in the function ckidtokgs() with the following stack trace: Thread 0xfffffc00233aa380: Pid 632125: icssvr_daemon_fr 0 stop_secondary_cpu() [1202, 0xfffffc00005f5a3c] 1 panic() [1252, 0xfffffc0000294a04] 2 event_timeout() [1971, 0xfffffc00005f6c74] 3 printf() [940, 0xfffffc0000293db8] 4 panic() [1309, 0xfffffc0000294b38] 5 trap() [2262, 0xfffffc00005ea680] 6 _XentMM() [2116, 0xfffffc00005e4458] 7 ckidtokgs() [8849, 0xfffffc0000946618] 8 cfs_sentinel_force() [1424, 0xfffffc00008ff064] 9 crfs_fsync_0() [7302, 0xfffffc00008fd068] 10 icstnc_rpc_dispatch() [951, 0xfffffc0000802ef0] 11 icstnc_svr_rcall() [714, 0xfffffc00008029d0] 12 icssvr_daemon_from_pool() [778, 0xfffffc00008be7bc] PROBLEM: (90178, BCGM918KQ) (PATCH ID: TCR520-059) ******** Fix potential CFS deadlock. PROBLEM: (IT_G02601, IT_G02586, 90532) (PATCH ID: TCR520-062) ******** This patch is to correct a cfsmgr error "Not enough space" when attempting to relocate a file system with a large amount of disks. An example of the cfsmgr command and the error the patch corrects is: #cfsmgr -r -a server=hostname -d AreaBuff_dmn cfsmgr: subsystem error: Not Enough Space PROBLEM: (91311) (PATCH ID: TCR520-093) ******** This patch corrects possible CFS file read failures if the storage used for an AdvFS domain is LSM volumes which are comprised of local only storage and the node attached to the storage leaves the cluster. The other remaining nodes may get file read failures once the attached node reboots into the cluster and reserves the AdvFS domain. Also, there is a very small window for a non LSM AdvFS domain whereby the same problem could occur. PROBLEM: (90039S) (PATCH ID: TCR520-084) ******** This patch corrects support for muliptle filesets being mounted from a cluster node's boot partition domain. When the server node leaves the cluster while other filesets from its boot parition domain are mounted, other nodes could panic with the following: Assert failed: CFS_CMI_TO_SERVER (vftocmi(mp))==this_node PROBLEM: (92941) (PATCH ID: TCR520-136) ******** This patch addresses a cluster problem that can arise in the case where a cluster is serving as an NFS server. The problem can result in "stale" file data being cached at cluster nodes which are servicing NFS requests (ie, the cached data will not be invalidated if the file is subsequently written to). The problem manifests itself on a per-file basis, and the net result is that reading a file from different cluster nodes could yield different results. PROBLEM: (92135) (PATCH ID: TCR520-116) ******** This patch corrects a CFS problem which could be seen on a DMAPI/HSM managed filesystem whereby the node executing the HSM and serving the fs panics with the following: (panic): cfstok_hold_tok(): held token table overflow PROBLEM: (90512) (PATCH ID: TCR520-053) ******** The panic "cmn_err: CE_PANIC: ics_unable_to_make_progress: netisrs stalled" would happen with clua.mod attempting to malloc with wait. Since memory was exhausted, the wait would cause a timeout and panic. PROBLEM: (90886) (PATCH ID: TCR520-067) ******** A kernel memory fault panic would occur in clua_cnx_unregister if a protocol specific pcb (tp) could not be allocated for a new TCP connection. PROBLEM: (90232) (PATCH ID: TCR520-044) ******** When a new member is added to a cluster alias, the selection priority of that member would not be recognized, resulting in connections potentially going to the wrong cluster member. PROBLEM: (DEK043348, 90164) (PATCH ID: TCR520-077) ******** This patch fixes a problem when the cluster alias subsystem does not send a reply to a client that pings a cluster alias address with a packet size of less than 28 bytes. PROBLEM: (EVT07519664) (PATCH ID: TCR520-003) ******** This patch allows the command "cfsstat -i" to execute properly. Before the patch you would receive the error: get_val: read: No such device or address This patch also corrects a memory leak in the command. PROBLEM: (93384) (PATCH ID: TCR520-159) ******** This patch address a potential Cluster File System deadlock which can occur during CFS failover processing following the failure of a CFS server. Under certain conditions, it's possible for the CFS failover processing to deadlock with outstanding I/O's, but in turn, the I/O's are blocked due to the server failure, thus the failover never occurs, and any processes attempting to access the file system involved block indefinitely. PROBLEM: (91910, BCGM10TH0) (PATCH ID: TCR520-124) ******** Prevent process hangs on clusters mounting NFS file systems and accessing plock-ed files on the NFS server. Most obvious symptom is "ps" commands blocked for long periods of time with the following stack trace: lock_wait() lock_write() u_map_copyout() table() syscall() _Xsyscall() PROBLEM: (92511, 92734) (PATCH ID: TCR520-126) ******** This patch fixes a possible timing window whereby a booting node may panic due to memory corruption if another node dies which is the server of NFS filesystems or server only filesystems while the booting node is performing remote mounting which occurs once the following line is output to the console during boot: CMS: Joining deferred filesystem sets" before going to multi user mode. PROBLEM: (91620) (PATCH ID: TCR520-109) ******** This patch fixes a clusterwide panic with: RM_CRASH_NODE_MASK PANIC IN RM_POLL which can occur when a node leaves the cluster causing quorum loss and then rejoins the cluster when it was previosuly the server of fs and a filesystem request gets sent to the rebooted node before it is able to handle it and reject it appropriately. PROBLEM: (TKT291292) (PATCH ID: TCR520-119) ******** This patch fixes a problem in which a cluster member may panic with the panic string "kernel memory fault". PROBLEM: (93007) (PATCH ID: TCR520-147) ******** If the cluster_root domain consists of a LSM volume and the underlying physiscal storage is not connected to a booting node, the booting node may hang and display the following message to the console: Waiting for cluster mount to complete <5>lsm:volio: Cannot open disk dsk88: kernel error 6 PROBLEM: (92325, BCGM303KH) (PATCH ID: TCR520-118) ******** This patch prevents a memory leak from occurring when using small, unaligned Directio I/O access (ie, not aligned on a 512 boundary, and doesn't cross a 512 byte boundary). Analysis of a forced crash or dumpsys will show a large amount of memory consumed in the malloc bucket: CFS GENERAL Example: 124 CFS GENERAL 38351112032 PROBLEM: (87683) (PATCH ID: TCR520-110) ******** When "cfsmgr -a statistics" is invoked and the file system named is not mounted, it is possible that erroneous information will be displayed for the server name. PROBLEM: (91582) (PATCH ID: TCR520-111) ******** This patch corrects support for Synchronized IO in CFS. Files opened from remote clients using file status flags O_DSYNC or O_RSYNC, were not conforming to the behaviors as documented in the open(2) manpage. PROBLEM: (89966) (PATCH ID: TCR520-120) ******** This patch will eliminate erroneous EIO errors which could occur if a client node becomes a server during a rename/unlink/rmdir system call between the initial lookup done by the vfs layer and the subsequent call to the pfs operation. PROBLEM: (92816) (PATCH ID: TCR520-130) ******** This patch addresses a CFS problem that could result in degraded performance when sequentially reading, from a CFS client, at file offsets past 2GB. PROBLEM: (90686, 91982) (PATCH ID: TCR520-112) ******** This patch addresses a file locking problem which can arise when using a cluster as an NFS file server. In the case of a failure of one of the cluster members, specifically the CFS server for any exported NFS filesystems, the NFS file locking service in the cluster may not be re-started quickly enough. The result could be that processes running within the cluster may be granted file locks that conflict with locks that were granted to NFS clients, but weren't reclaimed during the NFS lock daemon's "grace period". PROBLEM: (91622, 91977) (PATCH ID: TCR520-113) ******** This patch addresses a CFS problem where file access rights may not appear consistent cluster-wide. The effect could be that a for a given file, a particular user may be erroneously denied file access from certain cluster nodes, while being granted the expected access from others. This problem is far more likely to occur in the case of NFS filesystems (ie, cluster as NFS client), than for local filesystem types. PROBLEM: (86883) (PATCH ID: TCR520-107) ******** This patch fixes two problems. First, a race between file name lookup and cluster mount may result in the lookup erroneously failing. This is more likely in the presence of AutoFS. The second problem is a file system recovery during failover that deadlocks. This occurs only with an Advfs fileset mounted beneath the subdirectory of /etc/fdmns that corresponds to its domain. PROBLEM: (88878, 92739) (PATCH ID: TCR520-131) ******** This patch corrects a Cluster File System (CFS) performance issue seen when multiple threads/processes simultaneously access the same file on an SMP (>1 cpu) system. The specific performance issue addressed by this patch is one in which a dramatic drop in filesystem performance is seen when more than 1 or 2 threads/processes on the same cluster node are simultaneously accessing the same file and the node has more than 1 cpu. PROBLEM: (90221, 91235) (PATCH ID: TCR520-151) ******** This patch addresses an obscure CFS problem that could result in a cluster-wide hang. When recycling dirty pages, a cross-node deadlock could occur between CFS and Advfs, the result of which would typically render the cluster completely unresponsive. PROBLEM: (91428) (PATCH ID: TCR520-108) ******** When ACLs are enabled on the system and there is a default ACL on a directory, files and directories created in that directory should inherit the default ACL and permissions based on the rules that are discussed in detail in the Security manual. In particular, the file permissions should be based on the intersection of requested mode (unmodified by the umask) and the permissions from the default ACL. Files and directories created from CFS clients are given the permissions directly from the default ACL without the required intersection with the requested mode. This patch insures that files created from a CFS client node will be given the same permissions that they would get if the create request were issued at the CFS server or on a non-cluster system. PROBLEM: (DE_G03037) (PATCH ID: TCR520-144) ******** This patch fixes a problem where cluster filesystem (CFS) I/O and AdvFS domain access causes processes to hang. Node1 Node2 ----- ----- cfsmgr -r -a SERVER=Node2 /fs cd /fs/Dir dd if=/dev/zero of=/bigfile wait a little bit... ls -l & cd /fs/Dir ls -l & before ls completes, halt node ps & - hangs ls -l & - hangs df -k & - hangs PROBLEM: (93354) (PATCH ID: TCR520-157) ******** This patch fixes a hang during node shutdown that occurs when some other node in the cluster serves a server_only file system. PROBLEM: (HPAQA2CB0, DE_G03705) (PATCH ID: TCR520-148) ******** This patch fixes a kernel memory fault from clua_cnx_thread. 0 stop_secondary_cpu : 1205 1 panic : 1252 2 event_timeout : 1971 3 printf : 940 4 panic : 1309 5 trap : 2262 6 _XentMM : 2115 7 malloc_internal : 1720 8 kch_join_attach_internal : 381 9 kch_join_set : 335 10 clua_join_set : 1666 11 clua_aliasset_common : 508 12 cluaioc_alias : 279 13 clua_cfgmgr_dispatch : 592 14 clua_configure : 496 15 kmodcall : 696 16 syscall : 713 17 _Xsyscall : 1785 PROBLEM: (92701, STL401547) (PATCH ID: TCR520-129) ******** This patch addresses a cluster problem where an application which uses file locking may experience degraded performance. In the referenced CLD, the customer experienced severely degraded performance of their Cobol application. The application didn't explicitly do file locking, but it's likely that something in the Cobol run-time environment did. PROBLEM: (89245, 90061) (PATCH ID: TCR520-033) ******** There is a firmware issue in the HSG80 controllers that during a cluster transistion can cause the HSG80 controllers to crash. This controller crash can then cause loss of data access to those logical volumes on that pair of HSG80 controllers. If cluster root is on that HSG80 a cluster domain panic can result. The symptoms of this problem are DRD barrier errors logged to the /usr/adm/messages files and to the console. This can also be verified by examining the HSG80 fmu logs and the HSGO console. The key text in determining this problem is as follows: During processing to maintain consistency of the data for Persistent Reserve SCSI commands, an internal inconsistency was detected. > Last Failure Parameter[0] contains a code defining the precise nature Example output run fmu FMU> show last most Last Failure Entry: 6. Flags: 006FF901 Template: 1.(01) Description: Last Failure Event Occurred on 26-SEP-2001 at 10:10:29 Power On Time: 1. Years, 30. Days, 9. Hours, 1. Minutes, 11. Seconds Controller Model: HSG80 Serial Number: ZG02900845 Hardware Version: E05(2D) Software Version: XCF4P-0(FF) Instance Code: 0102030A Description: An unrecoverable software inconsistency was detected or an intentional restart or shutdown of controller operation was requested. Reporting Component: 1.(01) Description: Executive Services Reporting component's event number: 2.(02) Event Threshold: 10.(0A) Classification: SOFT. An unexpected condition detected by a controller software component (e.g., protocol violations, host buffer access errors, internal inconsistencies, uninterpreted device errors, etc.) or an intentional restart or shutdown of controller operation is indicated. Last Failure Code: 43230101 Last Failure Parameter[0.] 00000013 Last Failure Code: 43230101 Description: During processing to maintain consistency of the data for Persistent Reserve SCSI commands, an internal inconsistency was detected. > Last Failure Parameter[0] contains a code defining the precise nature of the inconsistency. Reporting Component: 67.(43) Description: Host Port Protocol Layer Reporting component's event number: 35.(23) Restart Type: 0.(00) Description: Full software restart Active Thread: FOC I960 Priority: 0.(00) Interrupt Stack Guard is intact NULL Thread Stack Guard is intact Thread Stack Guard State Flags (ID# Bit; 0=intact,1=not intact): 00000000 PROBLEM: (89054) (PATCH ID: TCR520-017) ******** This patch fixes a situation in which a rebooting cluster member would panic shortly after rejoining the cluster if another cluster mamber was doing remote disk IO to the rebooting member when it was rebooted. PROBLEM: (GB_G01153) (PATCH ID: TCR520-006) ******** This patch allows high density tape drives to use the high density compression setting in a cluster environment. While opening a tape density file tape_d*, the DRD driver would issue a special ioctl by using dev_t of tape0 (the default density file). This caused all of the device drivers to register tape_d* standand density, therefore consequently, no matter what density is specified, they all ended up with the standard mode. The fix in DRD driver picks the correct dev_t. PROBLEM: (HPAQ50WCZ) (PATCH ID: TCR520-007) ******** During the cluster failover time, if the cluster has any shared served disks such as a shared CD-ROM, the cluster members that directly connect to the devive can crash with a message similiar to: trap: invalid memory ifetch access from kernel mode faulting virtual address: 0x0000000000000000 pc of faulting instruction: 0x0000000000000000 ra contents at time of fault: 0xffffffff0052afd0 sp contents at time of fault: 0xfffffe068993f820 panic (cpu 1): kernel memory fault PROBLEM: (89405) (PATCH ID: TCR520-020) ******** This patch fixes the problem of cluster wide hang because of DRD node failover is stuck and unable to bid a new server for served device. PROBLEM: (90685, 90503) (PATCH ID: TCR520-064) ******** There is a firmware issue in the HSG80 controllers that during a cluster transistion can cause the HSG80 controllers to crash. This controller crash can then cause loss of data access to those logical volumes on that pair of HSG80 controllers. If cluster root is on that HSG80 a cluster domain panic can result. The symptoms of this problem are DRD barrier errors logged to the /usr/adm/messages files and to the console. This can also be verified by examining the HSG80 fmu logs and the HSGO console. The key text in determining this problem is as follows: During processing to maintain consistency of the data for Persistent Reserve SCSI commands, an internal inconsistency was detected. > Last Failure Parameter[0] contains a code defining the precise nature Example output run fmu FMU> show last most Last Failure Entry: 6. Flags: 006FF901 Template: 1.(01) Description: Last Failure Event Occurred on 26-SEP-2001 at 10:10:29 Power On Time: 1. Years, 30. Days, 9. Hours, 1. Minutes, 11. Seconds Controller Model: HSG80 Serial Number: ZG02900845 Hardware Version: E05(2D) Software Version: XCF4P-0(FF) Instance Code: 0102030A Description: An unrecoverable software inconsistency was detected or an intentional restart or shutdown of controller operation was requested. Reporting Component: 1.(01) Description: Executive Services Reporting component's event number: 2.(02) Event Threshold: 10.(0A) Classification: SOFT. An unexpected condition detected by a controller software component (e.g., protocol violations, host buffer access errors, internal inconsistencies, uninterpreted device errors, etc.) or an intentional restart or shutdown of controller operation is indicated. Last Failure Code: 43230101 Last Failure Parameter[0.] 00000013 Last Failure Code: 43230101 Description: During processing to maintain consistency of the data for Persistent Reserve SCSI commands, an internal inconsistency was detected. > Last Failure Parameter[0] contains a code defining the precise nature of the inconsistency. Reporting Component: 67.(43) Description: Host Port Protocol Layer Reporting component's event number: 35.(23) Restart Type: 0.(00) Description: Full software restart Active Thread: FOC I960 Priority: 0.(00) Interrupt Stack Guard is intact NULL Thread Stack Guard is intact Thread Stack Guard State Flags (ID# Bit; 0=intact,1=not intact): 00000000 PROBLEM: (90961) (PATCH ID: TCR520-075) ******** Resources like tape/changers handled by CAA do not come online (according to CAA). The caa_stat command will return something like this even though there is no problem accessing the device: NAME=tapeone TYPE=tape TARGET=ONLINE on hamm TARGET=ONLINE on woody STATE=OFFLINE on hamm STATE=OFFLINE on woody PROBLEM: (90599) (PATCH ID: TCR520-079) ******** This patch fixes a problem where the tape changer is only accessible from member that's the drd server for the changer. PROBLEM: (91283) (PATCH ID: TCR520-094) ******** This patch fixes a problem where an open request to a disk in a cluster fails with an illegal errno (>=1024). PROBLEM: (87387) (PATCH ID: TCR520-096) ******** This patch fixes a problem where an open to a tape drive in a cluster would take 6 min (instead of 2) to fail if there were no tape in the drive. PROBLEM: (91286) (PATCH ID: TCR520-097) ******** This patch solves a problem in which a cluster would hang the next time a node was rebooted after a tape device was deleted from the cluster. PROBLEM: (90755) (PATCH ID: TCR520-088) ******** This patch fixes a domain panic in a cluster when a file system is mounted on a disk accessed remotely over the cluster interconnect. PROBLEM: (BCSMA0T58/90257) (PATCH ID: TCR520-098) ******** This patch fixes the race condition problem when multiple unbarrierable disks failed at the same time. PROBLEM: (TPOQ57031) (PATCH ID: TCR520-103) ******** This patch fixes a kernel memory fault in drd_open PROBLEM: (93056) (PATCH ID: TCR520-155) ******** This is a regression in tcr51asupportos.bl2 and wcalphaos and as such can't be seen by a customer since they have never gotten the code. PROBLEM: (92113, 92114, 92235, 92689) (PATCH ID: TCR520-149) ******** This patch addresses the following issues: Performance fix associated with locking hierarchy. Added support for the new cluster safe IOCTLs added in the base submit. Cleans up compiler warnings. Fixes problems associated with kch collistions. DRD tape/changer access path locking issues. Drdmgr message errors. DRD KCH proposal to reject the server tranfer issues. Better DRD console messages. DRD CNX recovery issues. CNX drain issues. CNX device recovery issues. Reservation conflicts and MUNSA_REJECTS. Elimnates a race condition in the open/reopen code. PROBLEM: (93630) (PATCH ID: TCR520-162) ******** This patch will ensure that this kit works properly during a rolling upgrade process. The problem being resolved is within this kit, and is not present in previous OS versions or patch kits. PROBLEM: (93022) (PATCH ID: TCR520-167) ******** This fix addresses a problem in which a cluster or a device can get IO:s stuck or that a cluster node may panic after a device has been deleted. PROBLEM: (93126, 93724) (PATCH ID: TCR520-166) ******** Excessive FIDS lock contention is observed when large number of files using system based file locking. Result from "lockinfo -sort=misses -d 20 -f 200 -p 25 -l 20" will shows at the top of the list with a high miss rate. PROBLEM: (94199) (PATCH ID: TCR520-197) ******** This patch addresses a problem when a file is removed on a node that is not the CFS server for the filesystem. The attributes for the directory were not updated on the CFS server, and hence the attributes returned by the NFS server would not be updated. This behavior can cause NFS clients to erroneously continue to apply cached lookup data since the directory had not changed in their view, leading to stale file handle errors, when a similar situation on a single-system server would not. PROBLEM: (94106) (PATCH ID: TCR520-198) ******** This patch fixes a hang condition in Device Request Dispatcher (DRD) when accessing a failed disk. PROBLEM: (94082) (PATCH ID: TCR520-184) ******** This patch fixes a problem in the cluster kernel where a cluster member panics when another member is rebooted. PROBLEM: (BCGMM1774, GB_G04904, 94740) (PATCH ID: TCR520-203) ******** This patch prevents a simple_lock: time limit exceeded panic or a Assert Failed: brp->br_fs_svr_out panic that can be seen while executing chfsets on a cluster. Example stack traces are: 0 boot 1 panic 2 simple_lock_fault 3 simple_lock_time_violation 4 cfs_blkrsrv_flush 5 msfs_syscall_op_set_bfset_params_activate 6 msfs_real_syscall 7 msfs_syscall 8 syscall 9 _Xsyscall crash> tf 0 stop_secondary_cpu 1 panic 2 event_timeout 3 printf 4 panic 5 cfsdb_assert 6 cfs_blkrsrv_svrupdate 7 cfs_pfscachewrite 8 cfs_write 9 vn_write 10 rwuio 11 write 12 syscall 13 _Xsyscall PROBLEM: (94157, 94221, 94440) (PATCH ID: TCR520-188) ******** This patch fixes a problem in the cluster kernel where a cluster member hangs during cluster shutdown or while booting. PROBLEM: (94279) (PATCH ID: TCR520-179) ******** This patch fixes a problem in the cluster kernel where a cluster member panics when a tape device is accessed. PROBLEM: (93203, TKT299628, BCGM40HZM) (PATCH ID: TCR520-163) ******** This patch fixes a token problem which could cause an unmount to hang. Also what will be seen on the console during this problem are messages similiar to: WARNING: svrcfstok_waitfortokens: svrcfstok structures not cleaned up (retries = 1100) PROBLEM: (92409) (PATCH ID: TCR520-212) ******** This patch fixes a CNX manager panic encountered while multiple cluster nodes are booted simulataneously. The panic string seen is: CNX MGR: Invalid configuration for cluster seq disk PROBLEM: (94120) (PATCH ID: TCR520-174) ******** This is a regression in tcr51asupportos.bl2 and wcalphaos and as such can't be seen by a customer since they have never gotten the code. PROBLEM: (93635) (PATCH ID: TCR520-164) ******** This fix addresses a problem in which two nodes leaving the cluster within a short (but not too short) time period would cause IO:s on some devices to get stuck. PROBLEM: (93870) (PATCH ID: TCR520-165) ******** This fix addresses a problem in which a new device would not be properly configured in a cluster if the device was discovered during a boot. On some of the booting nodes the device would not be considered locally connected although it is. This can create Availability problems later. PROBLEM: (93996) (PATCH ID: TCR520-185) ******** The Device Request Dispatcher, DRD, should retry to get disk attributes when EINPROGRESS is returned from the disk driver. This problem can be seen by deleting a device in a cluster and then adding it. The console message is: drd_get_disk_attributes (1234) - ksm_get_attributes failed 36 PROBLEM: (DE_G04593) (PATCH ID: TCR520-210) ******** This addresses an issue with ICS overloading rad 0 on a numa based system. PROBLEM: (94911, 95063) (PATCH ID: TCR520-215) ******** Fixes a possible race condition between a SCSI reservation conflict and an I/O drain, which could result in a hang. The race condition occurs when a SCSI event causes a reservation conflict, such as a path failover, while at the same time a cluster member is in the process of issuing an I/O barrier, due to an event such as a member transition. This results in a hang on the cluster member attempting to barrier. Examination of the system in this state or by a forced crash will reveal one or more drd_event_threads sleeping in ccmn_send_ccb_wait3(). The hang is ultimately caused by in flight I/Os that are pending due to the above thread. Here is a typical stack trace: THREAD: fffffc0003816e00 0 thread_block 1 sleep_prim 2 mpsleep 3 ccmn_send_ccb_wait3 4 ccmn_path_ping3 5 ccmn_resolve_paths3 6 cdisk_ioctl 7 drd_issue_local_ioctl 8 drd_check_path 9 drd_handle_event_io_drained 10 drd_handle_one_event 11 drd_handle_events 12 drd_event_thread PROBLEM: (94385) (PATCH ID: TCR520-186) ******** This fix adds the support of multiple opens to tape libraries/media changers. Prior to this fix, the Device Request Dispatcher would fail for multiple opens on tape libraries/media changers returning EBUSY (errno 16). PROBLEM: (92799) (PATCH ID: TCR520-216) ******** This patch alleviates a condition in which a cluster member takes an extremely long time to boot when using LSM. The problem occurs when a fiber channel disk that belongs to an LSM set goes bad. The condition is seen while booting a system into a cluster, where the other members are far enough up to recognize their LSM sets. Immediately after the "starting LSM" boot message, the booting system will appear to hang and will periodically output the following message to the user console: "DRD failed register against returned 5" PROBLEM: (90608) (PATCH ID: TCR520-191) ******** This patch corrects a reference counting problem on objects related to mountpoints. The problem had resulted in the unexpected persistance of an object seen during cluster mount, and the subsequent failure of an assertion within the code. The problem can be recognized by the panic string " CFS_JOIN_COMMIT: NO DB ENTRY OR INFO STRUCT ". PROBLEM: (92789) (PATCH ID: TCR520-173) ******** This patch relieves pressure on the CMS global DLM lock by allowing AutoFS auto-mounts to back off when their lock requests are not granted within a reasonable amount of time. This can help avoid turning a transient slowdown into one which is more persistent. PROBLEM: (93923) (PATCH ID: TCR520-171) ******** This patch addresses a potential panic in the Cluster File System which can occur when using raw Asynchronous I/O. When the problem occurs, the symptom will be a locking violation panic with the following string: "mcs_unlock: current lock not found" and a stack trace ending in either cfs_condio_iodone() or cfs_condio_issue_io(), such as: 4 panic src/kernel/bsd/subr_prf.c : 1309 5 simple_lock_fault src/kernel/kern/lock.c : 2805 6 mcs_unlock_found_violation src/kernel/kern/lock.c : 3142 7 cfs_condio_iodone src/kernel/tnc_common/tnc_cfe/cfs_directio.c : 870 8 biodone src/kernel/vfs/vfs_bio.c : 1682 9 volkiodone src/kernel/lsm/dec/kiosubr.c : 235 10 volsiodone src/kernel/lsm/common/siosubr.c : 358 11 vol_mv_write_done src/kernel/lsm/common/mvio.c : 3596 12 voliod_iohandle src/kernel/lsm/common/iod.c : 569 13 voliod_loop src/kernel/lsm/common/iod.c : 372 PROBLEM: (94580) (PATCH ID: TCR520-190) ******** This problem addresses an assertion failure which can occur in the Cluster File System when file system quotas are in use. The problem can only happen if a user has opened a very large number of files (at least 32768) since the cluster was booted. The assertion failure is: Assert Failed: dq->cfs_dq_cnt > 0 The stack traceback will be similar to the following: 1 panic src/kernel/bsd/subr_prf.c 2 cfsdb_assert src/kernel/tnc_common/tnc_cfe/alpha/cfs_debug.c 3 cfs_dqget src/kernel/tnc_common/tnc_cfe/alpha/cfs_quota.c 4 cfs_getinoquota src/kernel/tnc_common/tnc_cfe/alpha/cfs_quota.c 5 cfs_rwvp_cache src/kernel/tnc_common/tnc_cfe/alpha/cfs_vm_alpha.c 6 cfs_cachewrite src/kernel/tnc_common/tnc_cfe/alpha/cfs_vm_alpha.c 7 cfs_write src/kernel/tnc_common/tnc_cfe/cfs_vm_osi.c 8 vn_write src/kernel/vfs/vfs_vnops.c 9 rwuio src/kernel/bsd/sys_generic.c 10 write src/kernel/bsd/sys_generic.c 11 syscall src/kernel/arch/alpha/syscall_trap.c 12 _Xsyscall src/kernel/arch/alpha/locore.s PROBLEM: (94645, 94795) (PATCH ID: TCR520-209) ******** This patch fixes kernel memory faults that can happen if invalid arguments are supplied to the mount system call on a cluster. A typical stack traceback might be either of the following: Example 1 --------- 0 panic src/kernel/bsd/subr_prf.c 1 trap src/kernel/arch/alpha/trap.c 2 _XentMM src/kernel/arch/alpha/locore.s 3 strlen src/kernel/arch/alpha/fastcopy.s 4 cms_mfs_mount_args src/kernel/tnc_common/tnc_cfe/cms_utils.c 5 cms_copy_mount_args src/kernel/tnc_common/tnc_cfe/cms_utils.c 6 cms_mount_preprocess src/kernel/tnc_common/tnc_cfe/cms_kgs.c 7 cluster_mount src/kernel/tnc_common/tnc_cfe/cfs_mnthooks.c 8 mount1 src/kernel/vfs/vfs_syscalls.c 9 syscall src/kernel/arch/alpha/syscall_trap.c 10 _Xsyscall src/kernel/arch/alpha/locore.s Example 2 --------- 0 panic src/kernel/bsd/subr_prf.c 1 trap src/kernel/arch/alpha/trap.c 2 _XentMM src/kernel/arch/alpha/locore.s 3 copystr src/kernel/arch/alpha/copy.c 4 namei src/kernel/vfs/vfs_lookup.c 5 cms_getmdev src/kernel/tnc_common/tnc_cfe/cms_utils.c 6 cms_ufs_device_list src/kernel/tnc_common/tnc_cfe/cms_utils.c 7 cms_select_cfs_server src/kernel/tnc_common/tnc_cfe/cms_kgs.c 8 cms_ufs_mount_initial src/kernel/tnc_common/tnc_cfe/cms_utils.c 9 cms_mount_preprocess src/kernel/tnc_common/tnc_cfe/cms_kgs.c 10 cluster_mount src/kernel/tnc_common/tnc_cfe/cfs_mnthooks.c 11 mount1 src/kernel/vfs/vfs_syscalls.c 12 syscall src/kernel/arch/alpha/syscall_trap.c 13 _Xsyscall src/kernel/arch/alpha/locore.s PROBLEM: (94643) (PATCH ID: TCR520-194) ******** When multiple nodes are shutting down together, and there are server-only filesystems mounted, it is possible that some nodes will enter retry logic which will never end. This will occur far enough into the system shutdown processing so that the node will generally be unusable, but before the "syncing disks..." is printed to the console. PROBLEM: (89024/94321) (PATCH ID: TCR520-183) ******** Fix for potential system crash which could occur when adding a cluster alias. This could be seen as a kernel memory fault in cluaioc_alias or others while accessing the inifaddr hash. PROBLEM: (93677) (PATCH ID: TCR520-176) ******** This patch improves the responsiveness of EINPROGRESS handling during the issuing of I/O barriers. The fix removes a possible infinite loop scenario which could occur due to the deletion of a storage device. The issue with EINPROGRESS responsiveness is the continued looping while waiting for a disk structure to become available. No attempts were being made to force the availability of the disk structure. In addition, no retry limit was being enforced and no checks were being made for deleted devices. This combination presents the possiblity of infinite retry attempts. PROBLEM: (94069, 93505) (PATCH ID: TCR520-178) ******** This patch relieves pressure on the CMS global DLM lock by allowing AutoFS auto-UNmounts to back off when their lock requests are not granted within a reasonable amount of time. This can help avoid turning a transient slowdown into one which is more persistent. PROBLEM: (94166, 95439) (PATCH ID: TCR520-187) ******** This patch adds some data validation to code which encodes/decodes token messages in the cluster, in order to assist in problem isolation and diagnosis. PROBLEM: (95288) (PATCH ID: TCR520-226) ******** This patch prevents a panic due to "simple_lock: uninitialized lock" during bootup. Code that was previously added to help diagnose an infrequent problem with filesystem messages passed between cluster nodes, might cause this panic between the point at which the node joins the cluster at the CNX level, and the completion of the code that establishes the node's filesystem state as part of the cluster (global variable cfs_set_join_completed is still 0). A typical stack trace is : 0 boot 1 panic 2 simple_lock_fault 3 simple_lock_valid_violation 4 ckidtokgs 5 check_cfs_infs 6 xdr_cfs_infs 7 xdr_cfswriteargs 8 xdr_reference 9 xdr_pointer 10 xdr_cfswriteargs_p 11 icsxdr_decode 12 icssvr_decode_xdr 13 svr_rcfs_write 14 icssvr_daemon_from_pool PROBLEM: (95541, 95368) (PATCH ID: TCR520-243) ******** If a quorum disk is on a parallel scsi bus, bus resets will cause the quorum disk to be placed into the MUNSA reject state. This will prevent all IO from occuring to this disk. If the quorum votes are needed to keep quorum, the votes from the quorum disk will be lost resulting the cluster hanging.