TITLE: HP Tru64 UNIX - Corrects device related hangs, panics and boot issues. Copyright (c) Hewlett-Packard Company 2006. All rights reserved. PRODUCT: HP TruCluster Server [R] V5.1B-3 SOURCE: Hewlett-Packard Company ECO INFORMATION: ECO Name: TCRKIT1001020-V51BB26-E-20061205 ECO Kit Approximate Size: 0.00MB Kit Applies To: HP TruCluster Server V5.1B-3 PK5 (BL26) ECO Kit CHECKSUMS: /usr/bin/sum results: /usr/bin/cksum results: MD5 results: SHA1 results: ECO KIT SUMMARY: A dupatch-based, Early Release Patch kit exists for HP TruCluster Server V5.1B-3 that contains solutions to the following problem(s): This patch fixes a configuration issue found in non CAM devices and CD_ROM devices. This patch improves the reliability of the Tru64 Cluster DRD subsystem when faced with tape devices and tape device failures. There was a timing hole where two opens would be sent down at the same time to the tape driver. Before the tape driver would check to determine if it was already open, the paths could be changed, which would result in a kernel memory fault panic. A typical stack trace for the panic would be: THREAD 1 drd_open() drd_set_tape_changer_server() drd_check_path() drd_issue_local_ioctl() ctape_ioctl() ccmn_path_setup3 ccmn_alloc_path3() cmn_reg_hier_path3 THREAD 2 drd_open() drd_local_open() drd_local_device_open() drd_issue_local_ioctl() ctape_ioctl() ctape_verify_path() ccmn_path_setup3 ccmn_del_stale_paths3() ccmn_destroy_invalid_paths() ccmn_reg_hier_path3 When a device is deleted via hwmgr and an open is in progress the open can hang. This patch removes the timing hole that allows the open to progress to the point where it hangs. When a device fails all current IOs are returned with an appropriate error status code. If the upper layers continue to send IOs after the device has been marked as failed, IOs can hang in drd. . This patch also fixes barrier issues when devices fail and a barrier is in progress. Symptoms for 2,3 and 4. Status of a drd disk with stalled IOs. drd_disk d_hwid d_state d_flags d_type errno eei d_bp_cnt 0xfffffc00f4fe0e00 0x0086 0x0003 0x0a800081 0x0000 0x0013 0x0000 1 DRD_FAILED DRD_DISK_BLOCKED DK_DAIO_DISK DRD_DISK_NOT_USABLE bp 0xfffffc00291b3500 00:02:24.180 DRD_DRAINED_FLAGS DRD_DISK_FAILED DRD_STOP_SERVER DRD_DO_NOT_DELETE DRD_IS_BARRIERABLE DRD_CAM_REGISTERED Typical thread trace for vold threads at the time of hung IOs. >0 thread_block 1 volsiowait 4 volsioctl_rea 5 spec_ioctl 6 vn_ioctl 7 ioctl_base 8 syscall 9 _Xsyscall This patch fixes an error in the DRD subsystem wherein un-initialized disk attributes can cause a system panic. a) This patch fixes an error in the DRD subsystem wherein few un-initialized disk attributes could result in a system panic with the following or similar stack trace: 4 panic 5 trap 6 _XentMM 7 free 8 drd_release_bp_resources 9 drd_ics_io 10 drd_ics_read 11 svr_drd_ics_read 12 icssvr_daemon_from_poolsvr_drd_ics_read This problem appears when open/read is attempted on deleted XCR disks. This patch also fixes an error during a failback of a Tape device wherein character devt is not restored properly. Corrects a problem where, DRD event thread may run infinitely while responding for bid server transaction. This patch fixes a problem wherein DRD subsystem may cause a system panic since strategy routines may be called from a Light weight context(LWC). Corrects a problem with DRD subsystem, where strategy routines can be called from a Light weight context(LWC). This could result in a system panic with the following or similar stack trace. 0 boot 1 panic 2 thread_block 3 lock_wait 4 lock_write 5 (source file cannot be determined) 6 (source file cannot be determined) 7 (source file cannot be determined) 8 drd_restart_io 9 drd_io_barrier_complete_timeout 10 softclock_scan 11 lwc_schedule 12 exception_exit Fixes a hang with disklabel(8) that occurred if a local open failed for the same disk simultaneously. Corrects reference counting issues within the DRD subsystem that can prevent the deletion of hwids. Fixes disk I/O hang in DRD. This patch fixes a problem in DRD that could result in the hanging of commands like disklabel, showfdmn or any file system I/O. Typical stack trace is as follows: 0 thread_block 1 sleep_prim 2 mpsleep 3 drd_reopen_partitions 4 drd_change_server_node 5 drd_complete_failback 6 drd_handle_event_io_drained 7 drd_handle_one_event 8 drd_handle_events 9 drd_event_thread DRD now plays an active role in the device deletion callback and voting. In the past drd would be notified after the device deletion had occurred via an evm event. This caused numerous panics and hung devices as drd could attempt to access a deleted device. With this fix drd will no longer access a device that has a deletion pending or in progress. This patch fixes an issue of DRD returning incorrect device information when the hwid is not found. Corrects an existing timing-hole. Provides a fix for a Kernel Memory Fault in drd disk code A typical stack trace of the problem is as follows: 0 boot 1 panic 2 trap 3 _XentMM 4 simple_lock_D 5 drd_add_server 6 drd_find_local_disks 7 drd_config_thread Fix for DRD_IOCTL_ERROR handling for tape devices Fixes a Kernel Memory Fault in IO Path for Served Disks and for stalled Ios A typical stack trace of the problem is as follows: 0 stop_secondary_cpu 1 panic 2 event_timeout 3 printf 4 panic 5 trap 6 _XentMM 7 drd_ics_get_disk 8 drd_ics_io 9 drd_ics_read 10 svr_drd_ics_read 11 icssvr_daemon_from_pool Fixes disk access issues that shows up early in the boot process. This problem could result in a system panic with the following or similar stack trace. PANIC: "CNX MGR: Invalid configuration for cluster seq disk" 0 boot 1 panic 2 init_globals 3 init_cnx 4 cnx_subsys_configure 5 cnx_callback 6 dispatch_callback 7 main 8 main Fixes a hang during cluster bootup caused by early reservation conflicts. During cluster bootup, the following warning messages appears and the node hangs till another node comes up. "WARNING: cfs_perform_glroot_mount: cfs_mountroot_local failed to mount" Fixes a cluster hang issue during cluster boot-up, when local disk open operations fail while disklabel is in progress. This patch corrects an erroneous error message which can be displayed by drdmgr when relocating a device. For example: drdmgr: Error, Uknown error -1431655766 for device 'tape0' attribute DRD_SERVER Handles reservation conflict errors to address cluster node hang during boot. During cluster booting, the following warning messages appears and the node may hang until the second node comes up. Typical message that appears on the console when the node hangs is as below, "WARNING: cfs_perform_glroot_mount: cfs_mountroot_local failed to mount" This is due to the path being configured later in the boot process resulting in a reservation conflict. Allow retries of disk open at boot time if device is in MUNSA reject state. A disk open can fail if the device is currently in MUNSA reject state. This can result in boot hang conditions while the system is being booted up. The Patch Kit Installation Instructions and the Patch Summary and Release Notes documents provide patch kit installation and removal instructions and a summary of each patch. Please read these documents prior to installing patches on your system. The patches in this ERP kit will also be available in the next mainstream patch kit - HP TruCluster Server V5.1B-6. INSTALLATION NOTES: 1) Install this kit with the dupatch utility that is included in the patch kit. You may need to baseline your system if you have manually changed system files on your system. The dupatch utility provides the baselining capability. 2) The patch in this ERP kit does not have any file intersections with any other ERP available at this time for this product version. 3) This ERP kit will NOT install over any Customer Specific Patches (CSPs) which have file intersections with this ERP kit. Contact your normal Service Provider for assistance if the installation of this ERP kit is blocked by any of your installed CSPs. INSTALLATION PREREQUISITES: You must have installed HP TruCluster Server V5.1B-3 PK5 (BL26) prior to installing this Early Release Patch Kit. SUPERSEDE INFORMATION: TCRKIT1000547-V51BB26-E-20060420 KNOWN PROBLEMS WITH THE PATCH KIT: None. RELEASE NOTES FOR TCRKIT1001020-V51BB26-E-20061205: [R] UNIX is a registered trademark in the United States and other countries licensed exclusively through X/Open Company Limited. Copyright Hewlett-Packard Company 2006. All Rights reserved. This software is proprietary to and embodies the confidential technology of Hewlett-Packard Company. Possession, use, or copying of this software and media is authorized only pursuant to a valid written license from Hewlett-Packard or an authorized sublicensor. This ECO has not been through an exhaustive field test process. Due to the experimental stage of this ECO/workaround, Hewlett-Packard makes no representations regarding its use or performance. The customer shall have the sole responsibility for adequate protection and back-up data used in conjunction with this ECO/workaround.