ECO NUMBER: ALPPORTS01_062 PRODUCT: OpenVMS Alpha OPERATING SYSTEM 6.2 UPDATE PRODUCT: OpenVMS Alpha OPERATING SYSTEM 6.2 COVER LETTER 1 KIT NAME: ALPPORTS01_062 2 KITS SUPERSEDED BY THIS KIT: ALPDRIV17_H3062 3 KIT DEPENDENCIES: 3.1 The following remedial kit(s) must be installed BEFORE installation of this, or any required kit: None 3.2 In order to receive all the corrections listed in this kit, the following remedial kits should also be installed: None 4 KIT DESCRIPTION: 4.1 Version(s) of OpenVMS to which this kit may be applied: OpenVMS Alpha V6.2, V6.2-1H1, V6.2-1H2, V6.2-1H3 4.2 Files patched or replaced: o [SYS$LDR]SYS$PCADRIVER.EXE (new image) o [SYS$LDR]SYS$PNDRIVER.EXE (new image) o [SYS$LDR]SYS$SCS.EXE (new image) 5 PROBLEMS ADDRESSED IN ALPPORTS01_062 KIT o An AlphaServer node booting into a CI cluster, with CIPCA or CIXCD, may fail to join the cluster and hang. On node boot-up, or after virtual-circuit failure recovery, with CIPCA or CIXCD adapters, an SCS "CONNECTION-REQUEST" (CON_REQ) SCS-control message may be lost. This will suspend all SCS-sysap connection formation activity on a given CI-virtual-circuit -- COVER LETTER -- Page 2 22 October 1998 (SCS path-block SCSMSG lost). Under V6.2/V6.2-1Hx, this problem was responsible for the "Virtual-Circuit-Timeout" errors frequently seen on booting Alpha/CIPCA/CIXCD nodes. OpenVMS V7.1 changes to VC-timeout detection to reduce "nuisance errors" caused the SCS-lost-message hang described here. Using SDA> SHOW CONNECTION on a crash-dump from a system with the "lost SCS control message" will show 1 sysap in "CON_SENT", and 0 or more sysaps on xxx_pend state: SDA> SHOW CONNECTION --- CDT Summary Page --- CDT Address Local Process Connection ID State Remote ----------- ------------- ------------- ----- -------- 8105D720 SCS$DIRECTORY DB1F0000 listen . 8105E530 SCS$DIR_LOOKUP DB1F0009 con_sent ADEBUG . 8xxxxxxx SCS$DIRECTORY DB1F000A accp_pend ADEBUG . 8xxxxxxx VMS$DISK_CL_DRVR 6DB70006 con_pend PTMANB . . ------------------------------------------------ Using SDA> SHOW CLUSTER/SCS to find the path-block for the problem virtual-circuit, note that "SCS MSGBUF" is empty, confirming loss of the single SCS-control-message allocated for each virtual-circuit (path-block): VMScluster data structures -------------------------- --- Path Block (PB) 80DAFF40 --- Status: 0020 credit Remote sta. addr. 000000000005 Remote port type CIPCA Remote state ENAB Number of data paths 2 Remote hardware rev. 00000015 Cables state A-OK B-OK Remote func. mask ABFF0D00 Local state OPEN Reseting port 00 Port dev. name PNA0 Handshake retry cnt. 1 SCS MSGBUF address 00000000 ======== Msg. buf. wait queue 80DAFF78 PDT address 80DA6B00 -------------------------------------------- Confirming symptoms on remote "victim" nodes is not as reliable or foolproof. The typical SCS-connection state, from SDA> SHOW CONNECTIONS would show SCS-sysap-process connections hung in -- COVER LETTER -- Page 3 22 October 1998 CON_ACK state, since the remote/culprit node has lost the SCS-control-message for returning an "ACCP_REQ" (accept request): --- CDT Summary Page --- CDT Address Local Process Connection ID State Remote ----------- ------------- ------------- ----- ------ 8105D720 SCS$DIRECTORY DB1F0000 listen . 8105E530 SCS$DIR_LOOKUP DB1F0009 con_ack ANDA1A o CIXCD (CIMNA) and CIPCA adapters will generate MFQE (Message-Free- Queue-Empty) interrupts, causing a CI-adapter "RESET" and temporary loss of all virtual-circuits (mount-verification, etc.) when using the StorageWorks Control Console (SWCC V2.0) or HSJ-console monitoring scripts using FYDRIVER/MSCP$DUP. The following console messages and DECevent-formatted error-log messages will be seen: CONSOLE: %PNA0 - Software Shutting Down Port ERROR-LOG: **************************** ENTRY 1 *********************** Logging OS 1. OpenVMS System Architecture 2. Alpha OS version V7.1 Event sequence number 1. Timestamp of occurrence 01-JAN-1996 00:00:11 Time since reboot 0 Day(s) 0:00:11 Host name CSG84 System Model AlphaServer 8400 Model 5/300 Entry type 100. Logged Message ---- Device Profile ---- Unit CSG84$PNB0 Product Name CIXCD (XMI to CI Adapter); ** OR ** CIPCA (PCI to CI Adatper) ---- MSCP Logged Msg ---- Logged Message Type Code 3. Port Message Error Type/SubType xC002 Signaled via Packet, Software SHUTTING DOWN Port. Port will be RE-STARTED. Count - Remaining Retries 50. . . . *************************************************************** o System fails to boot OpenVMS V7.1 when boot path is KFMSB (XMI-to-DSSI)/HSD10. The following console and error-log events are generated: -- COVER LETTER -- Page 4 22 October 1998 CONSOLE: %PNB0 - Software Shutting Down Port DECEVENT-FORMATTED ERROR-LOG: ************************** ENTRY 1 *********************** Logging OS 1. OpenVMS System Architecture 2. Alpha OS version V7.1 Event sequence number 1. Timestamp of occurrence 01-JAN-1996 00:00:11 Time since reboot 0 Day(s) 0:00:11 Host name CSG84 System Model AlphaServer 8400 Model 5/300 Entry type 100. Logged Message ---- Device Profile ---- Unit CSG84$PNB0 Product Name KFMSB (XMI to DSSI Adapter) ---- MSCP Logged Msg ---- Logged Message Type Code 3. Port Message Error Type/SubType xC002 Signaled via Packet, Software SHUTTING DOWN Port. Port will be RE-STARTED. Count - Remaining Retries 50. Error Count 1. Local Station Address x0000000000000007 . . . ************************************************************* o On booting VMS, both the CIXCD and CIPCA will generate "Path LOOPBACK" error-messages on the console and in the error-log. This error has occurred since initial release of Alpha OpenVMS V1.0 and since the CIPCA was introduced with V6.2-1H2. The following console and error-log entries will appear: CONSOLE: %PNA0, Path #0. Loopback has gone from GOOD to BAD %PNB0, Path #0. Loopback has gone from GOOD to BAD DECEVENT-FORMATTED ERROR-LOG: ************************ ENTRY 6 ************************* Logging OS 1. OpenVMS System Architecture 2. Alpha OS version V6.2-1H3 Event sequence number 1. Timestamp of occurrence 03-NOV-1997 13:53:48 Time since reboot 0 Day(s) 0:00:20 Host name CSG84 System Model AlphaServer 8400 Model 5/300 Entry type 100. Logged Message ---- Device Profile ---- Unit CSG84$PNB0 Product Name CIXCD (XMI to CI Adapter) -- COVER LETTER -- Page 5 22 October 1998 ---- MSCP Logged Msg ---- Logged Message Type Code 3. Port Message Error Type/SubType x4106 Cable Status Change, Path #0. Loopback went from GOOD to BAD. Count - Remaining Retries 50. Error Count 1. Local Station Address x000000000000000D Local Station ID x0000000000004DE8 Remote Station Address x0000FFFFFFFFFFFF <- *** Unavailable Remote Station ID x0000000000000000 <- *** Unavailable *** NOTE that no remote CI-station address is available ************************************************************* o CIPCA device-registers are not properly read and collected into the port-descriptor-table (PDT$) by CIPCA.MAR/READ_REG: routine. This prevents accurate diagnosis of CIPCA adapter or port-driver errors by the CSCs or VMS Engineering. o Two problems are corrected: - CIPCA CORRUPTED CRCTX & BADDALRQSIZ BUGCHECK A BADDALRQSIZ bugcheck will result, following port-crash/reset on Alpha systems with more than 1 Gb. of memory. Also, this improper CRCTX "free-queue" reset causes an NPAGEDYN pool-leak of 64 CRCTX buffers (96. bytes x 64 = 6144 bytes) for each CIPCA device reset. - DEVICE INIT-FAILURE BAP NPORT-CARRIER LEAKAGE BAP pool leakage will be seen after CIMNA, CIPCA, or KFMSB device-initialization failure. For CIPCA/CIMNA, all 14 NPORT stopper-CRRRs will be lost on each port reinit attempt, accumulating to 700 after the allowed 50 retries. (CRRR size = 192. bytes x 700 = 134,400. bytes). o Following a CIPCA, CIXCD, or KFMSB port-reset (UCB$L_ERTCNT retry-count decrements), the VC_CHK_TIME's (virtual-circuit-timeout) deallocated TQE may be unintentionally returned to and re-queued by EXE$SWTIMER_FORK::/SYSUB: system-routine from SYS$PN/PCAdriver. The 64-byte (0x40 byte) non-paged-pool lookaside list will be corrupted, and incorrectly linked into EXE$GL_TQFL. The TQE-requeue will *ONLY* occur if TQE$V_REPEAT bit (byte-offset 0x0B, bit<2>) is set when VC_CHK_TIME: deallocates the TQE. Either by POOLCHECK with "deallocate -- COVER LETTER -- Page 6 22 October 1998 poison pattern bit<2>=1"; or if the TQE is immediately reallocated before VC_CHK_TIME returns to EXE$SWTIMER_FORK::/SYSUB:. 6 PROBLEMS ADDRESSED IN ALPDRIV17_H3062 KIT o Messages such as: %PNA0, Inappropriate SCA Control Message - FLAGS/OPC/STATUS/PORT 00/00/00/00 may appear on the console, with associated errorlog messages, on systems with HSX disk controllers. o Following a CI-port MFQE (message-free-queue-empty) interrupt, with no SCS- credit deficit (not in "optimistic SCS-credit mgmt. mode": MFQ entry-count = SCS Rcv-credits), a subsequent legitimate MFQE interrupt (with SCS-credit deficit) will result in a series of secondary errors causing port-resets, never expanding the MFQ queue, and posting of a series of these error-log entrys (key ID: error-type/sub-type = 0x8102): Logging OS 1. OpenVMS System Architecture 2. Alpha OS version V7.1 Event sequence number 3653. Timestamp of occurrence 20-OCT-1997 00:01:27 Time since reboot 2 Day(s) 12:14:28 Host name GDC140 System Model AlphaServer 8400 Model 5/300 Entry type 98. Asynchronous Device Attention ---- Device Profile ---- Unit GDC140$PNA0 Product Name CIXCD (XMI to CI Adapter) ------ Adapter Data ----- Error Type/SubType x8102 Hardware Error, Unspeci- fied Port Hardware Error. Port will be RE-STARTED. Count - Remaining Retries 36. CASR x00000001 Bit 0: Message Free Que EXHAUSTED (AMFQE) AMCSR x00000004 Bit 2: Interrupt ENABLE (IE) PESR xFFFFFFFF XDEV x05110C2F Device Type is: 0x0C2F -- COVER LETTER -- Page 7 22 October 1998 = CIMNA Device Revision is: 0x11 = A1 Firmware Revision is:0x05 = V-5 ASNR x00000001 XBER x00000040 XMI Node ID is: 1. Commander ID is: 2 = Microcode CMDR XFADR xFFFFFE00 XFAER x73FF0FFF PDCSR x00000001 PFAR x0000055C Extra Longword 1 x00000000 Extra Longword 2 x00000000 Extra Longword 3 x00000000 ----- Software Info ----- UCB$x_ERTCNT 128. Retries Remaining UCB$x_ERTMAX 10. Retries Allowable UCB$x_STS x00000000 UCB$x_ERRCNT 30. Errors This Unit UCB$L_DEVCHAR1 x0C450000 Sharable Available Error Logging Capable of Input Capable of Output ------------------------------------------------------- o When using the ALPDRIV12_H3062 Cluster Ports TIMA Kit with non-NPORT (non-CIPCA,CIMNA, or KFMSB) SCS-port drivers, NPAGEDYN pool-corruption will occur in pool following the end of each non-NPORT PDT (port-descriptor-table: 1-per-SCS-port). Symptoms will vary according to how this NPAGEDYN is currently used but could consist of INVEXCPTN, SSRVEXCPTN, etc. ACCVIOs. 7 PROBLEMS ADDRESSED IN ALPDRIV12_H3062 KIT o On Alpha systems, with many Virtual Circuit failures, the system will finally BUGCHECK with a CLUEXIT - or may simply hang. Within the subsequent dump, many CDTs (Connection Descriptor Table) in DISC_MATCH will be seen - and there will be no free CDTs. o These problems only affect Turbolaser AS8200/8400 capable of exceeding 4 gigabyte memory sizes. -- COVER LETTER -- Page 8 22 October 1998 CIMNA (NPORT CIXCD XMI-to-CI adapter for Laser/Turbolaser) and KFMSB (XMI-to-DSSI) adapters will fail to initialize or start under OpenVMS if non-paged-dynamic (NPAGEDYN) pool contains PFNs (physical pages) over 4 gigabytes (PA > 32-bits), and, BAP (bus-addressable-pool) is merged with NPAGEDYN due to the absence of a PCI bus on the system. If any of the NPORT structures (ABLK, AMPB, QBUFs, CRRRs, BDL, BDLT) contain physical addresses (PA) > 32-bits, these devices fail to start, producing the following errors. CDTs appear in various states when examined with the "SDA> 1. CIMNA ERRORS =============== The CIMNA will exhibit "port-timeouts" or XMI transaction- timeout (TTO) memory-system errors on boot, such as: TURBOLASER CONSOLE LOG: ----------------------- %PNA0, Port Error Bit(s) Set - CNF/PMC/PSR 08110C2F/00000004/00000208 %PNA0, Port is Reinitializing (48 Retries Left). Check the Error Log. ---------------------------------------------- %PNA0, CI port timeout. %PNA0, Port is Reinitializing (49 Retries Left). Check the Error Log. ---------------------------------------------- CIMNA ERROR LOG ENTRY: ---------------------- ********************** ENTRY 2 ***************************** Logging OS 1. OpenVMS System Architecture 2. Alpha OS version V7.1 Event sequence number 1. Timestamp of occurrence 01-JAN-1996 00:00:04 Time since reboot 0 Day(s) 0:00:04 Host name ANDA1A System Model AlphaServer 8400 5/300 Entry type 98. Asynchronous Device Attention ---- Device Profile ---- Unit ANDA1A$PNA0 Product Name CIXCD (XMI to CI Adapter) ------ Adapter Data ----- Error Type/SubType x8102 Hardware Error,Unspecified Port Hardware Error. Port will be RE-STARTED. Count - Remaining Retries 50. CASR 00000208 Bit 3: Memory System ERROR (MSE) Bit 9: Uninitialize State (UNIN) AMCSR x00000004 Bit 2: Interrupt ENABLE (IE) -- COVER LETTER -- Page 9 22 October 1998 PESR xFFFFFFFF XDEV x08110C2F Device Type is: 0x0C2F = CIMNA Device Revision is: 0x11 = A1 Firmware Revision is: 0x08 = V-8 ASNR x00000208 XBER x8000A060 Bit 13: Transaction Timeout (TTO) Bit 15: Command NoAck (CNAK) Bit 31: Error Summary (ES) XMI Node ID is: 1. Commander ID is: 3 = INTR XFADR xFFFFFFFF XMI Failing Addr[00:28]: x1FFFFFFF XMI Failing Addr[39]: x00000001 Failing Length: x00000003 XFAER x13FF0000 Mask[00:15]: x00000000 XMI Failing Addr[29:38]: x000003FF XMI Failing Command: 1, READ PDCSR x00000208 PFAR x0000055C Extra Longword 1 x00000000 Extra Longword 2 x00000000 Extra Longword 3 x00000000 ----- Software Info ----- UCB$x_ERTCNT 0. Retries Remaining UCB$x_ERTMAX 0. Retries Allowable UCB$x_STS x10000000 UCB$x_ERRCNT 1. Errors This Unit UCB$L_DEVCHAR1 x0C450000 Sharable Available Error Logging Capable of Input Capable of Output ************************************************************ 2. KFMSB ERRORS =============== TURBOLASER CONSOLE LOG ---------------------- "Port Error Bit(s) Set - CNF/PMC/PSR xxxxxxxx/xxxxxxxx/05008010" NOTE: The PSR (taken from the ASR: Adapter Status Register) translates to: - <04> Adapter Abnormal Condition - <15> Channel 1 flag - <30:24> (=5) Illegal Carrier Address o SCS sysap data-transfer mapping requests will generate incorrect (mis-calculated) physical-address pointers, causing disk/tape data-transfer corruption, if the page_offset -- COVER LETTER -- Page 10 22 October 1998 requested extends beyond the first page (>8Kb-1: Alpha page-size) of the requested transfer (page defined by SVAPTE in SCS$MAP request). SYS$SCS sources the page_offset from CDRP$L_BOFF, and sources SVAPTE from CDRP$L_SVAPTE, both of which are supplied by the SCS client sysap (DUDRIVER, CNXMGR, etc.). OpenVMS SCS sysaps are not believed to use CDRP$L_BOFF values > 8k-1 (Alpha page), but user-written SCS sysap applications might use a value > 13-bits since CDRP$L_BOFF is 32-bits (formerly CDRP$W_BOFF/16-bits). o Performance is degraded on CIXCD and CIPCA based systems, when communicating with other NADP (non-alternating-dual-path) nodes during single-CI-path operation. This results in CI-cable failure/removal or CI single-path failures (NO_RESPONSE, NAK errors). NADP-supporting nodes currently are HSJ40, HSJ50, CIPCA, and CIMNA/CIXCD. o The CIPCA will not properly re-initialize after a PCI-DMA-Engine "bus error" (PCI bus master abort or target abort). The port-driver will continually fail to retry the re-initialize until the 50 retry count is expired. The console OPA0 output is typically as follows: %PNB0, Port Error Bit(s) Set - NODESTS/CASR(H)/(L) 02800001/001C0000/000001D0 %PNB0, Port is Reinitializing ( 48 Retries Left). Check the Error Log. o When booting Alpha machines the console may display messages such as: %PNA0, Inappropriate SCA Control Message - FLAGS/OPC/STATUS/PORT 00/00/00/00 on the console - with associated errorlog messages. o The CIPCA Direct-DMA (DDMA) pool will not correctly initialize on AS4100 systems running OpenVMS-V6.2-1H3 or V62R with greater than 1 Gb. of memory. One of the following 3 symptoms will occur following an NPAGEDYN expansion event, without a DDMA-pool when > 1 Gb. of memory is present: o SPINWAIT system-crash o System-hang which will respond to a ^P HALT request, and will generate a forced crash if the system-disk/dump-file is NOT on a CIPCA; o System-hang which will not respond to a ^P HALT request. A system-reset (front-panel reset switch) is required to clear. NO DUMP is created. NOTE: All crashes/hangs also resulted in CIPCA LED -- COVER LETTER -- Page 11 22 October 1998 error-code=PCI-DMA-ENGINE-RING-ERROR-1/0 (code=0x01C or 0x01B/once). System data cells SCS$GQ_DDMA_BASE & SCS$GQ_DDMA_LEN will both contain a "00000000" value on AS4100 systems with CIPCA and 1 Gb. of memory (use SDA> to examine). 8 KIT INSTALLATION RATING: The following kit installation rating, based upon current CLD information, is provided to serve as a guide as to which customers should apply this remedial kit. (Reference attached Disclaimer of Warranty and Limitation of Liability Statement) INSTALLATION RATING: INSTALL_2 : To be installed by all customers using the following feature(s): Clusters 9 INSTALLATION INSTRUCTIONS: Install this kit with the VMSINSTAL utility by logging into the SYSTEM account, and typing the following at the DCL prompt: @SYS$UPDATE:VMSINSTAL ALPPORTS01_062 [location of the saveset] The saveset location may be a tape drive, or a disk directory that contains the kit saveset. The images in this kit will not take effect until the system is rebooted. If you have other nodes in your VMS cluster, they must also be rebooted in order to make use of the new image(s). If it is not possible or convenient to reboot the entire cluster at this time, a rolling re-boot may be performed. 9.1 Installation Notes: o Multiprocessor Systems with CIPCAs: SMP_SPINWAIT Restriction If your system uses a CIPCA adapter and you operate with MULTIPROCESSING set to a non-zero value, you must reset the value of the SMP_SPINWAIT parameter to 300000 (3 seconds) instead of the default 100000 (1 second). If you do not change the value of SMP_SPINWAIT, a CIPCA adapter error could generate a CPUSPINWAIT system bugcheck similar to the following: -- COVER LETTER -- Page 12 22 October 1998 **** OpenVMS (TM) Alpha Operating System V7.1 - BUGCHECK **** ** Code=0000078C: CPUSPINWAIT, CPU spinwait timer expired This restriction will be removed in a future OpenVMS release. Note: This release note supersedes a similar release note, note 4.15.2.4.5, in the OpenVMS Version 7.1 Release Notes manual as well as 6.2-1H3 sec:1.13.1, which also included a SYSTEM_CHECK parameter restriction. The SYSTEM_CHECK parameter restriction is incorrect. Furthermore, the earlier release note stated that the change to the SMP_SPINWAIT parameter was required for a MULTIPROCESSING parameter setting of 1 or 2. This requirement applies to all non-zero MULTIPROCESSING parameter settings. o This ALPPORTS01_062 remedial kit removes the KFMSB/HSD10 booting restriction that was listed in the ALPDRIV17_H3062 remedial kit. The ALPPORTS01_062 kit kit can be used in KFMSB/HSD10 boot configurations. Copyright (c) Compaq Computer Corporation, 1998 All Rights Reserved. Unpublished rights reserved under the copyright laws of the United States. The software contained on this media is proprietary to and embodies the confidential technology of Compaq Computer Corporation. Possession, use, or dissemination of the software and media is authorized only pursuant to a valid written license from Compaq Computer Corporation. DISCLAIMER OF WARRANTY AND LIMITATION OF LIABILITY THIS PATCH IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND. ALL EXPRESS OR IMPLIED CONDITIONS, REPRESENTATIONS AND WARRANTIES, INCLUDING ANY IMPLIED WARRANTY OF MERCHANTABILITY, FITNESS FOR PARTICULAR PURPOSE, OR NON-INFRINGEMENT, ARE HEREBY EXCLUDED TO THE EXTENT PERMITTED BY APPLICABLE LAW. IN NO EVENT WILL COMPAQ BE LIABLE FOR ANY LOST REVENUE OR PROFIT, OR FOR SPECIAL, INDIRECT, CONSEQUENTIAL, INCIDENTAL OR PUNITIVE DAMAGES, HOWEVER CAUSED AND REGARDLESS OF THE THEORY OF LIABILITY, WITH RESPECT TO ANY PATCH MADE AVAILABLE HERE OR TO THE USE OF SUCH PATCH.