|
|
OpenVMS ALPPORTS01_071 Alpha V7.1 PCA/PNDRIVER _ SCS ECO Summary
|
TITLE: OpenVMS ALPPORTS01_071 Alpha V7.1 PCA/PNDRIVER _ SCS ECO Summary
Modification Date: 03-NOV-98
Modification Type: Updated Kit Supersedes ALPDIRV08_071
NOTE: An OpenVMS saveset or PCSI installation file is stored
on the Internet in a self-expanding compressed file.
The name of the compressed file will be kit_name-dcx_vaxexe
for OpenVMS VAX or kit_name-dcx_axpexe for OpenVMS Alpha.
Once the file is copied to your system, it can be expanded
by typing RUN compressed_file. The resultant file will
be the OpenVMS saveset or PCSI installation file which
can be used to install the ECO.
Copyright (c) Compaq Computer Corporation 1998. All rights reserved.
PRODUCT: DIGITAL OpenVMS Alpha
COMPONENT: SYS$PNDRIVER.EXE
SYS$PCADRIVER.EXE
SYS$SCS.EXE
SOURCE: Compaq Computer Corporation
ECO INFORMATION:
ECO Kit Name: ALPPORTS01_071
ECO Kits Superseded by This ECO Kit: ALPDRIV08_071
ALPDRIV04_071
ECO Kit Approximate Size: 1134 Blocks
Kit Applies To: OpenVMS Alpha V7.1, V7.1-1H1, V7.1-1H2
System/Cluster Reboot Necessary: Yes
Rolling Re-boot Supported: Yes
Installation Rating: INSTALL_2
2 - To be installed on all systems running
the listed version(s) of OpenVMS and
using the following feature(s):
OpenVMS Clusters
Kit Dependencies:
The following remedial kit(s) must be installed BEFORE
installation of this kit:
None
In order to receive all the corrections listed in this
kit, the following remedial kits should also be installed:
None
ECO KIT SUMMARY:
An ECO kit exists for PCA/PNDRIVER.EXE and SCS.EXE on OpenVMS Alpha
V7.1 through V7.1-1H2. This kit addresses the following problems:
Problems addressed in ALPPORTS01_071:
o Use of CIXCD/CIPCA (NPORT) FAST_PATH I/O (performance enhancement
feature) mixed with non-FAST_PATH I/O, under heavy I/O loads,
can cause CIPCA-related "invalid scatter-gather map register"
machine check system-crashes and NPAGEDYN pool-corruption.
NOTE: FAST_PATH is only available under Alpha OpenVMS V7.0
and later. FAST_PATH is enabled by SYSGEN parameter
"FAST_PATH"=1, which defaults to "0" under OpenVMS
V7.0/V7.1.
o An AlphaServer node booting into a CI cluster, with CIPCA or
CIXCD, may fail to join the cluster and hang. On node boot-up,
or after virtual-circuit failure recovery, with CIPCA or CIXCD
adapters, an SCS "CONNECTION-REQUEST" (CON_REQ) SCS-control
message may be lost. This will suspend all SCS-sysap connection
formation activity on a given CI-virtual-circuit (SCS path-block
SCSMSG lost).
Under V6.2/V6.2-1Hx, this problem was responsible for the
"Virtual-Circuit-Timeout" errors frequently seen on booting
Alpha/CIPCA/CIXCD nodes. OpenVMS V7.1 changes to VC-timeout
detection to reduce "nuisance errors" caused the
SCS-lost-message hang described here.
Using SDA> SHOW CONNECTION on a crash-dump from a system with
the "lost SCS control message" will show 1 sysap in "CON_SENT",
and 0 or more sysaps on xxx_pend state:
SDA> SHOW CONNECTION
--- CDT Summary Page ---
CDT Address Local Process Connection ID State Remote
----------- ------------- ------------- ----- --------
8105D720 SCS$DIRECTORY DB1F0000 listen
.
8105E530 SCS$DIR_LOOKUP DB1F0009 con_sent ADEBUG
.
8xxxxxxx SCS$DIRECTORY DB1F000A accp_pend ADEBUG
.
8xxxxxxx VMS$DISK_CL_DRVR 6DB70006 con_pend PTMANB
.
.
------------------------------------------------
Using SDA> SHOW CLUSTER/SCS to find the path-block for the
problem virtual-circuit, note that "SCS MSGBUF" is empty,
confirming loss of the single SCS-control-message allocated
for each virtual-circuit (path-block):
VMScluster data structures
--------------------------
--- Path Block (PB) 80DAFF40 ---
Status: 0020 credit
Remote sta. addr. 000000000005 Remote port type CIPCA
Remote state ENAB Number of data paths 2
Remote hardware rev. 00000015 Cables state A-OK B-OK
Remote func. mask ABFF0D00 Local state OPEN
Reseting port 00 Port dev. name PNA0
Handshake retry cnt. 1 SCS MSGBUF address 00000000
========
Msg. buf. wait queue 80DAFF78 PDT address 80DA6B00
--------------------------------------------
Confirming symptoms on remote "victim" nodes is not as reliable
or foolproof. The typical SCS-connection state, from SDA> SHOW
CONNECTIONS would show SCS-sysap-process connections hung in
CON_ACK state, since the remote/culprit node has lost the
SCS-control-message for returning an "ACCP_REQ" (accept request):
--- CDT Summary Page ---
CDT Address Local Process Connection ID State Remote
----------- ------------- ------------- ----- ------
8105D720 SCS$DIRECTORY DB1F0000 listen
.
8105E530 SCS$DIR_LOOKUP DB1F0009 con_ack ANDA1A
o CIXCD (CIMNA) and CIPCA adapters will generate MFQE (Message-Free-
Queue-Empty) interrupts, causing a CI-adapter "RESET" and temporary
loss of all virtual-circuits (mount-verification, etc.) when using
the StorageWorks Control Console (SWCC V2.0) or HSJ-console monitoring
scripts using FYDRIVER/MSCP$DUP. The following console messages
and DECevent-formatted error-log messages will be seen:
CONSOLE:
%PNA0 - Software Shutting Down Port
ERROR-LOG:
**************************** ENTRY 1 ***********************
Logging OS 1. OpenVMS
System Architecture 2. Alpha
OS version V7.1
Event sequence number 1.
Timestamp of occurrence 01-JAN-1996 00:00:11
Time since reboot 0 Day(s) 0:00:11
Host name CSG84
System Model AlphaServer 8400 Model 5/300
Entry type 100. Logged Message
---- Device Profile ----
Unit CSG84$PNB0
Product Name CIXCD (XMI to CI Adapter);
** OR ** CIPCA (PCI to CI Adatper)
---- MSCP Logged Msg ----
Logged Message Type Code 3. Port Message
Error Type/SubType xC002 Signaled via Packet, Software
SHUTTING DOWN Port.
Port will be RE-STARTED.
Count - Remaining Retries 50.
.
.
.
***************************************************************
o System fails to boot OpenVMS V7.1 when boot path is KFMSB
(XMI-to-DSSI)/HSD10. The following console and error-log
events are generated:
CONSOLE:
%PNB0 - Software Shutting Down Port
DECEVENT-FORMATTED ERROR-LOG:
************************** ENTRY 1 ***********************
Logging OS 1. OpenVMS
System Architecture 2. Alpha
OS version V7.1
Event sequence number 1.
Timestamp of occurrence 01-JAN-1996 00:00:11
Time since reboot 0 Day(s) 0:00:11
Host name CSG84
System Model AlphaServer 8400 Model 5/300
Entry type 100. Logged Message
---- Device Profile ----
Unit CSG84$PNB0
Product Name KFMSB (XMI to DSSI Adapter)
---- MSCP Logged Msg ----
Logged Message Type Code 3. Port Message
Error Type/SubType xC002 Signaled via Packet, Software
SHUTTING DOWN Port.
Port will be RE-STARTED.
Count - Remaining Retries 50.
Error Count 1.
Local Station Address x0000000000000007
.
.
.
*************************************************************
o On booting VMS, both the CIXCD and CIPCA will generate "Path
LOOPBACK" error-messages on the console and in the error-log.
This error has occurred since initial release of Alpha OpenVMS
V1.0 and since the CIPCA was introduced with V6.2-1H2. The
following console and error-log entries will appear:
CONSOLE:
%PNA0, Path #0. Loopback has gone from GOOD to BAD
%PNB0, Path #0. Loopback has gone from GOOD to BAD
DECEVENT-FORMATTED ERROR-LOG:
************************ ENTRY 6 *************************
Logging OS 1. OpenVMS
System Architecture 2. Alpha
OS version V6.2-1H3
Event sequence number 1.
Timestamp of occurrence 03-NOV-1997 13:53:48
Time since reboot 0 Day(s) 0:00:20
Host name CSG84
System Model AlphaServer 8400 Model 5/300
Entry type 100. Logged Message
---- Device Profile ----
Unit CSG84$PNB0
Product Name CIXCD (XMI to CI Adapter)
---- MSCP Logged Msg ----
Logged Message Type Code 3. Port Message
Error Type/SubType x4106 Cable Status Change, Path #0.
Loopback went from GOOD to BAD.
Count - Remaining Retries 50.
Error Count 1.
Local Station Address x000000000000000D
Local Station ID x0000000000004DE8
Remote Station Address x0000FFFFFFFFFFFF <- ***
Unavailable
Remote Station ID x0000000000000000 <- ***
Unavailable
*** NOTE that no remote CI-station address is available
*************************************************************
o CIPCA device-registers are not properly read and collected
into the port-descriptor-table (PDT$) by CIPCA.MAR/READ_REG:
routine. This prevents accurate diagnosis of CIPCA adapter or
port-driver errors by the CSCs or VMS Engineering.
o Two problems are corrected:
- CIPCA CORRUPTED CRCTX & BADDALRQSIZ BUGCHECK
A BADDALRQSIZ bugcheck will result, following port-crash/reset
on Alpha systems with more than 1 Gb of memory. This improper
CRCTX "free-queue" reset causes an NPAGEDYN pool-leak of 64
CRCTX buffers (96 bytes x 64 = 6144 bytes) for each CIPCA
device reset.
- DEVICE INIT-FAILURE BAP NPORT-CARRIER LEAKAGE
BAP pool leakage will be seen after CIMNA, CIPCA, or KFMSB
device-initialization failure. For CIPCA/CIMNA, all 14
NPORT stopper-CRRRs will be lost on each port reinit attempt,
accumulating to 700 after the allowed 50 retries.
(CRRR size = 192 bytes x 700 = 134,400 bytes).
o NPORT message and carrier BAP allocation failures are mis-reported
as "Insufficient pool" on console and in errorlog, when a BAP (BUS
ADDRESSABLE POOL) shortage should be identified:
CONSOLE:
"%PNA0, Insufficient Non-paged Pool for Initialization "
Note that BAP is controlled by SYSGEN parameters, "NPAG_BAP_MIN,
NPAG_BAP_MAX, and NPAG_BAP_MAX_PA". These parameters are are
properly set by running OpenVMS AUTOGEN with "FEEDBACK" after a
VMS installation or upgrade.
o Following a CIPCA, CIXCD, or KFMSB port-reset (UCB$L_ERTCNT
retry-count decrements), the VC_CHK_TIME's (virtual-circuit-timeout)
deallocated TQE may be unintentionally returned to and re-queued by
EXE$SWTIMER_FORK::/SYSUB: system-routine from SYS$PN/PCAdriver. The
64-byte (0x40 byte) non-paged-pool lookaside list will be corrupted,
and incorrectly linked into EXE$GL_TQFL.
The TQE-requeue will *ONLY* occur if TQE$V_REPEAT bit (byte-offset
0x0B, bit<2>) is set when VC_CHK_TIME: deallocates the TQE. Either
by POOLCHECK with "deallocate poison pattern bit<2>=1"; or if the
TQE is immediately reallocated before VC_CHK_TIME returns to
EXE$SWTIMER_FORK::/SYSUB:.
Problems addressed in ALPDRIV08_071:
o Messages such as:
%PNA0, Inappropriate SCA Control Message - FLAGS/OPC/STATUS/PORT
00/00/00/00
may appear on the console, with associated errorlog messages,
on systems with HSX disk controllers.
o Following a CI-port MFQE (message-free-queue-empty) interrupt,
with no SCS- credit deficit (not in "optimistic SCS-credit
mgmt. mode": MFQ entry-count = SCS Rcv-credits), a subsequent
legitimate MFQE interrupt (with SCS-credit deficit) will result
in a series of secondary errors causing port-resets, never
expanding the MFQ queue, and posting of a series of these
error-log entrys (key ID: error-type/sub-type = 0x8102):
Logging OS 1. OpenVMS
System Architecture 2. Alpha
OS version V7.1
Event sequence number 3653.
Timestamp of occurrence 20-OCT-1997 00:01:27
Time since reboot 2 Day(s) 12:14:28
Host name GDC140
System Model AlphaServer 8400 Model
5/300
Entry type 98. Asynchronous Device
Attention
---- Device Profile ----
Unit GDC140$PNA0
Product Name CIXCD (XMI to CI Adapter)
------ Adapter Data -----
Error Type/SubType x8102 Hardware Error, Unspeci-
fied Port
Hardware Error.
Port will be RE-STARTED.
Count - Remaining Retries 36.
CASR x00000001 Bit 0: Message Free Que
EXHAUSTED
(AMFQE)
AMCSR x00000004 Bit 2: Interrupt ENABLE
(IE)
PESR xFFFFFFFF
XDEV x05110C2F Device Type is: 0x0C2F
= CIMNA
Device Revision is: 0x11
= A1
Firmware Revision is:0x05
= V-5
ASNR x00000001
XBER x00000040 XMI Node ID is: 1.
Commander ID is: 2 =
Microcode CMDR
XFADR xFFFFFE00
XFAER x73FF0FFF
PDCSR x00000001
PFAR x0000055C
Extra Longword 1 x00000000
Extra Longword 2 x00000000
Extra Longword 3 x00000000
----- Software Info -----
UCB$x_ERTCNT 128. Retries Remaining
UCB$x_ERTMAX 10. Retries Allowable
UCB$x_STS x00000000
UCB$x_ERRCNT 30. Errors This Unit
UCB$L_DEVCHAR1 x0C450000 Sharable
Available
Error Logging
Capable of Input
Capable of Output
Problems addressed in ALPDRIV04_071:
o On Alpha systems, with many Virtual Circuit failures, the
system will finally BUGCHECK with a CLUEXIT - or may simply
hang. Within the subsequent dump, many CDTs (Connection
Descriptor Table) in DISC_MATCH will be seen - and there will
be no free CDTs.
o These problems only affect Turbolaser AS8200/8400 capable of
exceeding 4 gigabyte memory sizes.
CIMNA (NPORT CIXCD XMI-to-CI adapter for Laser/Turbolaser) and
KFMSB (XMI-to-DSSI) adapters will fail to initialize or start
under OpenVMS if non-paged-dynamic (NPAGEDYN) pool contains
PFNs (physical pages) over 4 gigabytes (PA > 32-bits), and, BAP
(bus-addressable-pool) is merged with NPAGEDYN due to absence
of PCI bus on the system. If any of the NPORT structures
(ABLK, AMPB, QBUFs, CRRRs, BDL, BDLT) contain physical
addresses (PA) > 32-bits, these devices fail to start,
producing the following errors.
CDTs appear in various states when examined with the "SDA>
1. CIMNA ERRORS
===============
The CIMNA will exhibit "port-timeouts" or XMI transaction-
timeout (TTO) memory-system errors on boot, such as:
TURBOLASER CONSOLE LOG:
-----------------------
%PNA0, Port Error Bit(s) Set -
CNF/PMC/PSR 08110C2F/00000004/00000208
%PNA0, Port is Reinitializing (48 Retries Left). Check the
Error Log.
----------------------------------------------
%PNA0, CI port timeout.
%PNA0, Port is Reinitializing (49 Retries Left). Check the
Error Log.
----------------------------------------------
CIMNA ERROR LOG ENTRY:
----------------------
********************** ENTRY 2 *****************************
Logging OS 1. OpenVMS
System Architecture 2. Alpha
OS version V7.1
Event sequence number 1.
Timestamp of occurrence 01-JAN-1996 00:00:04
Time since reboot 0 Day(s) 0:00:04
Host name ANDA1A
System Model AlphaServer 8400 5/300
Entry type 98. Asynchronous Device Attention
---- Device Profile ----
Unit ANDA1A$PNA0
Product Name CIXCD (XMI to CI Adapter)
------ Adapter Data -----
Error Type/SubType x8102 Hardware Error,Unspecified Port
Hardware Error. Port will be RE-STARTED.
Count - Remaining Retries 50.
CASR 00000208 Bit 3: Memory System ERROR (MSE)
Bit 9: Unintialize State (UNIN)
AMCSR x00000004 Bit 2: Interrupt ENABLE (IE)
PESR xFFFFFFFF
XDEV x08110C2F Device Type is: 0x0C2F = CIMNA
Device Revision is: 0x11 = A1
Firmware Revision is: 0x08 = V-8
ASNR x00000208
XBER x8000A060 Bit 13: Transaction Timeout (TTO)
Bit 15: Command NoAck (CNAK)
Bit 31: Error Summary (ES)
XMI Node ID is: 1.
Commander ID is: 3 = INTR
XFADR xFFFFFFFF XMI Failing Addr[00:28]: x1FFFFFFF
XMI Failing Addr[39]: x00000001
Failing Length: x00000003
XFAER x13FF0000 Mask[00:15]: x00000000
XMI Failing Addr[29:38]: x000003FF
XMI Failing Command: 1, READ
PDCSR x00000208
PFAR x0000055C
Extra Longword 1 x00000000
Extra Longword 2 x00000000
Extra Longword 3 x00000000
----- Software Info -----
UCB$x_ERTCNT 0. Retries Remaining
UCB$x_ERTMAX 0. Retries Allowable
UCB$x_STS x10000000
UCB$x_ERRCNT 1. Errors This Unit
UCB$L_DEVCHAR1 x0C450000 Sharable
Available
Error Logging
Capable of Input
Capable of Output
************************************************************
2. KFMSB ERRORS
===============
TURBOLASER CONSOLE LOG
----------------------
"Port Error Bit(s) Set - CNF/PMC/PSR
xxxxxxxx/xxxxxxxx/05008010"
NOTE: The PSR (taken from the ASR: Adapter Status Register)
translates to:
- <04> Adapter Abnormal Condition
- <15> Channel 1 flag
- <30:24> (=5) Illegal Carrier Address
o SCS sysap data-transfer mapping requests will generate
incorrect (mis-calculated) physical-address pointers, causing
disk/tape data-transfer corruption, if the page_offset
requested extends beyond the first page (>8Kb-1: Alpha
page-size) of the requested transfer (page defined by SVAPTE in
SCS$MAP request). SYS$SCS sources the page_offset from
CDRP$L_BOFF, and sources SVAPTE from CDRP$L_SVAPTE, both of
which are supplied by the SCS client sysap (DUDRIVER, CNXMGR,
etc.). OpenVMS SCS sysaps are not believed to use CDRP$L_BOFF
values > 8k-1 (Alpha page), but user-written SCS sysap
applications might use a value > 13-bits since CDRP$L_BOFF is
32-bits (formerly CDRP$W_BOFF/16-bits).
o Performance is degraded on CIXCD and CIPCA based systems, when
communicating with other NADP (non-alternating-dual-path) nodes
during single-CI-path operation. This results in CI-cable
failure/removal or CI single-path failures (NO_RESPONSE, NAK
errors). NADP-supporting nodes currently are HSJ40, HSJ50,
CIPCA, and CIMNA/CIXCD.
o The CIPCA will not properly re-init after a PCI-DMA-Engine "bus
error" (PCI bus master abort or target abort). The port-driver
will continually fail to retry the re-init until the 50 retry
count is expired. The console OPA0 output is typically as
follows:
%PNB0, Port Error Bit(s) Set - NODESTS/CASR(H)/(L)
02800001/001C0000/000001D0
%PNB0, Port is Reinitializing ( 48 Retries Left).
Check the Error Log.
o When booting Alpha machines the console may display messages
such as:
%PNA0, Inappropriate SCA Control Message - FLAGS/OPC/STATUS/PORT
00/00/00/00
on the console - with associated errorlog messages.
INSTALLATION NOTES:
The images in this kit will not take effect until the system is
rebooted. If there are other nodes in the VMScluster, they must
also be rebooted in order to make use of the new image(s).
If it is not possible or convenient to reboot the entire cluster at
this time, a rolling re-boot may be performed.
o Multiprocessor Systems with CIPCAs: SMP_SPINWAIT Restriction
If your system uses a CIPCA adapter and you operate with
MULTIPROCESSING set to a non-zero value, you must reset the
value of the SMP_SPINWAIT parameter to 300000 (3 seconds)
instead of the default 100000 (1 second).
If you do not change the value of SMP_SPINWAIT, a CIPCA adapter
error could generate a CPUSPINWAIT system bugcheck similar to
the following:
**** OpenVMS (TM) Alpha Operating System V7.1 - BUGCHECK ****
** Code=0000078C: CPUSPINWAIT, CPU spinwait timer expired
This restriction will be removed in a future OpenVMS release.
NOTE: This release note supersedes a similar release note,
note 4.15.2.4.5, in the OpenVMS Version 7.1 Release
Notes manual as well as 6.2-1H3 sec:1.13.1, which also
included a SYSTEM_CHECK parameter restriction. The
SYSTEM_CHECK parameter restriction is incorrect.
Furthermore, the earlier release note stated that the
change to the SMP_SPINWAIT parameter was required for
a MULTIPROCESSING parameter setting of 1 or 2. This
requirement applies to all non-zero MULTIPROCESSING
parameter settings.
o This ALPPORTS01_071 remedial kit removes the KFMSB/HSD10
booting restriction that was listed in the ALPDRIV08_071
remedial kit. The ALPPORTS01_071 kit can be used in
KFMSB/HSD10 boot configurations.
Files on this server are as follows:
|
»alpports01_071.README
»alpports01_071.CHKSUM
»alpports01_071.CVRLET_TXT
»alpports01_071.a-dcx_axpexe
|