ECO NUMBER: ALPSHAD09_061 ----------- PRODUCT: OpenVMS Alpha Operating System -------- UPDATED PRODUCT: OpenVMS Alpha Operating System 6.1 ---------------- APPRX BLCK SIZE: 701 ---------------- COVER LETTER 1 KIT NAME: ALPSHAD09_061 2 KITS SUPERSEDED BY THIS KIT: AXPSHAD07_061, For OpenVMS Alpha V6.1 only ALPSHAD08_061 ALPSHAD09FT_061 3 KIT DESCRIPTION: 3.1 Version(s) of OpenVMS to which this kit may be applied: OpenVMS Alpha V6.1, V6.1-1H1, V6.1-1H2 3.2 This kit also requires installation of the following remedial kits: ALPSHAD10_061 When you install the ALPSHAD09_061 remedial kit you must also install the ALPSHAD10_061 or later remedial kit before rebooting your system. A system which is running the ALPSHAD09_061 kit without the ALPSHAD10_061, or a later SHADOW kit, may experience the MERGE problem or the SHADZEROMBR bugcheck problem. These problems are outlined and resolved in the ALPSHAD10_061 remedial documentation and kit. 3.3 Files patched or replaced for V6.1, V6.1-1H1, V6.1-1H2 o SYS$COMMON:[SYS$LDR]IO_ROUTINES.EXE (new image) o SYS$COMMON:[SYS$LDR]LOCKING.EXE (new image) o SYS$COMMON:[SYSLIB]MOUNTSHR.EXE (new image) o SYS$COMMON:[SYSEXE]MTAAACP.EXE (new image) o SYS$COMMON:[SYS$LDR]SYS$CLUSTER.EXE (new image) o SYS$COMMON:[SYS$LDR]SYS$DUDRIVER.EXE (new image) o SYS$COMMON:[SYS$LDR]SYS$SHDRIVER.EXE (new image) o SYS$COMMON:[SYS$LDR]SYS$TUDRIVER.EXE (new image) -- COVER LETTER -- Page 2 8 April 1996 o SYS$COMMON:[SYS$LDR]SYS$VCC.EXE (new image) o SYS$COMMON:[SYS$LDR]SYS$VCC_MON.EXE (new image) o SYS$COMMON:[SYS$LDR]MSCP.EXE (new image) o SYS$COMMON:[SYSEXE]SHADOW_SERVER.EXE (new image) o SYS$COMMON:[SYS$LDR]TMSCP.EXE (new image) o SYS$COMMON:[SYSEXE]MONITOR_TV.EXE (new image) o SYS$COMMON:[SYS$LDR]SECURITY.EXE (new image) o SYS$COMMON:[SYSLIB]SPISHR.EXE (new image) o SYS$COMMON:[SYSUPD]VMS$REMEDIAL_ID.EXE (new image) o SYS$COMMON:[SYS$LDR]VPM.EXE (new image) 4 PROBLEMS ADDRESSED IN ALPSHAD09_061 KIT Descriptions for problems that were corrected in previous Alpha Shadow kits are included in the ALPSHAD09_061 Release Notes. ALPSHAD09_061 Release notes can be found in the save set ALPSHAD09_061.A. If you have not installed a previous shadow kit it is recommended that you read these release notes before installing the ALPSHAD09_061 Shadow kit. To access the release notes, restore them from the saveset by issuing a command with the following format: $ BACKUP/SEL=ALPSHAD09_061.RELEASE_NOTES DEVICE:[DIR]ALPSHAD09_061.A/SA- DEVICE:[DIR]ALPSHAD09_061.RELEASE_NOTES o Shadowing crash immediately upon booting system with shadowed system disk, in SHSB$READ_SCB. o A two member shadowset with member index 0 a copy target and index 1 the only source member experiences a node failure on a node serving the disks. The source member goes "available". The source index is never packacked and the system remains with the set hung in mount verification forever. o If Shadowing tries to mark a block bad on all disks due to it being bad on the source(s) and encounters an error it may return an incorrect status to the user. The status will be SS$_NORMAL for MSCP devices and may be SS$_UNSUPPORTED for non-MSCP devices (as determined by routine SHSB$CHECK_MSCP). An SS$_NORMAL error is misleading as it indicates all blocks were correctly marked bad, SS$_UNSUPPORTED doesn't seem to be a valid return status for shadowing I/Os. -- COVER LETTER -- Page 3 8 April 1996 o Removing a DCD (Disk Copy Data) copy target and adding it back again causes the source of the DCD copy to change. This can cause the copy to be non-assisted if the alternate source isn't on the same controller. o If a DCD copy is interrupted by a mini-merge the copy will restart at 0% copied (LBN 0) rather than continuing from where it left off. DCD copies should restart at the last copied LBN after interrupted by mini-merge. o Failures to start copies or restart copies, usually after after a node halt, shutdown or reboot. Additional symptoms observed include inconsistent values for HBS_CIP when compared to SHADOW_MAX_COPY, negative values for HBS_CIP and copies that should continue started over from the beginning. o Demote CMPL to CMPW for #SS$_* to prevent incorrect status handling. o TPU would output SPR text if a user pressed CTRL/C during the compile of TPU code that contained errors. Users often do this when they accidentally try to compile non-TPU code or their procedure has many coding errors in it. This problem is corrected in OpenVMS Alpha V6.2. o If a three member Shadowset has it's index zero member as a copy target and all three members also require a MERGE, then when the COPY completes the MERGE does not take place. The LBN for the just completed COPY (the last LBN on the disk) is passed as the MERGE starting LBN. So it completes without doing any IO. o When MONITOR is run on a terminal with more than 24 lines, MONITOR still uses only 24 lines. For several classes (PROCESS, DISK, and CLUSTER), it would be nice if MONITOR could use the additional lines. This ECO provides support for the PROCESS class - the one that could use it most. This feature was provided in OpenVMS Alpha V6.2 o Specifying the MONITOR RMS with the /PERCENT qualifier will cause MONITOR to unexpectedly terminate with an ACCVIO. The following command will demonstrate the problem: $ MONITOR RMS/File=a.dat/Percent VAX/VMS Monitor Utility RMS FILE OPERATIONS (%) on node NODENAME CUR% AVE% MIN% MAX% $GET Call Rate (Seq) (Key) (RFA) $FIND Call Rate (Seq) -- COVER LETTER -- Page 4 8 April 1996 (Key) (RFA) $PUT Call Rate (Seq) (Key) $READ Call Rate $WRITE Call Rate $UPDATE Call Rate $DELETE Call Rate $TRUNCATE Call Rate $EXTEND Call Rate $FLUSH Call Rate %MONITOR-E-UNEXPERR, unexpected error -SYSTEM-F-ACCVIO, access violation, reason mask=01, virtual address=E100001C, PC=00018C7C, PSL=03C00008 This problem is corrected in OpenVMS Alpha V6.2 o Specifying the DISK Class to Monitor can result in unexpected side effects to the display. When MONITOR DISK command is issued on a system with DFS devices mounted, only the first three characters of the DFS name are displayed correctly. Instead of the fourth character, the low byte of the unit number is output. It is often displayed as an non-printable character or as an escape sequence (in which case, may cause terminal lock-ups, resetting characteristics, etc). The following command illustrates this problem when executed on a system with DFS disks mounted: $MONITOR DISK DISK I/O STATISTICS on node NODENAME 7-APR-1994 16:25:17 I/O Operation Rate DSA2241: FOLKLORE 6.27 6.27 6.27 6.27 DSA2249: AUDIT 0.00 0.00 0.00 0.00 DSA2263: VMS19NOVC3L 0.00 0.00 0.00 0.00 DSA2264: LAV19NOVC3L 0.00 0.00 0.00 0.00 DSA2265: MDF19NOVC3L 15.84 15.84 15.84 15.84 DSA2266: VMS28APRB3E 0.00 0.00 0.00 0.00 DSA2267: LAV28APRB3E 0.00 0.00 0.00 0.00 DSA2268: MDF28APRB3E 0.00 0.00 0.00 0.00 DSA2269: VMS18JANC3L 0.00 0.00 0.00 0.00 DSA2270: MDF18JANC3L 0.00 0.00 0.00 0.00 DSA2271: LAV18JANC3L 0.00 0.00 0.00 0.00 DSA2280: VMS12OCTM3C 0.00 0.00 0.00 0.00 $254$DFSé1001() DEC:..._STAR 0.00 0.00 0.00 0.00 $254$DFSH8008() V501_RESD 0.00 0.00 0.00 0.00 $254$DFSI8009() V51_RESD 0.00 0.00 0.00 0.00 -- COVER LETTER -- Page 5 8 April 1996 o Due to an inadequate synchronization mechanism, the MONITOR DISK command can go into an infinite loop on multi-processor machines. This problem is corrected in OpenVMS Alpha V6.2 o When a DCD should be valid to do, it is not always done. This results is doing a non-assisted FULL copy operation which takes much longer to do. o Event Flag not set when completion AST also specified on $ENQ o A problem would occur if a satellite were to crash and then attempt to boot back into the cluster (in a SCSI CLUSTER). The physical device would be unavailable to the satellite so that it would never be allowed to boot back into the cluster. This problem is corrected in OpenVMS Alpha V6.2 o On multi-interconnect clusters, there is a window which will allow a lock remaster operation to complete without all interested nodes pointing to the new master. This usually results in a number of nodes crashing with LOCKMGRERR bugchecks. The situation is only possible after a node CLUEXITs. Other required conditions are that the node which CLUEXITs must have a LOCKDIRWT of zero, such that a partial lock rebuild occurs after the CLUEXIT. If a SS$_NODELEAVE error is returned for a node which is to participate in the remaster, we must stop the remaster from completing, and allow the lock rebuild to clean things up. o A SET SECURITY or SET ACL on volumes on the cluster place High I/O on the server process. This exhausts paged pool and AUDIT_SERVER goes into a RWPAG state. This problem is corrected in OpenVMS Alpha V6.2 o A field in the IRP that is used during Volume Processing was not initialized in clones of USER IOs. If an error occurs, the code that determines the severity of the error can be misled by data in these fields. It can fail to locate the error and return the IO as successful. Since we also return a zero Byte count the User would see an Incomplete Segmented Transfer error. The fix is to init the field when the clone is allocated. o Listings are sometimes difficult to follow because there are varied format conventions used and some comments are misleading or missing. This problem is corrected in OpenVMS Alpha V6.2 o Certain applications calling $AUDIT_EVENT with AST's turned off will be interrupted when $AUDIT_EVENT returns to caller. This problem is corrected in OpenVMS Alpha V6.2 -- COVER LETTER -- Page 6 8 April 1996 o Code relies on page being present when trying to release spinlock and if the system is paging heavily, this might not be the case. This problem is corrected in OpenVMS Alpha V6.2 o Repeating wakeups from $SCHDWK show an accumulating drift over time. This problem is corrected in OpenVMS Alpha V6.2 o COPY and/or BACKUP of a DISK to a TMSCP-Served TAPE, will fail when the tape device is placed in a MV state. The failure does not occur in the same task is performed locally. COPY will fail with: "SYSTEM-F-TAPEPOSLOST, magnetic tape position lost" BACKUP will fail with: "-SYSTEM-F-DATALOST, data lost" This problem is corrected in OpenVMS Alpha V6.2 o To transition an OpenVMS process from the virtual balance set to the real balance set, the SPTE's (system page table entries) which describe its process PTE pages (process page table pages) need to be copied from saved memory back into the real balance slot from whence they originally came. This makes the process' P0 and P1 space accessible again. SPTE's for the process page table pages describing the undefined area between P0 and P1 must be represented by pre-initialized null values (actually, ERKW DZERO-type values). When this undefined void area is exactly zero pages (i.e., P0 and P1 are tangent), the VBSS$READ_OPT2_VBSM routine takes the wrong branch, causing a VBSSERR bugcheck. This fix adds a test for this case, and takes the image(s) correct branch. This problem is corrected in OpenVMS Alpha V6.2 o When a process is switched from a real balance slot to a virtual balance slot, the allocation fails, causing a VBSSERR bugcheck. This problem is corrected in OpenVMS Alpha V6.2 o When returning process quota (BYTLM) to a process for a created system global section compute returned quota value correctly. This problem is corrected in OpenVMS Alpha V6.2 o System crashes due to corrupted PTE entries. The corruption appears to be Global Section Table Entries pointing to Global Section Descriptors. The problem occurs only if 4095 GBLSECTIONS is exceeded. To check the number of Global Sections currently in use add the following values: -- COVER LETTER -- Page 7 8 April 1996 0 SDA> VALIDATE QUEUE EXE$GL_GSDSYSFL !global sections 0 SDA> VALIDATE QUEUE EXE$GL_GSDDELFL !delete pending global sections 0 SDA> VALIDATE QUEUE EXE$GL_GSDGRPFL !group global sections o Devices can remain allocated to processes that no longer exist. The device remains unusable until the system is rebooted. o If a previously shadowed disk is mounted with a MOUNT/OVER=SHADOW command and a new shadow set is created using this disk, OpenVMS Alpha will attempt to create the old shadow set using the old physical device names. o The system crashes with a NOBVPVCB bugcheck. The crash occurs on the kernel stack with MTAAACP.EXE as the current image. o The system crashes with an XQPERR while dismounting a MAD drive. o SUBTRACED errors not correctly determined for images installed /HEADER_RESIDENT. This problem is corrected in OpenVMS Alpha V6.2. o When returning process quota (BYTLM) to a process for a created system global section compute returned quota value correctly. o Users of RDB V6.1 may get ILLIOFUNC errors when doing IO to a Host Based Shadowset whose members are served. o The user will see a large number of the shadow copies being done by OpenVMS rather than the controller, even when both disks are on the same controller and the controller has DCD capabilities. o If a three member Shadowset has it's index zero member as a copy target and all three members also require a MERGE, then when the COPY completes the MERGE does not take place. The LBN for the just completed COPY (the last LBN on the disk) is passed as the MERGE starting LBN. So it completes without doing any IO. o System hang when I/Os pending to a shadow set do not complete. o In previous shadow kits two new fields were added to the IRP data structure for shadow write logging information. This new IRP definition size conflicted with the IRP sizes of other images on the system that were not part of the SHADOW kits. This conflict could cause a variety of errors including fatal bugchecks. This fix changes the IRP definitions back to the SSB versions and also adds some special definitions to the SHDRIVER for the new IRP fields. -- COVER LETTER -- Page 8 8 April 1996 o Fatal bugcheck from data structure corruption due to the value 10 HEX being added to the corrupted field. Crashes are of various types including node and cluster crashes, crashes due to invalid UCB addresses, invalid VCB addresses, invalid member IDs, invalid number of devices etc. 5 PROBLEMS ADDRESSED IN ALPSHAD07_061 KIT FOR V6.1, V6.1-1H1, V6.1-1H2 Although this kit has previous fixes that can be applied to V1.5, starting with kit AXPSHAD06_061, OpenVMS Alpha SHADOW kits no longer provide additional fixes for OpenVMS Alpha V1.5. If you are running OpenVMS Alpha V1.5 and have experienced any of the problems addressed in kit AXPSHAD06_061 or subsequent kits, it is recommended that you upgrade to OpenVMS Alpha V6.1 as soon as possible. o Fatal bugcheck from data structure corruption due to the value 10 HEX being added to the corrupted field. Crashes are of various types including node and cluster crashes, crashes due to invalid UCB addresses, invalid VCB addresses, invalid member IDs, invalid number of devices etc. o There is a race condition possible when a CFCB (Cache File Control Block) is being deleted due to XQP action and cache space is being reclaimed from a LIMBO file. o Under certain conditions, a fork locks used by the virtual I/O cache may be created with an incorrect length. This results in unsynchronized data access which can cause coruption. o When a satellite node in a SCSI cluster crashes, the MSCP server marks the physical device as offline which prevents the satellite node from being able to boot back into the cluster. 6 PROBLEMS ADDRESSED IN AXPSHAD06_061 KIT FOR OPENVMS ALPHA V6.1 o REGCORDET, Register corruption is detected after a fork for a fatal bugcheck The corruption occurs when fork dispatch does some checks. o The error message for the following scenario is inappropriate and not informative. If a customer forgets to set the value of the ALLOCLASS SYSGEN parameter and then tries to use shadowing, a shadow volume can be created but members can not be added to the shadow set. No error messages are received up to the point where the customer tries to add the second member. On the MOUNT command, the customer will receive the error messages: -- COVER LETTER -- Page 9 8 April 1996 $ mount /system dsa500 /shadow=dkb400 alphavms015 %MOUNT-I-SHDWMEMFAIL, DKB400 failed as a member of the shadow set -SYSTEM-F-INCSHAMEM, incompatible shadow set member "Incompatible" is not a true statement of the problem. The problem is actually due to "missing allocation class," or "incorrect allocation class." o I/O to a shadow set may become stalled if a shadow set member is dismounted at the same time from multiple nodes within a cluster. o Mount will not add shadow set members unless they are either MSCP or SCSI. o Shadowset member expulsion is currently based on the time it takes a fork & wait and a PACKACK to complete rather than the actual time transpired. On some devices, particularly SCSI, where a PACKACK can take approximately one minute, the timeout was much too long. Using the default value of 20 (seconds) for SHADOW_MBR_TMO would actually mean that it would take 20 minutes to expel a member, that is experiencing errors, from a SCSI shadowset. o Crashes have been seen where SHADDETINCON was triggered by the check at the end of MATCH_MASTER_SCB. In this consistency check the SHAD$W_DEVSTS_PASSIVE_MV_CNTR is incorrectly verified to be zero. Another symptom is that the virtual unit UCB$W_RWAITCNT is zero. Crashes where this problem was investigated also had shadow set member counts of zero. o Frequent crashes in EXPEL_PACKACK_ANY, particularly crashes with connections broken to all members and IRP$L_SHD_LOCK_FR5 = 1 (packack retries exhausted). o All members of a shadowset become inaccessible at the same time and remain inaccessible for a period of time greater than "shadow member timeout" (SHADOW_MBR_TMO or SHADOW_SYS_TMO) seconds, but less than MVTIMEOUT seconds. All members subsequently become accessible within seconds of each other, but not at exactly the same time. This results in all but one member being expelled from the shadowset. This often occurs when changing HSJ microcode when all members are connected to the same HSJ. When brought back online, polling will cause the devices to be found seconds apart which will result in all but one member being expelled. o All members of the shadow set must be checked to see if they meet the criteria of being MSCP. The original design does not allow for having no index zero member. o Using $PROCESS_SCAN explicitly, or implicitly, with the DCL SHOW USER command, sometimes causes a system crash due to an ACCVIO in kernel mode or an IVSSRVRQST bugcheck. -- COVER LETTER -- Page 10 8 April 1996 o When a node with a SCSI bus boots, it resets the SCSI bus. In a multi-host SCSI cluster this can cause the other node to experience I/O failures. Normally, this results in a brief mount verification, the I/O is retried, succeeds, and there is no serious consequence. However, if the other node is in the process of booting and the system disk is a shadow set, then the system will crash. o PGFIPLHI bugcheck in the SHADOW_SERVER process at the REMQUE in K_GET_COPYSHAD_IRP. On OpenVMS Alpha, PC = A0E, VA = 274. o A double-deallocation crash may occur as the result of MOUNT not properly initializing the MTL pointer. This pointer had a stale value as a result of 2 calls to SYS$VMOUNT from a single program. The problem will not happen as a result of DCL commands, as the cells are initialized at image activation. The stale pointer will only cause a problem if the system is unable to allocate space for defining the logical name. o If a user attempts to mount a disk that is 100% full and the disk was originally initialized with a version of OpenVMS Alpha prior to the one now in use, paged pool can be corrupted leading to system crashes. If the disk is filled AFTER it has been mounted, there will not be any problem. o Tape devices with stacker/loaders, such as the TF857, may take up to 6 minutes to Rewind/unload/load the next tape. A change was made to the behavior of MOUNT to take this delay into account. However, a side effect of that change is that non-stacker drives may also wait 6 minutes before failing. o Processes hang in RWNPG state waiting for a request for NPP (non-paged pool) so large that it cannot be satisfied. o The system crashes with the current process executing a $CHKPRO system service call. o If a multi-programming application uses a non-homogenous access pattern to a file which is resident in Virtual I/O cache, there is a possibility that the size returned in the I/O status block from a READ operation will be truncated. If a clustered application consisting of a large number of concurrent processes which perform file operations consisting of an OPEN, WRITE , CLOSE sequence on the same data file repetitively , a possibility of data corruption exists. In a multi-programming environment, where a significant amount of NEW data from a file is being loaded into the cache concurrently by multiple processes, the possibility of a HANG exists. o When a value block or value status block can not be returned, SYS$GETLKI returns the error SS$_ILLRSDM. A correction has been made to SYS$GETLKI to now return all other requested information and update the wildcard search index. -- COVER LETTER -- Page 11 8 April 1996 o The Audit server EXCLUDE process list corrupts after issuing a SET AUDIT/EXCLUDE=pid command. o Data corruption on the file container when using PATHWORKS. The corruption can be shown by running CHKDSK on the PC container disk. Also using PCDISK to IMPORT and EXPORT files to and from the container will show corrupted files when EXPORTed back to VMS. 7 PROBLEMS ADDRESSED IN AXPSHAD04_061 KIT FOR OPENVMS ALPHA V6.1, V6.1-1H1, V6.1-1H2 o When booting two or more systems simultaneously from shadowed system disks the systems may appear to hang. Crashing the systems and examining the crash dumps indicates that shadowing driver blocking AST routines have not run. o When a node runs out of SHADOW_MAX_COPY threads while mounting new copy target units, other nodes in the cluster that have available SHADOW_MAX_COPY threads will not pick up the copy work. This results in the copy not being started for copy members that are added to shadow sets. 8 PROBLEMS ADDRESSED IN AXPSHAD02_061 KIT FOR OPENVMS ALPHA V6.1, V6.1-1H1, V6.1-1H2 o While running a UETP tape test, fatal controller errors occurred. This problem was caused by TMSCP (the tape server) incorrectly interpreting a TUDRIVER status subcode. This was converted to a fatal controller error status and returned to the user. o Shadow sets have separate mount verification done by SHDRIVER, instead of the usual system mount verification. This SHDRIVER mount verification had an error updating the volume label if it was changed. This correction enables the behavior of virtual units to be consistent with the behavior of physical units. The previous workaround had been to issue the SET VOLUME/LABEL command on all nodes in the VMScluster. This command updates only the volume control block on the node that issued the command and it updates the physical device's home block with the new label. o The symptoms for this problem varied between VAX and Alpha. VAX nodes had unnecessary calls to mount verification or host-based volume shadowing processing. On OpenVMS Alpha nodes, these mount verification or host-based volume shadowing -- COVER LETTER -- Page 12 8 April 1996 processing calls could fail, resulting in I/O hangs and, eventually, volume invalid errors. o The code stream for AVAILABLE or OFFLINE status returned from a transfer command did not implement the MSCP specification. This correction implements the MSCP specification. o A served disk may have appeared to be ONLINE when it was OFFLINE. This was caused by the MSCP server's CHECK_SERVICE routine searching the device database and incorrectly returning an ONLINE status. o This problem was characterized by system crashes in SHDRIVER's RESTORE_WLE routine because there was no write-log table. Also, shadow-set members were spuriously removed from the set. o The symptoms of this problem were: o Undiagnosable hangs in individual copy operations or on the entire server o Spurious copy aborts o Possible poor copy performance o Possible shadow set inconsistency This correction stops these problems and also adds an optional new system logical name that you can use to control the buffer size of shadow copies. SHAD$COPY_BUFFER_SIZE has a maximum size of 127 blocks (default) and a minimum size of 31 blocks. You can change this size by using the DEFINE/SYSTEM command. o This problem was characterized by very high interrupt stack activity on a node performing a merged copy operation. This was most obvious in configurations that could require I/O at a high rate and when an update to a shadowing internal entry was required. This could adversely affect configurations using HSJ40 controllers with many shadow sets. The interrupt stack activity has approached 100% on the nodes performing the update operation for several multiple member shadow sets. This correction was made by modifying the algorithm used to determine when an update is performed, so that only affected nodes issue the required I/O. o This problem resulted in data inconsistency between members of a Phase II shadow set. It occurred under very heavy I/O operations to a shadow set, where the members of that shadow set were undergoing failover from one controller to another. o Artificial errors were reported in the system error log for Write History Management commands that had no actual error. -- COVER LETTER -- Page 13 8 April 1996 o A second shadow server could accidentally be created using the startup command procedure. This results in de-synchronization of shadow sets. The startup procedure was modified so that it does not allow multiple servers. o Incorrect MSCP online command handling returned errors rather than queuing I/O to a busy device. o The number of blocks to rewrite after a system failure (using an assisted merge copy operation) was not being computed correctly. This could lead to inconsistent data between shadow set members. o A process issuing I/O to a TMSCP tape device may have appeared to be hung after a controller failover attempt. This was caused by an incorrect check of the cached data's lost error status, resulting in an endless loop trying to recover a nonexistent error. o Alpha systems are unable to reboot an MSCP controller, such as an HSC. This could result in stalled pending I/O. o The path selection logic for DUDRIVER had a timing problem that caused devices to be mounted by an MSCP server, even though a local controller could be used. Although this symptom could still appear under extreme circumstances, the majority of devices should now find the local controller. o Incorrect MSCP-served disk synchronization, would cause I/O to an MSCP-served disk to get stalled on an internal queue and later restarted. o I/O hangs to a shadow set occurred because the timer queue element was not set up correctly. o Incorrect register usage caused an Invalid Exception bugcheck from DUDRIVER. o In the past, it was possible for MSCP to serve only 256 disks. This fix modifies MSCP internal constants so that 512 disks can be served. o During disk and tape error recovery, MSCP was unable to perform a TMSCP controller reset which resulted in a system crash. o During the processing of a write-log entry in SHDRIVER, a register value could be improperly maintained if the system was low on nonpaged pool. This would cause a crash when the entry was resumed. o This correction no longer checks the device ID. Instead, the software checks the geometry of the disks---cylinders, tracks, and sectors---as well as the maximum number of LBNs to ensure that the disks are the same before allowing shadow set members to be mounted. -- COVER LETTER -- Page 14 8 April 1996 Note that if you install this remedial kit across your VMScluster system, SCSI shadow sets that are configured across different controller types are not supported and will no longer work. This is because the new shadowing software is comparing disk geometries and the maximum number of LBNs. The new shadowing software will allow the RZ28/RZ28B combination in a shadow set because they have like geometries and maximum LBNs, but it will not allow disks with different geometries and/or maximum LBNs to work in a shadow set. The following figure shows how the kit affects working and non-working configurations. Shadowed CI VMScluster -------- -- ---------- +-Shadow Set-+ | | v v [RZ28B] [RZ28] [RZ28] [RZ28B] <-----WORKED BEFORE KIT; \ / \ / WILL NOT WORK AFTER KIT \/ \/ +-----+ +-----+ | HSC | | HSJ | +--+--+ +--+--+ +-Shadow Set-+ \ / | | \ / v v \ / [RZ28] [RZ28B] <---DID NOT WORK BEFORE * CI \ / KIT; WILL WORK AFTER / \ \ / KIT / \ \ / +-----+ +-----+ +-------+------+ |NODE1| |NODE2| |NODE3 (native)| +--+--+ +--+--+ +--+-----------+ | | | <-----------+--------+----------+--------------------> VMSclusters with shadowed SCSI disks and mixed-architecture VMSclusters running OpenVMS Alpha Version 6.1 must apply the kit and reboot the entire cluster simultaneously, so that the entire VMScluster is running the same version of Volume Shadowing software. The kit is required for both VAX and Alpha nodes. Do not mount shadow sets containing RZ28 and RZ28B devices without first applying this kit. o SHDRIVER was accessing a field that contains how long a system was running as a word instead of as a longword. After the first 18 hours of operation, some OPCOM messages that should have been logged were skipped. o If two members of a three-member shadow set were simultaneously removed --- either intentionally or in a failover situation --- this caused system hangs or failures. o System crashes could be caused by two processes having access to the same file during virtual I/O cache (VIOC) expansion. -- COVER LETTER -- Page 15 8 April 1996 o When subjected to a high I/O load and multiple failures, the write logging (minimerge) and shadowing synchronization subsystem became unreliable. This correction improves the operational characteristics of write logging and of the shadowing driver as a whole. o Unreliable shadow subsystem behavior and shadow-set hangs resulted from VMScluster nodes failing to relinquish shadow-set resources. o The TMSCP server Bugchecked in TMSCP$FIND_UQB when a command packet was being processed that referred to a specific unit and that unit did not have the Server Local Unit Number (SLUN) bit set. This change will return an end packet that will cause the offending class driver (TUDRIVER) to bugcheck. o An internal routine, MOVE_SERVER, had a sequencing problem and could cause stalled I/O to a served shadow-set member. o IRPs returning from stale I/O did not reflect changes in a shadow-set configuration, notably removal of members and changes in write logging state. System crashes occurred as IRPs filtered back through DUDRIVER and SHDRIVER. o Shadow set members could be inconsistent after the failure of a node accessing a shadow set served by an Alpha node. This amount of corrupted data depends on previous I/O operations to the shadow set. 9 PROBLEMS ADDRESSED IN AXPSHAD01_061 KIT FOR OPENVMS ALPHA V6.1 o In Volume Shadowing for OpenVMS Alpha Version 6.1, several changes were made to the assisted merge (minimerge) functionality. These changes disabled mimimerge functionality across mixed architecture VMSclusters. With minimerge disabled, shadowing continued to function normally, except that a full merge was always done when a merge operation occurred. Full merges take considerably longer than minimerges. If you want minimerge functionality, Digital recommends that you install this kit across any VMSclusters that contain an Alpha node running OpenVMS Alpha Version 6.1. Mixed-architecture VMSclusters that are running OpenVMS Alpha Version 6.1 must apply this kit and reboot the entire cluster simultaneously. In these cases, rolling upgrades are not supported. o Prior to this remedial kit, if you attempted to mount an RZ28B disk device with an RZ28 in the same shadow set, Volume Shadowing detected different device IDs and may not have -- COVER LETTER -- Page 16 8 April 1996 allowed the devices to be mounted. This behavior applied only an RZ28/RZ28B shadow-set combination when connected with a local SCSI controller. Since RZ28 and RZ28B are different device types but can be shadowed, the checking for shadow-set membership in the host-based shadowing software needed to be modified. With this remedial kit, customers will be able to combine RZ28 and RZ28B devices in a shadow set, as long as they are connected to like controllers. With the use of SCSI devices, like controllers are required because geometry can vary from controller to controller. Digital recommends that you configure SCSI shadow sets across like controller types. Existing SDI and DSSI configurations are unaffected; if they are not using SCSI drives and are shadowing SDI devices across different controllers, these configurations will continue to work without this remedial kit. Note that if you install this remedial kit across your VMScluster system, SCSI shadow sets that are configured across different controller types are not supported and will no longer work. This is because the new shadowing software is comparing disk geometries and the maximum number of LBNs. The new shadowing software will allow the RZ28/RZ28B combination in a shadow set because they have like geometries and maximum LBNs, but it will not allow disks with different geometries and/or maximum LBNs to work in a shadow set. The following figure shows how the kit affects working and non-working configurations. -- COVER LETTER -- Page 17 8 April 1996 Shadowed CI VMScluster -------- -- ---------- +-Shadow Set-+ | | v v [RZ28B] [RZ28] [RZ28] [RZ28B] <-----WORKED BEFORE KIT; \ / \ / WILL NOT WORK AFTER KIT \ / \ / \ / \ / \/ \/ +-----+ +-----+ | HSC | |HSJ | +--+--+ +--+--+ +-Shadow Set-+ \ / | | \ / v v \ / [RZ28] [RZ28B] <---DID NOT WORK BEFORE * CI \ / KIT; WILL WORK AFTER / \ \ / KIT / \ \ / +-----+ +-----+ +-------+------+ |NODE1| |NODE2| |NODE3 (native)| +--+--+ +--+--+ +--+-----------+ | | | <----------+--------+----------+--------------------> VMSclusters with shadowed SCSI disks and mixed-architecture VMSclusters running OpenVMS Alpha Version 6.1 must apply the kit and reboot the entire cluster simultaneously, so that the entire VMScluster is running the same version of Volume Shadowing software. The kit is required for both VAX and Alpha nodes. Do not mount shadow sets containing RZ28 and RZ28B devices without first applying this kit. The following memo contains the announcement from Storage product management about the capability of using RZ28/RZ28B devices in the same shadow set: *********************************************************** From: Digital Storage For questions call the StorageWorks Hotline: 1-800-786-7967 *********************************************************** ANNOUNCING CAPABILITY TO USE RZ28/RZ28B IN THE SAME SHADOW/STRIPE SET O IMMEDIATELY SUPPORTED BEHIND THE FOLLOWING CONTROLLERS - HSD05 - X36 AND X36A -- COVER LETTER -- Page 18 8 April 1996 - HSJ40- V 1.3 AND V 1.4 - HSD30 - V 1.4 - SHADOW/STRIPE CAPABILITY ON HSCXX WILL BE AVAILABLE WITH V 8.4 AT A FUTURE DATE. O RZ28/RZ28B SHADOW/STRIPE CAPABILITY UNDER LOCAL/NATIVE SCSI CONNECTION WILL BE AVAILABLE ON JUNE 6th. - OPEN VMS VAX/Alpha WILL RELEASE A PATCH KIT WITH POINTER ON HOW TO ACCESS. O SHADOW/STRIPE SETS MUST BE ON THE SAME CONTROLLER TYPE IN IN BOTH CASES MENTIONED ABOVE. - CURRENT SHADOW SETS ACROSS UNLIKE CONTROLLERS WILL BE RENDERED INOPERABLE. 10 INSTALLATION INSTRUCTIONS: If you are using the Shadowing option, it is highly recommended that this kit be installed. When you install the ALPSHAD09_061 remedial kit you must also install the ALPSHAD10_061 or later remedial kit before rebooting your system. A system which is running the ALPSHAD09_061 kit without the ALPSHAD10_061, or a later SHADOW kit, may experience the MERGE problem or the SHADZEROMBR bugcheck problem. These problems are outlined and resolved in the ALPSHAD10_061 remedial documentation and kit. Install this kit with the VMSINSTAL utility by logging into the SYSTEM account, and typing the following at the DCL prompt: @SYS$UPDATE:VMSINSTAL ALPSHAD09_061 [location of the saveset] The saveset location may be a tape drive, or a disk directory that contains the kit saveset. **************************** * * * NOTE * * * * INSTALLATION WARNINGS * * * **************************** -- COVER LETTER -- Page 19 8 April 1996 * Future OpenVMS Alpha V6.1 kits that are issued for facilities included in the ALPSHAD09_061 kit will not install unless the ALPSHAD09_061 kit is installed on your system first. It is highly recommended that the complete ALPSHAD09_061 remedial kit be installed as soon as possible. Installation of individual images from the ALPSHAD09_061 remedial kit is not supported and could result in unpredictable system behavior. * If you have a mixed-architecture cluster, and have not previously installed a shadowing kit, you must install this kit on the VAX nodes as well as the Alpha version of this kit on Alpha nodes of cluster BEFORE you bring up both types of systems in a cluster again. If both kits are not installed, you may not be able to create shadow sets. If you have previously installed a shadowing kit then you do not need to install the VAX version of this kit at this time as long as the shadowing kit installed on the VAX nodes of the cluster is VAXSHAD04_061 or later. * Working configurations that contain SCSI shadow sets on dissimilar controllers may no longer work. For more information, please see the Problem Description section of the Cover Letter/Release Notes supplied with this kit. Copyright Digital Equipment Corporation 1996. All Rights reserved. This software is proprietary to and embodies the confidential technology of Digital Equipment Corporation. Possession, use, or copying of this software and media is authorized only pursuant to a valid written license from Digital or an authorized sublicensor. This ECO has not been through an exhaustive field test process. Due to the experimental stage of this ECO/workaround, Digital makes no representations regarding its use or performance. The customer shall have the sole responsibility for adequate protection and back-up data used in conjunction with this ECO/workaround.