OpenVMS__SHADOW VAXSHAD04_060 VAX V6.0 VOLUME SHADOWING ECO Summary

NOTE: An OpenVMS saveset or PCSI installation file is stored on the Internet in a self-expanding compressed file. The name of the compressed file will be kit_name-dcx_vaxexe for OpenVMS VAX or kit_name-dcx_axpexe for OpenVMS Alpha. Once the file is copied to your system, it can be expanded by typing RUN compressed_file. The resultant file will be the OpenVMS saveset or PCSI installation file which can be used to install the ECO. Copyright (c) Digital Equipment Corporation 1995, 1996. All rights reserved. ************************* CAUTION!! ************************* * * * Please *READ* the installation instructions in the release * * notes/cover letter for this ECO kit *BEFORE* you install * * it on your system. System crashes may occur if this ECO * * kit is not installed on *every* OpenVMS VAX node and the * * OpenVMS Alpha (ALPSHAD) version of this kit is not * * installed on *every* OpenVMS Alpha in a mixed architecture * * cluster before the cluster is rebooted. * * * * Installation of this kit may also change the configuration * * performance for existing SCSI shadow sets. * * * *************************************************************** PRODUCT: Volume Shadowing for OpenVMS (Phase II) NOTE: The problems fixed in this ECO Kit also affect the following products: VAXcluster Software for OpenVMS VAX DEC TCP/IP Services for VMS (UCX) OP/SYS: OpenVMS VAX COMPONENTS: System, Bugcheck, Backup, Mount, Dismount, MSCP, TMSCP, MTAAACP, I/O Routines, Audit Server, Security, System Primitives, Adaptive Pool Management (APM), Operator Communication Manager (OPCOM), User Environmental Test Package (UETP) SOURCE: Digital Equipment Corporation ECO INFORMATION: ECO Kit Name: VAXSHAD04_060 ECO Kits Superseded by and Included in this ECO Kit: VAXSHAD03_060 VAXSHAD07_061 (For OpenVMS VAX V6.0 ONLY) VAXSHAD06_061 VAXSHAD05_061 VAXSHAD04_061 VAXSHAD03_061 VAXSHAD01_061 (CSCPAT_1160) VAXSHAD02_060 (CSCPAT_1116) VAXSHAD01_060 (CSCPAT_1116) VAXDRIV02_060 (CSCPAT_1136) VAXSYS14_061 (For OpenVMS VAX V6.0 ONLY) VAXSYS12_061 VAXSYS07_061 VAXSYS01_061 (CSCPAT_1113) VAXSYS04_060 (CSCPAT_1113) VAXSYS03_060 (CSCPAT_1113, CSCPAT_1124) VAXSYS01_060 (CSCPAT_1113) ECO Kit Approximate Size: 2790 Blocks Kit Applies To: OpenVMS VAX V6.0 System Reboot Necessary: Yes CAUTION: Before Installing this Kit, Read the Following Cautions: After installation of this kit, the following issues may occur: 1) ISSUE: When a node reboots into the cluster there may not be an OPCOM message that reports the node is joining the cluster. Absent messages occur on a random basis. WORKAROUND: In order to verify the node has entered the cluster, after the node has fully rebooted, the user should enter the command: $ SHOW CLUSTER to verify the node is a valid member of the VAXcluster. 2) ISSUE FROM THE CSC: An INVEXCEPTN in SNDRIVER may be seen if DECnet/SNA V2.1 is used in conjunction with the IO_ROUTINES from the VAXSHAD ECO kit. SNAVMS_E04021 (CSCPAT_5041) will fix this problem by replacing the incompatible SNDRIVER in DECnet/SNA V2.1 NOTE: SNAVMS_E04021 applies to DECnet/SNA V2.1 only. These issues are being addressed and will be corrected in a future version of OpenVMS VAX. ECO KIT SUMMARY: An ECO kit exists for Volume Shadowing on OpenVMS VAX V6.0. This kit addresses the following problems: Problems Addressed in the VAXSHAD04_060 Kit for OpenVMS VAX V6.0: o The VAXSHAD03_060 remedial kit for OpenVMS VAX V6.0 should have superseded the VAXSYS14_061 remedial kit for OpenVMS VAX V6.0. This kit supersedes and includes fixes from VAXSYS14_061. Problems Addressed in the VAXSHAD03_060 Kit for OpenVMS VAX V6.0: o After applying the VAXSHAD07_061 kit to the system disk, systems booting from that disk would no longer boot and would crash in SYSINIT with a DELCONPFN bugcheck. Problems Addressed in the VAXSHAD07_061 Kit for OpenVMS VAX V6.0: o In the VAXSHAD05 and VAXSHAD06 kits two new fields were added to the IRP data structure for shadow write logging information. This new IRP definition size conflicts with the IRP sizes of other images on the system that are not part of the SHADOW kits. This conflict may cause a variety of errors, including fatal bugchecks. This fix changes the IRP definitions back to the SBB versions and adds some special definitions to the SHDRIVER for the new IRP fields. o Fatal bugchecks from data structure corruption may occur due to the addition of the value 10 HEX to the corrupted field. Crashes are of various types and include node and cluster crashes, crashes due to invalid UCB addresses, invalid VCB addresses, invalid member IDs, and invalid number of devices. Problems Addressed in the VAXSHAD06_061 Kit for OpenVMS VAX V6.0. o When trying to access a DFS disk, the following error may be seen: -SYSTEM-F-FILALRACC, file already accessed on channel The disk can be accessed immediately after reboot; however, after a period of time of not accessing the disk, a simple directory command will return this error. Problems Addressed in the VAXSHAD05_061 Kit for OpenVMS VAX V6.0: o After a node crashes, on reboot it cannot mount a Host Based Volume Shadowing virtual unit. The error message usually returned is "volume not software enabled"; however, "Medium Offline" may also be seen. A SHOW DEVICE will show that the the Shadowset is in 0% merge but SNA will show that a minimerge is pending. o A double deallocation crash may occur as the result of MOUNT not properly initializing the Mounted Volume List (MTL) pointer. This pointer had a stale value as a result of two calls to SYS$VMOUNT from a single program. The stale pointer will only cause a problem if the system is unable to allocate space for defining the logical name. NOTE: Since cells are initialized at image activation, this problem should not occur as a result of DCL commands. o Tape devices with stacker/loaders, such as the TF857, may take up to 6 minutes to rewind/unload/load the next tape. In VAXSHAD01_061, a change was made to the behavior of MOUNT to take this delay into account. However, a side effect of that change was that non-stacker drives may also wait 6 minutes before failing. This problem has been addressed by this VAXSHAD kit. o System crashes with an INVEXCEPTN during a SHDRIVER COPY_DATA_REPAIR copy operation. o If the value of the ALLOCLASS SYSGEN parameter is not set and the user tries to use shadowing, a shadow volume can be created but members can not be added to the shadow set. No error messages are received up until a second member is added. On the MOUNT command, the customer will receive the error messages: $ mount /system dsa500 /shadow=dkb400 alphavms015 %MOUNT-I-SHDWMEMFAIL, DKB400 failed as a member of the shadow set -SYSTEM-F-INCSHAMEM, incompatible shadow set member "Incompatible" is an inappropriate statement of the problem. A more accurate message would be "missing allocation class," or "incorrect allocation class." o If a shadow set member is dismounted at the same time from multiple nodes within a cluster, I/O to a shadow set may become stalled. o Mount will not add shadow set members unless they are either MSCP or SCSI. o Shadow set member expulsion is currently based on the time it takes a fork & wait and a PACKACK to complete rather than the actual time transpired. On some devices, particularly SCSI, where a PACKACK can take approximately one minute, the timeout was much too long. Using the default value of 20 (seconds) for SHADOW_MBR_TMO would actually mean that it would take 20 minutes to expel from a SCSI shadow set a member experiencing errors. o SHDRIVER loss of synchronization may result in a crash where SHADDETINCON is triggered by the check at the end of MATCH_MASTER_SCB. In this consistency check, the SHAD$W_DEVSTS_PASSIVE_MV_CNTR is verified to be zero and is not. Another symptom is that the virtual unit UCB$W_RWAITCNT is zero. Also shadow set member counts of zero may be seen. o Crashes may occur in EXPEL_PACKACK_ANY with connections broken to all members and IRP$L_SHD_LOCK_FR5 = 1 (packack retries exhausted). o All members of a shadow set become inaccessible at the same time and remain inaccessible for a period of time greater than "shadow member timeout" (SHADOW_MBR_TMO or SHADOW_SYS_TMO) seconds but less than MVTIMEOUT seconds. All members subsequently become accessible within seconds of each other but not at exactly the same time. This results in all but one member being expelled from the shadow set. This often occurs when changing HSJ microcode and all members are connected to the same HSJ. When brought back online, polling will cause the devices to be found seconds apart which will result in all but one member being expelled. o All members of the set must be checked to see if they meet the criteria of being MSCP. The original design did not allow for having no index zero member. o When the mounting of full copy targets exceeds the SHADOW_MAX_COPY threads for a given node, other nodes with the shadow set mounted do not pick up the copy work. o In a cluster, using $PROCESS_SCAN explicitly or implicitly with the DCL SHOW USER command sometimes causes a system crash due to an ACCVIO in kernel mode or an IVSSRVRQST bugcheck. o When a node with a SCSI bus boots, it resets the SCSI bus. In a multi-host SCSI cluster, this can cause the other node to experience I/O failures. Normally, this results in a brief mount verification. The I/O is retried, succeeds, and there is no serious consequence. However, if the other node is in the process of booting and the system disk is a shadow set, the system will crash. o PGFIPLHI bugcheck in the SHADOW_SERVER process at the REMQUE in K_GET_COPYSHAD_IRP. On OpenVMS VAX, the PC is A0E and the VA is 274. o A page setup module which draws a frame and company logo on each page of output is used on a queue pointing to an LN03. This page setup module works on OpenVMS Version VAX 5.5-2 and prior versions. However, with VAXQMAN8_U2055 (CSCPAT_1165) or OpenVMS VAX Version 6.1 installed, this page setup module causes the printer to continually spew out paper with only the output from the page setup module. This continues until the entry is deleted from the queue. o Due to an inadequate synchronization mechanism, the MONITOR DISK command can go into an infinite loop on multi-processing machines. o If a multi-programming application uses a non-homogenous access pattern to a file which is resident in Virtual I/O cache, there is a possibility that the size returned in the I/O status block from a READ operation will be truncated. o If a clustered application uses of a large number of concurrent processes to perform file operations consisting of an OPEN, WRITE, and CLOSE sequence repetitively on the same data file, data corruption may occur. o In a multi-programming environment where a significant amount of NEW data from a file is being loaded into the cache concurrently by multiple processes, the system may HANG. o If a user attempts to mount a disk that is 100% full on OpenVMS VAX V6.* and the disk was originally initialized with a version of OpenVMS VAX prior to V6.0, paged pool can be corrupted leading to system crashes. If the disk is filled AFTER it has been mounted under V6.*, there will not be any problem. o The class driver will sometimes attempt to send an MSCP command packet on the wrong connection. This fix detects this mismatch and corrects it. o Due to invalid allocation counts, processes hang in RWNPG state waiting for a request for non-paged pool (NPP) so large that it cannot be satisfied. o The system crashes with the current process executing a $CHKPRO system service call. o A $AUDIT_EVENT system crash my occur in SECURITY.EXE due to corrupt scan structure storage. o When a rights list is passed into $CHKPRO (CHP$_RIGHTS), it is copied into the ARB within the NSA$A_SCRATCH area. This area will hold a maximum of eight rights. The code that handles this copy operation will split any larger rights list into the first eight, which are copied into the local rights area, and the remainder, which a descriptor is created and its address is added as extended process rights. The code involved in copying the first eight rights was looping incorrectly and copying rights to random locations within the NSA$A_SCRATCH area usually resulting in a SSRVEXCPT crash. Problems Addressed in the VAXSHAD04_061 Kit for OpenVMS VAX V6.0: o When booting two or more systems simultaneously from shadowed system disks, the systems may appear to hang. Crashing the systems and examining the crash dumps indicates that shadowing driver blocking AST routines have not run. o In a two node VAXcluster configuration, containing a DSSI system shadow set and a quorum disk, if one node exits the cluster and reboots, the node will hang on boot while attempting to form the system disk shadow set virtual unit. o When multiple virtual unit mount commands are issued that will result in copy operations, only the node from which the commands are issued will attempt to perform the copy operations. Only the SHADOWMAXCOPY number of copies will run simultaneously. This means that copy operations might take longer than expected and copies will not be started for copy members that are added to shadow sets. o On OpenVMS VAX V6.0 systems, disks could not be mounted after installation of VAXSHAD03_061. This problem is fixed in OpenVMS VAX V6.1 Problems Addressed in the VAXSHAD03_061 Kit for OpenVMS VAX V6.0: o A double-deallocation crash may occur as the result of MOUNT not properly initializing the MTL pointer. This pointer had a stale value as a result of 2 calls to SYS$VMOUNT from a single program. The problem will not happen as a result of DCL commands, as the cells are initialized at image activation. The stale pointer will only cause a problem if the system is unable to allocate space for defining the logical name. o OPCOM message was being output even though /NOASSIST was specified in the MOUNT command. This caused problems for UETP. o System crash in SECURITY.EXE. o A process is in RWPAG while auditing an event. o When the current process executes a $CHKPRO system service call, the system will crash. o Processes hang in RWNPG state (Call to $CRMPSC) waiting for a request for NPP so large that it cannot be satisfied. o DISMOUNT/OVERRIDE=CHECKS against the SYSTEM disk is allowed. Once this command is issued nothing else can be done. Installation of this kit will only allow this command to be issued on non-system disks. Problems Addressed in the VAXSHAD01_061 Kit for OpenVMS VAX V6.0: o In Volume Shadowing for OpenVMS Alpha V6.1, minimerge functionality across mixed architecture VMSclusters was disabled. In order to reestablish the minimerge functionality, install this kit across any VMScluster that contains an OpenVMS Alpha V6.1 node. o Mounting an RZ28B disk device with an RZ28 in the same shadow set is not allowed and will display the following error: %MOUNT-I-SHDWMEMFAIL, $1$DUA0 failed as a member of the shadow set -SYSTEM-F-INCSHAMEM, incompatible shadow set member This behavior is seen when RZ28/RZ28B shadow set members are connected with a local SCSI (Small Computer System Interface) controller. With this kit, RZ28 and RZ28B devices can be combined in a shadow set if they are connected to like controllers. NOTE: If this kit is installed across a VMScluster, SCSI shadow sets configured across different controller types are not supported and will no longer work. Problems Addressed in the VAXSHAD02_060 Kit for OpenVMS VAX V6.0: o After installation of CSCPAT_1116 V1.0 (VAXSHAD01_060), the system may crash with a SHADDETINCON bugcheck at SHDRIVER+F0B4. The bugcheck occurs when a disk is removed from a mounted shadow set. Problems Addressed in the VAXSHAD01_060 Kit for OpenVMS VAX V6.0: o In a situation in which more than one member of a three-member shadow set go into error recovery at the same time and cannot be brought back into the shadow set (due to loss of connectivity, media offline, write-locked device, etc.), SHDRIVER expels one of the members and crashes with a SHADDETINCON bugcheck because it cannot update the Storage Control Block (SCB) on the remaining members. This can cause many cluster nodes to crash at the same time. o When all three members of a three-member shadow set are write-locked, a bugcheck will occur due to the destruction of Register 4 upon execution of a jump to sub-routine command that overwrites the value in the register. o The SHADOW_MAX_COPY SYSGEN parameter is used to set how many merge/copy threads may be started at the same time on a node. Systems are allowing more than SHADOW_MAX_COPY number of threads to run concurrently. o Various SHDRIVER system disk member timer issues and Register 2/Register 5 Corruption: - The SHSB$MATCH_MASTER_SCB routine uses SHSB$PAUSE incorrectly. This improper usage causes the value in Register 2 to be destroyed when the time delay is invoked, so the resulting value in Register 2 is indeterminate. - The SHSB$MATCH_MASTER_SCB routine uses SH$TIME_DELAY incorrectly. This improper usage causes an incorrect value to be placed in Register 5, which requires a UCB value. - The SH$ABORT_VP routine uses SH$TIME_DELAY incorrectly. This improper usage causes the value in Register 2 to be destroyed when the time delay is invoked, so the resulting value in Register 2 is indeterminate. - In some customer configurations, the benefit of re-assembling a multiple-member system disk shadow set is lost. This occurs because the fixed amount of time expires and not all of the former members are available. - Member time out for system disks and other disks is not differentiated. - The hardcoded wait of FF seconds to connect to all members of an existing system disk is not a controllable variable. o SHDRIVER MVTIMEOUT and R5 Corruption errors: - When one member of a multiple-member shadow set is spontaneously removed from the shadow set due to a fatal error condition, some VAXcluster nodes will hang the virtual unit until the MVTIMEOUT time expires. - After a call to SHSB$PAUSE, the wait loop at 103$ in SHSB$VALIDATE_SHADOW_SET does not correctly restore the contents of R5 to be the virtual unit. o Post-processing is not performed correctly on all clones which causes allocation of new, unnecessary Write Log Entries. The Write Log INUSE bit is never cleared and the write log table has to be expanded. Once the table expands to MAX, Write Logging is disabled. When Write Logging is turned back on, the cycle begins again. Eventually, all the entries in the controller are exhausted, which forces Write Log Exhaustion handling and, in some cases, the controller is reset. o If the READ of Logical Block #1 fails during INVALIDATE_ALL_ENTRIES or if WLG has been turned off, the shadow set will hang with a SEQCMD lock and an incorrectly incremented RWAITCNT. Problems Addressed in the VAXDRIV02_060 Kit for OpenVMS VAX V6.0: o A tape drive will sometimes fail over to another HSX controller after the tape is dismounted. o Numbers greater than 9999 which are randomly generated by HSx devices may cause the system to crash. o RE-INITIALIZATION errors are reported to users of SCSI tape drives attached to an HSx controller. This occurs if multiple SCSI tapes are attached to the HSx and all the tapes are at or near PEOT and the connection to the HSx is broken. Problems addressed in VAXSYS14_061 Kit for OpenVMS VAX V6.0: o There is a race condition that may occur when a CFCB (Cache File Control Block) is being deleted due to XQP action and cache space is being reclaimed from a LIMBO file. o Disk corruption can occur when heavy open/read/write/close/delete operations are occurring. o At some point after a node CLUEXITs, 2 or more cluster nodes crash with LOCKMGRERR Bugchecks. o When two or more VAX or Alpha nodes boot at the same time, one or more of them may crash. Problems addressed in the VAXSYS07_061 Kit for OpenVMS VAX V6.0: o If a multi-programming application uses a non-homogenous access pattern to a file which is resident in Virtual I/O cache, there is a possibility that the size returned in the I/O status block from a READ operation will be truncated. o If a clustered application uses of a large number of concurrent processes to perform file operations consisting of an OPEN, WRITE, and CLOSE sequence repetitively on the same data file, data corruption may occur. o In a multi-programming environment where a significant amount of NEW data from a file is being loaded into the cache concurrently by multiple processes, the system may HANG. o Documentation states that -1 as well as 0 is accepted as a wildcard in SYS$GETLKI. However, that is no longer the case beginning with V5.5. Problems Addressed in the VAXSYS01_061 Kit for OpenVMS VAX V6.0: o SYS$CHKPRO had several problems that did not manifest themselves in a readily visible effect to the end user. The problems include: - accepting up to 11 rights lists even though no more than two would actually be processed. - CHKPRO would accept a CHP$_UIC and write it over a location which was to contain a rightslist pointer. - In most cases the wrong UIC was used in access checking. The only time the customer would notice a problem is if they specifically tested access to an object known to be protected from current rights and UIC settings. o Nonpaged dynamic memory (NPAGEDYN) expansion occurs even when there is a large amount of free space available. This can lead to performance problems as pool expansion causes free memory to be diverted away from that available to processes and dedicated to nonpaged pool usage. For example, with a SHOW MEMORY/POOL command you can observe that the "Total" amount of "Nonpaged Dynamic Memory" increases when the amount of "Free" bytes is quite large: Dynamic Mem Usage (bytes): Total Free In Use Largest Nonpaged Dynamic Mem 38555136 17372224 21182912 38720 Paged Dynamic Mem 17282048 8295888 8986160 8265232 Starting with the introduction of the Adaptive Pool Management (APM) feature, in OpenVMS VAX V6.0, these figures include the contributions of both the lookaside lists and the variable pool. So, a large "Free" figure is indicative of large (and possibly, growing) lookaside lists. If the "Total" figure is increasing, it indicates that pool expansion is occurring, and that the lookaside list space is not being used effectively. The above symptom can result from either of the two following separate problems: - A routine in the software which supports security features such as "rightslists" was obtaining a nonpaged pool block and then freeing it in two smaller pieces. - An internal loop counter governing the number of times a lookaside list allocation was attempted, was set too low. This problem will most likely be seen on the VAX 6000 - 500 and 600. A third software change associated with APM will also be available in a future OpenVMS VAX version, but is not available as a remedial change. The third change provides a potential performance benefit under very specialized conditions, such as during VMScluster state transitions. Problems Addressed in the VAXSYS03_060 & VAXSYS04_060 Kits for OpenVMS VAX V6.0: o When tapes are served in cluster tape profiles cannot be changed. The problem has occurred in the following two ways: 1) If discretionary access does not allow the audit server process access to the device, the profile cannot be changed. 2) If the object server is available (though it had been started at least once), the ORB$V_TRANSITION flag is set and not cleared. In this case, only BYPASS privilege allows access to the device. This prevents a profile change as in (1). The profile change, once it is allowed, must clear the TRANSITION flag. o Cluster object profile resolution can fail for tape devices when ASSIGN fails with SS$_NOPRIV. This has shown up in matrix testing with failures of STABACKIT and UETP. Problems Addressed in the VAXSYS01_060 Kit for OpenVMS VAX V6.0: o Attempting to access the VPROT item with GETDVI on UCX TYMNET terminals may result in an access violation. The VPROT item is implicitly accessed using $GETDEV and $GETCHN services, which are used by a number of utilities. INSTALLATION NOTES: This kit *MUST* be installed on every VAX in a mixed-architecture VMScluster, and the Alpha (AXPSHAD) version of this kit *MUST* be installed on every Alpha system in the cluster BEFORE any systems are re-booted into the VMScluster. If the correct kit is not installed on each system, shadow sets cannot be created. System crashes may also occur if the kits are not installed on all appropriate cluster nodes. The following restrictions will apply upon completion of the installation: o VMSclusters with shadowed SCSI disks and mixed-architecture VMSclusters running OpenVMS Alpha V6.1 must apply the kit and reboot the entire cluster simultaneously. In these cases, rolling upgrades are not supported. o Working configurations that contain SCSI shadow sets on dissimilar controllers may no longer work.

This patch can be found at any of these sites:

Files on this server are as follows:

vaxshad04_060.README
vaxshad04_060.CHKSUM
vaxshad04_060.CVRLET_TXT
vaxshad04_060.a-dcx_vaxexe