OpenVMS__SHADOW VAXSHAD09_061 VAX V6.1 ECO Summary
NOTE: An OpenVMS saveset or PCSI installation file is stored
on the Internet in a self-expanding compressed file.
The name of the compressed file will be kit_name-dcx_vaxexe
for OpenVMS VAX or kit_name-dcx_axpexe for OpenVMS Alpha.
Once the file is copied to your system, it can be expanded
by typing RUN compressed_file. The resultant file will
be the OpenVMS saveset or PCSI installation file which
can be used to install the ECO.
Copyright (c) Digital Equipment Corporation 1994, 1995. All rights reserved.
***** WARNING!!! *****
Future OpenVMS VAX V6.1 kits that are issued for facilities
included in the VAXSHAD09_061 kit will not install unless the
VAXSHAD09_061 kit is installed on your system first. It is
highly recommended that the complete VAXSHAD09_061 remedial kit
be installed as soon as possible. Installation of individual
images from the VAXSHAD09_061 remedial kit is not supported and
could result in unpredictable system behavior.
Descriptions for problems that were corrected in previous VAX
Shadow kits are included in the VAXSHAD09_061 Release Notes.
The release notes can be found in the save set VAXSHAD09_061.A.
If you have not installed a previous shadow kit it is
recommended that you read these release notes before installing
the VAXSHAD09_061 Shadow kit. To access the release notes,
restore them from the saveset by issuing a command with the
following format:
$ BACKUP/SEL=VAXSHAD09_061.RELEASE_NOTES DEVICE:[DIR]VAXSHAD09.A/SA-
DEVICE:[DIR]VAXSHAD09_061.RELEASE_NOTES
PRODUCT: Volume Shadowing for OpenVMS (Phase II)
NOTE: The problems fixed in this ECO Kit also affect the
following products:
VAXcluster Software for OpenVMS VAX
VAXcluster Console System (VCS)
OP/SYS: OpenVMS VAX
COMPONENTS: System, Bugcheck, Backup,
Mount, Dismount, MSCP, TMSCP, MTAAACP,
I/O Routines, Audit Server,
Security, System Primitives,
Adaptive Pool Management (APM),
Operator Communication Manager (OPCOM),
User Environmental Test Package (UETP),
Media Management Extensions (MME)
SOURCE: Digital Equipment Corporation
ECO INFORMATION:
ECO Kit Name: VAXSHAD09_061
ECO Kits Superseded by and Included in this ECO Kit:
VAXSHADFT09_061 (Never Officially Released)
VAXSHAD08_061 (Never Officially Released)
VAXSHAD07_061 (For OpenVMS VAX V6.1 systems only)
VAXSHAD06_061
VAXSHAD05_061
VAXSHAD04_061
VAXSHAD03_061
VAXSHAD02_061 (CSCPAT_1160)
VAXSHAD01_061 (CSCPAT_1160)
VAXMTAA01_062 (For OpenVMS VAX V6.1 systems only)
VAXMTAA02_061
VAXMTAA01_061 (CSCPAT_1154)
VAXMONT01_061 (For OpenVMS VAX V6.1 systems only)
VAXSYS14_061 (For OpenVMS VAX V6.1 systems only)
VAXSYS12_061
VAXSYS07_061
VAXSYS01_061 (CSCPAT_1113)
VAXMME01_061 (CSCPAT_1174)
VAXOPCO01_061 (CSCPAT_1144)
VAXAUDI02_061
ECO Kit Size: 3960
Kit Applies To: OpenVMS VAX V6.1
System/Cluster Reboot Necessary: Yes
CAUTION:
Before Installing this Kit, Read the Following Cautions:
After installation of this kit, the following issues may occur:
1) ISSUE: When a node reboots into the cluster there may not
be an OPCOM message that reports the node is joining
the cluster. Absent messages occur on a random
basis.
WORKAROUND: In order to verify the node has entered the
cluster, after the node has fully rebooted, the
user should enter the command:
$ SHOW CLUSTER
to verify the node is a valid member of the
VAXcluster.
2) ISSUE FROM THE CSC: An INVEXCEPTN in SNDRIVER may be seen if
DECnet/SNA V2.1 is used in conjunction
with the IO_ROUTINES from the VAXSHAD
ECO kit. SNAVMS_E04021 (CSCPAT_5041) will
fix this problem by replacing the
incompatible SNDRIVER in DECnet/SNA V2.1
NOTE: SNAVMS_E04021 applies to
DECnet/SNA V2.1 only.
These issues are being addressed and will be corrected in a
future version of OpenVMS VAX.
ECO KIT SUMMARY:
An ECO kit exists for Volume Shadowing on OpenVMS VAX V6.1. This
kit contains the fixes described below.
Problems Addressed in the VAXSHAD09_061 Kit:
o A 'SET SECURITY' or 'SET ACL' on a volume in an OpenVMS
cluster places high I/O on the server process. This
exhausts paged pool and the AUDIT_SERVER goes into an
RWPAG state.
This problem is corrected in OpenVMS VAX V6.2
o A field in the IRP that is used during Volume Processing is
not initialized in clones of USER IOs. If an error occurs,
the code that determines the severity of the error can be
misled by data in these fields. It can fail to locate the
error and return the IO as successful. Since a zero-byte count
is returned, an Incomplete Segmented Transfer error will occur.
The fix is to initialize the field when the clone is allocated.
o While creating a page, a user process might be swapped out and
then return using a different balance set slot.
This problem is corrected in OpenVMS VAX V6.2.
o Certain applications calling $AUDIT_EVENT with ASTs disabled
will be interrupted when $AUDIT_EVENT returns to the caller.
This problem is corrected in OpenVMS VAX V6.2
o The code relies on a page being present when it attempts
to release a spinlock. If the system is paging heavily,
the page may not be available. This may result in pagefaults
in EXE$BRKTHRU at IPL greater than 2.
This problem is corrected in OpenVMS VAX V6.2
o Repeating wakeups from $SCHDWK show an accumulating drift over
time.
This problem is corrected in OpenVMS VAX V6.2.
o Magnetic tape position may be lost in differing circumstances:
- COPY and/or BACKUP of a DISK to a TMSCP-Served TAPE, will fail
when the tape device is placed in an MV state. The failure
does not occur if the same task is performed locally.
- COPY will fail with: "SYSTEM-F-TAPEPOSLOST, magnetic tape
position lost"
- BACKUP will fail with: "-SYSTEM-F-DATALOST, data lost"
This problem is corrected in OpenVMS VAX V6.2.
o To transition an OpenVMS process from the virtual balance set
to the real balance set, the SPTEs (system page table entries)
which describe its process PTE pages (process page table pages)
need to be copied from saved memory back into the real balance
slot from where they originally came. This makes the process'
P0 and P1 space accessible again. SPTEs for the process page
table pages describing the undefined area between P0 and P1
must be represented by pre-initialized null values (actually,
ERKW DZERO-type values). When this undefined void area is
exactly zero pages (i.e., P0 and P1 are tangent), the
VBSS$READ_OPT2_VBSM routine takes the wrong branch, causing a
VBSSERR bugcheck. This fix adds a test for this case, and
takes the image's correct branch.
This problem is corrected in OpenVMS V6.2.
o When a process is switched from a real balance slot to a
virtual balance slot, the allocation may fail, causing a
VBSSERR bugcheck.
This problem is corrected in OpenVMS VAX V6.2.
o Incorrect quota value is returned when process quota (BYTLM) is
returned to a process for a created system global section.
This problem is corrected in OpenVMS VAX V6.2.
o System crashes may occur due to corrupted PTE entries. The
corruption appears to be Global Section Table Entries pointing
to Global Section Descriptors.
The problem occurs only if 4095 GBLSECTIONS are exceeded. To
check the number of Global Sections currently in use, add the
following values:
- SDA> VALIDATE QUEUE EXE$GL_GSDSYSFL !global sections
- SDA> VALIDATE QUEUE EXE$GL_GSDDELFL !delete pending global
!sections
- SDA> VALIDATE QUEUE EXE$GL_GSDGRPFL !group global sections
o Devices can remain allocated to processes that no longer exist.
The device remains unusable until the system is rebooted.
o If a previously shadowed disk is mounted with a MOUNT/OVER=SHADOW
command and a new shadow set is created using this disk,
OpenVMS VAX will attempt to create the old shadow set using the
old physical device names.
o The system crashes with a NOBVPVCB bugcheck. The crash occurs
on the kernel stack with MTAAACP.EXE as the current image.
o The system crashes with an XQPERR while dismounting a MAD
drive.
o SUBTRACED errors are not correctly determined for images installed
with /HEADER_RESIDENT.
This problem is corrected in OpenVMS VAX V6.2.
o Users of ORACLE[R] Rdb V6.1 may get ILLIOFUNC errors when doing IO
to a Host Based Shadowset whose members are served.
o The user will see a large number of shadow copies being done by
OpenVMS rather than the controller, even when both disks are on
the same controller and the controller has DCD (Disk Copy Data)
capabilities.
o If a three-member Shadowset has its index zero member as a copy
target and all three members require a MERGE, when the COPY
completes the MERGE does not take place. The LBN for the just
completed COPY (the last LBN on the disk) is passed as the
MERGE starting LBN, so it completes without doing any IO.
o Failures occur during attempts to start copies or restart
copies, usually after a node halt, shutdown or reboot.
Additional symptoms observed include inconsistent values for
HBS_CIP when compared to SHADOW_MAX_COPY, negative values for
HBS_CIP and copies that should continue start over from the
beginning.
o System hangs may occur when I/Os pending to a shadow set do
not complete.
Problems Addressed in the VAXSHAD07_061 Kit:
o In the VAXSHAD05 and VAXSHAD06 kits two new fields were added
to the IRP data structure for shadow write logging information.
This new IRP definition size conflicts with the IRP sizes of
other images on the system that are not part of the SHADOW kits.
This conflict may cause a variety of errors, including fatal
bugchecks. This fix changes the IRP definitions back to the SBB
versions and adds some special definitions to the SHDRIVER for
the new IRP fields.
o Fatal bugchecks from data structure corruption may occur due
to the addition of the value 10 HEX to the corrupted field.
Crashes are of various types and include node and cluster
crashes, crashes due to invalid UCB addresses, invalid VCB
addresses, invalid member IDs, and invalid number of devices.
o When trying to access a DFS disk, the following error may be
seen:
-SYSTEM-F-FILALRACC, file already accessed on channel
The disk can be accessed immediately after reboot; however,
after a period of time of not accessing the disk, a simple
directory command will return this error.
o If a tape is initialized with a non-blank accessibility field
and then mounted using /OVERRIDE=(ACCESSIBILITY), the tape
mounts but cannot be read or written to. The command format
to initialize the tape would be similar to:
INIT/LABEL=VOLUME_ACCESSIBILITY="+" tape: LABEL
In addition, the following OPCOM messages are generated and
the tape volume is automatically unloaded after an attempt to
WRITE or READ the tape volume:
%%%%%%%%%%% OPCOM 12-DEC-1994 12:57:23.53 %%%%%%%%%%% Message
from user USERXX on NODEXX non-blank accessibility field in
volume labels on SYS$DEVICE:
%%%%%%%%%%% OPCOM 12-DEC-1994 12:57:23.54 %%%%%%%%%%%
o MTAAACP posts attention ASTs to its mailbox. If the AST
QUOTA reaches zero and an attempt is made to kill the MTAAACP
process or the process that emitted the QIO, MTAAACP will go
into the RWAST state and hang.
Problems Addressed in the VAXSHAD06_061 Kit:
o When using PATHWORKS, data corruption may occur on the file
container. The corruption can be seen by running CHKDSK on the PC
container disk. Also using PCDISK to IMPORT and EXPORT files to
and from the container will show a corrupted file when EXPORTed
back to VMS.
o System crashes with INVEXCEPTN bugcheck at SCH$POSTEF+21.
To correct this problem, a change was made in the IOC$SIMREQCOM
routine to cause the destination of the IFNOWET test to
initialize R4 before calling the IOC$SCHEDEF routine.
IOC$SCHEDEF expects R4 to have the address of the user's PCB.
Problems Addressed in the VAXSHAD05_061 Kit for OpenVMS VAX V6.1:
o After a node crashes, on reboot it cannot mount a Host Based
Volume Shadowing virtual unit. The error message usually
returned is "volume not software enabled"; however, "Medium
Offline" may also be seen. A SHOW DEVICE will show that the
the Shadowset is in 0% merge but SNA will show that a minimerge
is pending.
o A double deallocation crash may occur as the result of MOUNT not
properly initializing the Mounted Volume List (MTL) pointer. This
pointer had a stale value as a result of two calls to SYS$VMOUNT
from a single program. The stale pointer will only cause a problem
if the system is unable to allocate space for defining the logical
name.
NOTE: Since cells are initialized at image activation, this
problem should not occur as a result of DCL commands.
o Tape devices with stacker/loaders, such as the TF857, may take
up to 6 minutes to rewind/unload/load the next tape. In
VAXSHAD01_061, a change was made to the behavior of MOUNT to take
this delay into account. However, a side effect of that change
was that non-stacker drives may also wait 6 minutes before failing.
o System crashes with an INVEXCEPTN during a SHDRIVER COPY_DATA_REPAIR
copy operation.
o If the value of the ALLOCLASS SYSGEN parameter is not set and the
user tries to use shadowing, a shadow volume can be created but
members cannot be added to the shadow set. No error messages are
received up until a second member is added. On the MOUNT command,
the customer will receive the error messages:
$ mount /system dsa500 /shadow=dkb400 alphavms015
%MOUNT-I-SHDWMEMFAIL, DKB400 failed as a member of the shadow set
-SYSTEM-F-INCSHAMEM, incompatible shadow set member
"Incompatible" is an inappropriate statement of the problem. A
more accurate message would be "missing allocation class," or
"incorrect allocation class."
o If a shadow set member is dismounted at the same time from multiple
nodes within a cluster, I/O to the shadow set may become stalled.
o Mount will not add shadow set members unless they are either
MSCP or SCSI.
o Shadow set member expulsion is currently based on the time it takes
a fork & wait and a PACKACK to complete rather than the actual time
transpired. On some devices, particularly SCSI, where a PACKACK
can take approximately one minute, the timeout was much too long.
Using the default value of 20 (seconds) for SHADOW_MBR_TMO would
actually mean that it would take 20 minutes to expel from a SCSI
shadow set a member experiencing errors.
o SHDRIVER loss of synchronization may result in a crash where
SHADDETINCON is triggered by the check at the end of
MATCH_MASTER_SCB. In this consistency check, the
SHAD$W_DEVSTS_PASSIVE_MV_CNTR is verified to be zero and is not.
Another symptom is that the virtual unit UCB$W_RWAITCNT is
zero. Shadow set member counts of zero may also be seen.
o Crashes may occur in EXPEL_PACKACK_ANY with connections broken to
all members and IRP$L_SHD_LOCK_FR5 = 1 (packack retries exhausted).
o All members of a shadow set become inaccessible at the same time and
remain inaccessible for a period of time greater than "shadow
member timeout" (SHADOW_MBR_TMO or SHADOW_SYS_TMO) seconds but
less than MVTIMEOUT seconds. All members subsequently become
accessible within seconds of each other but not at exactly the same
time. This results in all but one member being expelled from the
shadow set.
This often occurs when changing HSJ microcode and all members are
connected to the same HSJ. When brought back online, polling will
cause the devices to be found seconds apart which will result in
all but one member being expelled.
o All members of the set must be checked to see if they meet the
criteria of being MSCP. The original design did not allow
for having no index zero member.
o When the mounting of full copy targets exceeds the SHADOW_MAX_COPY
threads for a given node, other nodes with the shadow set mounted
do not pick up the copy work.
o In a cluster, using $PROCESS_SCAN explicitly or implicitly with the
DCL 'SHOW USER' command sometimes causes a system crash due to an
ACCVIO in kernel mode or an IVSSRVRQST bugcheck.
o When a node with a SCSI bus boots, it resets the SCSI bus. In a
multi-host SCSI cluster, this can cause the other node to
experience I/O failures. Normally, this results in a brief mount
verification. The I/O is retried, succeeds, and there is
no serious consequence. However, if the other node is in the
process of booting and the system disk is a shadow set, the
system will crash.
o A PGFIPLHI bugcheck may occur in the SHADOW_SERVER process at
the REMQUE in K_GET_COPYSHAD_IRP. On OpenVMS VAX, the PC is
A0E and the VA is 274.
o A page setup module which draws a frame and company logo on each
page of output is used on a queue pointing to an LN03. This page
setup module works on OpenVMS Version VAX 5.5-2 and prior versions.
However, with VAXQMAN8_U2055 (CSCPAT_1165) or OpenVMS VAX Version
6.1 installed, this page setup module causes the printer to
continually spew out paper with only the output from the page setup
module. This continues until the entry is deleted from the queue.
o If a multi-programming application uses a non-homogenous access
pattern to a file which is resident in Virtual I/O cache, there is
a possibility that the size returned in the I/O status block from a
READ operation will be truncated.
o If a clustered application uses of a large number of concurrent
processes to perform file operations consisting of an OPEN, WRITE,
and CLOSE sequence repetitively on the same data file, data
corruption may occur.
o In a multi-programming environment where a significant amount of
NEW data from a file is being loaded into the cache concurrently by
multiple processes, the system may HANG.
o If a user attempts to mount a disk that is 100% full on OpenVMS VAX
V6.* and the disk was originally initialized with a version of
OpenVMS VAX prior to V6.0, paged pool can be corrupted leading to
system crashes. If the disk is filled AFTER it has been mounted
under V6.*, there will not be any problem.
o The class driver will sometimes attempt to send an MSCP command
packet on the wrong connection. This fix detects this mismatch and
corrects it.
o Due to invalid allocation counts, processes hang in RWNPG state
waiting for a request for non-paged pool (NPP) so large that it
cannot be satisfied.
o The system crashes with the current process executing a $CHKPRO
system service call.
o A $AUDIT_EVENT system crash my occur in SECURITY.EXE due to corrupt
scan structure storage.
o When a rights list is passed into $CHKPRO (CHP$_RIGHTS), it is
copied into the ARB within the NSA$A_SCRATCH area. This area
will hold a maximum of eight rights. The code that handles this
copy operation will split any larger rights list into the first
eight, which are copied into the local rights area, and the
remainder, which a descriptor is created and its address is added
as extended process rights.
The code involved in copying the first eight rights is looping
incorrectly and copying rights to random locations within the
NSA$A_SCRATCH area usually resulting in a SSRVEXCPT crash.
o When a value block or value status block cannot be returned,
SYS$GETLKI returns the error SS$_ILLRSDM. A correction has been
made to SYS$GETLKI to now return all other requested information
and update the wildcard search index.
Problems Addressed in the VAXSHAD04_061 Kit:
o When booting two or more systems simultaneously from shadowed
system disks, the systems may appear to hang. Crashing the
systems and examining the crash dumps indicates that shadowing
driver blocking AST routines have not run.
o When a node runs out of SHADOW_MAX_COPY threads while mounting
new copy target units, other nodes in the cluster that have
available SHADOW_MAX_COPY threads will not pick up the copy
work. This results in the copy not being started for copy
members that are added to shadow sets.
Problems Addressed in the VAXSHAD03_061 Kit for OpenVMS VAX V6.1:
o A double-deallocation crash may occur as the result of MOUNT not
properly initializing the MTL pointer. This pointer had a stale
value as a result of 2 calls to SYS$VMOUNT from a single program.
The problem will not happen as a result of DCL commands, as the
cells are initialized at image activation. The stale pointer
will only cause a problem if the system is unable to allocate
space for defining the logical name.
o An OPCOM message was being output even though /NOASSIST was
specified in the MOUNT command. This caused problems for UETP.
o A system crash may occur in SECURITY.EXE.
o A process is in RWPAG while auditing an event.
o When the current process executes a $CHKPRO system service call,
the system will crash.
o Processes hang in RWNPG state (Call to $CRMPSC) waiting for a
request for NPP so large that it cannot be satisfied.
o DISMOUNT/OVERRIDE=CHECKS against the SYSTEM disk is allowed.
Once this command is issued nothing else can be done.
Installation of this kit will allow this command to
only be issued on non-system disks.
o When booting from a Controller-Based Shadowed System disk
for the first time as a Host-Based Shadowed System disk, boot
fails with a SHADBOOTFAIL bugcheck. A SHADBOOTFAIL may also
occur if SHADOW_SYS_UNIT is changed at boot time.
o During a copy operation the system may crash with an ACCVIO.
o When a user program allocates a read buffer from a TMSCP-served
tape creator, the record on tape will get server node system data
returned along with the data on tape. Printing the buffer will
show that the data from tape is in the correct location of the
buffer but it will also show that the area of the buffer that was
not supposed to be changed contains server node system data.
Problems Addressed in the VAXSHAD02_061 Kit:
o The local MSCP server issues a fatal MSCPSERV bug check when it
should not. The server should instruct the remote DISKCLASS
driver to BUGCHECK.
o When a serving node becomes so busy that it occasionally
exhausts resource limits, the RWAITCNT for heavily used disks
gets incremented. If a client node requests an ONLINE and
RWAITCNT is bumped, it is rejected by MSCP. This makes
MOUNTing devices very difficult.
o On OPCOM restart, the old privilege mask's upper 32-bits may
not be restored to their original value. This mask is
declared as a longword, but used as a quadword.
o When OPCOM receives a message that it does not recognize, the
message is included in the log file with the following text:
%%%%%% OPCOM 19-APR-1994 11:20:40.06 %%%%%% DUMP_LOG_FILE
OPCOM has noticed a condition which might be due to an internal
error. might also be explained by normal events, especially if
nodes have just crashed or rebooted in a VAXcluster. Please
bring this message to Digital's attention only if you are having
problems with operator communications.
Buffer is 8 (%X0008) bytes -- "- Unknown message received"
00000000 00000000 00000000 00000000 00000000 00000000 -
41534403 0015007B
o When an assisted merge is performed, an inaccurate number of
LBNs (Logical Block Numbers) and bytes transferred may be
computed. Therefore, all LBNs may not be merged in assisted
merge operations.
o Access path attention (ACPTH) messages are used by MSCP to
determine secondary paths for disks that are attached to dual
controllers. DUDRIVER might incorrectly assign this
information to the wrong device if two units with the same
unit number and allocation class exist. These messages may
also trigger unnecessary failover attempts.
o Servers in VAXclusters with more than 127 nodes may crash
when the 128th node attempts to access a given disk. This
usually occurs after a serving node crashes for other reasons,
but this causes the rest of the servers to crash.
o In a small working set, it is possible for the EXE$PSCAN_NEXT_PID
routine (called by $GETJPI) to take a page fault at IPL 8. This
causes a PGFIPLHI bugcheck. The page referenced is in the
PROCESS_SCAN context block (PSCANCTX$ data structure) in process
virtual address space.
o While running a UETP tape test, fatal controller errors may
occur. This problem is caused by TMSCP (the tape server)
incorrectly interpreting a TUDRIVER status subcode. This
misinterpretation is converted to a fatal controller error
status and returned to the user.
o Shadow sets have separate mount verification done by SHDRIVER,
instead of the usual system mount verification. The SHDRIVER
mount verification has an error updating the volume label on
shadow sets that have the volume label changed except on the
node that issues the label change. Once the devices are in
this state, they can not be recovered until MVTIMEOUT is
reached or a reboot of all affected nodes is performed.
This correction enables the behavior of virtual units to be
consistent with the behavior of physical units.
o Unnecessary calls to MOUNT verification or host-based
volume shadowing processing may occur. On Alpha nodes,
these mount verification or Host-Based Volume Shadowing
processing calls will fail, resulting in I/O hangs and,
eventually, volume invalid errors.
o AVAILABLE or OFFLINE status returned from a transfer command
does not implement the MSCP specification correctly.
o OpenVMS VAX MSCP Parity with OpenVMS Alpha. A served disk may
appear to be ONLINE when it is really OFFLINE. This occurs
because the MSCP server's CHECK_SERVICE routine searches the
device database and incorrectly returns an ONLINE status.
o There is no synchronization between SHADOW_PROCESSING and
INVALIDATE_ALL_ENTRIES, which allows these two code threads to
run simultaneously. This can cause a system crash due to the
fact that the SHADOW_PROCESSING thread may remove a member from
a multimember shadow set and the INVALIDATE_ALL_ENTRIES thread
is not aware that the member has been removed. The system
crash occurs in RESTORE_WLE because no Write Log table
exists.
o A problem exists with the SHADOW_SERVER. The symptoms of this
problem are:
+ Undiagnosable hangs in individual copy operations or on
the entire server
+ Unexpected copy aborts
+ Poor copy performance
+ Shadow set inconsistency
o High interrupt stack activity occurs on a node performing a merged
copy operation. This could adversely affect configurations using
HSJ40 controllers with many shadow sets.
o Data inconsistency may exist between members of a Phase II shadow
set. This occurs under very heavy I/O operations to a shadow
set while the members of that shadow set are undergoing failover
from one controller to another.
o Invalid Command status processing of Write History Management
commands unconditionally puts an entry into the error log.
This occurs even when there is not actual error.
o A second shadow server may accidentally be created using the
startup command procedure. This results in desynchronization
of shadow sets. The startup procedure has been modified so
that it does not allow multiple servers.
o When a serving node becomes so busy that it occasionally
exhausts resource limits, the RWAITCNT for heavily used disks
gets incremented. If a client node requests an ONLINE and
RWAITCNT is bumped, it is rejected by MSCP. This makes
MOUNTing devices very difficult.
o After a system failure, the number of blocks to be rewritten
is not computed correctly. This may cause inconsistent data
between shadow set members. This occurs during an assisted
merge when the information regarding which LBNs to include
is only requested from one shadow set member.
o A process issuing I/O to a TMSCP tape device may appear to
hang after a controller failover attempt. This is caused by
an incorrect check of the cached data's lost error status,
which results in an endless loop trying to recover a
nonexistent error.
o In the past, Volume Shadowing checked device IDs and the
maximum logical block numbers (LBNs.) Volume Shadowing
now checks for geometries and maximum LBNs. This
enables devices like the RZ28 and RZ28B to operate in
the same shadow set. Even though their device IDs differ,
their geometries and maximum LBNs will match when configured
on like controllers.
NOTE: If this remedial kit is installed across a VMScluster
system, SCSI shadow sets that are configured across
different controller types are not supported and will
no longer work.
o A device may be mounted by an MSCP server, even though a local
controller could be used. This situation may still occur after
the installation of this ECO kit under extreme timing circumstances.
o When new MSCP server I/O is sent to a device that is RWAITCNT
stalled and the connection from the driver to the device fails,
server I/O is posted to the restart queue if it is active. If
not, they are incorrectly left on the UCB (Unit Control Block)
pending queue. This causes shadow sets to appear to be stalled.
If the connection from the client to the server then fails,
I/O from the client that has been passed to the driver is
then allowed to complete. If this I/O is stalled on the
pending queue, it completes much later, possibly after
the client has reissued the stalled I/O.
o Incorrect MSPC-served disk synchronization might cause I/O to
an MSCP-served disk to become stalled on an internal queue
which would be restarted later.
o I/O hangs to a shadow set might occur because the shadowing
driver has no way to disable write logging if the write log
entries are mismanaged or depleted to a point that the
shadow set is unusable.
o An Invalid Exception bugcheck might occur in DUDRIVER during
I/O request complete processing.
o In the past, MSCP could only serve 256 disks. It can now
serve 512.
o During the processing of a write-log entry in SHDRIVER, a
register value may be improperly maintained if the system
is low on nonpaged pool. This will cause a system crash
with an INVEXCEPTN Bugcheck within SHSB$GET_WLE_TABLE in
module SHDSUBS when the entry is resumed.
o After approximately 18 hours of operation, some OPCOM
messages that should be logged are skipped.
o If two members of a three-member shadow set are
simultaneously removed, either intentionally or in
a failover situation, the system may hang or fail.
o System crashes might occur during virtual I/O cache (VIOC)
expansion under the following circumstances:
+ Multiple processes (or processors) are accessing the same
file concurrently;
+ The cache space for that file was being expanded;
+ That expansion caused the need for a new hash table
structure.
o When subjected to a high I/O load and multiple failures,
the write logging (minimerge) and shadowing synchronization
subsystems become unreliable.
o Unreliable shadow subsystem behavior and shadow-set hangs
occur when VMScluster nodes fail to relinquish shadow-set
resources.
o The TMSCP server bugchecks in TMSCP$FIND_UQB when a command
that refers to a specific unit is processed and that unit
does not have the Server Local Unit Number (SLUN) bit set.
The fix contained in this ECO kit will cause the bugcheck
to occur in TUDRIVER instead of the TMSCP server.
o I/O may stall to a served shadow-set member. Load balancing
makes this condition more likely.
o System crashes may occur during processing of stale I/O in
Host-Based Volume Shadow Sets. This I/O does not properly
reflect changes in shadow set configuration, notably removal of
members and changes in the write-logging state.
o Shadow set members may be inconsistent after the failure
of a node accessing a shadow set served by an Alpha node.
The amount of corrupted data depends on previous I/O
operations to the shadow set.
Problems Addressed in the VAXSHAD01_061 Kit:
o In Volume Shadowing for OpenVMS Alpha Version 6.1, several
changes were made to the assisted merge (minimerge)
functionality. These changes disabled mimimerge functionality
across mixed architecture VMSclusters. With minimerge
disabled, shadowing continued to function normally, except that
a full merge was always done when a merge operation occurred.
Full merges take considerably longer than minimerges. If you
want minimerge functionality, Digital recommends that you
install this kit across any VMSclusters that contain an Alpha
node running OpenVMS Alpha Version 6.1.
Mixed-architecture VMSclusters that are running OpenVMS Alpha
Version 6.1 must apply this kit and reboot the entire cluster
simultaneously. In these cases, rolling upgrades are not
supported.
o Prior to this remedial kit, if attempts were made to mount an
RZ28B disk device with an RZ28 in the same shadow set, Volume
Shadowing detected different device IDs and may not have
allowed the devices to be mounted. This behavior applied only
an RZ28/RZ28B shadow-set combination when connected with a
local SCSI controller. Since RZ28 and RZ28B are different
device types but can be shadowed, the checking for shadow-set
membership in the host-based shadowing software needed to be
modified.
This remedial kit enables the combination of RZ28 and RZ28B
devices in a shadow set, as long as they are connected to like
controllers. With the use of SCSI devices, like controllers
are required because geometry can vary from controller to
controller. Digital recommends that SCSI shadow sets be
configured across like controller types. Existing SDI and DSSI
configurations are unaffected; if they are not using SCSI
drives and are shadowing SDI devices across different
controllers, these configurations will continue to work
without this remedial kit.
VMSclusters with shadowed SCSI disks and mixed-architecture
VMSclusters running OpenVMS Alpha Version 6.1 must apply the
kit and reboot the entire cluster simultaneously, so that the
entire VMScluster is running the same version of Volume
Shadowing software. The kit is required for both VAX and Alpha
nodes. Do not mount shadow sets containing RZ28 and RZ28B
devices without first applying this kit.
o The MME$$MNTREQ function, which requests that a volume should
be selected for mount, allowed the use of logical names for the
device name. However, since these are process logical names,
as part of the caller's process, these logical names are not
available to the media manager.
o A device not ready for magtapes error is not reported until a
delay of up to 6 minutes has expired.
o If a user creates a shadow set, dismounts the set, then mounts
just one of the members, the other members of the set will be
marked "ONLINE" when viewed from the HSC. As a result, no HSC
operations are allowed until the disk is MOUNTed then
DISMOUNTed from the shadow set.
o If MOUNT fails to create a logical name, no error information
is displayed. In this case, the logical name may point to
an incorrect device.
o If a device is MOUNTED/SYSTEM and then it is MOUNTED/CLUSTER
with conflicting /OWNER_UIC or /PROTECTION qualifiers,
incorrect error messages may be displayed. The following two
types of errors may occur.
+ The error message may generate garbage which would
force terminal characteristics to be reset to
ASCII.
+ The following error messages may be displayed:
inconsistent /PROTECTION option. Cluster mounted (garbage)
inconsistent /OWNER_UIC option. Cluster mounted (garbage)
o When a disk with a large EXTENT value is mounted under
V6.* for the first time or if the SECURITY.SYS file is
missing from the system, the SECURITY.SYS file will be
created as EXTENT size and rounded up for the disk
cluster size. This may waste disk space.
o The message for %MOUNT-F-BADUNDFAT has a typographical error.
o If the VOLUME_ACCESSIBILITY option is used in conjunction
with the INITIALIZE/LABEL= command upon tape initialization,
a user with all privileges enabled is unable to access the
tape unless he/she is the owner.
o In an OPCOM message, there is no separating the
device name and the comment text.
o After a BACKUP operation, the header of the INDEXF.SYS file
of the backup save set is corrupted. This can be seen by
issuing the following DCL command:
$ ANALYZE/DISK DJA0:
o Previously, MOUNT only waited 10 seconds to allow magtape
devices to become ready before determining that the device is
off line. Tx8x7 tape devices may take up to 6 minutes to
become ready during a volume switch. This fix causes the wait
to be done in user mode so that the wait can be aborted by the
user via a CTRL/C.
Problems Addressed in the VAXMTAA01_062 Kit:
o The system crashes with a NOBVPVCB bugcheck. The crash occurs
on the kernel stack with MTAAACP.EXE as the current image.
o The system crashes with an XQPERR while dismounting a MAD
drive.
Problems Addressed in the VAXMTAA02_061 Kit:
o If a tape is initialized with a non-blank accessibility field
and then mounted using /OVERRIDE=(ACCESSIBILITY), the tape
mounts but cannot be read or written to. The command format to
initialize the tape would be similar to:
INIT/LABEL=VOLUME_ACCESSIBILITY="+" tape: LABEL
In addition, the following OPCOM messages are generated and the
tape volume is automatically unloaded after an attempt to WRITE
or READ the tape volume:
%%%%%%%%%%% OPCOM 12-DEC-1994 12:57:23.53 %%%%%%%%%%% Message
from user USERXX on NODEXX non-blank accessibility field in
volume labels on SYS$DEVICE:
%%%%%%%%%%% OPCOM 12-DEC-1994 12:57:23.54 %%%%%%%%%%%
o If a user attempts to stop the MTAAACP process or a process that
emitted a QIO, MTAAACP will go into RWAST state and hang.
Problems Addressed in the VAXMTAA01_061 Kit:
o If the wrong magnetic tape volume is inserted as the next volume,
MTAAACP cancels the request and then hangs.
Problems Addressed in the VAXMONT01_061 Kit:
o Specifying the DISK Class to Monitor can result in unexpected
side effects to the display. When the MONITOR DISK command is
issued on a system with DFS devices mounted, only the first
three characters of the DFS name are displayed correctly.
Instead of the fourth character, the low byte of the unit
number is output. It is often displayed as an non-printable
character or as an escape sequence (in which case, it may cause
terminal lock-ups, resetting characteristics, etc).
The following command illustrates this problem when executed
on a system with DFS disks mounted:
$MONITOR DISK
DISK I/O STATISTICS
on node NODENAME
7-APR-1994 16:25:17
I/O Operation Rate
DSA2241: FOLKLORE 6.27 6.27 6.27 6.27
DSA2249: AUDIT 0.00 0.00 0.00 0.00
DSA2263: VMS19NOVC3L 0.00 0.00 0.00 0.00
DSA2264: LAV19NOVC3L 0.00 0.00 0.00 0.00
DSA2265: MDF19NOVC3L 15.84 15.84 15.84 15.84
DSA2266: VMS28APRB3E 0.00 0.00 0.00 0.00
DSA2267: LAV28APRB3E 0.00 0.00 0.00 0.00
DSA2268: MDF28APRB3E 0.00 0.00 0.00 0.00
DSA2269: VMS18JANC3L 0.00 0.00 0.00 0.00
DSA2270: MDF18JANC3L 0.00 0.00 0.00 0.00
DSA2271: LAV18JANC3L 0.00 0.00 0.00 0.00
DSA2280: VMS12OCTM3C 0.00 0.00 0.00 0.00
$254$DFSé1001() DEC:..._STAR 0.00 0.00 0.00 0.00
$254$DFSH8008() V501_RESD 0.00 0.00 0.00 0.00
$254$DFSI8009() V51_RESD 0.00 0.00 0.00 0.00
o The 'MONITOR DISK' command hangs when monitoring a system with
more than 800 disks. MONITOR contains an arbitrary upper limit
of 800 on the number of disks it can monitor. When a system
contains more than 800, MONITOR generates an error status, but
the status is not properly signaled, and the display appears to
hang. This can also be seen with a 'MONITOR CLUSTER' command
(which collects DISK data implicitly).
o Due to an inadequate synchronization mechanism, the MONITOR
DISK command can go into an infinite loop on multi-processor
machines.
o MONITOR PROCESS in a local environment will fail if the
SYSGEN parameter MAXPROCESSCNT is set to allow more than 1040
processes. When Virtual Balance Slots were added in OpenVMS
V6.0, this number dropped to 978.
Problems Addressed in the VAXSYS14_061 kit:
o There is a race condition possible when a CFCB (Cache File
Control Block) is being deleted due to XQP action and cache
space is being reclaimed from a LIMBO file.
o Disk corruption can occur when heavy open/read/write/close/delete
operations are occurring.
o At some point after a node CLUEXITs, 2 or more cluster nodes
crash with LOCKMGRERR Bugchecks.
o When two or more VAX or Alpha nodes are booting at the same
time, one or both of them will crash.
Problems Addressed in the VAXSYS12_061 Kit:
o When a value block or value status block cannot be returned,
SYS$GETLKI returns the error SS$_ILLRSDM. A correction has
been made to SYS$GETLKI so that it now returns all other
requested information and updates the wildcard search index.
Problems Addressed in the VAXSYS07_061 Kit:
o If a multi-programming application uses a non-homogenous
access pattern to a file which is resident in Virtual I/O
cache, there is a possibility that the size returned in the I/O
status block from a READ operation will be truncated.
If a clustered application consisting of a large number of
concurrent processes which perform file operations consisting
of an OPEN, WRITE, CLOSE sequence on the same data file
repetitively, a possibility of data corruption exists.
In a multi-programming environment, where a significant amount
of NEW data from a file is being loaded into the cache
concurrently by multiple processes, the possibility of a HANG
exists.
Problems Addressed in the VAXSYS01_061 Kit:
o SYS$CHKPRO had several problems that did not manifest themselves
in a readily visible effect to the end user. The problems
include:
- accepting up to 11 rights lists even though no more than two
would actually be processed.
- CHKPRO would accept a CHP$_UIC and write it over a location
which was to contain a rightslist pointer.
- In most cases the wrong UIC was used in access checking.
The only time the customer would notice a problem is if they
specifically tested access to an object known to be protected
from current rights and UIC settings.
o Nonpaged dynamic memory (NPAGEDYN) expansion occurs even when
there is a large amount of free space available. This can lead
to performance problems as pool expansion causes free memory to
be diverted away from that available to processes and dedicated
to nonpaged pool usage. For example, with a SHOW MEMORY/POOL
command you can observe that the "Total" amount of "Nonpaged
Dynamic Memory" increases when the amount of "Free" bytes is
quite large:
Dynamic Mem Usage (bytes): Total Free In Use Largest
Nonpaged Dynamic Mem 38555136 17372224 21182912 38720
Paged Dynamic Mem 17282048 8295888 8986160 8265232
Starting with the introduction of the Adaptive Pool Management
(APM) feature, in OpenVMS VAX V6.0, these figures include the
contributions of both the lookaside lists and the variable pool.
So, a large "Free" figure is indicative of large (and possibly,
growing) lookaside lists. If the "Total" figure is increasing,
it indicates that pool expansion is occurring, and that the
lookaside list space is not being used effectively.
The above symptom can result from either of the two following
separate problems:
- A routine in the software which supports security features
such as "rightslists" was obtaining a nonpaged pool block
and then freeing it in two smaller pieces.
- An internal loop counter governing the number of times a
lookaside list allocation was attempted, was set too low.
This problem will most likely be seen on the VAX 6000 - 500
and 600.
A third software change associated with APM will also be
available in a future OpenVMS VAX version, but is not available
as a remedial change. The third change provides a potential
performance benefit under very specialized conditions, such as
during VMScluster state transitions.
Problems Addressed in the VAXMME01_061 Kit:
o The MME$$MNTREQ function which requests that a volume should
be selected for MOUNT, allows the use of logical names for
the device name. However, since these are process logical
names, as part of the callers process, these logical names are
not available to the media manager.
o MME applications are no longer able to set mount and device
context.
Problems Addressed in the VAXOPCO01_061 Kit for OpenVMS VAX V6.1:
o When a node leaves a VAXcluster, OPCOM goes into a tight
loop on one of the remaining nodes in the cluster. OPCOM
can be seen using 90-95% of the CPU.
Problems Addressed in the VAXAUDI02_061 Kit for OpenVMS VAX V6.1:
o The Audit Server EXCLUDE process list may become corrupt after
the DCL 'SET AUDIT/EXCLUDE=pid' command is issued.
INSTALLATION NOTES:
This kit *MUST* be installed on every VAX in a mixed-architecture
VMScluster, and the Alpha (ALPSHAD) version of this kit *MUST* be
installed on every Alpha system in the cluster BEFORE any systems
are re-booted into the VMScluster. If the correct kit is not
installed on each system, shadow sets cannot be created. System
crashes may also occur if the kits are not installed on all
appropriate cluster nodes.
The following restrictions will apply upon completion of the
installation:
o VMSclusters with shadowed SCSI disks and mixed-architecture
VMSclusters running OpenVMS Alpha V6.1 must apply the kit and
reboot the entire cluster simultaneously. In these cases,
rolling upgrades are not supported.
o Working configurations that contain SCSI shadow sets on
dissimilar controllers may no longer work.
References:
ORACLE is a registered trademark of Oracle Corporation.
WordPerfect is a trademark of WordPerfect Corporation.
This patch can be found at any of these sites:
Colorado Site
Georgia Site
Files on this server are as follows:
vaxshad09_061.README
vaxshad09_061.CHKSUM
vaxshad09_061.CVRLET_TXT
vaxshad09_061.a-dcx_vaxexe
|