TITLE: OpenVMS__SHADOW VAXSHAD09_U2055 VAX V5.5-2__V5.5-2H4 ECO Summary
Modification Date: 02-FEB-99
Modification Type: DOCUMENTATION: Technical Modification
Added *WARNING* regarding installation
of OpenVMS VAX V5.5-2HF.
NOTE: An OpenVMS saveset or PCSI installation file is stored
on the Internet in a self-expanding compressed file.
The name of the compressed file will be kit_name-dcx_vaxexe
for OpenVMS VAX or kit_name-dcx_axpexe for OpenVMS Alpha.
Once the file is copied to your system, it can be expanded
by typing RUN compressed_file. The resultant file will
be the OpenVMS saveset or PCSI installation file which
can be used to install the ECO.
Copyright (c) Compaq Computer Corporation 1995, 1999. All rights reserved.
****************** WARNING *********************
* *
* **DO NOT** install this ECO kit on OpenVMS *
* VAX V5.5-2HF. The system will become *
* unbootable. Sustaining Engineering is *
* currently researching this problem. *
* *
**************************************************
PRODUCT: Volume Shadowing for OpenVMS (Phase II)
NOTE: The problems fixed in this ECO Kit also affect the
following products:
VAXcluster Software for OpenVMS VAX
VAXcluster Console System (VCS)
OP/SYS: OpenVMS VAX
COMPONENTS: System, Bugcheck, Backup,
Mount, Dismount, MSCP, TMSCP, MTAAACP,
I/O Routines, Audit Server,
Security, System Primitives,
Adaptive Pool Management (APM),
Operator Communication Manager (OPCOM),
User Environmental Test Package (UETP)
SOURCE: Compaq Computer Corporation
ECO INFORMATION:
ECO Kit Name: VAXSHAD09_U2055
ECO Kits Superseded by and Included in this ECO Kit:
VAXSHADFT9_U2055 (Never Released)
VAXSHADFT8_U2055 (Never Released)
VAXSHAD07_061 (For OpenVMS VAX V5.5-2, V5.5-2H4, V5.5-2HF only)
VAXSHAD06_061
VAXSHAD05_061
VAXSHAD04_061
VAXSHAD03_061
VAXSHAD01_061 (CSCPAT_1160)
VAXSHAD02_060 (CSCPAT_1116)
VAXSHAD01_060 (CSCPAT_1116)
VAXSHAD08_U2055 (CSCPAT_0269, CSCPAT_1160)
VAXSHAD07_U2055 (CSCPAT_0269, CSCPAT_1160)
VAXSHAD05_U2055 (CSCPAT_0269, CSCPAT_1160)
VAXSHAD04_U2055 (CSCPAT_0269, CSCPAT_1160)
VAXSYS14_061 (For OpenVMS VAX V5.5-2, V5.5-2H4, V5.5-2HF only)
VAXSYS16_U2055
VAXSYS15_U2055
VAXSYS14_U2055
VAXSYS13_U2055
VAXSYS12_U2055
VAXSYS11_U2055
VAXSYS10_U2055
VAXSYS09_U2055
PRCMGT$01_U2055
VAXSYS01_2H4055 (CSCPAT_1094)
VAXSYSL04_U2055
VAXMONT01_061 (For OpenVMS VAX V5.5-2, V5.5-2H4, V5.5-2HF only)
VAXMONT03_U2055
VAXMONT02_U2055
VAXMOUN05_U2055 (CSCPAT_1152)
VAXMOUN04_U2055 (CSCPAT_0240)
VAXMOUN03_U2055 (CSCPAT_0240)
VAXMSCP08_U2055 (CSCPAT_1120)
VAXMSCP07_U2055
VAXMSCP05_U2055 (CSCPAT_1068)
ECO Kit Approximate Size: 2952 Blocks
Kit Applies To: OpenVMS VAX V5.5-2, V5.5-2H4
NOTE: OpenVMS VAX V5.5-2H4 is a limited hardware release,
shipped only with the new systems (or system upgrades)
listed below. It is not separately orderable and will not
be distributed via Consolidated Distribution.
o VAX 4000 Model 100A
o VAX 4000 Model 500A
o VAX 4000 Model 600A
o VAX 4000 Model 700A
System/Cluster Reboot Necessary: Yes
*** WARNING!!! ***
Future OpenVMS VAX V5.5-2 kits that are issued for facilities
included in the VAXSHAD09_U2055 kit will not install unless the
VAXSHAD09_U2055 kit is installed on your system first. It is
highly recommended that the complete VAXSHAD09_U2055 remedial kit
be installed as soon as possible. Installation of individual
images from the VAXSHAD09_U2055 remedial kit is not supported and
could result in unpredictable system behavior.
Descriptions for problems that were corrected in previous VAX
Shadow kits are included in the VAXSHAD09_U2055 Release Notes.
The Release notes can be found in the VAXSHAD09_U2055.A save
set. If you have not installed a previous shadow kit,
it is recommended that you read these release notes before
installing the VAXSHAD09_U2055 Shadow kit. To access the release
notes, restore them from the saveset by issuing a command with the
following format:
$ BACKUP/SEL=VAXSHAD09_U2055.RELEASE_NOTES -
$_DEVICE:[DIR]VAXSHAD09_U2055.A/SA -
$_DEVICE:[DIR]VAXSHAD09_U2055.RELEASE_NOTES
If you have a mixed-architecture cluster, and have not
previously installed a shadowing kit, you must install this kit
on the VAX nodes as well as the applicable Alpha version of
this kit on Alpha nodes of cluster BEFORE you bring up both
types of systems in a cluster again. If both kits are not
installed, you may not be able to create shadow sets.
If you have previously installed a shadowing kit then you do
not need to install the Alpha version of this kit at this time
as long as the shadowing kit installed on the Alpha nodes of
the cluster is ALPSHAD04_061 or later.
Working configurations that contain SCSI shadow sets on
dissimilar controllers may no longer work.
CAUTION:
Before Installing this Kit, Read the Following Cautions:
After installation of this kit, the following issues may occur:
1) ISSUE: When a node reboots into the cluster there may not
be an OPCOM message that reports the node is joining
the cluster. Absent messages occur on a random
basis.
WORKAROUND: In order to verify the node has entered the
cluster, after the node has fully rebooted, the
user should enter the command:
$ SHOW CLUSTER
to verify the node is a valid member of the
VAXcluster.
2) ISSUE FROM THE CSC: An INVEXCEPTN in SNDRIVER may be seen if
DECnet/SNA V2.1 is used in conjunction
with the IO_ROUTINES from the VAXSHAD
ECO kit. SNAVMS_E04021 (CSCPAT_5041) will
fix this problem by replacing the
incompatible SNDRIVER in DECnet/SNA V2.1
NOTE: SNAVMS_E04021 applies to
DECnet/SNA V2.1 only.
3) After installation of this ECO kit on an OpenVMS VAX
V5.5-2x system, MAIL, REPLY, or any process that uses
the $BRKTHRU system service may hang in MUTEX waiting for
more BYTLM that it needs or has available. Looking at the
process from SDA will show it is waiting for significantly
more BYTLM (R1) than the process has left (BUFIO byte
count/limit), and significantly more than the message it
is trying to output.
The workaround for this is to make BYTLM larger.
4) This ECO kit should *NOT* be installed on an FT810
system running OpenVMS VAX V5.5-2HF. If it is installed,
the system will not reboot.
These issues are being addressed and will be corrected in a
future version of OpenVMS VAX.
ECO KIT SUMMARY:
An ECO kit exists for Volume Shadowing on OpenVMS VAX V5.5-2 and
V5.5-2H4.
Problems Addressed in the VAXSHAD09_U2055 kit:
o A 'SET SECURITY' or 'SET ACL' command issued on volumes in a
cluster places high I/O on the server process. This exhausts
paged pool and the AUDIT_SERVER goes into an RWPAG state.
This problem is corrected in OpenVMS VAX V6.2.
o A field in the IRP that is used during Volume Processing is not
initialized in clones of USER IOs. If an error occurs, the
code that determines the severity of the error can be misled by
data in these fields. The code can fail to locate the error
and may return the IO as successful. Since a zero byte count
is also returned, a user would see an Incomplete Segmented
Transfer error. The fix is to initialize the field when the
clone is allocated.
o While creating a page, a user process may be swapped out and
returned with a different balance set slot.
This problem is corrected in OpenVMS VAX V6.2.
o Listings may be difficult to read due to varied formats and
misleading or missing comments.
This problem is corrected in OpenVMS VAX V6.2.
o Certain applications that call $AUDIT_EVENT with ASTs turned
off will be interrupted when $AUDIT_EVENT returns to the
caller.
This problem is corrected in OpenVMS VAX V6.2.
o The code relies on a page being present when it attempts
to release a spinlock. If the system is paging heavily,
the page may not be available.
This problem is corrected in OpenVMS VAX V6.2.
o Repeating wakeups from $SCHDWK show an accumulating drift over
time.
This problem is corrected in OpenVMS VAX V6.2.
o COPY and/or BACKUP of a DISK to a TMSCP-Served tape, will fail
when the tape device is placed in an MV state. The failure
does not occur if the same task is performed locally.
COPY will fail with: "SYSTEM-F-TAPEPOSLOST, magnetic tape
position lost"
BACKUP will fail with: "-SYSTEM-F-DATALOST, data lost"
This problem is corrected in OpenVMS VAX V6.2.
o To transition an OpenVMS process from the virtual balance set
to the real balance set, the SPTEs (system page table entries)
which describe its process PTE pages (process page table pages)
need to be copied from saved memory back into the real balance
slot from where they originally came. This makes the process'
P0 and P1 space accessible again. SPTEs for the process page
table pages describing the undefined area between P0 and P1
must be represented by pre-initialized null values (actually,
ERKW DZERO-type values). When this undefined void area is
exactly zero pages (i.e., P0 and P1 are tangent), the
VBSS$READ_OPT2_VBSM routine takes the wrong branch, causing a
VBSSERR bugcheck. This fix adds a test for this case, and
takes the image's correct branch.
This problem is corrected in OpenVMS V6.2.
o When a process is switched from a real balance slot to a
virtual balance slot, the allocation may fail. This causes
a VBSSERR bugcheck.
This problem is corrected in OpenVMS VAX V6.2.
o The quota value may be incorrect when process quota
(bytlm) is returned to a process for a system global
section.
This problem is corrected in OpenVMS VAX V6.2
o System crashes may occur due to corrupted PTE entries. The
corruption appears to be Global Section Table Entries pointing
to Global Section Descriptors.
The problem occurs only if 4095 GBLSECTIONS are exceeded. To
check the number of Global Sections currently in use, add the
following values:
- SDA> VALIDATE QUEUE EXE$GL_GSDSYSFL !global sections
- SDA> VALIDATE QUEUE EXE$GL_GSDDELFL !delete pending global
sections
- SDA> VALIDATE QUEUE EXE$GL_GSDGRPFL !group global sections
o Devices can remain allocated to processes that no longer
exist. The device remains unusable until the system is
rebooted.
o If a previously shadowed disk is mounted with a MOUNT/OVER=SHADOW
command and a new shadow set is created using this disk, OpenVMS
VAX will attempt to create the old shadow set using the old
physical device names.
o The system may crash with a NOBVPVCB bugcheck. The crash occurs
on the kernel stack with MTAAACP.EXE as the current image.
o The system may crash with an XQPERR bugcheck while dismounting
a MAD drive.
o SUBTRACED errors are not correctly determined for images
installed /HEADER_RESIDENT.
This problem is corrected in OpenVMS VAX V6.2.
o Users of ORACLE [R] Rdb V6.1 may get ILLIOFUNC errors when
performing IO to a Host-Based Shadowset whose members
are served.
o The user will see a large number of the shadow copies being
done by OpenVMS rather than the controller, even when both
disks are on the same controller and the controller has DCD
(Disk Copy Data) capabilities.
o If a three-member Shadowset has its index zero member as a copy
target and all three members also require a merge, then when
the copy completes, the merge does not take place. The LBN for
the just completed copy (the last LBN on the disk) is passed as
the MERGE starting LBN, so it completes without doing any IO.
o Failures to start copies or restart copies may occur, usually
after a node halt, shutdown or reboot. Additional symptoms
observed include inconsistent values for HBS_CIP when compared
to SHADOW_MAX_COPY, negative values for HBS_CIP and copies
that should continue started over from the beginning.
o System hangs occur when IOs that are pending to a shadow set
do not complete.
o UCB$L_MAXBCNT appears to be invalid for a shadowed disk.
Problems addressed in the VAXSHAD07_061 Kit:
o In the VAXSHAD05 and VAXSHAD06 kits two new fields were added
to the IRP data structure for shadow write logging information.
This new IRP definition size conflicts with the IRP sizes of
other images on the system that are not part of the SHADOW kits.
This conflict may cause a variety of errors, including fatal
bugchecks. This fix changes the IRP definitions back to the SBB
versions and adds some special definitions to the SHDRIVER for
the new IRP fields.
o Fatal bugchecks from data structure corruption may occur due
to the addition of the value 10 HEX to the corrupted field.
Crashes are of various types and include node and cluster
crashes, crashes due to invalid UCB addresses, invalid VCB
addresses, invalid member IDs, and invalid number of devices.
Problems Addressed in the VAXSHAD06_061 Kit:
o When using PATHWORKS, data corruption may occur on the file
container. The corruption can be seen by running CHKDSK on the PC
container disk. Also using PCDISK to IMPORT and EXPORT files to
and from the container will show a corrupted file when EXPORTed
back to VMS.
o System crashes occur with INVEXCEPTN bugcheck at
SCH$POSTEF+21.
To correct this problem, a change was made in the IOC$SIMREQCOM
routine to cause the destination of the IFNOWET test to
initialize R4 before calling the IOC$SCHEDEF routine.
IOC$SCHEDEF expects R4 to have the address of the user's PCB.
Problems Addressed in the VAXSHAD05_061 Kit:
o After a node crashes, it cannot mount a Host-Based Volume
Shadowing virtual unit on reboot. The error message usually
returned is "volume not software enabled"; however, "Medium
Offline" may also be seen. A SHOW DEVICE will show that the
the Shadowset is in 0% merge but SDA will show that a minimerge
is pending.
o A double deallocation crash may occur as the result of MOUNT
not properly initializing the Mounted Volume List pointer.
This pointer may have a stale value as a result of two calls to
SYS$VMOUNT from a single program. The stale pointer will only
cause a problem if the system is unable to allocate space for
defining the logical name.
NOTE: Since cells are initialized at image activation, this
problem should not occur as a result of DCL commands.
o Tape devices with stacker/loaders, such as the TF857, may take
up to 6 minutes to rewind/unload/load the next tape. In
VAXSHAD01_061, a change was made to the behavior of MOUNT to
take this delay into account. However, a side effect of that
change was that non-stacker drives may also wait 6 minutes
before failing.
o A system may crash with an INVEXCEPTN during an SHDRIVER
COPY_DATA_REPAIR copy operation.
o If the value of the ALLOCLASS SYSGEN parameter is not set and
the user tries to use shadowing, a shadow volume can be created
but members cannot be added to the shadow set. No error
messages are received up until a second member is added. On
the MOUNT command, the customer will receive the error
messages:
$ mount /system dsa500 /shadow=dkb400 alphavms015
%MOUNT-I-SHDWMEMFAIL, DKB400 failed as a member of the shadow set
-SYSTEM-F-INCSHAMEM, incompatible shadow set member
"Incompatible" is an inappropriate statement of the problem. A
more accurate message would be "missing allocation class," or
"incorrect allocation class."
o If a shadow set member is dismounted at the same time from
multiple nodes within a cluster, I/O to that shadow set may
become stalled.
o Mount will not add shadow set members unless they are either
MSCP or SCSI.
o Shadow set member expulsion was based on the time it took a
fork & wait and a PACKACK to complete rather than the actual
time transpired. On some devices, particularly SCSI, where a
PACKACK can take approximately one minute, the timeout was much
too long. Using the default value of 20 (seconds) for
SHADOW_MBR_TMO would actually mean that it would take 20
minutes to expel a member experiencing errors from a SCSI
shadow set.
o SHDRIVER loss of synchronization may result in a crash where
SHADDETINCON is triggered by the check at the end of
MATCH_MASTER_SCB. In this consistency check, the
SHAD$W_DEVSTS_PASSIVE_MV_CNTR is verified to be zero and is not.
Another symptom is that the virtual unit UCB$W_RWAITCNT is
zero. Shadow set member counts of zero may also be seen.
o Crashes may occur in EXPEL_PACKACK_ANY with connections broken to
all members and IRP$L_SHD_LOCK_FR5 = 1 (packack retries exhausted).
o All members of a shadow set become inaccessible at the same time and
remain inaccessible for a period of time greater than "shadow
member timeout" (SHADOW_MBR_TMO or SHADOW_SYS_TMO) seconds but
less than MVTIMEOUT seconds. All members subsequently become
accessible within seconds of each other but not at exactly the same
time. This results in all but one member being expelled from the
shadow set.
This often occurs when changing HSJ microcode and all members are
connected to the same HSJ. When brought back online, polling will
cause the devices to be found seconds apart which will result in
all but one member being expelled.
o All members of a shadow set must be checked to see if they meet
the criteria of being MSCP. The original design did not allow
for having no index zero member.
o When the mounting of full copy targets exceeds the SHADOW_MAX_COPY
threads for a given node, other nodes with the shadow set mounted
do not pick up the copy work.
o In a cluster, using $PROCESS_SCAN explicitly or implicitly with the
DCL 'SHOW USER' command sometimes causes a system crash due to an
ACCVIO in kernel mode or an IVSSRVRQST bugcheck.
o When a node with a SCSI bus boots, it resets the SCSI bus. In a
multi-host SCSI cluster, this can cause the other node to
experience I/O failures. Normally, this results in a brief mount
verification. The I/O is retried, succeeds, and there is
no serious consequence. However, if the other node is in the
process of booting and the system disk is a shadow set, the
system will crash.
o A PGFIPLHI bugcheck may occur in the SHADOW_SERVER process at
the REMQUE in K_GET_COPYSHAD_IRP. On OpenVMS VAX, the PC is
A0E and the VA is 274.
o A page setup module which draws a frame and company logo on each
page of output is used on a queue pointing to an LN03. This page
setup module works on OpenVMS Version VAX 5.5-2 and prior versions.
However, with VAXQMAN8_U2055 (CSCPAT_1165) or OpenVMS VAX Version
6.1 installed, this page setup module causes the printer to
continually spew out paper with only the output from the page setup
module. This continues until the entry is deleted from the queue.
o Due to an inadequate synchronization mechanism, the MONITOR DISK
command can go into an infinite loop on multi-processing machines.
o A race condition may occur in a VMScluster. This happens
most frequently on clusters where the 'SET AUDIT/SERVER=NEW'
command is issued repeatedly. The race condition presents
itself as one or more of the audit servers within the
cluster continuing to use the old audit journal rather
than using a newly created journal.
o A system may crash with a PGFIPLHI bugcheck with a "PAGE FAULT
at IPL too high" error message.
Problems Addressed in the VAXSHAD04_061 Kit:
o When booting two or more systems simultaneously from shadowed
system disks, the systems may appear to hang. Crashing the
systems and examining the crash dumps indicates that shadowing
driver blocking AST routines have not run.
o When a node runs out of SHADOW_MAX_COPY threads while mounting
new copy target units, other nodes in the cluster that have
available SHADOW_MAX_COPY threads will not pick up the copy
work. This results in the copy not being started for copy
members that are added to shadow sets.
Problems Addressed in the VAXSHAD03_061 Kit:
o A double-deallocation crash may occur as the result of MOUNT not
properly initializing the MTL pointer. This pointer had a stale
value as a result of 2 calls to SYS$VMOUNT from a single program.
The problem will not happen as a result of DCL commands, as the
cells are initialized at image activation. The stale pointer
will only cause a problem if the system is unable to allocate
space for defining the logical name.
o OPCOM message was being output even though /NOASSIST was
specified in the MOUNT command. This caused problems for UETP.
o When booting from a Controller-Based System disk for the first
time as a Host-Based System disk, boot fails and a SHADBOOTFAIL
Bugcheck occurs. A SHADBOOTFAIL will also occur if the
SHADOW_SYS_UNIT is changed at boot time.
o During a copy operation the system may crash with an ACCVIO.
o Reduce the volume of messages printed during SHDRIVER volume
processing to make the messages that are printed more
meaningful to the user. This involves minor modifications to
SHDRIVER to suppress messages that do not indicate actual
problems. No messages have been modified, deleted, or changed.
Only the frequency with which they are printed has changed.
o The path selection logic for DUDRIVER had a timing problem that
caused devices to be mounted by an MSCP server, even though a
local controller could be used. Although this symptom could
still appear under extreme circumstances, the majority of devices
should now find the local controller.
o In a large LAVC (Local Area VAXcluster) after one or more nodes
leave the cluster, state transition times can be excessive and,
the following messages may be repeatedly sent to the consoles of
the various nodes:
%CNXMAN, proposing reconfiguration of the VAXcluster
%CNXMAN, aborting VAXcluster state transition
The state transition, which normally should complete within 1-3
seconds, instead may take 15-55 seconds or more.
o Incorrect MSCP-served disk synchronization, would cause I/O to an
MSCP-served disk to get stalled on an internal queue and later
restarted.
o An internal routine, MOVE_SERVER, had a sequencing problem and
could cause stalled I/O to a served shadow-set member.
o MSCP server crashes may occur in large clusters.
Problems Addressed in the VAXSHAD01_061 Kit:
o A delay of up to six minutes can occur before a
device-not-ready condition is reported during cartridge volume
switching on non-SCSI (Small Computer System Interface)
TX867-type devices.
o Some of the OpenVMS VAX console executive messages have changed
to mixed upper and lower case letters for OpenVMS VAX V6.0
message text. The result is that current VCS scan files will not
match the console text, and VCS alarms will fail to trigger.
(Please see the ECO kit release notes for more information and
instructions regarding this fix.)
o There is no synchronization between SHADOW_PROCESSING and
INVALIDATE_ALL_ENTRIES, which allows these two code threads to
run simultaneously. This can cause a system crash due to the
fact that the SHADOW_PROCESSING thread may remove a member from
a multimember shadow set and the INVALIDATE_ALL_ENTRIES thread
is not aware that the member has been removed. The system
crash occurs in RESTORE_WLE because no Write Log table
exists.
o When shadow set members are not available for SHADOW_MBR_TMO
seconds, they should be expelled from the shadow set.
Sometimes when two members of a three-member set enter this
condition, only one member will be successfully removed. The
other member will not be removed, and this will cause the
virtual unit to hang until the errant member returns or it is
manually removed from the set via push button.
o In Volume Shadowing for OpenVMS Alpha Version 6.1, several
changes were made to the assisted merge (minimerge)
functionality. These changes disabled mimimerge functionality
across mixed architecture VMSclusters. With minimerge
disabled, shadowing continued to function normally, except that
a full merge was always done when a merge operation occurred.
Full merges take considerably longer than minimerges. If
minimerge functionality is desired, Digital recommends that
this kit be installed across any VMSclusters that contain an Alpha
node running OpenVMS Alpha Version 6.1.
Mixed-architecture VMSclusters that are running OpenVMS Alpha
Version 6.1 must apply this kit and reboot the entire cluster
simultaneously. In these cases, rolling upgrades are not
supported.
o Prior to this remedial kit, if attempts were made to mount an
RZ28B disk device with an RZ28 in the same shadow set, Volume
Shadowing detected different device IDs and may not have
allowed the devices to be mounted. This behavior applied only
an RZ28/RZ28B shadow-set combination when connected with a
local SCSI controller. Since RZ28 and RZ28B are different
device types but can be shadowed, the checking for shadow-set
membership in the host-based shadowing software needed to be
modified.
This remedial kit enables the combination of RZ28 and RZ28B
devices in a shadow set, as long as they are connected to like
controllers. With the use of SCSI devices, like controllers
are required because geometry can vary from controller to
controller. Digital recommends that SCSI shadow sets be
configured across like controller types. Existing SDI and DSSI
configurations are unaffected; if they are not using SCSI
drives and are shadowing SDI devices across different
controllers, these configurations will continue to work
without this remedial kit.
VMSclusters with shadowed SCSI disks and mixed-architecture
VMSclusters running OpenVMS Alpha Version 6.1 must apply the
kit and reboot the entire cluster simultaneously, so that the
entire VMScluster is running the same version of Volume
Shadowing software. The kit is required for both VAX and Alpha
nodes. Do not mount shadow sets containing RZ28 and RZ28B
devices without first applying this kit.
o If a Shadowset Virtual Unit is dismounted during a full copy,
the full copy target's SCB is incorrectly written. This allows
a subsequent mount of that shadow set member to succeed as if
the copy had completed.
o System crashes may occur in RESTORE_WLE because there is no
Write Log table. In fact, a member has been removed from the
set. This problem is similar but different from the DU/SH
Synch problem that causes the same symptom.
o When members are not available for SHADOW_MBR_TMO seconds, and
other members are available, the unavailable members should be
ejected from the shadow set. In certain configurations, with
the current version of the driver, should two members of a
three member set enter this condition, only one member will be
successfully removed. The other member will not be removed
and the virtual unit will hang until the errant member returns,
or it is manually removed from the set via push button.
This behavior has been fixed in this kit. Any members that
remain unavailable for greater than SHADOW_MBR_TMO seconds will
be fully expelled from the set.
o Device not ready for magtapes was not reported until a delay of
up to 6 minutes expired.
Problems Addressed in the VAXSHAD02_060 Kit:
o A SHADDETINCON was caused by the X-64A1 check in, because the
wrong GPR was used when an unlikely system address was stored
into the IRP.
Problems Addressed in the VAXSHAD01_060 Kit:
o If SHDRIVER encounters a situation where more than one member
of a three member shadow set go into error recovery at the same
time, and they cannot be brought back into the shadow set
(i.e., loss of connectivity, media offline, write locked
device, etc.) SHDRIVER will expel one of the members and crash
with a SHADDETINCON when it cannot update the SCB on the
remaining members.
o When all shadow set members are write locked, a bugcheck will
occur due to R4 being destroyed across a JSB to
SHSB$GET_CLEAN_IRP. This fix preserves that register.
o The SHADOW_MAX_COPY SYSGEN parameter is used to set how many
merge/copy threads may be started at the same time on a node.
This was not working. Systems would start more than
SHADOW_MAX_COPY number of threads.
o SHdriver system disk member timer issues and R2/R5 corruption
problems:
1. The SHSB$MATCH_MASTER_SCB routine makes improper use of
SHSB$PAUSE. The use of SHSB$PAUSE causes the SHAD (in R2)
not to be preserved when the time delay is invoked
(since it forks), so the resulting value in R2 is
indeterminate.
2. The SHSB$MATCH_MASTER_SCB routine makes improper use of
SH$TIME_DELAY. An input requirement of SH$TIME_DELAY is
to have a UCB in R5.
3. The SH$ABORT_VP routine makes improper use of
SH$TIME_DELAY. The use of SH$TIME_DELAY causes the SHAD
(in R2) not to be preserved when the time delay is invoked
(since it forks), therefore the resulting value in R2 is
indeterminate.
4. The benefit of reassembling a multiple member system disk
shadow set is lost to some configurations if the current
fixed amount of time expires and all of the former members
of the shadow set are not available. This has caused
escalations to be raised to address this specific behavior.
Second, enable the differentiation of the member time
out time for system disk versus other disks. Last,
make the currently hardcoded wait of FF seconds to
connect to all members of an existing system disk a
user-controlled variable.
o SHdriver MVTIMEOUT after member error and R5 corruption
problems:
1. The spontaneous removal of one shadow set member of a multiple
member set due to a fatal error causes some cluster nodes to
hang the virtual unit until the MVTIMEOUT time expires.
2. In SHSB$VALIDATE_SHADOW_SET, the wait loop at 130$ does not
correctly restore the contents of R5 to be the VU VCB
after a call to SHSB$PAUSE.
o WLE_POST_PROC is not done on all the clones. This causes
allocation of new unnecessary Write Log Entries. The Write Log
INUSE bit is never cleared so the table has to be expanded.
Once the table expands to MAX, Write Logging is disabled. When
Write Logging gets turned back on it starts all over. All the
entries in the controller will be exhausted forcing Write Log
Exhaustion handling and in some cases the controller will be
reset.
o While doing INVALIDATE_ALL_ENTRIES if the READ of LBN #1 fails
or WLG has been turned off, a branch goes to the wrong
location. This results in issuing an IO with no READY clones
and the system will wait forever with SEQCMD lock held and
RWAITCNT bumped.
Problems Addressed in the VAXSHAD08_U2055 kit:
o After installation of CSCPAT_0269 V2.7 (VAXSHAD07_U2055), the
system may crash with a SHADDETINCON bugcheck. The bugcheck
occurs when a disk is removed from a mounted shadow set.
Problems Addressed in the VAXSHAD07_U2055 kit:
o Write Log Usage fixes:
1. The first problem symptom is that user I/O to a virtual
unit may intermittently hang on any node (usually only
one) that has a multiple member virtual unit mounted. The
hang can occur with no other overt error symptoms evident
in either the error log or as seen by analyzing the live
system.
2. The second problem symptom is less apparent, in that the
resources used for the write history management function
are managed in a more efficient manner.
o When one member of a multiple-member shadow set encounters a
fatal device error, the node that discovers the initial problem
will successfully expel that device from the set. However,
other nodes that are under heavy I/O loads when the device is
expelled may occasionally fail to recover the full membership.
This will cause the virtual unit to hang until the MVTIMEOUT
time limit is reached.
Problems Addressed in the VAXSHAD05_U2055 Kit for OpenVMS VAX V5.5-2
o A documentation change was made to the VAXSHAD04_U2055
kit to remove an incorrect reference.
Problems Addressed in the VAXSHAD04_U2055 Kit for OpenVMS VAX V5.5-2:
o When a host receives a controller error, Volume Shadowing Phase
II processing removes whatever device is at SHAD index 0 even
if this member was not the one that experienced the controller
error. Once the index 0 member is gone, all other controller
errors are ignored.
o The ability to switch to the current master member of a system
disk shadow set has a limited configuration of
controller/adapter types. Crash dumps that were correctly
written (according to console output) cannot be found for
analysis when using an HBVS multiple-member system disk shadow
set.
o Applications can hang or experience I/O transfer errors when
using multiple-member shadow sets that are connected in such a
way that segmented I/O transfers are needed. This has been
reported on systems running WordPerfect[TM].
o AN INVEXCEPTN crash can occur if the allocation of a clone
chain fails to successfully allocate. If a
FANOUT_ALLOCATION_XXX request fails, the MIRP is still linked
to the active queue which causes the next REMQUE to fail.
o If a system that currently holds the WATCHER lock crashes while
it is validating the status of a Host-Based Volume Shadow Set
that is mounted cluster-wide and another node assumes the WATCHER
lock, an IPL 8 system hang can occur.
o A SYSDUMP.DMP file that appears to be written correctly can be
invalid when the boot device and the master member of the system
disk shadow set diverge. The device that the system dump is
written to has always been the boot device. The SHDRIVER.EXE in
this kit allows the system dump to be written to a member of the
system disk shadow set other than the boot device. Upon successful
write completion, the unit number will be displayed on the console.
o The VAX 7000 had been restricted to using the boot device as the
only valid dump device in a prior remedial image. Additionally,
proper operation was not allowed at shutdown or when a crash dump
needed to be written because an incorrect message was sent
concerning the path to the system disk.
o Occasionally, all of the former members of a system disk shadow
set will not return upon a system reboot. This problem will
occur only if the virtual unit is not otherwise mounted in the
cluster at boot time.
o Under certain conditions, once a virtual unit exceeds the mount
verify time-out time, the correct behavior is not accomplished.
Indeterminate behavior occurs due to use of a corrupted SHAD
pointer because fork context requirements are not observed.
o If a node is booting into a cluster and the boot device being
used is already mounted in the cluster as a member of a virtual
unit with a different virtual unit number, the node is
incorrectly allowed to continue to boot into this cluster.
o Under certain circumstances, the SHADOW_MAX_COPY SYSGEN
parameter does not regulate the number of copies a particular
node will control. This effectively nullifies the significance
of setting any value in SHADOW_MAX_COPY.
o Configurations that consume a great number of event flags and
create a large number of multiple-member shadow sets (i.e.,
greater than 50) may experience a system crash.
o Inadvertent placement of the SCB (System Control Block) can
adversely affect the best time calculation needed for a full
merge operation. This will, in turn, adversely affect the
total time it takes to perform the full merge operation.
o If enough write I/O operations to cause the write log table to
go beyond its expansion limit of 4K are issued to a
multiple-member shadow set that has write logging enabled, the
set may hang. This condition can occur with no evident error
symptoms.
o If a member of a three-member shadow set loses its connection
for SHADOW_MBR_TMO time, a decision to remove that member is
initiated. Should either of the remaining members not be able
to complete an SCB (System Control Block) update, the removal
operation may occasionally result in a SHADDETINCON crash.
o When a multiple-member system disk shadow set is in use and a
number of nodes are rebooted at the same time, sometimes the
path to one of the non-boot device members is used before it
has been properly initialized. This causes a race condition
which may result in an MSCPCLASS bugcheck.
Problems Addressed in the VAXSYS14_061 Kit:
o There is a race condition that may occur when a CFCB (Cache File
Control Block) is being deleted due to XQP action and cache
space is being reclaimed from a LIMBO file.
o Disk corruption can occur when heavy open/read/write/close/delete
operations are occurring.
o At some point after a node CLUEXITs, 2 or more cluster nodes
crash with LOCKMGRERR Bugchecks.
o When two or more VAX or Alpha nodes boot at the same time, one
or more of them may crash.
Problems Addressed in the VAXSYS16_U2055, VAXSYS15_U2055,
and VAXSYS14_U2055 Kits:
o Two new fields were added to the IRP data structure for shadow
write logging information. This new IRP definition size
conflicts with the IRP sizes of other images on the system.
This conflict may cause a variety of errors, including fatal
bugchecks. This fix changes the IRP definitions back to the
SBB versions.
This problem is corrected in OpenVMS VAX V6.2.
Problems Addressed in the VAXSYS13_U2055 Kit:
NOTE: According to OpenVMS Engineering, the fixes contained
in VAXSYS13_U2055 have been included in OpenVMS VAX V7.0.
o System crashes may occur due to corrupted PTE entries. The
corruption appears to be related to Global Section Table Entries
pointing to Global Section Descriptors.
The problem occurs only if 4095 GBLSECTIONS is exceeded. To
check the number of Global Sections currently in use add the
following values:
o SDA> VALIDATE QUEUE EXE$GL_GSDSYSFL !global sections
o SDA> VALIDATE QUEUE EXE$GL_GSDDELFL !delete pending global
sections
o SDA> VALIDATE QUEUE EXE$GL_GSDGRPFL !group global sections
Problem addressed in the VAXSYS12_U2055 kit:
o Due to an inadequate synchronization mechanism, the MONITOR DISK
command can go into an infinite loop on multi-processing
machines.
Problem addressed in the VAXSYS11_U2055 kit:
o The system crashes with a PGFIPLHI bugcheck and the message
"Pagefault at IPL too high". The VA is pointing to a
CCB (Channel Control Block) and the PC is located within
the MBDRIVER module.
Problem Addressed in the VAXSYS10_U2055 Kit:
o Performance may be degraded due to excessive kernel mode time
being spent in MMG$FREWSLE attempting to find a working set
page to replace.
Problems Addressed in the VAXSYS09_U2055 Kit:
o In a small working set, it is possible for the EXE$PSCAN_NEXT_PID
routine (which is called by $GETJPI) to take a page fault at IPL 8.
This causes a PGFIPLHI bugcheck. The page referenced is in the
PROCESS_SCAN context block (PSCANCTX$ data structure) in process
virtual address space.
o The $SETIMR and $SCHDWK system services which request timer
interrupts may cause a system to hang. This occurs when a time
already passed is specified for a wake to occur.
Problems Addressed in the PRCMGT$01_U2055 Kit:
o A system crash may occur at POSIX$KERNEL+3B371 with POSIX$DCL as
the current image. The crash is provoked when a user logs in
with /CLI=POSIX$CLI. The DCL command may cause the system to
crash, or the process to evaporate. Occasionally, the crash will
occur following a few carriage returns.
This problem is corrected in OpenVMS VAX V6.0.
o Fixes for various problems in $GETJPI (ECO 15):
The following problems have been reported in the $GETJPI and
$GETJPIW system services (executive routine EXE$GETJPI):
· Process hangs while waiting for $GETJPI to complete
A process might wait forever in LEF state while attempting to
retrieve an item which required access to another process's P1
space. While this kit includes changes which fix some
instances of this problem, there is the possibility it may
still occur. Should the problem persist after installing this
kit, one may work around the hang by revising the application
and adding a timer request AST and recovery routine. For more
information, refer to an article in the OPENVMS database using
a search string of:
Application and $GETJPIW and $GETJPI and Hang
· SSRVEXCEPT bugchecks
There were several instances where EXE$GETJPI would try to
access data structures formerly assigned to a now deleted
process. Most frequently, the problem showed up as an access
violation at EXE$GETJPI+712 while trying to retrieve the
external PID.
· PGFIPLHI bugchecks
This involved another instance of access to the former data
structures of a deleted process. In this case though,
EXE$GETJPI attempted recovery. The recovery was incorrect and
would lead to unreleased spinlocks, high IPL access to paged
code and other problems.
· KRPEMPTY bugchecks
This was yet another instance of access to a deleted process.
If the process was selected by a "wildcard" PID, EXE$GETJPI
would attempt to allocate an entry from the KRP lookaside list
without having released a previous entry.
· Stack corruption
The kernel stack could be corrupted if the target process of a
$GETJPI request was out of AST quota.
· Incorrect AST quota
AST quota could be gained or lost on an SMP system because of
access via non-interlocked instructions.
· Final status of 0 in R0
A user could get a final status of 0 in R0 if the PHD of a
target process was swapped out.
These problems are corrected in OpenVMS VAX V6.0.
Problem addressed in the VAXSYS01_2H4055 kit:
o VAX 4000 Model 100A, 500A, 600A and 700A will no longer be able
to boot via the Q-bus after installation of DECnet/OSI V5.5 or
V5.6. These versions of DECnet/OSI eliminate code for support of
new hardware in OpenVMS VAX V5.5-2H4.
Problems Address in the VAXSYSL04_U2055 Kit:
o The PE1 parameter which was previously used to control the size
of trees to be remastered has been changed. If a negative
value is placed in the parameter, the RRSCAN routine will exit
without doing any scans or remastering.
o When the system is scanning for trees to remaster due to a
change in cluster membership, RM_QUOTA may be exhausted. When
this occurs, possible RSB queue corruption may result.
o The system crashes with a LKBREFNEG bugcheck when a parent sub
lock count exceeds 32K on a $DEQ.
o The system crashes with a RSBREFNEG bugcheck when a parent sub
resource count exceeds 32K on a $DEQ.
o During dynamic remastering, performance is degraded when large
lock trees are moved.
o The LKID_MSK routine which is used to mask off the LKID (Lock
ID) from the SEQN is incorrectly generated in DSTRLOCK. This
can cause the LKID Validation Routines to incorrectly indicate
that a LKID is invalid.
o Locks are sometimes granted out of order during remastering of
a resource.
o The "Recover" privilege is not being correctly checked. This
prevents recovery processing from recovering databases after
node failures.
o When the resource for a two phase conversion in progress is
canceled, a fatal bugcheck will occur if the resource's BLOCKAST
count is invalid.
o The activity scan rate of the Lock Manager has been changed
from 1 second to 8 seconds to reduce Lock Manager overhead and
make the tree moving algorithms more conservative.
Problems Addressed in the VAXMONT01_061 kit:
o When the 'MONITOR DISK' command is issued on a system with DFS
devices mounted, only the first three characters of the DFS
disk name are displayed correctly. The last character is
often displayed as a non-printable character or as an escape
sequence. This may cause terminal lock-ups, resetting of
terminal characteristics or other unexpected terminal side effects.
o The 'MONITOR DISK' command may appear to hang when monitoring
a system with more than 800 disks. An error occurs, but the
error status is not displayed. The hang may also occur when a
MONITOR CLUSTER command is issued.
o Due to an inadequate synchronization mechanism, the 'MONITOR
DISK' command can go into an infinite loop on multi-processor
machines.
o Use of the 'MONITOR PROCESS' command in a local environment will
fail if the SYSGEN parameter MAXPROCESSCNT is set to allow more
than 1040 processes. When Virtual Balance Slots were added in
OpenVMS V6.0, this number dropped to 978.
o In a mixed version OpenVMScluster, the following MONITOR
command will crash the target V6.0 node if it is issued
from a V5.5-2 node:
$MONITOR STATES,POOL,DECNET,LOCK /NODE=V6.0_node
Problem Addressed in the VAXMONT03_U2055 Kit:
o The image to correct the MAXPROCESSCNT problem should have
been included in the VAXMON02_U2055 kit. It was not.
Problems Addressed in the VAXMONT02_U2055 Kit:
o An error occurs following the use of the following MONITOR
command:
$ MONITOR [CLASS] /NODE={nodelist}
The error indicates that the connection to a remote node
has been lost and the collection activity terminates for
that node.
o The MONITOR process class will not function if the SYSGEN
parameter MAXPROCESSCNT is larger than 1040. The following
errors will be returned:
%MONITOR-E-COLLERR, error during data collection
-SYSTEM-F-BADPARAM, bad parameter value
Problem Addressed in the VAXMOUN05_U2055 Kit:
o A delay of up to six minutes can occur before a
device-not-ready condition is reported during cartridge volume
switching on non-SCSI (Small Computer System Interface)
TX867-type devices.
Problems Addressed in the VAXMOUN04_U2055 Kit:
o RE-INITIALIZATION errors are reported to users of SCSI
tape drives attached to an HSx controller. This occurs
if multiple SCSI tapes are attached to the HSx and all the
tapes are at or near PEOT and the connection to the HSx is
broken.
o A tape drive will sometimes fail over to another HSx
controller after the tape is dismounted.
o Numbers greater than 9999 which are randomly generated
by HSx devices may cause the system to crash.
o Packet Acknowledgements (PACKACK) issued on client nodes
that are using a specified preferred path will fail if
the specified path is not the current primary path and
the path cannot be changed because the disk in online
through another path.
o In Controller Based Shadowing, mounting a disk named
DUx or a tape named MUx causes the following error
message to appear:
%MOUNT-W-CBSNOTSUPTD, Attention - Phase I Shadowing is not supported
as of OpenVMS VAX V6.1
%MOUNT-I-MOUNTED, SCRTCH mounted on _$5$MUA0: (MOOSHEAD)
This error message should only appear when an attempt is
made to mount a DUS device.
o A user is unable to read the second volume of backup tapes
written under OpenVMS V5.3. However, the tapes can be
read successfully on OpenVMS VAX V5.5-1.
o If a logical is specified on a MOUNT shadow set command line
and this logical has the same name as one of the shadow set
members, then the following command sequence will fail with
an INCONSDEV mount error which will cause a system crash:
$ MOUNT/SYSTEM DSA0/SHADOW=$1$DIA0: TWI_TEST $1$DIA0
%MOUNT-I-MOUNTED, TWI_TEST mounted on _DSA0:
%MOUNT-I-SHDWMEMSUCC, _$1$DIA0: (SPRING) is now a valid
member of the shadow set
$ MOUNT/SYSTEM DSA0/SHADOW=$1$DIA1: TWI_TEST
%MOUNT-F-INCONSDEV, inconsistent device types
o If no operator is present to respond, MOUNT within a subprocess
will fail with the Following message:
%MOUNT-F-BATCHNOOPR, No operator available to service batch
request
o MOUNT causes an implicit allocation of a device (i.e., a channel
is opened to the device) to a child process to change the ownership
of the device to the parent process on a dismount. A subsequent
mount of the device by the child process will fail because the device
is now allocated to the parent.
o The new message "Another Volume Set of the Same Label is
Already Mounted" has been added.
o If a tape device does not support compaction, then the
MOUNT/FOREIGN/NOCACHE command mounts the device with
CACHE ENABLED.
o MOUNT only waits 10 seconds to allow SCSI magtape
devices to become ready before determining that the
device is off line. Tx8x7 tape devices may take
up to 6 minutes to become ready during a volume
switch.
o MOUNT is unable to skip a number of records greater
than 8000 hexadecimal when it tries to reposition
tapes after a label verification in mount verify.
o A tape initialized with the following command will not be
mounted if the user is not the owner, even if all privileges
are enabled (i.e., user is SYSTEM):
$ INITIALIZE/LABEL=(VOLUME_ACCESSIBILITY:"%")/OWNER=[100,100] -
/PROTECTION=(S:RWED,O:RWED,G,W)