|
|
HSZ_NAM_ISSUE Dual Redundant HSZ Naming Issue in Tru64 UNIX V5.0 TITLE:
|
TITLE: HSZ_NAM_ISSUE Dual Redundant HSZ Naming Issue in Tru64 UNIX V5.0
Copyright 2000 Compaq Computer Corporation. All rights reserved
DATE: 3 February 2000
PRODUCT: Tru64 UNIX V5.0
SOURCE: Compaq Computer Corporation
TITLE:
Dual Redundant HSZ naming issue; Tru64 UNIX V5.0 or later;
Impacts fail-over function.
=================================================================
PRODUCT NAME(S) IMPACTED:
PRODUCT FAMILY(IES): PRODUCT NUMBERS:
Storage _X_ HSZ70
Systems _X_ Alpha
Networks ___
PC ___ ________________
Software _X_ Tru64 UNIX 5.0
Other (specify) ___ ________________
PROBLEM STATEMENT:
==================
The Tru64 UNIX V5.0 software uses the product identifier (PID) field
provided by HSZ controllers to assist in uniquely identifying logical
disk units. If dual controllers with different PID fields are present,
then the logical units will not be properly identified as the same units,
and controller failover will not occur. The hardware management software
[hwmgr(8a)] will not properly recognize the unit after the failover, and
all access attempts to the unit will fail. In addition, since after the
failover attempt, hwmgr will see the unit as a different logical unit, a
new device name will exist for that logical unit. In effect, there will
be 2 device names for each logical device (1 name for each of the 2
controllers in the redundant pair).
The worst impact of this problem is that it can remain undetected until
an error occurs that would generate a controller failover. At the
crucial time when a failover is needed and expected, it will not work.
In HSZ controller firmware HSOF V7.7, this problem will be corrected by
ensuring that the PID fields are synchronized. For more details, refer
to the "FIRMWARE NOTE" below.
Background Info:
All SCSI devices contain what is known as a vendor identifier (VID) and a
product identifier (PID). These identifiers are installed in the device
during manufacturing. For dual redundant HSZ controllers to operate
correctly with Tru64 UNIX 5.0, the VID and PID contained in each of the
2 controllers in the set must match. It is possible however, that some
controllers may have different PID fields. For example:
"HSZ70" versus "HSZ70 (C) DEC"
CONFIGURATIONS AFFECTED:
========================
Tru64 UNIX Version 5.0 or later systems with Dual Redundant HSZ70
controllers with firmware version less than HSOF V7.7. (See Firmware
Note at the end of this article.)
PROBLEM SYMPTOM:
================
Controller failover will not complete and after failover attempt, all
access attempts to the unit will fail.
PROBLEM SOLUTION:
=================
Due to the severe consequences of this problem (fail-over inoperative)
and the "invisible" nature (you won't know about it until fail-over is
attempted), we recommend a pro-active approach to this problem. Field
personnel should check existing Tru64 UNIX V5 installations and take
necessary corrective action as described in this article. Tru64/Digital
UNIX installations that are to be upgraded to V5 should be examined for
this issue as a part of upgrade planning.
How to determine the PID field value for HSZ units:
---------------------------------------------------
The PID can be determined in several ways. A local terminal can be
attached to the HSZ maintenance terminal port, or the CLI window of SWCC
can be used. The "show this" and "show other" commands will display the
PID field on the first line of the output. The following example
illustrates a case of mis-matched PID values:
HSZ> show this
Controller:
HSZ70 (C) DEC ZG41400123 Firmware V25Z-0, Hardware A02
.
.
.
HSZ> show other
Controller:
HSZ70 ZG41800340 Firmware V25Z-0, Hardware 0000
.
.
.
(Note "HSZ70 (C) DEC" vs. "HSZ70")
An alternate method of examining the PID field is to use the scu command
from the host:
> scu
scu> set nexus bus a target b lun c
scu> show inq
<....>
Product Identification: HSZ70
(vs.)
Product Identification: HSZ70 (C) DEC
scu>
When using scu, a preferred target from each HSZ controller must be
examined when determining if they use the same PID field.
What to do:
-----------
If the PID fields for each redundant pair is the same, then no action is
required. If the PID fields for a pair is found to be different, then
replace one of the dual controllers so that a match can be obtained.
Which name is matched does not matter, simply that they are match exactly.
When to perform the check:
---------------------
The check should be performed:
- prior to the installation of the Tru64 UNIX 5.0 software;
- prior to an upgrade from a version previous to 5.0;
- if an HSZ70 controller in a dual-redundant set, is replaced on a
Tru64 UNIX 5.0 system.
Any necessary corrective action should be taken prior to the install
of/upgrade to Tru64 UNIX V5.
What if the system is already running V5 with mis-matched controllers?
----------------------------------------------------------------------
The first step is to correct the mis-matched names. This means calling
field service and getting one of the mis-matched units replaced, or when
available (see below), upgrading the HSZ firmware so the names can be
matched.
In the following example, the list of known disk devices is displayed
using the "hwmgr -view devices -category disk" command. The problem
devices on which to focus are the ones with the mis-matched PID (and
corresponding "Model") fields:
# hwmgr -view devices -category disk
HWID: Device Name Mfg Model Location
------------------------------------------------------------------------------
27: /dev/disk/floppy0c 3.5in floppy fdi0-unit-0
34: /dev/disk/dsk0c DEC RZ28 (C) DEC bus-0-targ-1-lun-0
35: /dev/disk/dsk1c DEC RZ26L (C) DEC bus-0-targ-2-lun-0
36: /dev/disk/cdrom0c DEC RRD44 (C) DEC bus-0-targ-5-lun-0
37: /dev/disk/dsk2c DEC HSZ70 bus-1-targ-3-lun-0
38: /dev/disk/dsk3c DEC HSZ70 bus-1-targ-3-lun-1
39: /dev/disk/dsk4c DEC HSZ70 (C) DEC bus-1-targ-6-lun-0
40: /dev/disk/dsk5c DEC HSZ70 (C) DEC bus-1-targ-6-lun-1
Note that the HSZ "Model" field of dsk2c and dsk3c does not match the HSZ
Model of dsk4c and dsk5c.
To correct this situation you have 2 choices.
Choice 1:
Reinstall V5 using this exact procedure:
Upgrade HSZ firmware to achieve an exact PID match between
the two controllers.
At the console level, force devices to be named from scratch:
P00>>> set bootdef_dev ""
Boot Tru64 UNIX V5 installation media and install V5.
Choice 2:
Perform this manual corrective procedure:
- Upgrade the HSZ firmware to achieve PID match
- Manually redirect the problematic scsi disks to new disk devices
that will be created during a hardware scan executed after the
PID match.
The remainder of this article demonstrates an example of this
manual procedure.
# hwmgr -view devices -category disk
(OR...)
# hwmgr -view dev -cat disk
HWID: Device Name Mfg Model Location
------------------------------------------------------------------------------
27: /dev/disk/floppy0c 3.5in floppy fdi0-unit-0
34: /dev/disk/dsk0c DEC RZ28 (C) DEC bus-0-targ-1-lun-0
35: /dev/disk/dsk1c DEC RZ26L (C) DEC bus-0-targ-2-lun-0
36: /dev/disk/cdrom0c DEC RRD44 (C) DEC bus-0-targ-5-lun-0
37: /dev/disk/dsk2c DEC HSZ70 bus-1-targ-3-lun-0
38: /dev/disk/dsk3c DEC HSZ70 bus-1-targ-3-lun-1
39: /dev/disk/dsk4c DEC HSZ70 (C) DEC bus-1-targ-6-lun-0
40: /dev/disk/dsk5c DEC HSZ70 (C) DEC bus-1-targ-6-lun-1
# df
Filesystem 512-blocks Used Available Capacity Mounted on
/dev/disk/dsk4a 338542 276436 28250 91% /
/dev/disk/dsk4g 3389096 520056 2530130 18% /usr
/proc 0 0 0 100% /proc
We will match the firmware of the HSZ controllers so the PID of both
controllers match the PID used by the controller of the system disk:
"HSZ70 (C) DEC".
Shut the system down to single-user mode. This will prevent background
processes from "touching" the disks while they are being changed:
# shutdown now
Determine which HSZ controller is the "master" controller. If the root
file system is on this HSZ pair, attach a terminal/CLI connection and
execute the show command. The output will determine to which
controller the root unit is "ONLINE":
HSZ> show unit
LUN Uses
--------------------------------------------------------------
D300 DISK300
D301 DISK320
D600 DISK600
D601 DISK630
HSZ> show d600
LUN Uses
--------------------------------------------------------------
D600 DISK600
...
State:
ONLINE to this controller
Not reserved
PREFERRED_PATH = THIS_CONTROLLER
...
HSZ> show d300
LUN Uses
--------------------------------------------------------------
D300 DISK300
...
State:
ONLINE to the other controller
PREFERRED_PATH = OTHER_CONTROLLER
...
D600 should be on the master, therefore, stop the controller that is
serving the other units (unit D300 in our example) via the "set
nofailover" command.
In our example, from "HSZ>" issue "set nofailover". Then insert a card
having the new version of firmware into the controller slot that was
shutdown, then reboot the controller (when it restarts, you can disregard
the "Controllers misconfigured." error message) For more information
about updating your HSZ70 firmware, refer to the HSZ70 Configuration
Manual and the Release Notes.
Now switch the cli line to the other HSZ controller and, issue the "set
failover copy=other" command. After the controller reboots, switch back
to the original controller, and issue the "shutdown" command, upgrade its
firmware, and allow the original HSZ controller to reboot.
Finally, from the host, issue "hwmgr -scan scsi" command. This will
update system information to find units with the "correct" PID field.
Summary:
Host commands Master HSZ commands "Other" HSZ commands
--------------------------------------------------------------------------
hwmgr -view dev
[find mis-matched HSZ units]
shutdown now
[determine which HSZ is the master]
set nofailover
[upgrade to new F/W]
set failover copy="other"
shutdown
[upgrade to new F/W]
hwmgr -scan scsi [ -bus 1 ]
A word about "hwmgr -scan scsi"...
Depending on the size of your configuration, the scan may take several
minutes to complete. The presence of tape devices will further
increase the delay to complete the scan. For this reason, you may wish to
use the -bus qualifier to specify the bus you want to scan. The correct bus
number can be determined by examining the "location" field of the hwmgr
-view devices output:
# hwmgr -view devices
HWID: Device Name Mfg Model Location
------------------------------------------------------------------------------
...
39: /dev/disk/dsk4c DEC HSZ70 (C) DEC bus-1-targ-6-lun-0
# hwmgr -scan scsi -bus 1
In order to understand which devices need to be redirected to the
newly-created devices, examine the current list of devices prior to
rebooting the system:
# hwmgr -view devices
HWID: Device Name Mfg Model Location
------------------------------------------------------------------------------
4: /dev/kevm
27: /dev/disk/floppy0c 3.5in floppy fdi0-unit-0
34: /dev/disk/dsk0c DEC RZ28 (C) DEC bus-0-targ-1-lun-0
35: /dev/disk/dsk1c DEC RZ26L (C) DEC bus-0-targ-2-lun-0
36: /dev/disk/cdrom0c DEC RRD44 (C) DEC bus-0-targ-5-lun-0
37: /dev/disk/dsk2c DEC HSZ70 bus-1-targ-3-lun-0
38: /dev/disk/dsk3c DEC HSZ70 bus-1-targ-3-lun-1
39: /dev/disk/dsk4c DEC HSZ70 (C) DEC bus-1-targ-6-lun-0
40: /dev/disk/dsk5c DEC HSZ70 (C) DEC bus-1-targ-6-lun-1
41: /dev/cport/scp0 HSZ70 bus-1-targ-3-lun-0
42: /dev/cport/scp1 HSZ70 (C) DEC bus-1-targ-6-lun-0
44: /dev/disk/dsk6c DEC HSZ70 (C) DEC bus-1-targ-3-lun-0
45: /dev/disk/dsk7c DEC HSZ70 (C) DEC bus-1-targ-3-lun-1
Note that you can ignore the control port "scp0" device. We will need to
redirect the following device names (with invalid PID's)
37: /dev/disk/dsk2c DEC HSZ70 bus-1-targ-3-lun-0
38: /dev/disk/dsk3c DEC HSZ70 bus-1-targ-3-lun-1
...to the following new device names (containing valid PID's)
44: /dev/disk/dsk6c DEC HSZ70 (C) DEC bus-1-targ-3-lun-0
45: /dev/disk/dsk7c DEC HSZ70 (C) DEC bus-1-targ-3-lun-1
Remember the following:
HWID 37 will be redirected to 44
HWID 38 will be redirected to 45
At this point, you should reboot the system (ONLY TO SINGLE-USER MODE).
Then mount the root file system to enable writing to the disk:
# shutdown -h now
.
.
.
P00>>> boot -flag s dkb600
.
.
.
Starting secondary cpu 1
INIT: SINGLE-USER MODE
# mountroot
.
.
.
Examine the output of "hwmgr -view devices" and "hwmgr -show scsi".
(The scsi DID output will be necessary to execute the hwmgr -redirect
commands.)
# hwmgr -view dev -cat disk
HWID: Device Name Mfg Model Location
------------------------------------------------------------------------------
27: /dev/disk/floppy0c 3.5in floppy fdi0-unit-0
34: /dev/disk/dsk0c DEC RZ28 (C) DEC bus-0-targ-1-lun-0
35: /dev/disk/dsk1c DEC RZ26L (C) DEC bus-0-targ-2-lun-0
36: /dev/disk/cdrom0c DEC RRD44 (C) DEC bus-0-targ-5-lun-0
39: /dev/disk/dsk4c DEC HSZ70 (C) DEC bus-1-targ-6-lun-0
40: /dev/disk/dsk5c DEC HSZ70 (C) DEC bus-1-targ-6-lun-1
44: /dev/disk/dsk6c DEC HSZ70 (C) DEC bus-1-targ-3-lun-0
45: /dev/disk/dsk7c DEC HSZ70 (C) DEC bus-1-targ-3-lun-1
# hwmgr -show scsi
SCSI DEVICE DEVICE DRIVER NUM DEVICE FIRST
HWID:DEVICEID HOSTNAME TYPE SUBTYPE OWNER PATH FILE VALID PATH
-------------------------------------------------------------------------
34: 0 ajkitt disk none 0 1 dsk0 [0/1/0]
35: 1 ajkitt disk none 0 1 dsk1 [0/2/0]
36: 2 ajkitt cdrom none 0 1 cdrom0 [0/5/0]
37: 3 ajkitt disk none 0 1 (null)
38: 4 ajkitt disk none 0 1 (null)
39: 5 ajkitt disk none 2 1 dsk4 [1/6/0]
40: 6 ajkitt disk none 0 1 dsk5 [1/6/1]
44: 7 ajkitt disk none 0 1 dsk6 [1/3/0]
45: 8 ajkitt disk none 0 1 dsk7 [1/3/1]
Note the following correspondence, and recall our intentions:
HWID = SCSI DID
---- --------
37 = 3
38 = 4
44 = 7
45 = 8
REDIRECT
HWID SCSI DID
-------- --------
37 to 44 3 to 7
38 to 45 4 to 8
The redirection is accomplished by the following hwmgr commands:
# hwmgr -redirect scsi -src 3 -dest 7
hwmgr: Redirect operation was successful
# hwmgr -redirect scsi -src 4 -dest 8
hwmgr: Redirect operation was successful
Final result, and proof that all devices are reachable:
# hwmgr -view devices
HWID: Device Name Mfg Model Location
------------------------------------------------------------------------------
4: /dev/kevm
27: /dev/disk/floppy0c 3.5in floppy fdi0-unit-0
34: /dev/disk/dsk0c DEC RZ28 (C) DEC bus-0-targ-1-lun-0
35: /dev/disk/dsk1c DEC RZ26L (C) DEC bus-0-targ-2-lun-0
36: /dev/disk/cdrom0c DEC RRD44 (C) DEC bus-0-targ-5-lun-0
37: /dev/disk/dsk2c DEC HSZ70 (C) DEC bus-1-targ-3-lun-0
38: /dev/disk/dsk3c DEC HSZ70 (C) DEC bus-1-targ-3-lun-1
39: /dev/disk/dsk4c DEC HSZ70 (C) DEC bus-1-targ-6-lun-0
40: /dev/disk/dsk5c DEC HSZ70 (C) DEC bus-1-targ-6-lun-1
42: /dev/cport/scp1 HSZ70 (C) DEC bus-1-targ-3-lun-0
# hwmgr -show scsi
SCSI DEVICE DEVICE DRIVER NUM DEVICE FIRST
HWID:DEVICEID HOSTNAME TYPE SUBTYPE OWNER PATH FILE VALID PATH
-------------------------------------------------------------------------
34: 0 ajkitt disk none 0 1 dsk0 [0/1/0]
35: 1 ajkitt disk none 0 1 dsk1 [0/2/0]
36: 2 ajkitt cdrom none 0 1 cdrom0 [0/5/0]
37: 3 ajkitt disk none 0 1 dsk2 [1/3/0]
38: 4 ajkitt disk none 0 1 dsk3 [1/3/1]
39: 5 ajkitt disk none 2 1 dsk4 [1/6/0]
40: 6 ajkitt disk none 0 1 dsk5 [1/6/1]
# mount /usr
# file /dev/rdisk/dsk*c
/dev/rdisk/dsk0c: character special (19/22) SCSI #0 "RZ28" disk #0
(SCSI ID #1) (SCSI LUN #0)
/dev/rdisk/dsk1c: character special (19/38) SCSI #0 "RZ26L" disk #1
(SCSI ID #2) (SCSI LUN #0)
/dev/rdisk/dsk2c: character special (19/70) SCSI #1 "HSZ70" disk #3
(SCSI ID #3) (SCSI LUN #0)
/dev/rdisk/dsk3c: character special (19/86) SCSI #1 "HSZ70" disk #4
(SCSI ID #3) (SCSI LUN #1)
/dev/rdisk/dsk4c: character special (19/102) SCSI #1 "HSZ70" disk #5
(SCSI ID #6) (SCSI LUN #0)
/dev/rdisk/dsk5c: character special (19/118) SCSI #1 "HSZ70" disk #6
(SCSI ID #6) (SCSI LUN #1)
The system is now ready for multi-user mode:
# ^D
INIT: New run level: 3
.
.
.
<>
---------------------------------------------------------------
FIRMWARE NOTE
-------------
HSZ70 controller firmware HSOF V7.7 will include checks at boot
time that will prevent the booting controller from entering a
dual configuration (either transparent failover or multibus
failover) if the product id fields of both the controllers are
not the same. The controller will issue the following warning on
the CLI prompt:
Controllers misconfigured. - Type SHOW THIS_CONTROLLER
The output of the "SHOW THIS_CONTROLLER" command will contain:
Controller:
HSZ70 ZG81110847 Firmware V77Z-0, Hardware H01
Configured for dual-redundancy with ZG71600468
Controllers misconfigured -- product id mismatch, a
SET FAILOVER COPY= is required to re-synchronize
controllers
When this command is followed by "SET NOFAILOVER" (or "SET
NOMULTIBUS_FAILOVER" in the case of a multibus failover) and
"SET FAILOVER COPY=THIS/OTHER" (or "SET MULTIBUS_FAILOVER"), the
pair will synchronize the product id fields based on the source
of the "COPY=THIS" or "COPY=OTHER" command. Note that the SET
FAILOVER (or SET MULTIBUS_FAILOVER) command has to be issued from
a controller that is running HSZ70 controller firmware HSOF V7.7.
HSZ70 controller firmware HSOF V7.7 will be available in March,
2000. Customers who signed the standard MDDS contract (Media and
Documentation Distribution Service) will automatically receive
the HSOF V7.7 firmware via "automatic update". Non-contract
customers may order the following kit, which is Tru64 UNIX
specific and contains HSOF V7.7:
QB-5SBAB-MA.7.7
*****************************< NOTE>********************************
* *
* INFORMATION IN THIS DOCUMENT REPRESENTS OPERATIONAL EXPERIENCES *
* AND SUGGESTIONS BY COMPAQ OR PARTNER EMPLOYEES. COMPAQ SHALL *
* NOT BE RESPONSIBLE FOR ANY ERRORS OR OMMISSIONS CONTAINED IN *
* THIS DOCUMENT, AND RESERVES THE RIGHT TO MAKE CHANGES TO IT *
* WITHOUT NOTICE. *
* *
********************************************************************
Files on this server are as follows:
|
»hsz_nam_issue.README
»hsz_nam_issue.CHKSUM
»hsz_nam_issue.tar
|