This chapter provides information that helps you troubleshoot the LSM
software.
8.1 Recovering from a Disk Failure
LSM's hot-sparing feature automatically detects disk failures and notifies
you of the failures by an electronic mail.
If hot-sparing is disabled or
you miss the electronic mail, you may notice disk failures through the output
of the
volprint
command to look at the status of the disks.
You may also see driver error messages on the console or in the system messages
file.
See
Section 7.1
for more information on the LSM
hot-sparing feature.
If one plex of a volume encounters a disk I/O failure (for example, because the disk has an uncorrectable format error), LSM disables the disk and does not retry the same I/O to that disk. However, before disabling the disk, LSM attempts to correct errors on the plex.
For read errors, LSM logs the error, and then attempts to read from other plexes. If the data is read successfully, LSM tries to correct the read error by writing the data back to the original plex.
If the write is successful, the data is returned, and message similar to the following is displayed:
Dec 4 17:27:32 xebec vmunix: io/vol.c(volerror): Correctable
read error on vol-dsk10g...
To display more information on a volume called
vol-dsk10g
,
enter:
#
volstat -f cf vol-dsk10g
Output similar the following is displayed:
CORRECTED FAILED TYP NAME READS WRITES READS WRITES vol vol-dsk10g 1 0 0 0
If the write fails, the bad plex is detached, I/O stops on that plex, but continues on the remaining plexes of the volume and the following message is written to the LSM kernel change log to record that the plex is detached and not used:
Dec 4 18:42:31 xebec vmunix: io/vol.c(volerror): Uncorrectable read error on vol... Dec 4 18:42:31 xebec vmunix: voliod_error: plex detach - volume vol-dsk10g, plex vol-dsk10g-02...
To display more information, enter:
#
volprint -ht vol-dsk10g
Output similar to the following is displayed:
v vol-dsk10g gen ENABLED ACTIVE 819200 PREFER pl vol-dsk10g-01 vol-dsk10g ENABLED ACTIVE 819200 CONCAT sd dsk10g-01 vol-dsk10g-01 0 0 819200 dsk10g pl vol-dsk10g-02 vol-dsk10g DISABLED NODEVICE 819200 CONCAT sd sd-dsk8g-1 vol-dsk10g-02 1 1 819199 dsk8
A disk failure on a mirrored volume results ina console error similar to the following:
Dec 5 10:44:37 xebec vmunix: io/vol.c(volerror): Uncorrectable read error on vol... Dec 5 10:44:37 xebec vmunix: io/vol.c(volerror): Uncorrectable read error on vol... Dec 5 10:44:37 xebec vmunix: voliod_error: plex detach - volume vol-dsk10g, plex vol-dsk10g-02...
To display more information, enter:
#
volstat -f cf vol-dsk10g
Output similar to the following is displayed:
CORRECTED FAILED TYP NAME READS WRITES READS WRITES vol vol-dsk10g 1 0 1 0
Or, enter:
#
volprint -ht vol-dsk10g
Output similar to the following is displayed:
v vol-dsk10g gen ENABLED ACTIVE 819200 PREFER pl vol-dsk10g-01 vol-dsk10g ENABLED ACTIVE 819200 CONCAT sd dsk10-01 vol-dsk10g-03 0 0 819200 dsk10 pl vol-dsk10g-02 vol-dsk10g DETACHED IOFAIL 819200 CONCAT sd sd-dsk8g-1 vol-dsk10g-02 1 1 819199 dsk8
For write errors, if the disk is still mirrored, the bad plex is detached and a message written to the LSM kernel change log to record that the plex is detached and is no longer used.
If the write succeeded on at least one plex in the volume, the write is considered successful.
If the write failed to all plexes, LSM returns a failure error and detaches
the disk from its disk group.
8.1.1 Replacing a Disk that is Beginning to Fail
Often a disk has recoverable (soft) errors before it fails completely. If a disk is getting an unusual number of soft errors, use the following procedure to replace it.
Detach the disk from its disk group by running
voldiskadm
and choosing
Remove a disk for replacement
from the main menu.
If there are initialized disks available as replacements, you can specify
the disk as part of this operation.
Otherwise, you must specify the replacement
disk later by choosing
Replace a failed or removed disk
from the main menu.
When you select a disk to remove for replacement, all volumes that will be affected by the operation are displayed. For example, the following output might be displayed:
The following volumes will lose mirrors as a result of this operation: home src No data on these volumes will be lost. The following volumes are in use, and will be disabled as a result of this operation: mkting Any applications using these volumes will fail future accesses. These volumes will require restoration from backup. Are you sure you want do this? [y,n,q,?] (default: n)
If any volumes would be disabled, quit from
voldiskadm
and save the volume.
Either back up the volume or move the volume
off of the disk.
For example, to move the volume
mkting
to a disk
other than
disk02
, enter the following command:
#
volassist move mkting
disk02
After the volume is backed up or moved, enter the
voldiskadm
command again and continue to remove the disk for replacement.
After the disk is removed, specify a replacement disk by choosing
Replace a failed or removed disk
from the main menu in
voldiskadm
menu interface.
If a disk that was in use by LSM fails completely and is detached, you
can replace the disk with a new disk.
To replace a disk, enter the
voldiskadm
command and choose
Replace a failed or removed
disk
from the main menu.
If you have any disks that are initialized for LSM but have never been added to a disk group, you can select one of those disks as a replacement. Do not choose the old disk drive as a replacement even though it may appear in the selection list. If there are no suitable initialized disks, you can choose to initialize a new disk.
If a disk failure caused a volume to be disabled, the volume must be
restored from backup after the disk is replaced.
To identify volumes that
wholly reside on disks that were disabled by a disk failure, use the
volinfo
command.
Any volumes that are listed as
Unstartable
must be restored from backup.
To display disk status, enter:
#
volinfo
Output similar to the following is displayed:
home fsgen Started mkting fsgen Unstartable src fsgen Started
To restart the
Unstartable
volume called
mkting
, enter:
#
volume -o bg -f start
mkting
The
-o bg
option recovers plexes as a background
task.
8.1.3 Reattaching Disks
A disk reattach operation may be necessary if a disk has experienced
a full failure and hot-sparing is not possible, or if LSM is started with
some disk drivers unloaded and unloadable (causing disks to enter the failed
state).
If the problem is fixed, it may be possible to use the
volreattach
command to reattach the disks without plexes being flagged as stale,
as long as the reattach happens before any volumes on the disk are started.
The
volreattach
command is called as part of disk
recovery from the
voldiskadm
menus.
If possible,
volreattach
will reattach the failed disk media record to the disk
with the same device name in the disk group in which it was located before
and will retain its original disk media name.
After a reattach takes place,
recovery may or may not be necessary.
The reattach may fail if the original
(or another) cause for the disk failure still exists.
To check whether a reattach is possible, enter:
#
volreattach
-c
This displays the disk group and disk media name where the disk can be reattached, without performing the operation.
For more information, see the
volreattach
(8)
reference page.
8.2 Recovering from a Boot Disk Failure
When the boot disk is mirrored, failures occurring on the original boot disk are transparent to all users. However, during a failure, the system might:
Write a message to the console indicating there was an error reading or writing to the plex on the boot disk.
Experience slow performance (depending on the problem encountered
with the disk containing one of the plexes in the
root
or
swap
volumes).
To reboot the system before the original boot disk is repaired, you
can boot from any disk that contains a valid
root
volume.
If all copies of
rootvol
are corrupted, and you cannot
boot the system, you must reinstall the system.
Replacing a boot disk is a more complex process than replacing other disks because boot-critical data must be placed in specific areas on specific disks in order for the boot process to find it. How you replace a failed boot disk depends on:
If you've mirrored the root disk and enabled hot-sparing support.
If the errors are correctable and the same disk can be re-used. This is known as re-adding a disk. If you reuse the boot disk, you should monitor it and replace it during your next maintenance cycle.
If the disk has completely failed and must be replaced.
The sections that follow give instructions for re-adding or replacing
the boot disk, as well as other information related to boot disk recovery.
8.2.1 Hot-Sparing and Boot Disk Failures
If the boot disk fails on a system that has the boot (root) disk mirrored and the hot-sparing feature enabled, LSM automatically attempts to replace the failed root disk mirror with a new mirror. To do this, a surviving mirror of the root disk is used to create a new mirror on either a spare disk or a disk with sufficient free space. This ensures that there are always at least two mirrors of the root disk that can be used for booting.
For hot-sparing to succeed, the
rootdg
disk group
must have enough spare or free space to accommodate the volumes from the failed
root disk.
Also, the
rootvol
and
swapvol
volumes require contiguous disk space.
If there is not enough contiguous space
on a single new disk, each of these volumes can be relocated to a different
new disk.
See
Chapter 4
for more information on mirroring
the boot disk.
8.2.2 Re-adding and Replacing Boot Disks
Normally, replacing a failed disk is as simple as putting a new disk somewhere on the controller and running LSM replace disk commands. It's even possible to move the data areas from that disk to available space on other disks, or to use a hot-spare disk already on the controller to replace the failure. For data that is not critical for booting the system, it doesn't matter where the data is located. All data that is not boot critical is only accessed by LSM after the system is fully operational. LSM can find this data for you. On the other hand, boot-critical data must be placed in specific areas on specific disks in order for the boot process to find it.
When a disk fails, there are two possible routes that can be taken to
correct the action.
If the errors are transient or correctable, then the same
disk can be re-used.
This is known as
re-adding
a disk.
On the other hand, if the disk has truly failed, then it should be completely
replaced.
8.2.2.1 Re-adding A Failed Boot Disk
Re-adding a disk is the same procedure as replacing a disk, except that the same physical disk is used. Usually, a disk that needs to be re-added has been detached, meaning that the LSM software has noticed that the disk has failed and has ceased to access it.
If the boot disk has a transient failure, its plexes can be recovered
using the following steps.
The
rootvol
and
swapvol
volumes can have two or three LSM disks per physical disk, depending
on the layout of the original root disk.
To list the LSM disks that are associated with the failed physical disk, enter:
#
voldisk list
Output similar to the following is displayed:
DEVICE TYPE DISK GROUP STATUS dsk10 sliced - - error dsk10b nopriv - - error dsk10f nopriv - - error dsk21 sliced dsk21 rootdg online dsk21b nopriv dsk21b rootdg online - - dsk10 rootdg removed was:dsk10 - - dsk10b rootdg removed was:dsk10b - - dsk10f rootdg removed was:dsk10f
In
this example, if
dsk10
was the failed boot disk, then you
can assume that
dsk10
,
dsk10b
, and
dsk10f
are the LSM disks associated with the physical disk
dsk10
.
Enter the following commands to add the LSM disks back to
the
rootdg
disk group:
#
voldisk online dsk10 dsk10b dsk10f
#
voldg -k adddisk dsk10=dsk10
#
voldg -k adddisk dsk10b=dsk10b
#
voldg -k adddisk rootdsk10=dsk10f
Resynchronize the plexes in the
rootvol
and
swapvol
volumes:
#
volrecover -sb rootvol
swapvol
8.2.2.2 Replacing a Failed Boot Disk
Follow these steps to replace a failed boot disk under LSM control with a new disk:
Disassociate the plexes on the failed disk from
rootvol
and
swapvol
.
Also, if
/usr
or
/var
were encapsulated on the boot disk, disassociate
their plexes on the failed disk:
#
volplex
-o
rm dis rootvol-02 swapvol-02 vol-dsk1g
Remove all LSM disks configured on the boot disk:
#
voldg rmdisk dsk1a disk1b dsk1g dsk1f
#
voldisk rm dsk1a dsk1b dsk1g dsk1f
Mirror the LSM volumes on the book disk onto the new disk, as described in Chapter 4. The replacement disk must have at least as much storage capacity as was in use on the old disk.
8.2.3 Stale or Unusable Plexes on the Boot Disk
If a disk is unavailable when the system is running, any plexes of volumes that reside on that disk will become stale, meaning the data on that disk is out of date relative to the other plexes of the volume.
During the boot process, the system accesses only one copy of the root volume (the copies on the boot disk) until a complete configuration for this volume can be obtained. If the plex of the root volume that was used for booting is stale, you must reboot the system from another boot disk that contains nonstale plexes. This problem can occur if the boot disk was replaced and restarted without adding the disk back into the LSM configuration. The system will boot normally, but the plexes that reside on the newly booted disk will be stale.
Another possible problem can occur if errors in the LSM headers on the boot disk prevents LSM from properly identifying the disk. In this case, LSM will be unable to know the name of that disk. This is a problem because plexes are associated with disk names, and therefore any plexes on that disk are unusable.
If either of these situations occurs, the LSM daemon
vold
will notice it when it is initializing the system as part of the
init
processing of the boot sequence.
It will output a message describing
the error, describe what can be done about it, and halt the system.
For example,
if the plex
rootvol-01
of the root volume
rootvol
on disk
disk01
of the system was stale,
vold
output is similar the following:
lsm:vold: Warning Plex rootvol-01 for root volume is stale or unusable. lsm:vold: Error: System boot disk does not have a valid root plex Please boot from one of the following disks: Disk: disk02 Device: dsk2 lsm:vold: Error: System startup failed
This informs you that the disk
disk02
contains usable
copies of the root and swap plexes and should be used for booting and
dsk2
is the name of the system backup disk.
When this message appears,
you should reboot the system and boot from the device that corresponds to
dsk2
.
Once the system is booted, you must determine the problem. If the plexes on the boot disk were stale, they are caught up automatically as the system starts. If there is a problem with the private area on the disk, you must readd or replace the disk.
If the plexes on the boot disk were unavailable, you will receive mail
from the
volwatch
command describing the problem.
To list that status of disks, enter:
#
voldisk list
Output similar the following is displayed:
DEVICE TYPE DISK GROUP STATUS - - disk02 rootdg failed was: dsk1 dsk2 sliced disk02 rootdg online
8.2.4 Failure To Obtain Crash Dumps
During a system crash or panic, a crash dump is temporarily saved to swap space. If the swap device is configured to use one or more LSM volumes, all the LSM swap volume's underlying disk partitions are used separately to maximize crash dump space, even when the swap volume was mirrored on different disk partitions. This does not cause any problems providing that the LSM mirrored swap are properly configured to not resynchronize the mirrors upon reboot because doing so could destroy the crash dump before it's saved to the file system.
If a mirrored swap volume is performing resynchronization upon reboot,
you need to verify its configuration.
If the secondary volumes (for example,
LSM volumes other than
swapvol
) are performing resynchronization,
this is probably due to the volume not being configured with
start_opts=norecov
option.
To check the
start_opts
option for a swap volume
called
v1
, enter:
#
volprint -m v1 | grep
start_opts
Output similar to the following is displayed:
start_opts="
To change the
start_opts
option for a swap volume
called
v1
, enter:
#
volume set start_opts=norecov
v1
To display the change, enter:
#
volprint -m v1 | grep
start_opts
Output similar to the following is displayed:
start_opts="norecov
If the LSM volume
swapvol
is performing resynchronization,
this is typically because this volume does not have its device minor number
set to 1.
See
Chapter 4
for information on how to setup
root and swap volumes.
To check the
swapvol
volume's minor number, enter:
#
ls -l /dev/*vol/swapvol
Output similar to the following is displayed:
crw------- 1 root system 40, 1 Mar 16 16:00 /dev/rvol/swapvol brw------- 1 root system 40, 1 Mar 16 16:00 /dev/vol/swapvol
When storing data redundantly, using mirrored or RAID5 volumes, LSM takes necessary measures to ensure that all copies of the data match exactly. However, under certain conditions (usually due to complete system failures), small amounts of the redundant data on a volume can become inconsistent or unsynchronized. Aside from normal configuration changes (such as detaching and reattaching a plex), this can only occur when a system crashes while data is being written to a volume. Data is written to the mirrors of a volume in parallel, as is the data and parity in a RAID5 volume. If a system crash occurs before all the individual writes complete, it is possible for some writes to complete while others do not, resulting in the data becoming unsynchronized. For mirrored volumes, it can cause two reads from the same region of the volume to return different results if different mirrors are used to satisfy the read request. In the case of RAID5 volumes, it can lead to parity corruption and incorrect data reconstruction.
When LSM recognizes this situation, it needs to make sure that all mirrors
contain exactly the same data and that the data and parity in RAID5 volumes
match.
This process is called volume resynchronization.
Volumes that are part
of disk groups that are automatically imported at boot time (such as
rootdg
) are resynchronized when the system boots.
Not all volumes require resynchronization after a system failure. Volumes that were never written or that were inactive when the system failure occurred and did not have any outstanding writes do not require resynchronization. LSM notices when a volume is first written and marks it as dirty. When a volume is closed by all processes or stopped cleanly, all writes will have completed and LSM removes the dirty flag for the volume. Only volumes that are marked dirty when the system reboots require resynchronization.
Resynchronization can be computationally expensive and can have a significant impact on system performance. The recovery process attempts to alleviate some of this impact by attempting to "spread out" recoveries to avoid stressing a specific disk or controller. Additionally, for very large volumes or for a very large number of volumes, the resynchronization process can take a long time. These effects can be addressed by using dirty-region logging for mirrored volumes, or by making sure that RAID5 volumes have valid RAID5 logs.
The exact process of resynchronization depends on the type of volume. RAID5 volumes that contain RAID5 logs can simply replay those logs. If no logs are available, the volume is placed in reconstruct-recovery mode and all parity is regenerated.
LSM automatically recovers mirrored and RAID5 volumes when the system is booted and the volumes are first started.
See the
volume
(8)
reference page for more information on resynchronizing volumes.
8.4 Recovering Volumes
The following sections describe recovery procedures for problems relating
to LSM volumes.
8.4.1 Listing Unstartable Volumes
An unstartable volume is likely to be incorrectly configured or has
other errors or conditions that prevent it from being started.
To display
unstartable volumes, use the
volinfo
command, which displays
information on the accessibility and usability of one or more volumes:
#
volinfo -g
disk_group [volume_name]
8.4.2 Recovering a Disabled Volume
If a system crash or an I/O error corrupts one or more plexes of a volume and no plex is CLEAN or ACTIVE, mark one of the plexes CLEAN and instruct the system to use that plex as the source for reviving the others. To place a plex in a CLEAN state, enter:
#
volmend fix clean
plex_name
For example, to place one plex called
vol01-02
in
the CLEAN state, enter:
#
volmend fix clean vol01-02
See the
volmend
(8)
reference pages for more information.
8.5 Recovering RAID5 Volumes
RAID5 volumes are designed to remain available when a disk fails with
a minimum of disk space overhead.
However, many implementations of RAID5 can
become vulnerable to data loss after a system failure, and some types of disk
failures can also affect RAID5 volumes adversely.
The following sections describe
how system and disk failures affect RAID5 volumes, and the types of recovery
needed.
8.5.1 System Failures and RAID5 Volumes
A system failure causes the data and parity in the RAID5 volume to become unsynchronized because the disposition of writes that were outstanding at the time of the failure cannot be determined. If this occurs while a RAID5 volume is being accessed, the volume is described as having stale parity. When this occurs, the parity must be reconstructed by reading all the non-parity columns within each stripe, recalculating the parity, and writing out the parity stripe unit in the stripe. This must be done for every stripe in the volume, so it can take a long time to complete.
Caution
While this resynchronization is going on, any failure of a disk within the array will cause the data in the volume to be lost. This only applies to RAID5 volumes without log plexes. Compaq recommends configuring all RAID5 volumes with a log.
Having the array vulnerable in this way is undesirable. Besides the vulnerability to failure, the resynchronization process can tax the system resources and slow down system operation.
RAID5 logs reduce the possible damage that can be caused by system failures.
Because they maintain a copy of the data being written at the time of the
failure, the process of resynchronization consists of simply reading that
data and parity from the logs and writing it to the appropriate areas of the
RAID5 volume.
This greatly reduces the amount of time needed for a resynchronization
of data and parity.
It also means that the volume never becomes truly stale
because the data and parity for all stripes in the volume is known at all
times, so the failure of a single disk cannot result in the loss of the data
within the volume.
8.5.2 Disk Failures and RAID5 Volumes
A RAID5 disk failure can occur due to an uncorrectable I/O error during a write to the disk (which causes the subdisk to be detached from the array) or due to a disk being unavailable when the system is booted (such as from a cabling problem or having a drive powered down). When this occurs, the subdisk cannot be used to hold data and is considered stale and detached. If the underlying disk becomes available or is replaced, the subdisk is considered stale and is not used.
If an attempt is made to read data contained on a stale subdisk, the data is reconstructed from data from all other stripe units in the stripe; this operation is called a reconstruct-read. This is a significantly more expensive operation than simply reading the data, resulting in degraded read performance; thus, when a RAID5 volume has stale subdisks, it is considered to be in degraded mode.
To display if a RAID5 volume is in degraded mode, enter:
#
volprint -ht
Output similar to the following is displayed:
V NAME USETYPE KSTATE STATE LENGTH READPOL PREFPLEX PL NAME VOLUME KSTATE STATE LENGTH LAYOUT NCOL/WID MODE SD NAME PLEX DISK DISKOFFS LENGTH [COL/]OFF DEVICE MODE v r5vol RAID5 ENABLED DEGRADED 20480 RAID - pl r5vol-01 r5vol ENABLED ACTIVE 20480 RAID 3/16 RW sd disk00-00 r5vol-01 disk00 0 10240 0/0 dsk4d1 sd disk01-00 r5vol-01 disk01 0 10240 1/0 dsk2d1 dS sd disk02-00 r5vol-01 disk02 0 10240 2/0 dsk3d1 - pl r5vol-l1 r5vol ENABLED LOG 1024 CONCAT - RW sd disk03-01 r5vol-l1 disk00 10240 1024 0 dsk3d0 - pl r5vol-l2 r5vol ENABLED LOG 1024 CONCAT - RW sd disk04-01 r5vol-l2 disk02 10240 1024 0 dsk1d1 -
The output shows that volume
r5vol
is in degraded
mode, as shown by the STATE, which is listed as DEGRADED.
The failed subdisk
is
disk01-00
, as shown by the flags in the last column.
The
d
indicates that the subdisk is detached and the
S
indicates that the subdisk contents are stale.
It is also possible that a disk containing a RAID5 log could experience a failure. This has no direct effect on the operation of the volume; however, the loss of all RAID5 logs on a volume makes the volume vulnerable to a complete failure.
The following
volprint
output shows a failure within
a RAID5 log plex as indicated by the plex state being BADLOG, where the RAID5
log plex
r5vol-l1
has failed.
V NAME USETYPE KSTATE STATE LENGTH READPOL PREFPLEX PL NAME VOLUME KSTATE STATE LENGTH LAYOUT NCOL/WID MODE SD NAME PLEX DISK DISKOFFS LENGTH [COL/]OFF DEVICE MODE v r5vol RAID5 ENABLED ACTIVE 20480 RAID - pl r5vol-01 r5vol ENABLED ACTIVE 20480 RAID 3/16 RW sd disk00-00 r5vol-01 disk00 0 10240 0/0 dsk4d1 ENA sd disk01-00 r5vol-01 disk01 0 10240 1/0 dsk2d1 dS sd disk02-00 r5vol-01 disk02 0 10240 2/0 dsk3d1 ENA pl r5vol-l1 r5vol DISABLED BADLOG 1024 CONCAT - RW sd disk03-01 r5vol-l1 disk00 10240 1024 0 dsk3d0 ENA pl r5vol-l2 r5vol ENABLED LOG 1024 CONCAT - RW sd disk04-01 r5vol-l2 disk02 10240 1024 0 dsk1d1 ENA
The following are the types of recovery typically needed for RAID5 volumes:
Parity resynchronization
Stale subdisk recovery
Log plex recovery
These types of recoveries are discussed in the sections that follow.
Parity resynchronization and stale subdisk recovery are typically performed
when the RAID5 volume is started, shortly after the system boots, or by calling
the
volrecover
command.
If hot-sparing is enabled at the time of a disk failure, system administrator
intervention is not required (unless there is no suitable disk space available
for relocation).
Hot-sparing will be triggered by the failure and the system
administrator will be notified of the failure by electronic mail.
Hot-sparing
will automatically attempt to relocate the subdisks of a failing RAID5 plex.
After any relocation takes place, the hot-sparing daemon (volspared
) will also initiate a parity resynchronization.
In the case of
a failing RAID5 log plex, relocation will only occur if the log plex is mirrored;
volspared
will then initiate a mirror resynchronization to recreate
the RAID5 log plex.
If hot-sparing is disabled at the time of a failure, the
system administrator may need to initiate a resynchronization or recovery.
8.5.3.1 Parity Resynchronization
In most circumstances, a RAID5 array will not have stale parity.
Stale
parity should only occur after all RAID5 log plexes for the RAID5 volume
have failed, and then only if there is a system failure.
Furthermore, even
if a RAID5 volume has stale parity, it is usually taken care of as part of
the
volume start
process.
However, if a volume without valid RAID5 logs starts and the process is killed before the volume is resynchronized, the result is an active volume with stale parity.
To display volume state, enter:
#
volprint -ht
Output similar to the following is displayed:
V NAME USETYPE KSTATE STATE LENGTH READPOL PREFPLEX PL NAME VOLUME KSTATE STATE LENGTH LAYOUT NCOL/WID MODE SD NAME PLEX DISK DISKOFFS LENGTH [COL/]OFF DEVICE MODE v r5vol RAID5 ENABLED NEEDSYNC 20480 RAID - pl r5vol-01 r5vol ENABLED ACTIVE 20480 RAID 3/16 RW sd disk00-00 r5vol-01 disk00 0 10240 0/0 dsk4d1 ENA sd disk01-00 r5vol-01 disk01 0 10240 1/0 dsk2d1 ENA sd disk02-00 r5vol-01 disk02 0 10240 2/0 dsk3d1 ENA
This output displays the volume state as NEEDSYNC, indicating that the parity needs to be resynchronized. The state could also have been SYNC, indicating that a synchronization was attempted at start time and that a synchronization process should be doing the synchronization. If no such process exists or if the volume is in the NEEDSYNC state, a synchronization can be manually started using the volume resync command.
To resynchronize the a RAID5 volume called
r5vol
,
enter:
#
volume resync r5vol
Parity is regenerated by issuing VOL_R5_RESYNC ioctls to the RAID5 volume.
The resynchronization process starts at the beginning of the RAID5 volume
and resynchronizes a region equal to the number of sectors specified by the
-o iosize
option to the volume command or, if
-o iosize
is not specified, the default maximum I/O size.
The resync
command then moves onto the next region until the entire length of the RAID5
volume is resynchronized.
For larger volumes, parity regeneration can take a significant amount of time and it is possible that the system can shut down or crash before the operation is completed. Unless the progress of parity regeneration is kept across reboots, the process starts over again.
To avoid this situation, parity regeneration is checkpointed, meaning
that the offset up to which the parity is regenerated is saved in the configuration
database.
The
-o checkpt=size
option to the volume command
controls how often the checkpoint is saved; if not specified, it defaults
to the default checkpoint size.
Because saving the checkpoint offset requires
a transaction, making the checkpoint size too small can significantly extend
the time required to regenerate parity.
After a system reboot, a RAID5 volume
that has a checkpoint offset smaller than the volume length will start a parity
resynchronization at the checkpoint offset.
8.5.3.2 Stale Subdisk Recovery
Like parity resynchronization, stale subdisk recovery is usually done at volume start time. However, it is possible that the process doing the recovery may get killed, or that the volume was started with an option to prevent subdisk recovery. It's also possible that the disk on which the subdisk resides was replaced without any recovery operations being performed.
To recover a stale subdisk in a RAID5 volume, enter:
#
volume recover r5vol dsk01-00
To recover multiple stale subdisks in a RAID5 volume at once with only the name of the volume, enter:
#
volume recover r5vol
8.5.3.3 Log Plex Recovery
RAID5 log plexes may become detached due to disk failures. To reattach failed RAID5 log plex, enter:
#
volplex att r5vol r5vol-11
8.6 Startup Problems
The following sections describe LSM command and startup problems and
suggests corrective actions.
8.6.1 I/O and System Delays Caused by Disk Failure
When a mirrored LSM disk fails, the system may hang for several minutes before resuming activity.
If you observe long delays in LSM recovery from disk failure, this is usually due to the underlying device driver, not LSM. When an initial I/O operation fails, there may be a delay as the device driver waits or retries the I/O. The length of the delay depends on the particular tolerances for that drive (for example, time for drive spin-up, ECC calculation time, retries and recalibration by the drive, other I/O being handled due to command-tag queuing, bus/device initialization time after reset, and so on).
LSM does not perform additional retries or generate additional delays when an I/O fails on a device. Once the underlying device driver returns an I/O failure error to LSM, LSM processes the error immediately (for example, issues another read to the other plex to recover and mask the error).
To reduce such delays, see the driver documentation for instructions
on changing the retry parameter settings.
8.6.2 An LSM Command Fails to Execute
When an LSM command fails to execute, LSM may display the following message:
Volume daemon is not accessible
This message often means that the volume daemon
vold
is not running.
Verify that the
vold
daemon is enabled by entering
the following command:
#
voldctl mode
Output similar to the following is displayed:
mode: enabled
Verify that the two
voliod
or more daemons are running
by entering the following command:
#
voliod
Output similar to the following is displayed:
2 volume I/O daemons are running
8.6.3 LSM Volume I/O or Mirroring Fails to Complete
Follow these steps if I/O to a LSM volume or mirroring of a LSM volume does not complete:
Check whether or not the LSM I/O daemon,
voliod
,
is running:
#
voliod
Output similar to the following should display:
2 volume I/O daemons are running
If the volume is in the ENABLED/SYNC state, the mirror resynchronization may have terminated abnormally. Restart the volume synchronization:
#
volrecover
If resynchronization still does not start, make sure the
volume
command is not running in the background and that the volume's
rwback
offset is not progressing by entering the following commands:
#
ps aux | grep volume
Output similar to the following is displayed:
root 4322 0.0 0.0 1.62M 160K console S + 19:36:15 0:00.00 grep volume
#
volprint -vl my_vol | grep flags
Output similar to the following is displayed:
flags: open rwback (offset=121488) writeback
#
sleep 120 ; volprint -vl my_vol | grep flags
Output similar to the following is displayed:
flags: open rwback (offset=121488) writeback
If the
ps
command output shows no
volume
commands running and the volume's
rwback
offset
remains the same, use the
volume
-o
force
command to restart resynchronization by entering the following command:
# volume
-o
force resync volume_name
8.6.4 Failures While Creating Volumes or Adding Disks
When creating a new volume or adding a disk, the operation may fail with the following message:
No more space in disk group configuration
This message could mean that you are out of space in the disk group's configuration database. Check to see if any disks were configured with 2 or more configuration databases. If all disks with active configuration databases are configured to use 1 configuration database within their private region, then check if a disk with a smaller private region can be reconfigured to deactive the configuration database within this smaller private region.
If all the disks have
nconfig
set to 1 and the same
size private regions, you can reconfigure and/or add disks with larger private
regions.
To display if the
rootdg
disk group is using a disk
with more than 1 configuration database, enter:
#
voldg list rootdg
Output similar the following is displayed:
Group: rootdg dgid: 921610896.1026.rio.dec.com import-id: 0.1 flags: copies: nconfig=default nlog=default config: seqno=0.1091 permlen=1496 free=1490 templen=3 loglen=226 config disk dsk7 copy 1 len=1496 state=clean online config disk dsk7 copy 2 len=1496 disabled config disk dsk8 copy 1 len=2993 state=clean online config disk dsk9 copy 1 len=2993 state=clean online config disk dsk10 copy 1 len=2993 state=clean online log disk dsk7 copy 1 len=226 log disk dsk7 copy 2 len=226 disabled log disk dsk8 copy 1 len=453 log disk dsk9 copy 1 len=453 log disk dsk10 copy 1 len=453
To increase the
rootdg
disk group free space from
1490 to 2987 by changing
dsk7
to have 1 configuration database
copy instead of 2 within its private region, enter:
#
voldisk moddb dsk7 nconfig=1
To display the results, enter:
#
voldg list rootdg
Output similar the following is displayed:
Group: rootdg dgid: 921610896.1026.rio.dec.com import-id: 0.1 flags: copies: nconfig=default nlog=default config: seqno=0.1091 permlen=2993 free=2987 templen=3 loglen=453 config disk dsk7 copy 1 len=2993 state=clean online config disk dsk8 copy 1 len=2993 state=clean online config disk dsk9 copy 1 len=2993 state=clean online config disk dsk10 copy 1 len=2993 state=clean online log disk dsk7 copy 1 len=453 log disk dsk8 copy 1 len=453 log disk dsk9 copy 1 len=453 log disk dsk10 copy 1 len=453
You can check the active configuration database sizes on each disk within a disk group to see if you can reconfigure a disk with a smaller private region to deactivate the configuration database within its smaller private region.
Follow steps to disable a configuration database:
Display the current configuration:
#
voldg list rootdg
Output similar the following is displayed:
Group: rootdg dgid: 921610896.1026.rio.dec.com import-id: 0.1 flags: copies: nconfig=default nlog=default config: seqno=0.1081 permlen=347 free=341 templen=3 loglen=52 config disk dsk7 copy 1 len=347 state=clean online config disk dsk8 copy 1 len=2993 state=clean online config disk dsk9 copy 1 len=2993 state=clean online config disk dsk10 copy 1 len=2993 state=clean online log disk dsk7 copy 1 len=52 log disk dsk8 copy 1 len=453 log disk dsk9 copy 1 len=453 log disk dsk10 copy 1 len=453
To disable the configuration databases on
dsk7
,
so the
rootdg
configuration database free size will increase
from 341 to 2987, enter:
#
voldisk moddb dsk7 nconfig=0
Display new configuration:
#
voldg list rootdg
Output similar the following is displayed:
Group: rootdg dgid: 921610896.1026.rio.dec.com import-id: 0.1 flags: copies: nconfig=default nlog=default config: seqno=0.1081 permlen=2993 free=2987 templen=3 loglen=453 config disk dsk8 copy 1 len=2993 state=clean online config disk dsk9 copy 1 len=2993 state=clean online config disk dsk10 copy 1 len=2993 state=clean online log disk dsk8 copy 1 len=453 log disk dsk9 copy 1 len=453 log disk dsk10 copy 1 len=453
If all disks have 1 configuration copy and you cannot disable disks with smaller private regions, then you can add and use disks with a larger private region. Follow these steps to specify a private region larger than the default of 4096:
Enter the
voldisksetup
command with the
privlen
option to specify a new private region size.
Use the
voldisk moddb
command as described
earlier in this section to deactivate the smaller disks.
Note
For a disk group with 4 or more disks, you should enable and configure at least 4 of the disks to be large enough to contain the disk group's configuration database.
Follow these steps to add 4 disks with larger configuration databases and disable the configuration database on the smaller disks, so only the new disks with the larger private region are used:
Display the current disk group configuration by entering the following command:
#
voldg list rootdg
Output similar the following is displayed:
Group: rootdg dgid: 921610896.1026.rio.dec.com import-id: 0.1 flags: copies: nconfig=default nlog=default config: seqno=0.1091 permlen=2993 free=2987 templen=3 loglen=453 config disk dsk7 copy 1 len=2993 state=clean online config disk dsk8 copy 1 len=2993 state=clean online config disk dsk9 copy 1 len=2993 state=clean online config disk dsk10 copy 1 len=2993 state=clean online log disk dsk7 copy 1 len=453 log disk dsk8 copy 1 len=453 log disk dsk9 copy 1 len=453 log disk dsk10 copy 1 len=453
Increase the private region size by entering the following commands:
#
voldisksetup -i dsk3 privlen=8192
#
voldisksetup -i dsk4 privlen=8192
#
voldisksetup -i dsk12 privlen=8192
#
voldisksetup -i dsk13 privlen=8192
Add the disks to the disk group by entering the following command:
#
voldg adddisk dsk3 dsk4
dsk12 dsk13
Deactivate the smaller disks by entering the following commands:
#
voldisk moddb dsk7 nconfig=0
#
voldisk moddb dsk8 nconfig=0
#
voldisk moddb dsk9 nconfig=0
#
voldisk moddb dsk10 nconfig=0
Display the new configuration by entering the following command:
#
voldg list rootdg
Output similar to the following is displayed:
Group: rootdg dgid: 921610896.1026.rio.dec.com import-id: 0.1 flags: copies: nconfig=default nlog=default config: seqno=0.1116 permlen=6017 free=6007 templen=3 loglen=911 config disk dsk3 copy 1 len=6017 state=clean online config disk dsk4 copy 1 len=6017 state=clean online config disk dsk12 copy 1 len=6017 state=clean online config disk dsk13 copy 1 len=6017 state=clean online log disk dsk3 copy 1 len=911 log disk dsk4 copy 1 len=911 log disk dsk12 copy 1 len=911 log disk dsk13 copy 1 len=911
8.6.5 Mounting a File System or Opening an LSM Volume Fails
If a file system cannot be mounted or an open function on an LSM volume
fails, check if
errno
is set to EBADF.
This could mean
that the LSM volume is not started.
To determine whether or not the volume is started, enter:
#
volinfo -g rootdg
Output similar the following is displayed:
vol1 fsgen Startable vol-dsk3h fsgen Started vol2 fsgen Started swapvol1 gen Started rootvol root Started swapvol swap Started
To start volume
vol1
, enter:
#
volume -g rootdg start
vol1
8.7 Restoring an LSM Configuration
You use the
volrestore
command to restore an LSM
configuration that you saved when using the
volsave
command.
If you enter the
volrestore
command with no options,
volrestore
attempts to restore all disk groups.
If you use the
-i
(interactive) option,
volrestore
prompts you
before restoring each disk group.
Before the
volrestore
command restores the LSM configuration,
it validates the checksum that is part of the description set.
By default, the
volrestore
command restores the whole
configuration, using the description set in the directory under
/usr/var/lsm/db
that has the latest timestamp.
You can specify options
to the command to use a different directory and to restore a specific volume
or disk group.
For example, this command restores only the volume called
myvol01
in the
staffdg
disk group:
#
volrestore -g staffdg
-v myvol01
When you restore a specific disk group, the
volrestore
command attempts to reimport the disk group based on configuration information
on disks that belong to that disk group.
If the import fails,
volrestore
recreates the disk group by reinitializing all disks within that
disk group and recreating all volumes, unassociated plexes, and unassociated
subdisks, based on information in the
volmake
description
file,
allvol.DF
Notes
The
volrestore
command does not restore volumes associated with theroot
,/usr
, and/var
file systems and the primary swap area. These partitions must be reencapsulated to use LSM volumes.See the Tru64 UNIX Clusters documentation before using
volrestore
in Tru64 UNIX cluster environment.
When you restore a complete LSM configuration, the
volrestore
command attempts to reenable the
vold
based
on the configuration databased found on the
rootdg
disks.
If the complete LSM configuration does not need to be restored, you can use
the
-i
(interactive) option with
volrestore
.
The
volrestore
command prompts you before restoring each
file, enabling you to skip specific disk groups.
If
vold
cannot be enabled, you are given the option
of recreating the
rootdg
disk group and any other disk
groups using the other files in the saved LSM description set.
The
rootdg
disk group is recreated first, and
vold
is put in the enabled mode.
Then, the other disk groups are enabled.
The disk
groups are recreated by first attempting to import them based on available
disks in that disk group.
If the import fails, the disk group is reinitialized
and all volumes in that disk group are also recreated based on the
volmake
description files.
When volumes are restored using the
volmake
description
file, the plexes are created in the DISABLED EMPTY state.
The
volrestore
command does not attempt to start or enable such volumes.
You must
use
volmend
or
volume
to set the plex
states appropriately before starting the volume.
The
volrestore
command warns you to check the state of each disk associated with a volume
before using
volmend
to set plex states; to carefully find
out which disks in the LSM configuration could have had failures because saving
the LSM configuration; and to use
volmend
to mark plexes
on those disks to be STALE.
In addition, any plex that was detached or disabled
at any point during or after the LSM configuration was saved should be marked
"STALE" using
volmend
.
To restore a disk group called
dg1
, enter the following
command, and the system will display output similar to this example:
#
volrestore -g dg1
Using LSM configuration from /usr/var/lsm/db/LSM.19991226203620.skylark Created at Tue Dec 26 20:36:30 EST 1999 on HOST skylarkWould you like to continue ? [y,n,q,?] (default: n)
y
Working . Restoring dg1 vol1 in diskgroup dg1 already exists. (Skipping ..) vol2 in diskgroup dg1 already exists. (Skipping ..) vol3 in diskgroup dg1 already exists. (Skipping ..)
8.7.1 Conflicts While Restoring the Configuration
When
volrestore
executes, it can encounter conflicts
in the LSM configuration, for example, if another volume uses the same plex
name or subdisk name, or the same location on a disk.
When
volrestore
finds a conflict, it displays error messages and the configuration
of the volume, as found in the saved LSM description set.
In addition, it
removes all volumes created in that disk group during the restoration.
The
disk group that had the conflict remains imported, and
volrestore
continues to restore other disk groups.
If
volrestore
fails because of a conflict, you can
use the
volrestore
-b
option to do the
best possible restoration in a disk group.
You will then have to resolve the
conflicts and restore the volumes in the affected disk group.
8.7.2 Failures in Restoring the Configuration
The restoration of volumes fails if one or more disks associated with the volumes are unavailable, for example due to disk failure. This, in turn, causes the restoration of a disk group to fail. To restore the LSM configuration of a disk group, enter:
# volrestore
-b
-g
diskgroup
The volumes associated with the failed disks can then be restored by
editing the
volmake
description file to remove the plexes
that use the failed disks.
Note that editing the description file affects
the checksum of the files in the backup directory, so you must override the
checksum validation by using the
-f
option.
8.8 Reinstalling the Operating System
If you reinstall the operating system, LSM-related information, such as data in the LSM private areas on reinstalled disks (containing the disk identifier and copies of the LSM configuration), is removed, which makes the disk unusable to LSM. The only volumes saved are those that reside on, or have copies on, disks that are not directly involved with reinstallation. Volumes on disks involved with the reinstallation are lost during reinstallation. If backup copies of these volumes are available, you can restore them after reinstallation. The system root disk is always involved in reinstallation.
To reinstall the operating system system and recover the LSM configuration you need to:
Prepare the system for the installation. This includes replacing any failed disks or other hardware, and detaching any disks not involved in the reinstallation.
Install the operating system.
Recover the LSM configuration.
Complete the configuration by restoring information in volumes
affected by the reinstallation and recreate system volumes (such as
rootvol
,
swapvol
, and
usr
).
8.8.1 Preparing the System for the Operating System Reinstallation
To prevent the loss of data on disks not involved in the reinstallation, you should only involve the root disk in the reinstallation procedure. It is recommended that any other disks (that contain volumes) be disconnected from the system before you start the reinstallation procedure.
Disconnecting the other disks ensures that they are unaffected by the
reinstallation.
For example, if the operating system was originally installed
with a file system on the second drive, the file system may still be recoverable.
Removing the second drive ensures that the file system remains intact.
8.8.2 Reinstalling the Operating System
After failed or failing disks are replaced and disks uninvolved with the reinstallation are detached, reinstall the operating system and LSM as described in the Installation Guide.
While the operating system installation progresses, make sure no disks
other than the root disk are accessed in any way.
If anything is written
on a disk other than the root disk, the LSM configuration on that disk could
be destroyed.
8.8.3 Recovering the LSM Configuration
Use the
volrestore
procedure to recover the LSM configuration
information that was previously saved with
volsave
.
If
the LSM configuration information can not be restored using
volrestore
, use the following procedure to reinitialize LSM.
Warning
Executing the
volsetup
command with the -oforce
option destroys any existing LSM configuration information on a system.
Once the LSM subsets have been loaded, recover the LSM configuration by doing the following:
Shut down the system.
Physically reattach the disks that were removed from the system.
Reboot the system. When the system comes up, make sure that all disks are configured in the kernel and that special device files have been created for the disks.
Run the
volsetup
script.
This script checks
for an existing LSM configuration and starts LSM if one exists.
If an existing
configuration is found, the script displays the following message:
LSM has detected the presence of an existing configuration.
Check the current configuration and use '-o force' option to
destroy the existing configuration if necessary.
Recreate the LSM configuration.
If the LSM configuration
was previously saved using the
volsave
command, use the
volrestore
command.
Otherwise, you must recreate the volumes, plexes
subdisks, disks, and disk groups using the procedures described in Chapter
5.
Restore the volume's data using the appropriate backup and
restore command.
For example, to restore an AdvFS or UFS file system that
was backed up with the
vdump
command, you would use the
vrestore
command.
If the
root
file system, swap partition,
and/or
usr
file system were previously under LSM control,
you can reconfigure the system disk under LSM control and mirror the disk
using the procedures described in Chapter 4.
The configuration preserved on the disks not involved with the reinstallation has now been recovered. However, because the root disk has been reinstalled, it appears to LSM as a non-LSM disk. Therefore, the configuration of the preserved disks does not include the root disk as part of the LSM configuration.
Note
If the root disk of your system and any other disk involved in the reinstallation were not under LSM control at the time of failure and reinstallation, then the reconfiguration is complete at this point. If other disks containing volumes or volume plexes are to be replaced, follow the replacement procedures in Section 8.2.2.2 .
8.8.4 Completing the Configuration
If the boot disk (or another disk) was involved with the reinstallation,
any volume or volume plexes on that disk (or other disks no longer attached
to the system) are now inaccessible.
If a volume had only one plex (contained
on a disk that was reinstalled, removed, or replaced), then the data on that
the volume is lost and must be restored from backup.
In addition, the system's
root file system and swap area are not located on volumes any longer.
8.8.4.1 Removing the Root and Swap Volumes
Remove volumes associated with root and swap areas, and their associated disks. This must be done if the root disk was under LSM control prior to reinstallation. The volumes to remove are:
rootvol
, which contains the root file system
swapvol
, which contains the swap area
Follow these steps to remove the
rootvol
and
swapvol
volumes:
Stop the root and swap volumes and remove them by entering the following commands:
#
volume stop rootvol
#
voledit -r rm rootvol
#
volume stop swapvol
#
voledit -r rm swapvol
Remove the LSM disks used by
rootvol
and
swapvol
.
For example, if disk
dsk3
was associated
with
rootvol
and disk
dsk3b
was associated
with
swapvol
:
#
voldg rmdisk dsk3 dsk3b
#
voldisk rm dsk3 dsk3b
8.8.4.2 LSM Volumes for /usr and /var Partitions
If
/usr
and
/var
partitions were
on LSM volumes prior to the reinstallation, then remove the LSM disks associated
with them using the
voledit
command in the previous example
shown for
rootvol
and
swapvol
.
8.8.4.3 Restoring Volumes from Backup
After configuring the volumes, you must determine which volumes need to be restored from backup. The volumes to be restored include any volumes that had all plexes residing on disks that were removed or reinstalled. These volumes are invalid and must be removed, recreated, and restored from backup. If only some plexes or a volume exist on reinitialized or removed disks, these plexes must be removed. The plexes can be readded later.
Follow these steps to restore the volumes:
Establish which LSM disks have been removed or reinstalled:
#
voldisk list
Output similar to the following is displayed:
DEVICE TYPE DISK GROUP STATUS dsk0 sliced - - error dsk1 sliced disk02 rootdg online dsk2 sliced disk03 rootdg online - - disk01 rootdg failed was: dsk0
This output shows that the reinstalled root device,
dsk0
is not recognized as an LSM disk and is marked with a status of
error
.
disk02
and
disk03
were
not involved in the reinstallation and are recognized by LSM and associated
with their devices (dsk1
and
dsk2
).
The former
disk01
, the LSM disk that had been associated
with the replaced disk device, is no longer associated with the device (dsk0
).
If there had been other disks (with volumes or volume plexes
on them) removed or replaced during reinstallation, these disks would also
have a disk device in
error
state and an LSM disk listed
as not associated with a device.
Once you know which disks are removed or replaced, display
the plexes on disks with a status of
failed
:
#
volprint -sF "%vname"
-e 'sd_disk = "<disk>"'
In this command, the variable
is the name
of a disk with a
failed
status.
Note
Be sure to enclose the disk name in quotes in the command. Otherwise, the command displays an error message.
The
volprint
command displays a list of volumes
that have plexes on the failed disk.
Repeat this command for each disk with
a
failed
status.
Check the status of each volume by entering the following command:
volprint -th
volume_name
For example, to infomation about a volume called
fnah
,
enter:
#
volprint -th fnah
Output similar to the following is displayed:
V NAME USETYPE KSTATE STATE LENGTH READPOL PREFPLEX PL NAME VOLUME KSTATE STATE LENGTH LAYOUT ST-WIDTH MODE SD NAME PLEX PLOFFS DISKOFFS LENGTH DISK-MEDIA ACCESS v fnah fsgen DISABLED ACTIVE 24000 SELECT - pl fnah-01 fnah DISABLED NODEVICE 24000 CONCAT - sd disk01-06 fnah-01 0 519940 24000 disk01 -
In this output, the only plex of the volume is shown in the
line beginning with
pl
.
The
STATE
field
for the plex called
fnah-01
is
NODEVICE
.
The plex has space on a disk that was replaced, removed, or reinstalled.
Therefore, the plex is no longer valid and you must remove it.
Because the
fnah-01
plex was the only plex of the
volume, the volume contents are irrecoverable except by restoring the volume
from a backup.
You must also remove the volume.
If a backup copy of the
volume exists, you can restore the volume later.
Keep a record of the volume
name and its length, you will need it for the backup procedure.
Remove the volume by entering the following command:
voledit -r rm
volume_name
For example, to remove a volume called
fnah
, enter:
#
voledit -r rm fnah
It is possible that only part of a plex is located on the failed disk.
If the volume has a striped plex associated with it, the volume is divided
between several disks.
For example, the volume called
vol01
has one striped plex, striped across three disks, one of which is the reinstalled
disk
disk01
.
The output of the
volprint
-th
command for
vol01
displays output similar
to the following:
V NAME USETYPE KSTATE STATE LENGTH READPOL PREFPLEX PL NAME VOLUME KSTATE STATE LENGTH LAYOUT ST-WIDTH MODE SD NAME PLEX PLOFFS DISKOFFS LENGTH DISK-MEDIA ACCESS v vol01 fsgen DISABLED ACTIVE 4224 SELECT - pl vol01-01 vol01 DISABLED NODEVICE 4224 STRIPE 128 RW sd disk02-02 vol01-01 0 14336 1408 disk02 dsk1 sd disk01-05 vol01-01 1408 517632 1408 disk01 - sd disk03-01 vol01-01 2816 14336 1408 disk03 dsk2
This output shows three disks, across which the plex
vol01-01
is striped (the lines starting with
sd
represent the stripes).
The second stripe area is located on the LSM disk called
disk01
.
This disk is no longer valid, so the plex called
vol01-01
has a state of
NODEVICE
.
Because this is the
only plex of the volume, the volume is invalid and must be removed.
If a
copy of
vol01
exists on the backup media, it can be restored
later.
Note
Keep a record of the volume name and length of any volumes you intend to restore from backup.
Use the
voledit
command to remove the volume, as
described earlier.
A volume that has one plex on a failed disk may also have other plexes
on disks that are still valid.
In this case, the volume does not need to
be restored from backup, because the data is still valid on the valid disks.
The output of the
volprint
-th
command
for a volume with one plex on a failed disk (disk01
) and
another plex on a valid disk (disk02
) displays output similar
to the following:
V NAME USETYPE KSTATE STATE LENGTH READPOL PREFPLEX PL NAME VOLUME KSTATE STATE LENGTH LAYOUT ST-WIDTH MODE SD NAME PLEX PLOFFS DISKOFFS LENGTH DISK-MEDIA ACCESS v foo fsgen DISABLED ACTIVE 10240 SELECT - pl foo-01 foo DISABLED ACTIVE 10240 CONCAT - RW sd disk02-01 foo-01 0 0 10240 disk02 dsk1 pl foo-02 foo DISABLED NODEVICE 10240 CONCAT RW sd disk01-04 foo-02 0 507394 10240 disk01 -
This volume has two plexes,
foo-01
and
foo-02
.
The first plex,
foo-01
, does not use any space
on the invalid disk, so it can still be used.
The second plex,
foo-02
, uses space on the invalid disk,
disk01
,
and has a state of
NODEVICE
.
Mirror
foo-02
must be removed.
However, the volume still has one valid plex containing
valid data.
If the volume needs to be mirrored, another plex can be added
later.
Note the name of the volume if you want to create another plex later.
To remove an invalid plex, you must dissociated the plex from the volume
and remove the plex.
To remove the plex called
foo-02
,
enter:
#
volplex -o rm dis foo-02
Once all the volumes are cleaned up, you must clean up the disk configuration as described in the following section.
Once all invalid volumes and volume plexes are removed, the disk configuration
can be cleaned up.
Each disk that was removed, reinstalled, or replaced (as
determined from the output of the
voldisk
list
command) must be removed from the configuration.
To remove the disk, use the
voldg
command.
To remove
the failed
disk01
, enter:
#
voldg rmdisk disk01
If the
voldg
command returns an error message, some
invalid volume plexes exist.
Repeat the processes described in "Volume
Cleanup" until all invalid volumes and volume plexes are removed.
8.8.4.5 Reconfiguring the root Volume
Once all the invalid disks are removed, you can replace or reinstall disks to add them to LSM control. If the root disk was originally under LSM control (the root file system and the swap area were on volumes), or you now want to put the root disk under LSM control, add this disk first, for example:
#
/usr/sbin/volencap
devname
See Chapter 4 for more information.
When the encapsulation is complete, reboot the system to multiuser mode.
8.8.4.6 Reconfiguring Volumes
After the boot disk is encapsulated, you can replace other disks. If the disks were reinstalled during the operating system reinstallation, they should be encapsulated; otherwise, add them.
Once the disks are added to the system, you can recreate the volumes that were removed and restore their contents from backup.
To recreate the volumes
fnah
and
vol01
,
enter:
#
volassist make fnah 24000
#
volassist make vol01 4224 layout=stripe nstripe=3
To replace the plex removed from the volume
foo
using
volassist
, enter:
#
volassist mirror foo
Once you restore the volumes and plexes, the recovery is complete and
your system should be configured as it was prior to reinstalling the Tru64 UNIX
operating system.
8.9 Deconfiguring Additional Swap
Follow these steps to deconfigure and remove additional swap volumes that were previously configured for use with the LSM software:
Deconfigure the swap space to no longer use the LSM volumes.
This can be done by updating the
vm:swapdevice
entry
in the
sysconfigtab
file to not reference the LSM volumes.
If the swap space was configured using the
/etc/fstab
file, update this file accordingly.
See the
System Administration
and
swapon
(8)
reference pages for more information.
Reboot the system to affect the change.
Stop and remove the volumes.
For example, to stop and remove
a volume called
swapvol1
, enter:
#
volume stop swapvol1
#
voledit -rf rm swapvol1
8.10 Removing the LSM Software
Follow these steps to deconfigure and remove LSM from a system.
Warning
Deconfiguring LSM causes any data currently under LSM to be lost and no longer accessible. You should unencapsulate and/or backup any needed data before proceeding.
Reconfigure any system file systems and/or swap space to no
longer be on a LSM volume.
If root and swap are configured under LSM, enter
the
volunroot
command and reboot the system.
Also, unencapsulate
the /usr and /var if these are configured under LSM.
See
Chapter 4
if
/usr
and
/var
are encapsulated under
LSM with the root and swap.
If additional swap space was configured using
LSM volumes, deconfigure them as described in
Section 8.9.
Unmount any other filesystems that were using LSM volumes
so all LSM volumes can be closed.
Update the
/etc/fstab
file if necessary to no longer mount any file systems on an LSM volume.
Stop
applications that are using raw LSM volumes and reconfigure them to no longer
use LSM volumes.
Note of which disks are currently configured under LSM by entering the following command:
#
voldisk list
Once all the LSM volumes are no longer in-use, restart LSM in disabled mode by entering the following command:
#
vold -k -r reset -d
This command fails if any volumes are open.
Stop LSM's volume and I/O daemons by entering the following command:
#
voliod -f set 0
#
voldctl stop
Update the disk labels using the list of disks under LSM from step 3 above. For each disk that was previously configured under LSM as sliced (for example, entire disk was under LSM), repartition and update the disk labels using the -rw option by entering the following commands:
#
disklabel -rw dsk4
#
disklabel -rw dsk5
For each disk partition that was configured under LSM as a simple disk,
update the partition's
fstype
to
unused
using the
-s
option with the
disklabel
command.
For example:
#
disklabel -s dsk6c unused
Also, update the disk partition
fstype
field for
any
nopriv
disks that were previously under LSM to either
unused
or the appropriate value depending on whether the partition
still contains valid data.
For example, if
dsk2g
was an
an LSM
nopriv
disk that still contains a valid UNIX file
system and
dsk2h
was a LSM
nopriv
disk
that no longer contains valid data, enter:
#
disklabel -s dsk2g 4.2BSD
#
disklabel -s dsk2h unused
Remove the LSM directories by entering the following command:
#
rm -r /etc/vol /dev/vol
/dev/rvol /etc/vol/volboot
Delete the following LSM entries in the
/etc/inittab
file:
lsmr:s:sysinit:/sbin/lsmbstartup -b </dev/console >/dev/console 2>&1 ##LSM lsm:23:wait:/sbin/lsmbstartup </dev/console >/dev/console 2>&1 ##LSM vol:23:wait:/sbin/vol-reconfig -n </dev/console >/dev/console 2>&1 ##LSM
Display the installed LSM subsets by entering the following command:
#
setld -i | grep LSM
Output displays the show the installed LSM subsets.
Delete the installed LSM subsets by entering the following command:
#
setld -d OSFLSMBASE500
OSFLSMBIN500 OSFLSMCLSMTOOLS500
Deconfigure LSM from the kernel.
For example, for system
named
rio
, replace the
pseudo-device lsm
1
entry in the
/sys/conf/RIO
file to
pseudo-device lsm 0
You can make this change either prior to running the
doconfig
command or while running
doconfig
command.
For example:
#
doconfig -c RIO
Copy the new kernel to root and reboot the system by entering the following command:
#
cp /sys/RIO/vmunix /
#
shutdown now