8    Troubleshooting the LSM Software

This chapter provides information that helps you troubleshoot the LSM software.

8.1    Recovering from a Disk Failure

LSM's hot-sparing feature automatically detects disk failures and notifies you of the failures by an electronic mail. If hot-sparing is disabled or you miss the electronic mail, you may notice disk failures through the output of the volprint command to look at the status of the disks. You may also see driver error messages on the console or in the system messages file. See Section 7.1 for more information on the LSM hot-sparing feature.

If one plex of a volume encounters a disk I/O failure (for example, because the disk has an uncorrectable format error), LSM disables the disk and does not retry the same I/O to that disk. However, before disabling the disk, LSM attempts to correct errors on the plex.

For read errors, LSM logs the error, and then attempts to read from other plexes. If the data is read successfully, LSM tries to correct the read error by writing the data back to the original plex.

If the write is successful, the data is returned, and message similar to the following is displayed:

Dec 4 17:27:32 xebec vmunix: io/vol.c(volerror): Correctable read error on vol-dsk10g...

To display more information on a volume called vol-dsk10g, enter:

# volstat -f cf vol-dsk10g

Output similar the following is displayed:

                           CORRECTED               FAILED
TYP NAME                READS    WRITES        READS    WRITES
 vol vol-dsk10g           1         0            0         0
 
 

If the write fails, the bad plex is detached, I/O stops on that plex, but continues on the remaining plexes of the volume and the following message is written to the LSM kernel change log to record that the plex is detached and not used:

Dec  4 18:42:31 xebec vmunix: io/vol.c(volerror): Uncorrectable read error on vol...
Dec  4 18:42:31 xebec vmunix: voliod_error: plex detach - volume vol-dsk10g, 
                                            plex vol-dsk10g-02...

To display more information, enter:

# volprint -ht vol-dsk10g

Output similar to the following is displayed:

v  vol-dsk10g      gen             ENABLED  ACTIVE   819200   PREFER
pl vol-dsk10g-01   vol-dsk10g      ENABLED  ACTIVE   819200   CONCAT
sd dsk10g-01       vol-dsk10g-01   0        0        819200   dsk10g
pl vol-dsk10g-02   vol-dsk10g      DISABLED NODEVICE 819200   CONCAT
sd sd-dsk8g-1      vol-dsk10g-02   1        1        819199   dsk8
 
 

A disk failure on a mirrored volume results ina console error similar to the following:

Dec  5 10:44:37 xebec vmunix: io/vol.c(volerror): Uncorrectable read error on vol...
Dec  5 10:44:37 xebec vmunix: io/vol.c(volerror): Uncorrectable read error on vol...
Dec  5 10:44:37 xebec vmunix: voliod_error: plex detach - volume vol-dsk10g, 
                                            plex vol-dsk10g-02...
 
 

To display more information, enter:

# volstat -f cf vol-dsk10g

Output similar to the following is displayed:

                           CORRECTED               FAILED 
TYP NAME                READS    WRITES        READS    WRITES 
vol vol-dsk10g            1         0           1         0

Or, enter:

# volprint -ht vol-dsk10g

Output similar to the following is displayed:

v  vol-dsk10g    gen           ENABLED  ACTIVE   819200   PREFER
pl vol-dsk10g-01 vol-dsk10g    ENABLED  ACTIVE   819200   CONCAT
sd dsk10-01      vol-dsk10g-03 0        0        819200   dsk10
pl vol-dsk10g-02 vol-dsk10g    DETACHED IOFAIL   819200   CONCAT
sd sd-dsk8g-1    vol-dsk10g-02 1        1        819199   dsk8

For write errors, if the disk is still mirrored, the bad plex is detached and a message written to the LSM kernel change log to record that the plex is detached and is no longer used.

If the write succeeded on at least one plex in the volume, the write is considered successful.

If the write failed to all plexes, LSM returns a failure error and detaches the disk from its disk group.

8.1.1    Replacing a Disk that is Beginning to Fail

Often a disk has recoverable (soft) errors before it fails completely. If a disk is getting an unusual number of soft errors, use the following procedure to replace it.

  1. Detach the disk from its disk group by running voldiskadm and choosing Remove a disk for replacement from the main menu.

    If there are initialized disks available as replacements, you can specify the disk as part of this operation. Otherwise, you must specify the replacement disk later by choosing Replace a failed or removed disk from the main menu.

    When you select a disk to remove for replacement, all volumes that will be affected by the operation are displayed. For example, the following output might be displayed:

    The following volumes will lose mirrors as a result of this 
    operation: 
     
         home src 
     
    No data on these volumes will be lost. 
     
    The following volumes are in use, and will be disabled 
    as a result of this operation: 
     
         mkting 
     
    Any applications using these volumes will fail future 
    accesses. These volumes will require restoration from 
    backup. 
     
    Are you sure you want do this? [y,n,q,?] (default: n)
    

  2. If any volumes would be disabled, quit from voldiskadm and save the volume. Either back up the volume or move the volume off of the disk.

    For example, to move the volume mkting to a disk other than disk02, enter the following command:

    # volassist move mkting disk02

    After the volume is backed up or moved, enter the voldiskadm command again and continue to remove the disk for replacement.

  3. After the disk is removed, specify a replacement disk by choosing Replace a failed or removed disk from the main menu in voldiskadm menu interface.

8.1.2    Replacing a Failed Disk

If a disk that was in use by LSM fails completely and is detached, you can replace the disk with a new disk. To replace a disk, enter the voldiskadm command and choose Replace a failed or removed disk from the main menu.

If you have any disks that are initialized for LSM but have never been added to a disk group, you can select one of those disks as a replacement. Do not choose the old disk drive as a replacement even though it may appear in the selection list. If there are no suitable initialized disks, you can choose to initialize a new disk.

If a disk failure caused a volume to be disabled, the volume must be restored from backup after the disk is replaced. To identify volumes that wholly reside on disks that were disabled by a disk failure, use the volinfo command. Any volumes that are listed as Unstartable must be restored from backup.

To display disk status, enter:

# volinfo

Output similar to the following is displayed:

home           fsgen    Started 
mkting         fsgen    Unstartable 
src            fsgen    Started
 
 

To restart the Unstartable volume called mkting, enter:

# volume -o bg -f start mkting

The -o bg option recovers plexes as a background task.

8.1.3    Reattaching Disks

A disk reattach operation may be necessary if a disk has experienced a full failure and hot-sparing is not possible, or if LSM is started with some disk drivers unloaded and unloadable (causing disks to enter the failed state). If the problem is fixed, it may be possible to use the volreattach command to reattach the disks without plexes being flagged as stale, as long as the reattach happens before any volumes on the disk are started.

The volreattach command is called as part of disk recovery from the voldiskadm menus. If possible, volreattach will reattach the failed disk media record to the disk with the same device name in the disk group in which it was located before and will retain its original disk media name. After a reattach takes place, recovery may or may not be necessary. The reattach may fail if the original (or another) cause for the disk failure still exists.

To check whether a reattach is possible, enter:

# volreattach -c

This displays the disk group and disk media name where the disk can be reattached, without performing the operation.

For more information, see the volreattach(8) reference page.

8.2    Recovering from a Boot Disk Failure

When the boot disk is mirrored, failures occurring on the original boot disk are transparent to all users. However, during a failure, the system might:

To reboot the system before the original boot disk is repaired, you can boot from any disk that contains a valid root volume.

If all copies of rootvol are corrupted, and you cannot boot the system, you must reinstall the system.

Replacing a boot disk is a more complex process than replacing other disks because boot-critical data must be placed in specific areas on specific disks in order for the boot process to find it. How you replace a failed boot disk depends on:

The sections that follow give instructions for re-adding or replacing the boot disk, as well as other information related to boot disk recovery.

8.2.1    Hot-Sparing and Boot Disk Failures

If the boot disk fails on a system that has the boot (root) disk mirrored and the hot-sparing feature enabled, LSM automatically attempts to replace the failed root disk mirror with a new mirror. To do this, a surviving mirror of the root disk is used to create a new mirror on either a spare disk or a disk with sufficient free space. This ensures that there are always at least two mirrors of the root disk that can be used for booting.

For hot-sparing to succeed, the rootdg disk group must have enough spare or free space to accommodate the volumes from the failed root disk. Also, the rootvol and swapvol volumes require contiguous disk space. If there is not enough contiguous space on a single new disk, each of these volumes can be relocated to a different new disk.

See Chapter 4 for more information on mirroring the boot disk.

8.2.2    Re-adding and Replacing Boot Disks

Normally, replacing a failed disk is as simple as putting a new disk somewhere on the controller and running LSM replace disk commands. It's even possible to move the data areas from that disk to available space on other disks, or to use a hot-spare disk already on the controller to replace the failure. For data that is not critical for booting the system, it doesn't matter where the data is located. All data that is not boot critical is only accessed by LSM after the system is fully operational. LSM can find this data for you. On the other hand, boot-critical data must be placed in specific areas on specific disks in order for the boot process to find it.

When a disk fails, there are two possible routes that can be taken to correct the action. If the errors are transient or correctable, then the same disk can be re-used. This is known as re-adding a disk. On the other hand, if the disk has truly failed, then it should be completely replaced.

8.2.2.1    Re-adding A Failed Boot Disk

Re-adding a disk is the same procedure as replacing a disk, except that the same physical disk is used. Usually, a disk that needs to be re-added has been detached, meaning that the LSM software has noticed that the disk has failed and has ceased to access it.

If the boot disk has a transient failure, its plexes can be recovered using the following steps. The rootvol and swapvol volumes can have two or three LSM disks per physical disk, depending on the layout of the original root disk.

  1. To list the LSM disks that are associated with the failed physical disk, enter:

    # voldisk list

    Output similar to the following is displayed:

    DEVICE        TYPE      DISK     GROUP      STATUS
    dsk10         sliced     -         -        error
    dsk10b        nopriv     -         -        error
    dsk10f        nopriv     -         -       error
    dsk21         sliced    dsk21     rootdg    online
    dsk21b        nopriv    dsk21b    rootdg    online
    -              -        dsk10     rootdg   removed was:dsk10
    -              -        dsk10b    rootdg   removed was:dsk10b
    -              -        dsk10f    rootdg   removed was:dsk10f
    

    In this example, if dsk10 was the failed boot disk, then you can assume that dsk10, dsk10b, and dsk10f are the LSM disks associated with the physical disk dsk10.

  2. Enter the following commands to add the LSM disks back to the rootdg disk group:

    
    # voldisk online dsk10 dsk10b dsk10f
    # voldg -k adddisk dsk10=dsk10
    # voldg -k adddisk dsk10b=dsk10b
    # voldg -k adddisk rootdsk10=dsk10f
    

  3. Resynchronize the plexes in the rootvol and swapvol volumes:

    # volrecover -sb rootvol swapvol

8.2.2.2    Replacing a Failed Boot Disk

Follow these steps to replace a failed boot disk under LSM control with a new disk:

  1. Disassociate the plexes on the failed disk from rootvol and swapvol. Also, if /usr or /var were encapsulated on the boot disk, disassociate their plexes on the failed disk:

    # volplex -o rm dis rootvol-02 swapvol-02 vol-dsk1g

  2. Remove all LSM disks configured on the boot disk:

    # voldg rmdisk dsk1a disk1b dsk1g dsk1f 
    # voldisk rm dsk1a dsk1b dsk1g dsk1f
    

  3. Mirror the LSM volumes on the book disk onto the new disk, as described in Chapter 4. The replacement disk must have at least as much storage capacity as was in use on the old disk.

8.2.3    Stale or Unusable Plexes on the Boot Disk

If a disk is unavailable when the system is running, any plexes of volumes that reside on that disk will become stale, meaning the data on that disk is out of date relative to the other plexes of the volume.

During the boot process, the system accesses only one copy of the root volume (the copies on the boot disk) until a complete configuration for this volume can be obtained. If the plex of the root volume that was used for booting is stale, you must reboot the system from another boot disk that contains nonstale plexes. This problem can occur if the boot disk was replaced and restarted without adding the disk back into the LSM configuration. The system will boot normally, but the plexes that reside on the newly booted disk will be stale.

Another possible problem can occur if errors in the LSM headers on the boot disk prevents LSM from properly identifying the disk. In this case, LSM will be unable to know the name of that disk. This is a problem because plexes are associated with disk names, and therefore any plexes on that disk are unusable.

If either of these situations occurs, the LSM daemon vold will notice it when it is initializing the system as part of the init processing of the boot sequence. It will output a message describing the error, describe what can be done about it, and halt the system. For example, if the plex rootvol-01 of the root volume rootvol on disk disk01 of the system was stale, vold output is similar the following:

lsm:vold: Warning Plex rootvol-01 for root 
volume is stale or unusable.
 
lsm:vold: Error: System boot disk does not have a valid root plex
Please boot from one of the following disks:
 
        Disk: disk02                     Device: dsk2
 
lsm:vold: Error: System startup failed
 
 

This informs you that the disk disk02 contains usable copies of the root and swap plexes and should be used for booting and dsk2 is the name of the system backup disk. When this message appears, you should reboot the system and boot from the device that corresponds to dsk2.

Once the system is booted, you must determine the problem. If the plexes on the boot disk were stale, they are caught up automatically as the system starts. If there is a problem with the private area on the disk, you must readd or replace the disk.

If the plexes on the boot disk were unavailable, you will receive mail from the volwatch command describing the problem.

To list that status of disks, enter:

# voldisk list

Output similar the following is displayed:

DEVICE       TYPE      DISK     GROUP     STATUS
-            -         disk02   rootdg    failed was:  dsk1
dsk2         sliced    disk02   rootdg    online

8.2.4    Failure To Obtain Crash Dumps

During a system crash or panic, a crash dump is temporarily saved to swap space. If the swap device is configured to use one or more LSM volumes, all the LSM swap volume's underlying disk partitions are used separately to maximize crash dump space, even when the swap volume was mirrored on different disk partitions. This does not cause any problems providing that the LSM mirrored swap are properly configured to not resynchronize the mirrors upon reboot because doing so could destroy the crash dump before it's saved to the file system.

If a mirrored swap volume is performing resynchronization upon reboot, you need to verify its configuration. If the secondary volumes (for example, LSM volumes other than swapvol) are performing resynchronization, this is probably due to the volume not being configured with start_opts=norecov option.

To check the start_opts option for a swap volume called v1, enter:

# volprint -m v1 | grep start_opts

Output similar to the following is displayed:

 start_opts="

To change the start_opts option for a swap volume called v1, enter:

# volume set start_opts=norecov v1

To display the change, enter:

# volprint -m v1 | grep start_opts

Output similar to the following is displayed:

start_opts="norecov

If the LSM volume swapvol is performing resynchronization, this is typically because this volume does not have its device minor number set to 1. See Chapter 4 for information on how to setup root and swap volumes.

To check the swapvol volume's minor number, enter:

# ls -l /dev/*vol/swapvol

Output similar to the following is displayed:

crw-------   1 root     system    40,  1 Mar 16 16:00 /dev/rvol/swapvol  
brw-------   1 root     system    40,  1 Mar 16 16:00 /dev/vol/swapvol

8.3    Resynchronizing Volumes

When storing data redundantly, using mirrored or RAID5 volumes, LSM takes necessary measures to ensure that all copies of the data match exactly. However, under certain conditions (usually due to complete system failures), small amounts of the redundant data on a volume can become inconsistent or unsynchronized. Aside from normal configuration changes (such as detaching and reattaching a plex), this can only occur when a system crashes while data is being written to a volume. Data is written to the mirrors of a volume in parallel, as is the data and parity in a RAID5 volume. If a system crash occurs before all the individual writes complete, it is possible for some writes to complete while others do not, resulting in the data becoming unsynchronized. For mirrored volumes, it can cause two reads from the same region of the volume to return different results if different mirrors are used to satisfy the read request. In the case of RAID5 volumes, it can lead to parity corruption and incorrect data reconstruction.

When LSM recognizes this situation, it needs to make sure that all mirrors contain exactly the same data and that the data and parity in RAID5 volumes match. This process is called volume resynchronization. Volumes that are part of disk groups that are automatically imported at boot time (such as rootdg) are resynchronized when the system boots.

Not all volumes require resynchronization after a system failure. Volumes that were never written or that were inactive when the system failure occurred and did not have any outstanding writes do not require resynchronization. LSM notices when a volume is first written and marks it as dirty. When a volume is closed by all processes or stopped cleanly, all writes will have completed and LSM removes the dirty flag for the volume. Only volumes that are marked dirty when the system reboots require resynchronization.

Resynchronization can be computationally expensive and can have a significant impact on system performance. The recovery process attempts to alleviate some of this impact by attempting to "spread out" recoveries to avoid stressing a specific disk or controller. Additionally, for very large volumes or for a very large number of volumes, the resynchronization process can take a long time. These effects can be addressed by using dirty-region logging for mirrored volumes, or by making sure that RAID5 volumes have valid RAID5 logs.

The exact process of resynchronization depends on the type of volume. RAID5 volumes that contain RAID5 logs can simply replay those logs. If no logs are available, the volume is placed in reconstruct-recovery mode and all parity is regenerated.

LSM automatically recovers mirrored and RAID5 volumes when the system is booted and the volumes are first started.

See the volume(8) reference page for more information on resynchronizing volumes.

8.4    Recovering Volumes

The following sections describe recovery procedures for problems relating to LSM volumes.

8.4.1    Listing Unstartable Volumes

An unstartable volume is likely to be incorrectly configured or has other errors or conditions that prevent it from being started. To display unstartable volumes, use the volinfo command, which displays information on the accessibility and usability of one or more volumes:

# volinfo -g disk_group [volume_name]

8.4.2    Recovering a Disabled Volume

If a system crash or an I/O error corrupts one or more plexes of a volume and no plex is CLEAN or ACTIVE, mark one of the plexes CLEAN and instruct the system to use that plex as the source for reviving the others. To place a plex in a CLEAN state, enter:

# volmend fix clean plex_name

For example, to place one plex called vol01-02 in the CLEAN state, enter:

# volmend fix clean vol01-02

See the volmend(8) reference pages for more information.

8.5    Recovering RAID5 Volumes

RAID5 volumes are designed to remain available when a disk fails with a minimum of disk space overhead. However, many implementations of RAID5 can become vulnerable to data loss after a system failure, and some types of disk failures can also affect RAID5 volumes adversely. The following sections describe how system and disk failures affect RAID5 volumes, and the types of recovery needed.

8.5.1    System Failures and RAID5 Volumes

A system failure causes the data and parity in the RAID5 volume to become unsynchronized because the disposition of writes that were outstanding at the time of the failure cannot be determined. If this occurs while a RAID5 volume is being accessed, the volume is described as having stale parity. When this occurs, the parity must be reconstructed by reading all the non-parity columns within each stripe, recalculating the parity, and writing out the parity stripe unit in the stripe. This must be done for every stripe in the volume, so it can take a long time to complete.

Caution

While this resynchronization is going on, any failure of a disk within the array will cause the data in the volume to be lost. This only applies to RAID5 volumes without log plexes. Compaq recommends configuring all RAID5 volumes with a log.

Having the array vulnerable in this way is undesirable. Besides the vulnerability to failure, the resynchronization process can tax the system resources and slow down system operation.

RAID5 logs reduce the possible damage that can be caused by system failures. Because they maintain a copy of the data being written at the time of the failure, the process of resynchronization consists of simply reading that data and parity from the logs and writing it to the appropriate areas of the RAID5 volume. This greatly reduces the amount of time needed for a resynchronization of data and parity. It also means that the volume never becomes truly stale because the data and parity for all stripes in the volume is known at all times, so the failure of a single disk cannot result in the loss of the data within the volume.

8.5.2    Disk Failures and RAID5 Volumes

A RAID5 disk failure can occur due to an uncorrectable I/O error during a write to the disk (which causes the subdisk to be detached from the array) or due to a disk being unavailable when the system is booted (such as from a cabling problem or having a drive powered down). When this occurs, the subdisk cannot be used to hold data and is considered stale and detached. If the underlying disk becomes available or is replaced, the subdisk is considered stale and is not used.

If an attempt is made to read data contained on a stale subdisk, the data is reconstructed from data from all other stripe units in the stripe; this operation is called a reconstruct-read. This is a significantly more expensive operation than simply reading the data, resulting in degraded read performance; thus, when a RAID5 volume has stale subdisks, it is considered to be in degraded mode.

To display if a RAID5 volume is in degraded mode, enter:

# volprint -ht

Output similar to the following is displayed:

V       NAME    USETYPE KSTATE  STATE   LENGTH  READPOL PREFPLEX
PL      NAME    VOLUME  KSTATE  STATE   LENGTH  LAYOUT  NCOL/WID        MODE
SD      NAME    PLEX    DISK    DISKOFFS        LENGTH  [COL/]OFF       DEVICE  MODE
v       r5vol   RAID5   ENABLED DEGRADED        20480   RAID    -
pl      r5vol-01        r5vol   ENABLED ACTIVE  20480   RAID    3/16    RW
sd      disk00-00       r5vol-01        disk00  0       10240   0/0     dsk4d1
sd      disk01-00       r5vol-01        disk01  0       10240   1/0     dsk2d1  dS
sd      disk02-00       r5vol-01        disk02  0       10240   2/0     dsk3d1  -
pl      r5vol-l1        r5vol   ENABLED LOG     1024    CONCAT  -       RW
sd      disk03-01       r5vol-l1        disk00  10240   1024    0       dsk3d0  -
pl      r5vol-l2        r5vol   ENABLED LOG     1024    CONCAT  -       RW
sd      disk04-01       r5vol-l2        disk02  10240   1024    0       dsk1d1  -         
 
 

The output shows that volume r5vol is in degraded mode, as shown by the STATE, which is listed as DEGRADED. The failed subdisk is disk01-00, as shown by the flags in the last column. The d indicates that the subdisk is detached and the S indicates that the subdisk contents are stale.

It is also possible that a disk containing a RAID5 log could experience a failure. This has no direct effect on the operation of the volume; however, the loss of all RAID5 logs on a volume makes the volume vulnerable to a complete failure.

The following volprint output shows a failure within a RAID5 log plex as indicated by the plex state being BADLOG, where the RAID5 log plex r5vol-l1 has failed.

V       NAME    USETYPE KSTATE  STATE   LENGTH  READPOL PREFPLEX
PL      NAME    VOLUME  KSTATE  STATE   LENGTH  LAYOUT  NCOL/WID        MODE
SD      NAME    PLEX    DISK    DISKOFFS        LENGTH  [COL/]OFF       DEVICE  MODE
v       r5vol   RAID5   ENABLED ACTIVE  20480   RAID    -
pl      r5vol-01        r5vol   ENABLED ACTIVE  20480   RAID    3/16    RW 
sd      disk00-00       r5vol-01        disk00  0       10240   0/0     dsk4d1  ENA
sd      disk01-00       r5vol-01        disk01  0       10240   1/0     dsk2d1  dS
sd      disk02-00       r5vol-01        disk02  0       10240   2/0     dsk3d1  ENA
pl      r5vol-l1        r5vol   DISABLED        BADLOG  1024    CONCAT  -       RW
sd      disk03-01       r5vol-l1        disk00  10240   1024    0       dsk3d0  ENA
pl      r5vol-l2        r5vol   ENABLED LOG     1024    CONCAT  -       RW
sd      disk04-01       r5vol-l2        disk02  10240   1024    0       dsk1d1  ENA
 
 

8.5.3    RAID5 Recovery

The following are the types of recovery typically needed for RAID5 volumes:

These types of recoveries are discussed in the sections that follow. Parity resynchronization and stale subdisk recovery are typically performed when the RAID5 volume is started, shortly after the system boots, or by calling the volrecover command.

If hot-sparing is enabled at the time of a disk failure, system administrator intervention is not required (unless there is no suitable disk space available for relocation). Hot-sparing will be triggered by the failure and the system administrator will be notified of the failure by electronic mail. Hot-sparing will automatically attempt to relocate the subdisks of a failing RAID5 plex. After any relocation takes place, the hot-sparing daemon (volspared) will also initiate a parity resynchronization. In the case of a failing RAID5 log plex, relocation will only occur if the log plex is mirrored; volspared will then initiate a mirror resynchronization to recreate the RAID5 log plex. If hot-sparing is disabled at the time of a failure, the system administrator may need to initiate a resynchronization or recovery.

8.5.3.1    Parity Resynchronization

In most circumstances, a RAID5 array will not have stale parity. Stale parity should only occur after all RAID5 log plexes for the RAID5 volume have failed, and then only if there is a system failure. Furthermore, even if a RAID5 volume has stale parity, it is usually taken care of as part of the volume start process.

However, if a volume without valid RAID5 logs starts and the process is killed before the volume is resynchronized, the result is an active volume with stale parity.

To display volume state, enter:

# volprint -ht

Output similar to the following is displayed:

V       NAME    USETYPE KSTATE  STATE   LENGTH  READPOL PREFPLEX
PL      NAME    VOLUME  KSTATE  STATE   LENGTH  LAYOUT  NCOL/WID        MODE
SD      NAME    PLEX    DISK    DISKOFFS        LENGTH  [COL/]OFF       DEVICE  MODE
v       r5vol   RAID5   ENABLED NEEDSYNC        20480   RAID    -
pl      r5vol-01        r5vol   ENABLED ACTIVE  20480   RAID    3/16    RW
sd      disk00-00       r5vol-01        disk00  0       10240   0/0     dsk4d1  ENA
sd      disk01-00       r5vol-01        disk01  0       10240   1/0     dsk2d1  ENA
sd      disk02-00       r5vol-01        disk02  0       10240   2/0     dsk3d1  ENA
 
 

This output displays the volume state as NEEDSYNC, indicating that the parity needs to be resynchronized. The state could also have been SYNC, indicating that a synchronization was attempted at start time and that a synchronization process should be doing the synchronization. If no such process exists or if the volume is in the NEEDSYNC state, a synchronization can be manually started using the volume resync command.

To resynchronize the a RAID5 volume called r5vol, enter:

# volume resync r5vol

Parity is regenerated by issuing VOL_R5_RESYNC ioctls to the RAID5 volume. The resynchronization process starts at the beginning of the RAID5 volume and resynchronizes a region equal to the number of sectors specified by the -o iosize option to the volume command or, if -o iosize is not specified, the default maximum I/O size. The resync command then moves onto the next region until the entire length of the RAID5 volume is resynchronized.

For larger volumes, parity regeneration can take a significant amount of time and it is possible that the system can shut down or crash before the operation is completed. Unless the progress of parity regeneration is kept across reboots, the process starts over again.

To avoid this situation, parity regeneration is checkpointed, meaning that the offset up to which the parity is regenerated is saved in the configuration database. The -o checkpt=size option to the volume command controls how often the checkpoint is saved; if not specified, it defaults to the default checkpoint size. Because saving the checkpoint offset requires a transaction, making the checkpoint size too small can significantly extend the time required to regenerate parity. After a system reboot, a RAID5 volume that has a checkpoint offset smaller than the volume length will start a parity resynchronization at the checkpoint offset.

8.5.3.2    Stale Subdisk Recovery

Like parity resynchronization, stale subdisk recovery is usually done at volume start time. However, it is possible that the process doing the recovery may get killed, or that the volume was started with an option to prevent subdisk recovery. It's also possible that the disk on which the subdisk resides was replaced without any recovery operations being performed.

To recover a stale subdisk in a RAID5 volume, enter:

# volume recover r5vol dsk01-00

To recover multiple stale subdisks in a RAID5 volume at once with only the name of the volume, enter:

# volume recover r5vol

8.5.3.3    Log Plex Recovery

RAID5 log plexes may become detached due to disk failures. To reattach failed RAID5 log plex, enter:

# volplex att r5vol r5vol-11

8.6    Startup Problems

The following sections describe LSM command and startup problems and suggests corrective actions.

8.6.1    I/O and System Delays Caused by Disk Failure

When a mirrored LSM disk fails, the system may hang for several minutes before resuming activity.

If you observe long delays in LSM recovery from disk failure, this is usually due to the underlying device driver, not LSM. When an initial I/O operation fails, there may be a delay as the device driver waits or retries the I/O. The length of the delay depends on the particular tolerances for that drive (for example, time for drive spin-up, ECC calculation time, retries and recalibration by the drive, other I/O being handled due to command-tag queuing, bus/device initialization time after reset, and so on).

LSM does not perform additional retries or generate additional delays when an I/O fails on a device. Once the underlying device driver returns an I/O failure error to LSM, LSM processes the error immediately (for example, issues another read to the other plex to recover and mask the error).

To reduce such delays, see the driver documentation for instructions on changing the retry parameter settings.

8.6.2    An LSM Command Fails to Execute

When an LSM command fails to execute, LSM may display the following message:

Volume daemon is not accessible

This message often means that the volume daemon vold is not running.

Verify that the vold daemon is enabled by entering the following command:

# voldctl mode

Output similar to the following is displayed:

mode: enabled

Verify that the two voliod or more daemons are running by entering the following command:

# voliod

Output similar to the following is displayed:

2 volume I/O daemons are running

8.6.3    LSM Volume I/O or Mirroring Fails to Complete

Follow these steps if I/O to a LSM volume or mirroring of a LSM volume does not complete:

8.6.4    Failures While Creating Volumes or Adding Disks

When creating a new volume or adding a disk, the operation may fail with the following message:

No more space in disk group configuration

This message could mean that you are out of space in the disk group's configuration database. Check to see if any disks were configured with 2 or more configuration databases. If all disks with active configuration databases are configured to use 1 configuration database within their private region, then check if a disk with a smaller private region can be reconfigured to deactive the configuration database within this smaller private region.

If all the disks have nconfig set to 1 and the same size private regions, you can reconfigure and/or add disks with larger private regions.

To display if the rootdg disk group is using a disk with more than 1 configuration database, enter:

# voldg list rootdg

Output similar the following is displayed:

Group:     rootdg  
dgid:      921610896.1026.rio.dec.com  
import-id: 0.1  flags:      
copies:    nconfig=default nlog=default  
config:    seqno=0.1091 permlen=1496 free=1490 templen=3 loglen=226  
config disk dsk7 copy 1 len=1496 state=clean online  
config disk dsk7 copy 2 len=1496 disabled  
config disk dsk8 copy 1 len=2993 state=clean online  
config disk dsk9 copy 1 len=2993 state=clean online  
config disk dsk10 copy 1 len=2993 state=clean online  
log disk dsk7 copy 1 len=226  
log disk dsk7 copy 2 len=226 disabled  
log disk dsk8 copy 1 len=453  
log disk dsk9 copy 1 len=453  
log disk dsk10 copy 1 len=453

To increase the rootdg disk group free space from 1490 to 2987 by changing dsk7 to have 1 configuration database copy instead of 2 within its private region, enter:

# voldisk moddb dsk7 nconfig=1

To display the results, enter:

# voldg list rootdg

Output similar the following is displayed:

Group:     rootdg  
dgid:      921610896.1026.rio.dec.com  
import-id: 0.1  flags:      
copies:    nconfig=default nlog=default  
config:    seqno=0.1091 permlen=2993 free=2987 templen=3 loglen=453  
config disk dsk7 copy 1 len=2993 state=clean online  
config disk dsk8 copy 1 len=2993 state=clean online  
config disk dsk9 copy 1 len=2993 state=clean online  
config disk dsk10 copy 1 len=2993 state=clean online  
log disk dsk7 copy 1 len=453  
log disk dsk8 copy 1 len=453  
log disk dsk9 copy 1 len=453  
log disk dsk10 copy 1 len=453

You can check the active configuration database sizes on each disk within a disk group to see if you can reconfigure a disk with a smaller private region to deactivate the configuration database within its smaller private region.

Follow steps to disable a configuration database:

  1. Display the current configuration:

    # voldg list rootdg

    Output similar the following is displayed:

    Group:     rootdg  
    dgid:      921610896.1026.rio.dec.com  
    import-id: 0.1  flags:      
    copies:    nconfig=default nlog=default  
    config:    seqno=0.1081 permlen=347 free=341 templen=3 loglen=52  
    config disk dsk7 copy 1 len=347 state=clean online  
    config disk dsk8 copy 1 len=2993 state=clean online  
    config disk dsk9 copy 1 len=2993 state=clean online  
    config disk dsk10 copy 1 len=2993 state=clean online  
    log disk dsk7 copy 1 len=52  
    log disk dsk8 copy 1 len=453  
    log disk dsk9 copy 1 len=453  
    log disk dsk10 copy 1 len=453
    

  2. To disable the configuration databases on dsk7, so the rootdg configuration database free size will increase from 341 to 2987, enter:

    # voldisk moddb dsk7 nconfig=0

  3. Display new configuration:

    # voldg list rootdg

    Output similar the following is displayed:

    Group:     rootdg  
    dgid:      921610896.1026.rio.dec.com  
    import-id: 0.1  flags:      
    copies:    nconfig=default nlog=default  
    config:    seqno=0.1081 permlen=2993 free=2987 templen=3 loglen=453  
    config disk dsk8 copy 1 len=2993 state=clean online  
    config disk dsk9 copy 1 len=2993 state=clean online  
    config disk dsk10 copy 1 len=2993 state=clean online  
    log disk dsk8 copy 1 len=453  
    log disk dsk9 copy 1 len=453  
    log disk dsk10 copy 1 len=453
    

If all disks have 1 configuration copy and you cannot disable disks with smaller private regions, then you can add and use disks with a larger private region. Follow these steps to specify a private region larger than the default of 4096:

  1. Enter the voldisksetup command with the privlen option to specify a new private region size.

  2. Use the voldisk moddb command as described earlier in this section to deactivate the smaller disks.

Note

For a disk group with 4 or more disks, you should enable and configure at least 4 of the disks to be large enough to contain the disk group's configuration database.

Follow these steps to add 4 disks with larger configuration databases and disable the configuration database on the smaller disks, so only the new disks with the larger private region are used:

  1. Display the current disk group configuration by entering the following command:

    # voldg list rootdg

    Output similar the following is displayed:

    Group:     rootdg  
    dgid:      921610896.1026.rio.dec.com  
    import-id: 0.1  flags:      
    copies:    nconfig=default nlog=default  
    config:    seqno=0.1091 permlen=2993 free=2987 templen=3 loglen=453  
    config disk dsk7 copy 1 len=2993 state=clean online  
    config disk dsk8 copy 1 len=2993 state=clean online  
    config disk dsk9 copy 1 len=2993 state=clean online  
    config disk dsk10 copy 1 len=2993 state=clean online  
    log disk dsk7 copy 1 len=453  
    log disk dsk8 copy 1 len=453  
    log disk dsk9 copy 1 len=453  
    log disk dsk10 copy 1 len=453
    

  2. Increase the private region size by entering the following commands:

    # voldisksetup -i dsk3 privlen=8192 
    # voldisksetup -i dsk4 privlen=8192 
    # voldisksetup -i dsk12 privlen=8192 
    # voldisksetup -i dsk13 privlen=8192
    

  3. Add the disks to the disk group by entering the following command:

    # voldg adddisk dsk3 dsk4 dsk12 dsk13

  4. Deactivate the smaller disks by entering the following commands:

    # voldisk moddb dsk7 nconfig=0 
    # voldisk moddb dsk8 nconfig=0  
    # voldisk moddb dsk9 nconfig=0  
    # voldisk moddb dsk10 nconfig=0  
    

  5. Display the new configuration by entering the following command:

    # voldg list rootdg

    Output similar to the following is displayed:

    Group:     rootdg  
    dgid:      921610896.1026.rio.dec.com  
    import-id: 0.1  flags:      
    copies:    nconfig=default nlog=default  
    config:    seqno=0.1116 permlen=6017 free=6007 templen=3 loglen=911  
    config disk dsk3 copy 1 len=6017 state=clean online  
    config disk dsk4 copy 1 len=6017 state=clean online  
    config disk dsk12 copy 1 len=6017 state=clean online  
    config disk dsk13 copy 1 len=6017 state=clean online  
    log disk dsk3 copy 1 len=911  
    log disk dsk4 copy 1 len=911  
    log disk dsk12 copy 1 len=911  
    log disk dsk13 copy 1 len=911
    

8.6.5    Mounting a File System or Opening an LSM Volume Fails

If a file system cannot be mounted or an open function on an LSM volume fails, check if errno is set to EBADF. This could mean that the LSM volume is not started.

To determine whether or not the volume is started, enter:

# volinfo -g rootdg

Output similar the following is displayed:

vol1         fsgen  Startable
vol-dsk3h    fsgen  Started
vol2         fsgen  Started
swapvol1     gen    Started
rootvol      root   Started
swapvol      swap   Started
 
 

To start volume vol1, enter:

# volume -g rootdg start vol1

8.7    Restoring an LSM Configuration

You use the volrestore command to restore an LSM configuration that you saved when using the volsave command. If you enter the volrestore command with no options, volrestore attempts to restore all disk groups. If you use the -i (interactive) option, volrestore prompts you before restoring each disk group.

Before the volrestore command restores the LSM configuration, it validates the checksum that is part of the description set.

By default, the volrestore command restores the whole configuration, using the description set in the directory under /usr/var/lsm/db that has the latest timestamp. You can specify options to the command to use a different directory and to restore a specific volume or disk group. For example, this command restores only the volume called myvol01 in the staffdg disk group:

# volrestore -g staffdg -v myvol01

When you restore a specific disk group, the volrestore command attempts to reimport the disk group based on configuration information on disks that belong to that disk group. If the import fails, volrestore recreates the disk group by reinitializing all disks within that disk group and recreating all volumes, unassociated plexes, and unassociated subdisks, based on information in the volmake description file, allvol.DF

Notes

The volrestore command does not restore volumes associated with the root, /usr, and /var file systems and the primary swap area. These partitions must be reencapsulated to use LSM volumes.

See the Tru64 UNIX Clusters documentation before using volrestore in Tru64 UNIX cluster environment.

When you restore a complete LSM configuration, the volrestore command attempts to reenable the vold based on the configuration databased found on the rootdg disks. If the complete LSM configuration does not need to be restored, you can use the -i (interactive) option with volrestore. The volrestore command prompts you before restoring each file, enabling you to skip specific disk groups.

If vold cannot be enabled, you are given the option of recreating the rootdg disk group and any other disk groups using the other files in the saved LSM description set. The rootdg disk group is recreated first, and vold is put in the enabled mode. Then, the other disk groups are enabled. The disk groups are recreated by first attempting to import them based on available disks in that disk group. If the import fails, the disk group is reinitialized and all volumes in that disk group are also recreated based on the volmake description files.

When volumes are restored using the volmake description file, the plexes are created in the DISABLED EMPTY state. The volrestore command does not attempt to start or enable such volumes. You must use volmend or volume to set the plex states appropriately before starting the volume. The volrestore command warns you to check the state of each disk associated with a volume before using volmend to set plex states; to carefully find out which disks in the LSM configuration could have had failures because saving the LSM configuration; and to use volmend to mark plexes on those disks to be STALE. In addition, any plex that was detached or disabled at any point during or after the LSM configuration was saved should be marked "STALE" using volmend.

To restore a disk group called dg1, enter the following command, and the system will display output similar to this example:

# volrestore -g dg1
Using LSM configuration from /usr/var/lsm/db/LSM.19991226203620.skylark 
Created at Tue Dec 26 20:36:30 EST 1999 on HOST skylark  
 
Would you like to continue ? [y,n,q,?] (default: n) y
Working .    
Restoring dg1    
vol1 in diskgroup dg1 already exists. (Skipping ..)    
vol2 in diskgroup dg1 already exists. (Skipping ..)    
vol3 in diskgroup dg1 already exists. (Skipping ..)

8.7.1    Conflicts While Restoring the Configuration

When volrestore executes, it can encounter conflicts in the LSM configuration, for example, if another volume uses the same plex name or subdisk name, or the same location on a disk. When volrestore finds a conflict, it displays error messages and the configuration of the volume, as found in the saved LSM description set. In addition, it removes all volumes created in that disk group during the restoration. The disk group that had the conflict remains imported, and volrestore continues to restore other disk groups.

If volrestore fails because of a conflict, you can use the volrestore -b option to do the best possible restoration in a disk group. You will then have to resolve the conflicts and restore the volumes in the affected disk group.

8.7.2    Failures in Restoring the Configuration

The restoration of volumes fails if one or more disks associated with the volumes are unavailable, for example due to disk failure. This, in turn, causes the restoration of a disk group to fail. To restore the LSM configuration of a disk group, enter:

# volrestore -b -g diskgroup

The volumes associated with the failed disks can then be restored by editing the volmake description file to remove the plexes that use the failed disks. Note that editing the description file affects the checksum of the files in the backup directory, so you must override the checksum validation by using the -f option.

8.8    Reinstalling the Operating System

If you reinstall the operating system, LSM-related information, such as data in the LSM private areas on reinstalled disks (containing the disk identifier and copies of the LSM configuration), is removed, which makes the disk unusable to LSM. The only volumes saved are those that reside on, or have copies on, disks that are not directly involved with reinstallation. Volumes on disks involved with the reinstallation are lost during reinstallation. If backup copies of these volumes are available, you can restore them after reinstallation. The system root disk is always involved in reinstallation.

To reinstall the operating system system and recover the LSM configuration you need to:

  1. Prepare the system for the installation. This includes replacing any failed disks or other hardware, and detaching any disks not involved in the reinstallation.

  2. Install the operating system.

  3. Recover the LSM configuration.

  4. Complete the configuration by restoring information in volumes affected by the reinstallation and recreate system volumes (such as rootvol, swapvol, and usr).

8.8.1    Preparing the System for the Operating System Reinstallation

To prevent the loss of data on disks not involved in the reinstallation, you should only involve the root disk in the reinstallation procedure. It is recommended that any other disks (that contain volumes) be disconnected from the system before you start the reinstallation procedure.

Disconnecting the other disks ensures that they are unaffected by the reinstallation. For example, if the operating system was originally installed with a file system on the second drive, the file system may still be recoverable. Removing the second drive ensures that the file system remains intact.

8.8.2    Reinstalling the Operating System

After failed or failing disks are replaced and disks uninvolved with the reinstallation are detached, reinstall the operating system and LSM as described in the Installation Guide.

While the operating system installation progresses, make sure no disks other than the root disk are accessed in any way. If anything is written on a disk other than the root disk, the LSM configuration on that disk could be destroyed.

8.8.3    Recovering the LSM Configuration

Use the volrestore procedure to recover the LSM configuration information that was previously saved with volsave. If the LSM configuration information can not be restored using volrestore, use the following procedure to reinitialize LSM.

Warning

Executing the volsetup command with the -o force option destroys any existing LSM configuration information on a system.

Once the LSM subsets have been loaded, recover the LSM configuration by doing the following:

  1. Shut down the system.

  2. Physically reattach the disks that were removed from the system.

  3. Reboot the system. When the system comes up, make sure that all disks are configured in the kernel and that special device files have been created for the disks.

  4. Run the volsetup script. This script checks for an existing LSM configuration and starts LSM if one exists. If an existing configuration is found, the script displays the following message:

    
    LSM has detected the presence of an existing configuration.
    Check the current configuration and use '-o force' option to
    destroy the existing configuration if necessary.
    

  5. Recreate the LSM configuration. If the LSM configuration was previously saved using the volsave command, use the volrestore command. Otherwise, you must recreate the volumes, plexes subdisks, disks, and disk groups using the procedures described in Chapter 5.

  6. Restore the volume's data using the appropriate backup and restore command. For example, to restore an AdvFS or UFS file system that was backed up with the vdump command, you would use the vrestore command.

  7. If the root file system, swap partition, and/or usr file system were previously under LSM control, you can reconfigure the system disk under LSM control and mirror the disk using the procedures described in Chapter 4.

The configuration preserved on the disks not involved with the reinstallation has now been recovered. However, because the root disk has been reinstalled, it appears to LSM as a non-LSM disk. Therefore, the configuration of the preserved disks does not include the root disk as part of the LSM configuration.

Note

If the root disk of your system and any other disk involved in the reinstallation were not under LSM control at the time of failure and reinstallation, then the reconfiguration is complete at this point. If other disks containing volumes or volume plexes are to be replaced, follow the replacement procedures in Section 8.2.2.2 .

8.8.4    Completing the Configuration

If the boot disk (or another disk) was involved with the reinstallation, any volume or volume plexes on that disk (or other disks no longer attached to the system) are now inaccessible. If a volume had only one plex (contained on a disk that was reinstalled, removed, or replaced), then the data on that the volume is lost and must be restored from backup. In addition, the system's root file system and swap area are not located on volumes any longer.

8.8.4.1    Removing the Root and Swap Volumes

Remove volumes associated with root and swap areas, and their associated disks. This must be done if the root disk was under LSM control prior to reinstallation. The volumes to remove are:

Follow these steps to remove the rootvol and swapvol volumes:

  1. Stop the root and swap volumes and remove them by entering the following commands:

    # volume stop rootvol 
    # voledit -r rm rootvol 
    # volume stop swapvol 
    # voledit -r rm swapvol
    

  2. Remove the LSM disks used by rootvol and swapvol. For example, if disk dsk3 was associated with rootvol and disk dsk3b was associated with swapvol:

    # voldg rmdisk dsk3 dsk3b 
    # voldisk rm dsk3 dsk3b
    

8.8.4.2    LSM Volumes for /usr and /var Partitions

If /usr and /var partitions were on LSM volumes prior to the reinstallation, then remove the LSM disks associated with them using the voledit command in the previous example shown for rootvol and swapvol.

8.8.4.3    Restoring Volumes from Backup

After configuring the volumes, you must determine which volumes need to be restored from backup. The volumes to be restored include any volumes that had all plexes residing on disks that were removed or reinstalled. These volumes are invalid and must be removed, recreated, and restored from backup. If only some plexes or a volume exist on reinitialized or removed disks, these plexes must be removed. The plexes can be readded later.

Follow these steps to restore the volumes:

  1. Establish which LSM disks have been removed or reinstalled:

    # voldisk list

    Output similar to the following is displayed:

    DEVICE  TYPE       DISK        GROUP       STATUS
    dsk0     sliced     -           -           error
    dsk1     sliced     disk02      rootdg      online
    dsk2     sliced     disk03      rootdg      online
    -       -           disk01      rootdg      failed was:  dsk0
    

    This output shows that the reinstalled root device, dsk0 is not recognized as an LSM disk and is marked with a status of error. disk02 and disk03 were not involved in the reinstallation and are recognized by LSM and associated with their devices (dsk1 and dsk2). The former disk01, the LSM disk that had been associated with the replaced disk device, is no longer associated with the device (dsk0). If there had been other disks (with volumes or volume plexes on them) removed or replaced during reinstallation, these disks would also have a disk device in error state and an LSM disk listed as not associated with a device.

  2. Once you know which disks are removed or replaced, display the plexes on disks with a status of failed:

    # volprint -sF "%vname" -e 'sd_disk = "<disk>"'

    In this command, the variable is the name of a disk with a failed status.

    Note

    Be sure to enclose the disk name in quotes in the command. Otherwise, the command displays an error message.

    The volprint command displays a list of volumes that have plexes on the failed disk. Repeat this command for each disk with a failed status.

  3. Check the status of each volume by entering the following command:

    volprint -th volume_name

    For example, to infomation about a volume called fnah, enter:

    # volprint -th fnah

    Output similar to the following is displayed:

    V  NAME     USETYPE  KSTATE   STATE    LENGTH READPOL  PREFPLEX
    PL NAME     VOLUME   KSTATE   STATE    LENGTH LAYOUT ST-WIDTH MODE
    SD NAME     PLEX     PLOFFS   DISKOFFS LENGTH DISK-MEDIA   ACCESS
     
    v  fnah      fsgen    DISABLED ACTIVE   24000  SELECT   -
    pl fnah-01   fnah     DISABLED NODEVICE 24000  CONCAT   -
    sd disk01-06 fnah-01  0        519940   24000  disk01   -
    

  4. In this output, the only plex of the volume is shown in the line beginning with pl. The STATE field for the plex called fnah-01 is NODEVICE. The plex has space on a disk that was replaced, removed, or reinstalled. Therefore, the plex is no longer valid and you must remove it.

    Because the fnah-01 plex was the only plex of the volume, the volume contents are irrecoverable except by restoring the volume from a backup. You must also remove the volume. If a backup copy of the volume exists, you can restore the volume later. Keep a record of the volume name and its length, you will need it for the backup procedure.

    Remove the volume by entering the following command:

    voledit -r rm volume_name

    For example, to remove a volume called fnah, enter:

    # voledit -r rm fnah

    It is possible that only part of a plex is located on the failed disk. If the volume has a striped plex associated with it, the volume is divided between several disks. For example, the volume called vol01 has one striped plex, striped across three disks, one of which is the reinstalled disk disk01. The output of the volprint -th command for vol01 displays output similar to the following:

    V  NAME       USETYPE  KSTATE   STATE    LENGTH  READPOL  PREFPLEX
    PL NAME       VOLUME   KSTATE   STATE    LENGTH  LAYOUT   ST-WIDTH MODE
    SD NAME       PLEX     PLOFFS   DISKOFFS LENGTH  DISK-MEDIA        ACCESS
     
    v  vol01      fsgen    DISABLED ACTIVE   4224    SELECT   -
    pl vol01-01   vol01    DISABLED NODEVICE 4224    STRIPE   128    RW
    sd disk02-02  vol01-01 0        14336    1408    disk02          dsk1
    sd disk01-05  vol01-01 1408     517632   1408    disk01   -
    sd disk03-01  vol01-01 2816     14336    1408    disk03          dsk2
    

    This output shows three disks, across which the plex vol01-01 is striped (the lines starting with sd represent the stripes). The second stripe area is located on the LSM disk called disk01. This disk is no longer valid, so the plex called vol01-01 has a state of NODEVICE. Because this is the only plex of the volume, the volume is invalid and must be removed. If a copy of vol01 exists on the backup media, it can be restored later.

    Note

    Keep a record of the volume name and length of any volumes you intend to restore from backup.

    Use the voledit command to remove the volume, as described earlier.

    A volume that has one plex on a failed disk may also have other plexes on disks that are still valid. In this case, the volume does not need to be restored from backup, because the data is still valid on the valid disks. The output of the volprint -th command for a volume with one plex on a failed disk (disk01) and another plex on a valid disk (disk02) displays output similar to the following:

    V  NAME   USETYPE  KSTATE   STATE    LENGTH  READPOL  PREFPLEX
    PL NAME   VOLUME   KSTATE   STATE    LENGTH  LAYOUT   ST-WIDTH  MODE
    SD NAME   PLEX     PLOFFS   DISKOFFS LENGTH  DISK-MEDIA         ACCESS
     
    v  foo       fsgen    DISABLED ACTIVE   10240   SELECT   -
    pl foo-01    foo      DISABLED ACTIVE   10240   CONCAT   -      RW
    sd disk02-01 foo-01   0        0        10240   disk02   dsk1
    pl foo-02    foo      DISABLED NODEVICE 10240   CONCAT          RW
    sd disk01-04 foo-02   0        507394   10240   disk01   -
    

    This volume has two plexes, foo-01 and foo-02. The first plex, foo-01, does not use any space on the invalid disk, so it can still be used. The second plex, foo-02, uses space on the invalid disk, disk01, and has a state of NODEVICE. Mirror foo-02 must be removed. However, the volume still has one valid plex containing valid data. If the volume needs to be mirrored, another plex can be added later. Note the name of the volume if you want to create another plex later.

    To remove an invalid plex, you must dissociated the plex from the volume and remove the plex. To remove the plex called foo-02, enter:

    # volplex -o rm dis foo-02

  5. Once all the volumes are cleaned up, you must clean up the disk configuration as described in the following section.

8.8.4.4    Disk Cleanup

Once all invalid volumes and volume plexes are removed, the disk configuration can be cleaned up. Each disk that was removed, reinstalled, or replaced (as determined from the output of the voldisk list command) must be removed from the configuration.

To remove the disk, use the voldg command. To remove the failed disk01, enter:

# voldg rmdisk disk01

If the voldg command returns an error message, some invalid volume plexes exist. Repeat the processes described in "Volume Cleanup" until all invalid volumes and volume plexes are removed.

8.8.4.5    Reconfiguring the root Volume

Once all the invalid disks are removed, you can replace or reinstall disks to add them to LSM control. If the root disk was originally under LSM control (the root file system and the swap area were on volumes), or you now want to put the root disk under LSM control, add this disk first, for example:

# /usr/sbin/volencap devname

See Chapter 4 for more information.

When the encapsulation is complete, reboot the system to multiuser mode.

8.8.4.6    Reconfiguring Volumes

After the boot disk is encapsulated, you can replace other disks. If the disks were reinstalled during the operating system reinstallation, they should be encapsulated; otherwise, add them.

Once the disks are added to the system, you can recreate the volumes that were removed and restore their contents from backup.

To recreate the volumes fnah and vol01, enter:

# volassist make fnah 24000  
# volassist make vol01 4224 layout=stripe nstripe=3

To replace the plex removed from the volume foo using volassist, enter:

# volassist mirror foo

Once you restore the volumes and plexes, the recovery is complete and your system should be configured as it was prior to reinstalling the Tru64 UNIX operating system.

8.9    Deconfiguring Additional Swap

Follow these steps to deconfigure and remove additional swap volumes that were previously configured for use with the LSM software:

  1. Deconfigure the swap space to no longer use the LSM volumes. This can be done by updating the vm:swapdevice entry in the sysconfigtab file to not reference the LSM volumes. If the swap space was configured using the /etc/fstab file, update this file accordingly.

    See the System Administration and swapon(8) reference pages for more information.

  2. Reboot the system to affect the change.

  3. Stop and remove the volumes. For example, to stop and remove a volume called swapvol1, enter:

    # volume stop swapvol1 
    # voledit -rf rm swapvol1
    

8.10    Removing the LSM Software

Follow these steps to deconfigure and remove LSM from a system.

Warning

Deconfiguring LSM causes any data currently under LSM to be lost and no longer accessible. You should unencapsulate and/or backup any needed data before proceeding.

  1. Reconfigure any system file systems and/or swap space to no longer be on a LSM volume. If root and swap are configured under LSM, enter the volunroot command and reboot the system. Also, unencapsulate the /usr and /var if these are configured under LSM. See Chapter 4 if /usr and /var are encapsulated under LSM with the root and swap. If additional swap space was configured using LSM volumes, deconfigure them as described in Section 8.9.

  2. Unmount any other filesystems that were using LSM volumes so all LSM volumes can be closed. Update the /etc/fstab file if necessary to no longer mount any file systems on an LSM volume. Stop applications that are using raw LSM volumes and reconfigure them to no longer use LSM volumes.

  3. Note of which disks are currently configured under LSM by entering the following command:

    # voldisk list

  4. Once all the LSM volumes are no longer in-use, restart LSM in disabled mode by entering the following command:

    # vold -k -r reset -d

    This command fails if any volumes are open.

  5. Stop LSM's volume and I/O daemons by entering the following command:

    # voliod -f set 0  
    # voldctl stop
    

  6. Update the disk labels using the list of disks under LSM from step 3 above. For each disk that was previously configured under LSM as sliced (for example, entire disk was under LSM), repartition and update the disk labels using the -rw option by entering the following commands:

    # disklabel -rw dsk4  
    # disklabel -rw dsk5
    

    For each disk partition that was configured under LSM as a simple disk, update the partition's fstype to unused using the -s option with the disklabel command. For example:

    # disklabel -s dsk6c unused

    Also, update the disk partition fstype field for any nopriv disks that were previously under LSM to either unused or the appropriate value depending on whether the partition still contains valid data. For example, if dsk2g was an an LSM nopriv disk that still contains a valid UNIX file system and dsk2h was a LSM nopriv disk that no longer contains valid data, enter:

    # disklabel -s dsk2g 4.2BSD 
    # disklabel -s dsk2h unused  
    

  7. Remove the LSM directories by entering the following command:

    # rm -r /etc/vol /dev/vol /dev/rvol /etc/vol/volboot

  8. Delete the following LSM entries in the /etc/inittab file:

    lsmr:s:sysinit:/sbin/lsmbstartup -b </dev/console >/dev/console 2>&1 ##LSM     
    lsm:23:wait:/sbin/lsmbstartup </dev/console >/dev/console 2>&1 ##LSM     
    vol:23:wait:/sbin/vol-reconfig -n </dev/console >/dev/console 2>&1 ##LSM 
     
    

  9. Display the installed LSM subsets by entering the following command:

    # setld -i | grep LSM

    Output displays the show the installed LSM subsets.

  10. Delete the installed LSM subsets by entering the following command:

    # setld -d OSFLSMBASE500 OSFLSMBIN500 OSFLSMCLSMTOOLS500

  11. Deconfigure LSM from the kernel. For example, for system named rio, replace the pseudo-device lsm 1 entry in the /sys/conf/RIO file to pseudo-device lsm 0

    You can make this change either prior to running the doconfig command or while running doconfig command. For example:

    # doconfig -c RIO

  12. Copy the new kernel to root and reboot the system by entering the following command:

    # cp /sys/RIO/vmunix / 
    # shutdown now