14 Recovering from Errors

This chapter explains some of the procedures you can follow to recover from errors.

14.1 Protecting Your System

There are a several steps that you can take to prevent loss of data and to make it easier to recover your system, in case of failure:

Do regular backups.
Backups are necessary, in case all copies of a volume are lost or corrupted in some way. For example, a power surge could damage several (or all) disks on your system. See Section 7.3.5 for information on how you can use the volassist command to reduce backup downtime.
Mirror your root disk on a second disk, and, at the console prompt, set both a primary and an alternate boot device. See Chapter 5 for information on how to do this.
Create and use volumes that have at least two mirrors (plexes).
Put the mirrored plexes on different disks, and preferably on different controllers. By mirroring drives critical to booting, you ensure that no single disk failure will leave your system unusable.
The volassist utility locates the plexes such that the loss of one disk will not result in a loss of data. Note that you can edit the file /etc/default/volassist to set the default number of plexes for newly created volumes to two.
Use the volsave command to save copies of your LSM configuration files, in case you need to recreate the configuration.

14.2 Monitoring LSM Events

LSM provides the volwatch, volnotify, and voltrace commands to monitor LSM events and configuration changes.

The volwatch shell script is started automatically when you install LSM. This script sends mail to the root login when certain LSM configuration events occur, such as a plex detach caused by a disk failure.

The volwatch script sends mail to root by default. You can specify another login as the mail recipient.

If you need to restart volwatch, use the following command:

# volwatch root

The volnotify command is useful for monitoring disk and configuration changes and for creating customized scripts similar to /usr/sbin/volwatch.

The voltrace command provides a trace of physical or logical I/O events or error events.

For further information, refer to the volnotify(8), volwatch(8), and voltrace(8) reference pages.

14.3 Handling Common Problems

The following sections describe some of the more common problems that LSM users might encounter and suggests corrective actions.

14.3.1 An LSM Command Fails to Execute

When an LSM command fails to execute, LSM may display the following message:

Volume daemon is not accessible

This message often means that the volume daemon vold is not running.

To correct the problem, try to restart vold. Refer to Section 14.4 for detailed instructions.

14.3.2 The vold Daemon Fails to Restart

If the vold daemon fails to restart (either during system reboot or from the command line), the following message may be displayed:

lsm:vold: Error: enable failed: Error in disk group configuration copies No valid disk found containing disk group; transactions are disabled.

This message could imply that the /etc/vol/volboot file has no valid disks that are in the rootdg diskgroup.

To correct the problem, update the /etc/vol/volboot file by adding disks that belong to the rootdg disk group and have a configuration copy. Then, restart vold. For example:

# voldctl add disk rz8h # voldctl add disk rz9 # vold -k

14.3.3 LSM Volume I/O or Mirroring Fails to Complete

If I/O to a LSM volume or mirroring of a LSM volume does not complete, check whether or not the LSM error daemon, voliod, is running on the system. Refer to Section 14.5 for details.

14.3.4 Creating a Volume or Adding a Disk Fails

When creating a new volume or adding a disk, the operation may fail with the following message:

No more space in disk group configuration

This often means that you are out of room in the disk group's configuration database. Refer to Section 6.3.8 and Section 6.3.9 for more information.

14.3.5 Mounting a File System or Opening an LSM Volume Fails

If a file system cannot be mounted or an open function on an LSM volume fails, check if errno is set to EBADF. This could mean that the LSM volume is not started.

Use the volinfo command to determine whether or not the volume is started. For example:

# volinfo -g rootdg vol1 fsgen Startable vol-rz3h fsgen Started vol2 fsgen Started swapvol1 gen Started rootvol root Started swapvol swap Started

To start volume vol1 you would enter the following command:

# volume -g rootdg start vol1

Refer to Section 7.6.4 and Section 14.9 for further information

14.4 Ensuring the Volume Configuration Daemon (vold) is Running

Before any LSM operations can be performed, the vold daemon must be running. Typically, the vold daemon is configured to start automatically during the reboot procedure. Perform the following steps to determine the state of the volume daemon:

Determine if the volume daemon is running and enabled by entering the voldctl mode command as follows:

# voldctl mode

If... Then...

The vold daemon is both running and enabled The following message displays:
mode:enabled

The vold daemon is running, but is not enabled The following message displays:
mode:disabled

The vold daemon is not running The following message displays:
mode:not-running

If necessary, enable the volume daemon by entering the voldctl enable command:
# voldctl enable
If necessary, start the volume daemon by entering the vold command:
# vold

For additional information about the vold daemon, refer to the vold(8) reference page.

14.5 Ensuring the Volume Extended I/O Daemon (voliod) is Running

Volume log I/O voliod kernel threads are started by the vold daemon (if block-change logging is enabled) and are killed by the kernel when these threads are no longer needed. Volume error kernel threads are started automatically by LSM startup procedures. Rebooting after your initial installation should start the voliod error daemon automatically.

Note
Digital recommends that there be at least as many voliod error daemons as the number of processors on the system.

You can perform these steps to determine the state of the error daemon:

Verify that the error daemon is running and enabled by entering the following command:

# voliod

If... Then...

Any voliod processes are running The following message displays:
n "volume I/O daemons running"
The n symbol in the previous example indicates the number of voliod daemons running.

There are no voliod daemons currently running Start some daemons by entering the following command: # voliod set 2

If necessary, enable the volume error daemon by entering the following command:

# voliod set 2

For more detailed information about the voliod daemon, refer to the voliod(8) reference page.

14.6 Problems Encapsulating the Root and Swap Partitions

The following sections describe two recovery procedures you can try if problems occur during the encapsulation procedure described in Section 5.2.

If something goes wrong during conversion from the root partition to the LSM root volume, you use a process called zapping to undo kernel changes that were made as a result of encapsulating the root disk. Zapping is described in Section 14.6.1.
If booting to multiuser mode is impossible after root encapsulation has succeeded, you can allow booting from the physical disk partition. See Section 14.6.2 for information about how to boot from the physical disk partition so that you can perform root maintenance.

14.6.1 Unencapsulating the Root Disk

If something goes wrong during the conversion from the root partition to the root LSM volume, the encapsulation procedure tries to back out all changes made, and restores the use of partitions for the root file system. Under some circumstances, you might need to manually undo the changes made as a result of encapsulating the root disk.

The following steps describe how to manually reset the changes made during root encapsulation:

Boot the system to single-user mode.
Enter the following command:
# voldctl -z
Mount the root partition as follows:
- If the root file system is UFS, mount the root partition with the following command:
  # mount -u /dev/rzXa /
- If the root file system is AdvFS, mount the root partition with the following command:
  # mount -u /
Edit the /etc/fstab file as follows:
- If the root file system is UFS, change the device-special file from /dev/vol/rootdg/rootvol to the a partition of the boot disk. Change the primary swap device from /dev/vol/rootdg/swapvol to the block-device file of the swap partition.
- If the root file system is AdvFS, enter the following commands:
  # cd /etc/fdmns/root_domain
  # rm rootvol
  # ln -s /dev/rzxa rzxa
  
  Change the primary swap device from /dev/vol/rootdg/swapvol to the block-device file of the swap partition.
Edit the /etc/sysconfigtab file and change the LSM entry from
lsm_rootdev_is_volume = 1
lsm_swapdev_is_volume = 1
to
lsm_rootdev_is_volume = 0
lsm_swapdev_is_volume = 0
Change the /sbin/swapdefault file (if it exists) to be a link to the swap partition's device-special file. For example, if the disk rz8b is the swap partition, enter the following commands:
# mv /sbin/swapdefault /sbin/swapdefault.swapvol

# ln -s /dev/rz8b /sbin/swapdefault
Remove files that were related to the conversion:
# rm -rf /etc/vol/reconfig.d/disk.d/*

# rm -rf /etc/vol/reconfig.d/ disks-cap-part
Reboot the system on the same boot disk. The system will reboot using disk partitions for root and swap.

14.6.2 Performing Root Maintenance

If you encounter problems in which booting to multiuser mode is impossible, you can use the following steps to allow booting from the physical disk partition, so that you can perform maintenance to fix the problem:

Use the step-by-step instructions for zapping in Section 14.6.1.
After the system has rebooted, use the volmend utility to set the good plex in your rootvol volume to ACTIVE. Refer to the volmend(8) reference page for information about fixing the volume.
After fixing the problem, undo the changes that you made in steps 4 through 6 in Section 14.6.1.
Reboot the system.

14.7 Recovering from Boot Disk Failure

When the boot disk is mirrored, failures occurring on the original boot disk are transparent to all users. However, during a failure, the system might do one or both of the following:

Write a message to the console indicating there was an error reading or writing to the plex on the boot disk.
Suffer from slow performance (depending on the problem encountered with the disk containing one of the plexes in the root or swap volumes).

To reboot the system before the original boot disk is repaired, you can boot from any disk that contains a valid root and swap volume plex. Chapter 5 shows how to set an alternate boot device from your system console.

If all copies of rootvol are corrupted, and you cannot boot the system, you must reinstall the system. Refer to Section 14.11 for details. docroff: ignoring superfluous symbol replace_failed_disks

14.7.1 Re-adding and Replacing Boot Disks

Normally, replacing a failed disk is as simple as putting a new disk somewhere on the controller and running LSM replace disk commands. It's even possible to move the data areas from that disk to available space on other disks, or to use a "hot spare" disk already on the controller to replace the failure. For data that is not critical for booting the system, it doesn't matter where the data is located. All data that is not boot critical is only accessed by LSM after the system is fully operational. LSM can find this data for you. On the other hand, boot-critical data must be placed in specific areas on specific disks in order for the boot process to find it.

When a disk fails, there are two possible routes that can be taken to correct the action. If the errors are transient or correctable, then the same disk can be re-used. This is known as re-adding a disk. On the other hand, if the disk has truly failed, then it should be completely replaced.

14.7.1.1 Re-adding A Failed Boot Disk

Re-adding a disk is the same procedure as replacing a disk, except that the same physical disk is used. Usually, a disk that needs to be re-added has been detached, meaning that LSM has noticed that the disk has failed and has ceased to access it.

If the boot disk has a transient failure, its plexes can be recovered using the following steps. The rootvol and swapvol volumes can have two or three LSM disks per physical disk, depending on the layout of the original root disk.

Enter the voldisk command to list the LSM disks that are associated with the failed physical disk. For example:

# voldisk list

DEVICE       TYPE      DISK     GROUP     STATUS
rz10         sliced     -         -       error
rz10b        nopriv     -         -       error
rz10f        nopriv     -         -       error
rz21         sliced    rz21     rootdg    online
rz21b        nopriv    rz21b    rootdg    online
-              -       rz10     rootdg   removed was:rz10
-              -       rz10b    rootdg   removed was:rz10b
-              -       rz10f    rootdg   removed was:rz10f

In this example, if rz10 was the failed boot disk, then you can assume that rz10, rz10b, and rz10f are the LSM disks associated with the physical disk rz10.

Enter the following commands to add the LSM disks back to the rootdg disk group:
# voldisk online rz10 rz10b rz10f

# voldg -k adddisk rz10=rz10

# voldg -k adddisk rz10b=rz10b

# voldg -k adddisk rootrz10=rz10f
After the disks have been added to the rootdg disk group, enter the volrecover command to resynchronize the plexes in the rootvol and swapvol volumes. For example:
# volrecover -sb rootvol swapvol

14.7.1.2 Replacing a Failed Boot Disk

If a boot disk that is under LSM control fails and you are replacing it with a new disk, perform the following steps:

Disassociate the plexes on the failed disk from rootvol and swapvol.
Remove the failed LSM disks from the disk group. Refer to volplex(8), voldg(8), and voldisk(8) for more information about how to accomplish this.
Mirror the rootvol and swapvol volumes onto the new disk, as described in Section 5.3.1. The replacement disk should have at least as much storage capacity as was in use on the old disk.

14.7.2 Stale or Unusable Plexes on Boot Disk

If a disk is unavailable when the system is running, any plexes of volumes that reside on that disk will become stale, meaning the data on that disk is out of date relative to the other plexes of the volume.

During the boot process, the system accesses only one copy of the root and swap volumes (the copies on the boot disk) until a complete configuration for those volumes can be obtained. If it turns out that the plex of one of these volumes that was used for booting is stale, the system must be rebooted from a backup boot disk that contains nonstale plexes. This problem can occur, for example, if the boot disk was replaced and restarted without adding the disk back into the LSM configuration. The system will boot normally, but the plexes that reside on the newly powered disk will be stale.

Another possible problem can occur if errors in the LSM headers on the boot disk prevents LSM from properly identifying the disk. In this case, LSM will be unable to know the name of that disk. This is a problem because plexes are associated with disk names, and therefore any plexes on that disk are unusable.

If either of these situations occurs, the LSM daemon vold will notice it when it is configuring system as part of the init processing of the boot sequence. It will output a message describing the error, describe what can be done about it, and halt the system. For example, if the plex rootvol-01 of the root volume rootvol on disk disk01 of the system was stale, vold would print the following message:

lsm:vold: Warning Plex rootvol-01 for root volume is stale or unusable.
lsm:vold: Error: System boot disk does not have a valid root plex
Please boot from one of the following disks:

 

        Disk: disk02                     Device: rz2

 

lsm:vold: Error: System startup failed

This informs the administrator that the disk disk02 contains usable copies of the root and swap plexes and should be used for booting. This is the name of the system backup disk. When this message appears, the administrator should reboot the system from a backup boot disk.

Once the system has booted, the exact problem needs to be determined. If the plexes on the boot disk were simply stale, they will be caught up automatically as the system comes up. If, on the other hand, there was a problem with the private area on the disk, the administrator will need to re-add or replace the disk.

If the plexes on the boot disk were unavailable, the administrator should get mail from the LSM volwatch utility describing the problem. Another way to discover the problem is by listing the disks with the voldisk utility. In the previous example, if the problem is a failure in the private area of disk01 (such as due to media failures or accidentally overwriting the LSM private region on the disk), enter the following command:

# voldisk list

This command produces the following output:

DEVICE      TYPE      DISK     GROUP     STATUS
-           -         disk02   rootdg    failed was:  rz1
rz2         sliced    disk02   rootdg    online

14.7.3 Crash Dumps

If a system failure occurs, the system console writes a crash dump to the boot disk. However, if the original boot disk has had a problem such that the corresponding plex in the root or swap volumes has been disabled, then the crash dump is written to the first available plex in the swap volume. The system reports the name of the disk that has the crash dump by printing a message on the system console.

For example, the following messages are printed to the console along with other dump information:

WARNING: LSM: Original dump device not found LSM attempting to dump to SCSI device unit number rz1

To obtain the crash dump when the system reboots, you must boot the system from the disk that contains the crash dump.

14.8 Recovering from Disk Problems

The following sections describe recovery procedures for problems related to LSM disks.

14.8.1 Detecting Failed Disks

If one plex of a volume encounters a disk I/O failure (for example, because the disk has an uncorrectable format error), one of the the following may happen:

LSM may detach the plex.
If a plex is detached, I/O stops on that plex but continues on the remaining plexes of the volume.
If a disk fails completely, LSM may detach the disk from its disk group.
If a disk is detached, all plexes on the disk are disabled. If there are any unmirrored volumes on a disk when it is detached, those volumes are disabled as well.

If a volume, a plex, or a disk is detached by failures, the volwatch(8) utility sends mail to root indicating the failed objects. For example, if a disk containing two mirrored volumes fails you might receive a mail message similar to the following:

To: root
Subject: Logical Storage Manager failures on mobius.lsm.com

 

Failures have been detected by LSM on host
mobius.lsm.com:

 

failed plexes:
 home-02
  src-02

 

No data appears to have been lost.  However, you should replace
the drives that have failed.

To determine which disks are causing the failures in this message, enter the following command:

# volstat -sff home-02 src-02

This produces output such as the following:

FAILED
TYP NAME                READS    WRITES
sd  disk01-04               0         0
sd  disk01-06               0         0
sd  disk02-03               1         0
sd  disk02-04               1         0

This display indicates that the failures are on disk02 (the basename for the displayed subdisks).

Sometimes these errors are caused by cabling failures. You should look at the cables connecting your disks to your system. If there are any obvious problems, correct them and recover the plexes with the following command:

# volrecover -b home src

This command starts a recovery of the failed plexes in the background (the command returns before the operation is done). If an error message appears later, or if the plexes become detached again, replace the disk.

If you do not see any obvious cabling failures, then the disk probably needs to be replaced.

If a disk fails completely, the mail message will list the disks that have failed, all plexes that use the disk, and all volumes defined on the disk that was disabled because the volumes were not mirrored. For example:

To: root
Subject: Logical Storage Manager failures on mobius.lsm.com

Failures have been detected by LSM on host
mobius.lsm.com:

failed disks:
 disk02

failed plexes:
 home-02
  src-02
   mkting-01

failed volumes:
 mkting

The contents of failed volumes may be corrupted, and should be
restored from any available backups.  To restart one of
these volumes so that you can restore it from backup, replace
disks as appropriate then use the command:

        volume -f start <volume-name>

You can then restore or recreate the volume.

This message indicates that disk02 was detached by a failure; that plexes home-02, src-02, and mkting-01 were also detached (probably because of the failure of the disk); and that the volume mkting was disabled.

Again, the problem may be a cabling error. If the problem is not a cabling error, then you must replace the disk.

14.8.2 Replacing a Failed Disk

Disks that have failed completely, and that have been detached by failure, can be replaced by running the voldiskadm menu utility and selecting item 5, Replace a failed or removed disk, from the main menu. If you have any disks that are initialized for LSM but have never been added to a disk group, you can select one of those disks as a replacement. Do not choose the old disk drive as a replacement even though it may appear in the selection list. If there are no suitable initialized disks, you can choose to initialize a new disk.

If a disk failure caused a volume to be disabled, then the volume must be restored from backup after the disk is replaced. To identify volumes that wholly reside on disks that were disabled by a disk failure, use the volinfo command.

Any volumes that are listed as Unstartable must be restored from backup. For example, the volinfo command might display:

home           fsgen    Started
mkting         fsgen    Unstartable
src            fsgen    Started

To restart volume mkting so that it can be restored from backup, use the following command:

# volume -obg -f start mkting

The -obg option causes any plexes to be recovered in a background task.

14.8.3 Replacing a Disk that is Beginning to Fail

Often a disk has recoverable (soft) errors before it fails completely. If a disk is getting an unusual number of soft errors, replace it. This involves two steps:

Detaching the disk from its disk group
Replacing the disk with a new one

To detach the disk, run voldiskadm and select item 4, Remove a disk for replacement, from the main menu. If there are initialized disks available as replacements, you can specify the disk as part of this operation. Otherwise, you must specify the replacement disk later by selecting item 5, Replace a failed or removed disk, from the main menu.

When you select a disk to remove for replacement, all volumes that will be affected by the operation are displayed. For example, the following output might be displayed:

  The following volumes will lose mirrors as a result of this
    operation:

 

        lhome src

 

  No data on these volumes will be lost.

 

The following volumes are in use, and will be disabled as a
  result of this operation:

 

        mkting

 

  Any applications using these volumes will fail future accesses.
    These volumes will require restoration from backup.

 

Are you sure you want do do this? [y,n,q,?] (default: n)

If any volumes would be disabled, quit from voldiskadm and save the volume. Either back up the volume or move the volume off of the disk. To move the volume mkting to a disk other than disk02, use the command:

# volassist move mkting disk02

After the volume is backed up or moved, run voldiskadm again and continue to remove the disk for replacement.

After the disk has been removed for replacement, specify a replacement disk by selecting item 5, Replace a failed or removed disk, from the main menu in voldiskadm.

Refer to Section C.10 for examples of how to replace disks.

14.8.4 Modifying the Disk Label to Start at Block 1 Instead of Block 16

In LSM Version 1.0, disks added to LSM skip physical block 0 and start at block 1 because block 0 contains the disk label and is write-protected.

Starting with LSM Version 1.1, disks added to LSM start at physical block 16 for performance reasons with certain disks. To start a disk at physical block 1 instead of block 16, use the disklabel command to modify the partition start offset and length accordingly before adding the disk to LSM.

For example:

# disklabel -e /dev/rrz16c # voldisk init rz16 type=sliced

Refer to the disklabel(8) reference page for details.

14.9 Recovering Volumes

The following sections describe recovery procedures for problems relating to LSM volumes.

14.9.1 Listing Unstartable Volumes

An unstartable volume is likely to be incorrectly configured or has other errors or conditions that prevent it from being started. To display unstartable volumes, use the volinfo command, which displays information on the accessibility and usability of one or more volumes:

# volinfo -g diskgroup [ volname ]

14.9.2 Recovering a Disabled Volume

If a system crash or an I/O error corrupts one or more plexes of a volume and no plex is CLEAN or ACTIVE, mark one of the plexes CLEAN and instruct the system to use that plex as the source for reviving the others. To place a plex in a CLEAN state, use the following command:

# volmend fix clean plex_name

For example, the command line to place one plex labeled vol01-02 in the CLEAN state looks like this:

# volmend fix clean vol01-02

Refer to the volmend(8) reference pages for more information.

14.10 Problems with volrestore

If you used the volsave command to save a copy of your configuration, you can use the volrestore command to restore the configuration. This section describes problems that may arise in restoring a configuration.

See Section 7.4 and Section 7.5 for information on volsave and volrestore. See Appendix C for examples of handling restore failures.

14.10.1 Conflicts While Restoring the Configuration

When volrestore executes, it can encounter conflicts in the LSM configuration, for example, if another volume uses the same plex name or subdisk name, or the same location on a disk. When volrestore finds a conflict, it displays error messages and the configuration of the volume, as found in the saved LSM description set. In addition, it removes all volumes created in that disk group during the restoration. The disk group that had the conflict remains imported, and volrestore continues to restore other disk groups.

If volrestore fails because of a conflict, you can use the -b option to do the "best possible" restoration in a disk group. You will then have to resolve the conflicts and restore the volumes in the affected disk group.

See Section C.26 for further information and examples.

14.10.2 Failures in Restoring the Configuration

The restoration of volumes fails if one or more disks associated with the volumes are unavailable, for example due to disk failure. This, in turn, can cause the restoration of a disk group to fail. You can use a command like the following to restore the LSM configuration of a disk group:

# volrestore -b -g diskgroup

The volumes associated with the failed disks can then be restored by editing the volmake description file to remove the plexes that use the failed disks. Note that editing the description file will affect the checksum of the files in the backup directory, so you will have to override the checksum validation by using the -f option.

See Section C.26 for further information and examples.

14.11 Reinstallation Recovery

Occasionally, your system may need to be reinstalled after some types of failures. Reinstallation is necessary if all copies of your root (boot) disk are damaged, or if certain critical files are lost due to file system damage. When a failure of either of these types occurs, you must reinstall the entire system.

If these types of failures occur, attempt to preserve as much of the original LSM configuration as possible. Any volumes not directly involved in the failure may be saved. You do not have to reconfigure any volumes that are preserved.

The following sections describe the procedures used to reinstall LSM and preserve as much of the original configuration as possible after a failure.

14.11.1 General Recovery Information

A system reinstallation completely destroys the contents of any disks that are reinstalled. Any LSM related information, such as data in the LSM private areas on reinstalled disks (containing the disk identifier and copies of the LSM configuration), is removed during reinstallation. The removal of this information makes the disk unusable as an LSM disk.

If a disk was placed under LSM control (either during the LSM installation or by later encapsulation), that disk and any volumes on it are lost during reinstallation. If a disk was not under LSM control before the failure, no volumes are lost at reinstallation. You can replace any other disks by following the procedures in Section 9.2.6,

When reinstallation is necessary, the only volumes saved are those that reside on, or have copies on, disks that are not directly involved with the failure, the reinstallation, or both; volumes on disks involved with the failure or reinstallation are lost during reinstallation. If backup copies of these volumes are available, you can restore them after reinstallation. The system root disk is always involved in reinstallation. Other disks may also be involved.

If the root disk was placed under LSM control by encapsulation, that disk and any volumes or volume plexes on it are lost during reinstallation. In addition, any other disks that are involved in the reinstallation (or that are removed and replaced), also lose any LSM data (including volumes and plexes).

If a disk (including the root disk) is not under LSM control prior to the failure, no volumes are lost at reinstallation. Although having the root disk under LSM control simplifies the recovery process after reinstallation, not having the root disk under LSM control increases the likelihood of a reinstallation being necessary. Having the root disk under LSM control, and creating plexes of the root disk contents, eliminates many of the problems that require system reinstallation.

14.11.2 Overview of Reinstallation and Reconfiguration Procedures

To reinstall the system and recover the LSM configuration you need to perform the following procedures:

Prepare the system for installation. This includes replacing any failed disks or other hardware, and detaching any disks not involved in the reinstallation.
Save the current copy of /etc/vol/volboot.
Install the operating system.
Recover the LSM configuration. Restore the saved copy of /etc/vol/volboot.
Cleanup the configuration. This includes restoring any information in volumes affected by the failure or reinstallation.

Each of these procedures is described in detail in the sections that follow.

14.11.3 Preparing the System for Reinstallation

To prevent the loss of data on disks not involved in the reinstallation, you should only involve the root disk in the reinstallation procedure. It is recommended that any other disks (that contain volumes) be disconnected from the system before you start the reinstallation procedure. Disconnecting the other disks ensures that they are unaffected by the reinstallation. For example, if the operating system was originally installed with a file system on the second drive, the file system may still be recoverable. Removing the second drive ensures that the file system remains intact.

14.11.4 Reinstalling the Operating System

Once any failed or failing disks have been replaced and disks uninvolved with the reinstallation have been detached, reinstall the operating system as described in the Installation Guide.

While the operating system installation progresses, make sure no disks other than the root disk are accessed in any way. If anything is written on a disk other than the root disk, the LSM configuration on that disk could be destroyed.

14.11.5 Recovering the LSM Configuration

Once the LSM subsets have been loaded, recover the LSM configuration by doing the following:

Shut down the system.
Physically reattach the disks that were removed from the system.
Reboot the system. When the system comes up, make sure that all disks are configured in the kernel and that special device files have been created for the disks.
Run the volinstall script to create LSM special device files and to add LSM entries to the /etc/inittab file:
# volinstall
Bring the system to single-user mode by entering the following command:
# shutdown now
You need to remove some files involved with installation that were created when you loaded LSM but are no longer needed. To do this, enter the following command:
# rm -rf /etc/vol/reconfig.d/state.d/install-db
Once these files are removed, you must start some LSM daemons. Start the daemons by entering the command:
# /sbin/voliod set 2
Start the LSM configuration daemon, vold, by entering the command:
# /sbin/vold -m disable
If a copy of /etc/vol/volboot exists on backup media, restore it. Go to the next step.
If a saved copy of /etc/vol/volboot does not exist, initialize /etc/vol/volboot by entering:
# voldctl init

Add one or more disks that have configuration databases to the /etc/vol/volboot file. You must do this otherwise LSM cannot restart after a reboot.
To reenable the previous LSM configuration, you need to determine the name of one of the disks that was in the rootdg disk group. If you do not know the name of one of the disks, you can scan the disk label on the disks available on the system for LSM disk label tags such as LSMpubl or LSMsimp. If you find the LSMpubl disk label tag on a disk, add the disk as an LSM sliced disk. If you find the LSMsimp disk label tag, add the partition as an LSM simple disk.
# voldctl add disk rz3
Enable vold by entering:
# voldctl enable
Start LSM volumes by entering:
# volrecover -sb

The configuration preserved on the disks not involved with the reinstallation has now been recovered. However, because the root disk has been reinstalled, it appears to LSM as a non-LSM disk. Therefore, the configuration of the preserved disks does not include the root disk as part of the LSM configuration.

Note
If the root disk of your system and any other disk involved in the reinstallation were not under LSM control at the time of failure and reinstallation, then the reconfiguration is complete at this point. If any other disks containing volumes or volume plexes are to be replaced, follow the replacement procedures in Chapter 6. There are several methods available to replace a disk. Choose the method that you prefer.

If the root disk (or another disk) was involved with the reinstallation, any volume or volume plexes on that disk (or other disks no longer attached to the system) are now inaccessible. If a volume had only one plex (contained on a disk that was reinstalled, removed, or replaced), then the data on that the volume is lost and must be restored from backup. In addition, the system's root file system and swap area are not located on volumes any longer. To correct these problems, follow the instructions in Section 14.11.6.

14.11.6 Configuration Cleanup

The following sections describe how to clean up the configuration of your system after reinstallation of LSM.

14.11.6.1 Rootability Cleanup

To clean up the LSM configuration, remove any volumes associated with rootability, and their associated disks. This must be done if the root disk was under LSM control prior to installation. The volumes to remove are:

rootvol, which contains the root file system
swapvol, which contains the swap area

Follow these steps:

To begin the cleanup, remove the root volume, stop the volume, and then use the voledit command, as follows:
# volume stop rootvol
# voledit -r rm rootvol
Repeat the command, using swapvol in place of rootvol, to remove the swap volume.
Remove the LSM disks used by rootvol and swapvol.
For example, if disk rz3 was associated with rootvol and disk rz3b was associated with swapvol, you would enter the following commands:
# voldg rmdisk rz3 rz3b # voldisk rm rz3 rz3b

14.11.6.2 LSM Volumes for /usr and /var

If /usr and /var were on LSM volumes prior to the reinstallation, clean up the volumes using the voledit command similar to the previous example shown for rootvol. Remove the LSM disks associated with the volumes used for /usr and /var.

14.11.6.3 Volume Cleanup

After completing the rootability cleanup, you must determine which volumes need to be restored from backup. The volumes to be restored include any volumes that had all plexes residing on disks that were removed or reinstalled. These volumes are invalid and must be removed, recreated, and restored from backup. If only some plexes or a volume exist on reinitialized or removed disks, these plexes must be removed. The plexes can be readded later.

To restore the volumes, do the following:

Establish which LSM disks have been removed or reinstalled, by entering the command:
# voldisk list

LSM displays a list of system disk devices and the status of these devices. For example, for a reinstalled system with three disks and a reinstalled root disk, the output of the voldisk list command produces an output similar to this:
```
DEVICE  TYPE       DISK        GROUP       STATUS
rz0     sliced     -           -           error
rz1     sliced     disk02      rootdg      online
rz2     sliced     disk03      rootdg      online
-       -          disk01      rootdg      failed was:  rz0
```
The previous display shows that the reinstalled root device, rz0 is not recognized as an LSM disk and is marked with a status of error. disk02 and disk03 were not involved in the reinstallation and are recognized by LSM and associated with their devices (rz1 and rz2). The former disk01, the LSM disk that had been associated with the replaced disk device, is no longer associated with the device (rz0).
If there had been other disks (with volumes or volume plexes on them) removed or replaced during reinstallation, these disks would also have a disk device in error state and an LSM disk listed as not associated with a device.
Once you know which disks have been removed or replaced, all the plexes on disks with a status of failed must be located. Enter the command:
# volprint -sF "%vname" -e 'sd_disk = "< disk >'

In this command, the variable <disk> is the name of a disk with a failed status.
Note
Be sure to enclose the disk name in quotes in the command. Otherwise, the command will return an error message.
The volprint command returns a list of volumes that have plexes on the failed disk. Repeat this command for every disk with a failed status.
Check the status of each volume. To print volume information, enter:
# volprint -th < volume_name >

In this command, volume_name is the name of the volume to be examined.
The volprint command displays the status of the volume, its plexes, and the portions of disks that make up those plexes. For example, a volume named fnah with only one plex resides on the reinstalled disk named disk01. The volprint -th command, applied to the volume fnah, produces the following display:
```
V  NAME     USETYPE  KSTATE   STATE    LENGTH READPOL  PREFPLEX
PL NAME     VOLUME   KSTATE   STATE    LENGTH LAYOUT ST-WIDTH MODE
SD NAME     PLEX     PLOFFS   DISKOFFS LENGTH DISK-MEDIA   ACCESS

 

v  fnah      fsgen    DISABLED ACTIVE   24000  SELECT   -
pl fnah-01   fnah     DISABLED NODEVICE 24000  CONCAT   -
sd disk01-06 fnah-01  0        519940   24000  disk01   -
```
The only plex of the volume is shown in the line beginning with pl. The STATE field for the plex named fnah-01 is NODEVICE. The plex has space on a disk that has been replaced, removed, or reinstalled. Therefore, the plex is no longer valid and must be removed. Since fnah-01 was the only plex of the volume, the volume contents are irrecoverable except by restoring the volume from a backup. The volume must also be removed. If a backup copy of the volume exists, you can restore the volume later. Keep a record of the volume name and its length, you will need it for the backup procedure.
To remove the volume, use the voledit command. To remove fnah, enter the command:
# voledit -r rm fnah

It is possible that only part of a plex is located on the failed disk. If the volume has a striped plex associated with it, the volume is divided between several disks. For example, the volume named woof has one striped plex, striped across three disks, one of which is the reinstalled disk disk01. The output of the volprint -th command for woof returns:
```
V  NAME    USETYPE KSTATE   STATE    LENGTH  READPOL  PREFPLEX
PL NAME    VOLUME  KSTATE   STATE    LENGTH  LAYOUT   ST-WIDTH MODE
SD NAME    PLEX    PLOFFS   DISKOFFS LENGTH  DISK-MEDIA   ACCESS

 

v  woof      fsgen   DISABLED ACTIVE   4224    SELECT   -
pl woof-01   woof    DISABLED NODEVICE 4224    STRIPE   128      RW
sd disk02-02 woof-01 0        14336    1408    disk02         rz1
sd disk01-05 woof-01 1408     517632   1408    disk01       -
sd disk03-01 woof-01 2816     14336    1408    disk03         rz2
```
The display shows three disks, across which the plex woof-01 is striped (the lines starting with sd represent the stripes). The second stripe area is located on LSM disk01. This disk is no longer valid, so the plex named woof-01 has a state of NODEVICE. Since this is the only plex of the volume, the volume is invalid and must be removed. If a copy of woof exists on the backup media, it can be restored later.
Note
Keep a record of the volume name and length of any volumes you intend to restore from backup.

Use the voledit command to remove the volume, as described earlier.
A volume that has one plex on a failed disk may also have other plexes on disks that are still valid. In this case, the volume does not need to be restored from backup, since the data is still valid on the valid disks. The output of the volprint -th command for a volume with one plex on a failed disk (disk01) and another plex on a valid disk (disk02) would look like this:
```
V  NAME   USETYPE  KSTATE   STATE    LENGTH  READPOL  PREFPLEX
PL NAME   VOLUME   KSTATE   STATE    LENGTH  LAYOUT ST-WIDTH MODE
SD NAME   PLEX     PLOFFS   DISKOFFS LENGTH  DISK-MEDIA   ACCESS

 

v  foo       fsgen    DISABLED ACTIVE   10240   SELECT   -
pl foo-01    foo      DISABLED ACTIVE   10240   CONCAT   -   RW
sd disk02-01 foo-01   0        0        10240   disk02   rz1
pl foo-02    foo      DISABLED NODEVICE 10240   CONCAT       RW
sd disk01-04 foo-02   0        507394   10240   disk01   -
```
This volume has two plexes, foo-01 and foo-02. The first plex, foo-01, does not use any space on the invalid disk, so it can still be used. The second plex, foo-02, uses space on the invalid disk, disk01, and has a state of NODEVICE. Mirror foo-02 must be removed. However, the volume still has one valid plex containing valid data. If the volume needs to be mirrored, another plex can be added later. Note the name of the volume if you want to create another plex later.
To remove an invalid plex, the plex must be dissociated from the volume and then removed. This is done with the volplex command. To remove the plex foo-02, enter the following command:
# volplex -o rm dis foo-02
Once all the volumes have been cleaned up, you must clean up the disk configuration as described in the following section.

14.11.6.4 Disk Cleanup

Once all invalid volumes and volume plexes have been removed, the disk configuration can be cleaned up. Each disk that was removed, reinstalled, or replaced (as determined from the output of the voldisk list command) must be removed from the configuration.

To remove the disk, use the voldg command. To remove the failed disk01, enter:

# voldg rmdisk disk01

If the voldg command returns an error message, some invalid volume plexes exist. Repeat the processes described in "Volume Cleanup" until all invalid volumes and volume plexes are removed.

14.11.6.5 Rootability Reconfiguration

Once all the invalid disks have been removed, the replacement or reinstalled disks can be added to LSM control. If the root disk was originally under LSM control (the root file system and the swap area were on volumes), or you now want to put the root disk under LSM control, add this disk first.

To add the root disk to LSM control, enter the following command:

# /usr/sbin/volencap <boot_disk>

For more information see Chapter 5.

When the encapsulation is complete, reboot the system to multi-user mode.

14.11.6.6 Final Reconfiguration

Once the root disk is encapsulated, any other disks that were replaced should be added using voldiskadm. If the disks were reinstalled during the operating system reinstallation, they should be encapsulated; otherwise, simply add them. See Chapter 6.

Once all the disks have been added to the system, any volumes that were completely removed as part of the configuration cleanup can be recreated on their contents restored from backup. The volume recreation can be done using either volassist or the Logical Storage Visual Administrator (dxlsm) interface.

To recreate the volumes fnah and woof using the volassist command, enter:

# volassist make fnah 24000
# volassist make woof 4224 layout=stripe nstripe=3

Once the volumes are created, they can be restored from backup using normal backup/restore procedures.

Any volumes that had plexes removed as part of the volume cleanup can have these plexes recreated following the instructions for mirroring a volume for the interface (volassist, voldiskadm, or dxlsm) you choose.

To replace the plex removed from the volume foo using volassist, enter:

# volassist mirror foo

Once you have restored the volumes and plexes lost during reinstallation, the recovery is complete and your system should be configured as it was prior to the failure.

If...	Then...
The `vold` daemon is both running and enabled	The following message displays: `mode:enabled`
The `vold` daemon is running, but is not enabled	The following message displays: `mode:disabled`
The `vold` daemon is not running	The following message displays: `mode:not-running`

If...	Then...
Any `voliod` processes are running	The following message displays: `n` "volume I/O daemons running" The `n` symbol in the previous example indicates the number of `voliod` daemons running.
There are no `voliod` daemons currently running	Start some daemons by entering the following command: `# voliod set 2`