6 Troubleshooting

LSM helps you protect the availability and reliability of data but does not prevent I/O failure. LSM is simply another layer added to the I/O subsystem. LSM depends on the underlying disk device drivers and system files to decide on the availability of individual disks and to manage and report any failures.

This chapter describes how to troubleshoot common LSM problems, describes tools that you can use to learn about problems, and offers possible solutions.

The hot-spare feature provides the best protection for volumes that use mirror plexes or a RAID 5 plex. When enabled, the hot-spare feature allows LSM to automatically relocate data from a failed disk in a volume that uses either a RAID 5 plex or mirrored plexes. LSM writes the data to a designated spare disk, or to free disk space, and sends you mail about the relocation. See Section 3.4.4.1 for more information about enabling the hot-spare feature.

6.1 Monitoring LSM

You can use LSM commands to monitor the status of LSM objects. By doing so, you can understand how LSM works under normal conditions and watch for indication that an LSM object might need adjustments before a problem arises.

6.1.1 Monitoring LSM Events

By default, LSM uses Event Manager (EVM) software to log events. The events that LSM logs are defined in the EVM template called /usr/share/evm/templates/sys/lsm.volnotify.evt.

You can select, filter, sort, format, and display LSM events using EVM commands or the graphical event viewer, which is integrated with the SysMan Menu and SysMan Station.

To display a list of logged LSM events, enter:

# evmget -f "[name *.volnotify]" | evmshow -t "@timestamp @@"

To display LSM events in real time, enter:

# evmwatch -f "[name *.volnotify]" | evmshow -t "@timestamp @@"

See the EVM(5) reference page for more information about EVM.

6.1.2 Monitoring Read and Write Statistics

You can use the volstat command to view and reset:

The number of successful or failed read and write operations

The number of blocks read from and written to

The average time spent on read and write operations. This time reflects the total time it took to complete a read or write operation, including the time spent waiting in a queue on a busy device.

Table 6-1 describes the some of the options that you can use with the volstat command.

Table 6-1: Common volstat Command Options

Option	Displays
-v	Volume statistics
-p	Plex statistics
-s	Subdisk statistics
-d	LSM disk statistics
-i `seconds`	The specified statistics continuously in the interval specified (in seconds).

For information on all the volstat options, see the volstat(8) reference page.

Note

In a cluster environment, the volstat command displays statistics for the system on which the command is entered and does not provide statistics for all the systems within a cluster.

6.1.2.1 Displaying Read and Write Statistics

To display read and write statistics for LSM objects, enter:

# volstat [-g disk_group] -vpsd [-i number_of_seconds]

Information similar to the following is displayed:

                OPERATIONS            BLOCKS       AVG TIME(ms)
TYP NAME      READ     WRITE      READ     WRITE   READ  WRITE 
dm  dsk6        3        82        40     62561    8.9   51.2 
dm  dsk7        0       725         0    176464    0.0   16.3 
dm  dsk9      688        37    175872       592    3.9    9.2 
dm  dsk10   29962         0   7670016         0    4.0    0.0 
dm  dsk12       0     29962         0   7670016    0.0   17.8 
vol v1          3        72        40     62541    8.9   56.5 
pl  v1-01       3        72        40     62541    8.9   56.5 
sd  dsk6-01     3        72        40     62541    8.9   56.5 
vol v2          0        37         0       592    0.0   10.5 
pl  v2-01       0        37         0       592    0.0    8.0 
sd  dsk7-01     0        37         0       592    0.0    8.0 
sd  dsk12-01    0         0         0         0    0.0    0.0 
pl  v2-02       0        37         0       592    0.0    9.2 
sd  dsk9-01     0        37         0       592    0.0    9.2 
sd  dsk10-01    0         0         0         0    0.0    0.0 
pl  v2-03       0         6         0        12    0.0   13.3 
sd  dsk6-02     0         6         0        12    0.0   13.3

The LSM objects are identified as follows:

dm - Disk media name (LSM name for the disk)

vol - Volume name

pl - Plex name

sd - Subdisk name

6.1.2.2 Displaying Failed Read and Write Statistics

To display failed I/O statistics, enter:

# volstat [-g disk_group] -f cf LSM_object

Information similar to the following is displayed:

                CORRECTED          FAILED
TYP NAME        READS   WRITES    READS   WRITES
vol testvol         1        0        0        0

LSM corrects read failures for mirror plexes or a RAID 5 plex, because these plexes provide data redundancy.

6.1.3 Monitoring LSM Object States

The kernel and LSM monitor the state of LSM objects.

To display the state of LSM objects, enter:

# volprint [-g disk_group]

Information similar to the following is displayed:

.
.
.
Disk group: dg1
 
TY NAME         ASSOC        KSTATE   LENGTH   PLOFFS   STATE    TUTIL0  PUTIL0
dg dg1          dg1          -        -        -        -        -       -
 
dm dsk1         dsk1         -        2046748  -        -        -       -
dm dsk2         dsk2         -        2046748  -        -        -       -
dm dsk4         dsk4         -        2046748  -        -        -       -
dm dsk5         dsk5         -        2046748  -        -        -       -
 
v  vol-test     fsgen        ENABLED  2048     -        ACTIVE   -       -
pl vol-test-01  vol-test     ENABLED  2048     -        ACTIVE   -       -
sd dsk1-01      vol-test-01  ENABLED  1024     0        -        -       -
sd dsk2-01      vol-test-01  ENABLED  1024     0        -        -       -
pl vol-test-02  vol-test     ENABLED  2048     -        ACTIVE   -       -
sd dsk4-01      vol-test-02  ENABLED  1024     0        -        -       -
sd dsk5-01      vol-test-02  ENABLED  1024     0        -        -       -

The KSTATE column shows the kernel state of the LSM object. The STATE column shows the LSM state of the LSM object. The LSM objects are identified as follows:

dg - Disk group name

dm - Disk media name (LSM name for the disk)

v - Volume name

pl - Plex name

sd - Subdisk name

6.1.3.1 LSM Kernel States

The LSM kernel state indicates the accessibility of the LSM object as viewed by the kernel. Table 6-2 describes kernel states for LSM objects.

Table 6-2: LSM Volume Kernel States (KSTATE)

Kernel State	Means
ENABLED	The LSM object is accessible and read and write operations can be performed.
DISABLED	The LSM object is not accessible.
DETACHED	Read and write operations cannot be performed, but device operations are accepted.

6.1.3.2 LSM Object States

LSM monitors the states of volumes, plexes, and subdisks.

A volume has an LSM state (Table 6-3). The meaning of some volume states differs depending on the kernel state (KSTATE).

A plex has an LSM state (Table 6-4).

A subdisk has an LSM state (Table 6-5).

Table 6-3: LSM Volume States (STATE)

State	Means	Kernel State
EMPTY	The volume contents are not initialized.	DISABLED
CLEAN	The volume is not started. For mirrored volumes, plexes are synchronized. For RAID 5 volumes, parity is good and stripes are consistent.	DISABLED
ACTIVE	The volume was started or was in use when the system was restarted.	ENABLED DISABLED if RAID 5 parity synchronization is not guaranteed or if mirror plexes are not guaranteed to be consistent.
SYNC	The system is resynchronizing mirror plexes or RAID 5 parity.	ENABLED if the mirror plexes or RAID 5 parity are being resynchronized. DISABLED if the mirror plexes or RAID 5 parity were being resynchronized when the system restarted and therefore still need to be synchronized.
NEEDSYNC	The volume requires a resynchronization operation the next time it starts.
REPLAY	A RAID 5 volume is in a transient state as part of a log replay. A log replay occurs when it is necessary to reconstruct data using parity and data.

Table 6-4: LSM Plex States

State	Means
EMPTY	The plex is not initialized. This state is also set when the volume state is EMPTY.
CLEAN	The plex was running normally when the volume was stopped. The plex was enabled without requiring recovery when the volume was started.
ACTIVE	The plex is running normally on a started volume.
LOG	The plex is a DRL or RAID 5 log plex for the volume.
STALE	The plex was detached, either by the `volplex det` command or by an I/O failure. STALE plexes are reattached automatically by `volplex att` when a volume starts.
IOFAIL	The `vold` daemon places an ACTIVE plex in the IOFAIL state when it detects an error. The plex is disqualified from the recovery selection process at volume start time, ensuring that LSM uses only valid plexes for recovery. A plex marked IOFAIL is recovered if possible during a resynchronization.
OFFLINE	The plex was disabled by the `volmend off` command.
SNAPATT	This is a snapshot plex that is attached by the `volassist snapstart` command. When the attach is complete, the state for the plex is changed to SNAPDONE. If the system fails before the attach completes, the plex and all of its subdisks are removed.
SNAPDONE	This is a snapshot plex created by `volassist snapstart` command that is fully attached. You can turn a plex in this state into a snapshot volume with the `volassist snapshot` command. If the system fails before the attach completes, the plex and all of its subdisks are removed.
SNAPTMP	This is a snapshot plex that is attached by the `volplex snapstart` command. When the attach is complete, the state for the plex changes to SNAPDIS. If the system fails before the attach completes, the plex is dissociated from the volume.
SNAPDIS	This is a snapshot plex created by the `volplex snapstart` command that is fully attached. You can turn a plex in this state into a snapshot volume with the `volplex snapshot` command. If the system fails before the attach completes, the plex is dissociated from the volume.
TEMP	This is a plex that is associated and attached to a volume with the `volplex att` command. If the system fails before the attach completes, the plex is dissociated from the volume.
TEMPRM	This is a plex that is being associated and attached to a volume with the `volplex att` command. If the system fails before the attach completes, the plex is dissociated from the volume and removed. Any subdisks in the plex are kept.
TEMPRMSD	This is a plex that is being associated and attached to a volume with the `volplex att` command. If the system fails before the attach completes, the plex and its subdisks are dissociated from the volume and removed.

Table 6-5: LSM Subdisk States

State	Means
REMOVED	The subdisk (which might encompass the entire LSM disk) was removed from the volume, disk group, or from LSM control.
RECOVER	The subdisk must be recovered. Use the `volrecover` command.

6.2 Missing or Altered sysconfigtab File

During the boot disk encapsulation procedure, LSM adds the following entries to the /etc/sysconfigtab file to enable the system to boot off the LSM root volume:

lsm:
lsm_rootdev_is_volume=1

If this file is deleted or the LSM-specific entries are deleted, the system will not boot. If this happens, do the following:

Boot the system interactively:

>>> boot -fl i
.........
.........
Enter kernel_name option_1 ... option_n: vmunix

Reset the LSM entries as follows:
```
lsm:
lsm_rootdev_is_volume=1
```

6.3 LSM Startup and Command Problems

LSM requires that the vold and voliod daemons be running. These daemons are normally started automatically when the system boots. If these daemons are not running, the most obvious problems you might notice are that LSM commands fail to complete or do not respond as expected, which is an indication that LSM did not correctly start up.

The following sections describe how to check if the daemons are running and how to correct problems.

6.3.1 Checking the vold Daemon

To determine the state of the vold daemon, enter:

# voldctl mode

Table 6-6 shows messages that might display, what the message means, and the commands you should enter if vold is disabled or not running.

Table 6-6: vold Messages and Solutions

Message	Status	Enter
`Mode: enabled`	Running and enabled	--
`Mode: disabled`	Running but disabled	`voldctl enable`
`Mode: not-running`	Not running	`vold`

See the vold(8) reference page for more information on the vold daemon.

6.3.2 Checking the voliod Daemon

The correct number of voliod daemons automatically start when LSM starts. Typically several voliod daemons are running at all times. You should run at least one voliod daemon for each processor on the system.

To display the number of the voliod daemons running, enter:

# voliod

Information similar to the following is displayed:

2 volume I/O daemons running

This is the only method for checking voliod daemons, because the voliod processes are kernel threads and do not display in the output of the ps command.

If no voliod daemons are running, or if you want to change the number of daemons, enter the following command where n is the number of I/O daemons to start:

# voliod set n

Set the number of LSM I/O daemons to two or the number of central processing units (CPUs) on the system, whichever is greater.

See the voliod(8) reference page for more information on the voliod daemon.

6.4 Solving Problems with LSM Volumes

The following sections describe how to solve common LSM volume problems.

6.4.1 Insufficient Space to Create a Volume

When you use the volassist command to create a volume with a striped plex, you might receive an error message indicating insufficient space for the volume even though you know there is enough space available.

The volassist command rounds up the length you specify on the command line to a multiple of the data unit size of 64 KB by default, or the stripe width you specified, and then divides the total by the number of disks available to make the column. The smallest disk in the disk group limits the data unit size.

For example, you have two disks with differing free space in the disk group called dg1:

# voldg -g dg1 free

GROUP  DISK      DEVICE    TAG      OFFSET   LENGTH    FLAGS
dg1    dsk1      dsk1      dsk1     0        2049820		-
dg1    dsk2      dsk2      dsk2     0        2047772		-

The total free space on these two disks is 4097592. You tried to create a volume with a striped plex, with a length of 4095544, or about 2 GB, which is less than the total space available:

# volassist -g dg1 make NewVol 4095544 layout=stripe

volassist: adjusting length 4095544 to conform
to a layout of    2    stripes    128    blocks wide
volassist: adjusting length up to 4095744 blks
 
volassist: insufficient space for a 4095744 block long volume in stripe,
contiguous layout

The command returned an error message indicating insufficient space, because volassist rounded up the length you specified to an even multiple of the data unit size of 64 KB (128 blocks) and divided that number by the number of disks (2). The result was larger than the space available on the smaller disk: 4095744 / 2 = 2048796.

If your volume does not need to be precisely the size you specified, you can retry the command with a length that works with the data unit size and the number of disks. For example, multiply the size of the smallest free space by the number of disks: 2047772 * 2 = 4095488. Use this value in the command line:

# volassist -g dg1 make NewVol 4095488 layout=stripe

If the volume you require is larger than the total free space in the disk group, or if the volume must be exactly the size you specify, you must add more (or larger) disks to that disk group. See Section 5.2.3 for more information on adding disks to a disk group.

6.4.2 Starting a Disabled Volume

If you cannot mount a file system or open an LSM volume, the LSM volume might not be started.

To determine whether or not the LSM volume is started, enter:

# volinfo [-g disk_group] volume

The following output shows the condition of several volumes:

vol   bigvol         fsgen    Startable
vol   vol2           fsgen    Started
vol   brokenvol      gen      Unstartable

LSM volumes can have the following conditions:

Started - The volume is enabled and running normally.

Startable - The volume is not enabled, and at least one plex has a state of ACTIVE or CLEAN, indicating that the volume can be restarted. ^{[Footnote 3]}
To start a startable volume, enter:
```
# volume [-g disk_group] start volume
```

Unstartable - The volume is not enabled and has a problem that you must resolve before you can start the volume. For example, a disk might have failed.
- If the volume is redundant (that is, it uses mirror plexes or a RAID 5 plex), see Section 6.5.5 for information on replacing failed disks and recovering the volumes.
- If the volume is not redundant, see Section 6.4.3.

6.4.3 Recovering Unstartable Nonredundant Volumes

Nonredundant volumes are those that use a single plex that is either concatenated or striped. If a disk in the plex fails, the volume will be unstartable.

You can display the volume's condition by entering:

# volinfo -p

Information similar to the following is displayed:

vol  tst            fsgen    Unstartable
plex tst-01         NODEVICE

To recover the volume:

If the disk is usable, continue with step 2. If the disk has failed, replace the disk:
1. Identify the disk media name of the failed disk using one of the following commands:
  - To display all disk, disk group, and volume information and the status of any volumes that are affected by the failed disk, enter:
```
# volprint -Aht
```
  - To display only the disk information, enter:
```
# volprint -Adt
```
2. Remove the failed disk and retain the disk media records:
```
# voldg [-g disk_group] -k rmdisk disk_media_name
```
3. Remove the disk from LSM control, using the disk access name:
```
# voldisk rm disk_access_name
```
4. Physically remove the failed disk and replace it with the new disk.
  Please note that the device must be completely removed from LSM before running any non-LSM commands to remove and replace the failed disk, such as hwmgr -redirect.
5. Scan for the new disk:
```
# hwmgr -scan scsi
```
  The hwmgr command returns the prompt before it completes the scan. You need to know that the system has discovered the new disk before continuing. See the hwmgr(8) reference page for more information on how to trap the end of a scan.
6. Label and initialize the new disk:
  - If you have a backup of the previous disk's disk label information (Section 4.1.3):
    1. Apply the backup disk label to the new disk:
```
# disklabel -R disk_access_name auto file
```
    2. Initialize the disk for LSM, using the disk access name:
```
# voldisk -f init disk_access_name
```
  - If no disk label file is available:
    1. Apply a default disk label to the new disk:
```
# disklabel -rwn disk_access_name
```
    2. Initialize the new disk for LSM:
```
# voldisksetup -i disk
```
7. Optionally (but recommended), create a backup copy of the new disk's disk label information:
```
# disklabel disk_access_name > file
```
8. Add the new disk to the applicable disk group, assigning a disk media name to the disk access name. You can reuse the disk media name of the failed disk as the disk media name for the new disk:
```
# voldg [-g disk_group] -k adddisk disk_media_name=disk_access_name
```
9. Verify that the volume's plex state has changed to RECOVER:
```
# volinfo -p
vol  tst            fsgen    Unstartable
plex tst-01         RECOVER
```

Set the plex state to STALE:
```
# volmend fix stale plex
```
LSM has internal state restrictions that require a plex to change states in a specific order. A plex must be STALE before it can be marked CLEAN.

Set the plex state to CLEAN:
```
# volmend fix clean plex
```

Start the volume:
```
# volume start volume
```
The volume is now running and usable but contains invalid data.

Depending on what was using the volume, do one of the following:
- If the volume was used by a file system, recreate the file system on the volume, and mount the file system. See Section 4.3 for more information on configuring a volume for a file system.
  If you have a backup of the data, restore the volume using the backup. See Section 5.4.3 for more information on restoring a volume from backup.
- If you have no backup and the volume was used by an application such as a database, refer to that application's documentation for information on restoring or recreating the data.

6.4.4 Recovering Volumes with Mirror Plexes

Volumes with mirror plexes are less vulnerable than volumes with a single (nonmirrored) plex, but if disks in all the plexes fail, the volume's state will be Unstartable.

There are three possible scenarios for the failure and recovery of volumes with mirror plexes:

Data in all plexes is known to be bad or is unknown. See Section 6.4.4.1.

One plex is known to be valid, and you want to use that plex to restore the others. See Section 6.4.4.2.

Data in all the plexes is known to be valid, but you have lost all copies of the configuration database. (All disks containing copies failed.) See Section 6.4.4.3.

6.4.4.1 Recovering a Volume with No Valid Plexes

If disks in multiple plexes of a volume failed, all the volume's data might be corrupt or suspect. Recovering a volume from a multiple disk failure requires that you restore the data from backup.

To recover a volume with no valid plexes:

Set all the plexes in the volume to CLEAN:
```
# volmend fix clean plex1 plex2 ...
```

Start the volume:
```
# volume start volume
```

Depending on what was using the volume, do one of the following:
- If the volume was used by a file system, recreate the file system on the volume, and mount the file system. See Section 4.3 for more information on configuring a volume for a file system.
  If you have a backup of the data, restore the volume using the backup. See Section 5.4.3 for more information on restoring a volume from backup.
- If you have no backup and the volume was used by an application such as a database, refer to that application's documentation for information on restoring or recreating the data.

6.4.4.2 Recovering a Volume with One Valid Plex

If you know that one plex in a volume contains valid data, you can use that plex to restore the others.

To recover a volume with one valid plex:

Set the valid plex's state to CLEAN:
```
# volmend fix clean valid_plex
```

Set the state of all the other plexes to STALE:

# volmend fix stale stale_plex1 stale_plex2 ...

Start the volume and initiate the resynchronization process in the background:
```
# volrecover -sb volume
```

6.4.4.3 Recovering a Volume After Loss of the Configuration Database

The following procedure requires a backup copy of the configuration database (created by the volsave command, as described in Section 5.3.1) restored by the volrestore command. See Section 5.3.2 for more information on restoring the configuration database. You should have a high degree of confidence that the volume data is still valid.

To recover a volume after restoring the configuration database:

Set all plexes in the volume to CLEAN:
```
# volmend fix clean plex1 plex2 ...
```

Start the volume:
```
# volume start volume
```

6.4.5 Recovering Volumes with a Failed RAID 5 Plex

Volumes that use a RAID 5 plex are designed to remain available when one disk fails. However, if two disks in the data plex fail, the entire volume is compromised.

If hot-sparing is enabled at the time of a disk failure, system administrator intervention is not required (unless there is no suitable disk space available for relocation). Hot-sparing is triggered by the disk failure, and you are notified of the failure by electronic mail. Hot-sparing automatically attempts to relocate the subdisks of a failing RAID 5 plex. After relocation takes place, the hot-sparing daemon (volspared) also initiates a parity resynchronization.
In the case of a failing RAID 5 log plex, relocation occurs only if the log plex is mirrored; volspared then initiates a mirror resynchronization to recreate the RAID 5 log plex.

If hot-sparing is disabled at the time of a failure, you might need to initiate a resynchronization or recovery.

There are three possible scenarios for the failure and recovery of volumes with a RAID 5 plex:

Within the data plex, a disk in one column fails. See Section 6.5.5 for more information on replacing a failed disk and recovering the volume.

Within the data plex, disks in two or more columns fail. See Section 6.4.5.1.

Within the log plex, a disk fails, or the log plex becomes detached. See Section 6.4.5.2.

6.4.5.1 Recovering a RAID 5 Plex from Multiple Disk Failures

If disks in two or more columns of a RAID 5 data plex fail, LSM cannot use the remaining data (if any) and parity to reconstruct the missing data. You must restore the data from backup.

To restore the volume:

If the disk is usable, continue with step 2. If the disk has failed, replace the disk:
1. Identify the disk media name of the failed disk using one of the following commands:
  - To display all disk, disk group, and volume information and the status of any volumes that are affected by the failed disk, enter:
```
# volprint -Aht
```
  - To display only the disk information, enter:
```
# volprint -Adt
```
2. Remove the failed disk and retain the disk media records:
```
# voldg [-g disk_group] -k rmdisk disk_media_name
```
3. Remove the disk access records, using the disk access name:
```
# voldisk rm disk_access_name
```
4. Physically remove the failed disk and replace it with the new disk.
  Please note that the device must be completely removed from LSM before running any non-LSM commands to remove and replace the failed disk, such as hwmgr -redirect.
5. Scan for the new disk:
```
# hwmgr -scan scsi
```
  The hwmgr command returns the prompt before it completes the scan. You need to know that the system has discovered the new disk before continuing. See the hwmgr(8) reference page for more information on how to trap the end of a scan.
6. Label and initialize the new disk:
  - If you have a backup of the previous disk's disk label information (Section 4.1.3):
    1. Apply the backup disk label to the new disk:
```
# disklabel -R disk_access_name auto file
```
    2. Initialize the disk for LSM, using the disk access name:
```
# voldisk -f init disk_access_name
```
  - If no disk label file is available:
    1. Apply a default disk label to the new disk:
```
# disklabel -rwn disk_access_name
```
    2. Initialize the disk for LSM:
```
# voldisksetup -i disk
```
7. Optionally (but recommended), create a backup copy of the new disk's disk label information:
```
# disklabel disk_access_name > file
```
8. Add the new disk to the applicable disk group, assigning a disk media name to the disk access name. You can reuse the disk media name of the failed disk as the disk media name for the new disk:
```
# voldg [-g disk_group] -k adddisk disk_media_name=disk_access_name
```
9. Verify that the volume's plex state has changed to RECOVER:
```
# volinfo -p
vol  tst            fsgen    Unstartable
plex tst-01         RECOVER
```

Stop the volume:
```
# volume stop volume
```

Set the RAID 5 data plex state to EMPTY:
```
# volmend -f fix empty volume
```
Setting the plex state to EMPTY causes LSM to recalculate the parity when you restart the volume in the next step.

Start the volume. The process of recalculating the parity can take a long time; you can run this operation in the background to return the system prompt immediately:
```
# volume [-o bg] start volume
```
The volume becomes usable even while the parity regeneration is underway. If users access a region of the volume that has not yet had its parity recalculated, LSM recalculates the parity for the entire stripe that contains the accessed data before honoring the read or write request.

Depending on what was using the volume, do one of the following:
- If the volume was used by a file system, recreate the file system on the volume, and mount the file system. See Section 4.3 for more information on configuring a volume for a file system.
  If you have a backup of the data, restore the volume using the backup. See Section 5.4.3 for more information on restoring a volume from backup.
- If you have no backup, and the volume was used by an application such as a database, refer to that application's documentation for information on restoring or recreating the data.

6.4.5.2 Recovering a RAID 5 Log Plex

A disk containing a RAID 5 log could experience a failure. This has no direct effect on the operation of the volume; however, the loss of all RAID 5 logs on a volume makes the volume vulnerable to a complete failure.

The following output from the volprint command shows a failure within a RAID 5 log plex. The plex state is BADLOG, and the RAID 5 log plex vol5-02 has failed.

Disk group: rootdg
 
V  NAME         USETYPE      KSTATE   STATE    LENGTH   READPOL   PREFPLEX
PL NAME         VOLUME       KSTATE   STATE    LENGTH   LAYOUT    NCOL/WID MODE
SD NAME         PLEX         DISK     DISKOFFS LENGTH   [COL/]OFF DEVICE   MODE
 
v  vol5         raid5        ENABLED  ACTIVE   409696   RAID      -
pl vol5-01      vol5         ENABLED  ACTIVE   409696   RAID      8/32     RW
sd dsk3-01      vol5-01      dsk3     0        58528    0/0       dsk3     ENA
sd dsk4-01      vol5-01      dsk4     0        58528    1/0       dsk4     ENA
sd dsk5-01      vol5-01      dsk5     0        58528    2/0       dsk5     ENA
sd dsk6-01      vol5-01      dsk6     0        58528    3/0       dsk6     ENA
sd dsk7-01      vol5-01      dsk7     0        58528    4/0       dsk7     ENA
sd dsk8-01      vol5-01      dsk8     0        58528    5/0       dsk8     ENA
sd dsk9-01      vol5-01      dsk9     0        58528    6/0       dsk9     ENA
sd dsk10-01     vol5-01      dsk10    0        58528    7/0       dsk10    ENA
pl vol5-02      vol5         DISABLED BADLOG   2560     CONCAT    -        RW
sd dsk11-01     vol5-02      dsk11    0        2560     0         -        RMOV

RAID 5 log plexes might have a state of DETACHED due to disk failures.

To recover a RAID 5 log plex:

If the disk is usable but the log plex is detached, continue with step 2. If the disk has failed, replace the disk:
1. Identify the disk media name of the failed disk using one of the following commands:
  - To display all disk, disk group, and volume information and the status of any volumes that are affected by the failed disk, enter:
```
# volprint -Aht
```
  - To display only the disk information, enter:
```
# volprint -Adt
```
2. Remove the failed disk and retain the disk media records:
```
# voldg [-g disk_group] -k rmdisk disk_media_name
```
3. Remove the disk access records, using the disk access name:
```
# voldisk rm disk_access_name
```
4. Physically remove the failed disk and replace it with the new disk.
  Please note that the device must be completely removed from LSM before running any non-LSM commands to remove and replace the failed disk, such as hwmgr -redirect.
5. Scan for the new disk:
```
# hwmgr -scan scsi
```
  The hwmgr command returns the prompt before it completes the scan. You need to know that the system has discovered the new disk before continuing. See the hwmgr(8) reference page for more information on how to trap the end of a scan.
6. Label and initialize the new disk:
  - If you have a backup of the previous disk's disk label information (Section 4.1.3):
    1. Apply the backup disk label to the new disk:
```
# disklabel -R disk_access_name auto file
```
    2. Initialize the disk for LSM, using the disk access name:
```
# voldisk -f init disk_access_name
```
  - If no disk label file is available:
    1. Apply a default disk label to the new disk:
```
# disklabel -rwn disk_access_name
```
    2. Initialize the disk for LSM:
```
# voldisksetup -i disk
```
7. Optionally (but recommended), create a backup copy of the new disk's disk label information:
```
# disklabel disk_access_name > file
```
8. Add the new disk to the applicable disk group, assigning a disk media name to the disk access name. You can reuse the disk media name of the failed disk as the disk media name for the new disk:
```
# voldg [-g disk_group] -k adddisk disk_media_name=disk_access_name
```
9. Verify that the volume's plex state has changed to RECOVER:
```
# volinfo -p
vol  tst            fsgen    Unstartable
plex tst-01         RECOVER
```

Reattach the log plex to the volume:
```
# volplex att volume log_plex
```

6.4.6 Checking the Status of Volume Resynchronization

If the system fails and restarts, LSM automatically recovers all volumes that were running normally at the time of the failure.

For volumes that use mirror plexes and have a DRL plex, this involves resynchronizing all the dirty regions.

For volumes that use a RAID 5 plex and have a RAID 5 log plex, this involves replaying the log plex to complete any outstanding writes.

Configuring redundant volumes with log plexes is the recommended method to speed the recovery of volumes after a system failure. Under normal circumstances, the recovery happens so quickly that there is no noticeable effect (such as performance lag) once the system is running again. However, if the volume had no log, the resynchronization can take a long time (minutes to hours, or longer) depending on the size of the volume.

You can display the status of the volume resynchronization in progress to determine how long it will take. (You cannot check the status of plex resynchronization, which occurs when you replace a failed disk or add a new plex to a volume; the volprint command does not have access to that information. However, in these cases, the volume is usable while the resynchronization occurs.)

To determine the time remaining for a volume resynchronization in progress:

Check the read/write flags for the volume to see the current recovery offset value:
```
# volprint -vl volume | grep flags
```
Information similar to the following is displayed:
```
flags:    open rwback (offset=121488) writeback
```

Check again after some time has passed (120 seconds is ample) to see how far the recovery has progressed:
```
# sleep 120 ; volprint -vl volume | grep flags
```
Information similar to the following is displayed:
```
flags:    open rwback (offset=2579088) writeback
```

Calculate the rate of progress by dividing the difference between the offsets by the time that passed between the two checks. For example, in 120 seconds the resynchronization had completed 2457600 sectors. Each second, approximately 20480 sectors (10 MB) were resynchronized.

Multiply the resynchronization rate by the size of the volume, in sectors. This indicates the approximate amount of time a complete resynchronization will take. For example, at a rate of 20480 sectors per second, a volume that is 200 GB will take about five and a half minutes to resynchronize.
The actual time required can vary, depending on other I/O loads on the system and whether the volume or the system experiences additional problems or failures.

6.4.6.1 Changing the Rate of Future Volume Resynchronizations

Although you cannot change the rate of (or stop) a volume resynchronization once it has begun, you can change the setting for the rate of future resynchronizations, if your volumes are large enough that the resynchronization has a noticeable impact on system performance during recovery.

Caution

Use this procedure only if you are a knowledgeable system administrator and you have evaluated the effect of volume resynchronization on system performance and determined it to be unacceptable. You must be familiar with editing system files and scripts.

To change the rate of volume resynchronization for future recoveries, use your preferred editor to modify the indicated line in the /sbin/lsm-startup script. The script contains information similar to the following, which has been edited for brevity and formatting:

#!/sbin/sh

.
.
.
vold_opts=-k
 
volrecover_iosize=64k
 
s_flag=$1

.
.
.
if [ "X`/sbin/voldctl mode 2> /dev/null`" = "Xmode: enabled" ]; then
     /sbin/volrecover -b -o iosize=$volrecover_iosize -s     [1]
       if [ $is_cluster -eq 1 -a $vold_locked -eq 1 ]
        then
                voldctl unlock
        fi
 
        if [ "$s_flag" != "-c" ]; then
                [ ! -f $STATEDIR/init_lsm ] && swapon -a > /dev/null 2>&1
        fi
        if [ "$s_flag" = "-c" ]; then
                Pid=`/bin/ps -e | grep "volwatch" | awk '$6 != "grep"'`
                if [ "X$Pid" = "X" ]; then
                        option=`rcmgr -c get LSMSTART 2> /dev/null`
                        if [ "$option" = "mailplus" ]; then
                                /usr/sbin/volwatch -s &
                                egettxt "LSM volwatch Service started - \
                                         hot spare support" lsmshm.cat:5148
                        else
                                rcmgr -c set LSMSTART mailonly 2> /dev/null
                                /usr/sbin/volwatch -m &
                                egettxt "LSM volwatch Service started - \
                                         mail only" lsmshm.cat:5116
                        fi
                fi
                Pid=`/bin/ps -e | grep "volnotify -e" | awk '$6 != "grep"'`
                if [ "X$Pid" = "X" ]; then
                        volnotify_opts=`rcmgr -c get LSM_EVM_OPTS 2> /dev/null`
                        if [ "$volnotify_opts" != "disable" ]; then
                                if [ ! -z "$volnotify_opts" ]; then
                                        /usr/sbin/volnotify -e $volnotify_opts > \
                                        /dev/null &
                                else
                                        /usr/sbin/volnotify -eicfd  >/dev/null &
                                fi
                        fi
                fi
        fi
else
        egettxt "LSM: Vold is not enabled for transactions" lsmshm.cat:981
        egettxt "  No volumes started\n" lsmshm.cat:982
        exit
fi

Change the indicated line to one of the following:
- To slow the rate of recovery, add -o slow as follows:
```
/sbin/volrecover -b -o iosize=$volrecover_iosize -o slow -s
```
  This option inserts a delay of 250ms between each recovery operation. This can considerably reduce the performance impact on the system, depending on the size of the volume and the number of plexes.
- To disable resynchronization, add -o delayrecover as follows:
```
/sbin/volrecover -b -o iosize=$volrecover_iosize -o delayrecover -s
```
  This option requires that you manually begin a resynchronization at your discretion, such as when the system is not under peak demand. Until then, the volume remains in read-writeback mode, which means that every time a region of the volume is read, the data is written to all plexes in the volume. When you eventually initiate the resynchronization, all regions marked dirty are resynchronized, perhaps unnecessarily.
  This option incurs performance overhead by writing all reads back to all plexes, which might be less than the impact of permitting the resynchronization to complete during periods of high system demand.
[Return to example]

You can change the /sbin/lsm-startup script back to its original state at any time.

6.4.7 Clearing Locks on LSM Volumes

When LSM makes changes to an object's configuration, LSM locks the object until the change is written. If a configuration change terminated abnormally, there might still be a lock on the object.

To determine if an object is locked, enter:

# volprint [-g disk_group] -vh

In the information displayed, the lock appears in the TUTIL0 column.

To clear the lock, enter:

# volmend [-g disk_group] clear tutil0 object ...

You might need to restart the volume. See Section 5.4.4.

6.5 Solving Disk Problems

The following sections describe troubleshooting procedures for failing and failed disks, including the boot disk.

6.5.1 Checking Disk Status

Disks can experience transient errors for a variety of reasons, such as when a power supply suffers a surge or a cable is accidentally unplugged. You can check the status of disks through the output of the volprint and voldisk commands.

To see the LSM status of a disk, enter:

# voldisk list

To check the usability of a disk, enter:

# voldisk check disk

Information similar to the following is displayed:

dsk5: Okay

The voldisk command validates the usability of the given disks by testing whether LSM can read and write the disk header information. A disk is considered usable if LSM can write and read back at least one of the disk headers that are stored on the disk. If a disk in a disk group is found to be unusable, it is detached from its disk group and all subdisks stored on the disk become invalid until you replace the physical disk or reassign the disk media records to a different physical disk.

Note

Because an LSM nopriv disk does not contain a disk header, a failed nopriv disk might continue to be considered okay and usable.

6.5.2 Recovering a Stale Subdisk

LSM usually recovers stale subdisks when the volume starts. However, it is possible that:

The recovery process might get killed.

The volume might be started with an option to prevent subdisk recovery.

The disk on which the subdisk resides might have been replaced without any recovery operations being performed.

To recover a stale subdisk in a volume, enter:

# volume recover volume subdisk

To recover all stale subdisks in a volume, enter the same command without specifying a subdisk:

# volume recover volume

6.5.3 Recovering Volumes After a Temporary Disk Failure

If a disk had a temporary failure but is not damaged (for example, the disk was removed by accident, a power cable was disconnected, or some other recoverable problem occurred) and the system was not restarted, you can recover the volumes on that disk. (LSM automatically recovers volumes when the system is restarted.)

To recover from a temporary disk failure:

Make sure the disk is back on line and accessible; for example:
- Check that the disk is firmly snapped into the bay.
- Reconnect any loose cables.
- Perform any other checks appropriate to your system.

Scan for all known disks to ensure the disk is available:
```
# voldctl enable
```

Recover the volumes on the disk:
```
# volrecover -sb
```

6.5.4 Moving a Volume Off a Failing Disk

Often a disk has recoverable (soft) errors before it fails completely. If a disk is experiencing an unusual number of soft errors, move the volume off the disk and replace it.

Note

To replace a failed boot disk, see Section 6.5.6.1.

To move a volume off a failing disk:

Identify the size of the volume on the failing disk:
```
# volprint [-g disk_group] -ht [volume]
```

Ensure there is an equal amount of free space in the disk group:
```
# voldg [-g disk_group] free
```
If there is not enough space, add a new disk. See Section 4.1.2.

Move the volume to a disk other than the failing disk, as specified by the ! operand. Use the appropriate shell quoting convention to correctly interpret the !. You do not need to specify a target disk.
```
# volassist [-g disk_group] move volume !disk
```
See Section 6.5.5 for information on replacing a failed disk.

6.5.5 Replacing a Failed Disk

When an LSM disk fails completely and its state becomes DETACHED, you must:

Replace the disk with a new disk. For best results, replace a failed disk with the same or similar type of disk.

Recover the volumes that used the failed disk.

Optionally, if the failure caused data to be moved onto a spare disk, you can move the data onto the new disk.

Note

To replace a failed boot disk, see Section 6.5.6.1.

6.5.5.1 Replacing a Failed Disk

To replace a failed disk:

Identify the disk media name of the failed disk using one of the following commands:
- To display all disk, disk group, and volume information, and the status of any volumes that are affected by the failed disk, enter:
```
# volprint -Aht
```
- To display only the disk information, enter:
```
# volprint -Adt
```

Remove the failed disk and retain the disk media records:
```
# voldg [-g disk_group] -k rmdisk disk_media_name
```

Remove the disk access records, using the disk access name:
```
# voldisk rm disk_access_name
```

Physically remove the failed disk and replace it with the new disk.
Please note that the device must be completely removed from LSM before running any non-LSM commands to remove and replace the failed disk, such as hwmgr -redirect.

Scan for the new disk:
```
# hwmgr -scan scsi
```
The hwmgr command returns the prompt before it completes the scan. You need to know that the system has discovered the new disk before continuing. See the hwmgr(8) reference page for more information on how to trap the end of a scan.

Label and initialize the new disk:
- If you have a backup of the previous disk's disk label information (Section 4.1.3):
  1. Apply the backup disk label to the new disk:
```
# disklabel -R disk_access_name auto file
```
  2. Initialize the disk for LSM, using the disk access name:
```
# voldisk -f init disk_access_name
```
- If no disk label file is available:
  1. Apply a default disk label to the new disk:
```
# disklabel -rwn disk_access_name
```
  2. Initialize the disk for LSM:
```
# voldisksetup -i disk
```

Optionally (but recommended), create a backup copy of the new disk's disk label information:
```
# disklabel disk_access_name > file
```

Add the new disk to the applicable disk group, assigning a disk media name to the disk access name.
You can reuse the disk media name of the failed disk as the disk media name for the new disk. Use the -k option if you want to apply the existing LSM information for the failed disk to the new one:
```
# voldg [-g disk_group] [-k] adddisk disk_media_name=disk_access_name
```

After you replace the disk, the steps you must do next, if any, depend on your setup:

If hot-sparing occurred, the volume is running and requires no recovery.
You can optionally move the data from the spare disk onto the disk you just replaced (Section 3.4.4.4), or you can configure the new disk as the hot-spare disk (Section 3.4.4.2).

If hot-sparing did not occur, you must recover the volume or restore it from backup. See Section 6.5.5.2 for more information.

6.5.5.2 Recovering the Volumes

Use one of the following methods to recover the volume data:

If the volume uses mirror plexes or a RAID 5 plex, start plex resynchronization. If the volume is large, you can run the resynchronization as a background task.
```
# volrecover -sb volume
```

If the volume is not redundant (not mirrored or RAID 5, or has no valid plexes from which to recover), restore the volume data from backup.

Optionally, verify the volume is started:

# volinfo

Information similar to the following is displayed:

home       fsgen    Started 
finance    fsgen    Started 
mkting     fsgen    Started 
src        fsgen    Started

6.5.6 Recovering from a Boot Disk Failure

When the boot disk on a standalone system is encapsulated into an LSM volume with mirror plexes, failures occurring on the original boot disk are transparent to all users. However, during a failure, the system might:

Write a message to the console indicating there was an error reading or writing to the plex on the boot disk

Experience slow performance (depending on the problem encountered with the disk containing one of the plexes in the root or swap volumes)

To restart the system before you replace the original boot disk, you can boot from any disk that contains a valid rootvol volume.

If all copies of rootvol are corrupted and you cannot boot the system, you must reinstall the operating system.

Replacing a boot disk is a more complex process than replacing other disks because boot-critical data must be placed in specific areas on specific disks for the boot process to find it. How you replace a failed boot disk depends on:

If you have mirrored the root disk and enabled hot-sparing support.

If the errors are correctable and the same disk can be reused. This is known as readding a disk. If you reuse the boot disk, you should monitor it and replace it during your next maintenance cycle.

If the disk has failed completely and must be replaced.

Section 6.5.6.1 gives instructions for replacing the boot disk, as well as other information related to boot disk recovery.

6.5.6.1 Replacing a Failed Boot Disk

The following procedure assumes that you originally encapsulated the boot disk on a standalone system and created mirror plexes for the boot disk volumes. The last step in this procedure creates a new (replacement) mirror on the new disk.

To replace a failed boot disk under LSM control with a new disk:

Restart the system from the disk that has not failed.

Display the status of all LSM disks and volumes to ensure you use the name of the failed disk and failed plex in the remaining steps:
```
# voldisk list
# volprint -ht
```

Dissociate the plexes on the failed disk from the root, swap, and user volumes, if /usr or /var were encapsulated on the boot disk.
```
# volplex -o rm dis rootvol-02 swapvol-02 vol-dsk0g-02
```
The /usr and /var volumes have names derived from the partition letter of the boot disk (for example, vol-dsk0g).

Remove the failed LSM disks for the boot disk:
1. Remove the disks from the rootdg disk group:
```
# voldg rmdisk dskna dsknb dskng ...
```
2. Remove the LSM disks configured on the boot disk from LSM control:
```
# voldisk rm dskna dsknb dskng ...
```

Physically remove and replace the failed disk.
Please note that the device must be completely removed from LSM before running any non-LSM commands to remove and replace the failed disk, such as hwmgr -redirect.

Scan for the new disk:
```
# hwmgr -scan scsi
```
The hwmgr command returns the prompt before it completes the scan. You need to know that the system has discovered the new disk before continuing. See the hwmgr(8) reference page for more information on how to trap the end of a scan.

Modify the device special files, reassigning the old disk name to the new disk. Make sure you list the new disk first.
```
# dsfmgr -e new_name old_name
```

Label the new disk, setting all partitions to unused:
```
# disklabel -rw new_disk
```

Mirror the existing root volumes onto the new disk:
```
# volrootmir new_disk
```

The boot disk volumes are restored and ready for use.

6.6 Problems Importing a Disk Group

If you receive an error message when trying to import a disk group or the command fails, possible causes are:

One or more of the disks contains the host ID of another system.
To verify this, enter:
```
# voldisk list disk_access_name
```
If the host ID of the disk does not match that of the system where you are trying to import the disk group, enter:
```
# voldisk clearimport disk_access_name
```
You can now import the disk group.

One or more of the disks might be inaccessible.
Some disks might have failed. You can forcibly import the disk group and resolve the problem later; for example, replace the failed disk.
To forcibly import a disk group, enter:
```
# voldg -f import disk_group
```
Once the disk group is imported, you can identify and solve the problem.