13 Advanced Volume Management Concepts

The two basic functions of the LSM software are to satisfy read and write requests, and to ensure that data is available. LSM automatically ensures that when you mirror disks, the corresponding plexes contain the same data. Under certain circumstances, LSM must perform a copy operation to ensure that corresponding plexes are mirrors of each other.

The state of LSM plexes and volumes can vary during the life of the LSM configuration. Changes in the composition of the volume are inevitable because:

Disk drives occasionally need corrective maintenance
New disks are added to replace other disks
System failures occur, requiring copy operations to take place within the LSM volume

This chapter describes how LSM manages plexes and volumes, and provides information for administrators who want to understand LSM plex and volume states, usages, and policies.

13.1 Plex States

System administrators look at plex states to see whether or not plexes are complete and consistent copies of the volume contents. Although the LSM utilities automatically maintain a plex's state, it is possible for you to modify the state of a plex if necessary. For example, if a disk with a particular plex located on it begins to fail, you can temporarily disable the plex. See the volume init command on the volume(8) reference page and Section 7.6 for information about modifying plex states.

LSM utilities use plex states to:

Indicate whether volume contents have been initialized to a known state
Determine if a plex contains a valid copy of the volume contents
Track whether a plex was in active use at the time of a system failure
Monitor operations on plexes

Plex states are an important aspect of high data availability. The following subsections describe all of the different plex states, how they change to indicate abnormalities, and what LSM does to normalize the plex state again.

Plexes that are associated with a volume will be in one of the states shown in Table 13-1.

Table 13-1: LSM Plex States

Plex State	Description
ACTIVE	A plex can be in the ACTIVE state in two situations: 1) When the volume is started and the plex fully participates in normal volume I/O (meaning that the plex contents change as the contents of the volume change) and, 2) When the volume was stopped as a result of a system crash and the plex was ACTIVE at the moment of the crash.
	Because of the impossibility of making atomic changes to more than one plex, a system failure may leave plex contents in an inconsistent state. When a volume is started, LSM performs a recovery action to guarantee that the contents of the plexes that are marked as ACTIVE are made identical. On a system that is performing well, you should see most volume plexes in the ACTIVE plex state.
CLEAN	A plex is in a CLEAN state when it contains a consistent copy of the volume contents and a `volume` `stop` operation has disabled the volume. As a result, when all plexes of a volume are CLEAN, no action is required to guarantee that the plexes are identical when that volume is started.
EMPTY	Volume creation sets all plexes associated with the volume to the EMPTY state to indicate to the usage type utilities (volume usage types are discussed later in this chapter) that the volume contents have not yet been initialized.
IOFAIL	On the detection of a failure of an ACTIVE plex, `vold` places that plex in the IOFAIL state so that it is disqualified from the recovery selection process at volume start time.
OFFLINE	The `volmend` `off` operation indefinitely detaches a plex from a volume by setting the plex state to OFFLINE. Although the detached plex maintains its association with the volume, changes to the volume do not update the OFFLINE plex until the plex is reattached with the `volplex` `att` operation. When this occurs, the plex is placed in the STALE state, which causes its contents to be recovered.
STALE	If there is a possibility that a plex does not have the complete and current volume contents, that plex is placed in the STALE state. Also, if an I/O error occurs on a plex, the kernel stops using and updating the contents of that plex, and a `volume` `stop` operation sets the state of the plex to STALE.
	A `volume` `start` operation revives STALE plexes from an ACTIVE plex. Atomic copy operations copy the contents of the volume to the STALE plexes. The system administrator can force a plex to the STALE state with a `volplex` `det` operation.
TEMP	Setting a plex to the TEMP state facilitates some plex operations that cannot occur in a truly atomic fashion. For example, attaching a plex to an enabled volume requires copying volume contents to the plex before it can be considered fully attached.
	A utility will set the plex state to TEMP at the start of such an operation and to a final state at the end of the operation. If the system goes down for any reason, a TEMP plex state indicates that the operation is incomplete; a `volume` `start` will disassociate plexes in the TEMP state.
TEMPRM	A TEMPRM plex state resembles a TEMP state except that at the completion of the operation, TEMPRM plex is removed. Some subdisk operations require a temporary plex. Associating a subdisk with a plex, for example, requires updating the subdisk with the volume contents before actually associating the subdisk. This update requires associating the subdisk with a temporary plex, marked TEMPRM, until the operation completes and removes the TEMPRM plex.
	If the system goes down for any reason, the TEMPRM state indicates that the operation did not complete successfully. A subsequent `volume` `start` operation will disassociate and remove TEMPRM plexes.

13.1.1 Plex State Cycle

Plex states change as a normal part of disk operations. However, deviations in plex state indicate abnormalities that LSM must normalize. Table 13-2 describes possible failure scenarios and the actions LSM takes to fix deviations.

Table 13-2: How LSM Handles Changes in Plex States

Cycle	During Normal Operations	When a Crash Occurs
Startup	The `volume` `start` operation makes all CLEAN plexes ACTIVE. The plexes remain marked as ACTIVE unless a crash occurs.	If a crash occurs between startup and shutdown, the volume-starting operation does not find any CLEAN plexes, only ACTIVE plexes. The operation then establishes one plex as an up to date and suitable source for reviving the other plexes. LSM marks that source plex ACTIVE and marks all others as STALE. The volume-usage type determines which plex is selected as the source plex.[Table Note 1] [Table Note 2]
Shutdown	If all goes well until shutdown, the volume-stopping operation marks all ACTIVE plexes CLEAN and the cycle continues. Having all plexes CLEAN at startup (before `volume` `start` makes them ACTIVE) indicates a normal shutdown and optimizes startup.	If an I/O error occurred and caused a plex to become disabled, the volume-stopping operation marks the plex in which the error occurred as STALE. Any STALE plexes require recovery. When the system restarts, a utility copies data from an ACTIVE to a STALE plex and makes the STALE plex ACTIVE.

Table Note:

Any plex can serve as the source for generic (gen) usage type volumes. The most up-to-date plex is selected for file system generic (fsgen) volumes. If the startup operation finds neither a CLEAN nor an ACTIVE plex, the system administrator must use volmend to select a plex to be set to CLEAN.
If the volume has the writeback-on-read flag set, all ACTIVE plexes are attached to the volume as ACTIVE plexes. A process is forked which reads the entire volume and the read data is written to the remaining plexes. Refer to Section 13.1.6 for more details.

13.1.2 Plex Kernel State

The plex kernel state indicates the accessibility of the plex. The plex kernel state is monitored in the volume driver and allows a plex to have an offline (DISABLED), maintenance (DETACHED), and online (ENABLED) mode of operation. These modes are described in the following table.

Mode Description

DISABLED The plex may not be accessed.

DETACHED A write to the volume is not reflected to the plex. A read request from the volume will never be satisfied from the plex device. Plex operations and ioctl functions are accepted.

ENABLED A write request to the volume will be reflected to the plex, if the plex is set to ENABLED for write mode. A read request from the volume is satisfied from the plex if the plex is set to ENABLED.

13.1.3 Plex Layout Policy

Plexes are logical groupings of subdisks that create an area of disk space that is independent of any physical disk size. The subdisks that make up a plex can be filled with data in two ways, concatenation or striping:

Concatenation places sequentially written data on subdisks in the order that the data was created. The first subdisk is filled, then the second, and so on.
Striping alternates sections of plex data among multiple disks. To accomplish this, the subdisks forming a plex must all be of equal size and are divided into stripe-blocks of equal size (the stripe width). Data is placed on the subdisks by stripe; sequentially written data fills stripe block 0 on Subdisk 1 first, then stripe block 1 on Subdisk 2, and so on.
For striping to be effective, subdisks must be spread across multiple disks, with one stripe per disk. Striping provides a performance advantage over concatenation because striping allows parallel I/O activity. You can use striping to distribute hot spots or areas of high I/O traffic across multiple devices.

See also Chapter 2 for more information about concatenation and striping.

13.1.4 Block-Change Logging

Block-change logging is a method used to dramatically reduce synchronization overhead of a mirrored volume during recovery in case of a system failure. Block-change logging keeps a log of the blocks that have changed due to I/O writes to a plex. If block-change logging is not enabled and a system failure occurs, LSM must restore all plexes to a consistent state by copying the full contents of an ACTIVE plex to the STALE plexes. This process can be lengthy and I/O intensive.

Block-change logging tracks writes by identifying and logging the block number that has changed, and stores this number in a logging subdisk. The block-change log maintains a record of all pending I/O records. Because the block-change log is only one block long, and because the log stores the I/O information until a process completes, the log may not be able to store all I/O changes as they occur. Once a process completes, it is flushed from the log and pending I/O information is then placed in the log.

Log records are written before the data is written. Thus, if the system experiences a crash, on system restart LSM searches the log and uses the log IDs to determine which plex contains the latest data written before the crash. In this way, plexes remain consistent, and except for possibly the last write before the crash, data is intact and up to date.

Block-change logging is enabled if two or more plexes in a mirrored volume have a logging subdisk associated with them. In addition, only the blocks recorded in the log of the ACTIVE plex need to be copied to restore the STALE plexes and maintain data integrity. For example:

# volmake sd rz1-01 len=1 rz1
# volsd aslog vol01-01 rz1-01

Block-change logging can add some overhead to your system, because LSM must perform an extra I/O for every write operation. If you want to disable block-change logging, use a command like the following:

voledit set logtype=none volume_name

13.1.5 Persistent State Logging

.eMPersistent state logging ensures that only active plexes are used for recovery purposes and prevents failed plexes from being selected for recovery. Table 13-3 describes how persistent state logging solves problems due to plex failures and recovery.

Table 13-3: Recovering from Plex Failures

Problem	Resolution
When LSM error policies sometimes detach failing plexes, the state information about the failed plex is maintained on disk and is kept by the kernel as a dynamic record of the state of the configuration. However, in the event of a system failure the kernel state of the plex at the time of the failure is unknown. Thus, during recovery, LSM might select a plex that has been detached from the volume (due to an error) some time before the system failed, and the data contained in the selected failed plex could be significantly out of date.	When LSM detaches a failed plex, it immediately writes a record to the persistent state log. This way, even if a system failure occurs between the time that the plex is detached and the state change is logged, the detached plex is then disqualified from the recovery selection process. Persistent state logging can therefore guarantee that only active plexes are selected for recovery purposes.[Table Note 1]
Without persistent state logging, a system crash causes all plexes to go through a recovery process, regardless of whether or not the plexes had been accessed.	Persistent state logging prevents unnecessary plex recoveries associated with started volumes that have never been accessed. Persistent state logging maintains a record of the first write to a volume and also of the last close of the volume so that no plex recovery is attempted following a crash.[Table Note 2]

Table Notes:

A special plex state, IOFAIL, exists for failed plexes. As soon as the failure of an active plex is detected, vold places that plex in the IOFAIL state to ensure that it is disqualified from the selection process for any subsequent volume start operation.
With persistent state logging, a transaction completion record is logged whenever a transaction completes. In this way, a later recovery will be able to determine the state of pending transactions and perform the appropriate recovery action.

13.1.6 Plex Resynchronizing Policy

During a volume start operation for an LSM volume, it might be necessary to resynchronize plexes that have become out of date using the following steps:

A set of the volume's plexes are chosen as being up to date or most up to date and the volume is made available with the up-to-date plexes in the ACTIVE state.
The remaining plexes are then set to the STALE state and are made unavailable for reading. These plexes are brought up to date with the ACTIVE plexes through a VOL_COPY process that copies the contents of the ACTIVE plexes to the out-of-date plexes and changes their state to ACTIVE.
The volume is then available with all plexes active.

During the resynchronization process, however, it is possible that only one plex is active for a volume. For the duration of the resynchronization, the volume is therefore vulnerable to I/O failures because there is no reliable redundancy of the contents of the volume. This might leave the volume in an irrecoverable state if an error is encountered.

The writeback-on-read mode avoids such problems. When a read is received for a volume, the writeback-on-read flag causes the data read to be written back to all other plexes in the volume. The volume start command uses the writeback-on-read mode to start volumes with only ACTIVE plexes from which to recover.

The writeback-on-read model works as follows:

All plexes that are marked as ACTIVE are attached to the volume as ACTIVE upon volume startup, and the writeback-on-read flag is set for the volume.
The start operation then forks off a process which generates read I/O for the entire volume. The reads are serviced normally (for example, a plex from among the ACTIVE plexes is chosen via the volume read policy) and the read data is then written back to the remaining plexes.
When the read loop has finished, the writeback-on-read flag is unset and all the plexes of the volume are now consistent.

The success of writeback-on-read depends on the existence of Persistent State Logging, which guarantees that plexes marked as ACTIVE and not marked as DETACHED in the persistent state log area were all active at the time of the system crash. If this is the case, then the only areas of the ACTIVE plexes where data may disagree are those places that had active write I/O at the time of the crash.

Data in these areas is not guaranteed to be correct in terms of representing the data either before or after the write, but LSM guarantees that the data areas are consistent across all plexes. Writeback-on-read supports this consistency.

In addition, the failure of any plex during the synchronization process results in the normal error processing being performed without entering an irrecoverable state, because other plexes are available for use.

13.2 Volume States

Volume states indicate whether or not the volume is initialized, written to, and the accessibility of the volume.

The interpretation of these volume states during volume startup is modified by the persistent state log for the volume (for example, the dirty/clean state). If the flag for the clean state is set, this means that an ACTIVE volume was not written to by any processes or was not open at the time of the reboot; therefore, it can be considered CLEAN. The flag for the clean state will always be set in any case where the volume is marked CLEAN.

Table 13-4 describes the volume states, some of which are similar to plex states.

Table 13-4: LSM Volume States

State	Description
ACTIVE	The volume has been started (kstate is currently ENABLED) or was in use (kstate was ENABLED) when the machine was rebooted. If the volume is currently ENABLED, the state of its plexes at any moment is not certain (since the volume is in use). If the volume is currently DISABLED, this means that the plexes cannot be guaranteed to be consistent.
CLEAN	The volume is not started (kstate is DISABLED) and its plexes are synchronized.
NEEDSYNC	The volume is not started (kstate is DISABLED) and its plexes are not synchronized. This can occur after a power failure or system failure.
EMPTY	The volume contents are not initialized. The kstate is always DISABLED when the volume is EMPTY.
SYNC	The volume is either in read-writeback mode (kstate is currently ENABLED) or was in read-writeback mode when the machine was rebooted (kstate is DISABLED). If the volume is ENABLED, this means that the plexes are being resynchronized via the read-writeback recovery. If the volume is DISABLED, it means that the plexes were being resynchronized via read-writeback when the machine rebooted and therefore still needs to be synchronized.

The following subsections describe the different volume states, how they change to indicate abnormalities, and what LSM does to normalize the volume state.

13.2.1 Volume Kernel State

The volume kernel state indicates the accessibility of the volume. The volume kernel state allows a volume to have an offline (DISABLED), maintenance (DETACHED), and online (ENABLED) mode of operation. These modes are described in the following table:

Mode Description

DISABLED The volume cannot be accessed.

DETACHED The volume cannot be read or written, but plex device operations and ioctl functions are accepted.

ENABLED The volumes can be read and written.

13.2.2 Volume Usage Types

A volume usage type is a type label given to each volume under LSM control. Just as a file system type establishes and enforces policies for file operations, a volume usage type establishes and enforces policies for volume operations. The rules and capabilities differ for different usage types. The volume usage types affect such things as plex synchronization and error handling.

LSM provides the options described in the following table for volume usage types.

Option Description

fsgen (file system generic) The fsgen usage type assumes the volume is being used by a file system. This usage type assumes there is a way to synchronize file system data to a volume during volplex or snapshot procedures. It uses the file system time stamp to see which plex is most up to date. It determines the file system type and calls an appropriate procedure to do a synchronization just prior to the plex split off.

gen (generic) The gen usage type makes no assumptions regarding the data content of the volume. This usage type does not handle synchronization. The gen option is useful for databases that reside directly on volumes.

Operations that are dependent on usage-type must determine the usage type of a volume before switching control to a utility customized for that usage type. For example, following a failure, the gen and fsgen usage types -- using different algorithms -- guarantee that all plexes of a volume are identical.

Note
Use the fsgen usage type when creating a volume if the volume is to be used by a file system. Otherwise, use the gen usage type.

13.2.3 Volume Read Policy

Starting a volume changes its state from DISABLED or DETACHED to ENABLED; stopping a volume changes its state from ENABLED or DETACHED to DISABLED.

Table 13-5 describes the three volume-read policies that can be selected.

Table 13-5: LSM Volume Read Policies

Policy	Description
Round	Prescribes round-robin reads of enabled plexes (this is the default read policy). If the read policy for a volume has been set to `round`, the read policy for that volume evenly distributes I/O read requests between all plexes of that volume. Reads are distributed by alternating read requests to each plex in a volume. This policy is preferred in cases where read access performance for all plexes is the same.
Prefer	Prescribes preferential reads from a specified plex. Setting the read policy to `prefer` designates a specific plex of a particular volume to be used for I/O read requests. This policy is preferred if read access performance of one plex is better than the other mirrors. For example, if one of the plexes is striped and the other plexes are concatenated.
Select	Selects a default read policy, based on the plex associations to the volume. If the volume contains a single, enabled, striped plex, the default is to prefer that plex. For any other set of plex associations, the default is to use a round-robin policy.

13.2.4 Managing Available Disk Space

Utilities such as the volassist utility obtain information on available disk space and use the information to calculate acceptable layouts for LSM objects. The concept of free space management is based on the idea that free space mapping can be derived by mapping out the existing allocations from the total space on the disk. The methods and specifications for dealing with the allocations and layouts may be provided at the command line; otherwise, they are obtained from defaults specified in a default file or internally.

13.3 Implementing LSM Configuration Changes

When a system administrator makes changes to a set of LSM objects, LSM groups the changes into a transaction. For any transaction, LSM ensures that either all related changes occur successfully or none of the changes are made. LSM makes all configuration changes appear to occur simultaneously and any intermediate stages of change are invisible. If a problem is encountered during the transaction, LSM does not allow any changes to occur and returns the configuration to its original state.

To achieve these atomic, all-or-nothing configuration changes, the LSM volume daemon (vold) envelops all configuration changes into a transaction by performing the following steps:

Locks all affected objects
Gets information about the locked records
Records prospective changes
Makes all the changes
Unlocks the changed objects

The result is that atomic transactions:

Permit several system administrators to make concurrent changes to the configuration.
Prevent inconsistent LSM configurations from occurring when there is a system failure.

If a system failure occurs during a transaction, restarting the system causes the vold utility to back out the partial changes. This prevents the disks maintained by LSM from becoming inconsistently configured.

Refer to the vold(8) reference page for additional information about the volume daemon.

Mode	Description
DISABLED	The plex may not be accessed.
DETACHED	A write to the volume is not reflected to the plex. A read request from the volume will never be satisfied from the plex device. Plex operations and `ioctl` functions are accepted.
ENABLED	A write request to the volume will be reflected to the plex, if the plex is set to ENABLED for write mode. A read request from the volume is satisfied from the plex if the plex is set to ENABLED.

Option	Description
`fsgen` (file system generic)	The `fsgen` usage type assumes the volume is being used by a file system. This usage type assumes there is a way to synchronize file system data to a volume during `volplex` or snapshot procedures. It uses the file system time stamp to see which plex is most up to date. It determines the file system type and calls an appropriate procedure to do a synchronization just prior to the plex split off.
`gen` (generic)	The `gen` usage type makes no assumptions regarding the data content of the volume. This usage type does not handle synchronization. The `gen` option is useful for databases that reside directly on volumes.