The two basic functions of the LSM software are to satisfy read and write requests, and to ensure that data is available. LSM automatically ensures that when you mirror disks, the corresponding plexes contain the same data. Under certain circumstances, LSM must perform a copy operation to ensure that corresponding plexes are mirrors of each other.
The state of LSM plexes and volumes can vary during the life of the LSM configuration. Changes in the composition of the volume are inevitable because:
This chapter describes how LSM manages plexes and volumes, and provides information for administrators who want to understand LSM plex and volume states, usages, and policies.
System administrators look at plex states to see whether or not plexes are complete and consistent copies of the volume contents. Although the LSM utilities automatically maintain a plex's state, it is possible for you to modify the state of a plex if necessary. For example, if a disk with a particular plex located on it begins to fail, you can temporarily disable the plex. See the volume init command on the volume(8) reference page and Section 7.6 for information about modifying plex states.
LSM utilities use plex states to:
Plex states are an important aspect of high data availability. The following subsections describe all of the different plex states, how they change to indicate abnormalities, and what LSM does to normalize the plex state again.
Plexes that are associated with a volume will be in one of the states shown in Table 13-1.
Plex State | Description |
ACTIVE | A plex can be in the ACTIVE state in two situations: 1) When the volume is started and the plex fully participates in normal volume I/O (meaning that the plex contents change as the contents of the volume change) and, 2) When the volume was stopped as a result of a system crash and the plex was ACTIVE at the moment of the crash. |
Because of the impossibility of making atomic changes to more than one plex, a system failure may leave plex contents in an inconsistent state. When a volume is started, LSM performs a recovery action to guarantee that the contents of the plexes that are marked as ACTIVE are made identical. On a system that is performing well, you should see most volume plexes in the ACTIVE plex state. | |
CLEAN | A plex is in a CLEAN state when it contains a consistent copy of the volume contents and a volume stop operation has disabled the volume. As a result, when all plexes of a volume are CLEAN, no action is required to guarantee that the plexes are identical when that volume is started. |
EMPTY | Volume creation sets all plexes associated with the volume to the EMPTY state to indicate to the usage type utilities (volume usage types are discussed later in this chapter) that the volume contents have not yet been initialized. |
IOFAIL | On the detection of a failure of an ACTIVE plex, vold places that plex in the IOFAIL state so that it is disqualified from the recovery selection process at volume start time. |
OFFLINE | The volmend off operation indefinitely detaches a plex from a volume by setting the plex state to OFFLINE. Although the detached plex maintains its association with the volume, changes to the volume do not update the OFFLINE plex until the plex is reattached with the volplex att operation. When this occurs, the plex is placed in the STALE state, which causes its contents to be recovered. |
STALE | If there is a possibility that a plex does not have the complete and current volume contents, that plex is placed in the STALE state. Also, if an I/O error occurs on a plex, the kernel stops using and updating the contents of that plex, and a volume stop operation sets the state of the plex to STALE. |
A volume start operation revives STALE plexes from an ACTIVE plex. Atomic copy operations copy the contents of the volume to the STALE plexes. The system administrator can force a plex to the STALE state with a volplex det operation. | |
TEMP | Setting a plex to the TEMP state facilitates some plex operations that cannot occur in a truly atomic fashion. For example, attaching a plex to an enabled volume requires copying volume contents to the plex before it can be considered fully attached. |
A utility will set the plex state to TEMP at the start of such an operation and to a final state at the end of the operation. If the system goes down for any reason, a TEMP plex state indicates that the operation is incomplete; a volume start will disassociate plexes in the TEMP state. | |
TEMPRM | A TEMPRM plex state resembles a TEMP state except that at the completion of the operation, TEMPRM plex is removed. Some subdisk operations require a temporary plex. Associating a subdisk with a plex, for example, requires updating the subdisk with the volume contents before actually associating the subdisk. This update requires associating the subdisk with a temporary plex, marked TEMPRM, until the operation completes and removes the TEMPRM plex. |
If the system goes down for any reason, the TEMPRM state indicates that the operation did not complete successfully. A subsequent volume start operation will disassociate and remove TEMPRM plexes. |
Plex states change as a normal part of disk operations. However, deviations in plex state indicate abnormalities that LSM must normalize. Table 13-2 describes possible failure scenarios and the actions LSM takes to fix deviations.
Cycle | During Normal Operations | When a Crash Occurs |
Startup | The volume start operation makes all CLEAN plexes ACTIVE. The plexes remain marked as ACTIVE unless a crash occurs. |
If a crash occurs between startup and shutdown, the volume-starting
operation does not find any CLEAN plexes, only ACTIVE plexes. The
operation then establishes one plex as an up to date and suitable
source for reviving
the other plexes. LSM marks that source plex ACTIVE and marks
all others as STALE. The volume-usage type determines which plex is
selected as the source plex. |
Shutdown | If all goes well until shutdown, the volume-stopping operation marks all ACTIVE plexes CLEAN and the cycle continues. Having all plexes CLEAN at startup (before volume start makes them ACTIVE) indicates a normal shutdown and optimizes startup. | If an I/O error occurred and caused a plex to become disabled, the volume-stopping operation marks the plex in which the error occurred as STALE. Any STALE plexes require recovery. When the system restarts, a utility copies data from an ACTIVE to a STALE plex and makes the STALE plex ACTIVE. |
Table Note:
The plex kernel state indicates the accessibility of the plex. The plex kernel state is monitored in the volume driver and allows a plex to have an offline (DISABLED), maintenance (DETACHED), and online (ENABLED) mode of operation. These modes are described in the following table.
Mode | Description |
DISABLED | The plex may not be accessed. |
DETACHED | A write to the volume is not reflected to the plex. A read request from the volume will never be satisfied from the plex device. Plex operations and ioctl functions are accepted. |
ENABLED | A write request to the volume will be reflected to the plex, if the plex is set to ENABLED for write mode. A read request from the volume is satisfied from the plex if the plex is set to ENABLED. |
Plexes are logical groupings of subdisks that create an area of disk space that is independent of any physical disk size. The subdisks that make up a plex can be filled with data in two ways, concatenation or striping:
For striping to be effective, subdisks must be spread across multiple disks, with one stripe per disk. Striping provides a performance advantage over concatenation because striping allows parallel I/O activity. You can use striping to distribute hot spots or areas of high I/O traffic across multiple devices.
See also Chapter 2 for more information about concatenation and striping.
Block-change logging is a method used to dramatically reduce synchronization overhead of a mirrored volume during recovery in case of a system failure. Block-change logging keeps a log of the blocks that have changed due to I/O writes to a plex. If block-change logging is not enabled and a system failure occurs, LSM must restore all plexes to a consistent state by copying the full contents of an ACTIVE plex to the STALE plexes. This process can be lengthy and I/O intensive.
Block-change logging tracks writes by identifying and logging the block number that has changed, and stores this number in a logging subdisk. The block-change log maintains a record of all pending I/O records. Because the block-change log is only one block long, and because the log stores the I/O information until a process completes, the log may not be able to store all I/O changes as they occur. Once a process completes, it is flushed from the log and pending I/O information is then placed in the log.
Log records are written before the data is written. Thus, if the system experiences a crash, on system restart LSM searches the log and uses the log IDs to determine which plex contains the latest data written before the crash. In this way, plexes remain consistent, and except for possibly the last write before the crash, data is intact and up to date.
Block-change logging is enabled if two or more plexes in a mirrored volume have a logging subdisk associated with them. In addition, only the blocks recorded in the log of the ACTIVE plex need to be copied to restore the STALE plexes and maintain data integrity. For example:
#
volmake sd rz1-01 len=1 rz1
#
volsd aslog vol01-01 rz1-01
Block-change logging can add some overhead to your system, because LSM must perform an extra I/O for every write operation. If you want to disable block-change logging, use a command like the following:
voledit set logtype=none volume_name
.eMPersistent
state
logging
ensures that only active plexes are used for recovery purposes and
prevents failed plexes from being selected for recovery.
Table 13-3
describes how persistent state logging solves problems due to plex
failures and recovery.
Problem | Resolution |
When LSM error policies sometimes detach failing plexes, the state information about the failed plex is maintained on disk and is kept by the kernel as a dynamic record of the state of the configuration. However, in the event of a system failure the kernel state of the plex at the time of the failure is unknown. Thus, during recovery, LSM might select a plex that has been detached from the volume (due to an error) some time before the system failed, and the data contained in the selected failed plex could be significantly out of date. |
When LSM detaches a failed plex, it immediately writes a
record to the persistent state log. This way, even if a system failure
occurs between the time that the plex is detached and the state
change is logged, the detached plex is then disqualified from the
recovery selection process. Persistent state logging can therefore
guarantee that only active plexes are selected for recovery
purposes. |
Without persistent state logging, a system crash causes all plexes to go through a recovery process, regardless of whether or not the plexes had been accessed. |
Persistent state logging prevents unnecessary plex recoveries
associated with started volumes that have never been accessed.
Persistent state logging maintains a record of the first write to a
volume and also of the last close of the volume so that no plex
recovery is attempted following a crash. |
Table Notes:
During a volume start operation for an LSM volume, it might be necessary to resynchronize plexes that have become out of date using the following steps:
During the resynchronization process, however, it is possible that only one plex is active for a volume. For the duration of the resynchronization, the volume is therefore vulnerable to I/O failures because there is no reliable redundancy of the contents of the volume. This might leave the volume in an irrecoverable state if an error is encountered.
The writeback-on-read mode avoids such problems. When a read is received for a volume, the writeback-on-read flag causes the data read to be written back to all other plexes in the volume. The volume start command uses the writeback-on-read mode to start volumes with only ACTIVE plexes from which to recover.
The writeback-on-read model works as follows:
The success of writeback-on-read depends on the existence of Persistent State Logging, which guarantees that plexes marked as ACTIVE and not marked as DETACHED in the persistent state log area were all active at the time of the system crash. If this is the case, then the only areas of the ACTIVE plexes where data may disagree are those places that had active write I/O at the time of the crash.
Data in these areas is not guaranteed to be correct in terms of representing the data either before or after the write, but LSM guarantees that the data areas are consistent across all plexes. Writeback-on-read supports this consistency.
In addition, the failure of any plex during the synchronization process results in the normal error processing being performed without entering an irrecoverable state, because other plexes are available for use.
Volume states indicate whether or not the volume is initialized, written to, and the accessibility of the volume.
The interpretation of these volume states during volume startup is modified by the persistent state log for the volume (for example, the dirty/clean state). If the flag for the clean state is set, this means that an ACTIVE volume was not written to by any processes or was not open at the time of the reboot; therefore, it can be considered CLEAN. The flag for the clean state will always be set in any case where the volume is marked CLEAN.
Table 13-4 describes the volume states, some of which are similar to plex states.
State | Description |
ACTIVE | The volume has been started (kstate is currently ENABLED) or was in use (kstate was ENABLED) when the machine was rebooted. If the volume is currently ENABLED, the state of its plexes at any moment is not certain (since the volume is in use). If the volume is currently DISABLED, this means that the plexes cannot be guaranteed to be consistent. |
CLEAN | The volume is not started (kstate is DISABLED) and its plexes are synchronized. |
NEEDSYNC | The volume is not started (kstate is DISABLED) and its plexes are not synchronized. This can occur after a power failure or system failure. |
EMPTY | The volume contents are not initialized. The kstate is always DISABLED when the volume is EMPTY. |
SYNC | The volume is either in read-writeback mode (kstate is currently ENABLED) or was in read-writeback mode when the machine was rebooted (kstate is DISABLED). If the volume is ENABLED, this means that the plexes are being resynchronized via the read-writeback recovery. If the volume is DISABLED, it means that the plexes were being resynchronized via read-writeback when the machine rebooted and therefore still needs to be synchronized. |
The following subsections describe the different volume states, how they change to indicate abnormalities, and what LSM does to normalize the volume state.
The volume kernel state indicates the accessibility of the volume. The volume kernel state allows a volume to have an offline (DISABLED), maintenance (DETACHED), and online (ENABLED) mode of operation. These modes are described in the following table:
Mode | Description |
DISABLED | The volume cannot be accessed. |
DETACHED | The volume cannot be read or written, but plex device operations and ioctl functions are accepted. |
ENABLED | The volumes can be read and written. |
A volume usage type is a type label given to each volume under LSM control. Just as a file system type establishes and enforces policies for file operations, a volume usage type establishes and enforces policies for volume operations. The rules and capabilities differ for different usage types. The volume usage types affect such things as plex synchronization and error handling.
LSM provides the options described in the following table for volume
usage types.
Option | Description |
fsgen (file system generic) | The fsgen usage type assumes the volume is being used by a file system. This usage type assumes there is a way to synchronize file system data to a volume during volplex or snapshot procedures. It uses the file system time stamp to see which plex is most up to date. It determines the file system type and calls an appropriate procedure to do a synchronization just prior to the plex split off. |
gen (generic) | The gen usage type makes no assumptions regarding the data content of the volume. This usage type does not handle synchronization. The gen option is useful for databases that reside directly on volumes. |
Operations that are dependent on usage-type must determine the usage type of a volume before switching control to a utility customized for that usage type. For example, following a failure, the gen and fsgen usage types -- using different algorithms -- guarantee that all plexes of a volume are identical.
Note
Use the fsgen usage type when creating a volume if the volume is to be used by a file system. Otherwise, use the gen usage type.
Starting a volume changes its state from DISABLED or DETACHED to ENABLED; stopping a volume changes its state from ENABLED or DETACHED to DISABLED.
Table 13-5
describes the three volume-read policies that can be selected.
Policy | Description |
Round | Prescribes round-robin reads of enabled plexes (this is the default read policy). If the read policy for a volume has been set to round, the read policy for that volume evenly distributes I/O read requests between all plexes of that volume. Reads are distributed by alternating read requests to each plex in a volume. This policy is preferred in cases where read access performance for all plexes is the same. |
Prefer | Prescribes preferential reads from a specified plex. Setting the read policy to prefer designates a specific plex of a particular volume to be used for I/O read requests. This policy is preferred if read access performance of one plex is better than the other mirrors. For example, if one of the plexes is striped and the other plexes are concatenated. |
Select | Selects a default read policy, based on the plex associations to the volume. If the volume contains a single, enabled, striped plex, the default is to prefer that plex. For any other set of plex associations, the default is to use a round-robin policy. |
Utilities such as the volassist utility obtain information on available disk space and use the information to calculate acceptable layouts for LSM objects. The concept of free space management is based on the idea that free space mapping can be derived by mapping out the existing allocations from the total space on the disk. The methods and specifications for dealing with the allocations and layouts may be provided at the command line; otherwise, they are obtained from defaults specified in a default file or internally.
When a system administrator makes changes to a set of LSM objects, LSM groups the changes into a transaction. For any transaction, LSM ensures that either all related changes occur successfully or none of the changes are made. LSM makes all configuration changes appear to occur simultaneously and any intermediate stages of change are invisible. If a problem is encountered during the transaction, LSM does not allow any changes to occur and returns the configuration to its original state.
To achieve these atomic, all-or-nothing configuration changes, the LSM volume daemon (vold) envelops all configuration changes into a transaction by performing the following steps:
The result is that atomic transactions:
If a system failure occurs during a transaction, restarting the system causes the vold utility to back out the partial changes. This prevents the disks maintained by LSM from becoming inconsistently configured.
Refer to the vold(8) reference page for additional information about the volume daemon.