There are various ways that you can manage your disk storage. Depending on your performance and availability needs, you can use static disk partitions, the Logical Storage Manager (LSM), hardware RAID, or a combination of these solutions.
The disk storage configuration can have a significant impact on system performance, because disk I/O is used for file system operations and also by the virtual memory subsystem for paging and swapping.
You may be able to improve disk I/O performance by following the configuration and tuning guidelines described in this chapter, which describes the following:
Improving overall disk I/O performance by distributing the I/O load (Section 8.1)
Managing LSM performance (Section 8.4)
Managing hardware RAID subsystem performance (Section 8.5)
Managing Common Access Method (CAM) performance (Section 8.6)
Not all guidelines are appropriate for all disk storage configurations.
Before applying any guideline, be sure that you understand your workload resource
model, as described in
Section 2.1, and the guideline's
benefits and tradeoffs.
8.1 Guidelines for Distributing the Disk I/O Load
Distributing the disk I/O load across devices helps to prevent a single disk, controller, or bus from becoming a bottleneck. It also enables simultaneous I/O operations.
For example, if you have 16 GB of disk storage, you may get better performance from sixteen 1-GB disks rather than four 4-GB disks, because using more spindles (disks) may allow more simultaneous operations. For random I/O operations, 16 disks may be simultaneously seeking instead of four disks. For large sequential data transfers, 16 data streams can be simultaneously working instead of four data streams.
Use the following guidelines to distribute the disk I/O load:
Stripe data or disks.
RAID 0 (data or disk striping) enables you to efficiently distribute data across the disks. See Section 2.5.2 for detailed information about the benefits of striping. Note that availability decreases as you increase the number of disks in a striped array.
To stripe data, use LSM (Section 8.4.5). To stripe disks, use a hardware RAID subsystem (Section 8.5).
As an alternative to data or disk striping, you can use the Advanced File System (AdvFS) to stripe individual files across disks in a file domain. However, do not stripe a file and also the disk on which it resides. See Section 9.3 for more information.
Use RAID 5.
RAID 5 distributes disk data and parity data across disks in an array to provide high data availability and to improve read performance. However, RAID 5 decreases write performance in a nonfailure state, and decreases read and write performance in a failure state. RAID 5 can be used for configurations that are mainly read-intensive. As a cost-efficient alternative to mirroring, you can use RAID 5 to improve the availability of rarely-accessed data.
To create a RAID 5 configuration, use LSM (Section 8.4.6) or a hardware RAID subsystem (Section 8.5).
Distribute frequently used file systems across disks and, if possible, different buses and controllers.
Place frequently used file systems on different
disks and, if possible, different buses and controllers.
Directories containing
executable files or temporary files, such as
/var
,
/usr
, and
/tmp
, are often frequently accessed.
If possible, place
/usr
and
/tmp
on
different disks.
You can use the AdvFS
balance
command to balance
the percentage of used space among the disks in an AdvFS file domain.
See
Section 9.3.7.4
for information.
Distribute swap I/O across devices.
To make paging and swapping more efficient and help prevent any single adapter, bus, or disk from becoming a bottleneck, distribute swap space across multiple disks. Do not put multiple swap partitions on the same disk.
You can also use the Logical Storage Manager (LSM) to mirror your swap space. See Section 8.4.2.7 for more information.
See Section 6.2 for more information about configuring swap devices for high performance.
Section 8.2
describes how to monitor the distribution
of disk I/O.
8.2 Monitoring the Distribution of Disk I/O
Table 8-1
describes
some commands that you can use to determine if your disk I/O is being distributed.
Table 8-1: Disk I/O Distribution Monitoring Tools
Name | Use | Description |
|
Displays information about AdvFS file domains |
Determines if files are evenly distributed across AdvFS volumes. See Section 9.3.5.3 for information. |
|
Displays information about AdvFS file domain and filset usage |
Provides performance statistics information for AdvFS file domains and filesets that you can use to determine if the file system I/O is evenly distributed. See Section 9.3.5.1 for information. |
|
Displays the swap space configuration |
Provides information about swap space
usage.
For each swap partition, the
|
|
Displays performance statistics for LSM objects |
Provides information about LSM volume and disk usage that you can use to characterize and understand your I/O workload, including the read/write ratio, the average transfer size, and whether disk I/O is evenly distributed. See Section 8.4.7.2 for information. |
|
Displays disk I/O statistics |
Provides information about which disks are being used the most. See Section 8.3 for information. |
8.3 Displaying Disk Usage by Using the iostat Command
For the best performance, disk
I/O should be evenly distributed across disks.
Use the
iostat
command to determine which disks are being used the most.
The command displays
disk I/O statistics for disks, in addition to terminal and CPU statistics.
An example of the
iostat
command is as follows; output
is provided in one-second intervals:
#
/usr/ucb/iostat 1
tty floppy0 dsk0 dsk1 cdrom0 cpu tin tout bps tps bps tps bps tps bps tps us ni sy id 1 73 0 0 23 2 37 3 0 0 5 0 17 79 0 58 0 0 47 5 204 25 0 0 8 0 14 77 0 58 0 0 8 1 62 1 0 0 27 0 27 46
The
iostat
command output displays the following
information:
The first line of the
iostat
command output
is the average since boot time, and each subsequent report is for the last
interval.
For each disk (dskn
),
the number of KB transferred per second (bps
) and the number
of transfers per second (tps
).
For the system (cpu
), the percentage of
time the CPU has spent in user state running processes either at their default
priority or preferred priority (us
), in user mode running
processes at a less favored priority (ni
), in system mode
(sy
), and in idle mode (id
).
This information
enables you to determine how disk I/O is affecting the CPU.
User mode includes
the time the CPU spent executing library routines.
System mode includes the
time the CPU spent executing system calls.
The
iostat
command can help you to do the following:
Determine which disk is being used the most and which
is being used the least.
This information will help you determine how to distribute
your file systems and swap space.
Use the
swapon -s
command
to determine which disks are used for swap space.
Determine if the system is disk bound.
If the
iostat
command output shows a lot of disk activity and a high system idle
time, the system may be disk bound.
You may need to balance the disk I/O load,
defragment disks, or upgrade your hardware.
Determine if an application is written efficiently.
If a disk
is doing a large number of transfers (the
tps
field) but
reading and writing only small amounts of data (the
bps
field), examine how your applications are doing disk I/O.
The application
may be performing a large number of I/O operations to handle only a small
amount of data.
You may want to rewrite the application if this behavior is
not necessary.
The Logical Storage Manager (LSM) provides flexible storage management, improved disk I/O performance, and high data availability, with little additional overhead. Although any type of system can benefit from LSM, it is especially suited for configurations with large numbers of disks or configurations that regularly add storage.
LSM allows you to set up unique pools of storage that consist of multiple disks. From these disk groups, you can create virtual disks (LSM volumes), which are used in the same way as disk partitions. You can create UFS or AdvFS file systems on a volume, use a volume as a raw device, or create volumes on top of RAID storage sets.
Because there is no direct correlation between an LSM volume and a physical disk, file system or raw I/O can span disks. You can easily add disks to and remove disks from a disk group, balance the I/O load, and perform other storage management tasks.
In addition, LSM provides high performance and high availability by using RAID technology. LSM is often referred to as software RAID. LSM configurations can be more cost-effective and less complex than a hardware RAID subsystem. Note that LSM RAID features require a license.
To obtain the best LSM performance, you must follow the configuration and tuning guidelines described in this manual. The following sections contain:
Information about LSM features and license requirements (Section 8.4.1)
Guidelines for disks, disk groups, and databases (Section 8.4.2)
Guidelines for mirroring volumes (Section 8.4.3)
Guidelines for using dirty-region logging (DRL) with mirrored volumes (Section 8.4.4)
Guidelines for striping volumes (Section 8.4.5)
Guidelines for RAID 5 volumes (Section 8.4.6)
Information about monitoring the LSM configuration and performance (Section 8.4.7)
See the
Logical Storage Manager
manual for detailed
information about using LSM.
8.4.1 LSM Features
LSM provides the following basic disk management features that do not require a license:
Disk concatenation enables you to create a large volume from multiple disks.
Load balancing transparently distributes data across disks.
Configuration database load-balancing automatically maintains an optimal number of LSM configuration databases in appropriate locations without manual intervention.
The
volstat
command provides detailed LSM
performance information.
The following LSM features require a license:
RAID 0 (striping) distributes data across disks in an array. Striping is useful if you quickly transfer large amounts of data, and also enables you to balance the I/O load from multi-user applications across multiple disks. LSM striping provides significant I/O performance benefits with little impact on the CPU.
RAID 1 (mirroring) maintains copies of data on different disks and reduces the chance that a single disk failure will cause the data to be unavailable.
RAID 5 (parity RAID) provides data availability through the use of parity data and distributes disk data and file data across disks in an array.
Mirrored root file system and swap space improves availability.
Hot spare support provides an automatic reaction to I/O failures on mirrored or RAID 5 objects by relocating the affected objects to spare disks or other free space.
Dirty-region logging (DRL) can be used to improve the recovery time of mirrored volumes after a system failure.
A graphical user interface (GUI) enables easy disk management and provides detailed performance information.
8.4.2 Basic LSM Disk, Disk Group, and Volume Guidelines
LSM enables you to group disks into storage pools called disk groups. Each disk group maintains a configuration database that contains records describing the LSM objects (volumes, plexes, subdisks, disk media names, and disk access names) that are being used in the disk group.
How you configure your LSM disks, disk groups, and volumes determines
the flexibility and performance of your configuration.
Table 8-2
describes the LSM disk, disk group, and volume
configuration guidelines and lists performance benefits as well as tradeoffs.
Table 8-2: LSM Disk, Disk Group, and Volume Configuration Guidelines
Guideline | Benefit | Tradeoff |
Initialize your LSM disks as sliced disks (Section 8.4.2.1) | Uses disk space efficiently | None |
Make the
rootdg
disk group
a sufficient size (Section 8.4.2.2) |
Ensures sufficient space for disk group information | None |
Use a sufficient private region size for each disk in a disk group (Section 8.4.2.3) | Ensures sufficient space for database copies | Large private regions require more disk space |
Make the private regions in a disk group the same size (Section 8.4.2.4) | Efficiently utilizes the configuration space | None |
Organize disk groups according to function (Section 8.4.2.5) | Allows you to move disk groups between systems | Reduces flexibility when configuring volumes |
Mirror the root file system (Section 8.4.2.6) | Provides availability and improves read performance | Cost of additional disks and small decrease in write performance |
Mirror swap devices (Section 8.4.2.7) | Provides availability and improves read performance | Cost of additional disks and small decrease in write performance |
Use hot-sparing (Section 8.4.6.3 and Section 8.4.3.5) | Improves recovery time after a disk failure in a mirrored or RAID 5 volume | Requires an additional disk |
Save the LSM configuration (Section 8.4.2.8) | Improves availability | None |
Use mirrored volumes (Section 8.4.3) | Improves availability and read performance | Cost of additional disks and small decrease in write performance |
Use dirty region logging (Section 8.4.4) | Improves resynchronization time after a mirrored volume failure | Slightly increases I/O overhead |
Use striped volumes (Section 8.4.5) | Improves performance | Decreases availability |
Use RAID 5 volumes (Section 8.4.6) | Provides data availability and improves read performance | Consumes CPU resources, decreases write performance in a nonfailure state, and decreases read and write performance in a failure state |
The following sections describe the previous guidelines in detail.
8.4.2.1 Initializing LSM Disks as Sliced Disks
Initialize your LSM disks as sliced disks, instead of configuring individual partitions as simple disks. The disk label for a sliced disk contains information that identifies the partitions containing the private and the public regions. In contrast, simple disks have both public and private regions in the same partition.
A sliced disk places the entire disk under LSM control, uses disk storage
efficiently, and avoids using space for multiple private regions on the same
disk.
When a disk is initialized as an LSM sliced disk, by default, the disk
is repartitioned so that partition
g
contains the LSM public
region and partition
h
contains the private region.
LSM
volume data resides in the public region, which uses the majority of the disk
starting at block 0.
LSM configuration data and metadata reside in the private
region, which uses the last 4096 blocks of the disk, by default.
Usually, you do not have to change the size of the LSM private region.
See
Section 8.4.2.3
for more information.
8.4.2.2 Sizing the rootdg Disk Group
The default disk group,
rootdg
is automatically
created when you initialize LSM.
Unlike other disk groups, the
rootdg
configuration database contains disk-access records that define
all disks under LSM control, in addition to its own disk-group configuration
information.
You must make sure that the
rootdg
disk group is
large enough to accommodate all the disk-access records.
The default size
of a configuration database is 4096 blocks.
Usually, you do not have to change
this value.
8.4.2.3 Sizing Private Regions
LSM keeps the disk media label and configuration database copies in each disk's private region. You must make sure that the private region for each disk is big enough to accommodate the database copies. In addition, the maximum number of LSM objects (disks, subdisks, volumes, and plexes) in a disk group depends on an adequate private region size.
The default private region size is 4096 blocks. Usually, you do not have to modify the default size.
To check the amount of free space in a disk group, use the
voldg list
command and specify the disk group.
You may want to increase the default private region size if you have a very large LSM configuration and need more space for the database copies. Note that a large private region consumes more disk space.
You may want to decrease the default private region size if your LSM configuration is small, and you do not need 4096 blocks for the configuration database. This may improve the LSM startup and disk import times.
Use the
voldisksetup
command with the
privlen
option to set the private region size.
See
voldisksetup
(8)
for more
information.
If you change the size of a disk's private region, all disks that contain
the configuration database (that is, if
nconfig
is not
0) should be the same size.
See
Section 8.4.2.4
for more information.
8.4.2.4 Making Private Regions in a Disk Group the Same Size
The private region of each disk in a disk group should be the same size. This enables LSM to efficiently utilize the configuration database space.
To determine the size of a disk's private region, use the
voldisk list
command and specify the name of the disk.
Use the
voldisksetup
command with the
privlen
option to set the private region size.
See
voldisksetup
(8)
for more
information.
8.4.2.5 Organizing Disk Groups
You may want to organize disk groups according to their function. This enables disk groups to be moved between systems.
Note that using many disk groups decreases the size of the LSM configuration
database for each disk group, but it increases management complexity and
reduces flexibility when configuring volumes.
8.4.2.6 Mirroring the Root File System
Mirroring the root file system improves overall system availability and also improves read performance for the file system. If a disk containing a copy of the root file system fails, the system can continue running. In addition, if the system is shut down, multiple boot disks can be used to load the operating system and mount the root file system.
Note that mirroring requires additional disks and slightly decreases write performance.
You can configure the root file system under LSM by selecting the option during the full installation, or by encapsulating it into LSM at a later time. The root disk will appear as an LSM volume that you can mirror.
If you mirror the root file system with LSM, you should also mirror the swap devices with LSM. See Section 8.4.2.7 for information about mirroring swap devices.
Note
In a TruCluster configuration, you cannot use LSM to configure the root file system, swap devices, boot partition, quorum disks, or any partition on a quorum disk. See the TruCluster documentation for more information.
See the
Logical Storage Manager
manual for restrictions and instructions for
mirroring the root disk and booting from a mirrored root volume.
8.4.2.7 Mirroring Swap Devices
Mirroring swap devices improves system availability by preventing a system failure caused by a failed swap disk, and also improves read performance. In addition, mirroring both the root file system and swap devices ensures that you can boot the system even if errors occur when you start the swap volume. See Section 8.4.2.6 for information about mirroring the root file system.
Note that mirroring requires additional disks and slightly decreases write performance.
You can configure swap devices under LSM by selecting the option during the full installation or by encapsulating them into LSM at a later time. The swap devices will appear as LSM volumes that you can mirror.
You can also mirror secondary swap devices. Compaq recommends that you use multiple disks for secondary swap devices and add the devices as several individual volumes, instead of striping or concatenating them into a single large volume. This makes the swapping algorithm more efficient.
See the
Logical Storage Manager
manual for restrictions and instructions for
mirroring swap space.
8.4.2.8 Saving the LSM Configuration
Use the
volsave
command to periodically
create a copy of the LSM configuration.
You can use the
volrestore
command to re-create the LSM configuration if you lose a disk group
configuration.
See the
Logical Storage Manager
manual for information about saving and restoring
the LSM configuration.
8.4.3 LSM Mirrored Volume Configuration Guidelines
Use LSM mirroring (RAID 1) to reduce the chance that a single disk failure will make disk data unavailable. Mirroring maintains multiple copies of volume data on different plexes. If a physical disk that is part of a mirrored plex fails, its plex becomes unavailable, but the system continues to operate using an unaffected plex.
At least two plexes are required to provide data redundancy, and each plex must contain different disks. You can use hot sparing to replace a failed mirrored disk. See Section 8.4.3.5 for information.
Because a mirrored volume has copies of the data on multiple plexes, multiple read operations can be simultaneously performed on the plexes, which dramatically improves read performance. For example, read performance may improve by 100 percent on a mirrored volume with two plexes because twice as many reads can be performed simultaneously. LSM mirroring provides significant I/O read performance benefits with little impact on the CPU.
Writes to a mirrored volume result in simultaneous write requests to each copy of the data, so mirroring may slightly decrease write performance. For example, an individual write request to a mirrored volume may require an additional 5 percent of write time, because the volume write must wait for the completion of the write to each plex.
However, mirroring can improve overall system performance because the
read performance that is gained may compensate for the slight decrease in
write performance.
To determine whether your system performance may benefit
from mirroring, use the
volstat
command to compare the
number of read operations on a volume to the number of write operations.
Table 8-3
describes LSM mirrored volume
configuration guidelines and lists performance benefits as well as tradeoffs.
Table 8-3: LSM Mirrored Volume Guidelines
Guideline | Benefit | Tradeoff |
Place mirrored plexes on different disks and buses(Section 8.4.3.1) | Improves performance and increases availability | Cost of additional hardware |
Attach multiple plexes to a mirrored volume (Section 8.4.3.2) | Improves performance for read-intensive workloads and increases availability | Cost of additional disks |
Use the appropriate read policy (Section 8.4.3.3) | Efficiently distributes reads | None |
Use a symmetrical configuration (Section 8.4.3.4) | Provides more predictable performance | None |
Configure hot sparing (Section 8.4.3.5) | Increases data availability (highly recommended) | Requires an additional disk device |
Use dirty-region logging (DRL) (Section 8.4.4) | Improves mirrored volume recovery rate | May cause an additional decrease in write performance |
The following sections describe the previous LSM mirrored volume guidelines
in detail.
8.4.3.1 Placing Mirrored Plexes on Different Disks and Buses
Each mirrored plex must contain different disks for effective data redundancy. If you are mirroring a striped plex, each plex must contain different disks for data redundancy. This enables effective striping and mirroring.
By default, the
volassist
command locates plexes
so that the loss of a disk will not result in loss of data.
In addition, placing each mirrored plex on a different bus or I/O controller
improves performance by distributing the I/O workload and preventing a bottleneck
at any one device.
Mirroring across different buses also increases availability
by protecting against bus and adapter failure.
8.4.3.2 Using Multiple Plexes in a Mirrored Volume
To improve performance for read-intensive workloads, use more than two plexes in a mirrored volume.
Although a maximum of 32 plexes can be attached to the same mirrored
volume, using this number of plexes uses disk space inefficiently.
8.4.3.3 Choosing a Read Policy for a Mirrored Volume
To provide optimal performance for different types of mirrored volumes, LSM supports the following read policies:
round
Reads, in a round-robin manner, from all plexes in the volume.
prefer
Reads preferentially from the plex that is designated the preferred plex (usually the plex with the highest performance). If one plex exhibits superior performance, either because the plex is striped across multiple disks or because it is located on a much faster device, designate that plex as the preferred plex.
select
Uses a read policy based on the volume's plex associations.
For example,
if a mirrored volume contains a single striped plex, that plex is designated
the preferred plex.
For any other set of plex associations, the
round
policy is used.
The
select
read policy
is the default policy.
Use the
volprint -t
command to display the read policy
for a volume.
See
Section 8.4.7.1
for information.
Use the
volume rdpol
command to set the read policy.
See
volprint
(8)
and
volume
(8)
for information.
8.4.3.4 Using a Symmetrical Plex Configuration
Configure symmetrical plexes for predictable performance and easy management. Use the same number of disks in each mirrored plex. For mirrored striped volumes, you can stripe across half of the available disks to form one plex and across the other half to form the other plex.
In addition, use disks with the same performance characteristics, if
possible.
You may not gain the performance benefit of a fast disk if it is
being used with a slow disk on the same mirrored volume.
This is because the
overall write performance for a mirrored volume will be determined and limited
by the slowest disk.
If you have disks with different performance characteristics,
group the fast disks into one volume, and group the slow disks in another
volume.
8.4.3.5 Using Hot Sparing for Mirrored Volumes
If more than one disk in a mirrored volume fails, you may lose all the data in the volume, unless you configure hot sparing. Compaq recommends that you use LSM hot sparing.
Hot sparing enables you to set up a spare disk that can be automatically used to replace a failed disk in a mirrored set. The automatic replacement capability of hot sparing improves the reliability of mirrored data when a single disk failure occurs.
Note that hot sparing requires an additional disk for the spare disk.
Use the
volwatch -s
command to enable hot sparing.
See the
Logical Storage Manager
manual for more information about hot-sparing restrictions
and guidelines.
8.4.4 Dirty-Region Logging Configuration Guidelines
For fast resynchronization of a mirrored volume after a system failure, LSM uses dirty-region logging (DRL). However, DRL adds a small I/O overhead for most write access patterns. Typically, the DRL performance degradation is more significant on systems with few writes than on systems with heavy write loads.
DRL logically divides a volume into a set of consecutive regions. Each region is represented by a status bit in the dirty-region log. A write operation to a volume marks the region's status bit as dirty before the data is written to the volume. When a system restarts after a failure, the LSM recovers only those regions of the volume that are marked as dirty in the dirty-region log.
If you disable DRL and the system fails, LSM must copy the full contents of a volume between its mirrors to restore and resynchronize all plexes to a consistent state. Although this process occurs in the background and the volume remains available, it can be a lengthy, I/O-intensive procedure.
Log subdisks are used to store a mirrored volume's dirty-region log. To enable DRL, you must associate at least one log subdisk to a mirrored plex. You can use multiple log subdisks to mirror the log. However, only one log subdisk can exist per plex.
A plex that contains only a log subdisk and no data subdisks is referred to as a log plex. By default, LSM creates a log plex for a mirrored volume. Although you can associate a log subdisk with a regular plex that contains data subdisks, the log subdisk will become unavailable if you detach the plex because one of its data subdisks has failed. Therefore, Compaq recommends that you configure DRL as a log plex.
Table 8-4
describes LSM DRL configuration guidelines
and lists performance benefits as well as tradeoffs.
Table 8-4: Dirty-Region Logging Guidelines
Guideline | Benefit | Tradeoff |
Configure one log plex for each mirrored volume (Section 8.4.4.1) | Greatly reduces mirror resynchronization time after a system failure. | Slight decrease in write performance |
Configure two or more log plexes for each mirrored volume (Section 8.4.4.1) | Greatly reduces mirror resynchronization time after a system failure and provides DRL availability | Slight decrease in write performance |
Configure log plexes on disks that are different from the volume's data plexes (Section 8.4.4.1) | Minimizes the logging overhead for writes by ensuring the same disk does not have to seek between the log area and data area for the same volume write | None |
Use the default log size (Section 8.4.4.2) | Improves performance | None |
Place logging subdisks on infrequently used disks (Section 8.4.4.3) | Helps to prevent disk bottlenecks | None |
Use solid-state disks for logging subdisks (Section 8.4.4.4) | Minimizes DRL's write degradation | Cost of solid-state disks |
Use a write-back cache for logging subdisks (Section 8.4.4.5) | Minimizes DRL write degradation | Cost of hardware RAID subsystem |
The following sections describe the previous DRL guidelines in detail.
8.4.4.1 Configuring Log Plexes
For each mirrored volume, configure one log plex, which is a plex that contains a single log subdisk and no data subdisks. After a system failure, a write to a mirrored volume may have completed on one of its plexes and not on the other plex. LSM must resynchronize each mirrored volume's plex to ensure that all plexes are identical.
A log plex significantly reduces the time it takes to resynchronize a mirrored volume when rebooting after a failure, because only the regions within the volume that were marked as dirty are resynchronized, instead of the entire volume.
By default, LSM creates a log plex for a mirrored volume.
For high availability, you can configure more than one log plex (but only one per plex) for a volume. This ensures that logging can continue even if a disk failure causes one log plex to become unavailable.
In addition, configure multiple log plexes on disks that are different
from the volume's data plexes.
This will minimize the logging overhead for
writes by ensuring that a disk does not have to seek between the log area
and data area for the same volume write.
8.4.4.2 Using the Correct Log Size
The size of a dirty-region log is proportional to the volume size and depends on whether you are using LSM in a TruCluster configuration.
For systems not configured as part of a cluster, log subdisks must be configured with two or more sectors. Use an even number, because the last sector in a log subdisk with an odd number of sectors is not used.
The log subdisk size is usually proportional to the volume size. If a volume is less than 2 GB, a log subdisk of two sectors is sufficient. Increase the log subdisk size by two sectors for every additional 2 GB of volume size.
Log subdisks for TruCluster member systems must be configured with 65 or more sectors. Use the same sizing guidelines for non-cluster configurations and multiply that result by 33 to determine the optimal log size for a cluster configuration.
By default, the
volassist addlog
command calculates
the optimal log size based on the volume size, so usually you do not have
to use the
loglen
attribute to specify a log size.
However,
the log size that is calculated by default is for a cluster configuration.
If a volume will never be used in a cluster, use the
volassist -c
addlog
to calculate the optimal log size for a noncluster environment.
Compaq
recommends that you use the default log size.
See the
Logical Storage Manager
manual for more information about log sizes.
8.4.4.3 Placing Logging Subdisks on Infrequently Used Disks
Place logging subdisks on infrequently used disks.
Because these
subdisks are frequently written, do not put them on busy disks.
In addition,
do not configure DRL subdisks on the same disks as the volume data, because
this will cause head seeking or thrashing.
8.4.4.4 Using Solid-State Disks for DRL Subdisks
If persistent (nonvolatile) solid-state disks are available, use
them for logging subdisks.
8.4.4.5 Using a Nonvolatile Write-Back Cache for DRL
To minimize DRL's impact on write performance, use LSM in conjunction
with a RAID subsystem that has a nonvolatile (battery backed) write-back cache.
Typically, the DRL performance degradation is more significant on systems
with few writes than on systems with heavy write loads.
8.4.5 LSM Striped Volume Configuration Guidelines
Striping data (RAID 0) is useful if you need to write large amounts of data, to quickly read data, or to balance the I/O load from multi-user applications across multiple disks. Striping is especially effective for applications that perform large sequential data transfers or multiple, simultaneous I/O operations. LSM striping provides significant I/O performance benefits with little impact on the CPU.
Striping distributes data in fixed-size units (stripes) across the disks in a volume. Each stripe is a set of contiguous blocks on a disk. The default stripe size (width) is 64 KB. The stripes are interleaved across the striped plex's subdisks, which must be located on different disks to evenly distribute the disk I/O.
The performance benefit of striping depends on the number of disks in the stripe set, the location of the disks, how your users and applications perform I/O, and the width of the stripe. I/O performance improves and scales linearly as you increase the number of disks in a stripe set. For example, striping volume data across two disks can double both read and write performance for that volume (read and write performance improves by 100 percent). Striping data across four disks can improve performance by a factor of four (read and write performance improves by 300 percent).
However, a single disk failure in a volume will make the volume inaccessible, so striping a volume increases the chance that a disk failure will result in a loss of data availability. You can combine mirroring (RAID 1) with striping to obtain high availability. See Section 8.4.3 for mirroring guidelines.
Table 8-5
describes the LSM striped volume
configuration guidelines and lists performance benefits as well as tradeoffs.
Table 8-5: LSM Striped Volume Guidelines
Guideline | Benefit | Tradeoff |
Use multiple disks in a striped volume (Section 8.4.5.1) | Improves performance by preventing a single disk from being an I/O bottleneck | Decreases volume reliability |
Use disks on different buses for the stripe set (Section 8.4.5.2) | Improves performance by preventing a single bus or controller from being an I/O bottleneck | Decreases volume reliability |
Use the appropriate stripe width (Section 8.4.5.3) | Ensures that an individual volume I/O is handled efficiently | None |
Avoid splitting small data transfers (Section 8.4.5.3) | Improves overall throughput and I/O performance by handling small requests efficiently | None |
Avoid splitting large data transfers (Section 8.4.5.3) | Improves overall throughput and I/O performance by handling multiple requests efficiently. | Optimizes a volume's overall throughput and performance for multiple I/O requests, instead of for individual I/O requests. |
The following sections describe the previous LSM striped volume configuration
guidelines in detail.
8.4.5.1 Increasing the Number of Disks in a Striped Volume
Increasing the number of disks in a striped volume can increase the throughput, depending on the applications and file systems you are using and the number of simultaneous users. This helps to prevent a single disk from becoming an I/O bottleneck.
However, a single disk failure in a volume will make the volume inaccessible,
so increasing the number of disks in a striped volume reduces the effective
mean-time-between-failures (MTBF) of the volume.
To provide high availability
for a striped volume, you can mirror the striped volume.
See
Section 8.4.3
for mirroring guidelines.
8.4.5.2 Distributing Striped Volume Disks Across Different Buses
Distribute the disks of a striped volume across different buses or controllers. This helps to prevent a single bus or controller from becoming an I/O bottleneck, but decreases volume reliability.
LSM can obtain I/O throughput and bandwidth that is significantly higher than a hardware RAID subsystem by enabling you to spread the I/O workload for a striped volume across different buses. To prevent a single bus from becoming an I/O bottleneck, configure striped plexes using disks on different buses and controllers, if possible.
You can obtain the best performance benefit by configuring a striped plex so that the stripe columns alternate or rotate across different buses. For example, you could configure a four-way stripe that uses four disks on two buses so that stripe columns 0 and 2 are on disks located on one bus and stripe columns 1 and 3 are on disks located on the other bus.
However, if you are mirroring a striped volume and you have a limited number of buses, mirroring across buses should take precedence over striping across buses. For example, if you want to configure a volume with a pair of two-way stripes (that is, you want to mirror a two-way stripe) by using four disks on two buses, place one of the plexes of the two-way stripe on disks located on one bus, and configure the other two-way striped plex on the other bus.
For the best possible performance, use a select or round-robin read
policy, so that all of the volume's reads and writes will be evenly distributed
across both buses.
Mirroring data across buses also provides high data availability
in case one of the controllers or buses fails.
8.4.5.3 Choosing the Correct LSM Stripe Width
A striped volume consists of a number of equal-sized subdisks, each located on different disks. To obtain the performance benefit of striping, you must select a stripe width that is appropriate for the I/O workload and configuration.
The number of blocks in a stripe unit determines the stripe width. LSM uses a default stripe width of 64 KB (or 128 sectors), which works well in most configurations, such as file system servers or database servers, that perform multiple simultaneous I/Os to a volume. The default stripe width is appropriate for these configurations, regardless of whether the I/O transfer size is small or large.
For highly specialized configurations in which large, raw I/Os are performed one at a time (that is, two or more I/Os are never issued simultaneously to the same volume), you may not want to use the default stripe width. Instead, use a stripe width that enables a large data transfer to be split up and performed in parallel.
The best stripe width for configurations that perform large, individual I/O transfers depends on whether the I/O size varies, the number of disks in the stripe-set, the hardware configuration (for example, the number of available I/O buses), and the disk performance characteristics (for example, average disk seek and transfer times). Therefore, try different stripe widths to determine the width that will provide the best performance for your configuration. Use the LSM online support to obtain help with configuring and deconfiguring plexes with different stripe widths and comparing actual I/O workloads.
If you are striping mirrored volumes, ensure that the stripe width is
the same for each plex.
Also, avoid striping the same data by both LSM and
a hardware RAID subsystem.
If a striped plex is properly configured with LSM,
striping the data with hardware RAID may degrade performance.
8.4.6 LSM RAID 5 Configuration Guidelines
RAID 5 provides high availability and improves read performance. A RAID 5 volume contains a single plex, consisting of multiple subdisks from multiple physical disks. Data is distributed across the subdisks, along with parity information that provides data redundancy.
RAID 5 provides data availability through the use of parity, which calculates a value that is used to reconstruct data after a failure. While data is written to a RAID 5 volume, parity is also calculated by performing an exclusive OR (XOR) procedure on the data. The resulting parity information is written to the volume. If a portion of a RAID 5 volume fails, the data that was on that portion of the failed volume is re-created from the remaining data and the parity information.
RAID 5 can be used for configurations that are mainly read-intensive. As a cost-efficient alternative to mirroring, you can use RAID 5 to improve the availability of rarely accessed data.
Notes
LSM mirroring and striping (RAID 0+1) provide significant I/O performance benefits with little impact on the CPU. However, LSM RAID 5 decreases write performance and has a negative impact on CPU performance, because a write to a RAID 5 volume requires CPU resources to calculate the parity information and may involve multiple reads and writes.
In addition, if a disk fails in a RAID 5 volume, write performance will significantly degrade. In this situation, read performance may also degrade, because all disks must be read in order to obtain parity data for the failed disk.
Therefore, Compaq recommends that you use LSM mirroring and striping or hardware (controller-based) RAID, instead of LSM (software-based) RAID 5.
Mirroring RAID 5 volumes and using LSM RAID 5 volumes TruCluster shared storage is not currently supported.
Table 8-6
describes LSM RAID 5 volume configuration
guidelines and lists performance benefits as well as tradeoffs.
Many of the
guidelines for creating striped and mirrored volumes also apply to RAID 5
volumes.
Table 8-6: LSM RAID 5 Volume Guidelines
Guideline | Benefit | Tradeoff |
Configure at least one log plex (Section 8.4.6.1) | Increases data availability (highly recommended) | Requires an additional disk |
Use the appropriate stripe width (Section 8.4.6.2) | Significantly improves write performance | May slightly reduce read performance |
Configure hot sparing (Section 8.4.6.3) | Increases data availability (highly recommended) | Requires an additional disk device |
The following sections describe these guidelines in detail.
8.4.6.1 Using RAID 5 Logging
Compaq recommends that you use logging to protect RAID 5 volume data if a disk or system failure occurs. Without logging, it is possible for data not involved in any active writes to be lost or corrupted if a disk and the system fail. If this double failure occurs, there is no way of knowing if the data being written to the data portions of the disks or the parity being written to the parity portions were actually written. Therefore, the recovery of the corrupted disk may be corrupted.
Make sure that each RAID 5 volume has at least one log plex. Do not use a disk that is part of the RAID 5 plex for a log plex.
You can associate a log with a RAID 5 volume by attaching it as an additional,
non-RAID 5 layout plex.
More than one log plex can exist for each RAID 5 volume,
in which case the log areas are mirrored.
If you use the
volassist
command to create a RAID 5 volume, a log is created by default.
8.4.6.2 Using the Appropriate Strip Width
Using the appropriate stripe width can significantly improve write performance. However, it may slightly reduce read performance.
The default RAID 5 stripe width is 16 KB, which is appropriate for most environments. To decrease the performance impact of RAID 5 writes, the stripe size used for RAID 5 is usually smaller than the size used for striping (RAID 0).
Unlike striping, splitting a write across all the disks in a RAID 5 set improves write performance, because the system does not have to read existing data to determine the new parity information when it is writing a full striped row of data. For example, writing 64 KB of data to a five-column RAID 5 stripe with a 64-KB stripe width may require two parallel reads, followed by two parallel writes (that is, reads from the existing data and parity information, then writes to the new data and new parity information).
However, writing the same 64 KB of data to a five-column RAID 5 stripe
with a 16-KB stripe width may enable the data to be written immediately to
disk (that is, five parallel writes to the four data disks and to the parity
disk).
This is possible because the new parity information for the RAID 5
stripe row can be determined from the 64 KB of data, and reading old data
is not necessary.
8.4.6.3 Using Hot Sparing for RAID 5 Volumes
Compaq recommends that you use LSM hot sparing. If more than one disk in a RAID 5 volume fails, you may lose all the data in the volume, unless you configure hot sparing.
Hot sparing enables you to set up a spare disk that can be automatically used to replace a failed RAID 5 disk. The automatic replacement capability of hot sparing improves the reliability of RAID 5 data when a single disk failure occurs. In addition, hot sparing reduces the RAID 5 volume's I/O performance degradation caused by the overhead associated with reconstructing the failed disk's data.
Note that RAID 5 hot sparing requires an additional disk for the spare disk.
Use the
volwatch -s
command to enable hot sparing.
See the
Logical Storage Manager
manual for more information about hot-sparing restrictions
and guidelines.
8.4.7 Gathering LSM Information
Table 8-7
describes
the tools you can use to obtain information about the LSM.
Table 8-7: LSM Monitoring Tools
Name | Use | Description |
|
Displays LSM configuration information (Section 8.4.7.1) |
Displays information about LSM disk
groups, disk media, volumes, plexes, and subdisk records.
It does not display
disk access records.
See
|
|
Monitors LSM performance statistics (Section 8.4.7.2) |
For LSM volumes, plexes, subdisks,
or disks, displays either the total performance statistics since the statistics
were last reset (or the system was booted), or the current performance statistics
within a specified time interval.
These statistics include information about
read and write operations, including the total number of operations, the number
of failed operations, the number of blocks read or written, and the average
time spent on the operation.
The
|
|
Tracks LSM operations (Section 8.4.7.3) |
Sets I/O tracing masks against one
or all volumes in the LSM configuration and logs the results to the LSM default
event log,
|
|
Monitors LSM events (Section 8.4.7.4) |
Monitors LSM for failures in disks,
volumes, and plexes, and sends mail if a failure occurs.
The
|
|
Monitors LSM events (Section 8.4.7.5) |
Displays events related to disk and
configuration changes, as managed by the LSM configuration daemon,
|
Note
In a TruCluster configuration, the
volstat
,voltrace
, andvolnotify
tools provide information only for the member system on which you invoke the command. Use Event Manager, instead of thevolnotify
utility, to obtain information about LSM events from any cluster member system. SeeEVM
(5) for more information.
The following sections describe some of these commands in detail.
8.4.7.1 Displaying Configuration Information by Using the volprint Utility
The
volprint
utility displays information
about LSM objects (disks, subdisks, disk groups, plexes, and volumes).
You
can select the objects (records) to be displayed by name or by using special
search expressions.
In addition, you can display record association hierarchies,
so that the structure of records is more apparent.
For example, you can obtain
information about failed disks in a RAID 5 configuration, I/O failures, and
stale data.
Invoke the
voldisk list
command to check disk status
and display disk access records or physical disk information.
The following example uses the
volprint
utility to
show the status of the
voldev1
volume:
#
/usr/sbin/volprint -ht voldev1
Disk group: rootdg V NAME USETYPE KSTATE STATE LENGTH READPOL PREFPLEX PL NAME VOLUME KSTATE STATE LENGTH LAYOUT NCOL/WID MODE SD NAME PLEX DISK DISKOFFS LENGTH [COL/]OFF DEVICE MODE v voldev1 fsgen ENABLED ACTIVE 209712 SELECT - pl voldev1-01 voldev1 ENABLED ACTIVE 209712 CONCAT - RW sd dsk2-01 voldev1-01 dsk2 65 209712 0 dsk2 ENA pl voldev1-02 voldev1 ENABLED ACTIVE 209712 CONCAT - RW sd dsk3-01 voldev1-02 dsk3 0 209712 0 dsk3 ENA pl voldev1-03 voldev1 ENABLED ACTIVE LOGONLY CONCAT - RW sd dsk2-02 voldev1-03 dsk2 0 65 LOG dsk2 ENA
The following
volprint
command output shows that
the RAID 5 volume
r5vol
is in degraded mode:
#
volprint -ht
V NAME USETYPE KSTATE STATE LENGTH READPOL PREFPLEX PL NAME VOLUME KSTATE STATE LENGTH LAYOUT NCOL/WID MODE SD NAME PLEX DISK DISKOFFS LENGTH [COL/]OFF DEVICE MODE v r5vol RAID5 ENABLED DEGRADED 20480 RAID - pl r5vol-01 r5vol ENABLED ACTIVE 20480 RAID 3/16 RW sd disk00-00 r5vol-01 disk00 0 10240 0/0 dsk4d1 sd disk01-00 r5vol-01 disk01 0 10240 1/0 dsk2d1 dS sd disk02-00 r5vol-01 disk02 0 10240 2/0 dsk3d1 - pl r5vol-l1 r5vol ENABLED LOG 1024 CONCAT - RW sd disk03-01 r5vol-l1 disk00 10240 1024 0 dsk3d0 - pl r5vol-l2 r5vol ENABLED LOG 1024 CONCAT - RW sd disk04-01 r5vol-l2 disk02 10240 1024 0 dsk1d1 -
The output shows that volume
r5vol
is in degraded
mode, as shown by the
STATE
, which is listed as
DEGRADED
.
The failed subdisk is
disk01-00
, as
shown by the last column, where the
d
indicates that the
subdisk is detached, and the
S
indicates that the subdisk
contents are stale.
It is also possible that a disk containing a RAID 5 log could experience a failure. This has no direct effect on the operation of the volume; however, the loss of all RAID 5 logs on a volume makes the volume vulnerable to a complete failure.
The following
volprint
command output shows a failure
within a RAID 5 log plex:
#
volprint -ht
V NAME USETYPE KSTATE STATE LENGTH READPOL PREFPLEX PL NAME VOLUME KSTATE STATE LENGTH LAYOUT NCOL/WID MODE SD NAME PLEX DISK DISKOFFS LENGTH [COL/]OFF DEVICE MODE v r5vol RAID5 ENABLED ACTIVE 20480 RAID - pl r5vol-01 r5vol ENABLED ACTIVE 20480 RAID 3/16 RW sd disk00-00 r5vol-01 disk00 0 10240 0/0 dsk4d1 ENA sd disk01-00 r5vol-01 disk01 0 10240 1/0 dsk2d1 dS sd disk02-00 r5vol-01 disk02 0 10240 2/0 dsk3d1 ENA pl r5vol-l1 r5vol DISABLED BADLOG 1024 CONCAT - RW sd disk03-01 r5vol-l1 disk00 10240 1024 0 dsk3d0 ENA pl r5vol-l2 r5vol ENABLED LOG 1024 CONCAT - RW sd disk04-01 r5vol-l2 disk02 10240 1024 0 dsk1d1 ENA
The previous command output shows that the RAID 5 log plex
r5vol-l1
has failed, as indicated by the
BADLOG
plex state.
See
volprint
(8)
for more information.
8.4.7.2 Monitoring Performance Statistics by Using the volstat Utility
The
volstat
utility provides information about activity on volumes, plexes, subdisks,
and disks under LSM control.
It reports statistics that reflect the activity
levels of LSM objects since boot time.
In a TruCluster configuration, the
volstat
utility
provides information only for the member system on which you invoke the command.
The amount of information displayed depends on which options you specify
with the
volstat
utility.
For example, you can display
statistics for a specific LSM object, or you can display statistics for all
objects at one time.
If you specify a disk group, only statistics for objects
in that disk group are displayed.
If you do not specify a particular disk
group, the
volstat
utility displays statistics for the
default disk group (rootdg
).
You can also use the
volstat
utility to reset the
base statistics to zero.
This can be done for all objects or only for specified
objects.
Resetting the statistics to zero before a particular operation makes
it possible to measure the subsequent impact of that operation.
LSM records the following three I/O statistics:
A count of read and write operations.
The number of read and write blocks.
The average operation time. This time refects the total time it took to complete an I/O operation, including the time spent waiting in a disk queue on a busy device.
LSM records these statistics for logical I/Os for each volume. The statistics are recorded for the following types of operations: reads, writes, atomic copies, verified reads, verified writes, plex reads, and plex writes. For example, one write to a two-plex volume requires updating statistics for the volume, both plexes, one or more subdisks for each plex, and one disk for each subdisk. Likewise, one read that spans two subdisks requires updating statistics for the volume, both subdisks, and both disks that contain the subdisks.
Because LSM maintains various statistics for each disk I/O, you can use LSM to understand your application's I/O workload and to identify bottlenecks. LSM often uses a single disk for multiple purposes to distribute the overall I/O workload and optimize I/O performance. If you use traditional disk partitions, monitoring tools combine statistics for an entire disk. If you use LSM, you can obtain statistics for an entire disk and also for its subdisks, which enables you to determine how the disk is being used (for example, by file system operations, raw I/O, swapping, or a database application).
LSM volume statistics enable you to characterize the I/O usage pattern for an application or file system, and LSM plex statistics can determine the effectiveness of a striped plex's stripe width (size). You can also combine LSM performance statistics with the LSM online configuration support tool to identify and eliminate I/O bottlenecks without shutting down the system or interrupting access to disk storage.
After measuring actual data-access patterns, you can adjust the placement of file systems. You can reassign data to specific disks to balance the I/O load among the available storage devices. You can reconfigure volumes on line after performance patterns have been established without adversely affecting volume availability.
LSM also maintains other statistical data. For example, read and write failures that appear for each mirror, and corrected read and write failures for each volume, accompany the read and write failures that are recorded.
The following example displays statical data for volumes:
#
volstat -vpsd
OPERATIONS BLOCKS AVG TIME(ms) TYP NAME READ WRITE READ WRITE READ WRITE dm dsk6 3 82 40 62561 8.9 51.2 dm dsk7 0 725 0 176464 0.0 16.3 dm dsk9 688 37 175872 592 3.9 9.2 dm dsk10 29962 0 7670016 0 4.0 0.0 dm dsk12 0 29962 0 7670016 0.0 17.8 vol v1 3 72 40 62541 8.9 56.5 pl v1-01 3 72 40 62541 8.9 56.5 sd dsk6-01 3 72 40 62541 8.9 56.5 vol v2 0 37 0 592 0.0 10.5 pl v2-01 0 37 0 592 0.0 8.0 sd dsk7-01 0 37 0 592 0.0 8.0 sd dsk12-01 0 0 0 0 0.0 0.0 pl v2-02 0 37 0 592 0.0 9.2 sd dsk9-01 0 37 0 592 0.0 9.2 sd dsk10-01 0 0 0 0 0.0 0.0 pl v2-03 0 6 0 12 0.0 13.3 sd dsk6-02 0 6 0 12 0.0 13.3
See
volstat
(8)
for more information.
8.4.7.3 Tracking Operations by Using the voltrace Utility
Use the
voltrace
utility to trace
operations on volumes.
You can set I/O tracing masks against a group of volumes
or the entire system.
You can then use the
voltrace
utility
to display ongoing I/O operations relative to the masks.
In a TruCluster configuration, the
voltrace
utility
provides information only for the member system on which you invoke the command.
The trace records for each physical I/O show a volume and buffer-pointer combination that enables you to track each operation, even though the traces may be interspersed with other operations. Similar to the I/O statistics for a volume, the I/O trace statistics include records for each physical I/O done, and a logical record that summarizes all physical records.
Note
Because the
voltrace
requires significant overhead and produces a large output, run the command only occasionally.
The following example uses the
voltrace
utility to
trace volumes:
#
/usr/sbin/voltrace -l
96 598519 START read vdev v2 dg rootdg dev 40,6 block 89 len 1 concurrency 1 pid 43 96 598519 END read vdev v2 dg rootdg op 926159 block 89 len 1 time 1 96 598519 START read vdev v2 dg rootdg dev 40,6 block 90 len 1 concurrency 1 pid 43 96 598519 END read vdev v2 dg rootdg op 926160 block 90 len 1 time 1
See
voltrace
(8)
for more information.
8.4.7.4 Monitoring Events by Using the volwatch Script
The
volwatch
script is
automatically started when you install LSM.
This script sends mail if certain
LSM configuration events occur, such as a plex detach caused by a disk failure.
The script also enables hot sparing.
The
volwatch
script sends mail to root by default.
To specify another mail recipient or multiple mail recipients, use the
rcmgr
command to set the
rc.config.common
variable
VOLWATCH_USERS
.
See
volwatch
(8)
for more information.
8.4.7.5 Monitoring Events by Using the volnotify Utility
The
volnotify
utility
monitors events related to disk and configuration changes, as managed by the
vold
configuration daemon.
The
volnotify
utility
displays requested event types until killed by a signal, until a given number
of events have been received, or until a given number of seconds have passed.
The
volnotify
utility can display the following events:
Disk group import, deport, and disable events
Plex, volume, and disk detach events
Disk change events
Disk group change events
In a TruCluster configuration, the
volnotify
utility
only reports events that occur locally on the member system.
Therefore, use
EVM to get LSM events that occur anywhere within the cluster.
8.5 Managing Hardware RAID Subsystem Performance
Hardware RAID subsystems provide RAID functionality for high performance and high availability, relieve the CPU of disk I/O overhead, and enable you to connect many disks to a single I/O bus. There are various types of hardware RAID subsystems with different performance and availability features, but they all include a RAID controller, disks in enclosures, cabling, and disk management software.
RAID storage solutions range from low-cost backplane RAID array controllers to cluster-capable RAID array controllers that provide extensive performance and availability features, such as write-back caches and complete component redundancy.
Hardware RAID subsystems use disk management software, such as the RAID Configuration Utility (RCU) and the StorageWorks Command Console (SWCC) utility, to manage the RAID devices. Menu-driven interfaces allow you to select RAID levels.
Use hardware RAID to combine multiple disks into a single storage set that the system sees as a single unit. A storage set can consist of a simple set of disks, a striped set, a mirrored set, or a RAID set. You can create LSM volumes, AdvFS file domains, or UFS file systems on a storage set, or you can use the storage set as a raw device.
The following sections discuss the following RAID hardware topics:
Hardware RAID features (Section 8.5.1)
Hardware RAID products (Section 8.5.2)
Guidelines for hardware RAID configurations (Section 8.5.3)
See the hardware RAID product documentation for detailed configuration
information.
8.5.1 Hardware RAID Features
Hardware RAID storage solutions range from low-cost backplane RAID array controllers to cluster-capable RAID array controllers that provide extensive performance and availability features. All hardware RAID subsystems provide you with the following features:
A RAID controller that relieves the CPU of the disk I/O overhead
Increased disk storage capacity
Hardware RAID subsystems allow you to connect a large number of disks to a single I/O bus. In a typical storage configuration, you attach a disk storage shelf to a system by using a SCSI bus connected to a host bus adapter installed in a I/O bus slot. However, you can connect a limited number of disks to a SCSI bus, and systems have a limited number of I/O bus slots.
In contrast, hardware RAID subsystems contain multiple internal SCSI buses that can be connected to a system by using a single I/O bus slot.
Read cache
A read cache improves I/O read performance by holding data that it anticipates the host will request. If a system requests data that is already in the read cache (a cache hit), the data is immediately supplied without having to read the data from disk. Subsequent data modifications are written both to disk and to the read cache (write-through caching).
Write-back cache
Hardware RAID subsystems support write-back caches (as a standard or an optional feature), which can improve I/O write performance while maintaining data integrity. A write-back cache decreases the latency of many small writes, and can improve Internet server performance because writes appear to be written immediately. Applications that perform few writes will not benefit from a write-back cache.
With write-back caching, data intended to be written to disk is temporarily stored in the cache, consolidated, and then periodically written (flushed) to disk for maximum efficiency. I/O latency is reduced by consolidating contiguous data blocks from multiple host writes into a single unit.
A write-back cache must be battery-backed to protect against data loss and corruption.
RAID support
All hardware RAID subsystems support RAID 0 (disk striping), RAID 1 (disk mirroring), and RAID 5. High-performance RAID array subsystems also support RAID 3 and dynamic parity RAID. See Section 1.2.3.1 for information about RAID levels.
Non-RAID disk array capability or "just a bunch of disks" (JBOD)
Component hot swapping and hot sparing
Hot swap support allows you to replace a failed component while the system continues to operate. Hot spare support allows you to automatically use previously installed components if a failure occurs.
Graphical user interface (GUI) for easy management and monitoring
There are different types of hardware RAID subsystems, which provide various degrees of performance and availability at various costs. Compaq supports the following hardware RAID subsystems:
Backplane RAID array storage subsystems
These entry-level subsystems, such as those utilizing the RAID Array 230/Plus storage controller, provide a low-cost hardware RAID solution and are designed for small and midsize departments and workgroups.
A backplane RAID array storage controller is installed in an I/O bus slot, either a PCI bus slot or an EISA bus slot, and acts as both a host bus adapter and a RAID controller.
Backplane RAID array subsystems provide RAID functionality (0, 1, 0+1, and 5), an optional write-back cache, and hot swap functionality.
High-performance RAID array subsystems
These subsystems, such as the RAID Array 450 subsystem, provide extensive performance and availability features and are designed for client/server, data center, and medium to large departmental environments.
A high-performance RAID array controller, such as an HSZ50 controller, is connected to a system through a FWD SCSI bus and a high-performance host bus adapter installed in an I/O bus slot.
High-performance RAID array subsystems provide RAID functionality (0, 1, 0+1, 3, 5, and dynamic parity RAID), dual-redundant controller support, scalability, storage set partitioning, a standard battery-backed write-back cache, and components that can be hotswapped.
Enterprise Storage Arrays (ESA)
These preconfigured high-performance hardware RAID subsystems, such as the RAID Array 10000, provide the highest performance, availability, and disk capacity of any RAID subsystem. They are used for high transaction-oriented applications and high bandwidth decision-support applications.
ESAs support all major RAID levels, including dynamic parity RAID; fully redundant components that can be hotswapped; a standard battery-backed write-back cache; and centralized storage management.
See the Compaq
Systems & Options Catalog
for detailed information about hardware RAID subsystem features.
8.5.3 Hardware RAID Configuration Guidelines
Table 8-8
describes
the hardware RAID subsystem configuration guidelines and lists performance
benefits as well as tradeoffs.
Table 8-8: Hardware RAID Subsystem Configuration Guidelines
Guideline | Performance Benefit | Tradeoff |
Evenly distribute disks in a storage set across different buses (Section 8.5.3.1) | Improves performance and helps to prevent bottlenecks | None |
Use disks with the same data capacity in each storage set (Section 8.5.3.2) | Simplifies storage management | None |
Use an appropriate stripe size (Section 8.5.3.3) | Improves performance | None |
Mirror striped sets (Section 8.5.3.4) | Provides availability and distributes disk I/O performance | Increases configuration complexity and may decrease write performance |
Use a write-back cache (Section 8.5.3.5) | Improves write performance, especially for RAID 5 storage sets | Cost of hardware |
Use dual-redundant RAID controllers (Section 8.5.3.6) | Improves performance, increases availability, and prevents I/O bus bottlenecks | Cost of hardware |
Install spare disks (Section 8.5.3.7) | Improves availability | Cost of disks |
Replace failed disks promptly (Section 8.5.3.7) | Improves performance | None |
The following sections describe some of these guidelines.
See your
RAID subsystem documentation for detailed configuration information.
8.5.3.1 Distributing Storage Set Disks Across Buses
You can improve performance and help to prevent bottlenecks by distributing storage set disks evenly across different buses.
In addition, make sure that the first member of each mirrored set is
on a different bus.
8.5.3.2 Using Disks with the Same Data Capacity
Use disks with the same capacity in a storage set.
This simplifies storage management.
8.5.3.3 Choosing the Correct Hardware RAID Stripe Size
You must understand how your applications perform disk I/O before you can choose the stripe (chunk) size that will provide the best performance benefit. See Section 2.1 for information about identifying a resource model for your system.
Here are some guidelines for stripe sizes:
If the stripe size is large compared to the average I/O size, each disk in a stripe set can respond to a separate data transfer. I/O operations can then be handled in parallel, which increases sequential write performance and throughput. This can improve performance for environments that perform large numbers of I/O operations, including transaction processing, office automation, and file services environments, and for environments that perform multiple random read and write operations.
If the stripe size is smaller than the average I/O operation, multiple disks can simultaneously handle a single I/O operation, which can increase bandwidth and improve sequential file processing. This is beneficial for image processing and data collection environments. However, do not make the stripe size so small that it will degrade performance for large sequential data transfers.
For example, if you use an 8-KB stripe size, small data transfers will be distributed evenly across the member disks, but a 64-KB data transfer will be divided into at least eight data transfers.
In addition, the following guidelines can help you to choose the correct stripe size:
Raw disk I/O operations
If your applications are doing I/O to a raw device and not a file system, use a stripe size that distributes a single data transfer evenly across the member disks. For example, if the typical I/O size is 1 MB and you have a four-disk array, you could use a 256-KB stripe size. This would distribute the data evenly among the four member disks, with each doing a single 256-KB data transfer in parallel.
Small file system I/O operations
For small file system I/O operations, use a stripe size that is a multiple of the typical I/O size (for example, four to five times the I/O size). This will help to ensure that the I/O is not split across disks.
I/O to a specific range of blocks
Choose a stripe size that will prevent any particular range of blocks from becoming a bottleneck. For example, if an application often uses a particular 8-KB block, you may want to use a stripe size that is slightly larger or smaller than 8 KB or is a multiple of 8 KB to force the data onto a different disk.
8.5.3.4 Mirroring Striped Sets
Striped disks improve I/O performance by distributing
the disk I/O load.
However, striping decreases availability because a single
disk failure will cause the entire stripe set to be unavailable.
To make a
stripe set highly available, you can mirror the stripe set.
8.5.3.5 Using a Write-Back Cache
RAID subsystems support, either as a standard or an optional feature, a nonvolatile (battery-backed) write-back cache that can improve disk I/O performance while maintaining data integrity. A write-back cache improves performance for systems that perform large numbers of writes and for RAID 5 storage sets. Applications that perform few writes will not benefit from a write-back cache.
With write-back caching, data intended to be written to disk is temporarily stored in the cache and then periodically written (flushed) to disk for maximum efficiency. I/O latency is reduced by consolidating contiguous data blocks from multiple host writes into a single unit.
A write-back cache improves performance, especially for Internet servers, because writes appear to be written immediately. If a failure occurs, upon recovery, the RAID controller detects any unwritten data that still exists in the write-back cache and writes the data to disk before enabling normal controller operations.
A write-back cache must be battery-backed to protect against data loss and corruption.
If you are using an HSZ40 or HSZ50 RAID controller with a write-back cache, the following guidelines may improve performance:
Set
CACHE_POLICY
to B.
Set
CACHE_FLUSH_TIMER
to a minimum of 45
(seconds).
Enable the write-back cache (WRITEBACK_CACHE
)
for each unit, and set the value of
MAXIMUM_CACHED_TRANSFER_SIZE
to a minimum of 256.
See the RAID subsystem documentation for more information about using
the write-back cache.
8.5.3.6 Using Dual-Redundant Controllers
If supported by your RAID subsystem, you
can use a dual-redundant controller configuration and balance the number of
disks across the two controllers.
This can improve performance, increase availability,
and prevent I/O bus bottlenecks.
8.5.3.7 Using Spare Disks to Replace Failed Disks
Install predesignated spare disks on separate controller ports
and storage shelves.
This will help you to maintain data availability and
recover quickly if a disk failure occurs.
8.6 Managing CAM Performance
The Common Access Method (CAM) is the operating system interface to the hardware. CAM maintains pools of buffers that are used to perform I/O. Each buffer takes approximately 1 KB of physical memory. Monitor these pools and tune them if necessary.
You may be able to modify the following
io
subsystem
attributes to improve CAM performance:
cam_ccb_pool_size
--The
initial size of the buffer pool free list at boot time.
The default is 200.
cam_ccb_low_water
--The
number of buffers in the pool free list at which more buffers are allocated
from the kernel.
CAM reserves this number of buffers to ensure that the kernel
always has enough memory to shut down runaway processes.
The default is 100.
cam_ccb_increment
--The
number of buffers either added or removed from the buffer pool free list.
Buffers are allocated on an as-needed basis to handle immediate demands, but
are released in a more measured manner to guard against spikes.
The default
is 50.
If the I/O pattern associated with your system tends to have intermittent
bursts of I/O operations (I/O spikes), increasing the values of the
cam_ccb_pool_size
and
cam_ccb_increment
attributes
may improve performance.
You may
be able to diagnose CAM performance problems by using
dbx
to examine the
ccmn_bp_head
data structure, which provides
statistics on the buffer structure pool that is used for raw disk I/O.
The
information provided is the current size of the buffer structure pool (num_bp
) and the wait count for buffers (bp_wait_cnt
).
For example:
#
/usr/ucb/dbx -k /vmunix /dev/mem
(dbx)
print ccmn_bp_head
struct { num_bp = 50 bp_list = 0xffffffff81f1be00 bp_wait_cnt = 0 } (dbx)
If the value for the
bp_wait_cnt
field is not zero,
CAM has run out of buffer pool space.
If this situation persists, you may
be able to eliminate the problem by changing one or more of the CAM subsystem
attributes described in this section.