5 Configuring and Tuning Storage Subsystems

A storage subsystem consists of software (operating system or layered product) and hardware (including host bus adapters, cables, and disks). Your storage configuration can have a significant impact on system performance, because disk I/O is used for file system operations and also by the virtual memory subsystem for paging and swapping.

To configure a storage subsystem that will meet your performance and availability needs, you must first understand the I/O requirements of the users and applications and how they perform disk I/O, as described in Chapter 1. After you configure your storage subsystem, you may be able to tune the subsystem to improve performance.

This chapter describes the features of different storage subsystems and provides guidelines for configuring and tuning the subsystems.

Many of the tuning tasks described in this chapter require you to modify system attributes. See Section 2.11 for more information about attributes.

5.1 Understanding Storage Subsystems

Disk I/O operations are significantly slower than data transfers involving the CPU or memory caches. Because disks are used for data storage and for virtual memory swap space, an incorrectly configured or tuned storage subsystem can degrade overall system performance.

Disk I/O performance can be affected by the following variables:

Workload characteristics
Performance depends on how your users and applications perform disk I/O. For example, a workload can involve primarily read or write I/O operations. In addition, some workloads require low latency and high throughput, while others require a fast data transfer rate (high bandwidth).
Low latency is important for multiple small data transfers and also for workstation, timesharing, and server environments. High bandwidth is important for systems that perform large sequential data transfers, such as database servers. See Chapter 1 for more information about characterizing your disk I/O.

Performance capacity of the hardware
DIGITAL recommends that you use hardware with the best performance features. For example, disks with a high rate of revolutions per minute (RPM) provide the best overall performance. Wide disks, which support 16-bit transfers, have twice the bandwidth of narrow (8 bit) disks and can improve performance for large data transfers. High-performance host bus adapters, such as fast wide differential (FWD) adapters provide low CPU overhead and high bandwidth. In addition, a write-back cache decreases the latency of small writes and can improve throughput.

Memory allocation to the UBC
The Unified Buffer Cache (UBC) is allocated a portion of physical memory and caches actual file system data for reads and writes, Advanced File System (AdvFS) metadata, and Memory File System (MFS) data. The UBC decreases the number of disk operations for file systems by serving as a layer between the disk and the operating system. The metadata buffer cache and the AdvFS buffer cache are also allocated a percentage of physical memory.

Kernel variable values
Disk I/O performance depends on kernel variable values that are appropriate for your workload and configuration. You may need to modify the default values to obtain optimal system performance, as described in this manual.

Mirrored disk configuration
Mirrored data across different disks improves the performance of read operations and provides high data availability. However, because data must be written to two separate locations, mirroring degrades disk write performance.

Striped disk configuration
Striping data across multiple disks distributes the I/O load and enables parallel I/O streams to operate concurrently on different devices, which improves disk I/O performance for some workloads.

Hardware RAID subsystem configuration
Hardware RAID subsystems relieve the CPU of disk management overhead and support write-back caches, which can improve disk I/O performance for some workloads.

File system configuration
File systems, including UFS and AdvFS, are used to organize and manage files. AdvFS provides you with fast file system recovery, improved performance for sequential and large I/O operations, and disk defragmentation features.

Raw I/O
For some workloads, raw I/O (I/O to a disk that does not contain a file system) may have better performance than file system I/O because it bypasses buffers and caches.

To choose a storage subsystem that will meet the needs of your users and applications, you must understand the benefits and tradeoffs of the various disk and file management options, as described in Section 5.2.

5.2 Choosing How to Manage Disks and Files

DIGITAL UNIX supports a number of methods that you can use to manage the physical disks and files in your environment. The traditional method of managing disks and files is to divide each disk into logical areas called disk partitions, and to then create a file system on a partition or use a partition for raw I/O. A disk can consist of one to eight partitions that have a fixed size; these partitions cannot overlap.

Each disk type has a default partition scheme. The disktab database file lists the default disk partition sizes. The partition size determines the amount of data it can hold. To modify the size of a partition, you must back up any data in the partition, change the size by using the disklabel command, and then restore the data to the resized partition. You must be sure that the data will fit into the new partition.

An alternative method to managing disks with static disk partitions is to use the Logical Storage Manager (LSM) to set up a shared storage pool that consists of multiple disks. You can then create virtual disks from this pool of storage, according to your performance and capacity needs. LSM provides you with flexible and easy management for large storage configurations. Because there is no direct correlation between a virtual disk and a physical disk, file system or raw I/O can span disks, as needed. In addition, you can easily add disks to and remove disks from the pool, balance the load, and perform other storage management tasks. LSM also provides you with high-performance and high-availability RAID functionality.

Hardware RAID subsystems provide another method of handling storage. These subsystems use intelligent controllers to provide high-performance and high-availability RAID functionality, allow you to increase your storage capacity, and support write-back caches. RAID controllers allow you to combine several disks into a single storage set that the system sees as a single unit.

You can choose to manage your file systems by using AdvFS. AdvFS provides file system features beyond those of a traditional UFS file system. Unlike the rigid UFS model in which the file system directory hierarchy (tree) is bound tightly to the physical storage, AdvFS consists of two distinct layers: the directory hierarchy layer and the physical storage layer. This decoupled file system structure enables you to manage the physical storage layer apart from the directory hierarchy layer. This means that you can move files between a defined group of disk volumes without changing file pathnames. Because the pathnames remain the same, the action is completely transparent to end users.

You can use different configurations in your environment. For example, you can create static partitions on some disks and use the remaining disks in LSM volumes. You can also combine products in the same configuration. For example, you can configure AdvFS file domains on top of LSM volumes, or configure LSM volumes on top of RAID storage sets.

The following sections describe the features of the different disk and file system management options.

5.2.1 Understanding RAID Levels and Products

RAID (redundant array of independent disks) technology can provide both high disk I/O performance and high data availability. The DIGITAL UNIX operating system provides RAID functionality by using the Logical Storage Manager (LSM) product. DIGITAL UNIX also supports hardware RAID subsystems, which provide RAID functionality by using intelligent controllers, caches, and software.

There are four primary RAID levels:

RAID 0--Also known as disk striping, RAID 0 divides data into blocks (sometimes called chunks or stripes) and distributes the blocks across multiple disks in a array. Striping enables parallel I/O streams to operate concurrently on different devices. I/O operations can be handled simultaneously by multiple devices, which balances the I/O load and improves performance.
The performance benefit of striping depends on the size of the stripe and how your users and applications perform disk I/O. For example, if an application performs multiple simultaneous I/O operations, you can specify a stripe size that will enable each disk in the array to handle a separate I/O operation. If an application performs large sequential data transfers, you can specify a stripe size that will distribute a large I/O evenly across the disks.
For volumes that receive only one I/O at a time, you may not want to use striping if access time is the most important factor. In addition, striping may degrade the performance of small data transfers, because of the latencies of the disks and the overhead associated with dividing a small amount of data.
Striping decreases data availability because one disk failure makes the entire disk array unavailable. To make striped disks highly available, you can mirror the disks.

RAID 1--Also known as disk mirroring, RAID 1 provides high data availability by maintaining identical copies of data on different disks in an array. RAID 1 also improves the disk read performance, because data can be read from two different locations. However, RAID 1 can decrease disk write performance, because data must be written to two different locations.

RAID 3--A type of parity RAID, RAID 3 divides data blocks and distributes the data across a disk array, providing parallel access to data. RAID 3 provides a high data transfer rate and increases bandwidth, but it provides no improvement in throughput (the I/O transaction rate).
RAID 3 can improve the I/O performance for applications that transfer large amounts of sequential data, but it provides no improvement for applications that perform multiple I/O operations involving small amounts of data.
RAID 3 provides high data availability by storing redundant parity information on a separate disk. The parity information is used to regenerate data if a disk in the array fails. However, performance degrades as multiple disks fail, and data reconstruction is slower than if you had used mirroring.

RAID 5--A type of parity RAID, RAID 5 distributes data blocks across disks in an array. RAID 5 allows independent access to data and can handle simultaneous I/O operations.
RAID 5 can improve throughput, especially for large file I/O operations, multiple small data transfers, and I/O read operations. However, it is not suited to write-intensive applications.
RAID 5 provides data availability by distributing redundant parity information across disks. Each array member contains enough parity information to regenerate data if a disk fails. However, performance may degrade and data may be lost if multiple disks fail. In addition, data reconstruction is slower than if you had used mirroring.

To address your performance and availability needs, you can combine some RAID levels. For example, you can combine RAID 0 with RAID 1 to mirror striped disks for high availability and high performance.

In addition, some DIGITAL hardware RAID subsystems support adaptive RAID 3/5 (also called dynamic parity RAID), which improves disk I/O performance for a wide variety of applications by dynamically adjusting, according to workload needs, between data transfer-intensive algorithms and I/O operation-intensive algorithms.

Table 5-1 compares the performance and availability features for the different RAID levels.

Table 5-1: RAID Level Performance and Availability Features

RAID Level Performance Impact Availability Impact

RAID 0 Balances I/O load and improves reads and writes Lower than single disk

RAID 1 Improves reads, may degrade writes Highest

RAID 0+1 Balances I/O load, improves reads, may degrade writes Highest

RAID 3 Improves bandwidth, performance may degrade if multiple disks fail Higher than single disk

RAID 5 Improves throughput, performance may degrade if multiple disks fail Higher than single disk

Adaptive RAID 3/5 Improves bandwidth and throughput, performance may degrade if multiple disks fail Higher than single disk

RAID Level	Performance Impact	Availability Impact
RAID 0	Balances I/O load and improves reads and writes	Lower than single disk
RAID 1	Improves reads, may degrade writes	Highest
RAID 0+1	Balances I/O load, improves reads, may degrade writes	Highest
RAID 3	Improves bandwidth, performance may degrade if multiple disks fail	Higher than single disk
RAID 5	Improves throughput, performance may degrade if multiple disks fail	Higher than single disk
Adaptive RAID 3/5	Improves bandwidth and throughput, performance may degrade if multiple disks fail	Higher than single disk

It is important to understand that RAID performance depends on the state of the devices in the RAID subsystem. There are three possible states: steady state (no failures), failure (one or more disks have failed), and recovery (subsystem is recovering from failure).

There are many variables to consider when choosing a RAID configuration:

Not all RAID products support all RAID levels.
For example, LSM currently supports only RAID 0 (striping) and RAID 1 (mirroring), and only high-performance RAID controllers support adaptive RAID 3/5.

RAID products provide different performance benefits.
For example, hardware RAID subsystems support write-back caches and other performance-enhancing features and also relieve the CPU of the I/O overhead.

Some RAID configurations are more cost-effective than others.
In general, LSM provides more cost-effective RAID functionality than hardware RAID subsystems. In addition, parity RAID provides data availability at a cost that is lower than RAID 1 (mirroring), because mirroring n disks requires 2n disks.

Data recovery rates depend on the RAID configuration.
For example, if a disk fails, it is faster to regenerate data by using a mirrored copy than by using parity information. In addition, if you are using parity RAID, I/O performance declines as additional disks fail.

There are advantages to each RAID product, and which one you choose depends on your workload requirements and other factors. The following sections describe the features of the different RAID subsystems and LSM.

5.2.1.1 Hardware RAID Subsystem Features

Hardware RAID subsystems use a combination of hardware (RAID controllers, caches, and host bus adapters) and software to provide high disk I/O performance and high data availability. A hardware RAID subsystem is sometimes called hardware RAID.

All hardware RAID subsystems provide you with the following features:

A RAID controller that relieves the CPU of the disk I/O overhead

Increased disk storage capacity
Hardware RAID subsystems allow you to connect a large number of disks to your system. In a typical storage configuration, you use a SCSI bus connected to an I/O bus slot to attach disks to a system. However, you can connect only a limited number of disks on a SCSI bus, and systems have limited I/O bus slots. Hardware RAID subsystems contain internal SCSI buses and host bus adapters, which enable you to connect multiple SCSI buses and multiple disks to a system by using only one I/O bus slot.

Read cache
A read cache can improve I/O read performance by holding data that it anticipates the host will request. If a system requests data that is already in the read cache (a cache hit), the data is immediately supplied without having to read the data from disk. Subsequent data modifications are written both to disk and to the read cache (write-through caching).

Write-back cache
Hardware RAID subsystems support (as a standard or an optional feature) a nonvolatile write-back cache, which can improve I/O write performance while maintaining data integrity. A write-back cache decreases the latency of many small writes, and can improve Web server performance because writes appear to be executed immediately. A write-back cache must be battery-backed to protect against data loss and corruption.
With write-back caching, data intended to be written to disk is temporarily stored in the cache, consolidated, and then periodically written (flushed) to disk for maximum efficiency. If a failure occurs, upon recovery, the RAID controller detects any unwritten data that still exists in the write-back cache and writes the data to disk before enabling normal controller operations.

RAID 0 (disk striping) support

RAID 1 (disk mirroring) support

Parity RAID support
Hardware RAID subsystems provide various levels of parity RAID support (RAID 3, RAID 5, or adaptive RAID 3/5) for high performance and high availability.

Hot component swapping and sparing
Hot swap support allows you to replace a failed component while the system continues to operate. Hot spare support allows you to automatically use previously installed components if a failure occurs.

Non-RAID disk array capability or "just a bunch of disks" (JBOD)

Graphical user interface (GUI) for easy management and monitoring

The volstat command, which provides detailed LSM performance information

There are various hardware RAID subsystems, including backplane RAID array subsystems and high-performance standalone RAID array subsystems, which provide different degrees of performance and availability at various costs. The features of these two subsystems are as follows:

Backplane RAID array subsystems
These entry-level subsystems, such as the RAID Array 230 subsystem, provide a low-cost hardware RAID solution. A backplane RAID array controller is installed in an I/O bus slot, either a PCI bus slot or an EISA bus slot, and acts as both a host bus adapter and a RAID controller.
Backplane RAID array subsystems are designed for small and midsize departments and workgroups, and provide RAID functionality (0, 1, 0+1, and 5) and an optional write-back cache.

Standalone RAID array subsystems
These subsystems, such as the RAID Array 450 subsystem, provide high availability and the highest performance of any RAID subsystem. A standalone RAID array subsystem uses a high-performance controller, such as the HSZ controller. The controller connects to the system through a FWD SCSI bus and a high-performance host bus adapter, such as a KZPSA adapter, installed in an I/O bus slot.
Standalone RAID array subsystems are designed for client/server, data center, and medium to large departmental environments. They provide RAID functionality (0, 1, 0+1, and adaptive RAID 3/5), dual-redundant controller support, scalability, storage set partitioning, and a standard write-back cache.

See Section 5.5 for information on configuring hardware RAID subsystems.

5.2.1.2 LSM Features

Logical Storage Manager (LSM) can improve disk I/O performance, provide high data availability, and help you to manage your storage more efficiently. All DIGITAL UNIX systems can use the basic LSM functions, but advanced disk management functions require a separate LSM license. When LSM is used to stripe or mirror disks, it is sometimes referred to as software RAID.

LSM allows you to organize a shared storage pool into volumes, which are used in the same way as disk partitions, except that I/O directed to a volume can span disks. You can create a UFS file system or an AdvFS file domain on a volume, or you can use a volume as a raw device. You can also create LSM volumes on top of RAID storage sets.

LSM supports the following disk management features:

Pool of storage

Load balancing by transparently moving data across disks

RAID 0 (disk striping) support (license necessary)

RAID 1 (disk mirroring) support (license necessary)

Disk concatenation (creating a large volume from multiple disks)

Graphical user interface (GUI) for easy disk management and detailed performance information (license necessary)

LSM provides more cost-effective RAID functionality than a hardware RAID subsystem. In addition, LSM configurations are less complex than hardware RAID configurations. To obtain the performance benefits of both LSM and hardware RAID, you can create LSM volumes on top of RAID storage sets.

LSM is especially suited for systems with large numbers of disks. For these systems, you may want to use LSM to manage your disks and AdvFS to manage your files. That is, you can organize your disks into LSM volumes and then use those volumes to create AdvFS file domains.

5.2.2 Understanding AdvFS

Advanced File System (AdvFS) is a DIGITAL UNIX file system option that provides many file management and performance features. You can use AdvFS instead of UFS to organize and manage your files.

The AdvFS Utilities product, which is licensed separately from the DIGITAL UNIX operating system, extends the capabilities of the AdvFS file system. An AdvFS file domain can consist of multiple volumes, which can be UNIX block devices (entire disks), disk partitions, LSM logical volumes, or RAID storage sets. AdvFS filesets can span all the volumes in the file domain.

AdvFS provides the following file management features:

Fast file system recovery
Rebooting after a system interruption is extremely fast. AdvFS uses write-ahead logging, instead of the fsck utility, as a way to check for and repair file system inconsistencies. The recovery speed depends on the number of uncommitted records in the log, not the amount of data in the fileset; therefore, reboots are quick and predictable.

High-performance file system
AdvFS uses an extent-based file allocation scheme that consolidates data transfers, which increases sequential bandwidth and improves performance for large data transfers. AdvFS performs large reads from disk when it anticipates a need for sequential data. AdvFS also performs large writes by combining adjacent data into a single data transfer.

Online file system management

File domain defragmentation

Support for large files and file systems

User quotas

AdvFS utilities provide the following features:

Pool of storage that allows you to add, remove, and back up disks without disrupting users or applications.

Disk spanning filesets

Ability to recover deleted files
Users can retrieve their own unintentionally deleted files from predefined trashcan directories or from clone filesets, without assistance from system administrators.

I/O load balancing across disks

Online fileset resizing

Online file migration across disks

File-level striping
File-level striping may improve I/O bandwidth (transfer rates) by distributing file data across multiple disk volumes.

Graphical user interface (GUI) that simplifies disk and file system administration, provides status, and alerts you to potential problems

See Section 5.6 for information about AdvFS configuration and tuning guidelines.

5.3 General Disk Storage Guidelines

There are some general guidelines for configuring and tuning storage subsystems. These guidelines are applicable to most configurations and will help you to get the best disk I/O performance, regardless of whether you are using static partitions, raw devices, LSM, hardware RAID subsystems, AdvFS, or UFS.

These guidelines fall into three categories:

Using high-performance hardware (see Table 5-2)

Distributing the disk I/O load (see Table 5-3)

General file system tuning (see Table 5-4)

The following sections describe these guidelines in detail.

5.3.1 High-Performance Hardware Guidelines

Using high-performance hardware will provide the best disk I/O performance, regardless of your storage configuration. Table 5-2 describes the guidelines for hardware configurations and lists the performance benefits as well as the tradeoffs.

Table 5-2: Guidelines for High-Performance Hardware Configurations

Hardware Performance Benefit Tradeoff

Fast (high RPM) disks (Section 5.3.1.1) Improve disk access time and sequential data transfer performance Cost

Disks with small platter sizes (Section 5.3.1.2) Improve seek times for applications that perform many small I/O operations No benefit for large sequential data transfers

Wide disks (Section 5.3.1.3) Provide high bandwidth and improves performance for large data transfers Cost

Solid-state disks (Section 5.3.1.4) Provide very low disk access time Cost

High-performance host bus adapters (Section 5.3.1.5) Increase bandwidth and throughput Cost

DMA host bus adapters (Section 5.3.1.6) Relieve CPU of data transfer overhead None

Prestoserve (Section 5.3.1.7) Improves synchronous write performance Cost, not supported in a cluster or for nonfile system I/O operations

Hardware RAID subsystem (Section 5.5) Increases disk capacity and supports write-back cache Cost of hardware RAID subsystem

Write-back cache (Section 5.3.1.8) Reduces the latency of many small writes Cost of hardware RAID subsystem

Hardware	Performance Benefit	Tradeoff
Fast (high RPM) disks (Section 5.3.1.1)	Improve disk access time and sequential data transfer performance	Cost
Disks with small platter sizes (Section 5.3.1.2)	Improve seek times for applications that perform many small I/O operations	No benefit for large sequential data transfers
Wide disks (Section 5.3.1.3)	Provide high bandwidth and improves performance for large data transfers	Cost
Solid-state disks (Section 5.3.1.4)	Provide very low disk access time	Cost
High-performance host bus adapters (Section 5.3.1.5)	Increase bandwidth and throughput	Cost
DMA host bus adapters (Section 5.3.1.6)	Relieve CPU of data transfer overhead	None
Prestoserve (Section 5.3.1.7)	Improves synchronous write performance	Cost, not supported in a cluster or for nonfile system I/O operations
Hardware RAID subsystem (Section 5.5)	Increases disk capacity and supports write-back cache	Cost of hardware RAID subsystem
Write-back cache (Section 5.3.1.8)	Reduces the latency of many small writes	Cost of hardware RAID subsystem

See the DIGITAL Systems & Options Catalog for information about disk, adapter, and controllers performance features.

The following sections describe these guidelines in detail.

5.3.1.1 Using Fast Disks

Disks that spin with a high rate of revolutions per minute (RPM) have a low disk access time (latency). High-RPM disks are especially beneficial to the performance of sequential data transfers.

High-performance 5400 RPM disks can improve performance for many transaction processing applications (TPAs). Extra high-performance 7200 RPM disks are ideal for applications that require both high bandwidth and high throughput.

5.3.1.2 Using Disks with Small Platters

Disks with small platter sizes provide better seek times than disks with large platter sizes, because the disk head has less distance to travel between tracks. There are three sizes for disk platters: 2.5, 3.5, and 5.25 inches in diameter.

A small platter size may improve disk I/O performance (seek time) for applications that perform many small I/O operations, but it provides no performance benefit for large sequential data transfers.

5.3.1.3 Using Disks with Wide Data Paths

Disks with wide (16-bit) data paths provide twice the bandwidth of disks with narrow (8-bit) data paths. Wide disks can improve I/O performance for large data transfers.

5.3.1.4 Using Solid-State Disks

Solid-state disks provide outstanding performance in comparison to regular disks but at a higher cost. Solid-state disks have a disk access time that is less than 100 microseconds, which is equivalent to memory access speed and more than 100 times faster than the disk access time for magnetic disks.

Solid-state disks are ideal for a wide range of response-time critical applications, such as online transaction processing (OLTP), and applications that require high bandwidth, such as video applications. Solid-state disks also provide data reliability through a data-retention system. For the best performance, use solid-state disks for your most frequently accessed data, place the disks on a dedicated bus, and use a high-performance host bus adapter.

5.3.1.5 Using High-Performance Host Bus Adapters

Host bus adapters provide different performance features at various costs. For example, FWD adapters, such as the KZPSA adapter, provide high bandwidth and high throughput connections to disk devices.

SCSI adapters let you set the SCSI bus speed, which is the rate of data transfers. There are three possible bus speeds:

Slow (up to 5 million bytes per second or 5 MHz)

Fast (up to 10 million bytes per second or 10 MHz)
Fast bus speed uses the fast synchronous transfer option, enabling I/O devices to attain high peak-rate transfers in synchronous mode.

Ultra (up to 20 million bytes per second or 20 MHz)

Not all SCSI bus adapters support all speeds.

5.3.1.6 Using DMA Host Bus Adapters

Some host bus adapters support direct memory access (DMA), which enables an adapter to bypass the CPU and go directly to memory to access and transfer data. For example, the KZPAA is a DMA adapter that provides a low-cost connection to SCSI disk devices.

5.3.1.7 Using Prestoserve

Prestoserve utilizes a nonvolatile, battery-backed memory cache to improve synchronous write performance. Prestoserve temporarily caches file system writes that otherwise would have to be written to disk. This capability improves performance for systems that perform large numbers of synchronous writes.

To optimize Prestoserve cache use, you may want to enable Prestoserve only on the most frequently used file systems. You cannot use Prestoserve in a cluster or for nonfile system I/O.

5.3.1.8 Using Write-Back Caches

Hardware RAID subsystems support (as a standard or an optional feature) write-back caches, which can improve I/O write performance while maintaining data integrity. A write-back cache must be battery-backed to protect against data loss and corruption.

A write-back cache decreases the latency of many small writes and can improve write-intensive application performance and Internet server performance. Applications that perform few writes will not benefit from a write-back cache.

With write-back caching, data intended to be written to disk is temporarily stored in the cache and then periodically written (flushed) to disk for maximum efficiency. I/O latency is reduced by consolidating contiguous data blocks from multiple host writes into a single unit.

Because writes appear to be executed immediately, a write-back cache improves performance. If a failure occurs and the cache is battery-backed, upon recovery, the RAID controller will detect any unwritten data that still exists in the write-back cache and write the data to disk before enabling normal controller operations.

5.3.2 Distributing the Disk I/O Load Guidelines

In addition to using hardware that will provide you with the best performance, you must distribute the disk I/O load across devices to obtain the maximum efficiency. Table 5-3 describes guidelines on how to distribute disk I/O and lists the performance benefits as well as tradeoffs.

Table 5-3: Guidelines for Distributing the Disk I/O Load

Action Performance Benefit Tradeoff

Distribute swap space across different disks and buses (Section 5.3.2.1) Improves paging and swapping performance and helps to prevent bottlenecks Requires additional disks, cabling, and adapters

Distribute disk I/O across different disks and buses (Section 5.3.2.2) Allows parallel I/O operations and helps to prevent bottlenecks Requires additional disks, cables, and adapters

Place the most frequently used file systems on different disks (Section 5.3.2.3) Helps to prevent disk bottlenecks Requires additional disks

Place data at the beginning of a ZBR disk (Section 5.3.2.4) Improves bandwidth for sequential data transfers None

Action	Performance Benefit	Tradeoff
Distribute swap space across different disks and buses (Section 5.3.2.1)	Improves paging and swapping performance and helps to prevent bottlenecks	Requires additional disks, cabling, and adapters
Distribute disk I/O across different disks and buses (Section 5.3.2.2)	Allows parallel I/O operations and helps to prevent bottlenecks	Requires additional disks, cables, and adapters
Place the most frequently used file systems on different disks (Section 5.3.2.3)	Helps to prevent disk bottlenecks	Requires additional disks
Place data at the beginning of a ZBR disk (Section 5.3.2.4)	Improves bandwidth for sequential data transfers	None

The following sections describe these guidelines in detail.

5.3.2.1 Distributing Swap Space Across Disks and Buses

Distributing swap space across different disks and buses makes paging and swapping more efficient and helps to prevent any single adapter, disk, or bus from becoming a bottleneck. See the System Administration manual or swapon(8) for information about configuring swap space.

You can also use LSM to stripe your swap disks, which distributes the disk I/O. See Section 5.4 for more information.

5.3.2.2 Distributing I/O Across Disks and Buses

Distributing disk I/O across different disks and buses helps to prevent a single adapter, disk, or bus from becoming an I/O bottleneck and also allows simultaneous operations.

For example, if you have 16 GB of disk storage, you may get better performance from sixteen 1-GB disks than four 4-GB disks. More spindles (disks) may allow more simultaneous operations. For random I/O operations, 16 disks may be simultaneously seeking instead of 4 disks. For large sequential data transfers, 16 data streams can be simultaneously working instead of 4 data streams.

You can also use LSM to stripe your disks, which distributes the disk I/O load. See Section 5.4 for more information.

5.3.2.3 Distributing File Systems Across Disks

Place the most frequently used file systems on different disks. Distributing file systems will help to prevent a single disk from becoming a bottleneck.

Directories containing executable files or temporary files are often frequently accessed (for example, /var, /usr, and /tmp). If possible, place /usr and /tmp on different disks.

5.3.2.4 Placing Data at the Beginning of ZBR Disks

Data is most quickly transferred when it is located at the beginning of zone-based recording (ZBR) disks. Placing data at the beginning of these disks improves the bandwidth for sequential data transfers.

5.3.3 General File System Tuning Guidelines

You may be able to improve I/O performance by modifying some kernel attributes that affect overall file system performance. The guidelines apply to all file system configurations, including UFS and AdvFS.

General file system tuning often involves tuning the Virtual File System (VFS). VFS provides a uniform interface that allows common access to files, regardless of the file system on which the files reside.

The file system tuning guidelines fall into these categories:

Changing how the system allocates and deallocates vnodes
The kernel data structure for an open file is called a vnode. These are used by all file systems. The allocation and deallocation of vnodes is handled dynamically by the system.

Increasing the size of the namei cache to make lookup operations faster
The namei cache is used by all file systems to map file pathnames to inodes.

Increasing the size of the hash chain table for the namei cache to make lookup operations faster
Hash tables are used for lookup operations.

Allocating more memory to the Unified Buffer Cache (UBC)
The UBC shares physical memory with the virtual memory subsystem and is used to cache the most recently accessed file system data.

Using Prestoserve to cache only UFS or AdvFS file system metadata

There are also specific guidelines for AdvFS and UFS file systems. See Section 5.6 and Section 5.7 for information.

Table 5-4 describes the guidelines for general file system tuning and lists the performance benefits as well as the tradeoffs.

Table 5-4: Guidelines for General File System Tuning

Action Performance Benefit Tradeoff

Increase the maximum number of open files (Section 5.3.3.1) Allocates more resources to applications Consumes memory

Increase the size of the namei cache (Section 5.3.3.2) Improves cache lookup operations Consumes memory

Increase the size of the hash chain table for the namei cache (Section 5.3.3.3) Improves cache lookup operations Consumes memory

Allocate more memory to the UBC (Section 5.3.3.4) Improves disk I/O performance May cause excessive paging and swapping

Use Prestoserve to cache only file system metadata (Section 5.3.3.5) Improves performance for applications that access large amounts of file system metadata Cost, not supported in a cluster or for nonfile system I/O operations

Cache more vnodes on the free list (Section 5.3.3.6) Improves cache lookup operations Consumes memory

Increase the amount of time for which vnodes are kept on the free list (Section 5.3.3.7) Improves cache lookup operations None

Delay vnode deallocation (Section 5.3.3.8) Improves namei cache lookup operations Consumes memory

Accelerate vnode deallocation (Section 5.3.3.9) Reduces memory demands Reduces the efficiency of the namei cache

Disable vnode deallocation (Section 5.3.3.10) Optimizes processing time Consumes memory

Increase the open file descriptor limit (Section 5.3.3.11) Provides more file descriptors to a process Increases the possibility of runaway allocations

Decrease the open file descriptor limit (Section 5.3.3.11) Prevents a process from consuming all the file descriptors May adversely affect the performance of processes that require many file descriptors

Disable clearing of the DMA scatter/gather map registers (Section 5.3.3.12) Improves performance of VLM/VLDB systems None

Action	Performance Benefit	Tradeoff
Increase the maximum number of open files (Section 5.3.3.1)	Allocates more resources to applications	Consumes memory
Increase the size of the namei cache (Section 5.3.3.2)	Improves cache lookup operations	Consumes memory
Increase the size of the hash chain table for the namei cache (Section 5.3.3.3)	Improves cache lookup operations	Consumes memory
Allocate more memory to the UBC (Section 5.3.3.4)	Improves disk I/O performance	May cause excessive paging and swapping
Use Prestoserve to cache only file system metadata (Section 5.3.3.5)	Improves performance for applications that access large amounts of file system metadata	Cost, not supported in a cluster or for nonfile system I/O operations
Cache more vnodes on the free list (Section 5.3.3.6)	Improves cache lookup operations	Consumes memory
Increase the amount of time for which vnodes are kept on the free list (Section 5.3.3.7)	Improves cache lookup operations	None
Delay vnode deallocation (Section 5.3.3.8)	Improves namei cache lookup operations	Consumes memory
Accelerate vnode deallocation (Section 5.3.3.9)	Reduces memory demands	Reduces the efficiency of the namei cache
Disable vnode deallocation (Section 5.3.3.10)	Optimizes processing time	Consumes memory
Increase the open file descriptor limit (Section 5.3.3.11)	Provides more file descriptors to a process	Increases the possibility of runaway allocations
Decrease the open file descriptor limit (Section 5.3.3.11)	Prevents a process from consuming all the file descriptors	May adversely affect the performance of processes that require many file descriptors
Disable clearing of the DMA scatter/gather map registers (Section 5.3.3.12)	Improves performance of VLM/VLDB systems	None

The following sections describe these guidelines in detail.

5.3.3.1 Increasing the Maximum Number of Open Files

Increasing the value of the max-vnodes or maxusers attribute increases the maximum number of vnodes, which increases the number of open files. If your applications require many open files, you may want to raise the values of these attributes. Raising the attribute values will increase the demand on your memory resources, and should only be done if you get a message stating that you are out of vnodes.

If the number of users on the system exceeds the value of maxusers, and you increase the value of maxusers, increase the value of max-vnodes proportionally.

5.3.3.2 Increasing the Size of the namei Cache

The namei cache is used by all file systems to map file pathnames to inodes. Use dbx to monitor the cache by examining the nchstats structure. The miss rate (misses / (good + negative + misses)) should be less than 20 percent.

To make lookup operations faster, increase the size of the namei cache by increasing the value of the maxusers attribute (the recommended way) or by increasing the value of the name-cache-size attribute. Increasing the value of maxusers or name-cache-size allocates more system resources for use by the kernel. However, it also increases the amount of physical memory consumed by the kernel. Note that many benchmarks may perform better with a large namei cache.

5.3.3.3 Increasing the Size of the Hash Chain Table for the namei Cache

Increasing the size of hash chain table for the namei cache spreads the namei cache elements and may reduce linear searches, which improves lookup speeds. The name-cache-hash-size attribute specifies the size of the hash chain table for the namei cache. The default size is 256 slots.

You can change the value of the name-cache-hash-size attribute so that each hash chain has three or four name cache entries. To determine an appropriate value for the name-cache-hash-size attribute, divide the value of name-cache-size attribute by 3 or 4 and then round the result to a power of 2. For example, if the value of name-cache-size is 1029, dividing 1029 by 4 produces a value of 257. Based on this calculation, you could specify 256 (2 to the power of 8) for the value of the name-cache-hash-size attribute.

5.3.3.4 Allocating More Memory for the UBC

The Unified Buffer Cache (UBC) uses a portion of physical memory to cache actual file system data for reads and writes, AdvFS metadata, and Memory File System (MFS) data. The UBC prevents the system from having to copy data from a disk, which improves performance. If there is an insufficient amount of memory allocated to the UBC, disk I/O performance may be degraded.

Increasing the size of the UBC improves the chance that data will be found in the cache. However, because the UBC and the virtual memory subsystem share the same physical memory pages, increasing the size of the UBC may cause excessive paging and swapping.

See Section 4.8 for information about tuning the UBC.

5.3.3.5 Using Prestoserve to Cache Only File System Metadata

Prestoserve can improve the overall run-time performance for systems that perform large numbers of synchronous writes. The prmetaonly attribute controls whether Prestoserve caches only UFS and AdvFS file system metadata, instead of both metadata and synchronous write data (the default). If the attribute is set to 1 (enabled), Prestoserve caches only file system metadata.

Caching only metadata may improve the performance of applications that access many small files or applications that access a large amount of file-system metadata but do not reread recently written data.

5.3.3.6 Caching More Free vnodes

You can raise the value of the min-free-vnodes attribute, which determines the minimum number of vnodes on the free list. Increasing the value causes the system to cache more free vnodes and improves the performance of cache lookup operations. However, increasing the value will increase the demand on your memory resources.

On 24-MB systems, the default value of the min-free-vnodes attribute is 150. On 32-MB or larger systems, the default value depends on the value of the maxusers attribute. For these systems, if the value of min-free-vnodes is close to the value of the max-vnodes attribute, vnode deallocation will not be effective.

If the value of min-free-vnodes is larger than the value of max-vnodes, vnode deallocations will not occur. If the value of min-free-vnodes must be close to the value of max-vnodes, you may want to disable vnode deallocation (see Section 5.3.3.10). However, disabling vnode deallocation does not free memory, because memory used by the vnodes is not returned to the system. On systems that need to reclaim the memory used by vnodes, make sure that the value of min-free-vnodes is significantly lower than the value of max-vnodes.

5.3.3.7 Increasing the Time vnodes Remain on the Free List

You can increase the value of the vnode-age attribute to increase the amount of time for which vnodes are kept on the free list. This increases the possibility that the vnode will be successfully looked up. The default value for vnode-age is 120 seconds on 32-MB or larger systems and 2 seconds on 24-MB systems.

5.3.3.8 Delaying the Deallocation of vnodes

Increase the value of the namei-cache-valid-time attribute to delay the deallocation of vnodes. This can improve namei cache lookup operations but it consumes memory resources.

5.3.3.9 Accelerating the Deallocation of vnodes

Decrease the value of the namei-cache-valid-time attribute to accelerate the deallocation of vnodes. This causes vnodes to be deallocated from the namei cache at a faster rate, but reduces the efficiency of the cache.

5.3.3.10 Disabling vnode Deallocation

To optimize processing time, disable vnode deallocation by setting the value of the vnode-deallocation-enable attribute to 0. Disabling vnode deallocation does not free memory, because memory used by the vnodes is not returned to the system. You may want to disable vnode allocation if the value of min-free-vnodes is close to the value of max-vnodes.

5.3.3.11 Modifying the Maximum Number of Open File Descriptors

The open-max-soft and open-max-hard attributes control the maximum number of open file descriptors for each process. When the open-max-soft limit is reached, a warning message is issued, and when the open-max-hard limit is reached, the process is stopped. These attributes prevent runaway allocations (for example, allocations within a loop that cannot be exited because of an error condition) from consuming all the available file descriptors.

The open-max-soft and open-max-hard attributes both have default values of 4096 file descriptors (open files) per process. The maximum number of open files per process is 65,536. If your applications require many open files, you may want to increase the maximum open file descriptor limit. Increasing the limit provides more file descriptors to a process, but it increases the possibility of runaway allocations. In addition, if you increase the number of open files per process, make sure that the max-vnodes attribute is set to an adequate value. See the Release Notes for information about increasing the open file descriptor limit.

Decreasing the open file descriptor limit decreases the number of file descriptors available to each process and prevents a process from consuming all the file descriptors. However, decreasing the limit may adversely affect the performance of processes that require many file descriptors.

5.3.3.12 Disabling Clearing of the DMA Scatter/Gather Map Registers

If you have an AlphaServer 8200 or 8400, the dma-sg-map-unload-zero attribute controls whether the direct memory access (DMA) scatter/gather map registers clear after an I/O operation completes. If your system utilizes large amounts of memory or storage, you may be able to gain some I/O performance benefit by setting the attribute to zero.

5.4 Using the Logical Storage Manager

The Logical Storage Manager (LSM) can improve system performance and provide high data availability. LSM also provides you with online storage management features and enhanced performance information and statistics, with little additional overhead. Although any type of system can benefit from LSM, it is especially suited for large systems with large numbers of disks.

LSM volumes are used in the same way as disk partitions. You can create UFS file systems and AdvFS file domains and filesets on an LSM volume, or you can use a volume as a raw device.

To set up a high-performance LSM configuration, you must be careful how you configure the following:

Disks, disk groups, and databases (see Section 5.4.1)

Mirrored disks (see Section 5.4.2)

Striped disks (see Section 5.4.3)

The Logical Storage Manager manual provides detailed information about using LSM. The following sections describe configuration and tuning guidelines for LSM.

5.4.1 Basic LSM Configuration Guidelines

The following sections provide general guidelines to configure LSM disks, disk groups, and databases. How you configure your LSM disks and disk groups determines the flexibility of your LSM configuration.

In addition, each LSM disk group maintains a configuration database, which includes detailed information about mirrored and striped disks and volume, plex, and subdisk records.

Table 5-5 lists LSM disk, disk group, and database configuration guidelines and performance benefits as well as tradeoffs.

Table 5-5: Guidelines for LSM Disks, Disk Groups, and Databases

Action Benefit Tradeoff

Initialize your LSM disks as sliced disks (Section 5.4.1.1) Provides greater storage configuration flexibility None

Increase the maximum number of LSM volumes (Section 5.4.1.2) Improves performance on VLM/VLDB systems None

Make the rootdg disk group a sufficient size (Section 5.4.1.3) Ensure sufficient space for disk group information None

Use a sufficient private region size for each disk (Section 5.4.1.4) Ensures sufficient space for database copies Large private regions require more disk space

Make the private regions in a disk group the same size (Section 5.4.1.5) Efficiently utilizes the configuration space None

Group disks into different disk groups (Section 5.4.1.6) Allows you to move disk groups between systems Reduces flexibility when configuring volumes

Use an appropriate size and number of database and log copies (Section 5.4.1.7) Ensures database availability and improves performance None

Place disks containing database and log copies on different buses (Section 5.4.1.8) Improves availability Cost of additional hardware

Action	Benefit	Tradeoff
Initialize your LSM disks as sliced disks (Section 5.4.1.1)	Provides greater storage configuration flexibility	None
Increase the maximum number of LSM volumes (Section 5.4.1.2)	Improves performance on VLM/VLDB systems	None
Make the `rootdg` disk group a sufficient size (Section 5.4.1.3)	Ensure sufficient space for disk group information	None
Use a sufficient private region size for each disk (Section 5.4.1.4)	Ensures sufficient space for database copies	Large private regions require more disk space
Make the private regions in a disk group the same size (Section 5.4.1.5)	Efficiently utilizes the configuration space	None
Group disks into different disk groups (Section 5.4.1.6)	Allows you to move disk groups between systems	Reduces flexibility when configuring volumes
Use an appropriate size and number of database and log copies (Section 5.4.1.7)	Ensures database availability and improves performance	None
Place disks containing database and log copies on different buses (Section 5.4.1.8)	Improves availability	Cost of additional hardware

The following sections describe these guidelines in detail.

5.4.1.1 Initializing LSM Disks as Sliced Disks

Initialize your LSM disks as sliced disks, instead of as simple disks. A sliced disk provides greater storage configuration flexibility because the entire disk is under LSM control. The disk label for a sliced disk contains information that identifies the partitions containing the private and the public regions. In contrast, simple disks have both public and private regions in the same partition.

5.4.1.2 Increasing the Maximum Number of LSM Volumes

For large systems increase the value of the max-vol attribute, which specifies the maximum number of volumes per system. The default is 1024; you can increase it to 4096.

5.4.1.3 Sizing the rootdg Disk Group

You must make sure that the rootdg disk group has an adequate size, because the disk group's configuration database contains records for disks outside of the rootdg disk group, in addition to the ordinary disk-group configuration information. For example, the rootdg configuration database includes disk-access records that define all disks under LSM control.

The rootdg disk group must be large enough to contain records for the disks in all the disk groups. See Table 5-6 for more information.

5.4.1.4 Sizing Private Regions

You must make sure that the private region for each disk has an adequate size. LSM keeps disk media label and configuration database copies in each disk's private region.

A private region must be large enough to accommodate the size of the LSM database copies. In addition, the maximum number of LSM objects (disks, subdisks, volumes, and plexes) in a disk group depends on an adequate private region size. However, a large private region requires more disk space. The default private region size is 1024 blocks, which is usually adequate for configurations using up to 128 disks per disk group.

5.4.1.5 Making Private Regions in a Disk Group the Same Size

The private region of each disk in a disk group should be the same size, in order to efficiently utilize the configuration space. One or two LSM configuration database copies can be stored in a disk's private region.

When you add a new disk to existing an LSM disk group, the size of the private region on the new disk is determined by the private region size of the other disks in the disk group. As you add more disks to a disk group, the voldiskadd utility reduces the number of configuration copies and log copies that are initialized for the new disks. See voldiskadd(8) for more information.

5.4.1.6 Grouping Disks in Disk Groups

You may want to group disks in disk groups according to their function. This enables disk groups to be moved between systems, and decreases the size of the LSM configuration database for each disk group. However, using multiple disk groups reduces flexibility when configuring volumes.

5.4.1.7 Choosing the Number and Size of the Database and Log Copies

Each disk group maintains a configuration database, which includes detailed information about mirrored and striped disks and volume, plex, and subdisk records. The LSM subsystem's overhead primarily involves managing the kernel change logs and copies of the configuration databases.

LSM performance is affected by the size and the number of copies of the configuration database and the kernel change log. They determine the amount of time it takes for LSM to start up, for changes to the configuration to occur, and for the LSM disks to fail over in a cluster.

Usually, each disk in a disk group contains one or two copies of both the kernel change log and the configuration database. Disk groups consisting of more than eight disks should not have copies on all disks. Always use four to eight copies.

The number of kernel change log copies must be the same as the number of configuration database copies. For the best performance, the number of copies must be the same on each disk that contains copies.

Table 5-6 describes the guidelines for configuration database and kernel change log copies.

Table 5-6: Configuration Database and Kernel Change Log Guidelines

Disks Per Disk Group Size of Private Region (in Blocks) Configuration and Kernel Change Log Copies Per Disk

1 to 3 512 Two copies in each private region

4 to 8 512 One copy in each private region

9 to 32 512 One copy on four to eight disks, zero copies on remaining disks

33 to 128 1024 One copy on four to eight disks, zero copies on remaining disks

129 to 256 1536 One copy on four to eight disks, zero copies on remaining disks

257 or more 2048 One copy on four to eight disks, zero copies on remaining disks

Disks Per Disk Group	Size of Private Region (in Blocks)	Configuration and Kernel Change Log Copies Per Disk
1 to 3	512	Two copies in each private region
4 to 8	512	One copy in each private region
9 to 32	512	One copy on four to eight disks, zero copies on remaining disks
33 to 128	1024	One copy on four to eight disks, zero copies on remaining disks
129 to 256	1536	One copy on four to eight disks, zero copies on remaining disks
257 or more	2048	One copy on four to eight disks, zero copies on remaining disks

5.4.1.8 Distributing the Database and Log Copies Across Buses

For disk groups with large numbers of disks, place the disks that contain configuration database and kernel change log copies on different buses. This provides you with better performance and higher availability.

5.4.2 LSM Mirrored Volume Configuration Guidelines

Use LSM mirrored volumes for high data availability. If a physical disk fails, the mirrored plex (copy) containing the failed disk becomes temporarily unavailable, but the remaining plexes are still available. A mirrored volume has at least two plexes for data redundancy.

Mirroring can also improve read performance. However, a write to a volume results in parallel writes to each plex, so write performance may be degraded. Environments whose disk I/O operations are predominantly reads obtain the best performance results from mirroring. See Table 5-7 for guidelines.

In addition, use block-change logging (BCL) to improve the mirrored volume recovery rate when a system failure occurs by reducing the synchronization time. If BCL is enabled and a write is made to a mirrored plex, BCL identifies the block numbers that have changed and then stores the numbers on a logging subdisk. BCL is not used for reads.

BCL is enabled if two or more plexes in a mirrored volume have a logging subdisk associated with them. Only one logging subdisk can be associated with a plex. BCL can add some overhead to your system and degrade the mirrored volume's write performance. However, the impact is less for systems under a heavy I/O load, because multiple writes to the log are batched into a single write. See Table 5-8 for guidelines.

Note that BCL will be replaced by dirty region logging (DRL) in a future release.

Table 5-7 lists LSM mirrored volume configuration guidelines and performance benefits as well as tradeoffs.

Table 5-7: Guidelines for LSM Mirrored Volumes

Action Benefit Tradeoff

Map mirrored plexes across different buses (Section 5.4.2.1) Improves performance and increases availability None

Use the appropriate read policy (Section 5.4.2.2) Efficiently distributes reads None

Attach up to eight plexes to the same volume (Section 5.4.2.3) Improves performance for read-intensive workloads and increases availability Uses disk space inefficiently

Use a symmetrical configuration (Section 5.4.2.4) Provides more predictable performance None

Use block-change logging (Table 5-8) Improves mirrored volume recovery rate May decrease write performance

Stripe the mirrored volumes (Table 5-9) Improves disk I/O performance and balances I/O load Increases management complexity

Action	Benefit	Tradeoff
Map mirrored plexes across different buses (Section 5.4.2.1)	Improves performance and increases availability	None
Use the appropriate read policy (Section 5.4.2.2)	Efficiently distributes reads	None
Attach up to eight plexes to the same volume (Section 5.4.2.3)	Improves performance for read-intensive workloads and increases availability	Uses disk space inefficiently
Use a symmetrical configuration (Section 5.4.2.4)	Provides more predictable performance	None
Use block-change logging (Table 5-8)	Improves mirrored volume recovery rate	May decrease write performance
Stripe the mirrored volumes (Table 5-9)	Improves disk I/O performance and balances I/O load	Increases management complexity

Table 5-8 lists LSM block-change logging (BCL) configuration guidelines and performance benefits as well as tradeoffs.

Table 5-8: Guidelines for LSM Block-Change Logging

Action Benefit Tradeoff

Configure multiple logging subdisks (Section 5.4.2.5) Improves recovery time Requires additional disks

Use a write-back cache for logging subdisks (Section 5.4.2.6) Minimizes BCLs write degradation Cost of hardware RAID subsystem

Use the appropriate BCL subdisk size (Section 5.4.2.7) Enables migration to dirty region logging None

Place logging subdisks on infrequently used disks (Section 5.4.2.8) Helps to prevent disk bottlenecks None

Use solid-state disks for logging subdisks (Section 5.4.2.9) Minimizes BCL's write degradation Cost of disks

Action	Benefit	Tradeoff
Configure multiple logging subdisks (Section 5.4.2.5)	Improves recovery time	Requires additional disks
Use a write-back cache for logging subdisks (Section 5.4.2.6)	Minimizes BCLs write degradation	Cost of hardware RAID subsystem
Use the appropriate BCL subdisk size (Section 5.4.2.7)	Enables migration to dirty region logging	None
Place logging subdisks on infrequently used disks (Section 5.4.2.8)	Helps to prevent disk bottlenecks	None
Use solid-state disks for logging subdisks (Section 5.4.2.9)	Minimizes BCL's write degradation	Cost of disks

The following sections describe these guidelines in detail.

5.4.2.1 Mirroring Volumes Across Different Buses

Putting each mirrored plex on a different bus improves performance and availability by helping to prevent bus bottlenecks, and allowing simultaneous I/O operations. Mirroring across different buses also increases availability by protecting against bus and adapter failure.

5.4.2.2 Choosing a Read Policy for a Mirrored Volume

To provide optimal performance for different types of mirrored volumes, LSM supports the following read policies:

Round-robin read
Satisfies read operations to the volume in a round-robin manner from all plexes in the volume.

Preferred read
Satisfies read operations from one specific plex (usually the plex with the highest performance).

Select
Selects a default read policy, based on the plex associations to the volume. If the mirrored volume contains a single, enabled, striped plex, the default is to prefer that plex. For any other set of plex associations, the default is to use a round-robin policy.

If one plex exhibits superior performance, either because the plex is striped across multiple disks or because it is located on a much faster device, then set the read policy to preferred read for that plex. By default, a mirrored volume with one striped plex should have the striped plex configured as the preferred read. Otherwise, you almost aways use the round-robin read policy.

5.4.2.3 Using Multiple Plexes in a Mirrored Volume

To improve performance for read-intensive workloads, up to eight plexes can be attached to the same mirrored volume. However, this configuration does not use disk space efficiently.

5.4.2.4 Using a Symmetrical Configuration

A symmetrical mirrored disk configuration provides predictable performance and easy management. Use the same number of disks in each mirrored plex. For mirrored striped volumes, you can stripe across half of the available disks to form one plex and across the other half to form the other plex.

5.4.2.5 Using Multiple BCL Subdisks

Using multiple block-change logging (BCL) subdisks will improve recovery time after a failure.

5.4.2.6 Using a Write-Back Cache with LSM

To minimize BCL's impact on write performance, use LSM in conjunction with a RAID subsystem that has a write-back cache. Typically, the BCL performance degradation is more significant on systems with few writes than on systems with heavy write loads.

5.4.2.7 Sizing BCL Subdisks

To support migration from BCL to dirty region logging (DRL), which will be supported in a future release, use the appropriate BCL subdisk size.

If you have less than 64 GB of disk space under LSM control, calculate the subdisk size by multiplying 1 block by each gigabyte of storage. If the result is an odd number, add 1 block; if the result is an even number, add 2 blocks. For example, if you have 1 GB (or less) of space, use a 2-block subdisk. If you have 2 GB (or 3 GB) of space, use a 4-block subdisk.

If you have more than 64 GB of disk space under LSM control, use a 64-block subdisk.

5.4.2.8 Placing BCL Logging Subdisks on Infrequently Used Disks

Place a logging subdisk on an infrequently used disk. Because this subdisk is frequently written, do not put it on a busy disk. Do not configure BCL subdisks on the same disks as the volume data, because this will cause head seeking or thrashing.

5.4.2.9 Using Solid-State Disks for BCL Subdisks

If persistent (nonvolatile) solid-state disks are available, use them for logging subdisks.

5.4.3 LSM Striped Volume Configuration Guidelines

Striping volumes can increase performance because parallel I/O streams can operate concurrently on separate devices. Striping can improve the performance of applications that perform large sequential data transfers or multiple, simultaneous I/O operations.

Striping distributes data across the disks in a volume in stripes with a fixed size. The stripes are interleaved across the striped plex's subdisks, which are located on different disks, to evenly distribute disk I/O.

The performance benefit of striping depends on the stripe width, which is the number of blocks in a stripe, and how your users and applications perform I/O. Bandwidth increases with the number of disks across which a plex is striped. See Table 5-9 for guidelines.

Table 5-9 lists LSM striped volume configuration guidelines and performance benefits as well as tradeoffs.

You may want to combine mirroring with striping to obtain both high availability and high performance. See Table 5-7 and Table 5-9 for guidelines.

Table 5-9: Guidelines for LSM Striped Volumes

Action Benefit Tradeoff

Use multiple disks in a striped volume (Section 5.4.3.1) Improves performance Decreases volume reliability

Distribute subdisks across different disks and buses (Section 5.4.3.2) Improves performance and increases availability None

Use the appropriate stripe width (Section 5.4.3.3) Improves performance None

Avoid splitting small data transfers (Section 5.4.3.3) Improves the performance of volumes that quickly receive multiple data transfers May use disk space inefficiently

Split large individual data transfers (Section 5.4.3.3) Improves the performance of volumes that receive large data transfers Decreases throughput

Action	Benefit	Tradeoff
Use multiple disks in a striped volume (Section 5.4.3.1)	Improves performance	Decreases volume reliability
Distribute subdisks across different disks and buses (Section 5.4.3.2)	Improves performance and increases availability	None
Use the appropriate stripe width (Section 5.4.3.3)	Improves performance	None
Avoid splitting small data transfers (Section 5.4.3.3)	Improves the performance of volumes that quickly receive multiple data transfers	May use disk space inefficiently
Split large individual data transfers (Section 5.4.3.3)	Improves the performance of volumes that receive large data transfers	Decreases throughput

The following sections discuss these guidelines in detail.

5.4.3.1 Increasing the Number of Disks in a Striped Volume

Increasing the number of disks in a striped volume can increase the bandwidth, depending on the applications and file systems you are using and on the number of simultaneous users. However, this reduces the effective mean-time-between-failures (MTBF) of the volume. If this reduction is a problem, use both striping and mirroring.

5.4.3.2 Distributing Striped Volume Subdisks Across Different Buses

Distribute the subdisks of a striped volume across different buses. This improves performance and helps to prevent a single bus from becoming a bottleneck.

5.4.3.3 Choosing the Correct Stripe Width

The performance benefit of striping depends on the size of the stripe width and the characteristics of the I/O load. Stripes of data are allocated alternately and evenly to the subdisks of a striped plex. A striped plex consists of a number of equal-sized subdisks located on different disks.

The number of blocks in a stripe determines the stripe width. LSM uses a default stripe width of 64 KB (or 128 sectors), which works well in most environments.

Use the volstat command to determine the number of data transfer splits. For volumes that receive only small I/O transfers, you may not want to use striping because disk access time is important. Striping is beneficial for large data transfers.

To improve performance of large sequential data transfers, use a stripe width that will divide each individual data transfer and distribute the blocks equally across the disks.

To improve the performance of multiple simultaneous small data transfers, make the stripe width the same size as the data transfer. However, an excessively small stripe width can result in poor system performance.

If you are striping mirrored volumes, ensure that the stripe width is the same for each plex.

5.4.4 LSM Tuning Guidelines

After you set up the LSM configuration, you may be able to improve performance. For example, you can perform the following tasks:

Balance the I/O load
LSM allows you to achieve a fine level of granularity in data placement, because LSM provides a way for volumes to be distributed across multiple disks. After measuring actual data-access patterns, you can adjust the placement of file systems.
You can reassign data to specific disks to balance the I/O load among the available storage devices. You can reconfigure volumes on line after performance patterns have been established without adversely impacting volume availability.

Use striping to increase bandwidth for frequently accessed data
LSM provides a significant improvement in performance when there are multiple I/O streams. After you identify the most frequently accessed file systems and databases, you can realize significant performance benefits by striping the high traffic data across portions of multiple disks, which increases bandwidth to this data.

Set the preferred read policy to the fastest mirrored plex
If one plex of a mirrored volume exhibits superior performance, either because the disk is being striped or concatenated across multiple disks, or because it is located on a much faster device, then set the read policy to the preferred read policy for that plex. By default, a mirrored volume with one striped plex should be configured with the striped plex as the preferred read.

Increase the value of the volinfo.max_io parameter. This can improve the performance of systems that use large amounts of memory or storage.

5.5 Hardware RAID Subsystem Configuration Guidelines

Hardware RAID subsystems increase your storage capacity and provide different degrees of performance and availability at various costs. For example, some hardware RAID subsystems support dual-redundant RAID controllers and a nonvolatile write-back cache, which greatly improve performance and availability. Entry-level hardware RAID subsystems provide cost-efficient RAID functionality.

Table 5-10 lists hardware RAID subsystem configuration guidelines and performance benefits as well as tradeoffs.

Table 5-10: Guidelines for Configuring Hardware RAID Subsystems

Hardware Performance Benefit Tradeoff

Evenly distribute disks in a storage set across different buses (Section 5.5.1) Improves performance and helps to prevent bottlenecks None

Ensure that the first member of each mirrored set is on a different disk Improves performance None

Use disks with the same data capacity in each storage set (Section 5.5.2) Improves performance None

Use the appropriate chunk size (Section 5.5.3) Improves performance None

Stripe mirrored sets (Section 5.5.4) Increases availability and read performance May degrade write performance

Use a write-back cache (Section 5.5.5) Improves write performance Cost of hardware

Use dual-redundant RAID controllers (Section 5.5.6) Improves performance, increases availability, and prevents I/O bus bottlenecks Cost of hardware

Install spare disks (Section 5.5.7) Improves availability Cost of disks

Replace failed disks promptly (Section 5.5.7) Improves performance None

Hardware	Performance Benefit	Tradeoff
Evenly distribute disks in a storage set across different buses (Section 5.5.1)	Improves performance and helps to prevent bottlenecks	None
Ensure that the first member of each mirrored set is on a different disk	Improves performance	None
Use disks with the same data capacity in each storage set (Section 5.5.2)	Improves performance	None
Use the appropriate chunk size (Section 5.5.3)	Improves performance	None
Stripe mirrored sets (Section 5.5.4)	Increases availability and read performance	May degrade write performance
Use a write-back cache (Section 5.5.5)	Improves write performance	Cost of hardware
Use dual-redundant RAID controllers (Section 5.5.6)	Improves performance, increases availability, and prevents I/O bus bottlenecks	Cost of hardware
Install spare disks (Section 5.5.7)	Improves availability	Cost of disks
Replace failed disks promptly (Section 5.5.7)	Improves performance	None

The following sections describe some of these guidelines. See your RAID subsystem documentation for detailed configuration information.

5.5.1 Distributing Storage Set Disks Across Buses

You can improve performance and help to prevent bottlenecks by distributing storage set disks evenly across different buses.

Make sure that the first member of each mirrored set is on a different bus.

5.5.2 Using Disks with the Same Data Capacity

Use disks with the same capacity in the same storage set.

5.5.3 Choosing the Correct Chunk Size

The performance benefit of stripe sets depends on how your users and applications perform I/O and the chunk (stripe) size. For example, if you choose a stripe size of 8 KB, small data transfers will be distributed evenly across the member disks. However, a 64-KB data transfer will be divided into at least eight data transfers.

You may want to use a stripe size that will prevent any particular range of blocks from becoming a bottleneck. For example, if an application often uses a particular 8-KB block, you may want to use a stripe size that is slightly larger or smaller than 8 KB or is a multiple of 8 KB, in order to force the data onto a different disk.

If the stripe size is large compared to the average I/O size, each disk in a stripe set can respond to a separate data transfer. I/O operations can then be handled in parallel, which increases sequential write performance and throughput. This can improve performance for environments that perform large numbers of I/O operations, including transaction processing, office automation, and file services environments, and for environments that perform multiple random read and write operations.

If the stripe size is smaller than the average I/O operation, multiple disks can simultaneously handle a single I/O operation, which can increase bandwidth and improve sequential file processing. This is beneficial for image processing and data collection environments. However, do not make the stripe size so small that it will degrade performance for large sequential data transfers.

If your applications are doing I/O to a raw device and not a file system, use a stripe size that distributes a single data transfer evenly across the member disks. For example, if the typical I/O size is 1 MB and you have a four-disk array, you could use a 256-KB stripe size. This would distribute the data evenly among the four member disks, with each doing a single 256-KB data transfer in parallel.

For small file system I/O operations, use a stripe size that is a multiple of the typical I/O size (for example, four to five times the I/O size). This will help to ensure that the I/O is not split across disks.

5.5.4 Striping Mirrored Sets

You can stripe mirrored sets to improve performance.

5.5.5 Using a Write-Back Cache

RAID subsystems support, either as a standard or an optional feature, a nonvolatile write-back cache that can improve disk I/O performance while maintaining data integrity. A write-back cache improves performance for systems that perform large numbers of writes, especially Web servers. Applications that perform few writes will not benefit from a write-back cache.

A write-back cache improves performance because writes appear to be executed immediately. If a failure occurs, upon recovery, the RAID controller detects any unwritten data that still exists in the write-back cache and writes the data to disk before enabling normal controller operations.

A write-back cache must be battery-backed to protect against data loss and corruption.

If you are using an HSZ40 or HSZ50 RAID controller with a write-back cache, the following guidelines may improve performance:

Set CACHE_POLICY to B.

Set CACHE_FLUSH_TIMER to a minimum of 45 (seconds).

Enable the write-back cache (WRITEBACK_CACHE) for each unit, and set the value of MAXIMUM_CACHED_TRANSFER_SIZE to a minimum of 256.

See the HSZ documentation for more information.

5.5.6 Using Dual-Redundant Controllers

If supported, use a dual-redundant controller configuration and balance the number of disks across the two controllers. This can improve performance, increase availability, and prevent I/O bus bottlenecks.

5.5.7 Using Spare Disks

Install predesignated spare disks on separate controller ports and storage shelves. This will help you to maintain data availability if a disk failure occurs.

5.6 Using the Advanced File System

The Advanced File System (AdvFS) allows you to put multiple volumes (disks, LSM volumes, or RAID storage sets) in a file domain and distribute the filesets and files across the volumes. A file's blocks usually reside together on the same volume, unless the file is striped or the volume is full. Each new file is placed on the successive volume by using round robin scheduling. See the AdvFS Guide to File System Administration for more information on using AdvFS.

The following sections describe how to configure and tune AdvFS for high performance.

5.6.1 AdvFS Configuration Guidelines

You will obtain the best performance if you carefully plan your AdvFS configuration. Table 5-11 lists AdvFS configuration guidelines and performance benefits as well as tradeoffs. In addition, the recommendations described in Table 5-2 and Table 5-3 apply to AdvFS configurations.

Table 5-11: AdvFS Configuration Guidelines

Action Performance Benefit Tradeoff

Use multiple-volume file domains (Section 5.6.1.1) Improves throughput and simplifies management Increases chance of domain failure and may cause log bottleneck

Use several file domains instead of one large domain (Section 5.6.1.1) Prevents log from becoming a bottleneck Increases maintenance complexity

Place transaction log on fast or uncongested volume (Section 5.6.1.2) Prevents log from becoming a bottleneck None

Preallocate space for the BMT (Section 5.6.1.3) Prevents prematurely running out of domain space Reduces available disk space

Increase the number of pages by which the BMT extent size grows (Section 5.6.1.3) Prevents prematurely running out of domain space Reduces available disk space

Stripe files (Section 5.6.1.4) Improves sequential read and write performance Increases chance of domain failure

Use quotas (Section 5.6.1.5) Controls file system space utilization None

Action	Performance Benefit	Tradeoff
Use multiple-volume file domains (Section 5.6.1.1)	Improves throughput and simplifies management	Increases chance of domain failure and may cause log bottleneck
Use several file domains instead of one large domain (Section 5.6.1.1)	Prevents log from becoming a bottleneck	Increases maintenance complexity
Place transaction log on fast or uncongested volume (Section 5.6.1.2)	Prevents log from becoming a bottleneck	None
Preallocate space for the BMT (Section 5.6.1.3)	Prevents prematurely running out of domain space	Reduces available disk space
Increase the number of pages by which the BMT extent size grows (Section 5.6.1.3)	Prevents prematurely running out of domain space	Reduces available disk space
Stripe files (Section 5.6.1.4)	Improves sequential read and write performance	Increases chance of domain failure
Use quotas (Section 5.6.1.5)	Controls file system space utilization	None

The following sections describe these AdvFS configuration guidelines in more detail.

5.6.1.1 Using Multiple-Volume File Domains

Using multiple-volume file domains allows greater control over your physical resources, and may improve a fileset's total throughput. However, be sure that the log does not become a bottleneck. Multiple-volume file domains improve performance because AdvFS generates parallel streams of output using multiple device consolidation queues.

In addition, using only a few file domains instead of using many file domains reduces the overall management effort because fewer file domains require less administration. However, a single volume failure within a file domain renders the entire file domain inaccessible. Therefore, the more volumes that you have in your file domain the greater the risk that a file domain will fail.

DIGITAL recommends that you use a maximum of 12 volumes in each file domain. However, to reduce the risk of file domain failure, limit the number of volumes per file domain to three or use mirrored volumes created with LSM.

For multiple-volume domains, make sure that busy files are not located on the same volume. Use the migrate command to move files across volumes.

5.6.1.2 Improving the Transaction Log Performance

Each file domain has a transaction log that keeps track of fileset activity for all filesets in the file domain. The AdvFS file domain transaction log may become a bottleneck. This can occur if the log resides on a congested disk or bus, or if the file domain contains many filesets.

To prevent the log from becoming a bottleneck, put the log on a fast, uncongested volume. You may want to put the log on a disk that contains only the log. See Section 5.6.2.9 for information on moving an existing transaction log.

To make the transaction log highly available, use LSM to mirror the log.

5.6.1.3 Improving Bitmap Metadata Table Performance

The AdvFS fileset data structure (metadata) is stored in a file called the bitfile metadata table (BMT). Each volume in a domain has a BMT that describes the file extents on the volume. If a domain has multiple volumes of the same size, files will be distributed evenly among the volumes.

The BMT is the equivalent of the UFS inode table. However, the UFS inode table is statically allocated, while the BMT expands as more files are added to the domain. Each time that AdvFS needs additional metadata, the BMT grows by a fixed size (the default is 128 pages). As a volume becomes increasingly fragmented, the size by which the BMT grows may be described by several extents.

If a file domain has a large number of small files, you may prematurely run out of disk space for the BMT. Handling many small files makes the system request metadata extents more frequently, which causes the metadata to become fragmented. Because the number of BMT extents is limited, the file domain will appear to be out of disk space if the BMT cannot be extended to map new files.

To monitor the BMT, use the vbmtpg command and examine the number of mcells (freeMcellCnt). The value of freeMcellCnt can range from 0 to 22. A volume with 1 free Mcell has very little space in which to grow the BMT. See vbmtpg(8) for more information.

You can also invoke the showfile command and specify mount_point/.tags/M-6 to examine the BMT extents on the first domain volume that contains the fileset mounted on the specified mount point. To examine the extents of the other volumes in the domain, specify M-12, M-18, and so on. If the extents at the end of the BMT are smaller than the extents at the beginning of the file, the BMT is becoming fragmented. See showfile(8) for more information.

If you are prematurely out of BMT disk space, you may be able to eliminate the problem by defragmenting the file domain that contains the volume. See defragment(8) for more information.

Table 5-12 provides some BMT sizing guidelines for the number of pages to preallocate for the BMT, and the number of pages by which the BMT extent size grows. The BMT sizing depends on the maximum number of files you expect to create on a volume.

Table 5-12: BMT Sizing Guidelines

Estimated Maximum Number of Files on a Volume Number of Pages to Preallocate Number of Pages to Grow Extent

< 50,000 3600 128

100,000 7200 256

200,000 14,400 512

300,000 21,600 768

400,000 28,800 1024

800,000 57,600 2048

Estimated Maximum Number of Files on a Volume	Number of Pages to Preallocate	Number of Pages to Grow Extent
< 50,000	3600	128
100,000	7200	256
200,000	14,400	512
300,000	21,600	768
400,000	28,800	1024
800,000	57,600	2048

You can preallocate space for the BMT when the file domain is created, and when a volume is added to the domain by using the mkfdmn and addvol commands with the -p flag.

You can also modify the number of extent pages by which the BMT grows when a file domain is created and when a volume is added to the domain by using the mkfdmn and the addvol commands with the -x flag.

If you use the mkfdmn -x or the addvol -x command when there is a large amount of free space on a disk, as files are created, the BMT will expand by the specified number of pages and those pages will be in one extent. As the disk becomes more fragmented, the BMT will still expand, but the pages will not be contiguous and will require more extents. Eventually, the BMT will run out of its limited number of extents even though the growth size is large.

Using the mkfdmn -p or the addvol -p command to preallocate a large BMT before the disk is fragmented may prevent this problem because the entire preallocated BMT is described in one extent. All subsequent growth will be able to utilize nearly all of the limited number of BMT extents. Do not overallocate BMT space because the disk space cannot be used for other purposes. However, too little BMT space will eventually cause the BMT to grow by a fixed amount. At this time, the disk may be fragmented and the growth will require multiple extents. See mkfdmn(8) and addvol(8) for more information.

5.6.1.4 Striping Files

The AdvFS stripe utility lets you improve the read and write performance of an individual file. This is useful if an application continually accesses a few specific files. See stripe(8) for information.

The utility directs a zero-length file (a file with no data written to it yet) to be distributed evenly across several volumes in a file domain. As data is appended to the file, the data is spread across the volumes. AdvFS determines the number of pages per stripe segment and alternates the segments among the disks in a sequential pattern. Bandwidth can be improved by distributing file data across multiple volumes.

Do not stripe both a file and the disk on which it resides.

To determine if you should stripe files, use the iostat utility. The blocks per second and I/O operations per second should be cross-checked with the disks bandwidth capacity. If the disk access time is slow, in comparison to the stated capacity, then file striping may improve performance.

5.6.1.5 Using AdvFS Quotas

AdvFS quotas allow you to track and control the amount of physical storage that a user, group, or fileset consumes. AdvFS eliminates the slow reboot activities associated with UFS quotas. In addition, AdvFS quota information is always maintained, but quota enforcement can be activated and deactivated.

For information about UFS quotas, see Section 5.7.1.6.

5.6.2 AdvFS Tuning Guidelines

After you configure AdvFS, you may be able to tune it to improve performance. Table 5-13 lists AdvFS tuning guidelines and performance benefits as well as tradeoffs. In addition, the recommendations described in Table 5-4 apply to AdvFS configurations.

Table 5-13: AdvFS Tuning Guidelines

Action Performance Benefit Tradeoff

Increase the percentage of memory allocated for the AdvFS buffer cache (Section 5.6.2.1) Improves AdvFS performance if data reuse is high Consumes memory

Defragment file domains (Section 5.6.2.2) Improves read and write performance None

Increase the dirty data caching threshold (Section 5.6.2.3) Improves random write performance May cause I/O spikes or increase the number of lost buffers if a crash occurs

Decrease the I/O transfer read-ahead size (Section 5.6.2.4) Improves performance for mmap page faulting None

Disable the flushing of dirty pages mapped with the mmap function during a sync call (Section 5.6.2.5) May improve performance for applications that manage their own flushing None

Modify the AdvFS device queue limit (Section 5.6.2.6) Influences the time to complete synchronous (blocking) I/O May cause I/O spikes

Consolidate I/O transfers (Section 5.6.2.7) Improves AdvFS performance None

Force all AdvFS file writes to be synchronous (Section 5.6.2.8) Ensures that data is successfully written to disk May degrade file system performance

Move the transaction log to a fast or uncongested volume (Section 5.6.2.9) Prevents log from becoming a bottleneck None

Balance files across volumes in a file domain (Section 5.6.2.10) Improves performance and evens the future distribution of files None

Migrate frequently used or large files to different file domains (Section 5.6.2.11) Improves I/O performance None

Decrease the size of the metadata buffer cache to 1 percent (Section 4.7.21) Improves performance for systems that use only AdvFS None

Action	Performance Benefit	Tradeoff
Increase the percentage of memory allocated for the AdvFS buffer cache (Section 5.6.2.1)	Improves AdvFS performance if data reuse is high	Consumes memory
Defragment file domains (Section 5.6.2.2)	Improves read and write performance	None
Increase the dirty data caching threshold (Section 5.6.2.3)	Improves random write performance	May cause I/O spikes or increase the number of lost buffers if a crash occurs
Decrease the I/O transfer read-ahead size (Section 5.6.2.4)	Improves performance for `mmap` page faulting	None
Disable the flushing of dirty pages mapped with the `mmap` function during a `sync` call (Section 5.6.2.5)	May improve performance for applications that manage their own flushing	None
Modify the AdvFS device queue limit (Section 5.6.2.6)	Influences the time to complete synchronous (blocking) I/O	May cause I/O spikes
Consolidate I/O transfers (Section 5.6.2.7)	Improves AdvFS performance	None
Force all AdvFS file writes to be synchronous (Section 5.6.2.8)	Ensures that data is successfully written to disk	May degrade file system performance
Move the transaction log to a fast or uncongested volume (Section 5.6.2.9)	Prevents log from becoming a bottleneck	None
Balance files across volumes in a file domain (Section 5.6.2.10)	Improves performance and evens the future distribution of files	None
Migrate frequently used or large files to different file domains (Section 5.6.2.11)	Improves I/O performance	None
Decrease the size of the metadata buffer cache to 1 percent (Section 4.7.21)	Improves performance for systems that use only AdvFS	None

The following sections describe how to tune AdvFS in detail.

5.6.2.1 Modifying the Size of the AdvFS Buffer Cache

The AdvfsCacheMaxPercent attribute specifies the amount of physical memory that AdvFS uses for its buffer cache.

You may improve AdvFS performance by increasing the percentage of memory allocated to the AdvFS buffer cache. To do this, increase the value of the AdvfsCacheMaxPercent attribute. The default is 7 percent of memory, the minimum is 1 percent, and the maximum is 30 percent.

Increasing the value of the AdvfsCacheMaxPercent attribute will decrease the amount of memory available to the virtual memory subsystem, so you must make sure that you do not cause excessive paging and swapping. Use the vmstat command to check virtual memory statistics.

You may want to increase the AdvFS buffer cache size if data reuse is high. If you increase the value of the AdvfsCacheMaxPercent attribute and experience no performance benefit, return to the original value. If data reuse is insignificant or if you have more than 2 GB of memory, you may want to decrease the cache size.

5.6.2.2 Defragmenting a File Domain

AdvFS attempts to store file data in a collection of contiguous blocks (a file extent) on a disk. If all data in a file is stored in contiguous blocks, the file has one file extent. However, as files grow, contiguous blocks on the disk may not be available to accommodate the new data, so the file must be spread over discontiguous blocks and multiple file extents.

File fragmentation degrades read and write performance because many disk addresses must be examined to access a file. In addition, if a domain has a large number of small files, you may prematurely run out of disk space, due to fragmentation.

Use the defragment utility with the -v and -n options to show the amount of file fragmentation.

The defragment utility reduces the amount of file fragmentation in a file domain by attempting to make the files more contiguous, which reduces the number of file extents. The utility does not affect data availability and is transparent to users and applications. Striped files are not defragmented.

You can improve the efficiency of the defragment process by deleting any unneeded files in the file domain before running the defragment utility. See defragment(8) for more information.

5.6.2.3 Increasing the Dirty Data Caching Threshold for a Volume

Dirty or modified data is data that has been written by an application and cached but has not yet been written to disk. You can increase the amount of dirty data that AdvFS will cache for each volume in a file domain. This can improve write performance for systems that perform many random writes by increasing the number of cache hits.

You can increase the amount of cached dirty data for all new volumes or for a specific, existing volume. The default value is 16 KB. The minimum value is 0, which disables dirty data caching. The maximum value is 32 KB.

If you have high data reuse (data is repeatedly read and written), you may want to increase the dirty data threshold. If you have low data reuse, you may want to decrease the threshold or use the default value.

Use the chvol -t command to modify the dirty data threshold for an individual existing volume. You must specify the number of dirty, 512-byte blocks to cache. See chvol(8) for more information.

To modify the dirty data threshold for all new volumes, modify the value of the AdvfsReadyQLim attribute, which specifies the number of 512-byte blocks that can be on the readylazy queue before the requests are moved to the device queue.

If you change the dirty data threshold and performance does not improve, return to the original value.

5.6.2.4 Decreasing the I/O Transfer Read-Ahead Size

AdvFS reads and writes data by a fixed number of 512-byte blocks. The default is 128 blocks. Use the chvol command with the -w option to change the write-consolidation size. Use the chvol command with the -r option to change the read-ahead size. See chvol(8) for more information.

You may be able to improve performance for mmap page faulting and reduce read-ahead paging and cache dilution by decreasing the read-ahead size.

If the disk is fragmented so that the pages of a file are not sequentially allocated, reduce fragmentation by using the defragment utility. See defragment(8) for more information.

5.6.2.5 Disabling the Flushing of Dirty mmapped Pages

A file can have dirty data in memory due to a write system call or a memory write reference after an mmap system call. The update daemon runs every 30 seconds and issues a sync call for every fileset mounted with read and write access.

The AdvfsSyncMmapPages attribute controls whether modified (dirty) mmapped pages are flushed to disk during a sync system call. If the AdvfsSyncMmapPages attribute is set to 1, the dirty mmapped pages are asynchronously written to disk. If the AdvfsSyncMmapPages attribute is set to 0, dirty mmapped pages are not written to disk during a sync system call.

If your applications manage their own mmap page flushing, set the value of the AdvfsSyncMmapPages attribute to 0.

See mmap(2) and msync(2) for more information.

5.6.2.6 Modifying the AdvFS Device Queue Limit

Synchronous and asynchronous AdvFS I/O requests are placed on separate consolidation queues, where small, logically contiguous block requests are consolidated into larger I/O requests. The consolidated synchronous and asynchronous I/O requests are moved to the AdvFS device queue and then sent to the device driver.

The AdvfsMaxDevQLen attribute limits the AdvFS device queue length. When the number of requests on the device queue exceeds the value of the AdvfsMaxDevQLen attribute, only synchronous requests are accepted onto the device queue. The default value of the AdvfsMaxDevQLen attribute is 80.

Limiting the size of the device queue affects the amount of time it takes to complete a synchronous (blocking) I/O operation. AdvFS issues several types of blocking I/O operations, including AdvFS metadata and log data operations.

The default value of the AdvfsMaxDevQLen attribute is appropriate for most configurations. However, you may need to modify this value if you are using fast or slow adapters, striping, or mirroring. A higher value may improve throughput, but will also increase synchronous read/write time. To calculate response time, multiply the value of the AdvfsMaxDevQLen attribute by 9 milliseconds (the average I/O latency).

A guideline is to specify a value for the AdvfsMaxDevQLen attribute that is less than or equal to the average number of I/O operations that can be performed in 0.5 seconds.

If you do not want to limit the number of requests on the device queue, set the value of the AdvfsMaxDevQLen attribute to 0 (zero).

5.6.2.7 Consolidating I/O Transfers

Consolidating a number of I/O transfers into a single, large I/O transfer can improve AdvFS performance. To do this, use the chvol command with the -c on flag. This is the default. DIGITAL recommends that you do not disable the consolidation of I/O transfers. See chvol(8) for more information.

5.6.2.8 Forcing Synchronous Writes

Use the chfile -l on command to force all write requests to an AdvFS file to be synchronous. When forced synchronous write requests to a file are enabled, the write system call returns a success value only after the data has been successfully written to disk. This may degrade file system performance.

When forced synchronous write requests to a file are disabled, the write system call returns a success value when the requests are cached. The data is then written to disk at a later time (asynchronously).

5.6.2.9 Moving the Transaction Log

Make sure that the AdvFS transaction log resides on an uncongested disk and bus or system performance may be degraded.

If the transaction log becomes a bottleneck, use the switchlog command to relocate the transaction log of the specified file domain to a faster or less congested volume in the same domain. Use the showfdmn command to determine the current location of the transaction log. In the showfdmn command display, the letter L displays next to the volume that contains the log. See switchlog(8) and showfdmn(8) for more information.

In addition, you can divide the file domain into several smaller file domains. This will cause each domain's transaction log to handle transactions for fewer filesets.

5.6.2.10 Balancing a Multivolume File Domain

If the files in a multivolume domain are not evenly distributed, performance may be degraded. The balance utility distributes the percentage of used space evenly between volumes in a multivolume file domain. This improves performance and evens the distribution of future file allocations. Files are moved from one volume to another until the percentage of used space on each volume in the domain is as equal as possible.

The balance utility does not affect data availability and is transparent to users and applications. If possible, use the defragment utility before you balance files.

The balance utility does not generally split files. Therefore, file domains with very large files may not balance as evenly as file domains with smaller files. See balance(8) for more information.

To determine if you need to balance your files across volumes, use the showfdmn command to display information about the volumes in a domain. The % used field shows the percentage of volume space that is currently allocated to files or metadata (fileset data structure). See showfdmn(8) for more information.

5.6.2.11 Migrating Files Within a File Domain

Performance may degrade if too many frequently accessed or large files reside on the same volume in a multivolume file domain. You can improve I/O performance by altering the way files are mapped on the disk.

Use the migrate utility to move frequently accessed or large files to different volumes in the file domain. You can specify the volume where a file is to be moved, or allow the system to pick the best space in the file domain. You can migrate either an entire file or specific pages to a different volume. However, using the balance utility after migrating files may cause the files to move to a different volume. See balance(8) for more information.

In addition, a file that is migrated is defragmented at the same time, if possible. Defragmentation makes the file more contiguous, which improves performance. Therefore, you can use the migrate command to defragment selected files. See migrate(8) for more information.

5.7 Using the UNIX File System

The following sections will help you to configure and tune UNIX File Systems (UFS).

5.7.1 UFS Configuration Guidelines

There are a number of parameters that can improve the UFS performance. You can set all of the parameters when you use the newfs command to create a file system. For existing file systems, you can tune some parameters by using the tunefs command.

Table 5-14 describes UFS configuration guidelines and performance benefits as well as tradeoffs. In addition, the recommendations described in Table 5-2 and Table 5-3 apply to UFS configurations.

Table 5-14: UFS Configuration Guidelines

Action Performance Benefit Tradeoff

Increase the file system fragment size to 8 KB (Section 5.7.1.1) Improves performance for large files Wastes disk space for small files

Use the default file system fragment size of 1 KB (Section 5.7.1.1) Uses disk space efficiently None

Reduce the density of inodes (Section 5.7.1.2) Improves performance of large files None

Allocate blocks contiguously (Section 5.7.1.3) Aids UFS block clustering None

Increase the number of blocks combined for a read (Section 5.7.1.4) Improves performance None

Use a Memory File System (MFS) (Section 5.7.1.5) Improves I/O performance Does not ensure data integrity because of cache volatility

Use disk quotas (Section 5.7.1.6) Controls disk space utilization UFS quotas may slow reboot time

Action	Performance Benefit	Tradeoff
Increase the file system fragment size to 8 KB (Section 5.7.1.1)	Improves performance for large files	Wastes disk space for small files
Use the default file system fragment size of 1 KB (Section 5.7.1.1)	Uses disk space efficiently	None
Reduce the density of inodes (Section 5.7.1.2)	Improves performance of large files	None
Allocate blocks contiguously (Section 5.7.1.3)	Aids UFS block clustering	None
Increase the number of blocks combined for a read (Section 5.7.1.4)	Improves performance	None
Use a Memory File System (MFS) (Section 5.7.1.5)	Improves I/O performance	Does not ensure data integrity because of cache volatility
Use disk quotas (Section 5.7.1.6)	Controls disk space utilization	UFS quotas may slow reboot time

The following sections describe the UFS configuration guidelines in detail.

5.7.1.1 Modifying the File System Fragment Size

If the average file in a file system is larger than 16 KB but less than 96 KB, you may be able to improve disk access time and decrease system overhead by making the file system fragment size equal to the block size, which is 8 KB. Use the newfs command to do this.

However, to use disk space efficiently, use the default fragment size, which is 1 KB. See newfs(8) for more information.

5.7.1.2 Reducing the Density of inodes

The number of files in a file system is determined by the number of inodes and the size of the file system. The default is to create an inode for each 4096 bytes of data space.

If a file system will contain many large files, you may want to increase the amount of data space allocated to an inode and reduce the density of inodes. To do this, use the newfs -i command to specify the amount of data space allocated to an inode. See newfs(8) for more information.

5.7.1.3 Allocating Blocks Contiguously

The UFS rotdelay parameter specifies the time, in milliseconds, to service a transfer completion interrupt and initiate a new transfer on the same disk. You can set the rotdelay parameter to 0 (the default) to allocate blocks sequentially and aid UFS block clustering. You can do this by using either the tunefs command or the newfs command. See newfs(8) and tunefs(8) for more information.

5.7.1.4 Increasing the Number of Blocks Combined for a Read

The value of the UFS maxcontig parameter specifies the number of blocks that can be combined into a single cluster. The default value of maxcontig is 8 KB. The file system attempts read operations in a size that is defined by the value of maxcontig multiplied by the block size (8 KB).

Device drivers that can chain several buffers together in a single transfer should use a maxcontig value that is equal to the maximum chain length.

Use the tunefs command or the newfs command to change the value of maxcontig. See newfs(8) and tunefs(8) for more information.

5.7.1.5 Using a Memory File System

Memory File System (MFS) is a UFS file system that resides only in memory. No permanent data or file structures are written to disk An MFS file system can improve read/write performance, but it is a volatile cache. The contents of an MFS file system are lost after a reboot, unmount operation, or power failure.

Because no date is written to disk, an MFS file system is a very fast file system and can be used to store temporary files or read-only files that are loaded into it after it is created. For example, if you are performing a software build that would have to be restarted if it failed, use an MFS file system to cache the temporary files that are created during the build and reduce the build time.

5.7.1.6 Using UFS Disk Quotas

You can specify UFS file system limits for user accounts and for groups by setting up file system quotas, also known as disk quotas. You can apply quotas to file systems to establish a limit on the number of blocks and inodes (or files) that a user account or a group of users can allocate. You can set a separate quota for each user or group of users on each file system.

You may want to set quotas on file systems that contain home directories because the sizes of these file systems can increase more significantly than other file systems. Do not set quotas on the /tmp file system.

Note that, unlike AdvFS quotas, UFS quotas may slow reboot time. For information about AdvFS quotas, see Section 5.6.1.5.

5.7.2 UFS Tuning Guidelines

After you configure your UFS file systems, you can modify some parameters and attributes to improve performance. Table 5-15 describes UFS tuning guidelines and performance benefits as well as tradeoffs. In addition, the recommendations described in Table 5-4 apply to UFS configurations.

Table 5-15: UFS Tuning Guidelines

Action Performance Benefit Tradeoff

Increase size of metadata buffer cache to more than 3 percent of main memory (Section 4.9.1) Increases cache hit rate and improves UFS performance Requires additional memory resources

Defragment the file system (Section 5.7.2.1) Improves read and write performance Requires down time

Delay flushing full write buffers to disk (Section 5.7.2.2) Frees CPU cycles May degrade real-time workload performance

Increase number of blocks combined for read ahead (Section 5.7.2.3) Improves performance None

Increase number of blocks combined for a write (Section 5.7.2.4) Improves performance None

Increase the maximum number of UFS or MFS mounts (Section 5.7.2.5) Allows more mounted file systems None

Action	Performance Benefit	Tradeoff
Increase size of metadata buffer cache to more than 3 percent of main memory (Section 4.9.1)	Increases cache hit rate and improves UFS performance	Requires additional memory resources
Defragment the file system (Section 5.7.2.1)	Improves read and write performance	Requires down time
Delay flushing full write buffers to disk (Section 5.7.2.2)	Frees CPU cycles	May degrade real-time workload performance
Increase number of blocks combined for read ahead (Section 5.7.2.3)	Improves performance	None
Increase number of blocks combined for a write (Section 5.7.2.4)	Improves performance	None
Increase the maximum number of UFS or MFS mounts (Section 5.7.2.5)	Allows more mounted file systems	None

The following sections describe how to tune UFS in detail.

5.7.2.1 Defragmenting a File System

When a file consists of many discontiguous file extents, the file is considered fragmented. A very fragmented file decreases UFS read and write performance because it requires more I/O operations to access the file.

You can determine whether the files in a file system are fragmented by determining how effectively the system is clustering. You can do this by using dbx to examine the ufs_clusterstats, ufs_clusterstats_read, and ufs_clusterstats_write structures. See dbx(1) for more information.

UFS block clustering is usually efficient. If the numbers from the UFS clustering kernel structures show that clustering is not being particularly effective, the files in the file system may be very fragmented.

To defragment a UFS file system, follow these steps:

Back up the file system onto tape or another partition.

Create a new file system either on the same partition or a different partition.

Restore the file system.

AdvFS provides you with the ability to defragment a file domain by using the defragment command. See defragment(8) for more information.

5.7.2.2 Delaying Full Write Buffer Flushing

You can free CPU cycles by using the dbx debugger to set the value of the delay_wbuffers kernel variable to 1, which delays flushing full write buffers to disk at the next sync call. However, this may adversely affect real-time workload performance. The default value of delay_wbuffers is 0. See dbx(1) for more information.

5.7.2.3 Increasing the Number of Blocks Combined for Read Ahead

You can increase the number of blocks that are combined for a read-ahead operation.

To do this, use the dbx debugger to make the value of the cluster_consec_init kernel variable equal to the value of the cluster_max_read_ahead variable (the default is 8), which specifies the maximum number of read-ahead clusters that the kernel can schedule. See dbx(1) for more information.

In addition, you must make sure that cluster read operations are enabled on nonread-ahead and read-ahead blocks. To do this, the value of the cluster_read_all kernel variable must be set to 1 (the default).

5.7.2.4 Increasing the Number of Blocks Combined for a Write

The cluster_maxcontig parameter specifies the number of blocks that are combined into a single write operation. The default value is 8 KB. Contiguous writes are done in a unit size that is determined by the file system block size (the default is 8 KB) multiplied by the value of the cluster_maxcontig parameter.

5.7.2.5 Increasing the Number of UFS or MFS Mounts

Mount structures are dynamically allocated when a mount request is made and subsequently deallocated when an unmount request is made. The max-ufs-mounts attribute specifies the maximum number of UFS and MFS mounts on the system.

You can increase the value of the max-ufs-mounts attribute if your system will have more than the default limit of 1000 mounts.

5.8 Tuning CAM

DIGITAL UNIX uses the Common Access Method (CAM) as the operating system interface to the hardware. CAM maintains pools of buffers that are used to perform I/O. Each buffer takes approximately 1 KB of physical memory. Monitor these pools and tune them if necessary.

You can modify the following attributes:

cam_ccb_pool_size--The initial size of the buffer pool free list at boot time. The default is 200.

cam_ccb_low_water--The number of buffers in the pool free list at which more buffers are allocated from the kernel. CAM reserves this number of buffers to ensure that the kernel always has enough memory to shut down runaway processes. The default is 100.

cam_ccb_increment--The number of buffers either added or removed from the buffer pool free list. Buffers are allocated on an as-needed basis to handle immediate demands, but are released in a more measured manner to guard against spikes. The default is 50.

If the I/O pattern associated with your system tends to have intermittent bursts of I/O operations (I/O spikes), increasing the values of the cam_ccb_pool_size and cam_ccb_increment attributes may improve performance.