A storage subsystem consists of software (operating system or layered product) and hardware (including host bus adapters, cables, and disks). Your storage configuration can have a significant impact on system performance, because disk I/O is used for file system operations and also by the virtual memory subsystem for paging and swapping.
To configure a storage subsystem that will meet your performance and availability needs, you must first understand the I/O requirements of the users and applications and how they perform disk I/O, as described in Chapter 1. After you configure your storage subsystem, you may be able to tune the subsystem to improve performance.
This chapter describes the features of different storage subsystems and provides guidelines for configuring and tuning the subsystems.
Many of the tuning tasks described in this chapter require you to modify system attributes. See Section 2.11 for more information about attributes.
Disk I/O operations are significantly slower than data transfers involving the CPU or memory caches. Because disks are used for data storage and for virtual memory swap space, an incorrectly configured or tuned storage subsystem can degrade overall system performance.
Disk I/O performance can be affected by the following variables:
Workload characteristics
Performance depends on how your users and applications perform disk I/O. For example, a workload can involve primarily read or write I/O operations. In addition, some workloads require low latency and high throughput, while others require a fast data transfer rate (high bandwidth).
Low latency is important for multiple small data transfers and also for workstation, timesharing, and server environments. High bandwidth is important for systems that perform large sequential data transfers, such as database servers. See Chapter 1 for more information about characterizing your disk I/O.
Performance capacity of the hardware
DIGITAL recommends that you use hardware with the best performance features. For example, disks with a high rate of revolutions per minute (RPM) provide the best overall performance. Wide disks, which support 16-bit transfers, have twice the bandwidth of narrow (8 bit) disks and can improve performance for large data transfers. High-performance host bus adapters, such as fast wide differential (FWD) adapters provide low CPU overhead and high bandwidth. In addition, a write-back cache decreases the latency of small writes and can improve throughput.
Memory allocation to the UBC
The Unified Buffer Cache (UBC) is allocated a portion of physical memory and caches actual file system data for reads and writes, Advanced File System (AdvFS) metadata, and Memory File System (MFS) data. The UBC decreases the number of disk operations for file systems by serving as a layer between the disk and the operating system. The metadata buffer cache and the AdvFS buffer cache are also allocated a percentage of physical memory.
Kernel variable values
Disk I/O performance depends on kernel variable values that are appropriate for your workload and configuration. You may need to modify the default values to obtain optimal system performance, as described in this manual.
Mirrored disk configuration
Mirrored data across different disks improves the performance of read operations and provides high data availability. However, because data must be written to two separate locations, mirroring degrades disk write performance.
Striped disk configuration
Striping data across multiple disks distributes the I/O load and enables parallel I/O streams to operate concurrently on different devices, which improves disk I/O performance for some workloads.
Hardware RAID subsystem configuration
Hardware RAID subsystems relieve the CPU of disk management overhead and support write-back caches, which can improve disk I/O performance for some workloads.
File system configuration
File systems, including UFS and AdvFS, are used to organize and manage files. AdvFS provides you with fast file system recovery, improved performance for sequential and large I/O operations, and disk defragmentation features.
Raw I/O
For some workloads, raw I/O (I/O to a disk that does not contain a file system) may have better performance than file system I/O because it bypasses buffers and caches.
To choose a storage subsystem that will meet the needs of your users and applications, you must understand the benefits and tradeoffs of the various disk and file management options, as described in Section 5.2.
DIGITAL UNIX supports a number of methods that you can use to manage the physical disks and files in your environment. The traditional method of managing disks and files is to divide each disk into logical areas called disk partitions, and to then create a file system on a partition or use a partition for raw I/O. A disk can consist of one to eight partitions that have a fixed size; these partitions cannot overlap.
Each disk type has a default partition scheme.
The
disktab
database file lists the default disk partition sizes.
The partition
size determines the amount of data it can hold.
To modify the size of a
partition, you must back up any data in the partition, change the size by
using the
disklabel
command, and then restore the data
to the resized partition.
You must be sure that the data will fit into
the new partition.
An alternative method to managing disks with static disk partitions is to use the Logical Storage Manager (LSM) to set up a shared storage pool that consists of multiple disks. You can then create virtual disks from this pool of storage, according to your performance and capacity needs. LSM provides you with flexible and easy management for large storage configurations. Because there is no direct correlation between a virtual disk and a physical disk, file system or raw I/O can span disks, as needed. In addition, you can easily add disks to and remove disks from the pool, balance the load, and perform other storage management tasks. LSM also provides you with high-performance and high-availability RAID functionality.
Hardware RAID subsystems provide another method of handling storage. These subsystems use intelligent controllers to provide high-performance and high-availability RAID functionality, allow you to increase your storage capacity, and support write-back caches. RAID controllers allow you to combine several disks into a single storage set that the system sees as a single unit.
You can choose to manage your file systems by using AdvFS. AdvFS provides file system features beyond those of a traditional UFS file system. Unlike the rigid UFS model in which the file system directory hierarchy (tree) is bound tightly to the physical storage, AdvFS consists of two distinct layers: the directory hierarchy layer and the physical storage layer. This decoupled file system structure enables you to manage the physical storage layer apart from the directory hierarchy layer. This means that you can move files between a defined group of disk volumes without changing file pathnames. Because the pathnames remain the same, the action is completely transparent to end users.
You can use different configurations in your environment. For example, you can create static partitions on some disks and use the remaining disks in LSM volumes. You can also combine products in the same configuration. For example, you can configure AdvFS file domains on top of LSM volumes, or configure LSM volumes on top of RAID storage sets.
The following sections describe the features of the different disk and file system management options.
RAID (redundant array of independent disks) technology can provide both high disk I/O performance and high data availability. The DIGITAL UNIX operating system provides RAID functionality by using the Logical Storage Manager (LSM) product. DIGITAL UNIX also supports hardware RAID subsystems, which provide RAID functionality by using intelligent controllers, caches, and software.
There are four primary RAID levels:
RAID 0--Also known as disk striping, RAID 0 divides data into blocks (sometimes called chunks or stripes) and distributes the blocks across multiple disks in a array. Striping enables parallel I/O streams to operate concurrently on different devices. I/O operations can be handled simultaneously by multiple devices, which balances the I/O load and improves performance.
The performance benefit of striping depends on the size of the stripe and how your users and applications perform disk I/O. For example, if an application performs multiple simultaneous I/O operations, you can specify a stripe size that will enable each disk in the array to handle a separate I/O operation. If an application performs large sequential data transfers, you can specify a stripe size that will distribute a large I/O evenly across the disks.
For volumes that receive only one I/O at a time, you may not want to use striping if access time is the most important factor. In addition, striping may degrade the performance of small data transfers, because of the latencies of the disks and the overhead associated with dividing a small amount of data.
Striping decreases data availability because one disk failure makes the entire disk array unavailable. To make striped disks highly available, you can mirror the disks.
RAID 1--Also known as disk mirroring, RAID 1 provides high data availability by maintaining identical copies of data on different disks in an array. RAID 1 also improves the disk read performance, because data can be read from two different locations. However, RAID 1 can decrease disk write performance, because data must be written to two different locations.
RAID 3--A type of parity RAID, RAID 3 divides data blocks and distributes the data across a disk array, providing parallel access to data. RAID 3 provides a high data transfer rate and increases bandwidth, but it provides no improvement in throughput (the I/O transaction rate).
RAID 3 can improve the I/O performance for applications that transfer large amounts of sequential data, but it provides no improvement for applications that perform multiple I/O operations involving small amounts of data.
RAID 3 provides high data availability by storing redundant parity information on a separate disk. The parity information is used to regenerate data if a disk in the array fails. However, performance degrades as multiple disks fail, and data reconstruction is slower than if you had used mirroring.
RAID 5--A type of parity RAID, RAID 5 distributes data blocks across disks in an array. RAID 5 allows independent access to data and can handle simultaneous I/O operations.
RAID 5 can improve throughput, especially for large file I/O operations, multiple small data transfers, and I/O read operations. However, it is not suited to write-intensive applications.
RAID 5 provides data availability by distributing redundant parity information across disks. Each array member contains enough parity information to regenerate data if a disk fails. However, performance may degrade and data may be lost if multiple disks fail. In addition, data reconstruction is slower than if you had used mirroring.
To address your performance and availability needs, you can combine some RAID levels. For example, you can combine RAID 0 with RAID 1 to mirror striped disks for high availability and high performance.
In addition, some DIGITAL hardware RAID subsystems support adaptive RAID 3/5 (also called dynamic parity RAID), which improves disk I/O performance for a wide variety of applications by dynamically adjusting, according to workload needs, between data transfer-intensive algorithms and I/O operation-intensive algorithms.
Table 5-1 compares the performance and availability features for the different RAID levels.
RAID Level | Performance Impact | Availability Impact |
RAID 0 | Balances I/O load and improves reads and writes | Lower than single disk |
RAID 1 | Improves reads, may degrade writes | Highest |
RAID 0+1 | Balances I/O load, improves reads, may degrade writes | Highest |
RAID 3 | Improves bandwidth, performance may degrade if multiple disks fail | Higher than single disk |
RAID 5 | Improves throughput, performance may degrade if multiple disks fail | Higher than single disk |
Adaptive RAID 3/5 | Improves bandwidth and throughput, performance may degrade if multiple disks fail | Higher than single disk |
It is important to understand that RAID performance depends on the state of the devices in the RAID subsystem. There are three possible states: steady state (no failures), failure (one or more disks have failed), and recovery (subsystem is recovering from failure).
There are many variables to consider when choosing a RAID configuration:
Not all RAID products support all RAID levels.
For example, LSM currently supports only RAID 0 (striping) and RAID 1 (mirroring), and only high-performance RAID controllers support adaptive RAID 3/5.
RAID products provide different performance benefits.
For example, hardware RAID subsystems support write-back caches and other performance-enhancing features and also relieve the CPU of the I/O overhead.
Some RAID configurations are more cost-effective than others.
In general, LSM provides more cost-effective RAID functionality than hardware RAID subsystems. In addition, parity RAID provides data availability at a cost that is lower than RAID 1 (mirroring), because mirroring n disks requires 2n disks.
Data recovery rates depend on the RAID configuration.
For example, if a disk fails, it is faster to regenerate data by using a mirrored copy than by using parity information. In addition, if you are using parity RAID, I/O performance declines as additional disks fail.
There are advantages to each RAID product, and which one you choose depends on your workload requirements and other factors. The following sections describe the features of the different RAID subsystems and LSM.
Hardware RAID subsystems use a combination of hardware (RAID controllers, caches, and host bus adapters) and software to provide high disk I/O performance and high data availability. A hardware RAID subsystem is sometimes called hardware RAID.
All hardware RAID subsystems provide you with the following features:
A RAID controller that relieves the CPU of the disk I/O overhead
Increased disk storage capacity
Hardware RAID subsystems allow you to connect a large number of disks to your system. In a typical storage configuration, you use a SCSI bus connected to an I/O bus slot to attach disks to a system. However, you can connect only a limited number of disks on a SCSI bus, and systems have limited I/O bus slots. Hardware RAID subsystems contain internal SCSI buses and host bus adapters, which enable you to connect multiple SCSI buses and multiple disks to a system by using only one I/O bus slot.
Read cache
A read cache can improve I/O read performance by holding data that it anticipates the host will request. If a system requests data that is already in the read cache (a cache hit), the data is immediately supplied without having to read the data from disk. Subsequent data modifications are written both to disk and to the read cache (write-through caching).
Write-back cache
Hardware RAID subsystems support (as a standard or an optional feature) a nonvolatile write-back cache, which can improve I/O write performance while maintaining data integrity. A write-back cache decreases the latency of many small writes, and can improve Web server performance because writes appear to be executed immediately. A write-back cache must be battery-backed to protect against data loss and corruption.
With write-back caching, data intended to be written to disk is temporarily stored in the cache, consolidated, and then periodically written (flushed) to disk for maximum efficiency. If a failure occurs, upon recovery, the RAID controller detects any unwritten data that still exists in the write-back cache and writes the data to disk before enabling normal controller operations.
Parity RAID support
Hardware RAID subsystems provide various levels of parity RAID support (RAID 3, RAID 5, or adaptive RAID 3/5) for high performance and high availability.
Hot component swapping and sparing
Hot swap support allows you to replace a failed component while the system continues to operate. Hot spare support allows you to automatically use previously installed components if a failure occurs.
Non-RAID disk array capability or "just a bunch of disks" (JBOD)
Graphical user interface (GUI) for easy management and monitoring
The
volstat
command, which provides
detailed LSM performance information
There are various hardware RAID subsystems, including backplane RAID array subsystems and high-performance standalone RAID array subsystems, which provide different degrees of performance and availability at various costs. The features of these two subsystems are as follows:
Backplane RAID array subsystems
These entry-level subsystems, such as the RAID Array 230 subsystem, provide a low-cost hardware RAID solution. A backplane RAID array controller is installed in an I/O bus slot, either a PCI bus slot or an EISA bus slot, and acts as both a host bus adapter and a RAID controller.
Backplane RAID array subsystems are designed for small and midsize departments and workgroups, and provide RAID functionality (0, 1, 0+1, and 5) and an optional write-back cache.
Standalone RAID array subsystems
These subsystems, such as the RAID Array 450 subsystem, provide high availability and the highest performance of any RAID subsystem. A standalone RAID array subsystem uses a high-performance controller, such as the HSZ controller. The controller connects to the system through a FWD SCSI bus and a high-performance host bus adapter, such as a KZPSA adapter, installed in an I/O bus slot.
Standalone RAID array subsystems are designed for client/server, data center, and medium to large departmental environments. They provide RAID functionality (0, 1, 0+1, and adaptive RAID 3/5), dual-redundant controller support, scalability, storage set partitioning, and a standard write-back cache.
See Section 5.5 for information on configuring hardware RAID subsystems.
Logical Storage Manager (LSM) can improve disk I/O performance, provide high data availability, and help you to manage your storage more efficiently. All DIGITAL UNIX systems can use the basic LSM functions, but advanced disk management functions require a separate LSM license. When LSM is used to stripe or mirror disks, it is sometimes referred to as software RAID.
LSM allows you to organize a shared storage pool into volumes, which are used in the same way as disk partitions, except that I/O directed to a volume can span disks. You can create a UFS file system or an AdvFS file domain on a volume, or you can use a volume as a raw device. You can also create LSM volumes on top of RAID storage sets.
LSM supports the following disk management features:
Pool of storage
Load balancing by transparently moving data across disks
RAID 1 (disk mirroring) support (license necessary)
Disk concatenation (creating a large volume from multiple disks)
Graphical user interface (GUI) for easy disk management and detailed performance information (license necessary)
LSM provides more cost-effective RAID functionality than a hardware RAID subsystem. In addition, LSM configurations are less complex than hardware RAID configurations. To obtain the performance benefits of both LSM and hardware RAID, you can create LSM volumes on top of RAID storage sets.
LSM is especially suited for systems with large numbers of disks. For these systems, you may want to use LSM to manage your disks and AdvFS to manage your files. That is, you can organize your disks into LSM volumes and then use those volumes to create AdvFS file domains.
Advanced File System (AdvFS) is a DIGITAL UNIX file system option that provides many file management and performance features. You can use AdvFS instead of UFS to organize and manage your files.
The AdvFS Utilities product, which is licensed separately from the DIGITAL UNIX operating system, extends the capabilities of the AdvFS file system. An AdvFS file domain can consist of multiple volumes, which can be UNIX block devices (entire disks), disk partitions, LSM logical volumes, or RAID storage sets. AdvFS filesets can span all the volumes in the file domain.
AdvFS provides the following file management features:
Fast file system recovery
Rebooting after a
system interruption is extremely fast.
AdvFS uses write-ahead logging,
instead of the
fsck
utility, as a way to check
for and repair file system inconsistencies.
The recovery speed depends
on the number of uncommitted records in the log, not the amount of data
in the fileset; therefore, reboots are quick and predictable.
High-performance file system
AdvFS uses an extent-based file allocation scheme that consolidates data transfers, which increases sequential bandwidth and improves performance for large data transfers. AdvFS performs large reads from disk when it anticipates a need for sequential data. AdvFS also performs large writes by combining adjacent data into a single data transfer.
Online file system management
File domain defragmentation
Support for large files and file systems
User quotas
AdvFS utilities provide the following features:
Pool of storage that allows you to add, remove, and back up disks without disrupting users or applications.
Disk spanning filesets
Ability to recover deleted files
Users can retrieve their own unintentionally deleted files from predefined trashcan directories or from clone filesets, without assistance from system administrators.
I/O load balancing across disks
Online fileset resizing
Online file migration across disks
File-level striping
File-level striping may improve I/O bandwidth (transfer rates) by distributing file data across multiple disk volumes.
Graphical user interface (GUI) that simplifies disk and file system administration, provides status, and alerts you to potential problems
See Section 5.6 for information about AdvFS configuration and tuning guidelines.
There are some general guidelines for configuring and tuning storage subsystems. These guidelines are applicable to most configurations and will help you to get the best disk I/O performance, regardless of whether you are using static partitions, raw devices, LSM, hardware RAID subsystems, AdvFS, or UFS.
These guidelines fall into three categories:
Using high-performance hardware (see Table 5-2)
Distributing the disk I/O load (see Table 5-3)
General file system tuning (see Table 5-4)
The following sections describe these guidelines in detail.
Using high-performance hardware will provide the best disk I/O performance, regardless of your storage configuration. Table 5-2 describes the guidelines for hardware configurations and lists the performance benefits as well as the tradeoffs.
Hardware | Performance Benefit | Tradeoff |
Fast (high RPM) disks (Section 5.3.1.1) | Improve disk access time and sequential data transfer performance | Cost |
Disks with small platter sizes (Section 5.3.1.2) | Improve seek times for applications that perform many small I/O operations | No benefit for large sequential data transfers |
Wide disks (Section 5.3.1.3) | Provide high bandwidth and improves performance for large data transfers | Cost |
Solid-state disks (Section 5.3.1.4) | Provide very low disk access time | Cost |
High-performance host bus adapters (Section 5.3.1.5) | Increase bandwidth and throughput | Cost |
DMA host bus adapters (Section 5.3.1.6) | Relieve CPU of data transfer overhead | None |
Prestoserve (Section 5.3.1.7) | Improves synchronous write performance | Cost, not supported in a cluster or for nonfile system I/O operations |
Hardware RAID subsystem (Section 5.5) | Increases disk capacity and supports write-back cache | Cost of hardware RAID subsystem |
Write-back cache (Section 5.3.1.8) | Reduces the latency of many small writes | Cost of hardware RAID subsystem |
See the DIGITAL Systems & Options Catalog for information about disk, adapter, and controllers performance features.
The following sections describe these guidelines in detail.
Disks that spin with a high rate of revolutions per minute (RPM) have a low disk access time (latency). High-RPM disks are especially beneficial to the performance of sequential data transfers.
High-performance 5400 RPM disks can improve performance for many transaction processing applications (TPAs). Extra high-performance 7200 RPM disks are ideal for applications that require both high bandwidth and high throughput.
Disks with small platter sizes provide better seek times than disks with large platter sizes, because the disk head has less distance to travel between tracks. There are three sizes for disk platters: 2.5, 3.5, and 5.25 inches in diameter.
A small platter size may improve disk I/O performance (seek time) for applications that perform many small I/O operations, but it provides no performance benefit for large sequential data transfers.
Disks with wide (16-bit) data paths provide twice the bandwidth of disks with narrow (8-bit) data paths. Wide disks can improve I/O performance for large data transfers.
Solid-state disks provide outstanding performance in comparison to regular disks but at a higher cost. Solid-state disks have a disk access time that is less than 100 microseconds, which is equivalent to memory access speed and more than 100 times faster than the disk access time for magnetic disks.
Solid-state disks are ideal for a wide range of response-time critical applications, such as online transaction processing (OLTP), and applications that require high bandwidth, such as video applications. Solid-state disks also provide data reliability through a data-retention system. For the best performance, use solid-state disks for your most frequently accessed data, place the disks on a dedicated bus, and use a high-performance host bus adapter.
Host bus adapters provide different performance features at various costs. For example, FWD adapters, such as the KZPSA adapter, provide high bandwidth and high throughput connections to disk devices.
SCSI adapters let you set the SCSI bus speed, which is the rate of data transfers. There are three possible bus speeds:
Slow (up to 5 million bytes per second or 5 MHz)
Fast (up to 10 million bytes per second or 10 MHz)
Fast bus speed uses the fast synchronous transfer option, enabling I/O devices to attain high peak-rate transfers in synchronous mode.
Ultra (up to 20 million bytes per second or 20 MHz)
Not all SCSI bus adapters support all speeds.
Some host bus adapters support direct memory access (DMA), which enables an adapter to bypass the CPU and go directly to memory to access and transfer data. For example, the KZPAA is a DMA adapter that provides a low-cost connection to SCSI disk devices.
Prestoserve utilizes a nonvolatile, battery-backed memory cache to improve synchronous write performance. Prestoserve temporarily caches file system writes that otherwise would have to be written to disk. This capability improves performance for systems that perform large numbers of synchronous writes.
To optimize Prestoserve cache use, you may want to enable Prestoserve only on the most frequently used file systems. You cannot use Prestoserve in a cluster or for nonfile system I/O.
Hardware RAID subsystems support (as a standard or an optional feature) write-back caches, which can improve I/O write performance while maintaining data integrity. A write-back cache must be battery-backed to protect against data loss and corruption.
A write-back cache decreases the latency of many small writes and can improve write-intensive application performance and Internet server performance. Applications that perform few writes will not benefit from a write-back cache.
With write-back caching, data intended to be written to disk is temporarily stored in the cache and then periodically written (flushed) to disk for maximum efficiency. I/O latency is reduced by consolidating contiguous data blocks from multiple host writes into a single unit.
Because writes appear to be executed immediately, a write-back cache improves performance. If a failure occurs and the cache is battery-backed, upon recovery, the RAID controller will detect any unwritten data that still exists in the write-back cache and write the data to disk before enabling normal controller operations.
In addition to using hardware that will provide you with the best performance, you must distribute the disk I/O load across devices to obtain the maximum efficiency. Table 5-3 describes guidelines on how to distribute disk I/O and lists the performance benefits as well as tradeoffs.
Action | Performance Benefit | Tradeoff |
Distribute swap space across different disks and buses (Section 5.3.2.1) | Improves paging and swapping performance and helps to prevent bottlenecks | Requires additional disks, cabling, and adapters |
Distribute disk I/O across different disks and buses (Section 5.3.2.2) | Allows parallel I/O operations and helps to prevent bottlenecks | Requires additional disks, cables, and adapters |
Place the most frequently used file systems on different disks (Section 5.3.2.3) | Helps to prevent disk bottlenecks | Requires additional disks |
Place data at the beginning of a ZBR disk (Section 5.3.2.4) | Improves bandwidth for sequential data transfers | None |
The following sections describe these guidelines in detail.
Distributing swap space across different disks and buses makes
paging and swapping more efficient and helps to prevent any single adapter,
disk, or bus from becoming a bottleneck.
See the
System Administration
manual or
swapon
(8)
for information about configuring
swap space.
You can also use LSM to stripe your swap disks, which distributes the disk I/O. See Section 5.4 for more information.
Distributing disk I/O across different disks and buses helps to prevent a single adapter, disk, or bus from becoming an I/O bottleneck and also allows simultaneous operations.
For example, if you have 16 GB of disk storage, you may get better performance from sixteen 1-GB disks than four 4-GB disks. More spindles (disks) may allow more simultaneous operations. For random I/O operations, 16 disks may be simultaneously seeking instead of 4 disks. For large sequential data transfers, 16 data streams can be simultaneously working instead of 4 data streams.
You can also use LSM to stripe your disks, which distributes the disk I/O load. See Section 5.4 for more information.
Place the most frequently used file systems on different disks. Distributing file systems will help to prevent a single disk from becoming a bottleneck.
Directories containing executable files or temporary files are often
frequently accessed (for example,
/var
,
/usr
, and
/tmp
).
If possible, place
/usr
and
/tmp
on
different disks.
Data is most quickly transferred when it is located at the beginning of zone-based recording (ZBR) disks. Placing data at the beginning of these disks improves the bandwidth for sequential data transfers.
You may be able to improve I/O performance by modifying some kernel attributes that affect overall file system performance. The guidelines apply to all file system configurations, including UFS and AdvFS.
General file system tuning often involves tuning the Virtual File System (VFS). VFS provides a uniform interface that allows common access to files, regardless of the file system on which the files reside.
The file system tuning guidelines fall into these categories:
Changing how the system allocates and deallocates vnodes
The kernel data structure for an open file is called a vnode. These are used by all file systems. The allocation and deallocation of vnodes is handled dynamically by the system.
Increasing the size of the namei cache to make lookup operations faster
The namei cache is used by all file systems to map file pathnames to inodes.
Increasing the size of the hash chain table for the namei cache to make lookup operations faster
Hash tables are used for lookup operations.
Allocating more memory to the Unified Buffer Cache (UBC)
The UBC shares physical memory with the virtual memory subsystem and is used to cache the most recently accessed file system data.
Using Prestoserve to cache only UFS or AdvFS file system metadata
There are also specific guidelines for AdvFS and UFS file systems. See Section 5.6 and Section 5.7 for information.
Table 5-4 describes the guidelines for general file system tuning and lists the performance benefits as well as the tradeoffs.
Action | Performance Benefit | Tradeoff |
Increase the maximum number of open files (Section 5.3.3.1) | Allocates more resources to applications | Consumes memory |
Increase the size of the namei cache (Section 5.3.3.2) | Improves cache lookup operations | Consumes memory |
Increase the size of the hash chain table for the namei cache (Section 5.3.3.3) | Improves cache lookup operations | Consumes memory |
Allocate more memory to the UBC (Section 5.3.3.4) | Improves disk I/O performance | May cause excessive paging and swapping |
Use Prestoserve to cache only file system metadata (Section 5.3.3.5) | Improves performance for applications that access large amounts of file system metadata | Cost, not supported in a cluster or for nonfile system I/O operations |
Cache more vnodes on the free list (Section 5.3.3.6) | Improves cache lookup operations | Consumes memory |
Increase the amount of time for which vnodes are kept on the free list (Section 5.3.3.7) | Improves cache lookup operations | None |
Delay vnode deallocation (Section 5.3.3.8) | Improves namei cache lookup operations | Consumes memory |
Accelerate vnode deallocation (Section 5.3.3.9) | Reduces memory demands | Reduces the efficiency of the namei cache |
Disable vnode deallocation (Section 5.3.3.10) | Optimizes processing time | Consumes memory |
Increase the open file descriptor limit (Section 5.3.3.11) | Provides more file descriptors to a process | Increases the possibility of runaway allocations |
Decrease the open file descriptor limit (Section 5.3.3.11) | Prevents a process from consuming all the file descriptors | May adversely affect the performance of processes that require many file descriptors |
Disable clearing of the DMA scatter/gather map registers (Section 5.3.3.12) | Improves performance of VLM/VLDB systems | None |
The following sections describe these guidelines in detail.
Increasing the value of the
max-vnodes
or
maxusers
attribute increases the maximum number of vnodes, which increases the number
of open files.
If your applications require many open files, you may want
to
raise the values of these attributes.
Raising the attribute values will increase the demand on your memory
resources, and
should only be done if you get a message stating that you are out of vnodes.
If the number of users on the system exceeds the value of
maxusers
,
and you increase the value of
maxusers,
increase the value of
max-vnodes
proportionally.
The namei cache is used by all file systems to map file pathnames to
inodes.
Use
dbx
to monitor the cache by examining the
nchstats
structure.
The miss rate (misses / (good + negative + misses)) should
be less than 20 percent.
To make lookup operations faster, increase the size of the namei
cache by increasing the value of the
maxusers
attribute (the recommended way) or by increasing the value of the
name-cache-size
attribute.
Increasing the value of
maxusers
or
name-cache-size
allocates more system resources for use by the kernel.
However, it also
increases the amount of physical memory consumed by the kernel.
Note that many benchmarks may perform better with a large
namei cache.
Increasing the size of hash chain table for the namei cache spreads
the namei cache elements and may reduce linear searches, which improves lookup
speeds.
The
name-cache-hash-size
attribute specifies the
size of the hash chain table for the namei cache.
The default size is 256
slots.
You can change the value of the
name-cache-hash-size
attribute so that each hash chain has three or four name cache entries.
To
determine an appropriate value for the
name-cache-hash-size
attribute, divide the value of
name-cache-size
attribute
by 3 or 4 and then round the result to a power of 2.
For example, if the value
of
name-cache-size
is 1029, dividing 1029 by 4 produces
a value of 257.
Based on this calculation, you could specify 256 (2 to the
power of 8) for the value of the
name-cache-hash-size
attribute.
The Unified Buffer Cache (UBC) uses a portion of physical memory to cache actual file system data for reads and writes, AdvFS metadata, and Memory File System (MFS) data. The UBC prevents the system from having to copy data from a disk, which improves performance. If there is an insufficient amount of memory allocated to the UBC, disk I/O performance may be degraded.
Increasing the size of the UBC improves the chance that data will be found in the cache. However, because the UBC and the virtual memory subsystem share the same physical memory pages, increasing the size of the UBC may cause excessive paging and swapping.
See Section 4.8 for information about tuning the UBC.
Prestoserve
can improve the overall run-time performance for systems that perform large
numbers of synchronous writes.
The
prmetaonly
attribute
controls whether Prestoserve caches only UFS and AdvFS file system metadata,
instead of both metadata and synchronous write data (the default).
If the
attribute is set to 1 (enabled), Prestoserve caches only file system metadata.
Caching only metadata may improve the performance of applications that access many small files or applications that access a large amount of file-system metadata but do not reread recently written data.
You can raise the value of the
min-free-vnodes
attribute, which determines the minimum number of vnodes on the
free list.
Increasing the value causes the system to cache more free
vnodes and improves the performance of cache lookup operations.
However,
increasing the value will increase the demand on your memory resources.
On 24-MB systems, the default value of the
min-free-vnodes
attribute is 150.
On 32-MB or larger systems, the default value
depends on the value of the
maxusers
attribute.
For these systems, if the value of
min-free-vnodes
is close to the value of the
max-vnodes
attribute, vnode deallocation will not be effective.
If the value of
min-free-vnodes
is larger than the
value of
max-vnodes
, vnode deallocations will not occur.
If the value of
min-free-vnodes
must be close to the value of
max-vnodes
, you may want
to
disable vnode deallocation (see
Section 5.3.3.10).
However, disabling vnode deallocation does not free memory,
because memory used by the vnodes is not returned to the system.
On systems that need to reclaim the memory used by vnodes,
make sure that the value of
min-free-vnodes
is significantly lower than the value of
max-vnodes
.
You can increase the value of the
vnode-age
attribute to increase the amount of time for which vnodes are kept on the
free list.
This
increases the possibility that the vnode will be successfully looked up.
The default value for
vnode-age
is 120 seconds on 32-MB or larger systems and 2 seconds on 24-MB systems.
Increase the value of the
namei-cache-valid-time
attribute
to delay the deallocation of vnodes.
This can improve namei cache lookup
operations but it consumes memory resources.
Decrease the value of the
namei-cache-valid-time
attribute
to accelerate the deallocation of vnodes.
This causes vnodes to be deallocated
from the namei cache at a faster rate, but reduces the efficiency of the
cache.
To optimize processing time, disable vnode deallocation by setting the value
of the
vnode-deallocation-enable
attribute to 0.
Disabling vnode deallocation does not free memory, because memory used by
the vnodes is not returned to the system.
You may want to disable vnode allocation
if the value of
min-free-vnodes
is close to the value of
max-vnodes
.
The
open-max-soft
and
open-max-hard
attributes control the maximum number of open file
descriptors for each process.
When the
open-max-soft
limit
is reached, a warning message is issued, and when the
open-max-hard
limit is reached, the process is stopped.
These attributes prevent
runaway allocations (for example, allocations within a loop that cannot be
exited because of an error condition) from consuming all the available file
descriptors.
The
open-max-soft
and
open-max-hard
attributes both have default values of 4096 file descriptors (open files)
per process.
The maximum number of open files per process is 65,536.
If your
applications require many open files, you may want to increase the maximum
open file descriptor limit.
Increasing the limit provides more file descriptors
to a process, but it increases the possibility of runaway allocations.
In
addition, if you increase the number of open files per process, make sure
that the
max-vnodes
attribute is set to an adequate value.
See the
Release Notes
for information about increasing
the open file descriptor limit.
Decreasing the open file descriptor limit decreases the number of file descriptors available to each process and prevents a process from consuming all the file descriptors. However, decreasing the limit may adversely affect the performance of processes that require many file descriptors.
If you have
an AlphaServer 8200 or 8400, the
dma-sg-map-unload-zero
attribute controls whether the direct memory access (DMA) scatter/gather
map registers clear after an I/O operation completes.
If your system utilizes
large amounts of memory or storage, you may be able to gain some I/O performance
benefit by setting the attribute to zero.
The Logical Storage Manager (LSM) can improve system performance and provide high data availability. LSM also provides you with online storage management features and enhanced performance information and statistics, with little additional overhead. Although any type of system can benefit from LSM, it is especially suited for large systems with large numbers of disks.
LSM volumes are used in the same way as disk partitions. You can create UFS file systems and AdvFS file domains and filesets on an LSM volume, or you can use a volume as a raw device.
To set up a high-performance LSM configuration, you must be careful how you configure the following:
Disks, disk groups, and databases (see Section 5.4.1)
Mirrored disks (see Section 5.4.2)
Striped disks (see Section 5.4.3)
The Logical Storage Manager manual provides detailed information about using LSM. The following sections describe configuration and tuning guidelines for LSM.
The following sections provide general guidelines to configure LSM disks, disk groups, and databases. How you configure your LSM disks and disk groups determines the flexibility of your LSM configuration.
In addition, each LSM disk group maintains a configuration database, which includes detailed information about mirrored and striped disks and volume, plex, and subdisk records.
Table 5-5 lists LSM disk, disk group, and database configuration guidelines and performance benefits as well as tradeoffs.
Action | Benefit | Tradeoff |
Initialize your LSM disks as sliced disks (Section 5.4.1.1) | Provides greater storage configuration flexibility | None |
Increase the maximum number of LSM volumes (Section 5.4.1.2) | Improves performance on VLM/VLDB systems | None |
Make the
rootdg
disk group a sufficient size (Section 5.4.1.3) |
Ensure sufficient space for disk group information | None |
Use a sufficient private region size for each disk (Section 5.4.1.4) | Ensures sufficient space for database copies | Large private regions require more disk space |
Make the private regions in a disk group the same size (Section 5.4.1.5) | Efficiently utilizes the configuration space | None |
Group disks into different disk groups (Section 5.4.1.6) | Allows you to move disk groups between systems | Reduces flexibility when configuring volumes |
Use an appropriate size and number of database and log copies (Section 5.4.1.7) | Ensures database availability and improves performance | None |
Place disks containing database and log copies on different buses (Section 5.4.1.8) | Improves availability | Cost of additional hardware |
The following sections describe these guidelines in detail.
Initialize your LSM disks as sliced disks, instead of as simple disks. A sliced disk provides greater storage configuration flexibility because the entire disk is under LSM control. The disk label for a sliced disk contains information that identifies the partitions containing the private and the public regions. In contrast, simple disks have both public and private regions in the same partition.
For large systems increase the value of the
max-vol
attribute, which specifies the maximum number of volumes per system.
The default is 1024; you can increase it to 4096.
You must make sure that the
rootdg
disk group has an
adequate size, because the disk group's configuration database contains
records for disks outside of the
rootdg
disk group, in addition to the ordinary disk-group configuration
information.
For example, the
rootdg
configuration database includes disk-access records that define
all disks under LSM control.
The
rootdg
disk group must be large enough to contain
records for the disks in all the disk
groups.
See
Table 5-6
for more information.
You must make sure that the private region for each disk has an adequate size. LSM keeps disk media label and configuration database copies in each disk's private region.
A private region must be large enough to accommodate the size of the LSM database copies. In addition, the maximum number of LSM objects (disks, subdisks, volumes, and plexes) in a disk group depends on an adequate private region size. However, a large private region requires more disk space. The default private region size is 1024 blocks, which is usually adequate for configurations using up to 128 disks per disk group.
The private region of each disk in a disk group should be the same size, in order to efficiently utilize the configuration space. One or two LSM configuration database copies can be stored in a disk's private region.
When you add a new disk to existing an LSM disk group, the size of the
private region on the new disk is determined by the private region
size of the other disks in the disk group.
As you add more disks to a disk group, the
voldiskadd
utility reduces the number of configuration copies and log copies that are
initialized for the new disks.
See
voldiskadd
(8)
for more information.
You may want to group disks in disk groups according to their function. This enables disk groups to be moved between systems, and decreases the size of the LSM configuration database for each disk group. However, using multiple disk groups reduces flexibility when configuring volumes.
Each disk group maintains a configuration database, which includes detailed information about mirrored and striped disks and volume, plex, and subdisk records. The LSM subsystem's overhead primarily involves managing the kernel change logs and copies of the configuration databases.
LSM performance is affected by the size and the number of copies of the configuration database and the kernel change log. They determine the amount of time it takes for LSM to start up, for changes to the configuration to occur, and for the LSM disks to fail over in a cluster.
Usually, each disk in a disk group contains one or two copies of both the kernel change log and the configuration database. Disk groups consisting of more than eight disks should not have copies on all disks. Always use four to eight copies.
The number of kernel change log copies must be the same as the number of configuration database copies. For the best performance, the number of copies must be the same on each disk that contains copies.
Table 5-6 describes the guidelines for configuration database and kernel change log copies.
Disks Per Disk Group | Size of Private Region (in Blocks) | Configuration and Kernel Change Log Copies Per Disk |
1 to 3 | 512 | Two copies in each private region |
4 to 8 | 512 | One copy in each private region |
9 to 32 | 512 | One copy on four to eight disks, zero copies on remaining disks |
33 to 128 | 1024 | One copy on four to eight disks, zero copies on remaining disks |
129 to 256 | 1536 | One copy on four to eight disks, zero copies on remaining disks |
257 or more | 2048 | One copy on four to eight disks, zero copies on remaining disks |
For disk groups with large numbers of disks, place the disks that contain configuration database and kernel change log copies on different buses. This provides you with better performance and higher availability.
Use LSM mirrored volumes for high data availability. If a physical disk fails, the mirrored plex (copy) containing the failed disk becomes temporarily unavailable, but the remaining plexes are still available. A mirrored volume has at least two plexes for data redundancy.
Mirroring can also improve read performance. However, a write to a volume results in parallel writes to each plex, so write performance may be degraded. Environments whose disk I/O operations are predominantly reads obtain the best performance results from mirroring. See Table 5-7 for guidelines.
In addition, use block-change logging (BCL) to improve the mirrored volume recovery rate when a system failure occurs by reducing the synchronization time. If BCL is enabled and a write is made to a mirrored plex, BCL identifies the block numbers that have changed and then stores the numbers on a logging subdisk. BCL is not used for reads.
BCL is enabled if two or more plexes in a mirrored volume have a logging subdisk associated with them. Only one logging subdisk can be associated with a plex. BCL can add some overhead to your system and degrade the mirrored volume's write performance. However, the impact is less for systems under a heavy I/O load, because multiple writes to the log are batched into a single write. See Table 5-8 for guidelines.
Note that BCL will be replaced by dirty region logging (DRL) in a future release.
Table 5-7 lists LSM mirrored volume configuration guidelines and performance benefits as well as tradeoffs.
Action | Benefit | Tradeoff |
Map mirrored plexes across different buses (Section 5.4.2.1) | Improves performance and increases availability | None |
Use the appropriate read policy (Section 5.4.2.2) | Efficiently distributes reads | None |
Attach up to eight plexes to the same volume (Section 5.4.2.3) | Improves performance for read-intensive workloads and increases availability | Uses disk space inefficiently |
Use a symmetrical configuration (Section 5.4.2.4) | Provides more predictable performance | None |
Use block-change logging (Table 5-8) | Improves mirrored volume recovery rate | May decrease write performance |
Stripe the mirrored volumes (Table 5-9) | Improves disk I/O performance and balances I/O load | Increases management complexity |
Table 5-8 lists LSM block-change logging (BCL) configuration guidelines and performance benefits as well as tradeoffs.
Action | Benefit | Tradeoff |
Configure multiple logging subdisks (Section 5.4.2.5) | Improves recovery time | Requires additional disks |
Use a write-back cache for logging subdisks (Section 5.4.2.6) | Minimizes BCLs write degradation | Cost of hardware RAID subsystem |
Use the appropriate BCL subdisk size (Section 5.4.2.7) | Enables migration to dirty region logging | None |
Place logging subdisks on infrequently used disks (Section 5.4.2.8) | Helps to prevent disk bottlenecks | None |
Use solid-state disks for logging subdisks (Section 5.4.2.9) | Minimizes BCL's write degradation | Cost of disks |
The following sections describe these guidelines in detail.
Putting each mirrored plex on a different bus improves performance and availability by helping to prevent bus bottlenecks, and allowing simultaneous I/O operations. Mirroring across different buses also increases availability by protecting against bus and adapter failure.
To provide optimal performance for different types of mirrored volumes, LSM supports the following read policies:
Round-robin read
Satisfies read operations to the volume in a round-robin manner from all plexes in the volume.
Preferred read
Satisfies read operations from one specific plex (usually the plex with the highest performance).
Select
Selects a default read policy, based on the plex associations to the volume. If the mirrored volume contains a single, enabled, striped plex, the default is to prefer that plex. For any other set of plex associations, the default is to use a round-robin policy.
If one plex exhibits superior performance, either because the plex is striped across multiple disks or because it is located on a much faster device, then set the read policy to preferred read for that plex. By default, a mirrored volume with one striped plex should have the striped plex configured as the preferred read. Otherwise, you almost aways use the round-robin read policy.
To improve performance for read-intensive workloads, up to eight plexes can be attached to the same mirrored volume. However, this configuration does not use disk space efficiently.
A symmetrical mirrored disk configuration provides predictable performance and easy management. Use the same number of disks in each mirrored plex. For mirrored striped volumes, you can stripe across half of the available disks to form one plex and across the other half to form the other plex.
Using multiple block-change logging (BCL) subdisks will improve recovery time after a failure.
To minimize BCL's impact on write performance, use LSM in conjunction with a RAID subsystem that has a write-back cache. Typically, the BCL performance degradation is more significant on systems with few writes than on systems with heavy write loads.
To support migration from BCL to dirty region logging (DRL), which will be supported in a future release, use the appropriate BCL subdisk size.
If you have less than 64 GB of disk space under LSM control, calculate the subdisk size by multiplying 1 block by each gigabyte of storage. If the result is an odd number, add 1 block; if the result is an even number, add 2 blocks. For example, if you have 1 GB (or less) of space, use a 2-block subdisk. If you have 2 GB (or 3 GB) of space, use a 4-block subdisk.
If you have more than 64 GB of disk space under LSM control, use a 64-block subdisk.
Place a logging subdisk on an infrequently used disk. Because this subdisk is frequently written, do not put it on a busy disk. Do not configure BCL subdisks on the same disks as the volume data, because this will cause head seeking or thrashing.
If persistent (nonvolatile) solid-state disks are available, use them for logging subdisks.
Striping volumes can increase performance because parallel I/O streams can operate concurrently on separate devices. Striping can improve the performance of applications that perform large sequential data transfers or multiple, simultaneous I/O operations.
Striping distributes data across the disks in a volume in stripes with a fixed size. The stripes are interleaved across the striped plex's subdisks, which are located on different disks, to evenly distribute disk I/O.
The performance benefit of striping depends on the stripe width, which is the number of blocks in a stripe, and how your users and applications perform I/O. Bandwidth increases with the number of disks across which a plex is striped. See Table 5-9 for guidelines.
Table 5-9 lists LSM striped volume configuration guidelines and performance benefits as well as tradeoffs.
You may want to combine mirroring with striping to obtain both high availability and high performance. See Table 5-7 and Table 5-9 for guidelines.
Action | Benefit | Tradeoff |
Use multiple disks in a striped volume (Section 5.4.3.1) | Improves performance | Decreases volume reliability |
Distribute subdisks across different disks and buses (Section 5.4.3.2) | Improves performance and increases availability | None |
Use the appropriate stripe width (Section 5.4.3.3) | Improves performance | None |
Avoid splitting small data transfers (Section 5.4.3.3) | Improves the performance of volumes that quickly receive multiple data transfers | May use disk space inefficiently |
Split large individual data transfers (Section 5.4.3.3) | Improves the performance of volumes that receive large data transfers | Decreases throughput |
The following sections discuss these guidelines in detail.
Increasing the number of disks in a striped volume can increase the bandwidth, depending on the applications and file systems you are using and on the number of simultaneous users. However, this reduces the effective mean-time-between-failures (MTBF) of the volume. If this reduction is a problem, use both striping and mirroring.
Distribute the subdisks of a striped volume across different buses. This improves performance and helps to prevent a single bus from becoming a bottleneck.
The performance benefit of striping depends on the size of the stripe width and the characteristics of the I/O load. Stripes of data are allocated alternately and evenly to the subdisks of a striped plex. A striped plex consists of a number of equal-sized subdisks located on different disks.
The number of blocks in a stripe determines the stripe width. LSM uses a default stripe width of 64 KB (or 128 sectors), which works well in most environments.
Use the
volstat
command to determine the number of data transfer splits.
For volumes that receive only small I/O transfers, you may not want to
use striping because disk access time is important.
Striping is beneficial for large data transfers.
To improve performance of large sequential data transfers, use a stripe width that will divide each individual data transfer and distribute the blocks equally across the disks.
To improve the performance of multiple simultaneous small data transfers, make the stripe width the same size as the data transfer. However, an excessively small stripe width can result in poor system performance.
If you are striping mirrored volumes, ensure that the stripe width is the same for each plex.
After you set up the LSM configuration, you may be able to improve performance. For example, you can perform the following tasks:
Balance the I/O load
LSM allows you to achieve a fine level of granularity in data placement, because LSM provides a way for volumes to be distributed across multiple disks. After measuring actual data-access patterns, you can adjust the placement of file systems.
You can reassign data to specific disks to balance the I/O load among the available storage devices. You can reconfigure volumes on line after performance patterns have been established without adversely impacting volume availability.
Use striping to increase bandwidth for frequently accessed data
LSM provides a significant improvement in performance when there are multiple I/O streams. After you identify the most frequently accessed file systems and databases, you can realize significant performance benefits by striping the high traffic data across portions of multiple disks, which increases bandwidth to this data.
Set the preferred read policy to the fastest mirrored plex
If one plex of a mirrored volume exhibits superior performance, either because the disk is being striped or concatenated across multiple disks, or because it is located on a much faster device, then set the read policy to the preferred read policy for that plex. By default, a mirrored volume with one striped plex should be configured with the striped plex as the preferred read.
Increase the value
of the
volinfo.max_io
parameter.
This can improve the
performance of systems that use large amounts of memory or storage.
Hardware RAID subsystems increase your storage capacity and provide different degrees of performance and availability at various costs. For example, some hardware RAID subsystems support dual-redundant RAID controllers and a nonvolatile write-back cache, which greatly improve performance and availability. Entry-level hardware RAID subsystems provide cost-efficient RAID functionality.
Table 5-10 lists hardware RAID subsystem configuration guidelines and performance benefits as well as tradeoffs.
Hardware | Performance Benefit | Tradeoff |
Evenly distribute disks in a storage set across different buses (Section 5.5.1) | Improves performance and helps to prevent bottlenecks | None |
Ensure that the first member of each mirrored set is on a different disk | Improves performance | None |
Use disks with the same data capacity in each storage set (Section 5.5.2) | Improves performance | None |
Use the appropriate chunk size (Section 5.5.3) | Improves performance | None |
Stripe mirrored sets (Section 5.5.4) | Increases availability and read performance | May degrade write performance |
Use a write-back cache (Section 5.5.5) | Improves write performance | Cost of hardware |
Use dual-redundant RAID controllers (Section 5.5.6) | Improves performance, increases availability, and prevents I/O bus bottlenecks | Cost of hardware |
Install spare disks (Section 5.5.7) | Improves availability | Cost of disks |
Replace failed disks promptly (Section 5.5.7) | Improves performance | None |
The following sections describe some of these guidelines. See your RAID subsystem documentation for detailed configuration information.
You can improve performance and help to prevent bottlenecks by distributing storage set disks evenly across different buses.
Make sure that the first member of each mirrored set is on a different bus.
Use disks with the same capacity in the same storage set.
The performance benefit of stripe sets depends on how your users and applications perform I/O and the chunk (stripe) size. For example, if you choose a stripe size of 8 KB, small data transfers will be distributed evenly across the member disks. However, a 64-KB data transfer will be divided into at least eight data transfers.
You may want to use a stripe size that will prevent any particular range of blocks from becoming a bottleneck. For example, if an application often uses a particular 8-KB block, you may want to use a stripe size that is slightly larger or smaller than 8 KB or is a multiple of 8 KB, in order to force the data onto a different disk.
If the stripe size is large compared to the average I/O size, each disk in a stripe set can respond to a separate data transfer. I/O operations can then be handled in parallel, which increases sequential write performance and throughput. This can improve performance for environments that perform large numbers of I/O operations, including transaction processing, office automation, and file services environments, and for environments that perform multiple random read and write operations.
If the stripe size is smaller than the average I/O operation, multiple disks can simultaneously handle a single I/O operation, which can increase bandwidth and improve sequential file processing. This is beneficial for image processing and data collection environments. However, do not make the stripe size so small that it will degrade performance for large sequential data transfers.
If your applications are doing I/O to a raw device and not a file system, use a stripe size that distributes a single data transfer evenly across the member disks. For example, if the typical I/O size is 1 MB and you have a four-disk array, you could use a 256-KB stripe size. This would distribute the data evenly among the four member disks, with each doing a single 256-KB data transfer in parallel.
For small file system I/O operations, use a stripe size that is a multiple of the typical I/O size (for example, four to five times the I/O size). This will help to ensure that the I/O is not split across disks.
You can stripe mirrored sets to improve performance.
RAID subsystems support, either as a standard or an optional feature, a nonvolatile write-back cache that can improve disk I/O performance while maintaining data integrity. A write-back cache improves performance for systems that perform large numbers of writes, especially Web servers. Applications that perform few writes will not benefit from a write-back cache.
With write-back caching, data intended to be written to disk is temporarily stored in the cache and then periodically written (flushed) to disk for maximum efficiency. I/O latency is reduced by consolidating contiguous data blocks from multiple host writes into a single unit.
A write-back cache improves performance because writes appear to be executed immediately. If a failure occurs, upon recovery, the RAID controller detects any unwritten data that still exists in the write-back cache and writes the data to disk before enabling normal controller operations.
A write-back cache must be battery-backed to protect against data loss and corruption.
If you are using an HSZ40 or HSZ50 RAID controller with a write-back cache, the following guidelines may improve performance:
Set
CACHE_POLICY
to B.
Set
CACHE_FLUSH_TIMER
to a minimum of 45
(seconds).
Enable the write-back cache (WRITEBACK_CACHE
)
for each unit, and set the value of
MAXIMUM_CACHED_TRANSFER_SIZE
to a minimum of 256.
See the HSZ documentation for more information.
If supported, use a dual-redundant controller configuration and balance the number of disks across the two controllers. This can improve performance, increase availability, and prevent I/O bus bottlenecks.
Install predesignated spare disks on separate controller ports and storage shelves. This will help you to maintain data availability if a disk failure occurs.
The Advanced File System (AdvFS) allows you to put multiple volumes (disks, LSM volumes, or RAID storage sets) in a file domain and distribute the filesets and files across the volumes. A file's blocks usually reside together on the same volume, unless the file is striped or the volume is full. Each new file is placed on the successive volume by using round robin scheduling. See the AdvFS Guide to File System Administration for more information on using AdvFS.
The following sections describe how to configure and tune AdvFS for high performance.
You will obtain the best performance if you carefully plan your AdvFS configuration. Table 5-11 lists AdvFS configuration guidelines and performance benefits as well as tradeoffs. In addition, the recommendations described in Table 5-2 and Table 5-3 apply to AdvFS configurations.
Action | Performance Benefit | Tradeoff |
Use multiple-volume file domains (Section 5.6.1.1) | Improves throughput and simplifies management | Increases chance of domain failure and may cause log bottleneck |
Use several file domains instead of one large domain (Section 5.6.1.1) | Prevents log from becoming a bottleneck | Increases maintenance complexity |
Place transaction log on fast or uncongested volume (Section 5.6.1.2) | Prevents log from becoming a bottleneck | None |
Preallocate space for the BMT (Section 5.6.1.3) | Prevents prematurely running out of domain space | Reduces available disk space |
Increase the number of pages by which the BMT extent size grows (Section 5.6.1.3) | Prevents prematurely running out of domain space | Reduces available disk space |
Stripe files (Section 5.6.1.4) | Improves sequential read and write performance | Increases chance of domain failure |
Use quotas (Section 5.6.1.5) | Controls file system space utilization | None |
The following sections describe these AdvFS configuration guidelines in more detail.
Using multiple-volume file domains allows greater control over your physical resources, and may improve a fileset's total throughput. However, be sure that the log does not become a bottleneck. Multiple-volume file domains improve performance because AdvFS generates parallel streams of output using multiple device consolidation queues.
In addition, using only a few file domains instead of using many file domains reduces the overall management effort because fewer file domains require less administration. However, a single volume failure within a file domain renders the entire file domain inaccessible. Therefore, the more volumes that you have in your file domain the greater the risk that a file domain will fail.
DIGITAL recommends that you use a maximum of 12 volumes in each file domain. However, to reduce the risk of file domain failure, limit the number of volumes per file domain to three or use mirrored volumes created with LSM.
For multiple-volume domains, make sure that busy files are not located
on the same volume.
Use the
migrate
command to move files across volumes.
Each file domain has a transaction log that keeps track of fileset activity for all filesets in the file domain. The AdvFS file domain transaction log may become a bottleneck. This can occur if the log resides on a congested disk or bus, or if the file domain contains many filesets.
To prevent the log from becoming a bottleneck, put the log on a fast, uncongested volume. You may want to put the log on a disk that contains only the log. See Section 5.6.2.9 for information on moving an existing transaction log.
To make the transaction log highly available, use LSM to mirror the log.
The AdvFS fileset data structure (metadata) is stored in a file called the bitfile metadata table (BMT). Each volume in a domain has a BMT that describes the file extents on the volume. If a domain has multiple volumes of the same size, files will be distributed evenly among the volumes.
The BMT is the equivalent of the UFS inode table. However, the UFS inode table is statically allocated, while the BMT expands as more files are added to the domain. Each time that AdvFS needs additional metadata, the BMT grows by a fixed size (the default is 128 pages). As a volume becomes increasingly fragmented, the size by which the BMT grows may be described by several extents.
If a file domain has a large number of small files, you may prematurely run out of disk space for the BMT. Handling many small files makes the system request metadata extents more frequently, which causes the metadata to become fragmented. Because the number of BMT extents is limited, the file domain will appear to be out of disk space if the BMT cannot be extended to map new files.
To monitor the BMT, use the
vbmtpg
command and examine
the number of mcells (freeMcellCnt
).
The value of
freeMcellCnt
can range from 0 to 22.
A volume with 1 free Mcell
has very little space in which to grow the BMT.
See
vbmtpg
(8)
for more information.
You can also invoke the
showfile
command and specify
mount_point/.tags/M-6
to examine
the BMT extents on the first domain volume that contains the fileset mounted
on the specified mount point.
To examine the extents of the other volumes
in the domain, specify
M-12
,
M-18
, and
so on.
If the extents at the end of the BMT are smaller than the extents
at the beginning of the file, the BMT is becoming fragmented.
See
showfile
(8)
for more information.
If you are prematurely out of BMT disk space, you may be able to eliminate
the problem by defragmenting the file domain that contains the volume.
See
defragment
(8)
for more information.
Table 5-12 provides some BMT sizing guidelines for the number of pages to preallocate for the BMT, and the number of pages by which the BMT extent size grows. The BMT sizing depends on the maximum number of files you expect to create on a volume.
Estimated Maximum Number of Files on a Volume | Number of Pages to Preallocate | Number of Pages to Grow Extent |
< 50,000 | 3600 | 128 |
100,000 | 7200 | 256 |
200,000 | 14,400 | 512 |
300,000 | 21,600 | 768 |
400,000 | 28,800 | 1024 |
800,000 | 57,600 | 2048 |
You can preallocate space for the BMT when
the file domain is created, and when a volume is added to the domain by using
the
mkfdmn
and
addvol
commands with
the
-p
flag.
You can also modify the number of extent pages by which the BMT grows
when a file domain is created and when a volume is added to the domain by
using the
mkfdmn
and the
addvol
commands
with the
-x
flag.
If you use the
mkfdmn -x
or the
addvol
-x
command when there is a large amount of free space on a disk,
as files are created, the BMT will expand by the specified number of pages
and those pages will be in one extent.
As the disk becomes more fragmented,
the BMT will still expand, but the pages will not be contiguous and will
require more extents.
Eventually, the BMT will run out of its limited number
of extents even though the growth size is large.
Using the
mkfdmn -p
or the
addvol -p
command to preallocate a large BMT before the disk is fragmented may prevent
this problem because the entire preallocated BMT is described in one extent.
All subsequent growth will be able to utilize nearly all of the limited number
of BMT extents.
Do not overallocate BMT space because the disk space cannot
be used for other purposes.
However, too little BMT space will eventually
cause the BMT to grow by a fixed amount.
At this time, the disk may be fragmented
and the growth will require multiple extents.
See
mkfdmn
(8)
and
addvol
(8)
for more information.
The AdvFS
stripe
utility
lets you improve the read and write performance of an individual file.
This
is useful if an application continually accesses a few specific files.
See
stripe
(8)
for information.
The utility directs a zero-length file (a file with no data written to it yet) to be distributed evenly across several volumes in a file domain. As data is appended to the file, the data is spread across the volumes. AdvFS determines the number of pages per stripe segment and alternates the segments among the disks in a sequential pattern. Bandwidth can be improved by distributing file data across multiple volumes.
Do not stripe both a file and the disk on which it resides.
To determine if you should stripe files, use the
iostat
utility.
The blocks per second and I/O operations per second should be cross-checked
with the disks bandwidth capacity.
If the disk access time is slow, in comparison
to the stated capacity, then file striping may improve performance.
AdvFS quotas allow you to track and control the amount of physical storage that a user, group, or fileset consumes. AdvFS eliminates the slow reboot activities associated with UFS quotas. In addition, AdvFS quota information is always maintained, but quota enforcement can be activated and deactivated.
For information about UFS quotas, see Section 5.7.1.6.
After you configure AdvFS, you may be able to tune it to improve performance. Table 5-13 lists AdvFS tuning guidelines and performance benefits as well as tradeoffs. In addition, the recommendations described in Table 5-4 apply to AdvFS configurations.
Action | Performance Benefit | Tradeoff |
Increase the percentage of memory allocated for the AdvFS buffer cache (Section 5.6.2.1) | Improves AdvFS performance if data reuse is high | Consumes memory |
Defragment file domains (Section 5.6.2.2) | Improves read and write performance | None |
Increase the dirty data caching threshold (Section 5.6.2.3) | Improves random write performance | May cause I/O spikes or increase the number of lost buffers if a crash occurs |
Decrease the I/O transfer read-ahead size (Section 5.6.2.4) | Improves performance for
mmap
page faulting |
None |
Disable the flushing of dirty
pages mapped with the
mmap
function during a
sync
call (Section 5.6.2.5) |
May improve performance for applications that manage their own flushing | None |
Modify the AdvFS device queue limit (Section 5.6.2.6) | Influences the time to complete synchronous (blocking) I/O | May cause I/O spikes |
Consolidate I/O transfers (Section 5.6.2.7) | Improves AdvFS performance | None |
Force all AdvFS file writes to be synchronous (Section 5.6.2.8) | Ensures that data is successfully written to disk | May degrade file system performance |
Move the transaction log to a fast or uncongested volume (Section 5.6.2.9) | Prevents log from becoming a bottleneck | None |
Balance files across volumes in a file domain (Section 5.6.2.10) | Improves performance and evens the future distribution of files | None |
Migrate frequently used or large files to different file domains (Section 5.6.2.11) | Improves I/O performance | None |
Decrease the size of the metadata buffer cache to 1 percent (Section 4.7.21) | Improves performance for systems that use only AdvFS | None |
The following sections describe how to tune AdvFS in detail.
The
AdvfsCacheMaxPercent
attribute specifies the amount
of physical memory that AdvFS uses for its buffer cache.
You may improve AdvFS performance by increasing the percentage of memory
allocated to the AdvFS buffer cache.
To do this, increase the value of the
AdvfsCacheMaxPercent
attribute.
The default is 7 percent of memory,
the minimum is 1 percent, and the maximum is 30 percent.
Increasing the value of the
AdvfsCacheMaxPercent
attribute will decrease the amount of memory available to the virtual memory
subsystem, so you must make sure that you do not cause excessive paging and
swapping.
Use the
vmstat
command to check virtual memory
statistics.
You may want to increase the AdvFS buffer cache size if data reuse is
high.
If you increase the value of the
AdvfsCacheMaxPercent
attribute and experience no performance benefit, return to the original value.
If data reuse is insignificant or if you have more than 2 GB of memory, you
may want to decrease the cache size.
AdvFS attempts to store file data in a collection of contiguous blocks (a file extent) on a disk. If all data in a file is stored in contiguous blocks, the file has one file extent. However, as files grow, contiguous blocks on the disk may not be available to accommodate the new data, so the file must be spread over discontiguous blocks and multiple file extents.
File fragmentation degrades read and write performance because many disk addresses must be examined to access a file. In addition, if a domain has a large number of small files, you may prematurely run out of disk space, due to fragmentation.
Use the
defragment
utility with the
-v
and
-n
options to show the amount of file
fragmentation.
The
defragment
utility reduces the amount of file
fragmentation in a file domain by attempting to make the files more contiguous,
which reduces the number of file extents.
The utility does not affect data
availability and is transparent to users and applications.
Striped files
are not defragmented.
You can improve the efficiency of the defragment process by deleting
any unneeded files in the file domain before running the
defragment
utility.
See
defragment
(8)
for more information.
Dirty or modified data is data that has been written by an application and cached but has not yet been written to disk. You can increase the amount of dirty data that AdvFS will cache for each volume in a file domain. This can improve write performance for systems that perform many random writes by increasing the number of cache hits.
You can increase the amount of cached dirty data for all new volumes or for a specific, existing volume. The default value is 16 KB. The minimum value is 0, which disables dirty data caching. The maximum value is 32 KB.
If you have high data reuse (data is repeatedly read and written), you may want to increase the dirty data threshold. If you have low data reuse, you may want to decrease the threshold or use the default value.
Use the
chvol -t
command to modify the dirty data
threshold for an individual existing volume.
You must specify the number
of dirty, 512-byte blocks to cache.
See
chvol
(8)
for more information.
To modify the dirty data threshold for all new volumes, modify the
value of the
AdvfsReadyQLim
attribute, which specifies
the number of 512-byte blocks that can be on the readylazy queue before the
requests are moved to the device queue.
If you change the dirty data threshold and performance does not improve, return to the original value.
AdvFS reads and writes data by a fixed number of 512-byte blocks.
The default is 128 blocks.
Use the
chvol
command with
the
-w
option to change the write-consolidation size.
Use the
chvol
command with the
-r
option
to change the read-ahead size.
See
chvol
(8)
for more information.
You may be able to improve performance for
mmap
page
faulting and reduce read-ahead paging and cache dilution by decreasing the
read-ahead size.
If the disk is fragmented so that the pages of a file are not sequentially
allocated, reduce fragmentation by using the
defragment
utility.
See
defragment
(8)
for more information.
A file can have dirty data in memory due to a
write
system call or a memory write reference after an
mmap
system call.
The
update
daemon runs every 30
seconds and issues a
sync
call for every fileset mounted
with read and write access.
The
AdvfsSyncMmapPages
attribute controls whether
modified (dirty) mmapped pages are flushed to disk during a
sync
system call.
If the
AdvfsSyncMmapPages
attribute
is set to 1, the dirty mmapped pages are asynchronously written to disk.
If the
AdvfsSyncMmapPages
attribute is set to 0, dirty
mmapped pages are not written to disk during a
sync
system
call.
If your applications manage their own
mmap
page flushing,
set the value of the
AdvfsSyncMmapPages
attribute to 0.
See
mmap
(2)
and
msync
(2)
for more information.
Synchronous and asynchronous AdvFS I/O requests are placed on separate consolidation queues, where small, logically contiguous block requests are consolidated into larger I/O requests. The consolidated synchronous and asynchronous I/O requests are moved to the AdvFS device queue and then sent to the device driver.
The
AdvfsMaxDevQLen
attribute limits the AdvFS device
queue length.
When the number of requests on the device queue exceeds the
value of the
AdvfsMaxDevQLen
attribute, only synchronous
requests are accepted onto the device queue.
The default value of the
AdvfsMaxDevQLen
attribute is 80.
Limiting the size of the device queue affects the amount of time it takes to complete a synchronous (blocking) I/O operation. AdvFS issues several types of blocking I/O operations, including AdvFS metadata and log data operations.
The default value of the
AdvfsMaxDevQLen
attribute
is appropriate for most configurations.
However, you may need to modify
this value if you are using fast or slow adapters, striping, or mirroring.
A higher value may improve throughput, but will also increase synchronous
read/write time.
To calculate response time, multiply the value of the
AdvfsMaxDevQLen
attribute by 9 milliseconds (the average I/O latency).
A guideline is to specify a value for the
AdvfsMaxDevQLen
attribute that is less than or equal to the average number of I/O operations
that can be performed in 0.5 seconds.
If you do not want to limit the number of requests on the device queue,
set the value of the
AdvfsMaxDevQLen
attribute to 0 (zero).
Consolidating
a number of I/O transfers into a single, large I/O transfer can improve AdvFS
performance.
To do this, use the
chvol
command with the
-c on
flag.
This is the default.
DIGITAL recommends that
you do not disable the consolidation of I/O transfers.
See
chvol
(8)
for more information.
Use the
chfile -l on
command to force all write requests to an AdvFS file
to be synchronous.
When forced synchronous write requests to a file are
enabled, the
write
system call returns a success value
only after the data has been successfully written to disk.
This may degrade
file system performance.
When forced synchronous write requests to a file are disabled, the
write
system call returns a success value when the requests are
cached.
The data is then written to disk at a later time (asynchronously).
Make sure that the AdvFS transaction log resides on an uncongested disk and bus or system performance may be degraded.
If the transaction log becomes a bottleneck, use the
switchlog
command to relocate the transaction log of the specified file domain
to a faster or less congested volume in the same domain.
Use the
showfdmn
command to determine the current location of the transaction
log.
In the
showfdmn
command display, the letter
L
displays next to the volume that contains the log.
See
switchlog
(8)
and
showfdmn
(8)
for more information.
In addition, you can divide the file domain into several smaller file domains. This will cause each domain's transaction log to handle transactions for fewer filesets.
If the files in a
multivolume domain are not evenly distributed, performance may be degraded.
The
balance
utility distributes the percentage of used
space evenly between volumes in a multivolume file domain.
This improves
performance and evens the distribution of future file allocations.
Files are
moved from one volume to another until the percentage of used space on each
volume in the domain is as equal as possible.
The
balance
utility does not affect data availability
and is transparent to users and applications.
If possible, use the
defragment
utility before you balance files.
The
balance
utility does not generally split files.
Therefore, file domains with very large files may not balance as evenly as
file domains with smaller files.
See
balance
(8)
for more information.
To determine if you need to balance your files across volumes, use the
showfdmn
command to display information about the volumes in a domain.
The
% used
field shows the percentage of volume space
that is currently allocated to files or metadata (fileset data structure).
See
showfdmn
(8)
for more information.
Performance may degrade if too many frequently accessed or large files reside on the same volume in a multivolume file domain. You can improve I/O performance by altering the way files are mapped on the disk.
Use the
migrate
utility to move frequently accessed
or large files to different volumes in the file domain.
You can specify the
volume where a file is to be moved, or allow the system to pick the best space
in the file domain.
You can migrate either an entire file or specific pages
to a different volume.
However, using the
balance
utility
after migrating files may cause the files to move to a different volume.
See
balance
(8)
for more information.
In addition, a file that is migrated is defragmented at the same time,
if possible.
Defragmentation makes the file more contiguous, which improves
performance.
Therefore, you can use the
migrate
command
to defragment selected files.
See
migrate
(8)
for more information.
The following sections will help you to configure and tune UNIX File Systems (UFS).
There are a number of parameters that can improve
the UFS performance.
You can set all of the parameters when you use the
newfs
command to create a file system.
For existing file systems,
you can tune some parameters by using the
tunefs
command.
Table 5-14 describes UFS configuration guidelines and performance benefits as well as tradeoffs. In addition, the recommendations described in Table 5-2 and Table 5-3 apply to UFS configurations.
Action | Performance Benefit | Tradeoff |
Increase the file system fragment size to 8 KB (Section 5.7.1.1) | Improves performance for large files | Wastes disk space for small files |
Use the default file system fragment size of 1 KB (Section 5.7.1.1) | Uses disk space efficiently | None |
Reduce the density of inodes (Section 5.7.1.2) | Improves performance of large files | None |
Allocate blocks contiguously (Section 5.7.1.3) | Aids UFS block clustering | None |
Increase the number of blocks combined for a read (Section 5.7.1.4) | Improves performance | None |
Use a Memory File System (MFS) (Section 5.7.1.5) | Improves I/O performance | Does not ensure data integrity because of cache volatility |
Use disk quotas (Section 5.7.1.6) | Controls disk space utilization | UFS quotas may slow reboot time |
The following sections describe the UFS configuration guidelines in detail.
If the average file in a file system is larger than
16 KB but less than 96 KB, you may be able to improve disk access time and
decrease system overhead by making the file system fragment size equal to
the block size, which is 8 KB.
Use the
newfs
command
to do this.
However, to use disk space efficiently, use the default fragment
size, which is 1 KB.
See
newfs
(8)
for more information.
The number of files in a file system is determined by the number of inodes and the size of the file system. The default is to create an inode for each 4096 bytes of data space.
If a file system will contain many large files, you may want to increase
the amount of data space allocated to an inode and reduce the density of inodes.
To do this, use the
newfs -i
command to specify the amount
of data space allocated to an inode.
See
newfs
(8)
for more information.
The UFS
rotdelay
parameter specifies
the time, in milliseconds, to service a transfer completion interrupt and
initiate a new transfer on the same disk.
You can set the
rotdelay
parameter to 0 (the default) to allocate blocks sequentially and
aid UFS block clustering.
You can do this by using either the
tunefs
command or the
newfs
command.
See
newfs
(8)
and
tunefs
(8)
for more information.
The value of the UFS
maxcontig
parameter specifies the number of blocks that can be combined into
a single cluster.
The default value of
maxcontig
is 8 KB.
The file system attempts read operations in a size that is defined by the
value of
maxcontig
multiplied by the block size (8 KB).
Device drivers that can chain several buffers together in a single transfer
should use a
maxcontig
value that is equal to the maximum
chain length.
Use the
tunefs
command or the
newfs
command to change the value of
maxcontig
.
See
newfs
(8)
and
tunefs
(8)
for more information.
Memory File System (MFS) is a UFS file system that resides only in memory. No permanent data or file structures are written to disk An MFS file system can improve read/write performance, but it is a volatile cache. The contents of an MFS file system are lost after a reboot, unmount operation, or power failure.
Because no date is written to disk, an MFS file system is a very fast file system and can be used to store temporary files or read-only files that are loaded into it after it is created. For example, if you are performing a software build that would have to be restarted if it failed, use an MFS file system to cache the temporary files that are created during the build and reduce the build time.
You can specify UFS file system limits for user accounts and for groups by setting up file system quotas, also known as disk quotas. You can apply quotas to file systems to establish a limit on the number of blocks and inodes (or files) that a user account or a group of users can allocate. You can set a separate quota for each user or group of users on each file system.
You may want to set quotas on file systems that contain home directories
because the sizes of these file systems can increase more significantly than
other file systems.
Do not set quotas on the
/tmp
file
system.
Note that, unlike AdvFS quotas, UFS quotas may slow reboot time. For information about AdvFS quotas, see Section 5.6.1.5.
After you configure your UFS file systems, you can modify some parameters and attributes to improve performance. Table 5-15 describes UFS tuning guidelines and performance benefits as well as tradeoffs. In addition, the recommendations described in Table 5-4 apply to UFS configurations.
Action | Performance Benefit | Tradeoff |
Increase size of metadata buffer cache to more than 3 percent of main memory (Section 4.9.1) | Increases cache hit rate and improves UFS performance | Requires additional memory resources |
Defragment the file system (Section 5.7.2.1) | Improves read and write performance | Requires down time |
Delay flushing full write buffers to disk (Section 5.7.2.2) | Frees CPU cycles | May degrade real-time workload performance |
Increase number of blocks combined for read ahead (Section 5.7.2.3) | Improves performance | None |
Increase number of blocks combined for a write (Section 5.7.2.4) | Improves performance | None |
Increase the maximum number of UFS or MFS mounts (Section 5.7.2.5) | Allows more mounted file systems | None |
The following sections describe how to tune UFS in detail.
When a file consists of many discontiguous file extents, the file is considered fragmented. A very fragmented file decreases UFS read and write performance because it requires more I/O operations to access the file.
You can determine whether the files in a file system are fragmented
by determining how effectively the system is clustering.
You can do this by
using
dbx
to examine the
ufs_clusterstats
,
ufs_clusterstats_read
, and
ufs_clusterstats_write
structures.
See
dbx
(1)
for more information.
UFS block clustering is usually efficient. If the numbers from the UFS clustering kernel structures show that clustering is not being particularly effective, the files in the file system may be very fragmented.
To defragment a UFS file system, follow these steps:
Back up the file system onto tape or another partition.
Create a new file system either on the same partition or a different partition.
Restore the file system.
AdvFS provides you with the ability to defragment a file domain by using
the
defragment
command.
See
defragment
(8)
for more
information.
You
can free CPU cycles by using the
dbx
debugger to set the
value of the
delay_wbuffers
kernel variable to 1, which
delays flushing full write buffers to disk at the next
sync
call.
However, this may adversely affect real-time workload performance.
The default value of
delay_wbuffers
is 0.
See
dbx
(1)
for more information.
You can increase the number of blocks that are combined for a read-ahead operation.
To do this, use the
dbx
debugger to make the value
of the
cluster_consec_init
kernel variable equal to the
value of the
cluster_max_read_ahead
variable (the default
is 8), which specifies the maximum number of read-ahead clusters that the
kernel can schedule.
See
dbx
(1)
for more information.
In addition, you must make sure that cluster read operations are enabled
on nonread-ahead and read-ahead blocks.
To do this, the value of the
cluster_read_all
kernel variable must be set to 1 (the default).
The
cluster_maxcontig
parameter specifies the number of blocks that are combined into a single write
operation.
The default value is 8 KB.
Contiguous writes are done in a unit
size that is determined by the file system block size (the default is 8 KB)
multiplied by the value of the
cluster_maxcontig
parameter.
Mount structures
are dynamically allocated when a mount request is made and subsequently deallocated
when an unmount request is made.
The
max-ufs-mounts
attribute
specifies the maximum number of UFS and MFS mounts on the system.
You can increase the value of the
max-ufs-mounts
attribute if your system will have more than the default limit of 1000 mounts.
DIGITAL UNIX uses the Common Access Method (CAM) as the operating system interface to the hardware. CAM maintains pools of buffers that are used to perform I/O. Each buffer takes approximately 1 KB of physical memory. Monitor these pools and tune them if necessary.
You can modify the following attributes:
cam_ccb_pool_size
--The initial size
of the buffer pool free list at boot time.
The default is 200.
cam_ccb_low_water
--The number of
buffers in the pool free list at which more buffers are allocated from the
kernel.
CAM reserves this number of buffers to ensure that the kernel always
has enough memory to shut down runaway processes.
The default is 100.
cam_ccb_increment
--The number of
buffers either added or removed from the buffer pool free list.
Buffers are
allocated on an as-needed basis to handle immediate demands, but are released
in a more measured manner to guard against spikes.
The default is 50.
If the I/O
pattern associated with your system tends to have intermittent bursts of I/O
operations (I/O spikes), increasing the values of the
cam_ccb_pool_size
and
cam_ccb_increment
attributes may improve
performance.