9 Managing File System Performance

The Tru64 UNIX operating system supports various file system options that have different performance features and functionality.

This chapter describes how to perform the following tasks:

Gather information about all types of file systems (Section 9.1)

Apply tuning recommendations that are applicable to all types of file systems (Section 9.2)

Manage Advanced File System (AdvFS) performance (Section 9.3)

Manage UNIX File System (UFS) performance (Section 9.4)

Manage Network File System (NFS) performance (Section 9.5)

9.1 Gathering File System Information

The following sections describe how to use tools to monitor general file system activity and describe some general file system tuning guidelines.

9.1.1 Monitoring the Unified Buffer Cache

The Unified Buffer Cache (UBC) uses a portion of physical memory to cache most-recently accessed UFS file system data for reads and writes and for page faults from mapped file regions, in addition to AdvFS metadata and user data. The UBC competes with processes for this portion of physical memory, so the amount of memory allocated to the UBC can affect overall system performance.

See Section 6.3.5 for information about using dbx to check the UBC. See Section 9.2 for information on how to tune the UBC.

9.1.2 Checking the namei Cache with the dbx Debugger

The namei cache is used by UFS, AdvFS, CD-ROM File System (CDFS), and NFS to store recently used file system pathname/inode number pairs. It also stores inode information for files that were referenced but not found. Having this information in the cache substantially reduces the amount of searching that is needed to perform pathname translations.

To check the namei cache, use the dbx print command to examine the nchstats data structure. Consider the following example:

# /usr/ucb/dbx -k /vmunix /dev/mem 
(dbx) print nchstats
struct {
    ncs_goodhits = 9748603   
    ncs_neghits = 888729     
    ncs_badhits = 23470
    ncs_falsehits = 69371
    ncs_miss = 1055430       
    ncs_long = 4067
    ncs_pass2 = 127950
    ncs_2passes = 195763
    ncs_dirscan = 47
}
(dbx)

Examine the ncs_goodhits (found a pair), ncs_neghits (found a pair that did not exist), and ncs_miss (did not find a pair) fields to determine the hit rate. The hit rate should be above 80 percent (ncs_goodhits plus ncs_neghits divided by the sum of the ncs_goodhits, ncs_neghits, ncs_miss, and ncs_falsehits).

See Section 9.2.1 for information on how to improve the namei cache hit rate and lookup speeds.

9.2 Tuning File Systems

You may be able to improve I/O performance by modifying some kernel attributes that affect all file system performance.

General file system tuning often involves tuning the Virtual File System (VFS), which provides a uniform interface that allows common access to files, regardless of the file system on which the files reside.

To successfully improve file system performance, you must understand how your applications and users perform I/O, as described in Section 2.1. Because file systems share memory with processes, you should also understand virtual memory operation, as described in Chapter 6.

Table 9-1 describes the guidelines for general file system tuning and lists the performance benefits as well as the tradeoffs. There are also specific guidelines for AdvFS and UFS file systems. See Section 9.3 and Section 9.4 for information.

Table 9-1: General File System Tuning Guidelines

Action	Performance Benefit	Tradeoff
Increase the size of the namei cache (Section 9.2.1)	Improves cache lookup operations	Consumes memory
Increase the size of the hash chain table for the namei cache (Section 9.2.2)	Improves cache lookup operations	Consumes memory
Increase the memory allocated to the UBC (Section 9.2.3)	Improves file system performance	May cause excessive paging and swapping
Decrease the amount of memory borrowed by the UBC (Section 9.2.4)	Improves file system performance	Decreases the memory available for processes and may decrease system response time
Increase the minimum size of the UBC (Section 9.2.5)	Improves file system performance	Decreases the memory available for processes
Increase the UBC write device queue depth (Section 9.2.6)	Increases overall file system throughput and frees memory	Decreases interactive response performance
Decrease the UBC write device queue depth (Section 9.2.6)	Improves interactive response time	Consumes memory
Increase the amount of UBC memory used to cache a large file (Section 9.2.7)	Improves large file performance	May allow a large file to consume all the pages on the free list
Decrease the amount of UBC memory used to cache a large file (Section 9.2.7)	Prevents a large file from consuming all the pages on the free list	May degrade large file performance
Disable flushing to disk file read access times (Section 9.2.8)	Improves file system performance for proxy servers	Jeopardizes the integrity of read access time updates and violates POSIX standards
Use Prestoserve to cache only file system metadata (Section 9.2.9)	Improves performance for applications that access large amounts of file system metadata	Prestoserve is not supported in a cluster or for nonfile system I/O operations
Increase the size of the Prestoserve buffer hash table (Section 9.2.10)	Decreases Prestoserve lock contention	Prestoserve is not supported in a cluster or for nonfile system I/O operations
Cache more vnodes on the free list (Section 9.2.11)	Improves cache lookup operations	Consumes memory
Increase the amount of time for which vnodes are kept on the free list (Section 9.2.12)	Improves cache lookup operations	None
Delay vnode deallocation (Section 9.2.13)	Improves namei cache lookup operations	Consumes memory
Accelerate vnode deallocation (Section 9.2.14)	Speeds the freeing of memory	Reduces the efficiency of the namei cache
Disable vnode deallocation (Section 9.2.15)	Optimizes processing time	Consumes memory

The following sections describe these guidelines in detail.

9.2.1 Increasing the Size of the namei Cache

The namei cache is used by all file systems to map file pathnames to inodes. Monitor the cache by using the dbx print command to examine the nchstats data structure. The miss rate (misses / (good + negative + misses)) should be less than 20 percent.

To make lookup operations faster, increase the size of the namei cache by increasing the value of the maxusers attribute (the recommended way), as described in Section 5.1, or by increasing the value of the vfs subsystem attribute name-cache-size (the default value is 1029).

Increasing the value of the maxusers or name-cache-size attribute allocates more system resources for use by the kernel, but also increases the amount of wired memory consumed by the kernel.

Note that many benchmarks perform better with a large namei cache.

9.2.2 Increasing the Size of the Hash Chain Table for the namei Cache

Increasing the size of the hash chain table for the namei cache distributes the namei cache elements and reduces the time needed for linear searches, which can improve lookup speeds. The vfs subsystem attribute name-cache-hash-size specifies the size of the hash chain table, in table elements, for the namei cache.

The default value of the name-cache-hash-size attribute is the value of the name-cache-size attribute divided by 8 and rounded up to the next power of 2, or 8192, whichever is the highest value.

You can change the value of the name-cache-hash-size attribute so that each hash chain has three or four name cache entries. To determine an appropriate value for the name-cache-hash-size attribute, divide the value of the vfs subsystem attribute name-cache-size by 3 or 4 and then round the result to a power of 2.

For example, if the value of name-cache-size is 1029, dividing 1029 by 4 produces a value of 257. Based on this calculation, you could specify 256 (2 to the power of 8) for the value of the name-cache-hash-size attribute.

9.2.3 Increasing Memory for the UBC

The Unified Buffer Cache (UBC) shares with processes the memory that is not wired. The UBC caches UFS file system data for reads and writes, AdvFS metadata and file data, and Memory File System (MFS) data. Performance is improved if the cached data is later reused and a disk operation is avoided.

If you reuse data, be sure to allocate enough memory to the UBC to improve the chance that data will be found in the cache. An insufficient amount of memory allocated to the UBC can impair file system performance. However, the performance of an application that generates a lot of random I/O will not be improved by a large UBC, because the next access location for random I/O cannot be predetermined.

To increase the maximum amount of memory allocated to the UBC, you can increase the value of the vm subsystem attribute ubc-maxpercent. The default value is 100 percent, which should be appropriate for most configurations, including Internet servers.

Be sure that allocating more memory to the UBC does not cause excessive paging and swapping.

See Section 6.1.2.2 for information about UBC memory allocation.

9.2.4 Decreasing the Amount of Borrowed Memory

The UBC borrows all physical memory above the value of the vm subsystem attribute ubc-borrowpercent and up to the value of the ubc-maxpercent attribute.

Increasing the value of the ubc-borrowpercent attribute allows more memory to remain in the UBC when page reclamation begins. This can increase the UBC cache effectiveness, but may degrade system response time when a low-memory condition occurs. If vmstat output shows excessive paging but few or no pageouts, you may want to increase borrowing threshold.

The value of the ubc-borrowpercent attribute can range from 0 to 100. The default value is 20 percent.

See Section 6.1.2.2 for information about UBC memory allocation.

9.2.5 Increasing the Minimum Size of the UBC

Increasing the minimum size of the UBC will prevent large programs from completely filling the UBC. For I/O servers, you may want to raise the value of the vm subsystem attribute ubc-minpercent to ensure that enough memory is available for the UBC. The default value is 10 percent.

Because the UBC and processes share virtual memory, increasing the minimum size of the UBC may cause the system to page excessively. In addition, if the values of the vm subsystem attributes ubc-maxpercent and ubc-minpercent are close together, you may degrade I/O performance.

To ensure that the value of the ubc-minpercent is appropriate, use the vmstat command to examine the page-out rate. See Section 6.3.2 for information.

See Section 6.1.2.2 for information about UBC memory allocation.

9.2.6 Modifying the UBC Write Device Queue Depth

The UBC uses a buffer to facilitate the movement of data between memory and disk. The vm subsystem attribute vm-ubcbuffers specifies the maximum file system device I/O queue depth for writes. The default value is 256.

Increasing the UBC write device queue depth frees memory and increases the overall file system throughput.

Decreasing the UBC write device queue depth increases memory demands, but it improves the interactive response time.

9.2.7 Controlling Large File Caching

If a large file completely fills the UBC, it may take all of the pages on the free page list, which may cause the system to page excessively. The vm subsystem attribute vm-ubcseqpercent specifies the maximum amount of memory allocated to the UBC that can be used to cache a file. The default value is 10 percent of memory allocated to the UBC.

The vm subsystem attribute vm-ubcseqstartpercent specifies the size of the UBC as a percentage of physical memory, at which time the virtual memory subsystem starts stealing the UBC LRU pages for a file to satisfy the demand for pages. The default is 50 percent of physical memory.

Increasing the value of the vm-ubcseqpercent attribute will improve the performance of a large single file, but will decrease the remaining amount of memory.

Decreasing the value of the vm-ubcseqpercent attribute will increase the available memory, but will degrade the performance of a large single file.

To force the system to reuse the pages in the UBC instead of taking pages from the free list, perform the following tasks:

Make the maximum size of the UBC greater than the size of the UBC as a percentage of percentage of memory. That is, the value of the vm subsystem attribute ubc-maxpercent (the default is 100 percent) must be greater than the value of the vm-ubcseqstartpercent attribute (the default is 50 percent).

Make the value of the vm-ubcseqpercent attribute, which specifies the size of a file as a percentage of the UBC, greater than a referenced file. The default value of the vm-ubcseqpercent attribute is 10 percent.

For example, using the default values, the UBC would have to be larger than 50 percent of all memory and a file would have to be larger than 10 percent of the UBC (that is, the file size would have to be at least 5 percent of all memory) in order for the system to reuse the pages in the UBC.

On large-memory systems that are doing a lot of file system operations, you may want to lower the vm-ubcseqstartpercent value to 30 percent. Do not specify a lower value unless you decrease the size of the UBC. In this case, do not change the value of the vm-ubcseqpercent attribute.

9.2.8 Disabling File Read Access Time Flushing

When a read system call is made to a file system's files, the default behavior is for the file system to update both the in-memory file access time and the on-disk stat structure, which contains most of the file information that is returned by the stat(2) system call.

You can improve file system performance for proxy servers by specifying, at mount time, that the file system update only the in-memory file access time when a read system call is made to a file. The file system will update the on-disk stat structure only if the file is modified.

To enable this functionality, use the mount command with the noatimes option. See read(2) and mount(8) for more information.

Updating only the in-memory file access time for reads can improve proxy server response time by decreasing the number of disk I/O operations. However, this behavior jeopardizes the integrity of read access time updates and violates POSIX standards. Do not use this functionality if it will affect utilities that use read access times to perform tasks, such as migrating files to different devices.

9.2.9 Using Prestoserve to Cache Only File System Metadata

Prestoserve can improve the overall run-time performance for systems that perform large numbers of synchronous writes. The prmetaonly attribute controls whether Prestoserve caches only UFS and AdvFS file system metadata, instead of both metadata and synchronous write data (the default). If the attribute is set to 1 (enabled), Prestoserve caches only file system metadata.

Caching only metadata may improve the performance of applications that access many small files or applications that access a large amount of file-system metadata but do not reread recently written data.

9.2.10 Increasing the Size of the Prestoserve Buffer Hash Table

If the contention on the Prestoserve lock (presto_lock) is high (for example, the miss rate is a few percentage points), you may be able to improve throughput by increasing the value of the presto subsystem attribute presto-buffer-hash-size. This will decrease Prestoserve lock contention.

The default value of the presto-buffer-hash-size attribute is 256 bytes. The minimum value is 0; the maximum value is 64 KB.

9.2.11 Caching More Free vnodes

You can increase the minimum number of vnodes on the free list to cache more free vnodes and improve the performance of cache lookup operations. However, increasing the minimim number of vnodes will consume memory resources.

The vfs subsystem attribute min-free-vnodes specifies the minimum number of vnodes. The default value of the min-free-vnodes attribute is either 150 or the value of the nvnode kernel variable, whichever is greater.

If the value of min-free-vnodes is larger than the value of max-vnodes, vnode deallocations will not occur.

If the value of min-free-vnodes is close to the value of the max-vnodes attribute, vnode deallocation will not be effective. If the value of min-free-vnodes must be close to the value of max-vnodes, you may want to disable vnode deallocation (see Section 9.2.15).

Disabling vnode deallocation does not free memory, because memory used by the vnodes is not returned to the system. On systems that need to reclaim the memory used by vnodes, make sure that the value of min-free-vnodes is significantly lower than the value of max-vnodes.

See Section 5.5.1 for information about modifying max-vnodes.

9.2.12 Increasing the Time vnodes Remain on the Free List

You can increase the value of the vfs subsystem attribute vnode-age to increase the amount of time for which vnodes are kept on the free list. This increases the possibility that the vnode will be successfully looked up. The default value for vnode-age is 120 seconds on 32-MB or larger systems and 2 seconds on 24-MB systems.

9.2.13 Delaying the Deallocation of vnodes

To delay the deallocation of vnodes, increase the value of the vfs subsystem attribute namei-cache-valid-time. The default value is 1200. This can improve namei cache lookup operations, but it consumes memory resources.

9.2.14 Accelerating the Deallocation of vnodes

To accelerate the deallocation of vnodes, decrease the value of the vfs subsystem attribute namei-cache-valid-time. The default value is 1200. This causes vnodes to be deallocated from the namei cache at a faster rate and returns memory to the operating system, but it also reduces the efficiency of the cache.

9.2.15 Disabling vnode Deallocation

To optimize processing time, disable vnode deallocation by setting the value of the vfs subsystem attribute vnode-deallocation-enable to zero. Disabling vnode deallocation does not free memory, because memory used by the vnodes is not returned to the system.

You may want to disable vnode allocation if the value of the vfs subsystem attribute min-free-vnodes is close to the value of the max-vnodes attribute. See Section 5.5.1 for information about modifying max-vnodes.

9.3 Managing Advanced File System Performance

The Advanced File System (AdvFS) provides file system features beyond those of a traditional UFS file system. Unlike the rigid UFS model in which the file system directory hierarchy (tree) is bound tightly to the physical storage, AdvFS consists of two distinct layers: the directory hierarchy layer and the physical storage layer.

The AdvFS decoupled file system structure enables you to manage the physical storage layer apart from the directory hierarchy layer. This means that you can move files between a defined group of disk volumes without changing file pathnames. Because the pathnames remain the same, the action is completely transparent to end users.

AdvFS allows you to put multiple volumes (disks, LSM volumes, or RAID storage sets) in a file domain and distribute the filesets and files across the volumes. A file's blocks usually reside together on the same volume, unless the file is striped or the volume is full. Each new file is placed on the successive volume by using round-robin scheduling.

AdvFS provides the following features:

High-performance file system
AdvFS uses an extent-based file allocation scheme that consolidates data transfers, which increases sequential bandwidth and improves performance for large data transfers. AdvFS performs large reads from disk when it anticipates a need for sequential data. AdvFS also performs large writes by combining adjacent data into a single data transfer.

Fast file system recovery
Rebooting after a system interruption is extremely fast, because AdvFS uses write-ahead logging, instead of the fsck utility, as a way to check for and repair file system inconsistencies. The recovery speed depends on the number of uncommitted records in the log, not the amount of data in the fileset; therefore, reboots are quick and predictable.

Online file system management

File domain defragmentation capability

Support for large files and file systems

User quotas

Support for the salvage command, which allows you to recover file data from damaged AdvFS file domains

The optional AdvFS utilities, which are licensed separately, provide the following features:

Pool of storage that allows you to add, remove, and back up disks without disrupting users or applications.

Disk spanning filesets

Ability to recover deleted files
Users can retrieve their own unintentionally deleted files from predefined trashcan directories, without assistance from system administrators.

I/O load balancing across disks

Online fileset resizing

Online file migration across disks

File-level striping
File-level striping may improve I/O bandwidth (transfer rates) by distributing file data across multiple disk volumes.

Graphical user interface (GUI) that simplifies disk and file system administration, provides status, and alerts you to potential problems

The following sections describe how to perform these tasks:

Understand AdvFS I/O queues and access structures (Section 9.3.1)

Use the AdvFS guidelines in order to set up a high-performance configuration (Section 9.3.2)

Obtain information about the AdvFS performance (Section 9.3.3)

Improve AdvFS performance by tuning the subsystem (Section 9.3.4)

See the AdvFS Guide to File System Administration for detailed information about setting up and managing AdvFS.

9.3.1 Understanding AdvFS Operation

AdvFS is a file system option that provides many file management and performance features. You can use AdvFS instead of UFS to organize and manage your files. An AdvFS file domain can consist of multiple volumes, which can be UNIX block devices (entire disks), disk partitions, LSM logical volumes, or RAID storage sets. AdvFS filesets can span all the volumes in the file domain.

The AdvFS Utilities product, which is licensed separately from the operating system, extends the capabilities of the AdvFS file system.

The following sections describe AdvFS I/O queues and access structures.

9.3.1.1 AdvFS I/O Queues

At boot time, the system reserves a percentage of static wired physical memory for the AdvFS buffer cache, which is the part of the UBC that holds the most recently accessed pages of AdvFS file data and metadata. A disk operation is avoided if the data is later reused and the page is still in the cache (a buffer cache hit). This can improve AdvFS performance.

The amount of memory that can be allocated to the AdvFS buffer cache is specified by the advfs subsystem attribute AdvfsCacheMaxPercent. The default value is 7 percent of physical memory. See Section 6.1.2.3 for information about how the system allocates memory to the AdvFS buffer cache.

For each AdvFS volume, I/O requests are sent either to the blocking queue, which caches synchronous I/O requests, or to the lazy queue, which caches asynchronous I/O requests. Both the blocking queue and the lazy queue feed I/O requests to the device queue.

A synchronous I/O request is one that must be written to disk before the write is considered successful and the application can continue. This ensures data reliability because the write is not stored in memory to be later written to disk. Therefore, I/O requests on the blocking queue cannot be asynchronously removed, because the I/O must complete.

Asynchronous I/O requests are cached in the lazy queue and periodically flushed to disk in portions that are large enough to allow the disk drivers to optimize the order of the write.

Figure 9-1 shows the movement of synchronous and asynchronous I/O requests through the AdvFS I/O queues.

Figure 9-1: AdvFS I/O Queues

When an asynchronous I/O request enters the lazy queue, it is assigned a time stamp. The lazy queue is a pipeline that contains a sequence of queues through which an I/O request passes: the wait queue (if applicable), the ready queue, and the consol queue. An AdvFS buffer cache hit can occur while an I/O request is in any part of the lazy queue.

Detailed descriptions of the wait, ready, and consol queues are as follows:

wait queue--Asynchronous I/O requests that are waiting for an AdvFS transaction log write to complete first enter the wait queue. Each file domain has a transaction log that keeps track of fileset activity for all filesets in the file domain, and ensures AdvFS metadata consistency if a crash occurs.
AdvFS uses write-ahead logging, which requires that when metadata is modified, the transaction log write must complete before the actual metadata is written. This ensures that AdvFS can always use the transaction log to create a consistent view of the file system metadata. After the transaction log is written, I/O requests are moved from the wait queue to the ready queue.

ready queue--Asynchronous I/O requests that are not waiting for an AdvFS transaction log write to complete enter the ready queue, where they are sorted and held until the size of the ready queue reaches the value specified by the AdvfsReadyQLim attribute, or until the update daemon flushes the data. The default value of the AdvfsReadyQLim attribute is 16,384 512-byte blocks (8 MB).
You can modify the size of the ready queue for all AdvFS volumes by changing the value of the AdvfsReadyQLim attribute. You can modify the size for a specific AdvFS volume by using the chvol -t command.
You can disable data caching in the ready queue and allow I/O requests to bypass the ready queue. To do this, specify a value of 0 for the AdvfsReadyQLim attribute However, this is not recommended. See Section 9.3.4.5 for information about tuning the ready queue.

consol queue--I/O requests are moved from the ready queue to the consol queue, which feeds the device queue.

Both the consol queue and the blocking queue feed the device queue, where logically contiguous I/O requests are consolidated into larger I/Os before they are sent to the device driver. The size of the device queue affects the amount of time it takes to complete a synchronous (blocking) I/O operation. AdvFS issues several types of blocking I/O operations, including AdvFS metadata and log data operations.

The AdvfsMaxDevQLen attribute limits the total number of I/O requests on the AdvFS device queue. The default value is 24 requests. When the number of requests exceeds this value, only synchronous requests from the blocking queue are accepted onto the device queue.

Although the default value of the AdvfsMaxDevQLen attribute is appropriate for most configurations, you may need to modify this value. However, only increase the default value if devices are not being kept busy. Make sure that increasing the size of the device queue does not cause a decrease in response time. See Section 9.3.4.6 for more information about tuning the AdvFS device queue.

Use the advfsstat command to show the AdvFS queue statistics.

9.3.1.2 AdvFS Access Structures

If your users or applications open and then reuse many files, you may be able to improve AdvFS performance by modifying how the system allocates AdvFS access structures. AdvFS access structures are in-memory data structures that AdvFS uses to cache low-level information about files that are currently open and files that were opened but are now closed. Caching open file information can enhance AdvFS performance if the open files are later reused.

At boot time, the system reserves for AdvFS access structures a percentage of the physical memory that is not wired by the kernel or applications. Out of this pool of reserved memory, the system allocates a number of access structures and places them on the access structure free list. When a file is opened, an access structure is taken from the access structure free list. Access structures are allocated and deallocated according to the kernel configuration and workload demands.

There are two attributes that control the allocation of AdvFS access structures:

The AdvfsAccessMaxPercent attribute controls the amount of pageable memory (malloc) that is reserved for AdvFS access structures. The default value is 80 percent of pageable memory.

The AdvfsPreallocAccess specifies the number of AdvFS access structures that the system allocates at startup time. The default and minimum values are 128. The maximum value is either 65536 or the value of the AdvfsAccessMaxPercent attribute, whichever is the smallest value.

You may be able to improve AdvFS performance by modifying the previous attributes and allocating more memory for AdvFS access structures. However, this will reduce the amount of memory available to processes and may cause excessive paging and swapping.

If you do not use AdvFS or if your workload does not reuse AdvFS files, do not allocate a large amount of memory for access structures. If you have a large-memory system, you may want to decrease the amount of memory reserved for AdvFS access structures.

See Section 9.3.4.3 for information about tuning access structures.

9.3.2 AdvFS Configuration Guidelines

You will obtain the best performance if you carefully plan your AdvFS configuration. Table 9-2 lists AdvFS configuration guidelines and performance benefits as well as tradeoffs.

Table 9-2: AdvFS Configuration Guidelines

Action	Performance Benefit	Tradeoff
Use multiple-volume file domains (Section 9.3.2.1)	Improves throughput and simplifies management	Increases chance of domain failure and may cause a log bottleneck
Use several file domains instead of one large domain (Section 9.3.2.1)	Prevents log from becoming a bottleneck	Increases maintenance complexity
Place transaction log on fast or uncongested volume (Section 9.3.2.2)	Prevents log from becoming a bottleneck	None
Stripe files across different disks and, if possible, different buses (Section 9.3.2.4)	Improves sequential read and write performance	Increases chance of domain failure
Use quotas (Section 9.3.2.5)	Controls file system space utilization	None

The following sections describe these AdvFS configuration guidelines in detail.

9.3.2.1 Using Multiple-Volume File Domains

Using multiple-volume file domains allows greater control over your physical resources, and may improve a fileset's total throughput. However, be sure that the log does not become a bottleneck. Multiple-volume file domains improve performance because AdvFS generates parallel streams of output using multiple device consolidation queues.

In addition, using only a few file domains instead of using many file domains reduces the overall management effort, because fewer file domains require less administration. However, a single volume failure within a file domain renders the entire file domain inaccessible. Therefore, the more volumes that you have in your file domain the greater the risk that a file domain will fail.

It is recommended that you use a maximum of 12 volumes in each file domain. However, to reduce the risk of file domain failure, limit the number of volumes per file domain to three or mirror data with LSM or hardware RAID.

For multiple-volume domains, make sure that busy files are not located on the same volume. Use the migrate command to move files across volumes.

9.3.2.2 Improving the Transaction Log Performance

Each file domain has a transaction log that tracks fileset activity for all filesets in the file domain, and ensures AdvFS metadata consistency if a crash occurs. The AdvFS file domain transaction log may become a bottleneck if the log resides on a congested disk or bus, or if the file domain contains many filesets.

To prevent the log from becoming a bottleneck, put the log on a fast, uncongested volume. You may want to put the log on a disk that contains only the log. See Section 9.3.4.12 for information on moving an existing transaction log.

To make the transaction log highly available, use LSM or hardware RAID to mirror the log.

9.3.2.3 Improving Bitmap Metadata Table Performance

The AdvFS fileset data structure (metadata) is stored in a file called the bitfile metadata table (BMT). Each volume in a domain has a BMT that describes the file extents on the volume. If a domain has multiple volumes of the same size, files will be distributed evenly among the volumes.

The BMT is the equivalent of the UFS inode table. However, the UFS inode table is statically allocated, while the BMT expands as more files are added to the domain. Each time that AdvFS needs additional metadata, the BMT grows by a fixed size (the default is 128 pages). As a volume becomes increasingly fragmented, the size by which the BMT grows may be described by several extents.

To monitor the BMT, use the vbmtpg command and examine the number of mcells (freeMcellCnt). The value of freeMcellCnt can range from 0 to 22. A volume with 1 free mcell has very little space in which to grow the BMT. See vbmtpg(8) for more information.

You can also invoke the showfile command and specify mount_point/.tags/M-10 to examine the BMT extents on the first domain volume that contains the fileset mounted on the specified mount point. To examine the extents of the other volumes in the domain, specify M-16, M-24, and so on. If the extents at the end of the BMT are smaller than the extents at the beginning of the file, the BMT is becoming fragmented. See showfile(8) for more information.

If you are prematurely out of BMT disk space, you may be able to eliminate the problem by defragmenting the file domain that contains the volume. See defragment(8) for more information.

Table 9-3 provides some BMT sizing guidelines for the number of pages to preallocate for the BMT, and the number of pages by which the BMT extent size grows. The BMT sizing depends on the maximum number of files you expect to create on a volume.

Table 9-3: BMT Sizing Guidelines

Estimated Maximum Number of Files on a Volume	Number of Pages to Preallocate	Number of Pages to Grow Extent
< 50,000	3600	128
100,000	7200	256
200,000	14,400	512
300,000	21,600	768
400,000	28,800	1024
800,000	57,600	2048

You can modify the number of extent pages by which the BMT grows when a file domain is created or when a volume is added to the domain. If you use the mkfdmn -x or the addvol -x command when there is a large amount of free space on a disk, as files are created the BMT will expand by the specified number of pages and those pages will be in one extent. As the disk becomes more fragmented, the BMT will still expand, but the pages will not be contiguous and will require more extents. Eventually, the BMT will run out of its limited number of extents even though the growth size is large.

To prevent this problem, you can preallocate space for the BMT when the file domain is created, or when a volume is added to the domain. If you use the mkfdmn -p or the addvol -p command, the preallocated BMT is described in one extent. All subsequent growth will be able to utilize nearly all of the limited number of BMT extents. See mkfdmn(8) and addvol(8).

Do not overallocate BMT space because the disk space cannot be used for other purposes. However, too little BMT space will eventually cause the BMT to grow by a fixed amount. The disk may be fragmented and the growth will require multiple extents.

9.3.2.4 Striping Files

You may be able to use the AdvFS stripe utility to improve the read and write performance of an individual file by spreading file data evenly across different disks in a file domain. For the maximum performance benefit, stripe files across disks on different I/O buses.

Striping files, instead of striping entire disks, is useful if an application continually accesses only a few specific files. Do not stripe both a file and the disk on which it resides.

The stripe utility directs a zero-length file (a file with no data written to it yet) to be distributed evenly across a specified number of volumes. As data is appended to the file, the data is spread across the volumes. The size of each data segment (also called the stripe or chunk size) is 64 KB (65,536 bytes). AdvFS alternates the placement of the segments on the disks in a sequential pattern. For example, the first 64 KB of the file is written to the first volume, the second 64 KB is written to the next volume, and so on.

See stripe(8) for more information.

Note

Distributing data across multiple volumes decreases data availability, because one volume failure makes the entire file domain unavailable. To make striped files highly available, you can mirror the disks on which the file is striped.

To determine if you should stripe files, use the iostat utility, as described in Section 8.2.1. The blocks per second and I/O operations per second should be cross-checked with the disk's bandwidth capacity. If the disk access time is slow, in comparison to the stated capacity, then file striping may improve performance.

The performance benefit of striping also depends on the size of the average I/O transfer in relation to the data segment (stripe) size, in addition to how your users and applications perform disk I/O.

9.3.2.5 Using AdvFS Quotas

AdvFS quotas allow you to track and control the amount of physical storage that a user, group, or fileset consumes. AdvFS eliminates the slow reboot activities associated with UFS quotas. In addition, AdvFS quota information is always maintained, but quota enforcement can be activated and deactivated.

For information about AdvFS quotas, see the AdvFS Administration manual.

9.3.3 Gathering AdvFS Information

Table 9-4 describes the tools you can use to obtain information about AdvFS.

Table 9-4: AdvFS Monitoring Tools

Name	Use	Description
`advfsstat`	Displays AdvFS performance statistics (Section 9.3.3.1)	Allows you to obtain extensive AdvFS performance information, including buffer cache, fileset, volume, and bitfile metadata table (BMT) statistics, for a specific interval of time.
`advscan`	Identifies disks in a file domain (Section 9.3.3.2)	Locates pieces of AdvFS file domains on disk partitions and in LSM disk groups.
`showfdmn`	Displays detailed information about AdvFS file domains and volumes (Section 9.3.3.3)	Allows you to determine if files are evenly distributed across AdvFS volumes. The `showfdmn` utility displays information about a file domain, including the date created and the size and location of the transaction log, and information about each volume in the domain, including the size, the number of free blocks, the maximum number of blocks read and written at one time, and the device special file. For multivolume domains, the utility also displays the total volume size, the total number of free blocks, and the total percentage of volume space currently allocated.
`showfile`	Displays information about files in an AdvFS fileset (Section 9.3.3.4)	Displays detailed information about files (and directories) in an AdvFS fileset. The `showfile` command allows you to check a file's fragmentation. A low performance percentage (less than 80 percent) indicates that the file is fragmented on the disk. The `showfile` command also displays the extent map of each file. An extent is a contiguous area of disk space that AdvFS allocates to a file. Simple files have one extent map; striped files have an extent map for every stripe segment. The extent map shows whether the entire file or only a portion of the file is fragmented.
`showfsets`	Displays AdvFS fileset information for a file domain (Section 9.3.3.5)	Displays information about the filesets in a file domain, including the fileset names, the total number of files, the number of free blocks, the quota status, and the clone status. The `showfsets` command also displays block and file quota limits for a file domain or for a specific fileset in the domain.
`verify`	Checks the AdvFS on-disk metadata structures	Checks AdvFS on-disk structures such as the BMT, the storage bitmaps, the tag directory and the frag file for each fileset. It verifies that the directory structure is correct and that all directory entries reference a valid file (tag) and that all files (tags) have a directory entry.
`fsx`	Exercises file systems	Exercises AdvFS and UFS file systems by creating, opening, writing, reading, validating, closing, and unlinking a test file. Errors are written to a log file. See `fsx`(8) for more information.

The following sections describe some of these commands in detail.

9.3.3.1 Monitoring AdvFS Performance Statistics by Using the advfsstat Command

The advfsstat command displays various AdvFS performance statistics and monitors the performance of AdvFS domains and filesets. Use this command to obtain detailed information, especially if the iostat command output indicates a disk bottleneck (see Section 8.2.1).

The advfsstat command displays detailed information about a file domain, including information about the AdvFS buffer cache, fileset vnode operations, locks, the namei cache, and volume I/O performance. The command reports information in units of one disk block (512 bytes) for each interval of time (the default is one second). You can use the -i option to output information at specific time intervals.

The following example of the advfsstat -v 2 command shows the I/O queue statistics for the specified volume:

# /usr/sbin/advfsstat -v 2 test_domain
vol1
  rd  wr  rg  arg  wg  awg  blk  wlz  rlz  con  dev
  54   0  48  128   0    0    0    1    0    0   65

The previous example shows the following fields:

Read and write requests--Compare the number of read requests (rd) to the number of write requests (wr). Read requests are blocked until the read completes, but write requests will not block the calling thread, which increases the throughput of multiple threads.

Consolidated reads and writes--You may be able to improve performance by consolidating reads and writes. The consolidated read values (rg and arg) and write values (wg and awg) indicate the number of disparate reads and writes that were consolidated into a single I/O to the device driver. If the number of consolidated reads and writes decreases compared to the number of reads and writes, AdvFS may not be consolidating I/O.

I/O queue values--The blk, wlz, rlz, con, and dev fields can indicate potential performance issues. The con value specifies the number of entries on the consolidate queue. These entries are ready to be consolidated and moved to the device queue. The device queue value (dev) shows the number of I/O requests that have been issued to the device controller. The system must wait for these requests to complete.
If the number of I/O requests on the device queue increases continually and you experience poor performance, applications may be I/O bound on this device. You may be able to eliminate the problem by adding more disks to the domain or by striping disks.

You can monitor the type of requests that applications are issuing by using the advfsstat command's -f option to display fileset vnode operations. You can display the number of file creates, reads, and writes and other operations for a specified domain or fileset. For example:

# /usr/sbin/advfsstat -i 3 -f 2 scratch_domain fset1
  lkup  crt geta read writ fsnc dsnc   rm   mv rdir  mkd  rmd link
     0    0    0    0    0    0    0    0    0    0    0    0    0
     4    0   10    0    0    0    0    2    0    2    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0
    24    8   51    0    9    0    0    3    0    0    4    0    0
  1201  324 2985    0  601    0    0  300    0    0    0    0    0
  1275  296 3225    0  655    0    0  281    0    0    0    0    0
  1217  305 3014    0  596    0    0  317    0    0    0    0    0
  1249  304 3166    0  643    0    0  292    0    0    0    0    0
  1175  289 2985    0  601    0    0  299    0    0    0    0    0
   779  148 1743    0  260    0    0  182    0   47    0    4    0
     0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0

See advfsstat(8) for more information.

Note that it is difficult to link performance problems to some statistics such as buffer cache statistics. In addition, lock performance that is related to lock statistics cannot be tuned.

9.3.3.2 Identifying Disks in an AdvFS File Domain by Using the advscan Command

The advscan command locates pieces of AdvFS domains on disk partitions and in LSM disk groups. Use the advscan command when you have moved disks to a new system, have moved disks around in a way that has changed device numbers, or have lost track of where the domains are.

You can specify a list of volumes or disk groups with the advscan command to search all partitions and volumes. The command determines which partitions on a disk are part of an AdvFS file domain.

You can also use the advscan command for repair purposes if you deleted the /etc/fdmns directory, deleted a directory domain under /etc/fdmns, or deleted some links from a domain directory under /etc/fdmns.

You can run the advscan command to rebuild all or part of your /etc/fdmns directory, or you can manually rebuild it by supplying the names of the partitions in a domain.

The following example scans two disks for AdvFS partitions:

# /usr/advfs/advscan rz0 rz5
 
Scanning disks  rz0 rz5
Found domains:
usr_domain
                Domain Id       2e09be37.0002eb40
                Created         Thu Jun 26 09:54:15 1998
                Domain volumes          2
                /etc/fdmns links        2
                Actual partitions found:
                                        rz0c
                                        rz5c

For the following example, the rz6 file domains were removed from /etc/fdmns. The advscan command scans device rz6 and re-creates the missing domains.

# /usr/advfs/advscan -r rz6
 
Scanning disks  rz6
Found domains:
*unknown*
                Domain Id       2f2421ba.0008c1c0
                Created         Mon Jan 20 13:38:02 1998
 
                Domain volumes          1
                /etc/fdmns links        0
 
                Actual partitions found:
                                        rz6a*
 
 
*unknown*
                Domain Id       2f535f8c.000b6860
                Created         Tue Feb 25 09:38:20 1998
 
                Domain volumes          1
                /etc/fdmns links       0
 
                Actual partitions found:
                                        rz6b*
 
 
Creating /etc/fdmns/domain_rz6a/
        linking rz6a
 
Creating /etc/fdmns/domain_rz6b/
        linking rz6b

See advscan(8) for more information.

9.3.3.3 Checking AdvFS File Domains by Using the showfdmn Command

The showfdmn command displays the attributes of an AdvFS file domain and detailed information about each volume in the file domain.

The following example of the showfdmn command displays domain information for the usr file domain:

% /sbin/showfdmn usr
 
               Id              Date Created  LogPgs  Domain Name
2b5361ba.000791be  Tue Jan 12 16:26:34 1998     256  usr
 
Vol 512-Blks    Free % Used  Cmode  Rblks  Wblks  Vol Name
 1L   820164  351580    57%     on    256    256  /dev/disk/rz0d

See showfdmn(8) for more information about the output of the command.

9.3.3.4 Displaying AdvFS File Information by Using the showfile Command

The showfile command displays the full storage allocation map (extent map) for one or more files in an AdvFS fileset. An extent is a contiguous area of disk space that AdvFS allocates to a file.

The following example of the showfile command displays the AdvFS characteristics for all of the files in the current working directory:


# /usr/sbin/showfile *
 
       Id  Vol  PgSz  Pages  XtntType  Segs  SegSz    I/O  Perf File
  22a.001    1    16      1    simple    **     **  async  50%  Mail
    7.001    1    16      1    simple    **     **  async  20%  bin
  1d8.001    1    16      1    simple    **     **  async  33%  c
 1bff.001    1    16      1    simple    **     **  async  82%  dxMail
  218.001    1    16      1    simple    **     **  async  26%  emacs
  1ed.001    1    16      0    simple    **     **  async 100%  foo
  1ee.001    1    16      1    simple    **     **  async  77%  lib
  1c8.001    1    16      1    simple    **     **  async  94%  obj
  23f.003    1    16      1    simple    **     **  async 100%  sb
 170a.008    1    16      2    simple    **     **  async  35%  t
    6.001    1    16     12    simple    **     **  async  16%  tmp

The I/O column specifies whether write operations are forced to be synchronous. See Section 9.3.4.10 for information.

The following example of the showfile command shows the characteristics and extent information for the tutorial file, which is a simple file:

# /usr/sbin/showfile -x tutorial
 
        Id  Vol  PgSz  Pages  XtntType  Segs  SegSz    I/O  Perf    File
 4198.800d    2    16     27    simple    **     **  async   66% tutorial
 
     extentMap: 1
          pageOff    pageCnt    vol    volBlock    blockCnt
                0          5      2      781552          80
                5         12      2      785776         192
               17         10      2      786800         160
       extentCnt: 3

The Perf entry shows the efficiency of the file-extent allocation, expressed as a percentage of the optimal extent layout. A high value, such as 100 percent, indicates that the AdvFS I/O subsystem is highly efficient. A low value indicates that files may be fragmented.

See showfile(8) for more information about the command output.

9.3.3.5 Displaying the AdvFS Filesets in a File Domain by Using the showfsets Command

The showfsets command displays the AdvFS filesets (or clone filesets) and their characteristics in a specified domain.

The following is an example of the showfsets command:

# /sbin/showfsets dmn
mnt
          Id           : 2c73e2f9.000f143a.1.8001
          Clone is     : mnt_clone
          Files        :       79,  limit =     1000
          Blocks  (1k) :      331,  limit =    25000
          Quota Status : user=on  group=on
 
mnt_clone
          Id           : 2c73e2f9.000f143a.2.8001
          Clone of     : mnt
          Revision     : 1

See showfsets(8) for information about the options and output of the command.

9.3.4 Tuning AdvFS

After you configure AdvFS, you may be able to tune it to improve performance. To successfully improve performance, you must understand how your applications and user perform file system I/O, as described in Section 2.1.

Table 9-5 lists AdvFS tuning guidelines and performance benefits as well as tradeoffs. In addition, the recommendations described in Table 9-1 apply to AdvFS configurations.

Table 9-5: AdvFS Tuning Guidelines

Action	Performance Benefit	Tradeoff
Decrease the size of the metadata buffer cache to 1 percent (Section 6.4.6)	Improves performance for systems that use only AdvFS	None
Increase the percentage of memory allocated for the AdvFS buffer cache (Section 9.3.4.1)	Improves AdvFS performance if data reuse is high	Consumes memory
Increase the number of AdvFS buffer hash chains (Section 9.3.4.2)	Speeds lookup operations and decreases CPU usage	Consumes memory
Increase the memory reserved for AdvFS access structures (Section 9.3.4.3)	Improves AdvFS performance for systems that open and reuse files	Decreases the memory available to the virtual memory subsystem and the UBC
Defragment file domains (Section 9.3.4.4)	Improves read and write performance	None
Increase the amount of data cached in the ready queue (Section 9.3.4.5)	Improves asynchronous write performance	May cause I/O spikes or increase the number of lost buffers if a crash occurs
Decrease the maximum number of I/O requests on the device queue (Section 9.3.4.6)	Decreases the time to complete synchronous I/O requests and improves response time	May cause I/O spikes
Decrease the I/O transfer read-ahead size (Section 9.3.4.7)	Improves performance for `mmap` page faulting	None
Disable the flushing of dirty pages mapped with the `mmap` function during a `sync` call (Section 9.3.4.8)	May improve performance for applications that manage their own flushing	None
Consolidate I/O transfers (Section 9.3.4.9)	Improves AdvFS performance	None
Force all AdvFS file writes to be synchronous (Section 9.3.4.10)	Ensures that data is successfully written to disk	May degrade file system performance
Prevent partial writes (Section 9.3.4.11)	Ensures that system crashes do not cause partial disk writes	May degrade asynchronous write performance
Move the transaction log to a fast or uncongested volume (Section 9.3.4.12)	Prevents log from becoming a bottleneck	None
Balance files across volumes in a file domain (Section 9.3.4.13)	Improves performance and evens the future distribution of files	None
Migrate frequently used or large files to different file domains (Section 9.3.4.14)	Improves I/O performance	None

The following sections describe the AdvFS tuning recommendations in detail.

9.3.4.1 Modifying the Size of the AdvFS Buffer Cache

The advfs subsystem attribute AdvfsCacheMaxPercent specifies the maximum percentage of physical memory that can be used to cache AdvFS file data. Caching AdvFS data can improve I/O performance only if the cached data is reused.

If data reuse is high, you may be able to improve AdvFS performance by increasing the percentage of memory allocated to the AdvFS buffer cache. To do this, increase the value of the AdvfsCacheMaxPercent attribute. The default is 7 percent of memory, and the maximum is 30 percent. If you increase the value of the AdvfsCacheMaxPercent attribute and experience no performance benefit, return to the original value.

Note that the AdvFS buffer cache cannot be more than 50 percent of the UBC.

Increasing the memory allocated to the AdvFS buffer cache will decrease the amount of memory available for processes; make sure that you do not cause excessive paging and swapping. Use the vmstat command to check virtual memory statistics, as described in Section 6.3.2.

If your workload does not reuse AdvFS data or if you have more than 2 GB of memory, you may want to decrease the size of the AdvFS buffer cache. The minimum value is 1 percent of physical memory. This can improve performance, because it decreases the overhead associated with managing the cache and also frees memory.

See Section 4.4 for information about modifying kernel subsystem attributes.

9.3.4.2 Increasing the Number of AdvFS Buffer Hash Chains

The hash chain table for the AdvFS buffer cache is used to locate pages of AdvFS file data in memory. The table contains a number of hash chains, which contain elements that point to pages of file system data that have already been read into memory. When a read or write system call is done for a particular offset within an AdvFS file, the system sequentially searches the appropriate hash chain to determine if the file data is already in memory.

The value of the advfs subsystem attribute AdvfsCacheHashSize specifies the number of hash chains on the table. The default value is either 8192 KB or 10 percent of the size of the AdvFS buffer cache (rounded up to the next power of 2), whichever is the smallest value. The minimum value is 1024 KB. The maximum value is either 65536 or the size of the AdvFS buffer cache, whichever is the smallest value. The AdvfsCacheMaxPercent attribute specifies the size of the AdvFS buffer cache (see Section 9.3.4.1).

If you have more than 4 GB of memory, you may want to increase the value of the AdvfsCacheHashSize attribute, which will increase the number of hash chains on the table. The more hash chains on the table, the shorter the hash chains. Short hash chains contain less elements to search, which results in fast seach speeds and decreases CPU usage.

For example, you can double the default value of the AdvfsCacheHashSize attribute if the system is experiencing high CPU system time, or if a kernel profile shows high percentage of CPU usage in the find_page routine.

Increasing the size of the AdvFS buffer cache hash table will increase the amount of kernel wired memory in the system.

See Section 4.4 for information about modifying kernel subsystem attributes.

9.3.4.3 Increasing the Memory Reserved for Access Structures

At boot time, the system reserves a percentage of pageable memory (memory that is not wired by the kernel or applications) for AdvFS access structures. If your system opens and then reuses many files (for example, if you have a proxy server), you may be able to improve AdvFS performance by increasing the number of AdvFS access structures that the system places on the access structure free list at startup time.

AdvFS access structures are in-memory data structures that AdvFS uses to cache low-level information about files that are currently open, and files that were opened but are now closed. Increasing the number of access structures on the free list allows more open file information (metadata) to remain in the cache, which can improve AdvFS performance if the files are reused. See Section 9.3.1.2 for more information about access structures.

Use the advfs subsystem attribute AdvfsPreallocAccess to modify the number of AdvFS access structures that the system allocates at startup time. The default and minimum values are 128 if you have a mounted AdvFS fileset. The maximum value is either 65536 or the value of the advfs subsystem attribute AdvfsAccessMaxPercent, whichever is the smallest value.

The AdvfsAccessMaxPercent attribute specifies the maximum percentage of pageable memory (malloc pool) that can be reserved for AdvFS access structures. The minimum value is 5 percent of pageable memory, and the maximum value is 95 percent. The default value is 80 percent.

Increasing the value of the AdvfsAccessMaxPercent attribute allows you to allocate more memory resources for access structures, which may improve AdvFS performance on systems that open and reuse many files. However, increasing the memory available for access structures will decrease the memory that is available to processes, which may cause excessive paging and swapping.

Decreasing the value of the AdvfsAccessMaxPercent attribute frees pageable memory but you will be able to allocate less memory for AdvFS access structures, which may degrade AdvFS performance on systems that open and reuse many files.

See Section 4.4 for information about modifying kernel subsystem attributes.

9.3.4.4 Defragmenting a File Domain

AdvFS attempts to store file data in a collection of contiguous blocks (a file extent) on a disk. If all data in a file is stored in contiguous blocks, the file has one file extent. However, as files grow, contiguous blocks on the disk may not be available to accommodate the new data, so the file must be spread over discontiguous blocks and multiple file extents.

File fragmentation degrades read and write performance because many disk addresses must be examined to access a file. In addition, if a domain has a large number of small files, you may prematurely run out of disk space due to fragmentation.

Use the defragment utility to reduce the amount of file fragmentation in a file domain by attempting to make the files more contiguous, which reduces the number of file extents. The utility does not affect data availability and is transparent to users and applications. Striped files are not defragmented.

Use the defragment utility with the -v and -n options to show the amount of file fragmentation.

You can improve the efficiency of the defragmenting process by deleting any unneeded files in the file domain before running the defragment utility. See defragment(8) for more information.

9.3.4.5 Increasing the Amount of Data Cached in the Ready Queue

AdvFS caches asynchronous I/O requests in the AdvFS buffer cache. If the cached data is later reused, pages can be retrieved from memory, and a disk operation is avoided.

Asynchronous I/O requests are sorted in the ready queue and remain there until the size of the queue reaches the value specified by the AdvfsReadyQLim attribute or until the update daemon flushes the data. The default value of the AdvfsReadyQLim attribute is 16,384 512-byte blocks (8 MB). See Section 9.3.1.1 for more information about AdvFS queues.

You can modify the size of the ready queue for all AdvFS volumes by changing the value of the AdvfsReadyQLim attribute. You can modify the size for a specific AdvFS volume by using the chvol -t command. See chvol(8) for more information.

If you have high data reuse (data is repeatedly read and written), you may want to increase the size of the ready cache. This can increase the number of AdvFS buffer cache hits. If you have low data reuse, you can decrease the threshold, but it is recommended that you use the default value.

If you change the size of the ready queue and performance does not improve, return to the original value.

Although you can specify a value of 0 for the AdvfsReadyQLim attribute to disable data caching in the ready queue and allow I/O requests to bypass the ready queue, this is not recommended.

See Section 4.4 for information about modifying kernel subsystem attributes.

9.3.4.6 Decreasing the Maximum Number of I/O Requests on the Device Queue

Small, logically contiguous synchronous and asynchronous AdvFS I/O requests are consolidated into larger I/O requests on the device queue, before they are sent to the device driver. See Section 9.3.1.1 for more information about AdvFS queues.

The AdvfsMaxDevQLen attribute controls the maximum number of I/O requests on the device queue. When the number of requests on the queue exceeds this value, only synchronous requests are accepted onto the device queue. The default value of the AdvfsMaxDevQLen attribute is 24 requests.

Although the default value of the AdvfsMaxDevQLen attribute is appropriate for most configurations, you may need to modify this value. Increase the default value of the AdvfsMaxDevQLen attribute only if devices are not being kept busy. A guideline is to specify a value for the AdvfsMaxDevQLen attribute that is less than or equal to the average number of I/O operations that can be performed in 0.5 seconds.

Make sure that increasing the size of the device queue does not cause a decrease in response time. To calculate response time, multiply the value of the AdvfsMaxDevQLen attribute by the average I/O latency time for your disks.

Decreasing the size of the device queue decreases the amount of time it takes to complete a synchronous (blocking) I/O operation and can improve response time.

If you do not want to limit the number of requests on the device queue, set the value of the AdvfsMaxDevQLen attribute to 0 (zero), although this behavior is not recommended.

See Section 4.4 for information about modifying kernel subsystem attributes.

9.3.4.7 Decreasing the I/O Transfer Size

AdvFS reads and writes data by a fixed number of 512-byte blocks. The default value depends on the disk driver's reported preferred transfer size. For example, a common default value is either 128 blocks or 256 blocks.

Use the chvol command with the -r option to change the read-ahead size. You may be able to improve performance for mmap page faulting and reduce read-ahead paging and cache dilution by decreasing the read-ahead size.

Use the chvol command with the -w option to change the write-consolidation size. See chvol(8) for more information.

If the disk is fragmented so that the pages of a file are not sequentially allocated, reduce fragmentation by using the defragment utility. See defragment(8) for more information.

9.3.4.8 Disabling the Flushing of Modified mmapped Pages

The AdvFS buffer cache can contain modified data due to a write system call or a memory write reference after an mmap system call. The update daemon runs every 30 seconds and issues a sync call for every fileset mounted with read and write access.

The AdvfsSyncMmapPages attribute controls whether modified (dirty) mmapped pages are flushed to disk during a sync system call. If the AdvfsSyncMmapPages attribute is set to 1 (the default), the modified mmapped pages are asynchronously written to disk. If the AdvfsSyncMmapPages attribute is set to 0, modified mmapped pages are not written to disk during a sync system call.

If your applications manage their own mmap page flushing, set the value of the AdvfsSyncMmapPages attribute to zero.

See mmap(2) and msync(2) for more information.

See Section 4.4 for information about modifying kernel subsystem attributes.

9.3.4.9 Consolidating I/O Transfers

By default, AdvFS consolidates a number of I/O transfers into a single, large I/O transfer, which can improve AdvFS performance. To enable the consolidation of I/O transfers, use the chvol command with the -c on option.

It is recommended that you not disable the consolidation of I/O transfers. See chvol(8) for more information.

9.3.4.10 Forcing Synchronous Writes

By default, asynchronous write requests are cached in the AdvFS buffer cache, and the write system call then returns a success value. The data is written to disk at a later time (asynchronously).

You can use the chfile -l on command to force all write requests to a specified AdvFS file to be synchronous. If you enable forced synchronous writes on a file, data must be successfully written to disk before the write system call will return a success value. This behavior is similar to the behavior associated with a file that has been opened with the O_SYNC option; however, forcing synchronous writes persists across open() calls.

Forcing all writes to a file to be synchronous ensures that the write has completed when the write system call returns a success value. However, it may degrade performance.

A file cannot have both forced synchronous writes enabled and atomic write data logging enabled. See Section 9.3.4.11 for more information.

Use the chfile command to determine whether forced synchronous writes or atomic write data logging is enabled. Use the chfile -l off command to disable forced synchronous writes (the default behavior).

9.3.4.11 Preventing Partial Data Writes

AdvFS writes data to disk in 8-KB chunks. By default, and in accordance with POSIX standards, AdvFS does not guarantee that all or part of the data will actually be written to disk if a crash occurs during or immediately after the write. For example, if the system crashes during a write that consists of two 8-KB chunks of data, only a portion (anywhere from 0 to 16 KB) of the total write may have succeeded. This can result in partial data writes and inconsistent data.

To prevent partial writes if a system crash occurs, use the chfile -L on command to enable atomic write data logging for a specified file.

By default, each file domain has a transaction log file that tracks fileset activity and ensures that AdvFS can maintain a consistent view of the file system metadata if a crash occurs. If you enable atomic write data logging on a file, data from a write call will be written to the transaction log file before it is written to disk. If a system crash occurs during or immediately after the write call, upon recovery, the data in the log file can be used to reconstruct the write. This guarantees that each 8-KB chunk of a write either is completely written to disk or is not written to disk.

For example, if atomic write data logging is enabled and a crash occurs during a write that consists of two 8-KB chunks of data, the write can have three possible states: none of the data is written, 8 KB of the data is written, or 16 KB of data is written.

Atomic write data logging may degrade AdvFS write performance because of the extra write to the transaction log file. In addition, a file that has atomic write data logging enabled cannot be memory mapped by using the mmap system call.

A file cannot have both forced synchronous writes enabled (see Section 9.3.4.10) and atomic write data logging enabled. However, you can enable atomic write data logging on a file and also open the file with an O_SYNC option. This ensures that the write is synchronous, but also prevents partial writes if a crash occurs.

Use the chfile command to determine if forced synchronous writes or atomic write data logging is enabled. Use the chfile -L off command to disable atomic write data logging (the default).

To enable atomic write data logging on AdvFS files that are NFS mounted, the NFS property list daemon, proplistd, must be running on the NFS client and the fileset must be mounted on the client by using the mount command's proplist option.

If atomic write data logging is enabled and you are writing to a file that has been NFS mounted, the offset into the file must be on an 8-KB page boundary, because NFS performs I/O on 8-KB page boundaries.

You can also activate and deactivate atomic data logging by using the fcntl system call. In addition, both the chfile and fcntl can be used on an NFS client to activate or deactivate this feature on a file that resides on the NFS server.

9.3.4.12 Moving the Transaction Log

Make sure that the AdvFS transaction log resides on an uncongested disk and bus or performance may be degraded.

Use the showfdmn command to determine the current location of the transaction log. In the showfdmn command display, the letter L displays next to the volume that contains the log.

If the transaction log becomes a bottleneck, use the switchlog command to relocate the transaction log of the specified file domain to a faster or less congested volume in the same domain. See switchlog(8) and showfdmn(8) for more information.

In addition, you can divide the file domain into several smaller file domains. This will cause each domain's transaction log to handle transactions for fewer filesets.

9.3.4.13 Balancing a Multivolume File Domain

If the files in a multivolume domain are not evenly distributed, performance may be degraded. Use the balance utility to distribute the percentage of used space evenly between volumes in a multivolume file domain. This improves performance and evens the distribution of future file allocations. Files are moved from one volume to another until the percentage of used space on each volume in the domain is as equal as possible.

The balance utility does not affect data availability and is transparent to users and applications. If possible, use the defragment utility before you balance files.

The balance utility does not generally split files. Therefore, file domains with very large files may not balance as evenly as file domains with smaller files. See balance(8) for more information.

To determine if you need to balance your files across volumes, use the showfdmn command to display information about the volumes in a domain. The % used field shows the percentage of volume space that is currently allocated to files or metadata (fileset data structure). See showfdmn(8) for more information.

9.3.4.14 Migrating Files Within a File Domain

Performance may degrade if too many frequently accessed or large files reside on the same volume in a multivolume file domain. You can improve I/O performance by altering the way files are mapped on the disk.

Use the migrate utility to move frequently accessed or large files to different volumes in the file domain. You can specify the volume where a file is to be moved, or allow the system to pick the best space in the file domain. You can migrate either an entire file or specific pages to a different volume.

Using the balance utility after migrating files may cause the files to move to a different volume. See balance(8) for more information.

In addition, a file that is migrated is defragmented at the same time, if possible. Defragmentation makes the file more contiguous, which improves performance. Therefore, you can use the migrate command to defragment selected files. See migrate(8) for more information. Use the iostat command to identify which disks are being heavily used. See Section 8.2.1 for information.

9.4 Managing UNIX File System Performance

The UNIX file system (UFS) can provide you with high-performance file system operations, especially for critical applications. For example, UFS file reads from striped disks can be 50 percent faster than if you are using AdvFS, and will consume only 20 percent of the CPU power that AdvFS requires.

However, unlike AdvFS, the UFS file system directory hierarchy is bound tightly to a single disk partition.

The following sections describe how to perform these tasks:

Use the UFS guidelines to set up a high-performance configuration (Section 9.4.1)

Obtain information about UFS performance (Section 9.4.2)

Tune UFS in order to improve performance (Section 9.4.3)

9.4.1 UFS Configuration Guidelines

There are a number of parameters that can improve the UFS performance. You can set all of the parameters when you use the newfs command to create a file system. For existing file systems, you can modify some parameters by using the tunefs command. See newfs(8) and tunefs(8) for more information.

Table 9-6 describes UFS configuration guidelines and performance benefits as well as tradeoffs.

Table 9-6: UFS Configuration Guidelines

Action	Performance Benefit	Tradeoff
Make the file system fragment size equal to the block size (Section 9.4.1.1)	Improves performance for large files	Wastes disk space for small files
Use the default file system fragment size of 1 KB (Section 9.4.1.1)	Uses disk space efficiently	Increases the overhead for large files
Reduce the density of inodes on a file system (Section 9.4.1.2)	Frees disk space for file data and improves large file performance	Reduces the number of files that can be created on the file system
Allocate blocks sequentially (Section 9.4.1.3)	Improves performance for disks that do not have a read ahead cache	Reduces the total available disk space
Increase the number of blocks combined for a cluster (Section 9.4.1.4)	May decrease number of disk I/O operations	May require more memory to buffer data
Use a Memory File System (MFS) (Section 9.4.1.5)	Improves I/O performance	Does not ensure data integrity because of cache volatility
Use disk quotas (Section 9.4.1.6)	Controls disk space utilization	UFS quotas may result in a slight increase in reboot time

The following sections describe the UFS configuration guidelines in detail.

9.4.1.1 Modifying the UFS Fragment Size

The UFS file system block size is 8 KB. The default fragment size is 1 KB. You can use the newfs command to modify the fragment size so that it is 25, 50, 75, or 100 percent of the block size.

Although the default fragment size uses disk space efficiently, it increases the overhead for large files. If the average file in a file system is larger than 16 KB but less than 96 KB, you may be able to improve disk access time and decrease system overhead by making the file system fragment size equal to the default block size (8 KB).

See newfs(8) for more information.

9.4.1.2 Reducing the Density of inodes

An inode describes an individual file in the file system. The maximum number of files in a file system depends on the number of inodes and the size of the file system. The system creates an inode for each 4 KB (4096 bytes) of data space in a file system.

If a file system will contain many large files and you are sure that you will not create a file for each 4 KB of space, you can reduce the density of inodes on the file system. This will free disk space for file data, but will reduce the number of files that can be created.

To do this, use the newfs -i command to specify the amount of data space allocated for each inode. See newfs(8) for more information.

9.4.1.3 Allocating Blocks Sequentially

The UFS rotdelay parameter specifies the time, in milliseconds, to service a transfer completion interrupt and initiate a new transfer on the same disk. You can set the rotdelay parameter to 0 (the default) to allocate blocks sequentially. This is useful for disks that do not have a read-ahead cache. However, it will reduce the total amount of available disk space.

Use either the tunefs command or the newfs command to modify the rotdelay value. See newfs(8) and tunefs(8) for more information.

9.4.1.4 Increasing the Number of Blocks Combined for a Cluster

The value of the UFS maxcontig parameter specifies the number of blocks that can be combined into a single cluster (or file-block group). The default value of maxcontig is 8. The file system attempts I/O operations in a size that is determined by the value of maxcontig multiplied by the block size (8 KB).

Device drivers that can chain several buffers together in a single transfer should use a maxcontig value that is equal to the maximum chain length. This may reduce the number of disk I/O operations. However, more memory will be needed to buffer data.

Use the tunefs command or the newfs command to change the value of maxcontig. See newfs(8) and tunefs(8) for more information.

9.4.1.5 Using a Memory File System

Memory File System (MFS) is a UFS file system that resides only in memory. No permanent data or file structures are written to disk. An MFS file system can improve read/write performance, but it is a volatile cache. The contents of an MFS file system are lost after a reboot, unmount operation, or power failure.

Because no date is written to disk, an MFS file system is a very fast file system and can be used to store temporary files or read-only files that are loaded into it after it is created. For example, if you are performing a software build that would have to be restarted if it failed, use an MFS file system to cache the temporary files that are created during the build and reduce the build time.

See mfs(8) for information.

9.4.1.6 Using UFS Disk Quotas

You can specify UFS file system limits for user accounts and for groups by setting up file system quotas, also known as disk quotas. You can apply quotas to file systems to establish a limit on the number of blocks and inodes (or files) that a user account or a group of users can allocate. You can set a separate quota for each user or group of users on each file system.

You may want to set quotas on file systems that contain home directories, because the sizes of these file systems can increase more significantly than other file systems. Do not set quotas on the /tmp file system.

Note that, unlike AdvFS quotas, UFS quotas may cause a slight increase in reboot time. For information about AdvFS quotas, see Section 9.3.2.5.

For information about UFS quotas, see the System Administration manual.

9.4.2 Gathering UFS Information

Table 9-7 describes the tools you can use to obtain information about UFS.

Table 9-7: UFS Monitoring Tools

Name	Use	Description
`dumpfs`	Displays UFS information (Section 9.4.2.1)	Displays detailed information about a UFS file system or a special device, including information about the file system fragment size, the percentage of free space, super blocks, and the cylinder groups.
`(dbx) print ufs_clusterstats`	Reports UFS clustering statistics (Section 9.4.2.2)	Reports statistics on how the system is performing cluster read and write transfers.
`(dbx) print bio_stats`	Reports UFS metadata buffer cache statistics (Section 9.4.2.3)	Reports statistics on the metadata buffer cache, including superblocks, inodes, indirect blocks, directory blocks, and cylinder group summaries.
`fsx`	Exercises file systems	Exercises UFS and AdvFS file systems by creating, opening, writing, reading, validating, closing, and unlinking a test file. Errors are written to a log file. See `fsx`(8) for more information.

The following sections describe some of these commands in detail.

9.4.2.1 Displaying UFS Information by Using the dumpfs Command

The dumpfs command displays UFS information, including super block and cylinder group information, for a specified file system. Use this command to obtain information about the file system fragment size and the minimum free space percentage.

The following example shows part of the output of the dumpfs command:

# /usr/sbin/dumpfs /devices/disk/rr3zg | more
magic   11954   format  dynamic time    Tue Sep 14 15:46:52 1998
nbfree  21490   ndir    9       nifree  99541   nffree  60
ncg     65      ncyl    1027    size    409600  blocks  396062
bsize   8192    shift   13      mask    0xffffe000
fsize   1024    shift   10      mask    0xfffffc00
frag    8       shift   3       fsbtodb 1
cpg     16      bpg     798     fpg     6384    ipg     1536
minfree 10%     optim   time    maxcontig 8     maxbpg  2048
rotdelay 0ms    headswitch 0us  trackseek 0us   rps     60

The information contained in the first lines are relevant for tuning. Of specific interest are the following fields:

bsize--The block size of the file system in bytes (8 KB).

fsize--The fragment size of the file system (in bytes). For the optimum I/O performance, you can modify the fragment size.

minfree--The percentage of space held back from normal users; the minimum free space threshold.

maxcontig--The maximum number of contiguous blocks that will be laid out before forcing a rotational delay; that is, the number of blocks that are combined into a single read request.

maxbpg--The maximum number of blocks any single file can allocate out of a cylinder group before it is forced to begin allocating blocks from another cylinder group. A large value for maxbpg can improve performance for large files.

rotdelay--The expected time (in milliseconds) to service a transfer completion interrupt and initiate a new transfer on the same disk. It is used to decide how much rotational spacing to place between successive blocks in a file. If rotdelay is zero, then blocks are allocated contiguously.

See Section 9.4.3 for information about tuning UFS.

9.4.2.2 Monitoring UFS Clustering by Using the dbx Debugger

To determine how efficiently the system is performing cluster read and write transfers, use the dbx print command to examine the ufs_clusterstats data structure.

The following example shows a system that is not clustering efficiently:

# /usr/ucb/dbx -k /vmunix /dev/mem 
(dbx) print ufs_clusterstats
struct {
    full_cluster_transfers = 3130
    part_cluster_transfers = 9786
    non_cluster_transfers = 16833
    sum_cluster_transfers = {
        [0] 0
        [1] 24644
        [2] 1128
        [3] 463
        [4] 202
        [5] 55
        [6] 117
        [7] 36
        [8] 123
        [9] 0
    }
}
(dbx)

The preceding example shows 24644 single-block transfers and no 9-block transfers. A single block is 8 KB. The trend of the data shown in the example is the reverse of what you want to see. It shows a large number of single-block transfers and a declining number of multiblock (1-9) transfers. However, if the files are all small, this may be the best blocking that you can achieve.

You can examine the cluster reads and writes separately with the ufs_clusterstats_read and ufs_clusterstats_write data structures.

See Section 9.4.3 for information on tuning UFS.

9.4.2.3 Checking the Metadata Buffer Cache by Using the dbx Debugger

The metadata buffer cache contains UFS file metadata--superblocks, inodes, indirect blocks, directory blocks, and cylinder group summaries. To check the metadata buffer cache, use the dbx print command to examine the bio_stats data structure.

Consider the following example:


# /usr/ucb/dbx -k /vmunix /dev/mem 
(dbx) print bio_stats
struct {
    getblk_hits = 4590388
    getblk_misses = 17569
    getblk_research = 0
    getblk_dupbuf = 0
    getnewbuf_calls = 17590
    getnewbuf_buflocked = 0
    vflushbuf_lockskips = 0
    mntflushbuf_misses = 0
    mntinvalbuf_misses = 0
    vinvalbuf_misses = 0
    allocbuf_buflocked = 0
    ufssync_misses = 0
}
(dbx)

If the miss rate is high, you may want to raise the value of the bufcache attribute. The number of block misses (getblk_misses) divided by the sum of block misses and block hits (getblk_hits) should not be more than 3 percent.

See Section 9.4.3.1 for information on how to tune the metadata buffer cache.

9.4.3 Tuning UFS

After you configure your UFS file systems, you may be able to improve UFS performance. To successfully improve performance, you must understand how your applications and users perform file system I/O, as described in Section 2.1.

Table 9-8 describes UFS tuning guidelines and performance benefits as well as tradeoffs. In addition, the recommendations described in Table 9-1 apply to UFS configurations.

Table 9-8: UFS Tuning Guidelines

Action	Performance Benefit	Tradeoff
Increase size of metadata buffer cache to more than 3 percent of main memory (Section 9.4.3.1)	Increases cache hit rate and improves UFS performance	Requires additional memory resources
Increase the size of the metadata hash chain table (Section 9.4.3.2)	Improves UFS lookup speed	Increases wired memory
Defragment the file system (Section 9.4.3.3)	Improves read and write performance	Requires down time
Delay flushing full write buffers to disk (Section 9.4.3.4)	Frees CPU cycles	May degrade real-time workload performance when buffers are flushed
Increase number of blocks combined for read ahead (Section 9.4.3.5)	May reduce disk I/O operations	May require more memory to buffer data
Increase number of blocks combined for a cluster(Section 9.4.3.6)	May decrease disk I/O operations	Reduces available disk space
Increase the smooth sync caching threshhold for asynchronous UFS I/O requests (Section 9.4.3.7)	Improves performance of AdvFS asynchronous I/O	None
Increase the maximum number of UFS and MFS mounts (Section 9.4.3.8)	Allows more mounted file systems	Requires additional memory resources

The following sections describe how to tune UFS in detail.

9.4.3.1 Modifying the Size of the Metadata Buffer Cache

At boot time, the kernel wires a percentage of physical memory for the metadata buffer cache, which temporarily holds recently accessed UFS and CD-ROM File System (CDFS) metadata. The vfs subsystem attribute bufcache specifies the size of the metadata buffer cache as a percentage of physical memory. The default is 3 percent.

Usually, you do not have to increase the cache size. However, you may want to increase the size of the metadata buffer cache if you reuse data and have a high cache miss rate (low hit rate).

To determine whether to increase the size of the metadata buffer cache, use the dbx print command to examine the bio_stats data structure. The miss rate (block misses divided by the sum of the block misses and block hits) should not be more than 3 percent.

If you have a general-purpose timesharing system, do not increase the value of the bufcache attribute to more than 10 percent. If you have an NFS server that does not perform timesharing, do not increase the value of the bufcache attribute to more than 35 percent.

Allocating additional memory to the metadata buffer cache reduces the amount of memory available to processes and the UBC. See Section 6.1.2.1 for information about how memory is allocated to the metadata buffer cache.

See Section 4.4 for information about modifying kernel subsystem attributes.

9.4.3.2 Increasing the Size of the Metadata Hash Chain Table

The hash chain table for the metadata buffer cache stores the heads of the hashed buffer queues. Increasing the size of the hash chain table distributes the buffers, which makes average chain lengths short. This can improve lookup speeds. However, increasing the size of the hash chain table increases wired memory.

The vfs subsystem attribute buffer-hash-size specifies the size of the hash chain table, in table entries, for the metadata buffer cache. The minimum size is 16; the maximum size is 524287. The default value is 512.

You can modify the value of the buffer-hash-size attribute so that each hash chain has 3 or 4 buffers. To determine a value for the buffer-hash-size attribute, use the dbx print command to examine the value of the nbuf kernel variable, then divide the value by 3 or 4, and finally round the result to a power of 2. For example, if nbuf has a value of 360, dividing 360 by 3 gives you a value of 120. Based on this calculation, specify 128 (2 to the power of 7) as the value of the buffer-hash-size attribute.

See Section 4.4 for information about modifying kernel subsystem attributes.

9.4.3.3 Defragmenting a File System

When a file consists of noncontiguous file extents, the file is considered fragmented. A very fragmented file decreases UFS read and write performance, because it requires more I/O operations to access the file.

You can determine whether the files in a file system are fragmented by determining how effectively the system is clustering. You can do this by using the dbx print command to examine the ufs_clusterstats, ufs_clusterstats_read, and ufs_clusterstats_write data structures. See Section 9.4.2.2 for information.

UFS block clustering is usually efficient. If the numbers from the UFS clustering kernel structures show that clustering is not being particularly effective, the files in the file system may be very fragmented.

To defragment a UFS file system, follow these steps:

Back up the file system onto tape or another partition.

Create a new file system either on the same partition or a different partition.

Restore the file system.

See the System Administration manual for information about backing up and restoring data and creating UFS file systems.

9.4.3.4 Delaying Full Write Buffer Flushing

You can free CPU cycles by delaying flushing full write buffers to disk until the next sync call (or until the percentage of UBC dirty pages reaches the value of the delay_wbuffers_percent kernel variable). However, delaying write buffer flushing may adversely affect real-time workload performance, because the system will experience a heavy I/O load at sync time.

To delay full write buffer flushing, use the dbx patch command to set the value of the delay_wbuffers kernel variable to 1 (enabled). The default value of delay_wbuffers is 0 (disabled).

See Section 4.4.6 for information on using dbx.

9.4.3.5 Increasing the Number of Blocks Combined for Read Ahead

You can increase the number of blocks that are combined for a read-ahead operation.

To do this, use the dbx patch command to set the value of the cluster_consec_init kernel variable equal to the value of the cluster_max_read_ahead kernel variable (the default is 8), which specifies the maximum number of read-ahead clusters that the kernel can schedule.

In addition, you must make sure that cluster read operations are enabled on nonread-ahead and read-ahead blocks. To do this, use dbx to set the value of the cluster_read_all kernel variable to 1, which is the default value.

See Section 4.4.6 for information on using dbx.

9.4.3.6 Increasing the Number of Blocks Combined for a Cluster

The cluster_maxcontig kernel variable specifies the number of blocks that are combined into a single I/O operation. The default value is 8. Contiguous writes are done in a unit size that is determined by the file system block size (8 KB) multiplied by the value of the cluster_maxcontig parameter.

See Section 4.4.6 for information about using dbx.

9.4.3.7 Modifying UFS Smooth Sync Caching

Smooth sync functionality improves UFS asynchronous I/O performance by preventing I/O spikes caused by the update daemon and by increasing the UBC hit rate, which decreases the total number of disk operations. Smooth sync also helps to efficiently distribute I/O requests over the sync interval, which decreases the length of the disk queue and reduces the latency that results from waiting for a busy page to be freed. By default, smooth sync is enabled on your system.

UFS caches asynchronous I/O requests in the dirty block queue and in the UBC object dirty page list queue before they are handed to the device driver. With smooth sync enabled (the default), the update daemon will not flush the dirty page list and dirty page wired list buffers. Instead, asynchronous I/O requests remain in the queue for the amount of time specified by the value of the vfs attribute smoothsync_age (the default is 30 seconds). When a buffer ages sufficiently, it is moved to the device queue.

If smooth sync is disabled, every 30 seconds the update daemon flushes data from memory to disk, regardless of how long a buffer has been cached.

Smooth sync functionality is controlled by the smoothsync_age attribute. However, you do not specify a value for smoothsync_age in the /etc/sysconfigtab file. Instead, the /etc/inittab file is used to enable smooth sync when the system boots to multiuser mode and to disable smooth sync when the system goes from multiuser mode to single-user mode. This procedure is necessary to reflect the behavior of the update daemon, which operates only in multiuser mode.

To enable smooth sync, the following lines must be included in the /etc/inittab file and the time limit for caching buffers in the smooth sync queue must be specified:

smsync:23:wait:/sbin/sysconfig -r vfs smoothsync_age=30 > /dev/null 2>&1
smsyncS:Ss:wait:/sbin/sysconfig -r vfs smoothsync_age=0 > /dev/null 2>&1

Thirty seconds is the default smooth sync queue threshhold. If you increase this value, you may improve the chance of a buffer cache hit by retaining buffers on the smooth sync queue for a longer period of time. Consequently, decreasing the value of the smoothsync_age attribute will speed the flushing of buffers.

To disable smooth sync, specify a value of 0 (zero) for the smoothsync_age attribute.

See Section 4.4 for information about modifying kernel subsystem attributes.

9.4.3.8 Increasing the Number of UFS and MFS Mounts

Mount structures are dynamically allocated when a mount request is made and subsequently deallocated when an unmount request is made. The vfs subsystem attribute max-ufs-mounts specifies the maximum number of UFS and MFS mounts on the system.

You can increase the value of the max-ufs-mounts attribute if your system will have more than the default limit of 1000 mounts. However, increasing the maximum number of UFS and MFS mounts requires memory resources for the additional mounts.

See Section 4.4 for information about modifying kernel subsystem attributes.

9.5 Managing NFS Performance

The Network File System (NFS) shares the UBC with the virtual memory subsystem and local file systems. NFS can put an extreme load on the network. Poor NFS performance is almost always a problem with the network infrastructure. Look for high counts of retransmitted messages on the NFS clients, network I/O errors, and routers that cannot maintain the load.

Lost packets on the network can severely degrade NFS performance. Lost packets can be caused by a congested server, the corruption of packets during transmission (which can be caused by bad electrical connections, noisy environments, or noisy Ethernet interfaces), and routers that abandon forwarding attempts too quickly.

You can monitor NFS by using the nfsstat and other commands. When evaluating NFS performance, remember that NFS does not perform well if any file-locking mechanisms are in use on an NFS file. The locks prevent the file from being cached on the client. See nfsstat(8) for more information.

The following sections describe how to perform the following tasks:

Gather NFS performance information (Section 9.5.1)

Improving NFS performance (Section 9.5.2)

9.5.1 Gathering NFS Information

Table 9-9 describes the commands you can use to obtain information about NFS operations.

Table 9-9: NFS Monitoring Tools

Name	Use	Description
`nfsstat`	Displays network and NFS statistics (Section 9.5.1.1)	Displays NFS and RPC statistics for clients and servers, including the number of packets that had to be retransmitted (`retrans`) and the number of times a reply transaction ID did not match the request transaction ID (`badxid`).
`nfswatch`	Monitors an NFS server	Monitors all incoming network traffic to an NFS server and divides it into several categories, including NFS reads and writes, NIS requests, and RPC authorizations. Your kernel must be configured with the `packetfilter` option to use the command. See `nfswatch`(8) and `packetfilter`(7) for more information.
`ps axlmp`	Displays information about idle threads (Section 9.5.1.2)	Displays information about idle threads on a client system.
`(dbx) print nfs_sv_active_hist`	Displays active NFS server threads (Section 4.4.6)	Displays a histogram of the number of active NFS server threads.
`(dbx) print nchstats`	Displays the hit rate (Section 9.1.2)	Displays the namei cache hit rate.
`(dbx) print bio_stats`	Displays metadata buffer cache information (Section 9.4.2.3)	Reports statistics on the metadata buffer cache hit rate.
`(dbx) print vm_perfsum`	Reports UBC statistics (Section 6.3.5)	Reports the UBC hit rate.

The following sections describe how to use some of these tools in detail.

9.5.1.1 Displaying NFS Information by Using the nfsstat Command

The nfsstat command displays statistical information about NFS and Remote Procedure Call (RPC) interfaces in the kernel. You can also use this command to reinitialize the statistics.

An example of the nfsstat command is as follows:

# /usr/ucb/nfsstat
 
Server rpc:
calls     badcalls  nullrecv   badlen   xdrcall
38903     0         0          0        0
 
Server nfs:
calls     badcalls
38903     0
 
Server nfs V2:
null      getattr   setattr    root     lookup     readlink   read
5  0%     3345  8%  61  0%     0  0%    5902 15%   250  0%    1497  3%
wrcache   write     create     remove   rename     link       symlink
0  0%     1400  3%  549  1%    1049  2% 352  0%    250  0%    250  0%
mkdir     rmdir     readdir    statfs
171  0%   172  0%   689  1%    1751  4%
 
Server nfs V3:
null      getattr   setattr    lookup    access    readlink   read
0  0%     1333  3%  1019  2%   5196 13%  238  0%   400  1%    2816  7%
write     create    mkdir      symlink   mknod     remove     rmdir
2560  6%  752  1%   140  0%    400  1%   0  0%     1352  3%   140  0%
rename    link      readdir    readdir+  fsstat    fsinfo     pathconf
200  0%   200  0%   936  2%    0  0%     3504  9%  3  0%      0  0%
commit
21  0%
 
Client rpc:
calls     badcalls  retrans    badxid    timeout   wait       newcred
27989     1         0          0         1         0          0
badverfs  timers
0         4
 
Client nfs:
calls     badcalls  nclget     nclsleep
27988     0         27988      0
 
Client nfs V2:
null      getattr   setattr    root      lookup    readlink   read
0  0%     3414 12%  61  0%     0  0%     5973 21%  257  0%    1503  5%
wrcache   write     create     remove    rename    link       symlink
0  0%     1400  5%  549  1%    1049  3%  352  1%   250  0%    250  0%
mkdir     rmdir     readdir    statfs
171  0%   171  0%   713  2%    1756  6%
 
Client nfs V3:
null      getattr   setattr    lookup    access    readlink   read
0  0%     666  2%   9  0%      2598  9%  137  0%   200  0%    1408  5%
write     create    mkdir      symlink   mknod     remove     rmdir
1280  4%  376  1%   70  0%     200  0%   0  0%     676  2%    70  0%
rename    link      readdir    readdir+  fsstat    fsinfo     pathconf
100  0%   100  0%   468  1%    0  0%     1750  6%  1  0%      0  0%
commit
10  0%
#

The ratio of timeouts to calls (which should not exceed 1 percent) is the most important thing to look for in the NFS statistics. A timeout-to-call ratio greater than 1 percent can have a significant negative impact on performance. See Chapter 10 for information on how to tune your system to avoid timeouts.

Use the nfsstat -s -i 10 command to display NFS and RPC information at ten-second intervals.

If you are attempting to monitor an experimental situation with nfsstat, reset the NFS counters to 0 before you begin the experiment. Use the nfsstat -z command to clear the counters.

See nfsstat(8) for more information about command options and output.

9.5.1.2 Displaying Idle Thread Information by Using the ps Command

On a client system, the nfsiod daemons spawn several I/O threads to service asynchronous I/O requests to the server. The I/O threads improve the performance of both NFS reads and writes. The optimum number of I/O threads depends on many variables, such as how quickly the client will be writing, how many files will be accessed simultaneously, and the characteristics of the NFS server. For most clients, seven threads are sufficient.

The following example uses the ps axlmp command to display idle I/O threads on a client system:

# 
/usr/ucb/ps axlmp 0 | grep nfs
 
 0  42   0            nfsiod_  S                 0:00.52                 
 0  42   0            nfsiod_  S                 0:01.18                 
 0  42   0            nfsiod_  S                 0:00.36                 
 0  44   0            nfsiod_  S                 0:00.87                 
 0  42   0            nfsiod_  S                 0:00.52                 
 0  42   0            nfsiod_  S                 0:00.45                 
 0  42   0            nfsiod_  S                 0:00.74                 
 
#

The previous output shows a sufficient number of sleeping threads and 42 server threads that were started by nfsd, where nfsiod_ has been replaced by nfs_tcp or nfs_udp.

If your output shows that few threads are sleeping, you may be able to improve NFS performance by increasing the number of threads. See Section 9.5.2.2, Section 9.5.2.3, nfsiod(8), and nfsd(8) for more information.

9.5.2 Improving NFS Performance

Improving performance on a system that is used only for serving NFS differs from tuning a system that is used for general timesharing, because an NFS server runs only a few small user-level programs, which consume few system resources. There is minimal paging and swapping activity, so memory resources should be focused on caching file system data.

File system tuning is important for NFS because processing NFS requests consumes the majority of CPU and wall clock time. Ideally, the UBC hit rate should be high. Increasing the UBC hit rate can require additional memory or a reduction in the size of other file system caches. In general, file system tuning will improve the performance of I/O-intensive user applications.

In addition, a vnode must exist to keep file data in the UBC. If you are using AdvFS, an access structure is also required to keep file data in the UBC.

If you are running NFS over TCP, tuning TCP may improve performance if there are many active clients. However, if you are running NFS over UDP, no network tuning is needed. See Section 10.2 for more information.

Table 9-10 lists NFS tuning and performance-improvement guidelines and the benefits as well as tradeoffs.

Table 9-10: NFS Performance Guidelines

Action	Performance Benefit	Tradeoff
Set the value of the `maxusers` attribute to the number of server NFS operations that are expected to occur each second (Section 5.1)	Provides the appropriate level of system resources	Consumes memory
Increase the size of the namei cache (Section 9.2.1)	Improves file system performance	Consumes memory
Increase the number of AdvFS access structures, if you are using AdvFS (Section 9.3.4.3)	Improves AdvFS performance	Consumes memory
Increase the size of the metadata buffer cache, if you are using UFS (Section 9.4.3.1)	Improves UFS performance	Consumes wired memory
Use Prestoserve (Section 9.5.2.1)	Improves synchronous write performance for NFS servers	Cost
Configure the appropriate number of threads on an NFS server (Section 9.5.2.2)	Enables efficient I/O blocking operations	None
Configure the appropriate number of threads on the client system (Section 9.5.2.3)	Enables efficient I/O blocking operations	None
Modify cache timeout limits on the client system (Section 9.5.2.4)	May improve network performance for read-only file systems and enable clients to quickly detect changes	Increases network traffic to server
Decrease network timeouts on the client system (Section 9.5.2.5)	May improve performance for slow or congested networks	Reduces theoretical performance
Use NFS protocol Version 3 on the client system (Section 9.5.2.6)	Improves network performance	Decreases the performance benefit of Prestoserve

The following sections describe these guidelines in detail.

9.5.2.1 Using Prestoserve to Improve NFS Server Performance

You can improve NFS performance by installing Prestoserve on the server. Prestoserve greatly improves synchronous write performance for servers that are using NFS Version 2. Prestoserve enables an NFS Version 2 server to write client data to a nonvolatile (battery-backed) cache, instead of writing the data to disk.

Prestoserve may improve write performance for NFS Version 3 servers, but not as much as with NFS Version 2, because NFS Version 3 servers can reliably write data to volatile storage without risking loss of data in the event of failure. NFS Version 3 clients can detect server failures and resend any write data that the server may have lost in volatile storage.

See the Guide to Prestoserve for more information.

9.5.2.2 Configuring Server Threads

The nfsd daemon runs on NFS servers to service NFS requests from client machines. The daemon spawns a number of server threads that process NFS requests from client machines. At least one server thread must be running for a machine to operate as a server. The number of threads determines the number of parallel operations and must be a multiple of 8.

For good performance on frequently used NFS servers, configure either 16 or 32 threads, which provides the most efficient blocking for I/O operations. See nfsd(8) for more information.

9.5.2.3 Configuring Client Threads

Client systems use the nfsiod daemon to service asynchronous I/O operations such as buffer cache read-ahead and delayed write operations. The nfsiod daemon spawns several IO threads to service asynchronous I/O requests to its server. The I/O threads improve performance of both NFS reads and writes.

The optimal number of I/O threads to run depends on many variables, such as how quickly the client is writing data, how many files will be accessed simultaneously, and the behavior of the NFS server. The number of threads must be a multiple of 8 minus 1 (for example, 7 or 15 is optimal).

NFS servers attempt to gather writes into complete UFS clusters before initiating I/O, and the number of threads (plus 1) is the number of writes that a client can have outstanding at any one time. Having exactly 7 or 15 threads produces the most efficient blocking for I/O operations. If write gathering is enabled, and the client does not have any threads, you may experience a performance degradation. To disable write gathering, use the dbx patch command to set the nfs_write_gather kernel variable to zero. See Section 4.4.6 for information.

Use the ps axlmp 0 | grep nfs command to display idle I/O threads on the client. If few threads are sleeping, you may be able to improve NFS performance by increasing the number of threads. See nfsiod(8) for more information.

9.5.2.4 Modifying Cache Timeout Limits

For read-only file systems and slow network links, performance may be improved by changing the cache timeout limits on NFS client systems. These timeouts affect how quickly you see updates to a file or directory that has been modified by another host. If you are not sharing files with users on other hosts, including the server system, increasing these values will give you slightly better performance and will reduce the amount of network traffic that you generate.

See mount(8) and the descriptions of the acregmin, acregmax, acdirmin, acdirmax, actimeo options for more information.

9.5.2.5 Decreasing Network Timeouts

NFS does not perform well if it is used over slow network links, congested networks, or wide area networks (WANs). In particular, network timeouts on client systems can severely degrade NFS performance. This condition can be identified by using the nfsstat command and determining the ratio of timeouts to calls. If timeouts are more than 1 percent of total calls, NFS performance may be severely degraded. See Section 9.5.1.1 for sample nfsstat output of timeout and call statistics and nfsstat(8) for more information.

You can also use the netstat -s command to verify the existence of a timeout problem. A nonzero value in the fragments dropped after timeout field in the ip section of the netstat output may indicate that the problem exists. See Section 10.1.1 for sample netstat command output.

If fragment drops are a problem, on a client system, use the mount command with the -rsize=1024 and -wsize=1024 options to set the size of the NFS read and write buffers to 1 KB.

9.5.2.6 Using NFS Protocol Version 3

NFS protocol Version 3 provides NFS client-side asynchronous write support, which improves the cache consistency protocol and requires less network load than Version 2. These performance improvements slightly decrease the performance benefit that Prestoserve provided for NFS Version 2. However, with Protocol Version 3, Prestoserve still speeds file creation and deletion.