9 Managing File System Performance

The Tru64 UNIX operating system supports different file system options that have various performance features and functionality.

This chapter describes the following:

Gathering information about all types of file systems (Section 9.1)

Applying tuning guidelines that are applicable to all types of file systems (Section 9.2)

Managing Advanced File System (AdvFS) performance (Section 9.3)

Managing UNIX File System (UFS) performance (Section 9.4)

Managing Network File System (NFS) performance (Section 9.5)

9.1 Gathering File System Information

The following sections describe how to use tools to monitor general file system activity and describe some general file system tuning guidelines.

See Section 6.3.4 for information about using dbx to check the Unified Buffer Cache (UBC).

9.1.1 Displaying File System Disk Space

The df command displays the disk space used by a UFS file system or AdvFS fileset. Because an AdvFS fileset can use multiple volumes, the df command reflects disk space usage somewhat differently than UFS.

For example:

# df /usr/var/spool/mqueue
Filesystem   512-blocks        Used   Available Capacity  Mounted on
/dev/rz13e      2368726         882     2130970     1%    /usr/var/spool/mqueue
 
# df /usr/sde
Filesystem     512-blocks        Used   Available Capacity  Mounted on
flume_sde#sde     1048576      319642      709904    32%    /usr/sde

See df(1) for more information.

9.1.2 Checking the namei Cache with the dbx Debugger

The namei cache is used by UNIX File System (UFS), Advanced File System (AdvFS), CD-ROM File System (CDFS), Memory File System (MFS), and Network File System (NFS) to store information about recently used file names, parent directory vnodes, and file vnodes. The number of vnodes determines the number of open files. The namei cache also stores vnode information for files that were referenced but not found. Having this information in the cache substantially reduces the amount of searching that is needed to perform pathname translations.

To check namei cache statistics, use the dbx print command and specify a processor number to examine the nchstats data structure. Consider the following example:


# /usr/ucb/dbx -k /vmunix /dev/mem 
(dbx) print processor_ptr[0].nchstats
struct {
   ncs_goodhits = 47967479
    ncs_neghits = 3611935
    ncs_badhits = 1828974
    ncs_falsehits = 58393
    ncs_miss = 4194525
    ncs_long = 60
    ncs_badtimehits = 406034
    ncs_collisions = 149
    ncs_unequaldups = 0
    ncs_pad = {
        [0] 0
        [1] 0
        [2] 0
    }
} 
(dbx)

Examine the ncs_goodhits (found a match), ncs_neghits (found a match that did not exist), and ncs_miss (did not find a match) fields to determine the hit rate. The hit rate should be above 80 percent (ncs_goodhits plus ncs_neghits divided by the sum of the ncs_goodhits, ncs_neghits, ncs_miss, and ncs_falsehits fields). See Section 9.2.1 for information on how to improve the namei cache hit rate and lookup speeds.

If the value in the ncs_badtimehits field is more than 0.1 percent of the ncs_goodhits field, then you may want to delay vnode deallocation. See Section 9.2.2 for more information.

9.2 Tuning File Systems

You may be able to improve I/O performance by modifying some kernel subsystem attributes that affect file system performance. General file system tuning often involves tuning the Virtual File System (VFS), which provides a uniform interface that allows common access to files, regardless of the file system on which the files reside.

To successfully improve file system performance, you must understand how your applications and users perform disk I/O, as described in Section 2.1. Because file systems share memory with processes, you should also understand virtual memory operation, as described in Chapter 6.

Table 9-1 describes the guidelines for general file system tuning and lists the performance benefits as well as the tradeoffs. There are also specific guidelines for AdvFS and UFS file systems. See Section 9.3 and Section 9.4 for information.

Table 9-1: General File System Tuning Guidelines

Guideline	Performance Benefit	Tradeoff
Increase the size of the namei cache (Section 9.2.1)	Improves namei cache lookup operations	Consumes memory
Delay vnode deallocation (Section 9.2.2)	Improves namei cache lookup operations	Consumes memory
Delay vnode recycling (Section 9.2.3)	Improves cache lookup operations	None
Increase the memory allocated to the UBC (Section 9.2.4)	Improves file system I/O performance	May cause excessive paging and swapping
Decrease the amount of memory borrowed by the UBC (Section 9.2.5)	Improves file system I/O performance	Decreases the memory available for processes, and may decrease system response time
Increase the minimum size of the UBC (Section 9.2.6)	Improves file system I/O performance	Decreases the memory available for processes
Increase the amount of UBC memory used to cache a large file (Section 9.2.7)	Improves large file performance	May allow a large file to consume all the pages on the free list
Disable flushing file read access times (Section 9.2.8)	Improves file system performance for systems that perform mainly read operations	Jeopardizes the integrity of read access time updates and violates POSIX standards
Use Prestoserve to cache only file system metadata (Section 9.2.9)	Improves performance for applications that access large amounts of file system metadata	Prestoserve is not supported in a cluster or for nonfile system I/O operations

The following sections describe these guidelines in detail.

9.2.1 Increasing the Size of the namei Cache

The namei cache is used by UFS, AdvFS, CDFS, and NFS to store information about recently used file names, parent directory vnodes, and file vnodes. The number of vnodes determines the number of open files. The namei cache also stores vnode information for files that were referenced but not found. Having this information in the cache substantially reduces the amount of searching that is needed to perform pathname translations.

The vfs subsystem attribute name_cache_size specifies the maximum number of elements in the cache. You can also control the size of the namei cache with the maxusers attribute, as described in Section 5.1.

Performance Benefit and Tradeoff

You may be able to make lookup operations faster by increasing the size of the namei cache. However, this increases the amount of wired memory.

Note that many benchmarks perform better with a large namei cache.

You cannot modify the name_cache_size attribute without rebooting the system.

When to Tune

Monitor the namei cache by using the dbx print command and specifying a processor number to examine the nchstats data structure. If the miss rate (misses / (good + negative + misses)) is more than 20 percent, you may want to increase the cache size. See Section 9.1.2 for more information.

Recommended Values

The default value of the vfs subsystem attribute name_cache_size is:

2 * (148 + 10 * maxusers) * 11 / 10

See Section 3.6 for information about modifying kernel subsystem attributes.

9.2.2 Delaying vnode Deallocation

File systems use a kernel data structure called a vnode for each open file. The number of vnodes determines the number of open files. By default, Tru64 UNIX uses dynamic vnode allocation, which enables the supply of kernel vnodes to increase and decrease, according to the system demand.

You enable and disable dynamic vnode allocation by using the vfs subsystem attribute vnode_deallocation_enable, which is set to 1 (enabled), by default. If you disable dynamic vnode allocation, the operating system will use a static vnode pool. For the best performance, Compaq recommends that you use dynamic vnode allocation.

If you are using dynamic vnode allocation, a vnode is deallocated (removed from the free list and its memory is returned to the system) when it has not been accessed through the namei cache for more than the amount of time specified by the vfs subsystem attribute namei_cache_valid_time. The default value is 1200 seconds.

Performance Benefit and Tradeoff

Increasing the default value of the namei_cache_valid_time attribute delays vnode deallocation, which may improve the cache hit rate. However, this will increase the amount of memory consumed by the vnode pool.

You cannot modify the namei_cache_valid_time attribute without rebooting the system.

When to Tune

The default value of the namei_cache_valid_time attribute (1200 seconds) is appropriate for most workloads. However, for workloads with heavy vnode pool activity, you may be able to optimize performance by modifying the default value.

You can obtain namei cache statistics for the number of cache lookup failures due to vnode deallocation by examining the ncs_badtimehits field in the dbx nchstats data structure. If the value in the ncs_badtimehits field is more than 0.1 percent of the successful cache hits, as specified in the ncs_goodhits field, then you may want to increase the default value of the namei_cache_valid_time attribute. See Section 9.1.2 for more information about monitoring the namei cache.

Recommended Values

To delay the deallocation of vnodes, increase the value of the vfs subsystem attribute namei_cache_valid_time. The default value is 1200.

Note

Decreasing the value of the namei_cache_valid_time attribute accelerates the deallocation of vnodes from the namei cache and reduces the efficiency of the cache.

See Section 3.6 for information about modifying kernel subsystem attributes.

9.2.3 Delaying vnode Recycling

File systems use a kernel data structure called a vnode for each open file. The number of vnodes determines the number of open files. By default, Tru64 UNIX uses dynamic vnode allocation, which enables the supply of kernel vnodes to increase and decrease, according to the system demand.

You enable and disable dynamic vnode allocation by using the vfs subsystem attribute vnode_deallocation_enable, which is set to 1 (enabled), by default. If you disable dynamic vnode allocation, the operating system will use a static vnode pool. For the best performance, Compaq recommends that you use dynamic vnode allocation.

Using dynamic vnode allocation, a vnode can be recycled and used to represent a different file object when it has been on the vnode free list for more than the amount of time specified by the vfs subsystem attribute vnode_age. The default value is 120 seconds.

Performance Benefit and Tradeoff

Increasing the value of the vnode_age attribute delays vnode recycling and increases the chance of a cache hit. However, delaying vnode recycling increases the length of the free list and the amount of memory consumed by the vnode pool.

You can modify the vnode_age attribute without rebooting the system.

When to Tune

The default value of the vnode_age attribute is appropriate for most workloads. However, for workloads with heavy vnode pool activity, you may be able to optimize performance by modifying the default value.

Recommended Values

To delay the recycling of vnodes, increase the default value of the vnode_age attribute. The default value is 120 seconds.

Decreasing the value of the vnode_age attribute accelerates vnode recycling, but decreases the chance of a cache hit.

See Section 3.6 for information about modifying kernel subsystem attributes.

9.2.4 Increasing Memory for the UBC

The Unified Buffer Cache (UBC) shares with processes the memory that is not wired. The UBC caches UFS and CDFS file system data for reads and writes, AdvFS metadata and file data, and MFS data. Performance is improved if the cached data is later reused and a disk operation is avoided.

The vm subsystem attribute ubc_maxpercent specifies the maximum amount of nonwired memory that can be allocated to the UBC. See Section 6.1.2.2 for information about UBC memory allocation.

Performance Benefit and Tradeoff

If you reuse data, increasing the size of the UBC will improve the chance that data will be found in the cache. An insufficient amount of memory allocated to the UBC can impair file system performance. However, the performance of an application that generates a lot of random I/O will not be improved by a large UBC, because the next access location for random I/O cannot be predetermined.

Be sure that allocating more memory to the UBC does not cause excessive paging and swapping.

You can modify the ubc_maxpercent attribute without rebooting the system.

When to Tune

For most configurations, use the default value of the ubc_maxpercent attribute (100 percent).

Recommended Values

To increase the maximum amount of memory allocated to the UBC, you can increase the value of the vm subsystem attribute ubc_maxpercent. The default value is 100 percent, which should be appropriate for most configurations, including Internet servers.

See Section 3.6 for information about modifying kernel subsystem attributes.

9.2.5 Increasing the Borrowed Memory Threshold

The UBC borrows all physical memory between the value of the vm subsystem attribute ubc_borrowpercent and the value of the ubc_maxpercent attribute. See Section 6.1.2.2 for more information about allocating memory to the UBC.

Performance Benefit and Tradeoff

Increasing the value of the ubc_borrowpercent attribute will reduce the amount of memory that the UBC borrows from processes and allow more memory to remain in the UBC when page reclamation begins. This can increase the UBC cache effectiveness, but it may degrade system response time when a low-memory condition occurs (for example, a large process working set).

You can modify the ubc_borrowpercent attribute without rebooting the system.

When to Tune

If vmstat output shows excessive paging but few or no page outs, you may want to increase the borrowing threshold.

Recommended Values

The value of the ubc_borrowpercent attribute can range from 0 to 100. The default value is 20 percent.

See Section 3.6 for information about modifying kernel subsystem attributes.

9.2.6 Increasing the Minimum Size of the UBC

The minimum amount of memory that can be allocated to the UBC is specified by the vm subsystem attribute ubc_minpercent. See Section 6.1.2.2 for information about allocating memory to the UBC.

Performance Benefit and Tradeoff

Increasing the minimum size of the UBC will prevent large programs from completely consuming the memory that can be used by the UBC.

Because the UBC and processes share virtual memory, increasing the minimum size of the UBC may cause the system to page.

You can modify the ubc_minpercent attribute without rebooting the system.

When to Tune

For I/O servers, you may want to raise the value of the vm subsystem attribute ubc_minpercent to ensure that enough memory is available for the UBC.

To ensure that the value of the ubc_minpercent is appropriate, use the vmstat command to examine the page-out rate. See Section 6.3.1 for information.

Recommended Values

The default value of the ubc_minpercent is 10 percent.

If the values of the vm subsystem attributes ubc_maxpercent and ubc_minpercent are close together, you may degrade I/O performance.

See Section 3.6 for information about modifying kernel subsystem attributes.

9.2.7 Improving Large File Caching Performance

If a large file completely fills the UBC, it may take all of the pages on the free page list, which may cause the system to page excessively. The vm subsystem attribute vm_ubcseqpercent specifies the maximum amount of memory allocated to the UBC that can be used to cache a single file.

The vm subsystem attribute vm_ubcseqstartpercent specifies the size of the UBC as a percentage of physical memory, at which time the virtual memory subsystem starts stealing the UBC LRU pages for a file to satisfy the demand for pages.

Performance Benefit and Tradeoff

Increasing the value of the vm_ubcseqpercent attribute will improve the I/O performance of a large single file, but will decrease the memory available for small files.

You can modify the vm_ubcseqpercent and vm_ubcseqstartpercent attributes without rebooting the system.

When to Tune

You may want to increase the value of the vm_ubcseqpercent attribute if you reuse large files.

Recommended Values

The default value of the vm_ubcseqpercent attribute is 10 percent of memory allocated to the UBC.

To force the system to reuse the pages in the UBC instead of taking pages from the free list, perform the following tasks:

Make the maximum size of the UBC greater than the size of the UBC as a percentage of memory. That is, the value of the vm subsystem attribute ubc_maxpercent (the default is 100 percent) must be greater than the value of the vm_ubcseqstartpercent attribute (the default is 50 percent).

Make the value of the vm_ubcseqpercent attribute, which specifies the size of a file as a percentage of the UBC, greater than a referenced file. The default value of the vm_ubcseqpercent attribute is 10 percent.

For example, using the default values, the UBC would have to be larger than 50 percent of all memory and a file would have to be larger than 10 percent of the UBC (that is, the file size would have to be at least 5 percent of all memory) in order for the system to reuse the pages in the UBC.

On large-memory systems that are doing a lot of file system operations, you may want to decrease the value of the vm_ubcseqstartpercent attribute to 30 percent. Do not specify a lower value unless you decrease the size of the UBC. In this case, do not change the value of the vm_ubcseqpercent attribute.

See Section 3.6 for information about modifying kernel subsystem attributes.

9.2.8 Disabling File Read Access Time Flushing

When a read system call is made to a file system's files, the default behavior is for the file system to update both the in-memory file access time and the on-disk stat structure, which contains most of the file information that is returned by the stat system call.

Performance Benefit and Tradeoff

You can improve file system performance for systems that perform mainly read operations (such as proxy servers) by specifying, at mount time, that the file system update only the in-memory file access time when a read system call is made to a file. The file system will update the on-disk stat structure only if the file is modified.

Updating only the in-memory file access time for reads can improve proxy server response time by decreasing the number of disk I/O operations. However, this behavior jeopardizes the integrity of read access time updates and violates POSIX standards. Do not use this functionality if it will affect utilities that use read access times to perform tasks, such as migrating files to different devices.

When to Perform this Task

You may want to disable file read access time flushing if your system performs mainly read operations.

Recommended Procedure

To disable file read access time flushing, use the mount command with the noatimes option.

See read(2) and mount(8) for more information.

9.2.9 Caching Only File System Metadata with Prestoserve

Prestoserve can improve the overall run-time performance for systems that perform large numbers of synchronous writes. The prmetaonly attribute controls whether Prestoserve caches only UFS and AdvFS file system metadata, instead of both metadata and synchronous write data (the default).

Performance Benefit and Tradeoff

Caching only metadata may improve the performance of applications that access many small files or applications that access a large amount of file-system metadata but do not reread recently written data.

When to Tune

Cache only file system metadata if your applications access many small files or access a large amount of file-system metadata but do not reread recently written data.

Recommended Values

Set the value of the prmetaonly attribute to 1 (enabled) to cache only file system metadata.

See Section 3.6 for information about modifying kernel subsystem attributes.

9.3 Managing Advanced File System Performance

The Advanced File System (AdvFS) provides file system features beyond those of a traditional UFS file system. Unlike the rigid UFS model in which the file system directory hierarchy (tree) is bound tightly to the physical storage, AdvFS consists of two distinct layers: the directory hierarchy layer and the physical storage layer.

The following sections describe:

AdvFS features (Section 9.3.1)

AdvFS I/O queues (Section 9.3.2)

AdvFS access structures (Section 9.3.3)

AdvFS guidelines for high-performance configurations (Section 9.3.4)

Monitoring AdvFS performance (Section 9.3.5)

Tuning AdvFS (Section 9.3.6)

Improving AdvFS performance (Section 9.3.7)

See the AdvFS Administration manual for detailed information about setting up and managing AdvFS.

9.3.1 AdvFS Features

The AdvFS decoupled file system structure enables you to manage the physical storage layer apart from the directory hierarchy layer. You can put multiple volumes (disks, LSM volumes, or RAID storage sets) in a file domain and distribute the filesets and files across the volumes. A file's blocks usually reside together on the same volume, unless the file is striped or the volume is full. Each new file is placed on the successive volume by using round-robin scheduling.

AdvFS enables you to move files between a defined group of disk volumes without changing file pathnames. Because the pathnames remain the same, the action is completely transparent to users.

The AdvFS Utilities product, which is licensed separately from the operating system, extends the capabilities of the AdvFS file system.

AdvFS provides the following basic features that do not require a license:

High-performance file system
AdvFS uses an extent-based file allocation scheme that consolidates data transfers, which increases sequential bandwidth and improves performance for large data transfers. AdvFS performs large reads from disk when it anticipates a need for sequential data. AdvFS also performs large writes by combining adjacent data into a single data transfer.

Fast file system recovery
Rebooting after a system interruption is extremely fast, because AdvFS uses write-ahead logging, instead of the fsck utility, as a way to check for and repair file system inconsistencies. The recovery speed depends on the number of uncommitted records in the log, not the amount of data in the fileset; therefore, reboots are quick and predictable.

Direct I/O support
AdvFS allows you to enable direct I/O functionality on the files in a fileset or on a specific file. If direct I/O is enabled, file data is synchronously read or written without copying the data into the AdvFS buffer cache. Direct I/O can significantly improve disk I/O throughput for applications that read or write data only once or do not frequently write to previously written pages. See Section 9.3.4.7 for more information.

Smooth sync
Smooth sync functionality improves AdvFS asynchronous I/O performance by preventing I/O spikes caused by the update daemon, increasing the chance of a buffer cache hit, and improving the consolidation of I/O requests. See Section 9.3.6.5 for more information.

Online file domain defragmentation capability
Defragmenting disk data can improve performance by making data more contiguous. AdvFS enables you to perform this task without interrupting data availability.

Disk quotas
AdvFS enables you to track and control the amount of disk storage that each user, group, and fileset consumes.

The optional AdvFS utilities product, which requires a license, provides the following features:

Disk spanning
A file or fileset can span disks within a multi-volume file domain.

Online file system resizing
You can dynamically change the size of a file system by adding or removing disks. AdvFS enables you to perform this task without disrupting users or applications.

Ability to recover deleted files
Users can retrieve their own unintentionally deleted files from predefined trashcan directories, without assistance from system administrators.

I/O load balancing across disks
You can distribute the percentage of used space evenly between volumes in a multi-volume domain.

Online file migration across disks
You can move specific files to different volumes to eliminate bottlenecks caused by heavily used files.

Online backup
You can back up file system contents with limited interruption to users.

Clone filesets
AdvFS enables you to clone a fileset, which produces a read-only snapshot of fileset data structures. Cloning can increase the availability of data by preserving the state of the AdvFS data at a particular time and protecting against accidental file deletion or corruption.

File-level striping
File-level striping may improve I/O bandwidth (transfer rates) by distributing file data across multiple disk volumes.

Graphical user interface
The AdvFS GUI simplifies disk and file system administration, provides status, and alerts you to potential problems.

See the AdvFS Administration manual for detailed information about AdvFS features.

9.3.2 AdvFS I/O Queues

The AdvFS buffer cache is part of the UBC, and acts as a layer between the operating system and disk by storing recently accessed AdvFS file system data. Performance is improved if the cached data is later reused (a buffer cache hit) and a disk operation is avoided.

At boot time, the kernel determines the amount of physical memory that is available for AdvFS buffer cache headers, and allocates a buffer cache header for each possible page. The size of an AdvFS page is 8 KB.

The number of AdvFS buffer cache headers depends on the number of 8-KB pages that can be obtained from the amount of memory specified by the advfs subsystem attribute AdvfsCacheMaxPercent. The default value is 7 percent of physical memory. See Section 6.1.2.3 for more information about how the system allocates memory to the AdvFS buffer cache.

For each AdvFS volume, I/O requests are sent to one of the following queues, which feed I/O requests to the device queue:

Blocking queue
The blocking queue caches synchronous I/O requests. A synchronous I/O request is a read operation or a write that must be flushed to disk before it is considered complete and the application can continue. This ensures data reliability because the data has been written to disk and is not stored only in memory. Therefore, I/O requests on the blocking queue cannot be asynchronously removed, because the I/O must complete.

Lazy queue
The lazy queue caches asynchronous I/O requests. Asynchronous I/O requests are cached in the lazy queue and periodically flushed to disk in portions that are large enough to allow the disk drivers to optimize the order of the write.

Figure 9-1 shows the movement of synchronous and asynchronous I/O requests through the AdvFS I/O queues.

Figure 9-1: AdvFS I/O Queues

When an asynchronous I/O request enters the lazy queue, it is assigned a time stamp. The lazy queue is a pipeline that contains a sequence of queues through which an I/O request passes: the wait queue (if applicable), the smooth sync queue, the ready queue, and the consol (consolidation) queue. An AdvFS buffer cache hit can occur while an I/O request is in any part of the lazy queue.

Detailed descriptions of the AdvFS queues are as follows:

Wait queue--Asynchronous I/O requests that are waiting for an AdvFS transaction log write to complete first enter the wait queue. Each file domain has a transaction log that tracks fileset activity for all filesets in the file domain, and ensures AdvFS metadata consistency if a crash occurs.
AdvFS uses write-ahead logging, which requires that when metadata is modified, the transaction log write must complete before the actual metadata is written. This ensures that AdvFS can always use the transaction log to create a consistent view of the file system metadata. After the transaction log is written, I/O requests can be moved from the wait queue to the smooth sync queue.

Smooth sync queue--The smooth sync queue improves AdvFS asynchronous I/O performance by preventing I/O spikes caused by the update daemon, increasing the chance of an AdvFS buffer cache hit, and improving the consolidation of I/O requests.
When smooth sync is not enabled, the update daemon flushes data from memory to disk every 30 seconds, regardless of how long a buffer has been cached. However, with smooth sync enabled (the default behavior), asynchronous I/O requests remain in the smooth sync queue for the amount of time specified by the value of the vfs attribute smoothsync_age (the default is 30 seconds). After this time, the buffer moves to the ready queue. The movement of buffers from the smooth sync queue to the ready queue occurs continuously, based on the age of the buffer, and reduces the need to flush large numbers of requests every 30 seconds. See Section 9.3.6.5 for information about tuning the smooth sync queue.

Ready queue--Asynchronous I/O requests that are not waiting for an AdvFS transaction log write to complete enter the ready queue, where they are sorted and held until the size of the ready queue reaches the value specified by the AdvfsReadyQLim attribute, or until the update daemon flushes the data. The default value of the AdvfsReadyQLim attribute is 16,384 512-byte blocks (8 MB).
You can modify the size of the ready queue for all AdvFS volumes by changing the value of the AdvfsReadyQLim attribute. Alternatively, you can modify the ready queue limit for a specific AdvFS volume by using the chvol -t command. See Section 9.3.6.4 for information about tuning the ready queue.

Consol queue--I/O requests are moved from the ready queue to the consol queue, which feeds the device queue. The consol queue serves as a holding area that enables the interleaving of I/O requests as they move from the blocking and the consol queues to the device queue, and also prevents flooding the device queue with requests.

Both the consol queue and the blocking queue feed the device queue, where logically contiguous I/O requests are consolidated into larger I/Os before they are sent to the device driver. The size of the device queue affects the amount of time it takes to complete a synchronous (blocking) I/O operation. AdvFS issues several types of blocking I/O operations, including AdvFS metadata and log data operations.

The AdvfsMaxDevQLen attribute limits the total number of I/O requests on the AdvFS device queue. The default value is 24 requests. When the number of requests exceeds this value, only synchronous requests from the blocking queue are accepted onto the device queue.

Although the default value of the AdvfsMaxDevQLen attribute is appropriate for most configurations, you may need to modify this value. However, increase the default value only if devices are not being kept busy. Make sure that increasing the size of the device queue does not cause a decrease in response time. See Section 9.3.6.6 for more information about tuning the AdvFS device queue.

Use the advfsstat command to show the AdvFS queue statistics. See Section 9.3.5.1 for information.

9.3.3 AdvFS Access Structures

AdvFS access structures are in-memory data structures that AdvFS uses to cache low-level information about files that are currently open and files that were opened but are now closed. Caching open file information can enhance AdvFS performance if the open files are later reused. If your users or applications open and then reuse many files, you may be able to improve AdvFS performance by modifying how the system allocates AdvFS access structures.

There are three attributes that control the allocation of AdvFS access structures:

The AdvfsAccessMaxPercent attribute controls the maximum percentage of pageable memory that can be allocated for AdvFS access structures.

At boot time, the system reserves for AdvFS access structures a portion of the physical memory that is not wired. The memory reserved is either twice the value of the AdvfsMinFreeAccess attribute or the value of the AdvfsAccessMaxPercent attribute, whichever is smaller. These access structures are then placed on the access structure free list.
As AdvFS files are opened, access structures are taken from the free list. If the number of access structures on the free list falls below the value of the AdvfsMinFreeAccess attribute, AdvFS allocates additional access structures and places them on the free list, until the number of access structures on the free list is twice the value of the AdvfsMinFreeAccess attribute or the value of the AdvfsAccessMaxPercent attribute, whichever is smaller.

At any one time, the access structure free list contains only a portion of the access structures that the system has allocated. The AdvfsMaxFreeAccessPercent attribute specifies the maximum percentage of the total allocated access structures that can be on the free list at one time. Access structures are deallocated from the free list, and memory is returned to the pool that is reserved for access structures when the following occurs:
- The number of access structures on the free list exceeds the value of the AdvfsMaxFreeAccessPercent attribute (as a percentage of the total allocated access structures). For example, this condition is satisfied if the value of the AdvfsMaxFreeAccessPercent attribute is 80 percent, there are 100 allocated access structures, and the number of access structures on the free list is more then 80.
- The number of access structures on the free list is more than twice the value of the AdvfsMinFreeAccess attribute.

You may be able to improve AdvFS performance by modifying the previous attributes and allocating more memory for AdvFS access structures. However, this will reduce the amount of memory available to processes and may cause excessive paging and swapping. See Section 9.3.6.3 for information.

If you do not use AdvFS or if your workload does not frequently write to previously-written pages, do not allocate a large amount of memory for access structures. If you have a large-memory system, you may want to decrease the amount of memory reserved for AdvFS access structures. See Section 6.4.5 for information.

9.3.4 AdvFS Configuration Guidelines

You will obtain the best performance if you carefully plan your AdvFS configuration. Table 9-2 lists AdvFS configuration guidelines and performance benefits as well as tradeoffs. See the AdvFS Administration manual for detailed information about AdvFS configuration.

Table 9-2: AdvFS Configuration Guidelines

Guideline	Performance Benefit	Tradeoff
Use a few file domains instead of a single large domain (Section 9.3.4.1)	Facilitates administration	None
Use a multi-volume file domains, instead of single-volume domains (Section 9.3.4.1)	Improves throughput	Multi-volumes increase the chance of domain failure
Configure one fileset for each domain (Section 9.3.4.2)	Facilitates administration	None
Keep filesets less than 50 GB in size (Section 9.3.4.2)	Facilitates administration	None
Distribute the I/O load over multiple disks (Section 9.3.4.3)	Improves throughput	Requires multiple disks
Place the transaction log on fast or uncongested volume (Section 9.3.4.4)	Prevents the log from becoming a bottleneck	None
Log only file structures (Section 9.3.4.4)	Maintains high performance	Increases the possibility of inconsistent data after a crash
Force all AdvFS file writes to be synchronous (Section 9.3.4.5)	Ensures that data is successfully written to disk	May degrade file system performance
Prevent partial writes (Section 9.3.4.6)	Ensures that system crashes do not cause partial disk writes	May degrade asynchronous write performance
Enable direct I/O (Section 9.3.4.7)	Improves disk I/O throughput for database applications that read or write data only once	Degrades I/O performance for applications that repeatedly access the same data
Use AdvFS for the root file system (Section 9.3.4.8)	Provides fast startup after a crash	None
Stripe files across different disks and, if possible, different buses (Section 9.3.4.9)	Improves sequential read and write performance	Increases chance of domain failure
Use quotas (Section 9.3.4.10)	Tracks and controls the amount of disk storage that each user, group, or fileset consumes	None
Consolidate I/O transfers (Section 9.3.4.11)	Improves AdvFS performance	None
Allocate sufficient swap space (Section 2.3.2.3)	Facilitates the use of the `verify` command	Requires additional disk space

The following sections describe these AdvFS configuration guidelines in detail.

9.3.4.1 Configuring File Domains

To facilitate AdvFS administration and improve performance, configure a few file domains with multiple volumes instead of many file domains or a single large file domain. Using a few file domains with multiple volumes provides better control over physical resources, improves a fileset's total throughput, and decreases the administration time.

Each file domain uses a transaction log on one of the volumes. If you configure only a single large multi-volume file domain, the log may become a bottleneck. In contrast, if you configure many file domains, you spread the overhead associated with managing the logs for the file domains.

Multi-volume file domains improve performance because AdvFS generates parallel streams of output using multiple device consolidation queues. A file domain with three volumes on different disks is more efficient than a file domain consisting of a single disk because the latter has only one I/O path.

However, a single volume failure within a file domain will render the entire domain inaccessible, so the more volumes that you have in a file domain, the greater the risk that the domain will fail. To reduce the risk of file domain failure, limit the number of volumes in a file domain to eight or mirror the file domain with LSM or hardware RAID.

In addition, follow these guidelines for configuring file domains:

For the best efficiency, spread a file domain across several of the same type of disks with the same speed.

Use an entire disk in a file domain. For example, do not use partition a in one file domain and partition b in another file domain.

Use a single disk partition to add a disk to a file domain (for example, partition c), instead of using multiple partitions.

Make sure that busy files are not located on the same volume. Use the migrate command to move files across volumes.

If you are using LSM, use multiple, small LSM volumes in a file domain, instead of a single, large concatenated or striped volume. This enables AdvFS to balance I/O across volumes.

9.3.4.2 Configuring Filesets for High Performance

Configuring many filesets in a file domain can adversely affect performance and AdvFS administration. If possible, configure only one fileset for each file domain.

In addition, the recommended maximum size of a fileset is 50 GB. Once a fileset reaches 30 GB, consider creating another file domain and fileset. You may want to establish a monitoring routine that alerts you to a large fileset size.

Use the showfsets command to display the number of filesets in a domain and the size of a fileset. See showfsets(8) for more information.

9.3.4.3 Distribute the AdvFS I/O Load

Distribute the AdvFS I/O load over multiple disks to improve throughput. Use multiple file domains and spread filesets across the domains.

The number of filesets depends on your storage needs. Each fileset can be managed and backed up independently, and can be assigned quotas. Be sure that heavily used filesets are located on different file domains, so that a single transaction log does not become a bottleneck.

See Section 8.1 for more information about distributing the disk I/O load.

9.3.4.4 Improving the Transaction Log Performance

Each file domain has a transaction log that tracks fileset activity for all filesets in the file domain, and ensures AdvFS metadata consistency if a crash occurs. The AdvFS file domain transaction log may become a bottleneck if the log resides on a congested disk or bus, or if the file domain contains many filesets.

To prevent the log from becoming a bottleneck, put the log on a fast, uncongested volume. You may want to put the log on a disk that contains only the log. See Section 9.3.7.3 for information on moving an existing transaction log.

To make the transaction log highly available, use LSM or hardware RAID to mirror the log.

You can also divide a large multi-volume file domain into smaller file domains to distribute transaction log I/O.

By default, AdvFS logs only file structures. However, you can also log file data to ensure that a file is internally consistent if a crash occurs. However, data logging can degrade performance. See Section 9.3.4.6 for information about atomic write data logging.

9.3.4.5 Forcing Synchronous Writes

By default, asynchronous write requests are cached in the AdvFS buffer cache, and the write system call then returns a success value. The data is written to disk at a later time (asynchronously).

Use the chfile -l on command to force all write requests to a specified AdvFS file to be synchronous. If you enable forced synchronous writes on a file, data must be successfully written to disk before the write system call will return a success value. This behavior is similar to the behavior associated with a file that has been opened with the O_SYNC option; however, forcing synchronous writes persists across open calls.

Forcing all writes to a file to be synchronous ensures that the write has completed when the write system call returns a success value. However, it may degrade write performance.

A file cannot have both forced synchronous writes enabled and atomic write data logging enabled. See Section 9.3.4.6 for more information.

Use the chfile command to determine whether forced synchronous writes or atomic write data logging is enabled. Use the chfile -l off command to disable forced synchronous writes (the default).

9.3.4.6 Preventing Partial Data Writes

AdvFS writes data to disk in 8-KB chunks. By default, and in accordance with POSIX standards, AdvFS does not guarantee that all or part of the data will actually be written to disk if a crash occurs during or immediately after the write. For example, if the system crashes during a write that consists of two 8-KB chunks of data, only a portion (anywhere from 0 to 16 KB) of the total write may have succeeded. This can result in partial data writes and inconsistent data.

To prevent partial writes if a system crash occurs, use the chfile -L on command to enable atomic write data logging for a specified file.

By default, each file domain has a transaction log file that tracks fileset activity and ensures that AdvFS can maintain a consistent view of the file system metadata if a crash occurs. If you enable atomic write data logging on a file, data from a write call will be written to the transaction log file before it is written to disk. If a system crash occurs during or immediately after the write call, upon recovery, the data in the log file can be used to reconstruct the write. This guarantees that each 8-KB chunk of a write either is completely written to disk or is not written to disk.

For example, if atomic write data logging is enabled and a crash occurs during a write that consists of two 8-KB chunks of data, the write can have three possible states: none of the data is written, 8 KB of the data is written, or 16 KB of data is written.

Atomic write data logging may degrade AdvFS write performance because of the extra write to the transaction log file. In addition, a file that has atomic write data logging enabled cannot be memory mapped by using the mmap system call, and it cannot have direct I/O enabled (see Section 9.3.4.7).

A file cannot have both forced synchronous writes enabled (see Section 9.3.4.5) and atomic write data logging enabled. However, you can enable atomic write data logging on a file and also open the file with an O_SYNC option. This ensures that the write is synchronous, but also prevents partial writes if a crash occurs before the write system call returns.

Use the chfile command to determine if forced synchronous writes or atomic write data logging is enabled. Use the chfile -L off command to disable atomic write data logging (the default).

To enable atomic write data logging on AdvFS files that are NFS mounted, the NFS property list daemon, proplistd, must be running on the NFS client and the fileset must be mounted on the client by using the mount command's proplist option.

If atomic write data logging is enabled and you are writing to a file that has been NFS mounted, the offset into the file must be on an 8-KB page boundary, because NFS performs I/O on 8-KB page boundaries.

You can also activate and deactivate atomic data logging by using the fcntl system call. In addition, both the chfile command and fcntl can be used on an NFS client to activate or deactivate this feature on a file that resides on the NFS server.

9.3.4.7 Enabling Direct I/O

You can use direct I/O to read and write data from a file without copying the data into the AdvFS buffer cache. If you enable direct I/O, read and write requests are executed to and from disk through direct memory access, bypassing the AdvFS buffer cache.

Direct I/O can significantly improve disk I/O throughput for database applications that read or write data only once (or for applications that do not frequently write to previously written pages). However, direct I/O can degrade disk I/O performance for applications that access data multiple times, because data is not cached. As soon as you specify direct I/O, any data already in the buffer cache is automatically flushed to disk.

If you enable direct I/O, by default, reads and writes to a file will be done synchronously. However, you can use the asynchronous I/O (AIO) functions (aio_read and aio_write) to enable an application to achieve an asynchronous-like behavior by issuing one or more synchronous direct I/O requests without waiting for their completion. See the Programmer's Guide for more information.

Although direct I/O will handle I/O requests of any byte size, the best performance will occur when the requested byte size is aligned on file page boundaries and is evenly divisible into 8-KB pages. Direct transfer from the user buffer to the disk is optimized in this case.

To enable direct I/O for a specific file, use the open system call and set the O_DIRECTIO file access flag. Once a file is opened for direct I/O, this mode is in effect until all users close the file.

Note that you cannot enable direct I/O for a file if it is already opened for data-logging or if it is memory mapped. Use the fcntl system call with the F_GETCACHEPOLICY argument to determine if an open file has direct I/O enabled.

See fcntl(2), open(2), AdvFS Administration, and the Programmer's Guide for more information.

9.3.4.8 Configuring an AdvFS root File system

There are several advantages to configuring an AdvFS root file system:

Quick restart after a crash, because you do not run the fsck utility after a crash.

One set of tools to manage all local file systems. All features of AdvFS except addvol and rmvol are available to manage the root file system.

Use AdvFS with LSM to mirror the root file system. This allows your root file system to remain viable even if there is a disk failure.

You can configure an AdvFS root file system during the initial base-system installation, or you can convert your existing root file system after installation. See the AdvFS Administration manual for more information.

9.3.4.9 Striping Files

You may be able to use the AdvFS stripe utility to improve the sequential read and write performance of an individual file by spreading file data evenly across different disks in a file domain. For the maximum performance benefit, stripe files across disks on different I/O buses.

Striping files, instead of striping entire disks with RAID 0, is useful if an application continually accesses only a few specific files. Do not stripe both a file and the disk on which it resides. For information about striping entire disks, see Chapter 8.

The stripe utility distributes a zero-length file (a file with no data written to it yet) evenly across a specified number of volumes. As data is appended to the file, the data is spread across the volumes. The size of each data segment (also called the stripe or chunk size) is fixed at 64 KB (65,536 bytes). AdvFS alternates the placement of the segments on the disks in a sequential pattern. For example, the first 64 KB of the file is written to the first volume, the second 64 KB is written to the next volume, and so on.

If an application's I/O transfer read or write size is more than 64 KB, striping files may improve application performance by enabling parallel I/O operations on multiple controllers or volumes, because AdvFS file striping uses a fixed 64 KB stripe width.

Note

Distributing data across multiple volumes decreases data availability, because one volume failure makes the entire file domain unavailable. To make striped files highly available, you can use RAID 1 to mirror the disks across which the file is striped. For information about mirroring, see Chapter 8.

See stripe(8) for more information.

9.3.4.10 Using AdvFS Quotas

AdvFS quotas allow you to track and control the amount of physical storage that a user, group, or fileset consumes. In addition, AdvFS quota information is always maintained, but quota enforcement can be activated and deactivated.

You can set quota values on the amount of disk storage and on the number of files. Quotas that apply to users and groups are similar to UFS quotas. You can set a separate quota for each user or each group of users for each fileset.

In addition, you can restrict the space that a fileset itself can use. Fileset quotas are useful when a file domain contains multiple filesets. Without fileset quotas, any fileset can consume all of the disk space in the file domain.

All quotas can have two types of limits: hard and soft. A hard limit cannot be exceeded; space cannot be allocated and files cannot be created. A soft limit permits a period of time during which the limit can be exceeded as long as the hard limit has not been exceeded.

For information about AdvFS quotas, see the AdvFS Administration manual.

9.3.4.11 Consolidating I/O Transfers

By default, AdvFS consolidates a number of I/O transfers into a single, large I/O transfer, which can improve AdvFS performance. To enable the consolidation of I/O transfers, use the chvol command with the -c on option.

It is recommended that you not disable the consolidation of I/O transfers. See chvol(8) for more information.

9.3.5 Gathering AdvFS Information

Table 9-3 describes the tools you can use to obtain information about AdvFS.

Table 9-3: AdvFS Monitoring Tools

Name	Use	Description
`advfsstat`	Displays AdvFS performance statistics (Section 9.3.5.1)	Allows you to obtain extensive AdvFS performance information, including buffer cache, fileset, volume, and bitfile metadata table (BMT) statistics, for a specific interval of time.
`advscan`	Identifies disks in a file domain (Section 9.3.5.2)	Locates pieces of AdvFS file domains on disk partitions and in LSM disk groups.
`showfdmn`	Displays detailed information about AdvFS file domains and volumes (Section 9.3.5.3)	Allows you to determine if file data is evenly distributed across AdvFS volumes. The `showfdmn` utility displays information about a file domain, including the date created and the size and location of the transaction log, and information about each volume in the domain, including the size, the number of free blocks, the maximum number of blocks read and written at one time, and the device special file. For multivolume domains, the utility also displays the total volume size, the total number of free blocks, and the total percentage of volume space currently allocated.
`showfile`	Displays information about files in an AdvFS fileset (Section 9.3.5.4)	Displays detailed information about files (and directories) in an AdvFS fileset. The `showfile` command allows you to check a file's fragmentation. A low performance percentage (less than 80 percent) indicates that the file is fragmented on the disk. The `showfile` command also displays the extent map of each file. An extent is a contiguous area of disk space that AdvFS allocates to a file. Simple files have one extent map; striped files have an extent map for every stripe segment. The extent map shows whether the entire file or only a portion of the file is fragmented.
`showfsets`	Displays AdvFS fileset information for a file domain (Section 9.3.5.5)	Displays information about the filesets in a file domain, including the fileset names, the total number of files, the number of used blocks, the quota status, and the clone status. The `showfsets` command also displays block and file quota limits for a file domain or for a specific fileset in the domain.
`quota`	Displays disk usage and quota limits	Displays the block usage, number of files, and quotas for a user or group. You can choose to display quota information for users or groups, for all filesets with usage over quota, or for all mounted filesets regardless of whether quotas are activated. See `quota`(1) for more information.
`vdf`	Clarifies the relationship between file domain and fileset disk usage	Reformats output from the `showfdmn`, `showfsets`, `shfragbf`, and `df` commands to display information about the disk usage of AdvFS file domains and filesets. It clarifies the relationship between a domain's disk usage and its fileset's disk usage. See `vdf`(8) for more information.
`vbmtpg`	Displays a formatted page of the BMT (Section 9.3.5.6)	The `vbmtpg` utility displays a complete, formatted page of the BMT for a mounted or unmounted AdvFS domain. This utility is useful for debugging when there has been some seemingly random file corruption.

The following sections describe some of these commands in detail.

9.3.5.1 Monitoring AdvFS Performance Statistics by Using the advfsstat Command

The advfsstat command displays various AdvFS performance statistics and monitors the performance of AdvFS domains and filesets. Use this command to obtain detailed information, especially if the iostat command output indicates a disk bottleneck (see Section 8.2).

The advfsstat command displays detailed information about a file domain, including information about the AdvFS buffer cache, fileset vnode operations, locks, the namei cache, and volume I/O performance. The command reports information in units of one disk block (512 bytes). By default, the command displays one sample. You can use the -i option to output information at specific time intervals.

The following example of the advfsstat -v 2 command shows the current I/O queue statistics for the specified file domain:

# /usr/sbin/advfsstat -v 2 test_domain  
vol1   rd  wr  rg  arg  wg  awg  blk  wlz  sms rlz  con  dev   
       54   0  48  128   0    0    0    1    0   0    0   65

The previous example shows the following fields:

Read and write requests--Compare the number of read requests (rd) to the number of write requests (wr). Read requests are blocked until the read completes, but write requests will not block the calling thread, which increases the throughput of multiple threads.

Consolidated reads and writes--You may be able to improve performance by consolidating reads and writes. The consolidated read values (rg and arg) and write values (wg and awg) indicate the number of disparate reads and writes that were consolidated into a single I/O to the device driver. If the number of consolidated reads and writes decreases compared to the number of reads and writes, AdvFS may not be consolidating I/O.

I/O queue values--The blk, wlz, sms, rlz, con, and dev fields can indicate potential performance issues. The sms value specifies the number of requests on the smooth sync queue. The con value specifies the number of entries on the consolidate queue. These entries are ready to be consolidated and moved to the device queue. The device queue value (dev) shows the number of I/O requests that have been issued to the device controller. The system must wait for these requests to complete.
If the number of I/O requests on the device queue increases continually and you experience poor performance, applications may be I/O bound on this device. You may be able to eliminate the problem by adding more disks to the domain or by striping with LSM or hardware RAID.

You can monitor the type of requests that applications are issuing by using the advfsstat command's -f option to display fileset vnode operations. You can display the number of file creates, reads, and writes and other operations for a specified domain or fileset. For example:

# /usr/sbin/advfsstat -i 3 -f 2 scratch_domain fset1
  lkup  crt geta read writ fsnc dsnc   rm   mv rdir  mkd  rmd link
     0    0    0    0    0    0    0    0    0    0    0    0    0
     4    0   10    0    0    0    0    2    0    2    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0
    24    8   51    0    9    0    0    3    0    0    4    0    0
  1201  324 2985    0  601    0    0  300    0    0    0    0    0
  1275  296 3225    0  655    0    0  281    0    0    0    0    0
  1217  305 3014    0  596    0    0  317    0    0    0    0    0
  1249  304 3166    0  643    0    0  292    0    0    0    0    0
  1175  289 2985    0  601    0    0  299    0    0    0    0    0
   779  148 1743    0  260    0    0  182    0   47    0    4    0
     0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0

See advfsstat(8) for more information.

Note that it is difficult to link performance problems to some statistics such as buffer cache statistics. In addition, lock performance that is related to lock statistics cannot be tuned.

9.3.5.2 Identifying Disks in an AdvFS File Domain by Using the advscan Command

The advscan command locates pieces of AdvFS domains on disk partitions and in LSM disk groups. Use the advscan command when you have moved disks to a new system, have moved disks in a way that has changed device numbers, or have lost track of where the domains are.

You can specify a list of volumes or disk groups with the advscan command to search all partitions and volumes. The command determines which partitions on a disk are part of an AdvFS file domain.

You can also use the advscan command for repair purposes if you deleted the /etc/fdmns directory, deleted a directory domain under /etc/fdmns, or deleted some links from a domain directory under /etc/fdmns.

Use the advscan command to rebuild all or part of your /etc/fdmns directory, or you can manually rebuild it by supplying the names of the partitions in a domain.

The following example scans two disks for AdvFS partitions:

# /usr/advfs/advscan dsk0 dsk5 
Scanning disks  dsk0 dsk5 
Found domains: 
usr_domain
          Domain Id       2e09be37.0002eb40                 
          Created         Thu Jun 26 09:54:15 1998                 
          Domain volumes          2
          /etc/fdmns links        2                 
          Actual partitions found:
                                  dsk0c                     
                                  dsk5c

For the following example, the dsk6 file domains were removed from /etc/fdmns. The advscan command scans device dsk6 and re-creates the missing domains.

# /usr/advfs/advscan -r dsk6   
Scanning disks  dsk6 
Found domains: *unknown*      
          Domain Id       2f2421ba.0008c1c0                 
          Created         Mon Jan 20 13:38:02 1998                   
          Domain volumes          1   
          /etc/fdmns links        0                   
          Actual partitions found:                                         
                                  dsk6a*    
*unknown*       
         Domain Id       2f535f8c.000b6860                 
         Created         Tue Feb 25 09:38:20 1998                   
         Domain volumes          1    
         /etc/fdmns links        0                   
         Actual partitions found:
                                 dsk6b*    
 
Creating /etc/fdmns/domain_dsk6a/
        linking dsk6a   
Creating /etc/fdmns/domain_dsk6b/         
        linking dsk6b

See advscan(8) for more information.

9.3.5.3 Checking AdvFS File Domains by Using the showfdmn Command

The showfdmn command displays the attributes of an AdvFS file domain and detailed information about each volume in the file domain.

The following example of the showfdmn command displays domain information for the root_domain file domain:

 % /sbin/showfdmn root_domain
               Id              Date Created  LogPgs  Version  Domain Name
34f0ce64.0004f2e0  Wed Mar 17 15:19:48 1999     512        4  root_domain
 
  Vol   512-Blks        Free  % Used  Cmode  Rblks  Wblks  Vol Name 
   1L     262144       94896     64%     on    256    256  /dev/disk/dsk0a

See showfdmn(8) for more information about the output of the command.

9.3.5.4 Displaying AdvFS File Information by Using the showfile Command

The showfile command displays the full storage allocation map (extent map) for one or more files in an AdvFS fileset. An extent is a contiguous area of disk space that AdvFS allocates to a file.

The following example of the showfile command displays the AdvFS characteristics for all of the files in the current working directory:

# /usr/sbin/showfile *
 
         Id  Vol  PgSz  Pages  XtntType  Segs  SegSz  I/O   Perf  File
  23c1.8001    1    16      1    simple    **     **  ftx   100%  OV
  58ba.8004    1    16      1    simple    **     **  ftx   100%  TT_DB
         **   **    **     **   symlink    **     **   **     **  adm
  239f.8001    1    16      1    simple    **     **  ftx   100%  advfs
         **   **    **     **   symlink    **     **   **     **  archive
     9.8001    1    16      2    simple    **     **  ftx   100%  bin (index)
         **   **    **     **   symlink    **     **   **     **  bsd
         **   **    **     **   symlink    **     **   **     **  dict
   288.8001    1    16      1    simple    **     **  ftx   100%  doc
   28a.8001    1    16      1    simple    **     **  ftx   100%  dt
         **   **    **     **   symlink    **     **   **     **  man
  5ad4.8001    1    16      1    simple    **     **  ftx   100%  net
         **   **    **     **   symlink    **     **   **     **  news
   3e1.8001    1    16      1    simple    **     **  ftx   100%  opt
         **   **    **     **   symlink    **     **   **     **  preserve
         **   **    **     **     advfs    **     **   **     **  quota.group
         **   **    **     **     advfs    **     **   **     **  quota.user
     b.8001    1    16      2    simple    **     **  ftx   100%  sbin (index)
         **   **    **     **   symlink    **     **   **     **  sde
   61d.8001    1    16      1    simple    **     **  ftx   100%  tcb
         **   **    **     **   symlink    **     **   **     **  tmp
         **   **    **     **   symlink    **     **   **     **  ucb
  6df8.8001    1    16      1    simple    **     **  ftx   100%  users

The I/O column specifies whether write operations are forced to be synchronous. See Section 9.3.4.5 for information.

The following example of the showfile command shows the characteristics and extent information for the tutorial file, which is a simple file:

# /usr/sbin/showfile -x tutorial
 
        Id  Vol  PgSz  Pages  XtntType  Segs  SegSz    I/O  Perf    File
 4198.800d    2    16     27    simple    **     **  async   66% tutorial
 
     extentMap: 1
          pageOff    pageCnt    vol    volBlock    blockCnt
                0          5      2      781552          80
                5         12      2      785776         192
               17         10      2      786800         160
       extentCnt: 3

The Perf entry shows the efficiency of the file-extent allocation, expressed as a percentage of the optimal extent layout. A high value, such as 100 percent, indicates that the AdvFS I/O subsystem is highly efficient. A low value indicates that files may be fragmented.

See showfile(8) for more information about the command output.

9.3.5.5 Displaying the AdvFS Filesets in a File Domain by Using the showfsets Command

The showfsets command displays the AdvFS filesets (or clone filesets) and their characteristics in a specified domain.

The following is an example of the showfsets command shows that the dmn1 file domain has one fileset and one clone fileset:

# /sbin/showfsets dmn1
mnt
  Id           : 2c73e2f9.000f143a.1.8001
  Clone is     : mnt_clone
  Files        :     7456,  SLim= 60000, HLim=80000  
  Blocks  (1k) :   388698,  SLim= 6000,  HLim=8000  
  Quota Status : user=on  group=on
 
mnt_clone
  Id           : 2c73e2f9.000f143a.2.8001
  Clone of     : mnt          
  Revision     : 2

See showfsets(8) for information about the options and output of the command.

9.3.5.6 Monitoring the Bitmap Metadata Table

The AdvFS fileset data structure (metadata) is stored in a file called the bitfile metadata table (BMT). Each volume in a domain has a BMT that describes the file extents on the volume. If a domain has multiple volumes of the same size, files will be distributed evenly among the volumes.

The BMT is the equivalent of the UFS inode table. However, the UFS inode table is statically allocated, while the BMT expands as more files are added to the domain. Each time AdvFS needs additional metadata, the BMT grows by a fixed size (the default is 128 pages). As a volume becomes increasingly fragmented, the size by which the BMT grows may be described by several extents.

To monitor the BMT, use the vbmtpg command and examine the number of mcells (freeMcellCnt). The value of freeMcellCnt can range from 0 to 22. A volume with 1 free mcell has very little space in which to grow the BMT. See vbmtpg(8) for more information.

You can also invoke the showfile command and specify mount_point/.tags/M-10 to examine the BMT extents on the first domain volume that contains the fileset mounted on the specified mount point. To examine the extents of the other volumes in the domain, specify M-16, M-24, and so on. If the extents at the end of the BMT are smaller than the extents at the beginning of the file, the BMT is becoming fragmented. See showfile(8) for more information.

9.3.6 Tuning AdvFS

After you configure AdvFS, as described in Section 9.3.4, you may be able to tune it to improve performance. To successfully improve performance, you must understand how your applications and user perform file system I/O, as described in Section 2.1.

Table 9-4 lists AdvFS tuning guidelines and performance benefits as well as tradeoffs. The guidelines described in Table 9-1 also apply to AdvFS configurations.

Table 9-4: AdvFS Tuning Guidelines

Guideline	Performance Benefit	Tradeoff
Decrease the size of the metadata buffer cache to 1 percent (Section 6.4.6)	Improves performance for systems that use only AdvFS	None
Increase the percentage of memory allocated for the AdvFS buffer cache (Section 9.3.6.1)	Improves AdvFS performance if data reuse is high	Consumes memory
Increase the size of the AdvFS buffer cache hash table (Section 9.3.6.2)	Speeds lookup operations and decreases CPU usage	Consumes memory
Increase the memory reserved for AdvFS access structures (Section 9.3.6.3)	Improves AdvFS performance for systems that open and reuse files	Consumes memory
Increase the amount of data cached in the ready queue (Section 9.3.6.4)	Improves AdvFS performance for systems that open and reuse files	May cause I/O spikes or increase the number of lost buffers if a crash occurs
Increase the smooth sync caching threshold for asynchronous I/O requests (Section 9.3.6.5)	Improves performance of AdvFS asynchronous I/O	Increases the chance that data may be lost if a system crash occurs
Increase the maximum number of I/O requests on the device queue (Section 9.3.6.6)	Keeps devices busy	May degrade response time
Disable the flushing of dirty pages mapped with the `mmap` function during a `sync` call (Section 9.3.6.7)	May improve performance for applications that manage their own flushing	None

The following sections describe the AdvFS tuning guidelines in detail.

9.3.6.1 Increasing the Size of the AdvFS Buffer Cache

The advfs subsystem attribute AdvfsCacheMaxPercent specifies the maximum percentage of physical memory that can be used to cache AdvFS file data. Caching AdvFS data improves I/O performance only if the cached data is reused.

Performance Benefit and Tradeoff

If data reuse is high, you may be able to improve AdvFS performance by increasing the percentage of memory allocated to the AdvFS buffer cache. However, this will decrease the amount of memory available for processes.

You also may need to increase the number of AdvFS buffer cache hash chains to increase the size of the AdvFS buffer cache. See Section 9.3.6.2 for information.

You cannot modify the AdvfsCacheMaxPercent attribute without rebooting the system.

When to Tune

You may need to increase the size of the AdvFS buffer cache if data reuse is high and if pages are being rapidly recycled. Increasing the size of the buffer cache will enable pages to remain in the cache for a longer period of time. This increases the chance that a cache hit will occur.

Use the advfsstat -b command to determine if pages are being recycled too quickly. If the command output shows that the ratio of total hits (hit) to total counts (cnt), for both pin and ref, is less than 85 percent, pages are being rapidly recycled.

Recommended Values

The default value of the AdvfsCacheMaxPercent attribute is 7 percent of memory. The minimum value is 1 percent; the maximum value is 30 percent.

Increase the value of the AdvfsCacheMaxPercent attribute only by small increments to optimize file system performance without wasting memory. If you increase the value of the AdvfsCacheMaxPercent attribute and experience no performance benefit, return to the original value.

Use the vmstat command to check virtual memory statistics, as described in Section 6.3.1. Make sure that increasing the size of the AdvFS buffer cache does not cause excessive paging and swapping.

See Section 3.6 for information about modifying kernel subsystem attributes.

9.3.6.2 Increasing the Number of AdvFS Buffer Hash Chains

The buffer cache hash table for the AdvFS buffer cache is used to locate pages of AdvFS file data in memory. The table contains a number of hash chains, which contain elements that point to pages of file system data that have already been read into memory. When a read or write system call is done for a particular offset within an AdvFS file, the system sequentially searches the appropriate hash chain to determine if the file data is already in memory.

The value of the advfs subsystem attribute AdvfsCacheHashSize specifies the number of hash chains (entries) on the AdvFS buffer cache hash table.

Performance Benefit and Tradeoff

Increasing the number of hash chains on the buffer cache hash table will result in shorter hash chains. Short hash chains contain less elements to search, which increases search speeds and decreases CPU usage.

Increasing the size of the AdvFS buffer cache hash table will increase the amount of wired memory.

You cannot modify the AdvfsCacheHashSize attribute without rebooting the system.

When to Tune

If you have more than 4 GB of memory, you may want to increase the value of the AdvfsCacheHashSize attribute, which will increase the number of hash chains on the table.

To determine if your system performance may benefit from increasing the size of the buffer hash table, divide the number of AdvFS buffers by the current value of the AdvfsCacheHashSize attribute. Use the sysconfig -q advfs AdvfsCacheHashSize to determine the current value of the attribute. To obtain the number of AdvFS buffers, examine the AdvFS system initialization message that reports this value and the total amount of memory being used.

The result of the previous calculation will show the average number of buffers for each buffer hash table chain. A small number means fewer potential buffers that AdvFS must search. This assumes that buffers are evenly distributed across the AdvFS buffer cache hash table. If the average number of buffers for each chain is greater that 100, you may want to increase the size of the hash chain table.

Recommended Values

The default value of the AdvfsCacheHashSize attribute is either 8192 KB or 10 percent of the size of the AdvFS buffer cache (rounded up to the next power of 2), whichever is the smallest value. The minimum value is 1024 KB. The maximum value is either 65536 or the size of the AdvFS buffer cache, whichever is the smallest value. The AdvfsCacheMaxPercent attribute specifies the size of the AdvFS buffer cache (see Section 9.3.6.1).

You may want to double the default value of the AdvfsCacheHashSize attribute if the system is experiencing high CPU system time (see Section 6.3.1), or if a kernel profile shows high percentage of CPU usage in the find_page routine.

See Section 3.6 for information about modifying kernel subsystem attributes.

9.3.6.3 Increasing the Memory for Access Structures

AdvFS access structures are in-memory data structures that AdvFS uses to cache low-level information about files that are currently open and files that were opened but are now closed. Caching open file information can enhance AdvFS performance if the open files are later reused.

At boot time, the system reserves for AdvFS access structures a portion of physical memory that is not wired. Access structures are placed on the access structure free list, and are allocated and deallocated according to the kernel configuration and workload demands.

There are three attributes that control the allocation of AdvFS access structures:

The AdvfsAccessMaxPercent attribute controls the maximum percentage of pageable memory that can be allocated for AdvFS access structures.

At boot time, and when the number of access structures on the free list is less than the value of the AdvfsMinFreeAccess attribute, AdvFS allocates additional access structures, until the number of access structures on the free list is twice the value of the AdvfsMinFreeAccess attribute or the value of the AdvfsAccessMaxPercent attribute, whichever is smaller.

The AdvfsMaxFreeAccessPercent attribute controls when access structures are deallocated from the free list. When the percentage of access structures on the free list is more than the value of the AdvfsMaxFreeAccessPercent attribute, and the number of access structures on the free list is more than twice the value of the AdvfsMinFreeAccess attribute, AdvFS deallocates access structures.

See Section 9.3.3 for information about access structures and attributes.

Performance Benefit and Tradeoff

Increasing the value of the AdvfsAccessMaxPercent attribute allows you to allocate more memory resources for access structures, which may improve AdvFS performance on systems that open and reuse many files. However, this increases memory consumption.

If you increase the value of the AdvfsMinFreeAccess attribute, you will retain more access structures on the free list and delay access structure deallocation, which may improve AdvFS performance for systems that open and reuse many files. However, this increases memory consumption.

If you increase the value of the AdvfsMaxFreeAccessPercent attribute, the system will retain access structures on the free list for a longer time, which may improve AdvFS performance for systems that open and reuse many files.

You can modify the AdvfsAccessMaxPercent, AdvfsMinFreeAccess, and AdvfsMaxFreeAccessPercent attributes without rebooting the system.

When to Tune

If your users or applications open and then reuse many AdvFS files (for example, if you have a proxy server), you may be able to improve AdvFS performance by increasing memory resources for access structures.

If you do not use AdvFS, if your workload does not frequently write to previously written pages, or if you have a large-memory system, you may want to decrease the memory allocated for access structures. See Section 6.4.5 for information.

Recommended Values

The default value of the AdvfsAccessMaxPercent attribute is 25 percent of pageable memory. The minimum value is 5 percent; the maximum value is 95 percent.

The default value of the AdvfsMinFreeAccess attribute is 128. The minimum value is 1; the maximum value is 100,000.

The default value of the AdvfsMaxFreeAccessPercent attribute is 80 percent. The minimum value is 5 percent; the maximum value is 95 percent.

See Section 3.6 for information about modifying kernel subsystem attributes.

9.3.6.4 Increasing Data Cached in the Ready Queue

AdvFS caches asynchronous I/O requests in the AdvFS buffer cache. If the cached data is later reused, pages can be retrieved from memory and a disk operation is avoided.

Asynchronous I/O requests are sorted in the ready queue and remain there until the size of the queue reaches the value specified by the AdvfsReadyQLim attribute or, if smooth sync is not enabled, until the update daemon flushes the data. See Section 9.3.2 for more information about AdvFS queues. See Section 9.3.6.5 for information about using smooth sync to control asynchronous I/O request caching.

Performance Benefit and Tradeoff

Increasing the size of the ready queue can improve AdvFS performance if data is reused by increasing the time that a buffer will stay on the I/O queue and not be flushed to disk.

You can modify the AdvfsReadyQLim attribute without rebooting the system.

When to Tune

If you have high data reuse (data is repeatedly read and written), you may want to increase the size of the ready queue. This can increase the number of AdvFS buffer cache hits. If you have low data reuse, it is recommended that you use the default value.

Recommended Values

You can modify the size of the ready queue for all AdvFS volumes by changing the value of the AdvfsReadyQLim attribute. The default value of the AdvfsReadyQLim attribute is 16,384 512-byte blocks (8 MB).

You can modify the size for a specific AdvFS volume by using the chvol -t command. See chvol(8) for more information.

If you change the size of the ready queue and performance does not improve, return to the original value.

See Section 3.6 for information about modifying kernel subsystem attributes.

9.3.6.5 Increasing the AdvFS Smooth Sync Cache Timeout Value

Smooth sync functionality improves AdvFS asynchronous I/O performance by preventing I/O spikes caused by the update daemon, increasing the chance of an AdvFS buffer cache hit, and improving the consolidation of I/O requests. By default, smooth sync is enabled on your system.

AdvFS uses I/O request queues to cache asynchronous I/O requests before they are handed to the device driver. Without smooth sync enabled, every 30 seconds, the update daemon flushes data from memory to disk, regardless of how long a buffer has been cached. However, with smooth sync enabled (the default), the update daemon will not automatically flush the AdvFS ready queue buffers. Instead, asynchronous I/O requests remain in the smooth sync queue for the amount of time specified by the value of the vfs attribute smoothsync_age (the default is 30 seconds). After this time, the buffer moves to the ready queue.

You enable smooth sync functionality (the default) by using the smoothsync_age attribute. However, you do not specify a value for smoothsync_age in the /etc/sysconfigtab file. Instead, the /etc/inittab file is used to enable smooth sync when the system boots to multiuser mode, and to disable smooth sync when the system goes from multiuser mode to single-user mode. This procedure is necessary to reflect the behavior of the update daemon, which operates only in multiuser mode.

To enable smooth sync, the following lines must be included in the /etc/inittab file and the time limit for caching buffers in the smooth sync queue must be specified (the default is 30 seconds):

smsync:23:wait:/sbin/sysconfig -r vfs smoothsync_age=30 > /dev/null 2>&1 
smsyncS:Ss:wait:/sbin/sysconfig -r vfs smoothsync_age=0 > /dev/null 2>&1

Performance Benefit and Tradeoff

Increasing the amount of time an asynchronous I/O request remains in the smooth sync queue increases the chance that a buffer cache hit will occur, which improves AdvFS performance if data is reused. However, this also increases the chance that data may be lost if a system crash occurs.

Decreasing the value of the smoothsync_age attribute will speed the flushing of buffers.

When to Tune

You may want to increase the amount of time an asynchronous I/O request remains in the smooth sync queue if you reuse AdvFS data.

Recommended Values

Thirty seconds is the default smooth sync queue timeout limit. If you increase the value of the smoothsync_age attribute in the /etc/inittab file, you may improve the chance of a buffer cache hit by retaining buffers on the smooth sync queue for a longer period of time. Use the advfsstat -S command to show the AdvFS smooth sync queue statistics.

To disable smooth sync, specify a value of 0 (zero) for the smoothsync_age attribute.

9.3.6.6 Specifying the Maximum Number of I/O Requests on the Device Queue

Small, logically contiguous AdvFS I/O requests are consolidated into larger I/O requests and put on the device queue, before they are sent to the device driver. See Section 9.3.2 for more information about AdvFS queues.

The AdvfsMaxDevQLen attribute controls the maximum number of I/O requests on the device queue. When the number of requests on the queue exceeds this value, only synchronous requests are accepted onto the device queue.

Performance Benefit and Tradeoff

Increasing the size of the device queue can keep devices busy, but may degrade response time.

Decreasing the size of the device queue decreases the amount of time it takes to complete a synchronous (blocking) I/O operation and can improve response time.

You can modify the AdvfsMaxDevQLen attribute without rebooting the system.

When to Tune

Although the default value of the AdvfsMaxDevQLen attribute is appropriate for many configurations, you may need to modify this value. Increase the default value of the AdvfsMaxDevQLen attribute only if devices are not being kept busy.

Recommended Values

The default value of the AdvfsMaxDevQLen attribute is 24 requests. The minimum value is 0; the maximum value is 65536. A guideline is to specify a value for the AdvfsMaxDevQLen attribute that is less than or equal to the average number of I/O operations that can be performed in 0.5 seconds.

Make sure that increasing the size of the device queue does not cause a decrease in response time. To calculate response time, multiply the value of the AdvfsMaxDevQLen attribute by the average I/O latency time for your disks.

If you do not want to limit the number of requests on the device queue, set the value of the AdvfsMaxDevQLen attribute to 0 (zero), although this is not recommended.

See Section 3.6 for information about modifying kernel subsystem attributes.

9.3.6.7 Disabling the Flushing of Modified mmapped Pages

The AdvFS buffer cache can contain modified data due to a write system call or a memory write reference after an mmap system call. The update daemon runs every 30 seconds and issues a sync call for every fileset mounted with read and write access. However, if smooth sync is enabled (the default), the update daemon will not flush the ready queue. Instead, asynchronous I/O requests remain in the smooth sync queue for the amount of time specified by the value of the vfs attribute smoothsync_age (the default is 30 seconds). See Section 9.3.6.5 for information about the smooth sync queue.

The AdvfsSyncMmapPages attribute controls whether modified (dirty) mmapped pages are flushed to disk during a sync system call. If the AdvfsSyncMmapPages attribute is set to 1 (the default), the modified mmapped pages are asynchronously written to disk. If the AdvfsSyncMmapPages attribute is set to 0, modified mmapped pages are not written to disk during a sync system call.

Performance Benefit

Disabling the flushing of modified mmapped pages may improve performance of applications that manage their own mmap page flushing.

You can modify the AdvfsSyncMmapPages attribute without rebooting the system.

When to Tune

Disable flushing mmapped pages only if your applications manage their own mmap page flushing.

Recommended Values

If your applications manage their own mmap page flushing, set the value of the AdvfsSyncMmapPages attribute to zero.

See mmap(2) and msync(2) for more information. See Section 3.6 for information about modifying kernel subsystem attributes.

9.3.7 Improving AdvFS Performance

After you configure AdvFS, as described in Section 9.3.4, you may be able to improve performance by performing some administrative tasks.

Table 9-4 lists AdvFS performance improvement guidelines and performance benefits as well as tradeoffs.

Table 9-5: AdvFS Performance Improvement Guidelines

Guideline	Performance Benefit	Tradeoff
Defragment file domains (Section 9.3.7.1)	Improves read and write performance	Procedure is time-consuming
Decrease the I/O transfer read-ahead size (Section 9.3.7.2)	Improves performance for `mmap` page faulting	None
Move the transaction log to a fast or uncongested volume (Section 9.3.7.3)	Prevents log from becoming a bottleneck	None
Balance files across volumes in a file domain (Section 9.3.7.4)	Improves performance and evens the future distribution of files	None
Migrate frequently used or large files to different file domains (Section 9.3.7.5)	Improves I/O performance	None

The following sections describe the AdvFS performance improvement guidelines in detail.

9.3.7.1 Defragmenting a File Domain

An extent is a contiguous area of disk space that AdvFS allocates to a file. Extents consist of one or more 8-KB pages. When storage is added to a file, it is grouped in extents. If all data in a file is stored in contiguous blocks, the file has one file extent. However, as files grow, contiguous blocks on the disk may not be available to accommodate the new data, so the file must be spread over discontiguous blocks and multiple file extents.

File I/O is most efficient when there are few extents. If a file consists of many small extents, AdvFS requires more I/O processing to read or write the file. Disk fragmentation can result in many extents and may degrade read and write performance because many disk addresses must be examined to access a file. In addition, if a domain has a large number of small files, you may prematurely run out of disk space due to fragmentation.

Use the defragment utility to reduce the amount of file fragmentation in a file domain by attempting to make the files more contiguous, which reduces the number of file extents. The utility does not affect data availability and is transparent to users and applications. Striped files are not defragmented.

Performance Benefit and Tradeoff

Defragmenting improves AdvFS performance by making AdvFS disk I/O more efficient. However, the defragment process can be time-consuming and requires disk space in order to run.

When to Perform this Task

Compaq recommends that you run defragment only if you experience problems because of excessive fragmentation and only when there is low file system activity. In addition, there is little performance benefit from defragmenting in the following circumstances:

A file domain contains primarily files that are smaller than 8 KB.

A file domain is used in a mail server.

A file domain is read-only.

To determine if a file domain is fragmented, use the defragment utility with the -v and -n options to show the amount of file fragmentation. Ideally, you want few extents for each file. For example:

# defragment -vn staff_dmn 
 
defragment: Gathering data for 'staff_dmn'
 Current domain data:
   Extents:                 263675
   Files w/ extents:        152693
   Avg exts per file w/exts:  1.73
   Aggregate I/O perf:         70%
   Free space fragments:     85574
                <100K   <1M   <10M   >10M
   Free space:   34%   45%    19%     2%
   Fragments:  76197  8930    440      7

You can also use the showfile command to check a file's fragmentation. See Section 9.3.5.4 for information.

Recommended Procedure

You can improve the efficiency of the defragmenting process by deleting any unneeded files in the file domain before running the defragment utility. See defragment(8) for more information.

9.3.7.2 Decreasing the I/O Transfer Size

AdvFS reads and writes data by a fixed number of 512-byte blocks. The default value depends on the disk driver's reported preferred transfer size. For example, a common default value is either 128 blocks or 256 blocks.

If you use the addvol or mkfdmn command on a Logical Storage Manager (LSM) volume, the preferred transfer size may be larger than if LSM was not used. The value depends on how you configured the LSM volume.

Performance Benefit

You may be able to improve performance for mmap page faulting and reduce read-ahead paging and cache dilution by decreasing the read-ahead size.

When to Perform this Task

You may want to decrease the I/O transfer size if you experience performance problems with AdvFS I/O throughput.

Recommended Procedure

To display the range of I/O transfer sizes, use the chvol -l command. Use the chvol -r command to modify the read I/O transfer size (the amount of data read for each I/O request). Use the chvol -w command to modify the write I/O transfer size (the amount of data written for each I/O request).

You can decrease the read-ahead size by using the chvol -r command.

You can decrease the amount of data written for each I/O request by using the chvol -w command. In general, you want to maximize the amount of data written for each I/O by using the default write I/O transfer size or a larger value.

However, in some cases (for example, if you are using LSM volumes), you may need to reduce the AdvFS write-consolidation size. If your AdvFS domains are using LSM, the default preferred transfer size is high, and I/O throughput is not optimal, reduce the write I/O transfer size.

See chvol(8) for more information.

9.3.7.3 Moving the Transaction Log

The AdvFS transaction log should be located on a fast or uncongested disk and bus; otherwise, performance may be degraded.

Performance Benefit

Locating the transaction log on a fast or uncongested bus improves performance.

When to Tune

Use the showfdmn command to determine the current location of the transaction log. In the showfdmn command display, the letter L displays next to the volume that contains the log. Move the transaction log if the volume on which it resides is busy and the transaction log is a bottleneck. See showfdmn(8) for more information.

Recommended Procedure

Use the switchlog command to relocate the transaction log of the specified file domain to a faster or less congested volume in the same domain. See switchlog(8) for more information.

In addition, you can divide a large multi-volume file domain into several smaller file domains. This will distribute the transaction log I/O across multiple logs.

9.3.7.4 Balancing a Multivolume File Domain

If the files in a multivolume domain are not evenly distributed, performance may be degraded. Use the balance utility to distribute the percentage of used space evenly across volumes in a multivolume file domain. This improves performance and the distribution of future file allocations. Files are moved from one volume to another until the percentage of used space on each volume in the domain is as equal as possible.

The balance utility does not affect data availability and is transparent to users and applications. If possible, use the defragment utility before you balance files.

The balance utility does not generally split files. Therefore, file domains with very large files may not balance as evenly as file domains with smaller files.

Performance Benefit

Balancing files across the volumes in a file domain improves the distribution of disk I/O.

When to Perform this Task

You may want to balance a file domain if the files are not evenly distributed across the domain.

To determine if you need to balance your files across volumes, use the showfdmn command to display information about the volumes in a domain. The % Used field shows the percentage of volume space that is currently allocated to files or metadata (fileset data structure). In the following example, the usr_domain file domain is not balanced. Volume 1 has 63% used space while volume 2 has 0% used space (it has just been added).

# showfdmn usr_domain 
 
               Id     Date Created       LogPgs Version   Domain Name
3437d34d.000ca710  Sun Oct 5 10:50:05 1997  512       3   usr_domain
 Vol  512-Blks   Free % Used  Cmode Rblks  Wblks  Vol Name 
  1L   1488716 549232    63%     on   128    128  /dev/disk/dsk0g
  2     262144 262000     0%     on   128    128  /dev/disk/dsk4a
     --------- -------  ------
       1750860 811232    54%

See showfdmn(8) for more information.

Recommended Procedure

Use the balance utility to distribute the percentage of used space evenly across volumes in a multivolume file domain. See balance(8) for more information.

9.3.7.5 Migrating Files Within a File Domain

Performance may degrade if too many frequently accessed or large files reside on the same volume in a multivolume file domain. You can improve I/O performance by altering the way files are mapped on the disk.

Use the migrate utility to move frequently accessed or large files to different volumes in the file domain. You can specify the volume where a file is to be moved, or allow the system to pick the best space in the file domain. You can migrate either an entire file or specific pages to a different volume.

In addition, the migrate command enables you to defragment a specific file and make the file more contiguous, which improves performance.

Performance Benefit

Distributing the I/O load across the volumes in a file domain improves AdvFS performance.

When to Perform this Task

To determine which files to move, use the showfile -x command to look at the extent map and the performance percentage of a file. A low performance percentage (less than 80%) indicates that the file is fragmented on the disk. The extent map shows whether the entire file or a portion of the file is fragmented.

The following example displays the extent map of a file called src. The file, which resides in a two-volume file domain, shows an 18% performance efficiency in the Perf field.

# showfile -x src 
 
    Id Vol PgSz Pages XtntType  Segs  SegSz  I/O  Perf  File
8.8002   1   16    11   simple    **     ** async  18%  src
 
 
             extentMap: 1
        pageOff    pageCnt     vol    volBlock    blockCnt
              0          1       1      187296          16
              1          1       1      187328          16
              2          1       1      187264          16
              3          1       1      187184          16
              4          1       1      187216          16
              5          1       1      187312          16
              6          1       1      187280          16
              7          1       1      187248          16
              8          1       1      187344          16
              9          1       1      187200          16
             10          1       1      187232          16
        extentCnt: 11

The file src consists of 11 file extents. This file would be a good candidate to move to another volume to reduce the number of file extents.

See Section 8.2 for information about using commands to determine if file system I/O is evenly distributed.

Recommended Procedure

Use the migrate utility to move frequently accessed or large files to different volumes in the file domain. Note that using the balance utility after migrating files may cause the files to move to a different volume.

See migrate(8) and balance(8) for more information.

9.4 Managing UFS Performance

The UNIX File System (UFS) can provide you with high-performance file system operations, especially for critical applications. For example, UFS file reads from striped disks can be 50 percent faster than if you are using AdvFS, and will consume only 20 percent of the CPU power that AdvFS requires.

However, unlike AdvFS, the UFS file system directory hierarchy is bound tightly to a single disk partition.

The following sections describe:

Using the UFS guidelines to set up a high-performance configuration (Section 9.4.1)

Obtaining information about UFS performance (Section 9.4.2)

Tuning UFS in order to improve performance (Section 9.4.3)

9.4.1 UFS Configuration Guidelines

There are a number of parameters that can improve the UFS performance. You can set all of the parameters when you use the newfs command to create a file system. For existing file systems, you can modify some parameters by using the tunefs command. See newfs(8) and tunefs(8) for more information.

Table 9-6 describes UFS configuration guidelines and performance benefits as well as tradeoffs.

Table 9-6: UFS Configuration Guidelines

Guideline	Performance Benefit	Tradeoff
Make the file system fragment size equal to the block size (Section 9.4.1.1)	Improves performance for large files	Wastes disk space for small files
Use the default file system fragment size of 1 KB (Section 9.4.1.1)	Uses disk space efficiently	Increases the overhead for large files
Reduce the density of inodes on a file system (Section 9.4.1.2)	Frees disk space for file data and improves large file performance	Reduces the number of files that can be created on the file system
Allocate blocks sequentially (Section 9.4.1.3)	Improves performance for disks that do not have a read-ahead cache	Reduces the total available disk space
Increase the number of blocks combined for a cluster (Section 9.4.1.4)	May decrease number of disk I/O operations	May require more memory to buffer data
Use a Memory File System (MFS) (Section 9.4.1.5)	Improves I/O performance	Does not ensure data integrity because of cache volatility
Use disk quotas (Section 9.4.1.6)	Controls disk space utilization	UFS quotas may result in a slight increase in reboot time
Increase the maximum number of UFS and MFS mounts (Section 9.4.1.7)	Allows more mounted file systems	Requires additional memory resources

The following sections describe the UFS configuration guidelines in detail.

9.4.1.1 Modifying the File System Fragment and Block Sizes

The UFS file system block size is 8 KB. The default fragment size is 1 KB. You can use the newfs command to modify the fragment size so that it is 25, 50, 75, or 100 percent of the block size.

The UFS file system block size can be 8 KB (the default), 16 KB, 32 KB, or 64 KB. The default fragment size is 1 KB. You can modify the fragment size so that it is 25, 50, 75, or 100 percent of the block size. Use the newfs command to modify block and fragment sizes.

Although the default fragment size uses disk space efficiently, it increases the overhead for large files. If the average file in a file system is larger than 16 KB but less than 96 KB, you may be able to improve disk access time and decrease system overhead by making the file system fragment size equal to the default block size (8 KB).

See newfs(8) for more information.

9.4.1.2 Reducing the Density of inodes

An inode describes an individual file in the file system. The maximum number of files in a file system depends on the number of inodes and the size of the file system. The system creates an inode for each 4 KB (4096 bytes) of data space in a file system.

If a file system will contain many large files and you are sure that you will not create a file for each 4 KB of space, you can reduce the density of inodes on the file system. This will free disk space for file data, but will reduce the number of files that can be created.

To do this, use the newfs -i command to specify the amount of data space allocated for each inode. See newfs(8) for more information.

9.4.1.3 Allocating Blocks Sequentially

The UFS rotdelay parameter specifies the time, in milliseconds, to service a transfer completion interrupt and initiate a new transfer on the same disk. You can set the rotdelay parameter to 0 (the default) to allocate blocks sequentially. This is useful for disks that do not have a read-ahead cache. However, it will reduce the total amount of available disk space.

Use either the tunefs command or the newfs command to modify the rotdelay value. See newfs(8) and tunefs(8) for more information.

9.4.1.4 Increasing the Number of Blocks Combined for a Cluster

The value of the UFS maxcontig parameter specifies the number of blocks that can be combined into a single cluster (or file-block group). The default value of maxcontig is 8. The file system attempts I/O operations in a size that is determined by the value of maxcontig multiplied by the block size (8 KB).

Device drivers that can chain several buffers together in a single transfer should use a maxcontig value that is equal to the maximum chain length. This may reduce the number of disk I/O operations. However, more memory will be needed to cache data.

Use the tunefs command or the newfs command to change the value of maxcontig. See newfs(8) and tunefs(8) for more information.

9.4.1.5 Using MFS

The Memory File System (MFS) is a UFS file system that resides only in memory. No permanent data or file structures are written to disk. An MFS can improve read/write performance, but it is a volatile cache. The contents of an MFS are lost after a reboot, unmount operation, or power failure.

Because no data is written to disk, an MFS is a very fast file system and can be used to store temporary files or read-only files that are loaded into the file system after it is created. For example, if you are performing a software build that would have to be restarted if it failed, use an MFS to cache the temporary files that are created during the build and reduce the build time.

See mfs(8) for information.

9.4.1.6 Using UFS Disk Quotas

You can specify UFS file system limits for user accounts and for groups by setting up UFS disk quotas, also known as UFS file system quotas. You can apply quotas to file systems to establish a limit on the number of blocks and inodes (or files) that a user account or a group of users can allocate. You can set a separate quota for each user or group of users on each file system.

You may want to set quotas on file systems that contain home directories, because the sizes of these file systems can increase more significantly than other file systems. Do not set quotas on the /tmp file system.

Note that, unlike AdvFS quotas, UFS quotas may cause a slight increase in reboot time. For information about AdvFS quotas, see Section 9.3.4.10. For information about UFS quotas, see the System Administration manual.

9.4.1.7 Increasing the Number of UFS and MFS Mounts

Mount structures are dynamically allocated when a mount request is made and subsequently deallocated when an unmount request is made. The vfs subsystem attribute max_ufs_mounts specifies the maximum number of UFS and MFS mounts on the system.

Performance Benefit and Tradeoff

Increasing the maximum number of UFS and MFS mounts enables you to mount more file systems. However, increasing the maximum number mounts requires memory resources for the additional mounts.

You can modify the max_ufs_mounts attribute without rebooting the system.

When to Tune

Increase the maximum number of UFS and MFS mounts if your system will have more than the default limit of 1000 mounts.

Recommended Values

The default value of the max_ufs_mounts attribute is 1000.

See Section 3.6 for information about modifying kernel subsystem attributes.

9.4.2 Gathering UFS Information

Table 9-7 describes the tools you can use to obtain information about UFS.

Table 9-7: UFS Monitoring Tools

Name	Use	Description
`dumpfs`	Displays UFS information (Section 9.4.2.1)	Displays detailed information about a UFS file system or a special device, including information about the file system fragment size, the percentage of free space, super blocks, and the cylinder groups.
`(dbx) print ufs_clusterstats`	Reports UFS clustering statistics (Section 9.4.2.2)	Reports statistics on how the system is performing cluster read and write transfers.
`(dbx) print bio_stats`	Reports UFS metadata buffer cache statistics (Section 9.4.2.3)	Reports statistics on the metadata buffer cache, including superblocks, inodes, indirect blocks, directory blocks, and cylinder group summaries.

The following sections describe these commands in detail.

9.4.2.1 Displaying UFS Information by Using the dumpfs Command

The dumpfs command displays UFS information, including super block and cylinder group information, for a specified file system. Use this command to obtain information about the file system fragment size and the minimum free space percentage. The following example shows part of the output of the dumpfs command:

# /usr/sbin/dumpfs /devices/disk/dsk0g | more  
magic   11954   format  dynamic time   Tue Sep 14 15:46:52 1998 
nbfree  21490   ndir    9       nifree  99541  nffree  60 
ncg     65      ncyl    1027    size    409600  blocks  396062
bsize   8192    shift   13      mask    0xffffe000 
fsize   1024    shift   10      mask    0xfffffc00 
frag    8       shift   3       fsbtodb 1 
cpg     16      bpg     798     fpg     6384    ipg     1536 
minfree 10%     optim   time    maxcontig 8     maxbpg  2048 
rotdelay 0ms    headswitch 0us  trackseek 0us   rps     60

The information contained in the first lines are relevant for tuning. Of specific interest are the following fields:

bsize -- The block size of the file system, in bytes (8 KB).

fsize -- The fragment size of the file system, in bytes. For the optimum I/O performance, you can modify the fragment size.

minfree -- The percentage of space that cannot be used by normal users (the minimum free space threshold).

maxcontig -- The maximum number of contiguous blocks that will be laid out before forcing a rotational delay; that is, the number of blocks that are combined into a single read request.

maxbpg -- The maximum number of blocks any single file can allocate out of a cylinder group before it is forced to begin allocating blocks from another cylinder group. A large value for maxbpg can improve performance for large files.

rotdelay -- The expected time (in milliseconds) to service a transfer completion interrupt and initiate a new transfer on the same disk. It is used to decide how much rotational spacing to place between successive blocks in a file. If rotdelay is zero, then blocks are allocated contiguously.

9.4.2.2 Monitoring UFS Clustering by Using the dbx Debugger

To determine how efficiently the system is performing cluster read and write transfers, use the dbx print command to examine the ufs_clusterstats data structure.

The following example shows a system that is not clustering efficiently:

# /usr/ucb/dbx -k /vmunix /dev/mem 
(dbx) print ufs_clusterstats
struct {
    full_cluster_transfers = 3130
    part_cluster_transfers = 9786
    non_cluster_transfers = 16833
    sum_cluster_transfers = {
        [0] 0
        [1] 24644
        [2] 1128
        [3] 463
        [4] 202
        [5] 55
        [6] 117
        [7] 36
        [8] 123
        [9] 0
    }
}
(dbx)

The preceding example shows 24644 single-block transfers and no 9-block transfers. A single block is 8 KB. The trend of the data shown in the example is the reverse of what you want to see. It shows a large number of single-block transfers and a declining number of multiblock (1-9) transfers. However, if the files are all small, this may be the best blocking that you can achieve.

You can examine the cluster reads and writes separately with the ufs_clusterstats_read and ufs_clusterstats_write data structures.

See Section 9.4.3 for information on tuning UFS.

9.4.2.3 Checking the Metadata Buffer Cache by Using the dbx Debugger

The metadata buffer cache contains UFS file metadata--superblocks, inodes, indirect blocks, directory blocks, and cylinder group summaries. To check the metadata buffer cache, use the dbx print command to examine the bio_stats data structure.

Consider the following example:


# /usr/ucb/dbx -k /vmunix /dev/mem 
(dbx) print bio_stats
struct {
    getblk_hits = 4590388
    getblk_misses = 17569
    getblk_research = 0
    getblk_dupbuf = 0
    getnewbuf_calls = 17590
    getnewbuf_buflocked = 0
    vflushbuf_lockskips = 0
    mntflushbuf_misses = 0
    mntinvalbuf_misses = 0
    vinvalbuf_misses = 0
    allocbuf_buflocked = 0
    ufssync_misses = 0
}
(dbx)

If the miss rate is high, you may want to raise the value of the bufcache attribute. The number of block misses (getblk_misses) divided by the sum of block misses and block hits (getblk_hits) should not be more than 3 percent.

See Section 9.4.3.1 for information on how to tune the metadata buffer cache.

9.4.3 Tuning UFS

After you configure your UFS file systems, you may be able to improve UFS performance. To successfully improve performance, you must understand how your applications and users perform file system I/O, as described in Section 2.1.

Table 9-8 describes UFS tuning guidelines and performance benefits as well as tradeoffs. The guidelines described in Table 9-1 also apply to UFS configurations.

Table 9-8: UFS Tuning Guidelines

Guideline	Performance Benefit	Tradeoff
Increase the size of metadata buffer cache to more than 3 percent of main memory (Section 9.4.3.1)	Increases cache hit rate and improves UFS performance	Requires additional memory resources
Increase the size of the metadata hash chain table (Section 9.4.3.2)	Improves UFS lookup speed	Increases wired memory
Increase the smooth sync caching threshold for asynchronous UFS I/O requests (Section 9.4.3.3)	Improves performance of AdvFS asynchronous I/O	Increases the chance that data may be lost if a system crash occurs
Delay flushing UFS clusters to disk (Section 9.4.3.4)	Frees CPU cycles and reduces number of I/O operations	May degrade real-time workload performance when buffers are flushed
Increase number of blocks combined for read ahead (Section 9.4.3.5)	May reduce disk I/O operations	May require more memory to buffer data
Increase number of blocks combined for a cluster (Section 9.4.3.6)	May decrease disk I/O operations	Reduces available disk space
Defragment the file system (Section 9.4.3.7)	Improves read and write performance	Requires down time

The following sections describe how to tune UFS in detail.

9.4.3.1 Increasing the Size of the Metadata Buffer Cache

At boot time, the kernel wires a percentage of physical memory for the metadata buffer cache, which temporarily holds recently accessed UFS and CD-ROM File System (CDFS) metadata. The vfs subsystem attribute bufcache specifies the size of the metadata buffer cache as a percentage of physical memory. See Section 6.1.2.1 for information about how memory is allocated to the metadata buffer cache.

Performance Benefit and Tradeoff

Allocating additional memory to the metadata buffer cache may improve UFS performance if you reuse files, but it will reduce the amount of memory available to processes and the UBC.

You cannot modify the bufcache attribute without rebooting the system.

When to Tune

Usually, you do not have to increase the size of the metadata buffer cache.

However, you may want to increase the size of the cache if you reuse data and have a high cache miss rate (low hit rate). To determine whether to increase the size of the metadata buffer cache, use the dbx print command to examine the bio_stats data structure. If the miss rate (block misses divided by the sum of the block misses and block hits) is more than 3 percent, you may want to increase the cache size. See Section 9.4.2.3 for more information.

Recommended Values

The default value of the bufcache attribute is 3 percent.

If you have a general-purpose timesharing system, do not increase the value of the bufcache attribute to more than 10 percent. If you have an NFS server that does not perform timesharing, do not increase the value of the bufcache attribute to more than 35 percent.

See Section 3.6 for information about modifying kernel subsystem attributes.

9.4.3.2 Increasing the Size of the Metadata Hash Chain Table

The hash chain table for the metadata buffer cache stores the heads of the hashed buffer queues. The vfs subsystem attribute buffer_hash_size specifies the size of the hash chain table, in table entries, for the metadata buffer cache.

Performance Benefit and Tradeoff

Increasing the size of the hash chain table distributes the buffers, which makes the average chain lengths short. This can improve lookup speeds. However, increasing the size of the hash chain table increases wired memory.

You cannot modify the buffer_hash_size attribute without rebooting the system.

When to Tune

Usually, you do not have to modify the size of the hash chain table.

Recommended Values

The minimum size of the buffer_hash_size attribute is 16; the maximum size is 524287. The default value is 512.

You can modify the value of the buffer_hash_size attribute so that each hash chain has 3 or 4 buffers. To determine a value for the buffer_hash_size attribute, use the dbx print command to examine the value of the nbuf kernel variable, then divide the value by 3 or 4, and finally round the result to a power of 2. For example, if nbuf has a value of 360, dividing 360 by 3 gives you a value of 120. Based on this calculation, specify 128 (2 to the power of 7) as the value of the buffer_hash_size attribute.

See Section 3.6 for information about modifying kernel attributes.

9.4.3.3 Increasing the UFS Smooth Sync Cache Timeout Value

Smooth sync functionality improves UFS I/O performance by preventing I/O spikes caused by the update daemon, and by increasing the UBC hit rate, which decreases the total number of disk operations. Smooth sync also helps to efficiently distribute I/O requests over the sync interval, which decreases the length of the disk queue and reduces the latency that results from waiting for a busy page to be freed. By default, smooth sync is enabled on your system.

UFS caches asynchronous I/O requests in the dirty-block queue and in the UBC object dirty-page list queue before they are handed to the device driver. With smooth sync enabled (the default), the update daemon will not flush buffers from the dirty page lists and dirty wired page lists. Instead, the buffers get moved to the device queue only after the amount of time specified by the value of the vfs attribute smoothsync_age (the default is 30 seconds). After this time, the buffer moves to the device queue.

If smooth sync is disabled, every 30 seconds the update daemon flushes data from memory to disk, regardless of how long a buffer has been cached.

Smooth sync functionality is controlled by the smoothsync_age attribute. However, you do not specify a value for smoothsync_age in the /etc/sysconfigtab file. Instead, the /etc/inittab file is used to enable smooth sync when the system boots to multiuser mode and to disable smooth sync when the system goes from multiuser mode to single-user mode. This procedure is necessary to reflect the behavior of the update daemon, which operates only in multiuser mode.

To enable smooth sync, the following lines must be included in the /etc/inittab file and the time limit for caching buffers in the smooth sync queue must be specified (default is 30 seconds):

smsync:23:wait:/sbin/sysconfig -r vfs smoothsync_age=30 > /dev/null 2>&1
smsyncS:Ss:wait:/sbin/sysconfig -r vfs smoothsync_age=0 > /dev/null 2>&1

Performance Benefit and Tradeoff

Increasing the amount of time that an asynchronous I/O request ages before being placed on the device queue (increasing the value of the smoothsync_age attribute) will increase the chance that a buffer cache hit will occur, which improves UFS performance if the data is reused. However, this increases the chance that data may be lost if a system crash occurs.

Decreasing the value of the smoothsync_age attribute will speed the flushing of buffers.

When to Tune

Usually, you do not have to modify the smooth sync queue timeout limit.

Recommended Values

Thirty seconds is the default smooth sync queue timeout limit. If you increase the value of the smoothsync_age attribute in the /etc/inittab file, you will increase the chance that a buffer cache hit will occur.

To disable smooth sync, specify a value of 0 (zero) for the smoothsync_age attribute.

See Section 3.6 for information about modifying kernel subsystem attributes.

9.4.3.4 Delaying UFS Cluster Flushing

By default, clusters of UFS pages are written asynchronously (the write must be completed). Enabling the delay_wbuffers kernel variable causes these clusters to be written synchronously (delayed), as other dirty data and metadata pages are written. However, if the percentage of UBC dirty pages reaches the value of the delay_wbuffers_percent kernel variable, the clusters will be written asynchronously, regardless of the setting of the delay_wbuffers kernel variable.

Performance Benefit and Tradeoff

Delaying full write buffer flushing can free CPU cycles. However, it may adversely affect real-time workload performance, because the system will experience a heavy I/O load at sync time.

You can modify the delay_wbuffers kernel variable without rebooting the system.

When to Tune

Delay cluster flushing if your applications frequently write to previously written pages. This can result in a net decrease in the total number of I/O requests.

Recommended Values

To delay cluster flushing, use the dbx patch command to set the value of the delay_wbuffers kernel variable to 1 (enabled). The default value of delay_wbuffers is 0 (disabled).

See Section 3.6.7 for information on using dbx.

9.4.3.5 Increasing the Number of Blocks Combined for Read-Ahead

You can increase the number of blocks that are combined for a read-ahead operation.

Performance Benefit and Tradeoff

Increase the number of blocks combined for read-ahead if your applications can use a large read-ahead size.

When to Tune

Usually, you do not have to increase the number of blocks combined for read-ahead.

Recommended Values

To increase the number of blocks combined for read-ahead, use the dbx patch command to set the value of the cluster_consec_init kernel variable equal to the value of the cluster_max_read_ahead kernel variable (the default is 8), which specifies the maximum number of read-ahead clusters that the kernel can schedule.

In addition, you must make sure that cluster read operations are enabled on nonread-ahead and read-ahead blocks. To do this, use dbx to set the value of the cluster_read_all kernel variable to 1, which is the default value.

See Section 3.6.7 for information on using dbx.

9.4.3.6 Increasing the Number of Blocks Combined for a Cluster

You can increase the number of blocks combined for a cluster. The cluster_maxcontig kernel variable specifies the number of blocks that are combined into a single I/O operation. Contiguous writes are done in a unit size that is determined by the file system block size (8 KB) multiplied by the value of the cluster_maxcontig parameter.

Performance Benefit and Tradeoff

Increase the number of blocks combined for a cluster if your applications can use a large cluster size.

When to Tune

Usually, you do not have to increase the number of blocks combined for a cluster.

Recommended Values

The default value of cluster_maxcontig kernel variable is 8.

See Section 3.6.7 for information about using dbx.

9.4.3.7 Defragmenting a File System

When a file consists of noncontiguous file extents, the file is considered fragmented. A very fragmented file decreases UFS read and write performance, because it requires more I/O operations to access the file.

Performance Benefit and Tradeoff

Defragmenting a UFS file system improves file system performance. However, it is a time-consuming process.

When to Perform This Task

You can determine whether the files in a file system are fragmented by determining how effectively the system is clustering. You can do this by using the dbx print command to examine the ufs_clusterstats data structure. See Section 9.4.2.2 for information.

UFS block clustering is usually efficient. If the numbers from the UFS clustering kernel structures show that clustering is not effective, the files in the file system may be very fragmented.

Recommended Procedure

To defragment a UFS file system, follow these steps:

Back up the file system onto tape or another partition.

Create a new file system either on the same partition or a different partition.

Restore the file system.

See the System Administration manual for information about backing up and restoring data and creating UFS file systems.

9.5 Managing NFS Performance

The Network File System (NFS) shares the Unified Buffer Cache (UBC) with the virtual memory subsystem and local file systems. NFS can put an extreme load on the network. Poor NFS performance is almost always a problem with the network infrastructure. Look for high counts of retransmitted messages on the NFS clients, network I/O errors, and routers that cannot maintain the load.

Lost packets on the network can severely degrade NFS performance. Lost packets can be caused by a congested server, the corruption of packets during transmission (which can be caused by bad electrical connections, noisy environments, or noisy Ethernet interfaces), and routers that abandon forwarding attempts too quickly.

You can monitor NFS by using the nfsstat and other commands. When evaluating NFS performance, remember that NFS does not perform well if any file-locking mechanisms are in use on an NFS file. The locks prevent the file from being cached on the client. See nfsstat(8) for more information.

The following sections describe how to perform the following tasks:

Gather NFS performance information (Section 9.5.1)

Improving NFS performance (Section 9.5.2)

9.5.1 Gathering NFS Information

Table 9-9 describes the commands you can use to obtain information about NFS operations.

Table 9-9: NFS Monitoring Tools

Name	Use	Description
`nfsstat`	Displays network and NFS statistics (Section 9.5.1.1)	Displays NFS and RPC statistics for clients and servers, including the number of packets that had to be retransmitted (`retrans`) and the number of times a reply transaction ID did not match the request transaction ID (`badxid`).
`nfswatch`	Monitors an NFS server	Monitors all incoming network traffic to an NFS server and divides it into several categories, including NFS reads and writes, NIS requests, and RPC authorizations. Your kernel must be configured with the `packetfilter` option to use the command. See `nfswatch`(8) and `packetfilter`(7) for more information.
`ps axlmp`	Displays information about idle threads (Section 9.5.1.2)	Displays information about idle threads on a client system.
`(dbx) print nfs_sv_active_hist`	Displays active NFS server threads (Section 3.6.7)	Displays a histogram of the number of active NFS server threads.
`(dbx) print nchstats`	Displays the hit rate (Section 9.1.2)	Displays the namei cache hit rate.
`(dbx) print bio_stats`	Displays metadata buffer cache information (Section 9.4.2.3)	Reports statistics on the metadata buffer cache hit rate.
`(dbx) print vm_tune`	Reports UBC statistics (Section 6.3.4)	Reports the UBC hit rate.

The following sections describe how to use some of these tools.

9.5.1.1 Displaying NFS Information by Using the nfsstat Command

The nfsstat command displays statistical information about NFS and Remote Procedure Call (RPC) interfaces in the kernel. You can also use this command to reinitialize the statistics.

An example of the nfsstat command is as follows:

# /usr/ucb/nfsstat
 
Server rpc:
calls     badcalls  nullrecv   badlen   xdrcall
38903     0         0          0        0
 
Server nfs:
calls     badcalls
38903     0
 
Server nfs V2:
null      getattr   setattr    root     lookup     readlink   read
5  0%     3345  8%  61  0%     0  0%    5902 15%   250  0%    1497  3%
wrcache   write     create     remove   rename     link       symlink
0  0%     1400  3%  549  1%    1049  2% 352  0%    250  0%    250  0%
mkdir     rmdir     readdir    statfs
171  0%   172  0%   689  1%    1751  4%
 
Server nfs V3:
null      getattr   setattr    lookup    access    readlink   read
0  0%     1333  3%  1019  2%   5196 13%  238  0%   400  1%    2816  7%
write     create    mkdir      symlink   mknod     remove     rmdir
2560  6%  752  1%   140  0%    400  1%   0  0%     1352  3%   140  0%
rename    link      readdir    readdir+  fsstat    fsinfo     pathconf
200  0%   200  0%   936  2%    0  0%     3504  9%  3  0%      0  0%
commit
21  0%
 
Client rpc:
calls     badcalls  retrans    badxid    timeout   wait       newcred
27989     1         0          0         1         0          0
badverfs  timers
0         4
 
Client nfs:
calls     badcalls  nclget     nclsleep
27988     0         27988      0
 
Client nfs V2:
null      getattr   setattr    root      lookup    readlink   read
0  0%     3414 12%  61  0%     0  0%     5973 21%  257  0%    1503  5%
wrcache   write     create     remove    rename    link       symlink
0  0%     1400  5%  549  1%    1049  3%  352  1%   250  0%    250  0%
mkdir     rmdir     readdir    statfs
171  0%   171  0%   713  2%    1756  6%
 
Client nfs V3:
null      getattr   setattr    lookup    access    readlink   read
0  0%     666  2%   9  0%      2598  9%  137  0%   200  0%    1408  5%
write     create    mkdir      symlink   mknod     remove     rmdir
1280  4%  376  1%   70  0%     200  0%   0  0%     676  2%    70  0%
rename    link      readdir    readdir+  fsstat    fsinfo     pathconf
100  0%   100  0%   468  1%    0  0%     1750  6%  1  0%      0  0%
commit
10  0%
#

The ratio of timeouts to calls (which should not exceed 1 percent) is the most important thing to look for in the NFS statistics. A timeout-to-call ratio greater than 1 percent can have a significant negative impact on performance. See Chapter 10 for information on how to tune your system to avoid timeouts.

Use the nfsstat -s -i 10 command to display NFS and RPC information at ten-second intervals.

If you are attempting to monitor an experimental situation with nfsstat, reset the NFS counters to 0 before you begin the experiment. Use the nfsstat -z command to clear the counters.

See nfsstat(8) for more information about command options and output.

9.5.1.2 Displaying Idle Thread Information by Using the ps Command

On a client system, the nfsiod daemon spawns several I/O threads to service asynchronous I/O requests to the server. The I/O threads improve the performance of both NFS reads and writes. The optimum number of I/O threads depends on many variables, such as how quickly the client will be writing, how many files will be accessed simultaneously, and the characteristics of the NFS server. For most clients, seven threads are sufficient.

The following example uses the ps axlmp command to display idle I/O threads on a client system:

# 
/usr/ucb/ps axlmp 0 | grep nfs
 
 0  42   0            nfsiod_  S                 0:00.52                 
 0  42   0            nfsiod_  S                 0:01.18                 
 0  42   0            nfsiod_  S                 0:00.36                 
 0  44   0            nfsiod_  S                 0:00.87                 
 0  42   0            nfsiod_  S                 0:00.52                 
 0  42   0            nfsiod_  S                 0:00.45                 
 0  42   0            nfsiod_  S                 0:00.74                 
 
#

The previous output shows a sufficient number of sleeping threads and 42 server threads that were started by nfsd, where nfsiod_ has been replaced by nfs_tcp or nfs_udp.

If your output shows that few threads are sleeping, you may be able to improve NFS performance by increasing the number of threads. See Section 9.5.2.2, Section 9.5.2.3, nfsiod(8), and nfsd(8) for more information.

9.5.2 Improving NFS Performance

Improving performance on a system that is used only for serving NFS differs from tuning a system that is used for general timesharing, because an NFS server runs only a few small user-level programs, which consume few system resources. There is minimal paging and swapping activity, so memory resources should be focused on caching file system data.

File system tuning is important for NFS because processing NFS requests consumes the majority of CPU and wall clock time. Ideally, the UBC hit rate should be high. Increasing the UBC hit rate can require additional memory or a reduction in the size of other file system caches. In general, file system tuning will improve the performance of I/O-intensive user applications.

In addition, a vnode must exist to keep file data in the UBC. If you are using AdvFS, an access structure is also required to keep file data in the UBC.

If you are running NFS over TCP, tuning TCP may improve performance if there are many active clients. See Section 10.2 for more information. However, if you are running NFS over UDP, no network tuning is needed.

Table 9-10 lists NFS tuning and performance-improvement guidelines and the benefits as well as tradeoffs.

Table 9-10: NFS Performance Guidelines

Guideline	Performance Benefit	Tradeoff
Set the value of the `maxusers` attribute to the number of server NFS operations that are expected to occur each second (Section 5.1)	Provides the appropriate level of system resources	Consumes memory
Increase the size of the namei cache (Section 9.2.1)	Improves file system performance	Consumes memory
Increase the number of AdvFS access structures, if you are using AdvFS (Section 9.3.6.3)	Improves AdvFS performance	Consumes memory
Increase the size of the metadata buffer cache, if you are using UFS (Section 9.4.3.1)	Improves UFS performance	Consumes wired memory
Use Prestoserve (Section 9.5.2.1)	Improves synchronous write performance for NFS servers	Cost
Configure the appropriate number of threads on an NFS server (Section 9.5.2.2)	Enables efficient I/O blocking operations	None
Configure the appropriate number of threads on the client system (Section 9.5.2.3)	Enables efficient I/O blocking operations	None
Modify cache timeout limits on the client system (Section 9.5.2.4)	May improve network performance for read-only file systems and enable clients to quickly detect changes	Increases network traffic to server
Decrease network timeouts on the client system (Section 9.5.2.5)	May improve performance for slow or congested networks	Reduces the theoretical performance
Use NFS Protocol Version 3 on the client system (Section 9.5.2.6)	Improves network performance	Decreases the performance benefit of Prestoserve

The following sections describe some of these guidelines.

9.5.2.1 Using Prestoserve to Improve NFS Server Performance

You can improve NFS performance by installing Prestoserve on the server. Prestoserve greatly improves synchronous write performance for servers that are using NFS Version 2. Prestoserve enables an NFS Version 2 server to write client data to a nonvolatile (battery-backed) cache, instead of writing the data to disk.

Prestoserve may improve write performance for NFS Version 3 servers, but not as much as with NFS Version 2, because NFS Version 3 servers can reliably write data to volatile storage without risking loss of data in the event of failure. NFS Version 3 clients can detect server failures and resend any write data that the server may have lost in volatile storage.

See the Guide to Prestoserve for more information.

9.5.2.2 Configuring Server Threads

The nfsd daemon runs on NFS servers to service NFS requests from client machines. The daemon spawns a number of server threads that process NFS requests from client machines. At least one server thread must be running for a machine to operate as a server. The number of threads determines the number of parallel operations and must be a multiple of 8.

To improve performance on frequently used NFS servers, configure either 16 or 32 threads, which provides the most efficient blocking for I/O operations. See nfsd(8) for more information.

9.5.2.3 Configuring Client Threads

Client systems use the nfsiod daemon to service asynchronous I/O operations, such as buffer cache read-ahead and delayed write operations. The nfsiod daemon spawns several I/O threads to service asynchronous I/O requests to its server. The I/O threads improve performance of both NFS reads and writes.

The optimal number of I/O threads to run depends on many variables, such as how quickly the client is writing data, how many files will be accessed simultaneously, and the behavior of the NFS server. The number of threads must be a multiple of 8 minus 1 (for example, 7 or 15 is optimal).

NFS servers attempt to gather writes into complete UFS clusters before initiating I/O, and the number of threads (plus 1) is the number of writes that a client can have outstanding at any one time. Having exactly 7 or 15 threads produces the most efficient blocking for I/O operations. If write gathering is enabled, and the client does not have any threads, you may experience a performance degradation. To disable write gathering, use the dbx patch command to set the nfs_write_gather kernel variable to zero. See Section 3.6.7 for information.

Use the ps axlmp 0 | grep nfs command to display idle I/O threads on the client. If few threads are sleeping, you may be able to improve NFS performance by increasing the number of threads. See nfsiod(8) for more information.

9.5.2.4 Modifying Cache Timeout Limits

For read-only file systems and slow network links, performance may be improved by changing the cache timeout limits on NFS client systems. These timeouts affect how quickly you see updates to a file or directory that has been modified by another host. If you are not sharing files with users on other hosts, including the server system, increasing these values will slightly improve performance and will reduce the amount of network traffic that you generate.

See mount(8) and the descriptions of the acregmin, acregmax, acdirmin, acdirmax, actimeo options for more information.

9.5.2.5 Decreasing Network Timeouts

NFS does not perform well if it is used over slow network links, congested networks, or wide area networks (WANs). In particular, network timeouts on client systems can severely degrade NFS performance. This condition can be identified by using the nfsstat command and determining the ratio of timeouts to calls. If timeouts are more than 1 percent of the total calls, NFS performance may be severely degraded. See Section 9.5.1.1 for sample nfsstat output of timeout and call statistics.

You can also use the netstat -s command to verify the existence of a timeout problem. A nonzero value in the fragments dropped after timeout field in the ip section of the netstat output may indicate that the problem exists. See Section 10.1.1 for sample netstat command output.

If fragment drops are a problem on a client system, use the mount command with the -rsize=1024 and -wsize=1024 options to set the size of the NFS read and write buffers to 1 KB.

9.5.2.6 Using NFS Protocol Version 3

NFS Protocol Version 3 provides NFS client-side asynchronous write support, which improves the cache consistency protocol and requires less network load than Version 2. These performance improvements slightly decrease the performance benefit that Prestoserve provided for NFS Version 2. However, with Protocol Version 3, Prestoserve still speeds file creation and deletion.