9 Managing File System Performance

To tune for better file system performance, you must understand how your applications and users perform disk I/O, as described in Section 2.1 and how the file system you are using shares memory with processes, as described in Chapter 6. Using this information, you might improve file system performance by changing the value of the kernel subsystem attributes described in this chapter.

This chapter describes how to tune:

Caches used by file systems (Section 9.1)

The Advanced File System (AdvFS) (Section 9.2)

The UNIX File System (UFS) (Section 9.3)

The Network File System (NFS) (Section 9.4)

9.1 Tuning Caches

The kernel caches (temporarily stores) in memory recently accessed data. Caching data is effective because data is frequently reused and it is much faster to retrieve data from memory than from disk. When the kernel requires data, it checks if the data was cached. If the data was cached, it is returned immediately. If the data was not cached, it is retrieved from disk and cached. File system performance is improved if data is cached and later reused.

Data found in a cache is called a cache hit, and the effectiveness of cached data is measured by a cache hit rate. Data that was not found in a cache is called a cache miss.

Cached data can be information about a file, user or application data, or metadata, which is data that describes an object for example, a file. The following list identifies the types of data that are cached:

A file name and its corresponding vnode is cached in the namei cache (Section 9.1.1).

UFS user and application data and AdvFS user and application data and metadata are cached in the Unified Buffer Cache (UBC) (Section 9.1.2).

UFS file metadata is cached in the metadata buffer cache (Section 9.1.3).

AdvFS open file information is cached in access structures (Section 9.1.4).

9.1.1 Tuning the namei Cache

The Virtual File System (VFS) presents to applications a uniform kernel interface that is abstracted from the subordinate file system layer. As a result, file access across different types of file systems is transparent to the user.

The VFS uses a structure called a vnode to store information about each open file in a mounted file system. If an application makes a read or write request on a file, VFS uses the vnode information to convert the request and direct it to the appropriate file system. For example, if an application makes a read() system call request on a file, VFS uses the vnode information to convert the system call to the appropriate type for the file system containing the file: ufs_read() for UFS, advfs_read() for AdvFS, or nfs_read() call if the file is in a file system mounted through NFS, then directs the request to the appropriate file system.

The VFS caches a recently accessed file name and its corresponding vnode in the namei cache. File system performance is improved if a file is reused and its name and corresponding vnode are in the namei cache.

Related Attributes

The following list describes the vfs subsystem attributes that relate to the namei cache:

The vnode_deallocation_enable attribute -- Specifies whether or not to dynamically allocate vnodes according to system demands.
Value 0 or 1
Default value: 1 (enabled)
Disabling causes the operating system to use a static vnode pool. For the best performance, do not disable dynamic vnode allocation.

The name_cache_hash_size attribute -- Specifies the size, in slots, of the hash chain table for the namei cache.
Default value: 2 * (148 + 10 * maxusers) * 11 / 10 / 15

The vnode_age attribute -- Specifies the amount of time, in seconds, before a free vnode can be recycled.
Value: 0 to 2,147,483,647
Default value: 120 seconds

The namei_cache_valid_time attribute -- Specifies the amount of time, in seconds, that a namei cache entry can remain in the cache before it is discarded.
Value: 0 to 2,147,483,647
Default value: 1200 (seconds) for 32-MB or larger systems; 30 (seconds) for 24-MB systems
Increasing keeps vnodes in the namei cache longer, but increases the amount of memory that the namei cache uses.
Decreasing accelerates the deallocation of vnodes from the namei cache, which reduces its efficiency.

Note

If you use increase the values of namei cache related attributes, consider increasing file system attributes that cache file and directory information. If you use AdvFS, see Section 9.1.4 for more information. If you use UFS, see Section 9.1.3 for more information.

When to Tune

You can check namei cache statistics to see if you should change the values of namei cache related attributes. To check namei cache statistics, enter the dbx print command and specify a processor number to examine the nchstats data structure, for example:

# /usr/ucb/dbx -k /vmunix /dev/mem 
(dbx) print processor_ptr[0].nchstats

Information similar to the following is displayed:

struct {
        ncs_goodhits = 18984
        ncs_neghits = 358
        ncs_badhits = 113
        ncs_falsehits = 23
        ncs_miss = 699
        ncs_long = 21
        ncs_badtimehits = 33
        ncs_collisions = 2
        ncs_unequaldups = 0
        ncs_newentry = 697
        ncs_newnegentry =  419
        ncs_gnn_hit = 1653
        ncs_gnn_miss = 12
        ncs_gnn_badhits = 12
        ncs_gnn_collision = 4
        ncs_pad = {
            [0] 0
        }
}

The following table describes when you might change the values namei cache related attributes based on the dbx print output:

If	Increase
The value of `ncs_goodhits` + `ncs_neghits` / `ncs_goodhits` + `ncs_neghits` + `ncs_miss` + `ncs_falsehits` is less than 80 percent	The value of either the `maxusers` attribute or the `name_cache_hash_size` attribute
The value of the `ncs_badtimehits` is more than 0.1 percent of the `ncs_goodhits`	The value of the `namei_cache_valid_time` attribute and the `vnode_age` attribute

You cannot modify the values of the name_cache_hash_size attribute, the namei_cache_valid_time attribute, or the vnode_deallocation_enable attribute without rebooting the system. You can modify the value of the vnode_age attribute without rebooting the system. See Section 3.6 for information about modifying subsystem attributes.

9.1.2 Tuning the UBC

The Unified Buffer Cache (UBC) shares with processes the memory that is not wired to cache UFS user and application data and AdvFS user and application data and metadata. File system performance is improved if the data and metadata is reused and in the UBC.

Related Attributes

The following list describes the vm subsystem attributes that relate to the UBC:

The vm_ubcdirtypercent attribute -- Specifies the percentage of pages that must be dirty (modified) before the UBC starts writing them to disk.
Value: 0 to 100
Default value: 10 percent

The ubc_maxdirtywrites attribute -- Specifies the number of I/O operations (per second) that the vm subsystem performs when the number of dirty (modified) pages in the UBC exceeds the value of the vm_ubcdirtypercent attribute.
Value: 0 to 2,147,483,647
Default value: 5 (operations per second)

The ubc_maxpercent attribute -- Specifies the maximum percentage of physical memory that the UBC can use at one time.
Value: 0 to 100
Default value: 100 percent

The ubc_borrowpercent attribute -- Specifies the percentage of memory above which the UBC is only borrowing memory from the vm subsystem. Paging does not occur until the UBC has returned all its borrowed pages.
Value: 0 to 100
Default value: 20 percent
Increasing might degrade system response time when a low-memory condition occurs (for example, a large process working set).

The ubc_minpercent attribute -- Specifies the minimum percentage of memory that the UBC can use. The remaining memory is shared with processes.
Value: 0 to 100
Default value: 10 percent
Increasing prevents large programs from completely consuming the memory that the UBC can use.
For I/O servers, consider increasing the value to ensure that enough memory is available for the UBC.

The vm_ubcpagesteal attribute -- Specifies the minimum number of pages to be available for file expansion. When the number of available pages falls below this number, the UBC steals additional pages to anticipate the file's expansion demands.
Value: 0 to 2,147,483,647
Default value: 24 (file pages)

The vm_ubcseqpercent attribute -- Specifies the maximum amount of memory allocated to the UBC that can be used to cache a single file.
Value: 0 to 100
Default value: 10 percent of memory allocated to the UBC
Consider increasing the value if application write large files.

The vm_ubcseqstartpercent attribute -- Specifies a threshold value that determines when the UBC starts to recognize sequential file access and steal the UBC LRU pages for a file to satisfy its demand for pages. This value is the size of the UBC in terms of its percentage of physical memory.
Value: 0 to 100
Default value: 50 percent
Consider increasing the value if applications write large files.

Note

If the values of the ubc_maxpercent and ubc_minpercent attributes are close, you may degrade file system performance.

When to Tune

An insufficient amount of memory allocated to the UBC can impair file system performance. Because the UBC and processes share memory, changing the values of UBC related attributes might cause the system to page. You can use the vmstat command to display virtual memory statistics that will help you to determine if you need to change values of UBC related attributes. The following table describes when you might change the values UBC related attributes based on the vmstat output:

If vmstat Output Displays Excessive:	Action:
Paging but few or no page outs	Increase the value of the `ubc_borrowpercent` attribute.
Paging and swapping	Decrease the `ubc_maxpercent` attribute.
Paging	Force the system to reuse pages in the UBC instead of from the free list by making the value of the `ubc_maxpercent` attribute greater than the value of the `vm_ubseqstartpercent` attribute, which it is by default, and that the value of the `vm_ubcseqpercent` attribute is greater than a referenced file.
Page outs	Increase the value of the `ubc_minpercent` attribute.

See Section 6.3.1 for information on the vmstat command. See Section 6.1.2.2 for information about UBC memory allocation.

You can modify the value of any of the UBC parameters described in this section without rebooting the system. See Section 3.6 for information about modifying subsystem attributes.

Note

The performance of an application that generates a lot of random I/O is not improved by a large UBC, because the next access location for random I/O cannot be predetermined.

9.1.3 Tuning the Metadata Buffer Cache

At boot time, the kernel wires a percentage of memory for the metadata buffer cache. UFS file metadata, such as superblocks, inodes, indirect blocks, directory blocks, and cylinder group summaries are cached in the metadata buffer cache. File system performance is improved if the metadata is reused and in the metadata buffer cache.

Related Attributes

The following list describes the vfs subsystem attributes that relate to the metadata buffer cache:

The bufcache attribute -- Specifies the size, as a percentage of memory, that the kernel wires for the metadata buffer cache.
Value: 0 to 50
Default value: 3 percent for 32-MB or larger systems and 2 percent for 24-MB systems

The buffer_hash_size attribute -- Specifies the size, in slots, of the hash chain table for the metadata buffer cache.
Value: 0 to 524,287
Default value: 2048 (slots)
Increasing distributes the buffers to make the average chain lengths shorter, which improves UFS performance, but will reduce the amount of memory available to processes and the UBC.

You cannot modify the values of the buffer_hash_size attribute or the bufcache attribute without rebooting the system. See Section 3.6 for information about modifying kernel subsystem attributes.

When to Tune

Consider increasing the size of the bufcache attribute if you have a high cache miss rate (low hit rate).

To determine if you have a high cache miss rate, use the dbx print command to display the bio_stats data structure. If the miss rate (block misses divided by the sum of the block misses and block hits) is more than 3 percent, consider increasing the value of the bufcache attribute. See Section 9.3.2.3 for more information on displaying the bio_stats data structure. Note that increasing the value of the bufcache attribute will reduce the amount of memory available to processes and the UBC.

9.1.4 Tuning AdvFS Access Structures

At boot time, the system reserves a portion of the physical memory that is not wired by the kernel for AdvFS access structures. AdvFS caches information about open files and information about files that were opened but are now closed in AdvFS access structures. File system performance is improved if the file information is reused and in an access structure.

AdvFS access structures are dynamically allocated and deallocated according to the kernel configuration and system demands.

Related Attribute

The AdvfsAccessMaxPercent attribute specifies, as a percentage, the maximum amount of pageable memory that can be allocated for AdvFS access structures.

Value: 5 to 95

Default value: 25 percent

You can modify the value of the AdvfsAccessMaxPercent attribute without rebooting the system. See Section 3.6 for information about modifying kernel subsystem attributes.

When to Tune

If users or applications reuse AdvFS files (for example, a proxy server), consider increasing the value of the AdvfsAccessMaxPercent attribute to allocate more memory for AdvFS access structures. Note that increasing the value of the AdvfsAccessMaxPercent attribute reduces the amount of memory available to processes and might cause excessive paging and swapping. You can use the vmstat command to display virtual memory statistics that will help you to determine excessive paging and swapping. See Section 6.3.1 for information on the vmstat command

Consider decreasing the amount of memory reserved for AdvFS access structures if:

You do not use AdvFS.

Your workload does not frequently open, close, and reopen the same files.

You have a large-memory system (because the number of open files does not scale with the size of system memory as efficiently as UBC memory usage and process memory usage).

9.2 Tuning AdvFS

This section describes how tune Advanced File System (AdvFS) queues, AdvFS configuration guidelines, and commands that you can use to display AdvFS information.

See the AdvFS Administration manual for information about AdvFS features and setting up and managing AdvFS.

9.2.1 Tuning AdvFS Queues

For each AdvFS volume, I/O requests are sent to one of the following queues:

Blocking and flush queue
The blocking and flush queues are queues in which reads and synchronous write requests are cached. A synchronous write request must be written to disk before it is considered complete and the application can continue.
The blocking queue is used primarily for reads and for kernel synchronous write requests. The flush queue is used primarily for buffer write requests, either through fsync(), sync(), or synchronous writes. Because the buffers on the blocking queue are given slightly higher priority than those on the flush queue, kernel requests are handled more expeditiously and are not blocked if many buffers are waiting to be written to disk.
Processes that need to read or modify data in a buffer in the blocking or flush queue must wait for the data to be written to disk. This is in direct contrast with buffers on the lazy queues that can be modified at any time until they are finally moved down to the device queue.

Lazy queue
The lazy queue is a logical series of queues in which asynchronous write requests are cached. When an asynchronous I/O request enters the lazy queue, it is assigned a time stamp. This time stamp is used to periodically flush the buffers down toward the disk in numbers large enough to allow them to be consolidated into larger I/Os. Processes can modify data in buffers at any time while they are on the lazy queue, potentially avoiding additonal I/Os. Descriptions of the queues in the lazy queue are provided after Figure 9-1.

All three queues (blocking, flush, and lazy) move buffers to the device queue. As buffers are moved onto the device queue, logically contiguous I/Os are consolidated into larger I/O requests. This reduces the actual number of I/Os that must be completed. Buffers on the device queue cannot be modified until their I/O has completed.

The algorithms that move the buffers onto the device queue favor taking buffers from the blocking queue over the flush queue, and both are favored over the lazy queue. The size of the device queue is limited by device and driver resources. The algorithms that load the device queue use feedback from the drivers to know when the device queue is full. At that point the device is saturated and continued movement of buffers to the device queue would only degrade throughput to the device. The potential size of the device queue and how full it is, ultimately determines how long it may take to complete a synchronous I/O operation.

Figure 9-1 shows the movement of synchronous and asynchronous I/O requests through the AdvFS I/O queues.

Figure 9-1: AdvFS I/O Queues

Detailed descriptions of the AdvFS lazy queues are as follows:

Wait queue -- Asynchronous I/O requests that are waiting for an AdvFS transaction log write to complete first enter the wait queue. Each file domain has a transaction log that tracks fileset activity for all filesets in the file domain, and ensures AdvFS metadata consistency if a crash occurs.
AdvFS uses write-ahead logging, which requires that when metadata is modified, the transaction log write must complete before the actual metadata is written. This ensures that AdvFS can always use the transaction log to create a consistent view of the file system metadata. After the transaction log is written, I/O requests can move from the wait queue to the smooth sync queue.

Smooth sync queue -- Asynchronous I/O requests remain in the smooth sync queue for at least 30 seconds, by default. Allowing requests to remain in the smooth sync queue for a specified amount of time prevents I/O spikes, increases cache hit rates, and improves the consolidation of requests. After requests have aged in the smooth sync queue, they move to the ready queue.

Ready queue -- Asynchronous I/O requests are sorted in the ready queue. After the queue reaches a specified size, the requests are moved the consol queue.

Consol queue -- Asynchronous I/O requests are interleaved in the consol queue and moved to the device queue.

Related Attributes

The following list describes the vfs subsystem attributes that relate to AdvFS queues:

The smoothsync_age attribute -- Specifies the amount of time, in seconds, that a modified page ages before becoming eligible for the smoothsync mechanism to flush it to disk.
Value: 0 to 60
Default value: 30 seconds
Setting to 0 sends data to the ready queue every 30 seconds, regardless of how long the data is cached.
Increasing the value increases the chance of lost data if the system crashes, but can decrease net I/O load (improve performance) by allowing the dirty pages to remain cached longer.
The smoothsync_age attribute is enabled when the system boots to multiuser mode and disabled when the system changes from multiuser mode to single-user mode. To change the value of the smoothsync_age attribute, edit the following lines in the /etc/inittab file:
```
smsync:23:wait:/sbin/sysconfig -r vfs smoothsync_age=30 > /dev/null 2>&1
smsyncS:Ss:wait:/sbin/sysconfig -r vfs smoothsync_age=0 > /dev/null 2>&1
 
```
You can use the smsync2 mount option to specify an alternate smoothsync policy that can further decrease the net I/O load. The default policy is to flush modified pages after they have been dirty for the smoothsync_age time period, regardless of continued modifications to the page. When you mount a UFS using the smsync2 mount option, modified pages are not written to disk until they have been dirty and idle for the smoothsync_age time period. Note that mmap'ed pages always use this default policy, regardless of the smsync2 setting.

The AdvfsSyncMmapPages attribute -- Specifies whether or not to disable smooth sync for applications that manage their own mmap page flushing.
Value: 0 or 1
Default value: 1 (enabled)
See mmap(2) and msync(2) for more information.

The AdvfsReadyQLim attribute -- Specifies the size of the ready queue.
Value: 0 to 32 K (blocks)
Default value: 16 K (blocks)

You can modify the value of the AdvfsSyncMmapPages attribute and the AdvfsReadyQLim attribute without rebooting the system. See Section 3.6 for information about modifying kernel subsystem attributes.

When to Tune

If you reuse data, consider increasing:

The amount of time I/O requests remains in the smooth sync queue to increase the possibility of a cache hit. However, doing so increases the chance that data might be lost if the system crashes.
Use the advfsstat -S command to show cache statistics in the AdvFS smooth sync queue.

The size of the ready queue to increase the possibility that I/O requests will be consolidated into a single, larger I/O and improve the possibility of a cache hit. However, doing so is not likely to have much influence if smooth sync is enabled and can increase the overhead in sorting the incoming requests onto the ready queue.

9.2.2 AdvFS Configuration Guidelines

The amount of I/O contention on the volumes in a file domain is the most critical factor for fileset performance. This can occur on large, very busy file domains. To help you determine how to set up filesets, first identify:

Frequently accessed data

Infrequently accessed data

Specific types of data (for example, temporary data or database data)

Data with specific access patterns (for example, create, remove, read, or write)

Then, use the previous information and the following guidelines to configure filesets and file domains:

Configure filesets that contain similar types of files in the same file domain to reduce disk fragmentation and improve performance. For example, do not place small temporary files, such as the output from cron and from news, mail, and Web cache servers, in the same file domain as a large database file.

For applications that perform many file create or remove operations, configure multiple filesets and distribute files across the filesets. This reduces contention on individual directories, the root tag directory, quota files, and the frag file.

Configure filesets used by applications with different I/O access patterns (for example, create, remove, read, or write patterns) in the same file domain. This might help to balance the I/O load.

To reduce I/O contention in a multi-volume file domain with more than one fileset, configure multiple domains and distribute the filesets across the domains. This enables each volume and domain transaction log to be used by fewer filesets.

Fileset with a very large number of small files can affect vdump and vrestore commands at times. Using multiple filesets enables the vdump command to be run simultaneously on each fileset, and decreases the amount of time needed to recover filesets with the vrestore command.

Table 9-1 lists additional AdvFS configuration guidelines and performance benefits and tradeoffs. See the AdvFS Administration manual for more information about AdvFS.

Table 9-1: AdvFS Configuration Guidelines

Benefit	Guideline	Tradeoff
Data loss protection	Use LSM or RAID to store data using RAID 1 (mirror data) or RAID 5 (Section 9.2.2.1)	Requires LSM or RAID
Data loss protection	Force synchronous writes or enable atomic write data logging on a file(Section 9.2.2.2)	Might degrade file system performance
Improve performance for applications that read or write data only once	Enable direct I/O (Section 9.2.2.3)	Degrades performance of application that repeatedly acccess the same data
Improve performance	Use AdvFS to distribute files in a file domain (Section 9.2.2.4)	None
Improve performance	Stripe data (Section 9.2.2.5)	None if using AdvFS or requires LSM or RAID
Improve performance	Defragment file domains (Section 9.2.2.6)	None
Improve performance	Decrease the I/O transfer size (Section 9.2.2.7)	None
Improves performance	Move the transaction log to a fast or uncongsted disk (Section 9.2.2.8)	Might require an additional disk

9.2.2.1 Storing Data Using RAID 1 or RAID 5

You can use LSM or hardware RAID to implement a RAID 1 or RAID 5 data storage configuration.

In a RAID 1 configuration LSM or hardware RAID stores and maintain mirrors (copies) of file domain or transaction log data on different disks. If a disk fails, LSM or hardware RAID uses a mirror to make the data available.

In a RAID 5 configuration LSM or hardware RAID stores parity information and data. If a disk fails, LSM or hardware RAID use the parity information and data on the remaining disks to reconstruct the missing data.

See the Logical Storage Manager manual for more information about LSM. See your storage hardware documentation for more information about hardware RAID.

9.2.2.2 Forcing a Synchronous Write Request or Enabling Atomic Write Data Logging

AdvFS writes data to disk in 8-KB units. By default, AdvFS asynchronous write requests are cached in the UBC, and the write system call returns a success value. The data is written to disk at a later time (asynchronously). AdvFS does not guarantee that all or part of the data will actually be written to disk if a crash occurs during or immediately after the write. For example, if the system crashes during a write that consists of two 8-KB units of data, only a portion (less than 16 KB) of the total write might have succeeded. This can result in partial data writes and inconsistent data.

You can configure AdvFS to force the write request for a specified file to be synchronous to ensure that data is successfully written to disk before the write system call returns a success value.

Enabling atomic write data logging for a specified file writes the data to the transaction log file before it is written to disk. If a system crash occurs during or immediately after the write system call, the data in the log file is used to reconstruct the write system call upon recovery.

You cannot enable both forced synchronous writes and atomic write data logging on a file. However, you can enable atomic write data logging on a file and also open the file with an O_SYNC option. This ensures that the write is synchronous, but also prevents partial writes if a crash occurs before the write system call returns.

To force synchronous write requests, enter:

# chfile -l on filename

A file that has atomic write data logging enabled cannot be memory mapped by using the mmap system call, and it cannot have direct I/O enabled (see Section 9.2.2.3). To enable atomic write data logging, enter:

# chfile -L on filename

To enable atomic write data logging on AdvFS files that are NFS mounted, ensure that:

The NFS property list daemon, proplistd, is running on the NFS client and that the fileset is mounted on the client by using the mount command and the proplist option.

The offset into the file is on an 8-KB page boundary, because NFS performs I/O on 8-KB page boundaries.

9.2.2.3 Enabling Direct I/O

You can enable direct I/O to significantly improve disk I/O throughput for applications that do not frequently reuse previously accessed data. The following lists considerations if you enable direct I/O:

Data is not cached in the UBC and reads and writes are synchronous. You can use the asynchronous I/O (AIO) functions (aio_read and aio_write) to enable an application to achieve an asynchronous-like behavior by issuing one or more synchronous direct I/O requests without waiting for their completion.

Although direct I/O supports I/O requests of any byte size, the best performance occurs when the requested byte transfer is aligned on a disk sector boundary and is an even multiple of the underlying disk sector size.

You cannot enable direct I/O for a file if it is already opened for data-logging or if it is memory mapped. Use the fcntl system call with the F_GETCACHEPOLICY argument to determine if an open file has direct I/O enabled.

To enable direct I/O for a specific file, use the open system call and set the O_DIRECTIO file access flag. A file is opened for direct I/O until all users close the file.

See fcntl(2), open(2), AdvFS Administration, and the Programmer's Guide for more information.

9.2.2.4 Using AdvFS to Distribute Files

If the files in a multivolume domain are not evenly distributed, performance might be degraded. You can distribute space evenly across volumes in a multivolume file domain to balance the percentage of used space among volumes in a domain. Files are moved from one volume to another until the percentage of used space on each volume in the domain is as equal as possible.

To volume information to determine if you need to balance files, enter:

# showfdmn file_domain_name

Information similar to the following is displayed:

               Id     Date Created       LogPgs Version   Domain Name
3437d34d.000ca710  Sun Oct 5 10:50:05 1999  512       3   usr_domain
 Vol  512-Blks   Free % Used  Cmode Rblks  Wblks  Vol Name 
  1L   1488716 549232    63%     on   128    128  /dev/disk/dsk0g
  2     262144 262000     0%     on   128    128  /dev/disk/dsk4a
     --------- -------  ------
       1750860 811232    54%

The % Used field shows the percentage of volume space that is currently allocated to files or metadata (the fileset data structure). In the pevious example, the usr_domain file domain is not balanced. Volume 1 has 63 percent used space while volume 2 has 0 percent used space (it was just added).

To distribute the percentage of used space evenly across volumes in a multivolume file domain, enter:

# balance file_domain_name

The balance command is transparent to users and applications and does not affect data availability or split files. Therefore, file domains with very large files may not balance as evenly as file domains with smaller files and you might need to move large files on the same volume in a multivolume file domain.

To determine if you should move a file, enter:

# showfile -x file_name

Information similar to the following is displayed:

    Id Vol PgSz Pages XtntType  Segs  SegSz  I/O  Perf  File
8.8002   1   16    11   simple    **     ** async  18%  src
 
             extentMap: 1
        pageOff    pageCnt     vol    volBlock    blockCnt
              0          1       1      187296          16
              1          1       1      187328          16
              2          1       1      187264          16
              3          1       1      187184          16
              4          1       1      187216          16
              5          1       1      187312          16
              6          1       1      187280          16
              7          1       1      187248          16
              8          1       1      187344          16
              9          1       1      187200          16
             10          1       1      187232          16
        extentCnt: 11

The file in the previous example is a good candidate to move to another volume because it has 11 extents and an 18 percent performance efficiency as shown in the Perf field. A high percentage indicates optimal efficiency.

To move a file to a different volumes in the file domain, enter:

# migrate [-p pageoffset] [-n pagecount] [-s volumeindex_from] \
[-d volumeindex_to] file_name

You can specify the volume from which and to which a file is to be moved, or allow the system to pick the best space in the file domain. You can move either an entire file or specific pages to a different volume.

Note that using the balance utility after moving files might move files to a different volume.

See showfdmn(8), migrate(8), and balance(8) for more information.

9.2.2.5 Striping Data

You can use AdvFS, LSM, or hardware RAID to stripe (distribute) data. Striped data is data that is separated into units of equal size, then written to two or more disks, creating a stripe of data. The data can be simultaneously written if there are two or more units and the disks are on different SCSI buses.

Figure 9-2 shows how a write request of 384-KB of data is separated into six 64-KB data units and written to 3 disks as two complete stripes.

Figure 9-2: Striping Data

In general, you should use only one method to stripe data. In some specific cases using multiple striping methods can improve performance but only if:

Most of the I/O requests are large (>= 1MB)

The data is striped over multiple RAID sets on different controllers

The LSM or AdvFS stripe size is a multiple of the full hardware RAID stripe size

See stripe(8) for more information about using AdvFS to stripe data. See the Logical Storage Manager manual for more information about using LSM to stripe data. See your storage hardware documentation for more information about using hardware RAID to stripe data.

9.2.2.6 Defragmenting a File Domain

An extent is a contiguous area of disk space that AdvFS allocates to a file. Extents consist of one or more 8-KB pages. When storage is added to a file, it is grouped in extents. If all data in a file is stored in contiguous blocks, the file has one file extent. However, as files grow, contiguous blocks on the disk may not be available to accommodate the new data, so the file must be spread over discontiguous blocks and multiple file extents.

File I/O is most efficient when there are few extents. If a file consists of many small extents, AdvFS requires more I/O processing to read or write the file. Disk fragmentation can result in many extents and may degrade read and write performance because many disk addresses must be examined to access a file. In addition, if a domain has a large number of small files, you may prematurely run out of disk space due to fragmentation.

To display fragmentation information for a file domain, enter:

# defragment -vn file_domain_name

Information similar to the following is displayed:

 defragment: Gathering data for 'staff_dmn'
 Current domain data:
   Extents:                 263675
   Files w/ extents:        152693
   Avg exts per file w/exts:  1.73
   Aggregate I/O perf:         70%
   Free space fragments:     85574
                <100K   <1M   <10M   >10M
   Free space:   34%   45%    19%     2%
   Fragments:  76197  8930    440      7

Ideally, you want few extents for each file.

Although the defragment command does not affect data availability and is transparent to users and applications, it can be a time-consuming process and requires disk space. You should run the defragment command during low file system activity as part of regular file system maintenance or if you experience problems because of excessive fragmentation.

There is little performance benefit from defragmenting a file domain that contains files less than 8 KB, is used in a mail server, or is read-only.

You can also use the showfile command to check a file's fragmentation. See Section 9.2.3.4 for information.

See defragment(8) for more information.

9.2.2.7 Decreasing the I/O Transfer Size

AdvFS attempts to transfer data to and from the disk in sizes that are the most efficient for the device driver. This value is provided by the device driver and is called the preferred transfer size. AdvFS uses the preferred transfer size to:

Consolidate contiguous, small I/O transfers into a larger, single I/O of the preferred transfer size. This results in a fewer number of I/O requests, which increases throughput.

Prefetch, or read-ahead, as many subsequent pages for files being read sequentially up to the preferred transfer size in anticipation that those pages will eventually be read by the applicaton.

Generally, the I/O transfer size provided by the device driver is the most efficient. However, in some cases you may want to reduce the AdvFS I/O transfer size. For example, if your AdvFS fileset is using LSM volumes, the preferred transfer size might be very high. This could cause the cache to be unduly diluted by the buffers for the files being read. If this is suspected, reducing the read transfer size may alleviate the problem.

For systems with impaired mmap page faulting or with limited memory, you should limit the read transfer size to limit the amount of data that is prefetched; however, this will limit I/O consolidation for all reads from this disk.

To display the I/O transfer sizes for a disk, enter:

# chvol -l block_special_device_name domain

To modify the read I/O transfer size, enter:

# chvol -r blocks block_special_device_name domain

To modify the write I/O transfer size, enter:

# chvol -w blocks block_special_device_name domain

See chvol(8) for more information.

Each device driver has a minimum and maximum value for the I/O transfer size. If you use an unsupported value, the device driver automatically limits the value to either the largest or smallest I/O transfer size it supports. See your device driver documentation for more information on supported I/O transfer sizes.

9.2.2.8 Moving the Transaction Log

The AdvFS transaction log should be located on a fast or uncongested disk and bus; otherwise, performance might be degraded.

To display volume information, enter:

# showfdmn file_domain_name

Information similar to the following is displayed:

               Id              Date Created  LogPgs  Domain Name
35ab99b6.000e65d2  Tue Jul 14 13:47:34 1998     512  staff_dmn
 
  Vol   512-Blks        Free  % Used  Cmode  Rblks  Wblks  Vol Name
   3L     262144      154512     41%     on    256    256  /dev/rz13a
   4      786432      452656     42%     on    256    256  /dev/rz13b
      ----------  ----------  ------
         1048576      607168     42%

In the showfdmn command display, the letter L displays next to the volume that contains the transaction log.

If the transaction log is located on a slow or busy disk, you can:

Move the transaction log to a different disk.
Use the switchlog command to move the transaction log.

Divide a large multi volume file domain into several smaller file domains. This will distribute the transaction log I/O across multiple logs.
To divide a multi volume domain into several smaller domains, create the smaller domains and then copy portions of the large domain into the smaller domains. You can use the AdvFS vdump and vrestore commands to allow the disks being used in the large domain to be used in the construction of the several smaller domains.

See showfdmn(8), switchlog(8), vdump(8), and vrestore(8) for more information.

9.2.3 Displaying AdvFS Information

Table 9-2 describes the commands you can use to display AdvFS information.

Table 9-2: Commands to Display AdvFS Information

To Display	Command
AdvFS performance statistics (Section 9.2.3.1)	`advfsstat`
Disks in a file domain (Section 9.2.3.2)	`advscan`
Information about AdvFS file domains and volumes (Section 9.2.3.3)	`showfdmn`
AdvFS fileset information for a file domain (Section 9.2.3.5)	`showfsets`
Information about files in an AdvFS fileset (Section 9.2.3.4)	`showfile`
A formatted page of the BMT (Section 9.2.3.6)	`vbmtpg`

9.2.3.1 Displaying AdvFS Performance Statistics

To display detailed information about a file domain, including use of the UBC and namei cache, fileset vnode operations, locks, bitfile metadata table (BMT) statistics, and volume I/O performance, enter:

# advfsstat -v [-i number_of_seconds] file_domain

Information, in units of one disk block (512 bytes), similar to the following is displayed:

vol1   rd  wr  rg  arg  wg  awg  blk  flsh  wlz  sms rlz  con  dev   
       54   0  48  128   0    0    0    0    1    0   0    0   65

You can use the -i option to display information at specific time intervals, in seconds.

The previous example displays:

rd (read) and wr (write) requests
Compare the number of read requests to the number of write requests. Read requests are blocked until the read completes, but write requests will not block the calling thread, which increases the throughput of multiple threads.

rg and arg (consolidated reads) and wg and awg (consolidated writes)
The consolidated read and write values indicate the number of disparate reads and writes that were consolidated into a single I/O to the device driver. If the number of consolidated reads and writes decreases compared to the number of reads and writes, AdvFS may not be consolidating I/O.

blk (blocking queue), flsh (flush queue) , wlz (wait queue), sms (smooth sync queu), rlz (ready queue), con (consol queue), and dev (device queue). See Section 9.2.1 for information on AdvFS I/O queues.
If you are experiencing poor performance, and the number of I/O requests on the flsh or blk queues increases continually while the number on the dev queue remains fairly constant, the application may be I/O bound to this device. You might eliminate the problem by adding more disks to the domain or by striping with LSM or hardware RAID.

To display the number of file creates, reads, and writes and other operations for a specified domain or filese, enter:

# advfsstat [-i number_of_seconds] -f 2 number file_domain file_set

Information similar to the following is displayed:

  lkup  crt geta read writ fsnc dsnc   rm   mv rdir  mkd  rmd link
     0    0    0    0    0    0    0    0    0    0    0    0    0
     4    0   10    0    0    0    0    2    0    2    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0
    24    8   51    0    9    0    0    3    0    0    4    0    0
  1201  324 2985    0  601    0    0  300    0    0    0    0    0
  1275  296 3225    0  655    0    0  281    0    0    0    0    0
  1217  305 3014    0  596    0    0  317    0    0    0    0    0
  1249  304 3166    0  643    0    0  292    0    0    0    0    0
  1175  289 2985    0  601    0    0  299    0    0    0    0    0
   779  148 1743    0  260    0    0  182    0   47    0    4    0
     0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0

The following table describes the headings in the previous example:

Heading	Displays Number Of
`lkup crtgeta readwritfsncdsncrmmvrdirmkdrmdlink`	file lookups file createsget attributesfile readsfile writes file syncs data syncs file removesfile renameddirectory reads make directories remove directorieslinks created

See advfsstat(8) for more information.

9.2.3.2 Displaying Disks in an AdvFS File Domain

Use the advscan command:

To search all devices and LSM disk groups for AdvFS domains.

To rebuild all or part of your /etc/fdmns directory if you deleted the /etc/fdmns directory, a directory domain under /etc/fdmns, or links from a domain directory under /etc/fdmns.

If you moved devices in a way that has changed device numbers.

To display AdvFS volumes on devices or in an LSM disk group, enter:

# advscan device | LSM_disk_group

Information similar to the following is displayed:

Scanning disks  dsk0 dsk5 
Found domains: 
usr_domain
          Domain Id       2e09be37.0002eb40                 
          Created         Thu Jun 26 09:54:15 1998                 
          Domain volumes          2
          /etc/fdmns links        2                 
          Actual partitions found:
                                  dsk0c                     
                                  dsk5c

To recreate missing domains on a device, enter:

# advscan -r device

Information similar to the following is displayed:

Scanning disks  dsk6 
Found domains: *unknown*      
          Domain Id       2f2421ba.0008c1c0                 
          Created         Mon Jan 20 13:38:02 1998                   
          Domain volumes          1   
          /etc/fdmns links        0                   
          Actual partitions found:                                         
                                  dsk6a*    
*unknown*       
         Domain Id       2f535f8c.000b6860                 
         Created         Tue Feb 25 09:38:20 1998                   
         Domain volumes          1    
         /etc/fdmns links        0                   
         Actual partitions found:
                                 dsk6b*    
 
Creating /etc/fdmns/domain_dsk6a/
        linking dsk6a   
Creating /etc/fdmns/domain_dsk6b/         
        linking dsk6b

See advscan(8) for more information.

9.2.3.3 Displaying AdvFS File Domains

To display information about a file domain, including the date created and the size and location of the transaction log, and information about each volume in the domain, including the size, the number of free blocks, the maximum number of blocks read and written at one time, and the device special file, enter:

# showfdmn file_domain

Information similar to the following is displayed:

               Id              Date Created  LogPgs  Version  Domain Name
34f0ce64.0004f2e0  Wed Mar 17 15:19:48 1999     512        4  root_domain
 
  Vol   512-Blks        Free  % Used  Cmode  Rblks  Wblks  Vol Name 
   1L     262144       94896     64%     on    256    256  /dev/disk/dsk0a

For multivolume domains, the showfdmn command also displays the total volume size, the total number of free blocks, and the total percentage of volume space currently allocated.

See showfdmn(8) for more information about the output of the command.

9.2.3.4 Displaying AdvFS File Information

To display detailed information about files (and directories) in an AdvFS fileset, enter:

# showfile * | file name

The * displays the AdvFS characteristics for all of the files in the current working directory.

Information similar to the following is displayed:

         Id  Vol  PgSz  Pages  XtntType  Segs  SegSz  I/O   Perf  File
  23c1.8001    1    16      1    simple    **     **  ftx   100%  OV
  58ba.8004    1    16      1    simple    **     **  ftx   100%  TT_DB
         **   **    **     **   symlink    **     **   **     **  adm
  239f.8001    1    16      1    simple    **     **  ftx   100%  advfs
         **   **    **     **   symlink    **     **   **     **  archive
     9.8001    1    16      2    simple    **     **  ftx   100%  bin (index)
         **   **    **     **   symlink    **     **   **     **  bsd
         **   **    **     **   symlink    **     **   **     **  dict
   288.8001    1    16      1    simple    **     **  ftx   100%  doc
   28a.8001    1    16      1    simple    **     **  ftx   100%  dt
         **   **    **     **   symlink    **     **   **     **  man
  5ad4.8001    1    16      1    simple    **     **  ftx   100%  net
         **   **    **     **   symlink    **     **   **     **  news
   3e1.8001    1    16      1    simple    **     **  ftx   100%  opt
         **   **    **     **   symlink    **     **   **     **  preserve
         **   **    **     **     advfs    **     **   **     **  quota.group
         **   **    **     **     advfs    **     **   **     **  quota.user
     b.8001    1    16      2    simple    **     **  ftx   100%  sbin (index)
         **   **    **     **   symlink    **     **   **     **  sde
   61d.8001    1    16      1    simple    **     **  ftx   100%  tcb
         **   **    **     **   symlink    **     **   **     **  tmp
         **   **    **     **   symlink    **     **   **     **  ucb
  6df8.8001    1    16      1    simple    **     **  ftx   100%  users

The following table describes the headings in the previous example:

Heading	Displays Number Of
`Id`	The unique number (in hexadecimal format) that identifies the file. Digits to the left of the dot (.) character are equivalent to a UFS inode.
`Vol`	The location of primary metadata for the file, expressed as a number. The data extents of the file can reside on another volume.
`PgSz`	The page size in 512-byte blocks.
`Pages`	The number of pages allocated to the file.
`XtntType`	The extent type can be `simple`, which is a regular AdvFS file without special extents; `stripe`, which is a striped file; `symlink`, which is a symbolic link to a file; `usf`, `nfsv3`, and so on. The `showfile` command cannot display attributes for symbolic links or non-AdvFS files.
`Segs`	The number of stripe segments per striped file, which is the number of volumes a striped file crosses. (Applies only to stripe type.)
`SegSz`	The number of pages per stripe segment. (Applies only to stripe type.)
`I/O`	The type of write requests to this file. If `async`, write requests are buffered (the AdvFS default). If `sync`, forced synchronous writes. If `ftx`, write requests executed under AdvFS transaction control, which is reserved for metadata files and directories.
`Perf`	The efficiency of file-extent allocation, expressed as a percentage of the optimal extent layout. A high percentage indicates that the AdvFS I/O system has achieved optimal efficiency. A low percentage indicates the need for file defragmentation.

See showfile(8) for more information about the command output.

9.2.3.5 Displaying the AdvFS Filesets in a File Domain

To display information about the filesets in a file domain, including the fileset names, the total number of files, the number of used blocks, the quota status, and the clone status, enter:

# showfsets file_domain

Information similar to the following is displayed:

mnt
  Id           : 2c73e2f9.000f143a.1.8001
  Clone is     : mnt_clone
  Files        :     7456,  SLim= 60000, HLim=80000  
  Blocks  (1k) :   388698,  SLim= 6000,  HLim=8000  
  Quota Status : user=on  group=on
 
mnt_clone
  Id           : 2c73e2f9.000f143a.2.8001
  Clone of     : mnt          
  Revision     : 2

The previous example displays that a file domain called dmn1 has one fileset and one clone fileset.

See showfsets(8) for information.

9.2.3.6 Displaying the Bitmap Metadata Table

The AdvFS fileset data structure (metadata) is stored in a file called the bitfile metadata table (BMT). Each volume in a domain has a BMT that describes the file extents on the volume. If a domain has multiple volumes of the same size, files will be distributed evenly among the volumes.

The BMT is the equivalent of the UFS inode table. However, the UFS inode table is statically allocated, while the BMT expands as more files are added to the domain. Each time AdvFS needs additional metadata, the BMT grows by a fixed size (the default is 128 pages). As a volume becomes increasingly fragmented, the size by which the BMT grows might be described by several extents.

To display a formatted page of the BMT, enter:

# vbmtpg volume

Information similar to the following is displayed:

PAGE LBN    32 megaVersion   0 nextFreePg   0 freeMcellCnt   0 pageId    0
nextfreeMCId page    0 cell    0
==========================================================================
CELL   0   nextVdIndex     0 linkSegment     0   tag,bfSetTag:     0,    0
nextMCId page    0 cell    0
CELL   1   nextVdIndex     0 linkSegment     0   tag,bfSetTag:     0,    0
nextMCId page    0 cell    0
CELL   2   nextVdIndex     0 linkSegment     0   tag,bfSetTag:     0,    0
nextMCId page    0 cell    0
CELL   3   nextVdIndex     0 linkSegment     0   tag,bfSetTag:     0,    0
nextMCId page    0 cell    0
.
.
.
CELL  21   nextVdIndex   267 linkSegment   779   tag,bfSetTag:    10,    0
nextMCId page16787458 cell   16
CELL  22   nextVdIndex  1023 linkSegment     0   tag,bfSetTag: 42096,46480
nextMCId page67126700 cell   16
CELL  23   nextVdIndex     4 linkSegment     0   tag,bfSetTag:-2147483648,    1
nextMCId page    0 cell    1
CELL  24   nextVdIndex     0 linkSegment     0   tag,bfSetTag:332144,    0
nextMCId page  585 cell   16
CELL  25   nextVdIndex 29487 linkSegment 26978   tag,bfSetTag:1684090734,1953325
686
nextMCId page    0 cell    0
==========================================================================
 
RECORD  0 bcnt26739 version105        type 108 *** unknown ***
 
CELL  26   nextVdIndex     0 linkSegment     0   tag,bfSetTag:1879048193,    2
nextMCId page    0 cell    0
CELL  27   nextVdIndex     0 linkSegment     0   tag,bfSetTag:     0, 1023
nextMCId page   31 cell   31

See vbmtpg(8) for more information.

You can also invoke the showfile command and specify mount_point/.tags/M-10 to examine the BMT extents on the first domain volume that contains the fileset mounted on the specified mount point. To examine the extents of the other volumes in the domain, specify M-16, M-24, and so on. If the extents at the end of the BMT are smaller than the extents at the beginning of the file, the BMT is becoming fragmented. See showfile(8) for more information.

9.3 Tuning UFS

This section describes UFS configuration and tuning guidelines and commands that you can use to display UFS information.

9.3.1 UFS Configuration Guidelines

Table 9-3 lists UFS configuration guidelines and performance benefits and tradeoffs.

Table 9-3: UFS Configuration Guidelines

Benefit	Guideline	Tradeoff
Improve performance for small files	Make the file system fragment size equal to the block size (Section 9.3.1.1)	Wastes disk space for small files
Improve performance for large files	Use the default file system fragment size of 1 KB (Section 9.3.1.1)	Increases the overhead for large files
Free disk space and improve performance for large files	Reduce the density of inodes on a file system (Section 9.3.1.2)	Reduces the number of files that can be created
Improve performance for disks that do not have a read-ahead cache	Set rotational delay (Section 9.3.1.3)	None
Decrease the number of disk I/O operations	Increase the number of blocks combined for a cluster (Section 9.3.1.4)	None
Improve performance	Use a Memory File System (MFS) (Section 9.3.1.5)	Does not ensure data integrity because of cache volatility
Control disk space usage	Using disk quotas (Section 9.3.1.6)	Might result in a slight increase in reboot time
Allow more mounted file systems	Increase the maximum number of UFS and MFS mounts (Section 9.3.1.7)	Requires addition memory resources

9.3.1.1 Modifying the File System Fragment and Block Sizes

The UFS file system block size is 8 KB. The default fragment size is 1 KB. You can use the newfs command to modify the fragment size to 1024 KB, 2048 KB, 4096 KB, or 8192 KB when you create it.

Although the default fragment size uses disk space efficiently, it increases the overhead for files less than 96 KB. If the average file in a file system is less than 96 KB, you might improve disk access time and decrease system overhead by making the file system fragment size equal to the default block size (8 KB).

See newfs(8) for more information.

9.3.1.2 Reducing the Density of inodes

An inode describes an individual file in the file system. The maximum number of files in a file system depends on the number of inodes and the size of the file system. The system creates an inode for each 4 KB (4096 bytes) of data space in a file system.

If a file system will contain many large files and you are sure that you will not create a file for each 4 KB of space, you can reduce the density of inodes on the file system. This will free disk space for file data, but reduces the number of files that can be created.

To do this, use the newfs -i command to specify the amount of data space allocated for each inode when you create the file system. See newfs(8) for more information.

9.3.1.3 Set Rotational Delay

The UFS rotdelay parameter specifies the time, in milliseconds, to service a transfer completion interrupt and initiate a new transfer on the same disk. It is used to decide how much rotational spacing to place between successive blocks in a file. By default, the rotdelay parameter is set to 0 to allocate blocks continuously. It is useful to set rotdelay on disks that do not have a read-ahead cache. For disks with cache, set the rotdelay to 0.

Use either the tunefs command or the newfs command to modify the rotdelay value. See newfs(8) and tunefs(8) for more information.

9.3.1.4 Increasing the Number of Blocks Combined for a Cluster

The value of the UFS maxcontig parameter specifies the number of blocks that can be combined into a single cluster (or file-block group). The default value of maxcontig is 8. The file system attempts I/O operations in a size that is determined by the value of maxcontig multiplied by the block size (8 KB).

Device drivers that can chain several buffers together in a single transfer should use a maxcontig value that is equal to the maximum chain length. This may reduce the number of disk I/O operations.

Use the tunefs command or the newfs command to change the value of maxcontig. See newfs(8) and tunefs(8) for more information.

9.3.1.5 Using MFS

The Memory File System (MFS) is a UFS file system that resides only in memory. No permanent data or file structures are written to disk. An MFS can improve read/write performance, but it is a volatile cache. The contents of an MFS are lost after a reboot, unmount operation, or power failure.

Because no data is written to disk, an MFS is a very fast file system and can be used to store temporary files or read-only files that are loaded into the file system after it is created. For example, if you are performing a software build that would have to be restarted if it failed, use an MFS to cache the temporary files that are created during the build and reduce the build time.

See mfs(8) for information.

9.3.1.6 Using UFS Disk Quotas

You can specify UFS file system limits for user accounts and for groups by setting up UFS disk quotas, also known as UFS file system quotas. You can apply quotas to file systems to establish a limit on the number of blocks and inodes (or files) that a user account or a group of users can allocate. You can set a separate quota for each user or group of users on each file system.

You may want to set quotas on file systems that contain home directories, because the sizes of these file systems can increase more significantly than other file systems. Do not set quotas on the /tmp file system.

Note that, unlike AdvFS quotas, UFS quotas may cause a slight increase in reboot time. See the AdvFS Administration manual for information about AdvFS quots. See the System Administration manual for information about UFS quotas.

9.3.1.7 Increasing the Number of UFS and MFS Mounts

Mount structures are dynamically allocated when a mount request is made and subsequently deallocated when an unmount request is made.

The max_ufs_mounts attribute specifies the maximum number of UFS and MFS mounts on the system.

Value: 0 to 2,147,483,647

Default value: 1000 (file system mounts)

You can modify the max_ufs_mounts attribute without rebooting the system. See Section 3.6 for information about modifying kernel subsystem attributes.

Increase the maximum number of UFS and MFS mounts if your system will have more than the default limit of 1000 mounts.

Increasing the maximum number of UFS and MFS mounts enables you to mount more file systems. However, increasing the maximum number mounts requires memory resources for the additional mounts.

9.3.2 Displaying UFS Information

Table 9-4 describes the commands you can use to display UFS information.

Table 9-4: Commands to Display UFS Information

To Dispaly	Command
UFS information (Section 9.3.2.1)	`dumpfs`
UFS clustering statistics (Section 9.3.2.2)	`(dbx) print ufs_clusterstats`
Metadata buffer cache statistics (Section 9.3.2.3)	`(dbx) print bio_stats`

9.3.2.1 Displaying UFS Information

To display UFS information for a specified file system, including super block and cylinder group information, enter:

# dumpfs filesystem | /devices/disk/device_name

Information similar to the following is displayed:

 magic   11954   format  dynamic time   Tue Sep 14 15:46:52 1999 
nbfree  21490   ndir    9       nifree  99541  nffree  60 
ncg     65      ncyl    1027    size    409600  blocks  396062
bsize   8192    shift   13      mask    0xffffe000 
fsize   1024    shift   10      mask    0xfffffc00 
frag    8       shift   3       fsbtodb 1 
cpg     16      bpg     798     fpg     6384    ipg     1536 
minfree 10%     optim   time    maxcontig 8     maxbpg  2048 
rotdelay 0ms    headswitch 0us  trackseek 0us   rps     60

The information contained in the first lines are relevant for tuning. Of specific interest are the following fields:

bsize -- The block size of the file system, in bytes (8 KB).

fsize -- The fragment size of the file system, in bytes. For the optimum I/O performance, you can modify the fragment size.

minfree -- The percentage of space that cannot be used by normal users (the minimum free space threshold).

maxcontig -- The maximum number of contiguous blocks that will be laid out before forcing a rotational delay; that is, the number of blocks that are combined into a single read request.

maxbpg -- The maximum number of blocks any single file can allocate out of a cylinder group before it is forced to begin allocating blocks from another cylinder group. A large value for maxbpg can improve performance for large files.

rotdelay -- The expected time, in milliseconds, to service a transfer completion interrupt and initiate a new transfer on the same disk. It is used to decide how much rotational spacing to place between successive blocks in a file. If rotdelay is zero, then blocks are allocated contiguously.

9.3.2.2 Monitoring UFS Clustering

To display how the system is performing cluster read and write transfers, use the dbx print command to examine the ufs_clusterstats data structure. For example:

# /usr/ucb/dbx -k /vmunix /dev/mem  
(dbx) print ufs_clusterstats

Information similar to the following is displayed:

struct {
    full_cluster_transfers = 3130
    part_cluster_transfers = 9786
    non_cluster_transfers = 16833
    sum_cluster_transfers = {
        [0] 0
        [1] 24644
        [2] 1128
        [3] 463
        [4] 202
        [5] 55
        [6] 117
        [7] 36
        [8] 123
        [9] 0
         .
         .
         .
       [33]
 
    }
}
(dbx)

The previous example shows 24644 single-block transfers, 1128 double-block transfers, 463 triple-block transfers, and so on.

You can use the dbx print command to examine cluster reads and writes by specifying the ufs_clusterstats_read and ufs_clusterstats_write data structures respectively.

9.3.2.3 Displaying the Metadata Buffer Cache

To display statistics on the metadata buffer cache, including superblocks, inodes, indirect blocks, directory blocks, and cylinder group summaries, use the dbx print command to examine the bio_stats data structure. For example:

# /usr/ucb/dbx -k /vmunix /dev/mem  
(dbx) print bio_stats

Information similar to the following is displayed:

struct {
    getblk_hits = 4590388
    getblk_misses = 17569
    getblk_research = 0
    getblk_dupbuf = 0
    getnewbuf_calls = 17590
    getnewbuf_buflocked = 0
    vflushbuf_lockskips = 0
    mntflushbuf_misses = 0
    mntinvalbuf_misses = 0
    vinvalbuf_misses = 0
    allocbuf_buflocked = 0
    ufssync_misses = 0
}

The number of block misses (getblk_misses) divided by the sum of block misses and block hits (getblk_hits) should not be more than 3 percent. If the number of block misses is high, you might want to increase the value of the bufcache attribute. See Section 9.1.3 for information on increasing the value of the bufcache attribute.

9.3.3 Tuning UFS for Performance

Table 9-5 lists UFS tuning guidelines and performance benefits and tradeoffs.

Table 9-5: UFS Tuning Guidelines

Benefit	Guideline	Tradeoff
Improve performance	Adjust UFS smoothsync and I/O throttling for asynchronous UFS I/O requests (Section 9.3.3.1)	None
Free CPU cycles and reduce the number of I/O operations	Delay UFS cluster writing (Section 9.3.3.2)	If I/O throttling is not used, might degrade real-time workload performance when buffers are flushed
Reduce the number of disk I/O operations	Increase the number of combine blocks for a cluster (Section 9.3.3.3)	Might require more memory to buffer data
Improve read and write performance	Defragment the file system (Section 9.3.3.4)	Requires down time

9.3.3.1 Adjusting UFS Smooth Sync and I/O Throttling

UFS uses smoothsync and I/O throttling to improve UFS performance and to minimize system stalls resulting from a heavy system I/O load.

Smoothsync allows each dirty page to age for a specified time period before going to disk. This allows more opportunity for frequently modified pages to be found in the cache, thus decreasing the I/O load. Also, spikes in which large numbers of dirty pages are locked on the device queue are minimized because pages are enqueued to a device after having aged sufficiently, as opposed to getting flushed by the update daemon.

I/O throttling further addresses the concern of locking dirty pages on the device queue. It enforces a limit on the number of delayed I/O requests allowed to be on the device queue at any point in time. This allows the system to be more responsive to any synchronous requests added to the device queue, such as a read or the loading of a new program into memory. This can also decrease the amount and duration of process stalls for specific dirty buffers, as pages remain available until placed on the device queue.

Related Attributes

The vfs subsystem attributes that affect smoothsync and throttling are:

The smoothsync_age attribute -- Specifies the amount of time, in seconds, that a modified page ages before becoming eligible for the smoothsync mechanism to flush it to disk.
Value: 0 to 60
Default value: 30 seconds
If set to 0, smoothsync is disabled and dirty page flushing is controlled by the update daemon at 30 second intervals.
Increasing the value increases the chance of lost data if the system crashes, but can decrease net I/O load (improve performance) by allowing the dirty pages to remain cached longer.
The smoothsync_age attribute is enabled when the system boots to multiuser mode and disabled when the system changes from multiuser mode to single-user mode. To change the value of the smoothsync_age attribute, edit the following lines in the /etc/inittab file:
```
smsync:23:wait:/sbin/sysconfig -r vfs smoothsync_age=30 > /dev/null 2>&1
smsyncS:Ss:wait:/sbin/sysconfig -r vfs smoothsync_age=0 > /dev/null 2>&1
 
```
You can use the smsync2 mount option to specify an alternate smoothsync policy that can further decrease the net I/O load. The default policy is to flush modified pages after they have been dirty for the smoothsync_age time period, regardless of continued modifications to the page. When you mount a UFS using the smsync2 mount option, modified pages are not written to disk until they have been dirty and idle for the smoothsync_age time period. Note that mmap'ed pages always use this default policy, regardless of the smsync2 setting.

The io_throttle_shift attribute -- Specifies a value that limits the maximum number of concurrent delayed UFS I/O requests on an I/O device queue.

The greater the number of requests on an I/O device queue, the longer the amount of time is required to process those requests and to make those pages and device available. The number of concurrent delayed I/O requests on an I/O device queue can be throttled (controlled) by setting the io_throttle_shift attribute. The calculated throttle value is based on the value of the io_throttle_shift attribute and the device's calculated I/O completion rate. The time required to process the I/O device queue is proportional to the throttle value. The correspondences between the value of the io_throttle_shift attribute and the time to process the device queue are:

Value of the io_throttle_shift attribute	Time (in seconds) to process device queue
-4	0.0625
-3	0.125
-2	0.25
-1	0.5
0	1
1	2
2	4
3	8
4	16

Default value: 1 (2 seconds). However, the io_throttle_shift attribute only applies to file system that you mount using the throttle mount option.

You might consider reducing the value of the io_throttle_shift attribute if your environment is particularly sensitive to delays in accessing the I/O device.

The io_maxmzthruputattribute -- Specifies whether or not to maximize I/O throughput or to maximize the availability of dirty pages. Maximizing I/O throughput works more aggressively to keep the device busy, but within the constraints of the io_throttle_shift attribute. Maximizing the availability of dirty pages favors decreasing the stall time experienced when waiting for dirty pages.
Value: 0 (disabled) or 1 (enabled)
Default value: 1 (enabled). However, the io_throttle_maxmzthruput attribute only applies to file system that you mount using the throttle mount option.
You might consider disabling the io_maxmzthruput attribute if your environment is particularly sensitive to delays in accessing sets of frequently used dirty pages or an environment in which I/O is confined to a small number of I/O intensive applications, such that access to a specific set of pages becomes more important for overall performance than does keeping the I/O device busy.

You can modify the smoothsync_age attribute, the io_throttle_static attribute, and the io_throttle_maxmzthruput attribute without rebooting the system.

9.3.3.2 Delaying UFS Cluster Writing

By default, clusters of UFS pages are written asynchronously. You can configure clusters of UFS pages to be written delayed as other modified data and metadata pages are written.

Related Attribute

The delay_wbuffers attribute specifies whether or not clusters of UFS pages are written asynchronously or delayed.

Value: 0 or 1

Default value: 0 (asynchronously)

If the percentage of UBC dirty pages reaches the value of the delay_wbuffers_percent attribute, the clusters will be written asynchronously, regardless of the value of the delay_wbuffers attribute.

When to Tune

Delay writing clusters of UFS pages if your applications frequently write to previously written pages. This can result in a decrease in the total number of I/O requests. However, if you are not using I/O throttling, it might adversely affect real-time workload performance because the system will experience a heavy I/O load at sync time.

To delay writing clusters of UFS pages, use the dbx patch command to set the value of the delay_wbuffers kernel variable to 1 (enabled).

See Section 3.6.7 for information about using dbx.

9.3.3.3 Increasing the Number of Blocks in a Cluster

UFS combines contiguous blocks into clusters to decrease I/O operations. You can specify the number of blocks in a cluster.

Related Attribute

The cluster_maxcontig attribute specifies the number of blocks that are combined into a single I/O operation.

Default value: 32 blocks

If the specific filesystem's rotational delay value is 0 (default), then UFS attempts to create clusters with up to n blocks, where n is either the value of the cluster_maxcontig attribute or the value from device geometry, whichever is smaller.

If the specific filesystem's rotational delay value is non-zero, then n is the value of the cluster_maxcontig attribute, the value from device geometry, or the value of the maxcontig file system attribute, whichever is smaller.

When to Tune

Increase the number of blocks combined for a cluster if your applications can use a large cluster size.

You can use the newfs command to set the filesystem rotational delay value and the value of the maxcontig attribute. You can use the dbx command to set the value of the cluster_maxcontig attribute.

9.3.3.4 Defragmenting a File System

When a file consists of noncontiguous file extents, the file is considered fragmented. A very fragmented file decreases UFS read and write performance, because it requires more I/O operations to access the file.

When to Perform

Defragmenting a UFS file system improves file system performance. However, it is a time-consuming process.

You can determine whether the files in a file system are fragmented by determining how effectively the system is clustering. You can do this by using the dbx print command to examine the ufs_clusterstats data structure. See Section 9.3.2.2 for information.

UFS block clustering is usually efficient. If the numbers from the UFS clustering kernel structures show that clustering is not effective, the files in the file system may be very fragmented.

Recommended Procedure

To defragment a UFS file system, follow these steps:

Back up the file system onto tape or another partition.

Create a new file system either on the same partition or a different partition.

Restore the file system.

See the System Administration manual for information about backing up and restoring data and creating UFS file systems.

9.4 Tuning NFS

The Network File System (NFS) shares the Unified Buffer Cache (UBC) with the virtual memory subsystem and local file systems. NFS can put an extreme load on the network. Poor NFS performance is almost always a problem with the network infrastructure. Look for high counts of retransmitted messages on the NFS clients, network I/O errors, and routers that cannot maintain the load.

Lost packets on the network can severely degrade NFS performance. Lost packets can be caused by a congested server, the corruption of packets during transmission (which can be caused by bad electrical connections, noisy environments, or noisy Ethernet interfaces), and routers that abandon forwarding attempts too quickly.

You can monitor NFS by using the nfsstat and other commands. When evaluating NFS performance, remember that NFS does not perform well if any file-locking mechanisms are in use on an NFS file. The locks prevent the file from being cached on the client. See nfsstat(8) for more information.

The following sections describe how to display NFS information and attributes that you might be able to tune to improve NFS performance.

9.4.1 Displaying NFS Information

Table 9-6 describes the commands you can use to display NFS information.

Table 9-6: Commands to Display NFS Information

To Display	Command
Network and NFS statistics (Section 9.4.1.1)	`nfsstat`
Information about idle threads (Section 9.4.1.2)	`ps axlmp`
All incoming network traffic to an NFS server	`nfswatch`
Active NFS server threads (Section 3.6.7)	`(dbx) print nfs_sv_active_hist`
Metadata buffer cache statistics (Section 9.3.2.3)	`(dbx) print bio_stats`

9.4.1.1 Displaying Network and NFS Statistics

To display or reinitialize NFS and Remote Procedure Call (RPC) statistics for clients and servers, including the number of packets that had to be retransmitted (retrans) and the number of times a reply transaction ID did not match the request transaction ID (badxid), enter:

# /usr/ucb/nfsstat

Information similar to the following is displayed:

Server rpc:
calls     badcalls  nullrecv   badlen   xdrcall
38903     0         0          0        0
 
Server nfs:
calls     badcalls
38903     0
 
Server nfs V2:
null      getattr   setattr    root     lookup     readlink   read
5  0%     3345  8%  61  0%     0  0%    5902 15%   250  0%    1497  3%
wrcache   write     create     remove   rename     link       symlink
0  0%     1400  3%  549  1%    1049  2% 352  0%    250  0%    250  0%
mkdir     rmdir     readdir    statfs
171  0%   172  0%   689  1%    1751  4%
 
Server nfs V3:
null      getattr   setattr    lookup    access    readlink   read
0  0%     1333  3%  1019  2%   5196 13%  238  0%   400  1%    2816  7%
write     create    mkdir      symlink   mknod     remove     rmdir
2560  6%  752  1%   140  0%    400  1%   0  0%     1352  3%   140  0%
rename    link      readdir    readdir+  fsstat    fsinfo     pathconf
200  0%   200  0%   936  2%    0  0%     3504  9%  3  0%      0  0%
commit
21  0%
 
Client rpc:
calls     badcalls  retrans    badxid    timeout   wait       newcred
27989     1         0          0         1         0          0
badverfs  timers
0         4
 
Client nfs:
calls     badcalls  nclget     nclsleep
27988     0         27988      0
 
Client nfs V2:
null      getattr   setattr    root      lookup    readlink   read
0  0%     3414 12%  61  0%     0  0%     5973 21%  257  0%    1503  5%
wrcache   write     create     remove    rename    link       symlink
0  0%     1400  5%  549  1%    1049  3%  352  1%   250  0%    250  0%
mkdir     rmdir     readdir    statfs
171  0%   171  0%   713  2%    1756  6%
 
Client nfs V3:
null      getattr   setattr    lookup    access    readlink   read
0  0%     666  2%   9  0%      2598  9%  137  0%   200  0%    1408  5%
write     create    mkdir      symlink   mknod     remove     rmdir
1280  4%  376  1%   70  0%     200  0%   0  0%     676  2%    70  0%
rename    link      readdir    readdir+  fsstat    fsinfo     pathconf
100  0%   100  0%   468  1%    0  0%     1750  6%  1  0%      0  0%
commit
10  0%

The ratio of timeouts to calls (which should not exceed 1 percent) is the most important thing to look for in the NFS statistics. A timeout-to-call ratio greater than 1 percent can have a significant negative impact on performance. See Chapter 10 for information on how to tune your system to avoid timeouts.

To display NFS and RPC information in intervals (seconds), enter:

# /usr/ucb/nfsstat -s -i number

The following example displays NFS and RPC information in 10-second intervals:

# /usr/ucb/nfsstat -s -i 10

If you are monitoring an experimental situation with nfsstat, reset the NFS counters to 0 before you begin the experiment. To reset counters to 0, enter:

# /usr/ucb/nfsstat -z

See nfsstat(8) for more information about command options and output.

9.4.1.2 Displaying Idle Thread Information

On a client system, the nfsiod daemon spawns several I/O threads to service asynchronous I/O requests to the server. The I/O threads improve the performance of both NFS reads and writes. The optimum number of I/O threads depends on many variables, such as how quickly the client will be writing, how many files will be accessed simultaneously, and the characteristics of the NFS server. For most clients, seven threads are sufficient.

To display idle I/O threads on a client system, enter:

# /usr/ucb/ps axlmp 0 | grep nfs

Information similar to the following is displayed:

 0  42   0            nfsiod_  S                 0:00.52                 
 0  42   0            nfsiod_  S                 0:01.18                 
 0  42   0            nfsiod_  S                 0:00.36                 
 0  44   0            nfsiod_  S                 0:00.87                 
 0  42   0            nfsiod_  S                 0:00.52                 
 0  42   0            nfsiod_  S                 0:00.45                 
 0  42   0            nfsiod_  S                 0:00.74                 
 
#

The previous example shows a sufficient number of sleeping threads and 42 server threads that were started by nfsd, where nfsiod_ was replaced by nfs_tcp or nfs_udp.

If your output shows that few threads are sleeping, you might improve NFS performance by increasing the number of threads. See Section 9.4.2.1, Section 9.4.2.2, nfsiod(8), and nfsd(8) for more information.

9.4.2 Improving NFS Performance

Improving performance on a system that is used only for serving NFS differs from tuning a system that is used for general timesharing, because an NFS server runs only a few small user-level programs, which consume few system resources. There is minimal paging and swapping activity, so memory resources should be focused on caching file system data.

File system tuning is important for NFS because processing NFS requests consumes the majority of CPU and wall clock time. Ideally, the UBC hit rate should be high. Increasing the UBC hit rate can require additional memory or a reduction in the size of other file system caches. In general, file system tuning will improve the performance of I/O-intensive user applications.

In addition, a vnode must exist to keep file data. If you are using AdvFS, an access structure is also required to keep file data.

If you are running NFS over TCP, tuning TCP may improve performance if there are many active clients. See Section 10.2 for more information. However, if you are running NFS over UDP, no network tuning is needed.

Table 9-7 lists NFS configuration guidelines and performance benefits and tradeoffs.

Table 9-7: NFS Tuning Guidelines

Benefit	Guideline	Tradeoff
Enable efficient I/O blocking operations	Configure the appropriate number of threads on an NFS server (Section 9.4.2.1)	None
Enable efficient I/O blocking operations	Configure the appropriate number of threads on the client system (Section 9.4.2.2)	None
Improve performance on slow or congested networks	Decrease network timeouts on the client system (Section 9.4.2.4)	Reduces the theoretical performance
Improve network performance for read-only file systems and enable clients to quickly detect changes	Modify cache timeout limits on the client system (Section 9.4.2.3)	Increases network traffic to server

9.4.2.1 Configuring Server Threads

The nfsd daemon runs on NFS servers to service NFS requests from client systems. The daemon spawns a number of server threads that process NFS requests from client systems. At least one server thread must be running for a machine to operate as a server. The number of threads determines the number of parallel operations and must be a multiple of 8.

To improve performance on frequently used NFS servers, configure either 16 or 32 threads, which provides the most efficient blocking for I/O operations. See nfsd(8) for more information.

9.4.2.2 Configuring Client Threads

Client systems use the nfsiod daemon to service asynchronous I/O operations, such as buffer cache read-ahead and delayed write operations. The nfsiod daemon spawns several I/O threads to service asynchronous I/O requests to its server. The I/O threads improve performance of both NFS reads and writes.

The optimal number of I/O threads to run depends on many variables, such as how quickly the client is writing data, how many files will be accessed simultaneously, and the behavior of the NFS server. The number of threads must be a multiple of 8 minus 1 (for example, 7 or 15 is optimal).

NFS servers attempt to gather writes into complete UFS clusters before initiating I/O, and the number of threads (plus 1) is the number of writes that a client can have outstanding at any one time. Having exactly 7 or 15 threads produces the most efficient blocking for I/O operations. If write gathering is enabled, and the client does not have any threads, you may experience a performance degradation. To disable write gathering, use the dbx patch command to set the nfs_write_gather kernel variable to zero. See Section 3.6.7for information.

Use the ps axlmp 0 | grep nfs command to display idle I/O threads on the client. If few threads are sleeping, you might improve NFS performance by increasing the number of threads. See nfsiod(8) for more information.

9.4.2.3 Modifying Cache Timeout Limits

For read-only file systems and slow network links, performance might improve by changing the cache timeout limits on NFS client systems. These timeouts affect how quickly you see updates to a file or directory that was modified by another host. If you are not sharing files with users on other hosts, including the server system, increasing these values will slightly improve performance and will reduce the amount of network traffic that you generate.

See mount(8) and the descriptions of the acregmin, acregmax, acdirmin, acdirmax, actimeo options for more information.

9.4.2.4 Decreasing Network Timeouts

NFS does not perform well if it is used over slow network links, congested networks, or wide area networks (WANs). In particular, network timeouts on client systems can severely degrade NFS performance. This condition can be identified by using the nfsstat command and determining the ratio of timeouts to calls. If timeouts are more than 1 percent of the total calls, NFS performance may be severely degraded. See Section 9.4.1.1 for sample nfsstat output of timeout and call statistics.

You can also use the netstat -s command to verify the existence of a timeout problem. A nonzero value in the fragments dropped after timeout field in the ip section of the netstat output may indicate that the problem exists. See Section 10.1.1 for sample netstat command output.

If fragment drops are a problem on a client system, use the mount command with the -rsize=1024 and -wsize=1024 options to set the size of the NFS read and write buffers to 1 KB.