The Tru64 UNIX operating system supports various file system options that have different performance features and functionality.
This chapter describes how to perform the following tasks:
Gather information about all types of file systems (Section 9.1)
Apply tuning recommendations that are applicable to all types of file systems (Section 9.2)
Manage Advanced File System (AdvFS) performance (Section 9.3)
Manage UNIX File System (UFS) performance (Section 9.4)
Manage Network File System (NFS) performance (Section 9.5)
The following sections describe how to use tools to monitor general file system activity and describe some general file system tuning guidelines.
The Unified Buffer Cache (UBC) uses a portion of physical memory to cache most-recently accessed UFS file system data for reads and writes and for page faults from mapped file regions, in addition to AdvFS metadata and user data. The UBC competes with processes for this portion of physical memory, so the amount of memory allocated to the UBC can affect overall system performance.
See
Section 6.3.5
for information about using
dbx
to check the UBC.
See
Section 9.2
for
information on how to tune the UBC.
The namei cache is used by UFS, AdvFS, CD-ROM File System (CDFS), and NFS to store recently used file system pathname/inode number pairs. It also stores inode information for files that were referenced but not found. Having this information in the cache substantially reduces the amount of searching that is needed to perform pathname translations.
To check the namei cache, use the
dbx print
command
to examine the
nchstats
data structure.
Consider the
following example:
#/usr/ucb/dbx -k /vmunix /dev/mem(dbx)print nchstatsstruct { ncs_goodhits = 9748603 ncs_neghits = 888729 ncs_badhits = 23470 ncs_falsehits = 69371 ncs_miss = 1055430 ncs_long = 4067 ncs_pass2 = 127950 ncs_2passes = 195763 ncs_dirscan = 47 } (dbx)
Examine the
ncs_goodhits
(found a pair),
ncs_neghits
(found a pair that did not exist),
and
ncs_miss
(did not find a pair) fields
to determine the hit rate.
The hit
rate should be above 80 percent (ncs_goodhits
plus
ncs_neghits
divided by the sum of the
ncs_goodhits,
ncs_neghits,
ncs_miss, and
ncs_falsehits).
See Section 9.2.1 for information on how to improve the namei cache hit rate and lookup speeds.
You may be able to improve I/O performance by modifying some kernel attributes that affect all file system performance.
General file system tuning often involves tuning the Virtual File System (VFS), which provides a uniform interface that allows common access to files, regardless of the file system on which the files reside.
To successfully improve file system performance, you must understand how your applications and users perform I/O, as described in Section 2.1. Because file systems share memory with processes, you should also understand virtual memory operation, as described in Chapter 6.
Table 9-1 describes the guidelines for general file system tuning and lists the performance benefits as well as the tradeoffs. There are also specific guidelines for AdvFS and UFS file systems. See Section 9.3 and Section 9.4 for information.
| Action | Performance Benefit | Tradeoff |
| Increase the size of the namei cache (Section 9.2.1) | Improves cache lookup operations | Consumes memory |
| Increase the size of the hash chain table for the namei cache (Section 9.2.2) | Improves cache lookup operations | Consumes memory |
| Increase the memory allocated to the UBC (Section 9.2.3) | Improves file system performance | May cause excessive paging and swapping |
| Decrease the amount of memory borrowed by the UBC (Section 9.2.4) | Improves file system performance | Decreases the memory available for processes and may decrease system response time |
| Increase the minimum size of the UBC (Section 9.2.5) | Improves file system performance | Decreases the memory available for processes |
| Increase the UBC write device queue depth (Section 9.2.6) | Increases overall file system throughput and frees memory | Decreases interactive response performance |
| Decrease the UBC write device queue depth (Section 9.2.6) | Improves interactive response time | Consumes memory |
| Increase the amount of UBC memory used to cache a large file (Section 9.2.7) | Improves large file performance | May allow a large file to consume all the pages on the free list |
| Decrease the amount of UBC memory used to cache a large file (Section 9.2.7) | Prevents a large file from consuming all the pages on the free list | May degrade large file performance |
| Disable flushing to disk file read access times (Section 9.2.8) | Improves file system performance for proxy servers | Jeopardizes the integrity of read access time updates and violates POSIX standards |
| Use Prestoserve to cache only file system metadata (Section 9.2.9) | Improves performance for applications that access large amounts of file system metadata | Prestoserve is not supported in a cluster or for nonfile system I/O operations |
| Increase the size of the Prestoserve buffer hash table (Section 9.2.10) | Decreases Prestoserve lock contention | Prestoserve is not supported in a cluster or for nonfile system I/O operations |
| Cache more vnodes on the free list (Section 9.2.11) | Improves cache lookup operations | Consumes memory |
| Increase the amount of time for which vnodes are kept on the free list (Section 9.2.12) | Improves cache lookup operations | None |
| Delay vnode deallocation (Section 9.2.13) | Improves namei cache lookup operations | Consumes memory |
| Accelerate vnode deallocation (Section 9.2.14) | Speeds the freeing of memory | Reduces the efficiency of the namei cache |
| Disable vnode deallocation (Section 9.2.15) | Optimizes processing time | Consumes memory |
The following sections describe these guidelines in detail.
The
namei cache is used by all file systems to map file pathnames to
inodes.
Monitor the cache by using the
dbx print
command
to examine the
nchstats
data
structure.
The miss rate (misses / (good + negative + misses)) should
be less than 20 percent.
To make lookup operations faster, increase the size of the namei
cache by increasing the value of the
maxusers
attribute (the recommended way), as described in
Section 5.1,
or by increasing the value of the
vfs
subsystem attribute
name-cache-size
(the default value is 1029).
Increasing the value of the
maxusers
or
name-cache-size
attribute
allocates more system resources for use by the kernel, but also
increases the amount of wired memory consumed by the kernel.
Note that many benchmarks perform better with a large namei cache.
Increasing the size
of the hash chain
table for the namei cache distributes
the namei cache elements and reduces the time needed for linear searches,
which can improve lookup
speeds.
The
vfs
subsystem attribute
name-cache-hash-size
specifies the
size of the hash chain table, in table elements,
for the namei cache.
The default value of the
name-cache-hash-size
attribute
is the value of the
name-cache-size
attribute divided
by 8 and rounded up to the next power of 2, or 8192, whichever
is the highest value.
You can change the value of the
name-cache-hash-size
attribute so that each hash chain has three or four name cache entries.
To
determine an appropriate value for the
name-cache-hash-size
attribute, divide the value of the
vfs
subsystem
attribute
name-cache-size
by 3 or 4 and then round the result to a power of 2.
For example, if the value
of
name-cache-size
is 1029, dividing 1029 by 4 produces
a value of 257.
Based on this calculation, you could specify 256 (2 to the
power of 8) for the value of the
name-cache-hash-size
attribute.
The Unified Buffer Cache (UBC) shares with processes the memory that is not wired. The UBC caches UFS file system data for reads and writes, AdvFS metadata and file data, and Memory File System (MFS) data. Performance is improved if the cached data is later reused and a disk operation is avoided.
If you reuse data, be sure to allocate enough memory to the UBC to improve the chance that data will be found in the cache. An insufficient amount of memory allocated to the UBC can impair file system performance. However, the performance of an application that generates a lot of random I/O will not be improved by a large UBC, because the next access location for random I/O cannot be predetermined.
To increase the maximum amount of memory allocated to the UBC, you
can increase the value of the
vm
subsystem
attribute
ubc-maxpercent.
The default value is 100 percent, which should be appropriate for most
configurations, including Internet servers.
Be sure that allocating more memory to the UBC does not cause excessive paging and swapping.
See Section 6.1.2.2 for information about UBC memory allocation.
The UBC borrows all
physical memory above the value of the
vm
subsystem attribute
ubc-borrowpercent
and up to the value of the
ubc-maxpercent
attribute.
Increasing the value of the
ubc-borrowpercent
attribute
allows more memory to remain in the UBC when page reclamation
begins.
This can increase the UBC cache effectiveness,
but may degrade system response time when a low-memory condition
occurs.
If
vmstat
output shows excessive paging but few or
no pageouts, you may want to increase borrowing threshold.
The value of the
ubc-borrowpercent
attribute
can range from 0 to 100.
The default value is 20 percent.
See Section 6.1.2.2 for information about UBC memory allocation.
Increasing
the minimum size of the UBC will prevent large programs from completely
filling the UBC.
For I/O servers, you may want to raise the value of the
vm
subsystem attribute
ubc-minpercent
to ensure that enough memory is available for the UBC.
The default value is 10 percent.
Because the UBC and processes share virtual memory,
increasing the minimum size of the UBC may cause the system to page
excessively.
In addition, if
the values of the
vm
subsystem attributes
ubc-maxpercent
and
ubc-minpercent
are close together, you may degrade I/O performance.
To ensure that the value of the
ubc-minpercent
is
appropriate, use the
vmstat
command to examine
the page-out rate.
See
Section 6.3.2
for information.
See Section 6.1.2.2 for information about UBC memory allocation.
The UBC uses a buffer to facilitate
the movement of data between memory and disk.
The
vm
subsystem attribute
vm-ubcbuffers
specifies the maximum file system device I/O queue
depth for writes.
The default value is 256.
Increasing the UBC write device queue depth frees memory and increases the overall file system throughput.
Decreasing the UBC write device queue depth increases memory demands, but it improves the interactive response time.
If
a large file completely fills the UBC, it may take all of the pages
on the free page list, which may cause the system to page excessively.
The
vm
subsystem attribute
vm-ubcseqpercent
specifies the
maximum amount of memory allocated to the UBC that can be used to cache a
file.
The default value is 10 percent of memory allocated to the UBC.
The
vm
subsystem attribute
vm-ubcseqstartpercent
specifies the
size of the UBC as a percentage of physical memory, at which time the
virtual memory subsystem starts stealing the
UBC LRU pages for a file to satisfy the demand for pages.
The default is 50 percent of physical memory.
Increasing the value of the
vm-ubcseqpercent
attribute will improve the performance of a large single file,
but will decrease the remaining amount of memory.
Decreasing the value of the
vm-ubcseqpercent
attribute will increase the available memory, but will degrade the
performance of a large single file.
To force the system to reuse the pages in the UBC instead of taking pages from the free list, perform the following tasks:
Make the maximum size of the UBC greater than the size of
the UBC as a
percentage of percentage of memory.
That is, the value of the
vm
subsystem attribute
ubc-maxpercent
(the default is 100 percent)
must be greater than the value of the
vm-ubcseqstartpercent
attribute (the default is 50 percent).
Make the value of the
vm-ubcseqpercent
attribute,
which specifies the size of a file as a percentage of the UBC, greater
than a referenced file.
The default value of the
vm-ubcseqpercent
attribute is 10 percent.
For example, using the default values, the UBC would have to be larger than 50 percent of all memory and a file would have to be larger than 10 percent of the UBC (that is, the file size would have to be at least 5 percent of all memory) in order for the system to reuse the pages in the UBC.
On large-memory systems that are doing a lot of file system operations,
you may want to lower the
vm-ubcseqstartpercent
value
to 30 percent.
Do not specify a lower value unless you decrease
the size of the UBC.
In this case, do not change the value of the
vm-ubcseqpercent
attribute.
When a
read
system call is made to a file system's files, the
default behavior is for the file system to update both the in-memory
file access time and
the on-disk stat structure, which contains most of the file information
that is returned by the
stat(2) system call.
You can improve file system performance for proxy servers by specifying, at mount time, that the file system update only the in-memory file access time when a read system call is made to a file. The file system will update the on-disk stat structure only if the file is modified.
To enable this functionality, use the
mount
command
with
the
noatimes
option.
See
read(2)
and
mount(8)
for more information.
Updating only the in-memory file access time for reads can improve proxy server response time by decreasing the number of disk I/O operations. However, this behavior jeopardizes the integrity of read access time updates and violates POSIX standards. Do not use this functionality if it will affect utilities that use read access times to perform tasks, such as migrating files to different devices.
Prestoserve
can improve the overall run-time performance for systems that perform large
numbers of synchronous writes.
The
prmetaonly
attribute
controls whether Prestoserve caches only UFS and AdvFS file system metadata,
instead of both metadata and synchronous write data (the default).
If the
attribute is set to 1 (enabled), Prestoserve caches only file system metadata.
Caching only metadata may improve the performance of applications that access many small files or applications that access a large amount of file-system metadata but do not reread recently written data.
If the contention
on the Prestoserve lock (presto_lock)
is high (for example, the miss rate is a few percentage points), you
may be able to improve throughput by increasing the value of the
presto
subsystem attribute
presto-buffer-hash-size.
This will decrease
Prestoserve lock contention.
The default value of the
presto-buffer-hash-size
attribute
is 256 bytes.
The minimum value is 0; the maximum value is 64 KB.
You can increase the minimum number of vnodes on the free list to cache more free vnodes and improve the performance of cache lookup operations. However, increasing the minimim number of vnodes will consume memory resources.
The
vfs
subsystem attribute
min-free-vnodes
specifies the minimum number of vnodes.
The default value of the
min-free-vnodes
attribute
is either 150 or the value of the
nvnode
kernel
variable, whichever is greater.
If the value of
min-free-vnodes
is larger than the
value of
max-vnodes, vnode deallocations will not occur.
If the value of
min-free-vnodes
is close to the value of
the
max-vnodes
attribute, vnode deallocation will not
be effective.
If the value of
min-free-vnodes
must be close to the value
of
max-vnodes, you may want to disable vnode deallocation
(see
Section 9.2.15).
Disabling vnode deallocation
does not free memory, because memory used by the vnodes is not returned to
the system.
On systems that need to reclaim the memory used by vnodes, make
sure that the value of
min-free-vnodes
is significantly
lower than the value of
max-vnodes.
See
Section 5.5.1
for information about modifying
max-vnodes.
You can increase the value of
the
vfs
subsystem attribute
vnode-age
to increase the amount of time for which vnodes are kept on the
free list.
This increases the possibility that the vnode will be successfully
looked up.
The default value for
vnode-age
is 120 seconds
on 32-MB or larger systems and 2 seconds on 24-MB systems.
To
delay the deallocation of vnodes, increase the value of the
vfs
subsystem attribute
namei-cache-valid-time.
The default value is 1200.
This can improve
namei cache lookup operations, but it consumes memory resources.
To
accelerate the deallocation of vnodes, decrease the value of the
vfs
subsystem attribute
namei-cache-valid-time.
The default
value is 1200.
This causes
vnodes to be deallocated from the namei cache at a faster rate and returns
memory to the operating system, but it also reduces the efficiency of
the cache.
To
optimize processing time, disable vnode deallocation by setting the value
of the
vfs
subsystem attribute
vnode-deallocation-enable
to zero.
Disabling vnode deallocation
does not free memory,
because memory used by the vnodes is not returned to the system.
You may want to disable vnode allocation if the value of the
vfs
subsystem attribute
min-free-vnodes
is close to the value of the
max-vnodes
attribute.
See
Section 5.5.1
for information about modifying
max-vnodes.
The Advanced File System (AdvFS) provides file system features beyond those of a traditional UFS file system. Unlike the rigid UFS model in which the file system directory hierarchy (tree) is bound tightly to the physical storage, AdvFS consists of two distinct layers: the directory hierarchy layer and the physical storage layer.
The AdvFS decoupled file system structure enables you to manage the physical storage layer apart from the directory hierarchy layer. This means that you can move files between a defined group of disk volumes without changing file pathnames. Because the pathnames remain the same, the action is completely transparent to end users.
AdvFS allows you to put multiple volumes (disks, LSM volumes, or RAID storage sets) in a file domain and distribute the filesets and files across the volumes. A file's blocks usually reside together on the same volume, unless the file is striped or the volume is full. Each new file is placed on the successive volume by using round-robin scheduling.
AdvFS provides the following features:
High-performance file system
AdvFS uses an extent-based file allocation scheme that consolidates data transfers, which increases sequential bandwidth and improves performance for large data transfers. AdvFS performs large reads from disk when it anticipates a need for sequential data. AdvFS also performs large writes by combining adjacent data into a single data transfer.
Fast file system recovery
Rebooting after a system interruption is extremely fast, because
AdvFS uses
write-ahead logging, instead of the
fsck
utility, as a
way to check for and repair file system inconsistencies.
The recovery speed
depends on the number of uncommitted records in the log, not the amount of
data in the fileset; therefore, reboots are quick and predictable.
Online file system management
File domain defragmentation capability
Support for large files and file systems
User quotas
Support for the
salvage
command, which
allows you to recover file data from damaged AdvFS file domains
The optional AdvFS utilities, which are licensed separately, provide the following features:
Pool of storage that allows you to add, remove, and back up disks without disrupting users or applications.
Disk spanning filesets
Ability to recover deleted files
Users can retrieve their own unintentionally deleted files from predefined trashcan directories, without assistance from system administrators.
I/O load balancing across disks
Online fileset resizing
Online file migration across disks
File-level striping
File-level striping may improve I/O bandwidth (transfer rates) by distributing file data across multiple disk volumes.
Graphical user interface (GUI) that simplifies disk and file system administration, provides status, and alerts you to potential problems
The following sections describe how to perform these tasks:
Understand AdvFS I/O queues and access structures (Section 9.3.1)
Use the AdvFS guidelines in order to set up a high-performance configuration (Section 9.3.2)
Obtain information about the AdvFS performance (Section 9.3.3)
Improve AdvFS performance by tuning the subsystem (Section 9.3.4)
See the AdvFS Guide to File System Administration for detailed information about setting up and managing AdvFS.
AdvFS is a file system option that provides many file management and performance features. You can use AdvFS instead of UFS to organize and manage your files. An AdvFS file domain can consist of multiple volumes, which can be UNIX block devices (entire disks), disk partitions, LSM logical volumes, or RAID storage sets. AdvFS filesets can span all the volumes in the file domain.
The AdvFS Utilities product, which is licensed separately from the operating system, extends the capabilities of the AdvFS file system.
The following sections describe AdvFS I/O queues and access structures.
At boot time, the system reserves a percentage of static wired physical memory for the AdvFS buffer cache, which is the part of the UBC that holds the most recently accessed pages of AdvFS file data and metadata. A disk operation is avoided if the data is later reused and the page is still in the cache (a buffer cache hit). This can improve AdvFS performance.
The amount of memory that can be allocated to the AdvFS buffer cache
is
specified by the
advfs
subsystem attribute
AdvfsCacheMaxPercent.
The default value is
7 percent of physical
memory.
See
Section 6.1.2.3
for information
about how the system allocates memory to the AdvFS buffer cache.
For each AdvFS volume, I/O requests are sent either to the blocking queue, which caches synchronous I/O requests, or to the lazy queue, which caches asynchronous I/O requests. Both the blocking queue and the lazy queue feed I/O requests to the device queue.
A synchronous I/O request is one that must be written to disk before the write is considered successful and the application can continue. This ensures data reliability because the write is not stored in memory to be later written to disk. Therefore, I/O requests on the blocking queue cannot be asynchronously removed, because the I/O must complete.
Asynchronous I/O requests are cached in the lazy queue and periodically flushed to disk in portions that are large enough to allow the disk drivers to optimize the order of the write.
Figure 9-1 shows the movement of synchronous and asynchronous I/O requests through the AdvFS I/O queues.
When an asynchronous I/O request enters the lazy queue, it is assigned a time stamp. The lazy queue is a pipeline that contains a sequence of queues through which an I/O request passes: the wait queue (if applicable), the ready queue, and the consol queue. An AdvFS buffer cache hit can occur while an I/O request is in any part of the lazy queue.
Detailed descriptions of the wait, ready, and consol queues are as follows:
wait queue--Asynchronous I/O requests that are waiting for an AdvFS transaction log write to complete first enter the wait queue. Each file domain has a transaction log that keeps track of fileset activity for all filesets in the file domain, and ensures AdvFS metadata consistency if a crash occurs.
AdvFS uses write-ahead logging, which requires that when metadata is modified, the transaction log write must complete before the actual metadata is written. This ensures that AdvFS can always use the transaction log to create a consistent view of the file system metadata. After the transaction log is written, I/O requests are moved from the wait queue to the ready queue.
ready queue--Asynchronous I/O requests that are
not waiting for an AdvFS transaction log write to complete enter the ready
queue, where they are sorted and held until
the size of the ready queue reaches the value specified by the
AdvfsReadyQLim
attribute, or
until the
update
daemon flushes the data.
The default
value
of the
AdvfsReadyQLim
attribute is 16,384 512-byte blocks
(8 MB).
You can modify the size of the ready queue for all AdvFS volumes
by changing the value of the
AdvfsReadyQLim
attribute.
You can modify the size for a specific AdvFS volume by using the
chvol -t
command.
You can disable data
caching in the ready queue and allow I/O requests to
bypass the ready queue.
To do this, specify a value of 0 for the
AdvfsReadyQLim
attribute However, this is not recommended.
See
Section 9.3.4.5
for information about tuning the ready
queue.
consol queue--I/O requests are moved from the ready queue to the consol queue, which feeds the device queue.
Both the consol queue and the blocking queue feed the device queue, where logically contiguous I/O requests are consolidated into larger I/Os before they are sent to the device driver. The size of the device queue affects the amount of time it takes to complete a synchronous (blocking) I/O operation. AdvFS issues several types of blocking I/O operations, including AdvFS metadata and log data operations.
The
AdvfsMaxDevQLen
attribute limits the
total number of I/O requests on the AdvFS device queue.
The default value
is 24 requests.
When the number of requests exceeds this value, only synchronous
requests from the blocking queue are accepted onto the device queue.
Although the default value of the
AdvfsMaxDevQLen
attribute
is appropriate for most configurations, you may need to modify this value.
However, only increase the default value if devices are not being kept busy.
Make sure that increasing the size of the device queue does not cause a
decrease in response time.
See
Section 9.3.4.6
for more information
about tuning the AdvFS device queue.
Use the
advfsstat
command to show the AdvFS queue
statistics.
If your users or applications open and then reuse many files, you may be able to improve AdvFS performance by modifying how the system allocates AdvFS access structures. AdvFS access structures are in-memory data structures that AdvFS uses to cache low-level information about files that are currently open and files that were opened but are now closed. Caching open file information can enhance AdvFS performance if the open files are later reused.
At boot time, the system reserves for AdvFS access structures a percentage of the physical memory that is not wired by the kernel or applications. Out of this pool of reserved memory, the system allocates a number of access structures and places them on the access structure free list. When a file is opened, an access structure is taken from the access structure free list. Access structures are allocated and deallocated according to the kernel configuration and workload demands.
There are two attributes that control the allocation of AdvFS access structures:
The
AdvfsAccessMaxPercent
attribute
controls the
amount of pageable memory (malloc)
that is reserved for AdvFS access structures.
The default value is 80 percent of pageable memory.
The
AdvfsPreallocAccess
specifies the
number of AdvFS access structures that the system allocates at
startup time.
The default and minimum values are 128.
The
maximum value is either 65536 or the value of the
AdvfsAccessMaxPercent
attribute, whichever is the
smallest value.
You may be able to improve AdvFS performance by modifying the previous attributes and allocating more memory for AdvFS access structures. However, this will reduce the amount of memory available to processes and may cause excessive paging and swapping.
If you do not use AdvFS or if your workload does not reuse AdvFS files, do not allocate a large amount of memory for access structures. If you have a large-memory system, you may want to decrease the amount of memory reserved for AdvFS access structures.
See Section 9.3.4.3 for information about tuning access structures.
You will obtain the best performance if you carefully plan your AdvFS configuration. Table 9-2 lists AdvFS configuration guidelines and performance benefits as well as tradeoffs.
| Action | Performance Benefit | Tradeoff |
| Use multiple-volume file domains (Section 9.3.2.1) | Improves throughput and simplifies management | Increases chance of domain failure and may cause a log bottleneck |
| Use several file domains instead of one large domain (Section 9.3.2.1) | Prevents log from becoming a bottleneck | Increases maintenance complexity |
| Place transaction log on fast or uncongested volume (Section 9.3.2.2) | Prevents log from becoming a bottleneck | None |
| Stripe files across different disks and, if possible, different buses (Section 9.3.2.4) | Improves sequential read and write performance | Increases chance of domain failure |
| Use quotas (Section 9.3.2.5) | Controls file system space utilization | None |
The following sections describe these AdvFS configuration guidelines in detail.
Using multiple-volume file domains allows greater control over your physical resources, and may improve a fileset's total throughput. However, be sure that the log does not become a bottleneck. Multiple-volume file domains improve performance because AdvFS generates parallel streams of output using multiple device consolidation queues.
In addition, using only a few file domains instead of using many file domains reduces the overall management effort, because fewer file domains require less administration. However, a single volume failure within a file domain renders the entire file domain inaccessible. Therefore, the more volumes that you have in your file domain the greater the risk that a file domain will fail.
It is recommended that you use a maximum of 12 volumes in each file domain. However, to reduce the risk of file domain failure, limit the number of volumes per file domain to three or mirror data with LSM or hardware RAID.
For multiple-volume domains, make sure that busy files are not located
on the same volume.
Use the
migrate
command to move files across volumes.
Each file domain has a transaction log that tracks fileset activity for all filesets in the file domain, and ensures AdvFS metadata consistency if a crash occurs. The AdvFS file domain transaction log may become a bottleneck if the log resides on a congested disk or bus, or if the file domain contains many filesets.
To prevent the log from becoming a bottleneck, put the log on a fast, uncongested volume. You may want to put the log on a disk that contains only the log. See Section 9.3.4.12 for information on moving an existing transaction log.
To make the transaction log highly available, use LSM or hardware RAID to mirror the log.
The AdvFS fileset data structure (metadata) is stored in a file called the bitfile metadata table (BMT). Each volume in a domain has a BMT that describes the file extents on the volume. If a domain has multiple volumes of the same size, files will be distributed evenly among the volumes.
The BMT is the equivalent of the UFS inode table. However, the UFS inode table is statically allocated, while the BMT expands as more files are added to the domain. Each time that AdvFS needs additional metadata, the BMT grows by a fixed size (the default is 128 pages). As a volume becomes increasingly fragmented, the size by which the BMT grows may be described by several extents.
To monitor the BMT, use the
vbmtpg
command and examine
the number of mcells (freeMcellCnt).
The value of
freeMcellCnt
can range from 0 to 22.
A volume with 1 free mcell
has very little space in which to grow the BMT.
See
vbmtpg(8)
for more information.
You can also invoke the
showfile
command and specify
mount_point/.tags/M-10
to
examine
the BMT extents on the first domain volume that contains the fileset mounted
on the specified mount point.
To examine the extents of the other volumes
in the domain, specify
M-16,
M-24, and
so on.
If the extents at the end of the BMT are smaller than the extents
at the beginning of the file, the BMT is becoming fragmented.
See
showfile(8)
for more information.
If you are prematurely out of BMT disk space, you may be able to eliminate
the problem by defragmenting the file domain that contains the volume.
See
defragment(8)
for more information.
Table 9-3 provides some BMT sizing guidelines for the number of pages to preallocate for the BMT, and the number of pages by which the BMT extent size grows. The BMT sizing depends on the maximum number of files you expect to create on a volume.
| Estimated Maximum Number of Files on a Volume | Number of Pages to Preallocate | Number of Pages to Grow Extent |
| < 50,000 | 3600 | 128 |
| 100,000 | 7200 | 256 |
| 200,000 | 14,400 | 512 |
| 300,000 | 21,600 | 768 |
| 400,000 | 28,800 | 1024 |
| 800,000 | 57,600 | 2048 |
You can modify the number of extent pages by which the BMT grows
when a file domain is created or when a volume is added to the domain.
If you use the
mkfdmn -x
or the
addvol
-x
command when there is a large amount of free space on a disk,
as files are created the BMT will expand by the specified number of pages
and those pages will be in one extent.
As the disk becomes more fragmented,
the BMT will still expand, but the pages will not be contiguous and will
require more extents.
Eventually, the BMT will run out of its limited number
of extents even though the growth size is large.
To prevent this problem, you can
preallocate space for the BMT when the file domain is created, or when a
volume is added to the domain.
If you use the
mkfdmn -p
or the
addvol -p
command, the preallocated BMT is
described in one extent.
All subsequent growth will be able to utilize
nearly all of the limited number of BMT extents.
See
mkfdmn(8)
and
addvol(8).
Do not overallocate BMT space because the disk space cannot be used for other purposes. However, too little BMT space will eventually cause the BMT to grow by a fixed amount. The disk may be fragmented and the growth will require multiple extents.
You may be able to use the AdvFS
stripe
utility to improve the read and write
performance of an individual file by
spreading file data evenly across different disks in a file domain.
For the maximum performance benefit, stripe files across disks on
different I/O buses.
Striping files, instead of striping entire disks, is useful if an application continually accesses only a few specific files. Do not stripe both a file and the disk on which it resides.
The
stripe
utility directs a zero-length file (a file with no data written
to it yet) to be distributed evenly across a specified number of volumes.
As data is
appended to the file, the data is spread across the volumes.
The size of each data segment (also called the stripe or chunk size)
is 64 KB (65,536 bytes).
AdvFS alternates the placement of the
segments on the disks in a sequential pattern.
For example, the
first 64 KB of the file is written to the first volume, the second 64 KB
is written to the next volume, and so on.
See
stripe(8)
for more information.
Note
Distributing data across multiple volumes decreases data availability, because one volume failure makes the entire file domain unavailable. To make striped files highly available, you can mirror the disks on which the file is striped.
To determine if you should stripe files, use the
iostat
utility, as described in
Section 8.2.1.
The blocks per second and I/O operations
per second should be cross-checked with the disk's bandwidth capacity.
If the disk access time is slow, in comparison
to the stated capacity, then file striping may improve performance.
The performance benefit of striping also depends on the size of the average I/O transfer in relation to the data segment (stripe) size, in addition to how your users and applications perform disk I/O.
AdvFS quotas allow you to track and control the amount of physical storage that a user, group, or fileset consumes. AdvFS eliminates the slow reboot activities associated with UFS quotas. In addition, AdvFS quota information is always maintained, but quota enforcement can be activated and deactivated.
For information about AdvFS quotas, see the AdvFS Administration manual.
Table 9-4 describes the tools you can use to obtain information about AdvFS.
| Name | Use | Description |
Displays AdvFS performance statistics (Section 9.3.3.1) |
Allows you to obtain extensive AdvFS performance information, including buffer cache, fileset, volume, and bitfile metadata table (BMT) statistics, for a specific interval of time. |
|
|
Identifies disks in a file domain (Section 9.3.3.2) |
Locates pieces of AdvFS file domains on disk partitions and in LSM disk groups. |
|
Displays detailed information about AdvFS file domains and volumes (Section 9.3.3.3) |
Allows you to determine if files are
evenly distributed across AdvFS volumes.
The
For multivolume domains, the utility also displays the total volume size, the total number of free blocks, and the total percentage of volume space currently allocated. |
Displays information about files in an AdvFS fileset (Section 9.3.3.4) |
Displays detailed information about
files (and directories) in an AdvFS fileset.
The
The
|
|
|
Displays AdvFS fileset information for a file domain (Section 9.3.3.5) |
Displays information about the filesets
in a file domain, including the fileset names, the total number of files,
the number of free blocks, the quota status, and the clone status.
The
|
|
Checks the AdvFS on-disk metadata structures |
Checks AdvFS on-disk structures such as the BMT, the storage bitmaps, the tag directory and the frag file for each fileset. It verifies that the directory structure is correct and that all directory entries reference a valid file (tag) and that all files (tags) have a directory entry. |
Exercises file systems |
Exercises AdvFS and UFS file systems
by creating, opening, writing, reading, validating, closing, and unlinking
a test file.
Errors are written to a log file.
See
|
The following sections describe some of these commands in detail.
The
advfsstat
command displays various AdvFS performance statistics and monitors
the performance of AdvFS domains and filesets.
Use this command to obtain
detailed information, especially if the
iostat
command
output indicates a disk bottleneck (see
Section 8.2.1).
The
advfsstat
command displays detailed information
about a file domain, including information about the AdvFS buffer cache,
fileset vnode operations, locks, the namei cache, and volume I/O performance.
The command reports information in units of one disk block (512 bytes) for
each interval of time (the default is one second).
You can use the
-i
option to output information at specific time intervals.
The following example of the
advfsstat -v 2
command
shows the I/O queue statistics for the specified volume:
#/usr/sbin/advfsstat -v 2 test_domainvol1 rd wr rg arg wg awg blk wlz rlz con dev 54 0 48 128 0 0 0 1 0 0 65
The previous example shows the following fields:
Read and write requests--Compare
the number of read requests (rd) to the number
of write requests (wr).
Read requests are blocked until
the read completes, but write requests will not block the calling thread,
which increases the throughput of multiple threads.
Consolidated reads and writes--You may be able to improve
performance by consolidating reads and writes.
The consolidated
read values (rg
and
arg) and write
values (wg
and
awg) indicate the number
of disparate reads and writes that were consolidated into a single I/O to
the device driver.
If the number of consolidated reads and writes decreases
compared to the number of reads and writes, AdvFS may not be consolidating
I/O.
I/O queue values--The
blk,
wlz,
rlz,
con, and
dev
fields can indicate potential performance issues.
The
con
value
specifies the number of entries on the consolidate queue.
These entries are ready to be consolidated and moved to the device queue.
The device queue value (dev) shows the number of I/O
requests that have been issued to the device controller.
The system must
wait for these requests to complete.
If the number of I/O requests on the device queue increases continually and you experience poor performance, applications may be I/O bound on this device. You may be able to eliminate the problem by adding more disks to the domain or by striping disks.
You can monitor the type of requests that applications are issuing
by using the
advfsstat
command's
-f
option to display fileset vnode operations.
You can display the number of
file
creates, reads, and writes and other operations for a specified domain or
fileset.
For example:
#/usr/sbin/advfsstat -i 3 -f 2 scratch_domain fset1lkup crt geta read writ fsnc dsnc rm mv rdir mkd rmd link 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0 10 0 0 0 0 2 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 24 8 51 0 9 0 0 3 0 0 4 0 0 1201 324 2985 0 601 0 0 300 0 0 0 0 0 1275 296 3225 0 655 0 0 281 0 0 0 0 0 1217 305 3014 0 596 0 0 317 0 0 0 0 0 1249 304 3166 0 643 0 0 292 0 0 0 0 0 1175 289 2985 0 601 0 0 299 0 0 0 0 0 779 148 1743 0 260 0 0 182 0 47 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
See
advfsstat(8)
for more information.
Note that it is difficult to link performance problems to some statistics such as buffer cache statistics. In addition, lock performance that is related to lock statistics cannot be tuned.
The
advscan
command locates pieces of AdvFS domains on disk partitions and in LSM disk
groups.
Use the
advscan
command when you have moved disks
to a new system, have moved disks around in a way that has changed device
numbers, or have lost track of where the domains are.
You can specify a list of volumes or disk groups with the
advscan
command to search all partitions and volumes.
The command determines
which partitions on a disk are part of an AdvFS file domain.
You can also use the
advscan
command for repair purposes
if you deleted the
/etc/fdmns
directory,
deleted a directory domain under
/etc/fdmns, or deleted
some links from a domain directory under
/etc/fdmns.
You can run the
advscan
command to rebuild all or part of your
/etc/fdmns
directory, or you can manually rebuild it by supplying
the names of the partitions in a domain.
The following example scans two disks for AdvFS partitions:
#/usr/advfs/advscan rz0 rz5Scanning disks rz0 rz5 Found domains: usr_domain Domain Id 2e09be37.0002eb40 Created Thu Jun 26 09:54:15 1998 Domain volumes 2 /etc/fdmns links 2 Actual partitions found: rz0c rz5c
For the following example, the
rz6
file domains were removed from
/etc/fdmns.
The
advscan
command scans device
rz6
and re-creates the missing domains.
#/usr/advfs/advscan -r rz6Scanning disks rz6 Found domains: *unknown* Domain Id 2f2421ba.0008c1c0 Created Mon Jan 20 13:38:02 1998 Domain volumes 1 /etc/fdmns links 0 Actual partitions found: rz6a* *unknown* Domain Id 2f535f8c.000b6860 Created Tue Feb 25 09:38:20 1998 Domain volumes 1 /etc/fdmns links 0 Actual partitions found: rz6b* Creating /etc/fdmns/domain_rz6a/ linking rz6a Creating /etc/fdmns/domain_rz6b/ linking rz6b
See
advscan(8)
for more information.
The
showfdmn
command displays the attributes of an AdvFS
file domain and detailed information about each volume in the file domain.
The following example of the
showfdmn
command displays
domain information for the
usr
file domain:
%/sbin/showfdmn usrId Date Created LogPgs Domain Name 2b5361ba.000791be Tue Jan 12 16:26:34 1998 256 usr Vol 512-Blks Free % Used Cmode Rblks Wblks Vol Name 1L 820164 351580 57% on 256 256 /dev/disk/rz0d
See
showfdmn(8)
for more information about the output of the
command.
The
showfile
command displays the full storage allocation map (extent
map) for one or more files in an AdvFS fileset.
An extent is a contiguous
area of disk space that AdvFS allocates to a file.
The following example
of the
showfile
command displays the AdvFS characteristics
for all of the files in the current working directory:
#/usr/sbin/showfile *Id Vol PgSz Pages XtntType Segs SegSz I/O Perf File 22a.001 1 16 1 simple ** ** async 50% Mail 7.001 1 16 1 simple ** ** async 20% bin 1d8.001 1 16 1 simple ** ** async 33% c 1bff.001 1 16 1 simple ** ** async 82% dxMail 218.001 1 16 1 simple ** ** async 26% emacs 1ed.001 1 16 0 simple ** ** async 100% foo 1ee.001 1 16 1 simple ** ** async 77% lib 1c8.001 1 16 1 simple ** ** async 94% obj 23f.003 1 16 1 simple ** ** async 100% sb 170a.008 1 16 2 simple ** ** async 35% t 6.001 1 16 12 simple ** ** async 16% tmp
The
I/O
column specifies whether write operations
are forced to be synchronous.
See
Section 9.3.4.10
for information.
The following example of the
showfile
command shows
the characteristics and extent information for the
tutorial
file, which is a simple file:
#/usr/sbin/showfile -x tutorialId Vol PgSz Pages XtntType Segs SegSz I/O Perf File 4198.800d 2 16 27 simple ** ** async 66% tutorial extentMap: 1 pageOff pageCnt vol volBlock blockCnt 0 5 2 781552 80 5 12 2 785776 192 17 10 2 786800 160 extentCnt: 3
The
Perf
entry shows the efficiency of the file-extent
allocation, expressed as a percentage of the optimal extent layout.
A high
value, such as 100 percent, indicates that the AdvFS I/O subsystem is highly
efficient.
A low value indicates that files may be fragmented.
See
showfile(8)
for more information about the command output.
The
showfsets
command displays the AdvFS filesets (or clone filesets) and their characteristics
in a specified domain.
The following is an example of the
showfsets
command:
#/sbin/showfsets dmnmnt Id : 2c73e2f9.000f143a.1.8001 Clone is : mnt_clone Files : 79, limit = 1000 Blocks (1k) : 331, limit = 25000 Quota Status : user=on group=on mnt_clone Id : 2c73e2f9.000f143a.2.8001 Clone of : mnt Revision : 1
See
showfsets(8)
for information about the options and output
of the command.
After you configure AdvFS, you may be able to tune it to improve performance. To successfully improve performance, you must understand how your applications and user perform file system I/O, as described in Section 2.1.
Table 9-5 lists AdvFS tuning guidelines and performance benefits as well as tradeoffs. In addition, the recommendations described in Table 9-1 apply to AdvFS configurations.
| Action | Performance Benefit | Tradeoff |
| Decrease the size of the metadata buffer cache to 1 percent (Section 6.4.6) | Improves performance for systems that use only AdvFS | None |
| Increase the percentage of memory allocated for the AdvFS buffer cache (Section 9.3.4.1) | Improves AdvFS performance if data reuse is high | Consumes memory |
| Increase the number of AdvFS buffer hash chains (Section 9.3.4.2) | Speeds lookup operations and decreases CPU usage | Consumes memory |
| Increase the memory reserved for AdvFS access structures (Section 9.3.4.3) | Improves AdvFS performance for systems that open and reuse files | Decreases the memory available to the virtual memory subsystem and the UBC |
| Defragment file domains (Section 9.3.4.4) | Improves read and write performance | None |
| Increase the amount of data cached in the ready queue (Section 9.3.4.5) | Improves asynchronous write performance | May cause I/O spikes or increase the number of lost buffers if a crash occurs |
| Decrease the maximum number of I/O requests on the device queue (Section 9.3.4.6) | Decreases the time to complete synchronous I/O requests and improves response time | May cause I/O spikes |
| Decrease the I/O transfer read-ahead size (Section 9.3.4.7) | Improves performance for
mmap
page faulting |
None |
Disable the flushing of dirty pages mapped
with the
mmap
function during a
sync
call (Section 9.3.4.8) |
May improve performance for applications that manage their own flushing | None |
| Consolidate I/O transfers (Section 9.3.4.9) | Improves AdvFS performance | None |
| Force all AdvFS file writes to be synchronous (Section 9.3.4.10) | Ensures that data is successfully written to disk | May degrade file system performance |
| Prevent partial writes (Section 9.3.4.11) | Ensures that system crashes do not cause partial disk writes | May degrade asynchronous write performance |
| Move the transaction log to a fast or uncongested volume (Section 9.3.4.12) | Prevents log from becoming a bottleneck | None |
| Balance files across volumes in a file domain (Section 9.3.4.13) | Improves performance and evens the future distribution of files | None |
| Migrate frequently used or large files to different file domains (Section 9.3.4.14) | Improves I/O performance | None |
The following sections describe the AdvFS tuning recommendations in detail.
The
advfs
subsystem
attribute
AdvfsCacheMaxPercent
specifies the maximum percentage of physical memory that can be
used to cache AdvFS file data.
Caching AdvFS data can improve I/O performance
only
if the cached data is reused.
If data reuse is high, you may be able to improve AdvFS performance
by
increasing the percentage of memory
allocated to the AdvFS buffer cache.
To do this, increase the value of the
AdvfsCacheMaxPercent
attribute.
The default is 7 percent
of memory, and the maximum is 30 percent.
If you increase the value of the
AdvfsCacheMaxPercent
attribute and experience no performance benefit, return to the original value.
Note that the AdvFS buffer cache cannot be more than 50 percent of the UBC.
Increasing the memory allocated to the AdvFS buffer cache will
decrease the amount of memory available for processes;
make sure that you do not cause excessive
paging and swapping.
Use the
vmstat
command to check
virtual memory statistics, as described in
Section 6.3.2.
If your workload does not reuse AdvFS data or if you have more than 2 GB of memory, you may want to decrease the size of the AdvFS buffer cache. The minimum value is 1 percent of physical memory. This can improve performance, because it decreases the overhead associated with managing the cache and also frees memory.
See Section 4.4 for information about modifying kernel subsystem attributes.
The
hash chain table for the AdvFS buffer cache is used to locate
pages of AdvFS file data in memory.
The table contains a number of
hash chains, which contain
elements that point to pages of file system data that have already been read
into memory.
When a
read
or
write
system
call is done for a particular offset within an AdvFS file, the system sequentially
searches the appropriate hash chain to determine if the file data is already
in memory.
The value of the
advfs
subsystem
attribute
AdvfsCacheHashSize
specifies the number of
hash chains on the table.
The
default value is either 8192 KB or 10 percent of the size of the AdvFS buffer
cache (rounded up to the next power of 2), whichever is the smallest value.
The minimum value is 1024 KB.
The maximum value is either 65536 or the
size of the AdvFS buffer cache, whichever is the smallest value.
The
AdvfsCacheMaxPercent
attribute specifies the
size of the AdvFS buffer cache (see
Section 9.3.4.1).
If you have more than 4 GB of memory, you may want to increase the value
of the
AdvfsCacheHashSize
attribute, which will increase
the number of hash chains on the table.
The more hash chains on the table,
the shorter the hash chains.
Short hash chains contain less elements to
search, which results in fast seach speeds and decreases CPU usage.
For example, you can double the default value of the
AdvfsCacheHashSize
attribute if the system is experiencing high
CPU system time, or if a kernel profile shows high percentage of CPU usage
in the
find_page
routine.
Increasing the size of the AdvFS buffer cache hash table will increase the amount of kernel wired memory in the system.
See Section 4.4 for information about modifying kernel subsystem attributes.
At boot time, the system reserves a percentage of pageable memory (memory that is not wired by the kernel or applications) for AdvFS access structures. If your system opens and then reuses many files (for example, if you have a proxy server), you may be able to improve AdvFS performance by increasing the number of AdvFS access structures that the system places on the access structure free list at startup time.
AdvFS access structures are in-memory data structures that AdvFS uses to cache low-level information about files that are currently open, and files that were opened but are now closed. Increasing the number of access structures on the free list allows more open file information (metadata) to remain in the cache, which can improve AdvFS performance if the files are reused. See Section 9.3.1.2 for more information about access structures.
Use the
advfs
subsystem attribute
AdvfsPreallocAccess
to modify the number of AdvFS access
structures that
the system allocates at startup time.
The default and minimum values are 128
if you have a mounted AdvFS fileset.
The maximum value is either 65536 or the value of the
advfs
subsystem attribute
AdvfsAccessMaxPercent,
whichever is the smallest value.
The
AdvfsAccessMaxPercent
attribute specifies the maximum
percentage of pageable memory (malloc
pool) that can
be reserved for AdvFS
access structures.
The minimum value is 5 percent of pageable memory, and
the maximum value is 95 percent.
The default value is 80 percent.
Increasing the value of the
AdvfsAccessMaxPercent
attribute allows you to allocate more memory resources for access structures,
which may improve AdvFS performance on systems that open and reuse many
files.
However, increasing the memory available for access structures will
decrease the memory that is available to processes, which may cause excessive
paging and swapping.
Decreasing the value of the
AdvfsAccessMaxPercent
attribute frees pageable memory but you will be able to allocate less memory
for AdvFS access structures, which may degrade AdvFS performance on systems
that open and reuse many files.
See Section 4.4 for information about modifying kernel subsystem attributes.
AdvFS attempts to store file data in a collection of contiguous blocks (a file extent) on a disk. If all data in a file is stored in contiguous blocks, the file has one file extent. However, as files grow, contiguous blocks on the disk may not be available to accommodate the new data, so the file must be spread over discontiguous blocks and multiple file extents.
File fragmentation degrades read and write performance because many disk addresses must be examined to access a file. In addition, if a domain has a large number of small files, you may prematurely run out of disk space due to fragmentation.
Use the
defragment
utility to reduce the amount of
file
fragmentation in a file domain by attempting to make the files more contiguous,
which reduces the number of file extents.
The utility does not affect data
availability and is transparent to users and applications.
Striped files
are not defragmented.
Use the
defragment
utility with the
-v
and
-n
options to show the amount of file
fragmentation.
You can improve the efficiency of the defragmenting process by deleting
any unneeded files in the file domain before running the
defragment
utility.
See
defragment(8)
for more information.
AdvFS caches asynchronous I/O requests in the AdvFS buffer cache. If the cached data is later reused, pages can be retrieved from memory, and a disk operation is avoided.
Asynchronous I/O requests are sorted in the ready queue and remain there
until the size of the queue reaches the value specified by the
AdvfsReadyQLim
attribute or until
the
update
daemon flushes the data.
The default value of the
AdvfsReadyQLim
attribute is 16,384 512-byte blocks (8 MB).
See
Section 9.3.1.1
for more information about AdvFS queues.
You can modify the size of the ready queue for all AdvFS volumes by
changing the value of the
AdvfsReadyQLim
attribute.
You
can modify the size for a specific AdvFS volume by using the
chvol
-t
command.
See
chvol(8)
for more information.
If you have high data reuse (data is repeatedly read and written), you may want to increase the size of the ready cache. This can increase the number of AdvFS buffer cache hits. If you have low data reuse, you can decrease the threshold, but it is recommended that you use the default value.
If you change the size of the ready queue and performance does not improve, return to the original value.
Although you can specify
a value of 0 for the
AdvfsReadyQLim
attribute to disable
data caching in the ready queue and allow I/O requests to bypass the ready
queue, this is not recommended.
See Section 4.4 for information about modifying kernel subsystem attributes.
Small, logically contiguous synchronous and asynchronous AdvFS I/O requests are consolidated into larger I/O requests on the device queue, before they are sent to the device driver. See Section 9.3.1.1 for more information about AdvFS queues.
The
AdvfsMaxDevQLen
attribute controls the maximum
number of I/O requests on the device queue.
When the number of requests on
the queue exceeds this value, only synchronous requests are accepted onto
the device queue.
The default value of the
AdvfsMaxDevQLen
attribute is 24 requests.
Although the default value of the
AdvfsMaxDevQLen
attribute is appropriate for most configurations, you may need to modify this
value.
Increase the default value of the
AdvfsMaxDevQLen
attribute only if devices are not being kept busy.
A guideline is to specify a value for the
AdvfsMaxDevQLen
attribute that is less than or equal to the average number of I/O operations
that can be performed in 0.5 seconds.
Make sure that increasing
the size of the device queue does not cause a decrease in response time.
To calculate response time, multiply the value of the
AdvfsMaxDevQLen
attribute by the average I/O latency time for your disks.
Decreasing the size of the device queue decreases the amount of time it takes to complete a synchronous (blocking) I/O operation and can improve response time.
If you do not want to limit the number of requests on the device queue,
set the value of the
AdvfsMaxDevQLen
attribute to 0 (zero),
although this behavior is not recommended.
See Section 4.4 for information about modifying kernel subsystem attributes.
AdvFS reads and writes data by a fixed number of 512-byte blocks. The default value depends on the disk driver's reported preferred transfer size. For example, a common default value is either 128 blocks or 256 blocks.
Use the
chvol
command with the
-r
option to change the read-ahead size.
You may be able to improve performance for
mmap
page
faulting and reduce read-ahead paging and cache dilution by decreasing the
read-ahead size.
Use the
chvol
command with the
-w
option to change the write-consolidation size.
See
chvol(8)
for more information.
If the disk is fragmented so that the pages of a file are not sequentially
allocated, reduce fragmentation by using the
defragment
utility.
See
defragment(8)
for more information.
The AdvFS buffer cache can contain modified data due to a
write
system call or a memory write reference after an
mmap
system call.
The
update
daemon runs every
30 seconds and issues a
sync
call for every fileset mounted
with read and write access.
The
AdvfsSyncMmapPages
attribute controls whether
modified (dirty) mmapped pages are flushed to disk during a
sync
system call.
If the
AdvfsSyncMmapPages
attribute
is set to 1 (the default), the modified mmapped pages are asynchronously
written to disk.
If the
AdvfsSyncMmapPages
attribute
is set to 0, modified mmapped pages are not written to disk during a
sync
system call.
If your applications manage their own
mmap
page flushing,
set the value of the
AdvfsSyncMmapPages
attribute to zero.
See
mmap(2)
and
msync(2)
for more information.
See Section 4.4 for information about modifying kernel subsystem attributes.
By default,
AdvFS consolidates a number of I/O transfers into a single, large I/O
transfer, which can improve AdvFS performance.
To enable the consolidation
of
I/O transfers, use the
chvol
command with the
-c on
option.
It is recommended that you not
disable the consolidation of I/O transfers.
See
chvol(8)
for more information.
By default, asynchronous
write requests are cached in the AdvFS buffer cache, and the
write
system call then returns a success value.
The data is written to
disk at a later time (asynchronously).
You can use the
chfile -l on
command to force all
write requests to a specified AdvFS file to be synchronous.
If you enable
forced synchronous writes on a file, data must be successfully written to
disk before the
write
system call will return a success
value.
This behavior is similar to the behavior associated with a file that
has been opened with the
O_SYNC
option; however, forcing
synchronous writes persists across
open()
calls.
Forcing all writes to a file to be synchronous ensures that the write
has completed when the
write
system call returns a success
value.
However, it may degrade performance.
A file cannot have both forced synchronous writes enabled and atomic write data logging enabled. See Section 9.3.4.11 for more information.
Use the
chfile
command
to determine whether forced synchronous writes or atomic write data logging
is enabled.
Use the
chfile -l off
command to disable
forced synchronous writes (the default behavior).
AdvFS writes data to disk in 8-KB chunks. By default, and in accordance with POSIX standards, AdvFS does not guarantee that all or part of the data will actually be written to disk if a crash occurs during or immediately after the write. For example, if the system crashes during a write that consists of two 8-KB chunks of data, only a portion (anywhere from 0 to 16 KB) of the total write may have succeeded. This can result in partial data writes and inconsistent data.
To prevent partial writes if a system crash occurs, use the
chfile -L on
command to enable atomic write data logging for a
specified file.
By default, each file domain has a transaction log file that tracks fileset activity and ensures that AdvFS can maintain a consistent view of the file system metadata if a crash occurs. If you enable atomic write data logging on a file, data from a write call will be written to the transaction log file before it is written to disk. If a system crash occurs during or immediately after the write call, upon recovery, the data in the log file can be used to reconstruct the write. This guarantees that each 8-KB chunk of a write either is completely written to disk or is not written to disk.
For example, if atomic write data logging is enabled and a crash occurs during a write that consists of two 8-KB chunks of data, the write can have three possible states: none of the data is written, 8 KB of the data is written, or 16 KB of data is written.
Atomic write data logging may degrade AdvFS write performance because
of the extra write to the transaction log file.
In addition, a file that
has atomic write data logging enabled cannot be memory mapped by using the
mmap
system call.
A file cannot have both forced synchronous writes enabled (see
Section 9.3.4.10) and atomic write data logging enabled.
However,
you can enable atomic write data logging on a file and also open the file
with an
O_SYNC
option.
This ensures that the write is synchronous,
but also prevents partial writes if a crash occurs.
Use the
chfile
command to
determine if forced synchronous writes or atomic write data logging is enabled.
Use the
chfile -L off
command to disable atomic write
data logging (the default).
To enable atomic write data logging on AdvFS files that are NFS mounted,
the NFS property list daemon,
proplistd, must be running
on the NFS client and the fileset must be mounted on the client by using
the
mount
command's
proplist
option.
If atomic write data logging is enabled and you are writing to a file that has been NFS mounted, the offset into the file must be on an 8-KB page boundary, because NFS performs I/O on 8-KB page boundaries.
You can also activate and deactivate atomic data logging by
using the
fcntl
system call.
In addition, both the
chfile
and
fcntl
can be used on an NFS client to activate or deactivate
this feature on a file that resides on the NFS server.
Make sure that the AdvFS transaction log resides on an uncongested disk and bus or performance may be degraded.
Use the
showfdmn
command to determine the current
location of the transaction
log.
In the
showfdmn
command display, the letter
L
displays next to the volume that contains the log.
If the transaction log becomes a bottleneck, use the
switchlog
command to relocate the transaction log of the specified file domain
to a faster or less congested volume in the same domain.
See
switchlog(8)
and
showfdmn(8)
for more information.
In addition, you can divide the file domain into several smaller file domains. This will cause each domain's transaction log to handle transactions for fewer filesets.
If the files in a multivolume
domain are not evenly distributed, performance may be degraded.
Use the
balance
utility to distribute
the percentage of used space evenly
between volumes in a multivolume file domain.
This improves performance and
evens the distribution of future file allocations.
Files are moved from one
volume to another until the percentage of used space on each volume in the
domain is as equal as possible.
The
balance
utility does not affect data availability
and is transparent to users and applications.
If possible, use the
defragment
utility before you balance files.
The
balance
utility does not generally split files.
Therefore, file domains with very large files may not balance as evenly as
file domains with smaller files.
See
balance(8)
for more information.
To determine if you need to balance your files across volumes, use the
showfdmn
command to display information about the volumes in a domain.
The
% used
field shows the percentage of volume space
that is currently allocated to files or metadata (fileset data structure).
See
showfdmn(8)
for more information.
Performance may degrade if too many frequently accessed or large files reside on the same volume in a multivolume file domain. You can improve I/O performance by altering the way files are mapped on the disk.
Use the
migrate
utility to move frequently accessed
or large files to different volumes in the file domain.
You can specify the
volume where a file is to be moved, or allow the system to pick the best space
in the file domain.
You can migrate either an entire file or specific pages
to a different volume.
Using the
balance
utility
after migrating files may cause the files to move to a different volume.
See
balance(8)
for more information.
In addition, a file that is migrated is defragmented at the same time,
if possible.
Defragmentation makes the file more contiguous, which improves
performance.
Therefore, you can use the
migrate
command
to defragment selected files.
See
migrate(8)
for more information.
Use the
iostat
command to identify which disks are being heavily
used.
See
Section 8.2.1
for information.
The UNIX file system (UFS) can provide you with high-performance file system operations, especially for critical applications. For example, UFS file reads from striped disks can be 50 percent faster than if you are using AdvFS, and will consume only 20 percent of the CPU power that AdvFS requires.
However, unlike AdvFS, the UFS file system directory hierarchy is bound tightly to a single disk partition.
The following sections describe how to perform these tasks:
Use the UFS guidelines to set up a high-performance configuration (Section 9.4.1)
Obtain information about UFS performance (Section 9.4.2)
Tune UFS in order to improve performance (Section 9.4.3)
There are a number of
parameters that can improve the UFS performance.
You can set all of the parameters
when you use the
newfs
command to create a file system.
For existing file systems, you can modify some parameters by using the
tunefs
command.
See
newfs(8)
and
tunefs(8)
for more information.
Table 9-6 describes UFS configuration guidelines and performance benefits as well as tradeoffs.
| Action | Performance Benefit | Tradeoff |
| Make the file system fragment size equal to the block size (Section 9.4.1.1) | Improves performance for large files | Wastes disk space for small files |
| Use the default file system fragment size of 1 KB (Section 9.4.1.1) | Uses disk space efficiently | Increases the overhead for large files |
| Reduce the density of inodes on a file system (Section 9.4.1.2) | Frees disk space for file data and improves large file performance | Reduces the number of files that can be created on the file system |
| Allocate blocks sequentially (Section 9.4.1.3) | Improves performance for disks that do not have a read ahead cache | Reduces the total available disk space |
| Increase the number of blocks combined for a cluster (Section 9.4.1.4) | May decrease number of disk I/O operations | May require more memory to buffer data |
| Use a Memory File System (MFS) (Section 9.4.1.5) | Improves I/O performance | Does not ensure data integrity because of cache volatility |
| Use disk quotas (Section 9.4.1.6) | Controls disk space utilization | UFS quotas may result in a slight increase in reboot time |
The following sections describe the UFS configuration guidelines in detail.
The UFS file system block size is 8 KB.
The default fragment
size is 1 KB.
You can use the
newfs
command to modify the fragment size so
that it is 25, 50, 75,
or 100 percent of the block size.
Although the default fragment size uses disk space efficiently, it increases the overhead for large files. If the average file in a file system is larger than 16 KB but less than 96 KB, you may be able to improve disk access time and decrease system overhead by making the file system fragment size equal to the default block size (8 KB).
See
newfs(8)
for more information.
An inode describes an individual file in the file system. The maximum number of files in a file system depends on the number of inodes and the size of the file system. The system creates an inode for each 4 KB (4096 bytes) of data space in a file system.
If a file system will contain many large files and you are sure that you will not create a file for each 4 KB of space, you can reduce the density of inodes on the file system. This will free disk space for file data, but will reduce the number of files that can be created.
To do this, use the
newfs -i
command to specify
the amount of data space allocated for each inode.
See
newfs(8)
for more information.
The UFS
rotdelay
parameter specifies
the time, in milliseconds, to service a transfer completion interrupt and
initiate a new transfer on the same disk.
You can set the
rotdelay
parameter to 0 (the default) to allocate blocks sequentially.
This is useful for disks that do not have a read-ahead cache.
However, it
will reduce the total amount of available disk space.
Use either the
tunefs
command or the
newfs
command to
modify the
rotdelay
value.
See
newfs(8)
and
tunefs(8)
for more information.
The value of the UFS
maxcontig
parameter specifies the number of blocks that can be combined into
a single cluster (or file-block group).
The default value of
maxcontig
is 8.
The file system
attempts I/O operations in a size that is determined by the
value of
maxcontig
multiplied by the block size (8 KB).
Device drivers that can chain several buffers together in a single transfer
should use a
maxcontig
value that is equal to the maximum
chain length.
This may reduce the number of disk I/O operations.
However,
more memory will be needed to buffer data.
Use the
tunefs
command or the
newfs
command to change the value of
maxcontig.
See
newfs(8)
and
tunefs(8)
for more information.
Memory File System (MFS) is a UFS file system that resides only in memory. No permanent data or file structures are written to disk. An MFS file system can improve read/write performance, but it is a volatile cache. The contents of an MFS file system are lost after a reboot, unmount operation, or power failure.
Because no date is written to disk, an MFS file system is a very fast file system and can be used to store temporary files or read-only files that are loaded into it after it is created. For example, if you are performing a software build that would have to be restarted if it failed, use an MFS file system to cache the temporary files that are created during the build and reduce the build time.
You can specify UFS file system limits for user accounts and for groups by setting up file system quotas, also known as disk quotas. You can apply quotas to file systems to establish a limit on the number of blocks and inodes (or files) that a user account or a group of users can allocate. You can set a separate quota for each user or group of users on each file system.
You may want to set quotas on file systems that contain home directories,
because the sizes of these file systems can increase more significantly than
other file systems.
Do not set quotas on the
/tmp
file
system.
Note that, unlike AdvFS quotas, UFS quotas may cause a slight increase in reboot time. For information about AdvFS quotas, see Section 9.3.2.5.
For information about UFS quotas, see the System Administration manual.
Table 9-7 describes the tools you can use to obtain information about UFS.
| Name | Use | Description |
|
Displays UFS information (Section 9.4.2.1) |
Displays detailed information about a UFS file system or a special device, including information about the file system fragment size, the percentage of free space, super blocks, and the cylinder groups. |
Reports UFS clustering statistics (Section 9.4.2.2) |
Reports statistics on how the system is performing cluster read and write transfers. |
|
Reports UFS metadata buffer cache statistics (Section 9.4.2.3) |
Reports statistics on the metadata buffer cache, including superblocks, inodes, indirect blocks, directory blocks, and cylinder group summaries. |
|
Exercises file systems |
Exercises UFS and AdvFS file systems
by creating, opening, writing, reading, validating, closing, and unlinking
a test file.
Errors are written to a log file.
See
|
The following sections describe some of these commands in detail.
The
dumpfs
command displays
UFS information, including super block and cylinder group information, for
a specified file system.
Use this command to obtain information about the
file system fragment size and the minimum free space percentage.
The following example shows part of the output of the
dumpfs
command:
#/usr/sbin/dumpfs /devices/disk/rr3zg | moremagic 11954 format dynamic time Tue Sep 14 15:46:52 1998 nbfree 21490 ndir 9 nifree 99541 nffree 60 ncg 65 ncyl 1027 size 409600 blocks 396062 bsize 8192 shift 13 mask 0xffffe000 fsize 1024 shift 10 mask 0xfffffc00 frag 8 shift 3 fsbtodb 1 cpg 16 bpg 798 fpg 6384 ipg 1536 minfree 10% optim time maxcontig 8 maxbpg 2048 rotdelay 0ms headswitch 0us trackseek 0us rps 60
The information contained in the first lines are relevant for tuning. Of specific interest are the following fields:
bsize--The block size of the file
system in bytes (8 KB).
fsize--The fragment size of the file
system (in bytes).
For the optimum I/O performance, you can modify the fragment
size.
minfree--The percentage of space held
back from normal users; the minimum free space threshold.
maxcontig--The maximum number of contiguous
blocks that will be laid out before forcing a rotational delay; that is, the
number of blocks that are combined into a single read request.
maxbpg--The maximum number of blocks
any single file can allocate out of a cylinder group before it is forced to
begin allocating blocks from another cylinder group.
A large value for
maxbpg
can improve performance for large files.
rotdelay--The expected time (in milliseconds)
to service a transfer completion interrupt and initiate a new transfer on
the same disk.
It is used to decide how much rotational spacing to place
between successive blocks in a file.
If
rotdelay
is zero,
then blocks are allocated contiguously.
See Section 9.4.3 for information about tuning UFS.
To
determine how efficiently the system is performing cluster read and
write transfers, use the
dbx print
command to examine the
ufs_clusterstats
data
structure.
The following example shows a system that is not clustering efficiently:
#/usr/ucb/dbx -k /vmunix /dev/mem(dbx)print ufs_clusterstatsstruct { full_cluster_transfers = 3130 part_cluster_transfers = 9786 non_cluster_transfers = 16833 sum_cluster_transfers = { [0] 0 [1] 24644 [2] 1128 [3] 463 [4] 202 [5] 55 [6] 117 [7] 36 [8] 123 [9] 0 } } (dbx)
The preceding example shows 24644 single-block transfers and no 9-block transfers. A single block is 8 KB. The trend of the data shown in the example is the reverse of what you want to see. It shows a large number of single-block transfers and a declining number of multiblock (1-9) transfers. However, if the files are all small, this may be the best blocking that you can achieve.
You can examine the cluster reads and writes separately with the
ufs_clusterstats_read
and
ufs_clusterstats_write
data structures.
See Section 9.4.3 for information on tuning UFS.
The
metadata buffer cache contains UFS file metadata--superblocks, inodes,
indirect blocks, directory blocks, and cylinder group summaries.
To check
the metadata buffer cache, use the
dbx print
command to
examine the
bio_stats
data structure.
Consider the following example:
#/usr/ucb/dbx -k /vmunix /dev/mem(dbx)print bio_statsstruct { getblk_hits = 4590388 getblk_misses = 17569 getblk_research = 0 getblk_dupbuf = 0 getnewbuf_calls = 17590 getnewbuf_buflocked = 0 vflushbuf_lockskips = 0 mntflushbuf_misses = 0 mntinvalbuf_misses = 0 vinvalbuf_misses = 0 allocbuf_buflocked = 0 ufssync_misses = 0 } (dbx)
If the miss rate is high,
you may want to raise the value of the
bufcache
attribute.
The number of block misses (getblk_misses) divided by the
sum of block misses and block hits (getblk_hits) should
not be more than 3 percent.
See Section 9.4.3.1 for information on how to tune the metadata buffer cache.
After you configure your UFS file systems, you may be able to improve UFS performance. To successfully improve performance, you must understand how your applications and users perform file system I/O, as described in Section 2.1.
Table 9-8 describes UFS tuning guidelines and performance benefits as well as tradeoffs. In addition, the recommendations described in Table 9-1 apply to UFS configurations.
| Action | Performance Benefit | Tradeoff |
| Increase size of metadata buffer cache to more than 3 percent of main memory (Section 9.4.3.1) | Increases cache hit rate and improves UFS performance | Requires additional memory resources |
| Increase the size of the metadata hash chain table (Section 9.4.3.2) | Improves UFS lookup speed | Increases wired memory |
| Defragment the file system (Section 9.4.3.3) | Improves read and write performance | Requires down time |
| Delay flushing full write buffers to disk (Section 9.4.3.4) | Frees CPU cycles | May degrade real-time workload performance when buffers are flushed |
| Increase number of blocks combined for read ahead (Section 9.4.3.5) | May reduce disk I/O operations | May require more memory to buffer data |
| Increase number of blocks combined for a cluster(Section 9.4.3.6) | May decrease disk I/O operations | Reduces available disk space |
| Increase the smooth sync caching threshhold for asynchronous UFS I/O requests (Section 9.4.3.7) | Improves performance of AdvFS asynchronous I/O | None |
| Increase the maximum number of UFS and MFS mounts (Section 9.4.3.8) | Allows more mounted file systems | Requires additional memory resources |
The following sections describe how to tune UFS in detail.
At boot time,
the kernel wires a percentage of physical memory for the metadata buffer
cache, which temporarily holds recently accessed UFS and CD-ROM File System
(CDFS) metadata.
The
vfs
subsystem
attribute
bufcache
specifies the size of the metadata
buffer cache as a
percentage of physical memory.
The default is 3 percent.
Usually, you do not have to increase the cache size. However, you may want to increase the size of the metadata buffer cache if you reuse data and have a high cache miss rate (low hit rate).
To determine whether to increase the
size of the metadata buffer cache, use the
dbx print
command
to examine the
bio_stats
data structure.
The miss rate
(block misses divided by the sum of the block misses and block hits) should
not be more than 3 percent.
If you have a general-purpose timesharing
system, do not increase the value of the
bufcache
attribute
to more than 10 percent.
If you have an NFS server that does not perform timesharing,
do not increase the value of the
bufcache
attribute to
more than 35 percent.
Allocating additional memory to the metadata buffer cache reduces the amount of memory available to processes and the UBC. See Section 6.1.2.1 for information about how memory is allocated to the metadata buffer cache.
See Section 4.4 for information about modifying kernel subsystem attributes.
The hash chain table for the metadata buffer cache stores the heads of the hashed buffer queues. Increasing the size of the hash chain table distributes the buffers, which makes average chain lengths short. This can improve lookup speeds. However, increasing the size of the hash chain table increases wired memory.
The
vfs
subsystem
attribute
buffer-hash-size
specifies the size of the
hash chain table, in table entries, for
the metadata buffer cache.
The minimum size is 16; the maximum size is
524287.
The default value is 512.
You can modify the value of the
buffer-hash-size
attribute so that each hash chain has 3 or 4 buffers.
To determine a value
for the
buffer-hash-size
attribute, use the
dbx print
command to examine the value of the
nbuf
kernel variable, then divide the value by 3 or 4, and finally round the result
to a power of 2.
For example, if
nbuf
has a value of
360, dividing 360 by 3 gives you a value of 120.
Based on this calculation,
specify 128 (2 to the power of 7) as the value of the
buffer-hash-size
attribute.
See Section 4.4 for information about modifying kernel subsystem attributes.
When a file consists of noncontiguous file extents, the file is considered fragmented. A very fragmented file decreases UFS read and write performance, because it requires more I/O operations to access the file.
You can determine whether the files in a file system are fragmented
by determining how effectively the system is clustering.
You can do this by
using the
dbx print
command to examine the
ufs_clusterstats,
ufs_clusterstats_read, and
ufs_clusterstats_write
data structures.
See
Section 9.4.2.2
for information.
UFS block clustering is usually efficient. If the numbers from the UFS clustering kernel structures show that clustering is not being particularly effective, the files in the file system may be very fragmented.
To defragment a UFS file system, follow these steps:
Back up the file system onto tape or another partition.
Create a new file system either on the same partition or a different partition.
Restore the file system.
See the System Administration manual for information about backing up and restoring data and creating UFS file systems.
You can free CPU cycles
by delaying flushing full write buffers to disk until the next
sync
call (or until the percentage of UBC dirty pages reaches the
value of the
delay_wbuffers_percent
kernel variable).
However, delaying write buffer flushing may adversely affect real-time workload
performance, because the system will experience a heavy I/O load at sync
time.
To delay full write buffer flushing, use the
dbx patch
command to set the value of the
delay_wbuffers
kernel
variable to 1 (enabled).
The default value of
delay_wbuffers
is 0 (disabled).
See
Section 4.4.6
for information on using
dbx.
You can increase the number of blocks that are combined for a read-ahead operation.
To do this, use the
dbx patch
command to set the
value of the
cluster_consec_init
kernel variable equal
to the value of the
cluster_max_read_ahead
kernel
variable (the
default is 8), which specifies the maximum number of read-ahead clusters
that the kernel can schedule.
In
addition, you must make sure that cluster read operations are enabled
on nonread-ahead and read-ahead blocks.
To do this, use
dbx
to set the value of the
cluster_read_all
kernel variable
to 1, which is the default value.
See
Section 4.4.6
for information on using
dbx.
The
cluster_maxcontig
kernel variable
specifies the number of blocks that are combined into a single I/O operation.
The default value is 8.
Contiguous writes are done in a unit size that is
determined by the file system block size (8 KB) multiplied by the value of
the
cluster_maxcontig
parameter.
See
Section 4.4.6
for information about using
dbx.
Smooth sync functionality improves UFS asynchronous I/O performance by preventing I/O spikes caused by the update daemon and by increasing the UBC hit rate, which decreases the total number of disk operations. Smooth sync also helps to efficiently distribute I/O requests over the sync interval, which decreases the length of the disk queue and reduces the latency that results from waiting for a busy page to be freed. By default, smooth sync is enabled on your system.
UFS caches asynchronous I/O requests in the dirty block queue and in
the UBC object dirty page list queue before they are handed to the device
driver.
With smooth sync enabled (the default),
the
update
daemon will not flush the dirty page list and
dirty page wired list buffers.
Instead, asynchronous I/O requests remain
in the queue for the amount of time specified by the value of the
vfs
attribute
smoothsync_age
(the default is
30 seconds).
When a buffer ages sufficiently, it is moved to the device
queue.
If smooth sync is disabled, every 30 seconds the
update
daemon flushes data from memory to disk, regardless of how long
a buffer has been cached.
Smooth sync functionality is controlled by the
smoothsync_age
attribute.
However, you do not specify a value for
smoothsync_age
in the
/etc/sysconfigtab
file.
Instead, the
/etc/inittab
file is used to enable smooth
sync when the system boots to multiuser mode and to disable smooth sync when
the system goes from multiuser mode to single-user mode.
This procedure
is necessary to reflect the behavior of the
update
daemon, which operates only in multiuser mode.
To enable smooth sync, the following lines must be included in the
/etc/inittab
file and the time limit for caching buffers in the
smooth sync queue must be specified:
smsync:23:wait:/sbin/sysconfig -r vfs smoothsync_age=30 > /dev/null 2>&1 smsyncS:Ss:wait:/sbin/sysconfig -r vfs smoothsync_age=0 > /dev/null 2>&1
Thirty seconds is the default smooth sync queue threshhold.
If you
increase this value, you may improve the chance of a buffer cache hit by retaining
buffers on the smooth sync queue for a longer period of time.
Consequently,
decreasing the value of the
smoothsync_age
attribute will
speed the flushing of buffers.
To disable smooth sync, specify a value of 0 (zero) for the
smoothsync_age
attribute.
See Section 4.4 for information about modifying kernel subsystem attributes.
Mount
structures are dynamically allocated when a mount request
is made and subsequently deallocated when an unmount request is made.
The
vfs
subsystem attribute
max-ufs-mounts
specifies the maximum number of UFS and MFS mounts on the system.
You can increase the value of the
max-ufs-mounts
attribute if your system will have more than the default limit of 1000 mounts.
However, increasing the maximum number of UFS and MFS mounts requires memory
resources for the additional mounts.
See Section 4.4 for information about modifying kernel subsystem attributes.
The Network File System (NFS) shares the UBC with the virtual memory subsystem and local file systems. NFS can put an extreme load on the network. Poor NFS performance is almost always a problem with the network infrastructure. Look for high counts of retransmitted messages on the NFS clients, network I/O errors, and routers that cannot maintain the load.
Lost packets on the network can severely degrade NFS performance. Lost packets can be caused by a congested server, the corruption of packets during transmission (which can be caused by bad electrical connections, noisy environments, or noisy Ethernet interfaces), and routers that abandon forwarding attempts too quickly.
You can monitor NFS by using the
nfsstat
and other
commands.
When evaluating NFS performance, remember that NFS does not perform
well if any file-locking mechanisms are in use on an NFS file.
The locks prevent
the file from being cached on the client.
See
nfsstat(8)
for more information.
The following sections describe how to perform the following tasks:
Gather NFS performance information (Section 9.5.1)
Improving NFS performance (Section 9.5.2)
Table 9-9 describes the commands you can use to obtain information about NFS operations.
| Name | Use | Description |
Displays network and NFS statistics (Section 9.5.1.1) |
Displays NFS
and RPC statistics for clients and servers, including
the number of packets that had to be retransmitted ( |
|
|
Monitors all incoming network traffic
to an NFS server and divides it into several categories, including NFS reads
and writes, NIS requests, and RPC authorizations.
Your kernel must be configured with the
|
|
Displays information about idle threads (Section 9.5.1.2) |
Displays information about idle threads on a client system. |
|
Displays active NFS server threads (Section 4.4.6) |
Displays a histogram of the number of active NFS server threads. |
|
Displays the hit rate (Section 9.1.2) |
Displays the namei cache hit rate. |
|
Displays metadata buffer cache information (Section 9.4.2.3) |
Reports statistics on the metadata buffer cache hit rate. |
|
Reports UBC statistics (Section 6.3.5) |
Reports the UBC hit rate. |
The following sections describe how to use some of these tools in detail.
The
nfsstat
command displays statistical information about
NFS and Remote Procedure Call (RPC) interfaces in the kernel.
You can also
use this command to reinitialize the statistics.
An example of the
nfsstat
command is as follows:
#/usr/ucb/nfsstatServer rpc: calls badcalls nullrecv badlen xdrcall 38903 0 0 0 0 Server nfs: calls badcalls 38903 0 Server nfs V2: null getattr setattr root lookup readlink read 5 0% 3345 8% 61 0% 0 0% 5902 15% 250 0% 1497 3% wrcache write create remove rename link symlink 0 0% 1400 3% 549 1% 1049 2% 352 0% 250 0% 250 0% mkdir rmdir readdir statfs 171 0% 172 0% 689 1% 1751 4% Server nfs V3: null getattr setattr lookup access readlink read 0 0% 1333 3% 1019 2% 5196 13% 238 0% 400 1% 2816 7% write create mkdir symlink mknod remove rmdir 2560 6% 752 1% 140 0% 400 1% 0 0% 1352 3% 140 0% rename link readdir readdir+ fsstat fsinfo pathconf 200 0% 200 0% 936 2% 0 0% 3504 9% 3 0% 0 0% commit 21 0% Client rpc: calls badcalls retrans badxid timeout wait newcred 27989 1 0 0 1 0 0 badverfs timers 0 4 Client nfs: calls badcalls nclget nclsleep 27988 0 27988 0 Client nfs V2: null getattr setattr root lookup readlink read 0 0% 3414 12% 61 0% 0 0% 5973 21% 257 0% 1503 5% wrcache write create remove rename link symlink 0 0% 1400 5% 549 1% 1049 3% 352 1% 250 0% 250 0% mkdir rmdir readdir statfs 171 0% 171 0% 713 2% 1756 6% Client nfs V3: null getattr setattr lookup access readlink read 0 0% 666 2% 9 0% 2598 9% 137 0% 200 0% 1408 5% write create mkdir symlink mknod remove rmdir 1280 4% 376 1% 70 0% 200 0% 0 0% 676 2% 70 0% rename link readdir readdir+ fsstat fsinfo pathconf 100 0% 100 0% 468 1% 0 0% 1750 6% 1 0% 0 0% commit 10 0%#
The ratio of timeouts to calls (which should not exceed 1 percent) is the most important thing to look for in the NFS statistics. A timeout-to-call ratio greater than 1 percent can have a significant negative impact on performance. See Chapter 10 for information on how to tune your system to avoid timeouts.
Use the
nfsstat -s -i 10
command to display NFS and
RPC
information at ten-second intervals.
If you are attempting to monitor an experimental situation with
nfsstat, reset the NFS counters to 0 before you begin the experiment.
Use the
nfsstat -z
command to clear the counters.
See
nfsstat(8)
for more information
about command options and output.
On a
client system, the
nfsiod
daemons spawn several I/O
threads to service asynchronous I/O requests to the server.
The
I/O threads improve the performance of both NFS reads and writes.
The optimum number of I/O threads depends on many variables, such as
how quickly the client will be writing, how many files will be accessed
simultaneously, and the characteristics of the NFS server.
For most clients, seven threads are sufficient.
The following example uses the
ps axlmp
command to
display idle I/O threads on a client system:
#/usr/ucb/ps axlmp 0 | grep nfs0 42 0 nfsiod_ S 0:00.52 0 42 0 nfsiod_ S 0:01.18 0 42 0 nfsiod_ S 0:00.36 0 44 0 nfsiod_ S 0:00.87 0 42 0 nfsiod_ S 0:00.52 0 42 0 nfsiod_ S 0:00.45 0 42 0 nfsiod_ S 0:00.74#
The previous output shows a sufficient number of sleeping threads
and 42 server threads that were started
by
nfsd, where
nfsiod_
has been
replaced by
nfs_tcp
or
nfs_udp.
If your output shows that few threads are sleeping, you may be able
to improve NFS performance by increasing the number of threads.
See
Section 9.5.2.2,
Section 9.5.2.3,
nfsiod(8), and
nfsd(8)
for more information.
Improving performance on a system that is used only for serving NFS differs from tuning a system that is used for general timesharing, because an NFS server runs only a few small user-level programs, which consume few system resources. There is minimal paging and swapping activity, so memory resources should be focused on caching file system data.
File system tuning is important for NFS because processing NFS requests consumes the majority of CPU and wall clock time. Ideally, the UBC hit rate should be high. Increasing the UBC hit rate can require additional memory or a reduction in the size of other file system caches. In general, file system tuning will improve the performance of I/O-intensive user applications.
In addition, a vnode must exist to keep file data in the UBC. If you are using AdvFS, an access structure is also required to keep file data in the UBC.
If you are running NFS over TCP, tuning TCP may improve performance if there are many active clients. However, if you are running NFS over UDP, no network tuning is needed. See Section 10.2 for more information.
Table 9-10 lists NFS tuning and performance-improvement guidelines and the benefits as well as tradeoffs.
| Action | Performance Benefit | Tradeoff |
Set the value of the
maxusers
attribute
to the number of server NFS operations that are expected to occur each second
(Section 5.1) |
Provides the appropriate level of system resources | Consumes memory |
| Increase the size of the namei cache (Section 9.2.1) | Improves file system performance | Consumes memory |
| Increase the number of AdvFS access structures, if you are using AdvFS (Section 9.3.4.3) | Improves AdvFS performance | Consumes memory |
| Increase the size of the metadata buffer cache, if you are using UFS (Section 9.4.3.1) | Improves UFS performance | Consumes wired memory |
| Use Prestoserve (Section 9.5.2.1) | Improves synchronous write performance for NFS servers | Cost |
| Configure the appropriate number of threads on an NFS server (Section 9.5.2.2) | Enables efficient I/O blocking operations | None |
| Configure the appropriate number of threads on the client system (Section 9.5.2.3) | Enables efficient I/O blocking operations | None |
| Modify cache timeout limits on the client system (Section 9.5.2.4) | May improve network performance for read-only file systems and enable clients to quickly detect changes | Increases network traffic to server |
| Decrease network timeouts on the client system (Section 9.5.2.5) | May improve performance for slow or congested networks | Reduces theoretical performance |
| Use NFS protocol Version 3 on the client system (Section 9.5.2.6) | Improves network performance | Decreases the performance benefit of Prestoserve |
The following sections describe these guidelines in detail.
You can improve NFS performance by installing Prestoserve on the server. Prestoserve greatly improves synchronous write performance for servers that are using NFS Version 2. Prestoserve enables an NFS Version 2 server to write client data to a nonvolatile (battery-backed) cache, instead of writing the data to disk.
Prestoserve may improve write performance for NFS Version 3 servers, but not as much as with NFS Version 2, because NFS Version 3 servers can reliably write data to volatile storage without risking loss of data in the event of failure. NFS Version 3 clients can detect server failures and resend any write data that the server may have lost in volatile storage.
See the Guide to Prestoserve for more information.
The
nfsd
daemon runs on NFS servers to
service NFS requests from
client machines.
The daemon spawns a number of server threads that process
NFS requests from client machines.
At least one server thread must be
running for a machine to operate as a server.
The number of threads
determines the number of parallel operations and must be a multiple of 8.
For good performance on frequently used NFS servers, configure
either 16 or 32 threads, which provides the most
efficient blocking for I/O operations.
See
nfsd(8)
for more information.
Client systems
use the
nfsiod
daemon to service
asynchronous I/O operations such as buffer cache read-ahead
and delayed write operations.
The
nfsiod
daemon spawns
several IO threads to service asynchronous I/O requests to its server.
The
I/O threads improve performance of both NFS reads and writes.
The optimal number of I/O threads to run depends on many variables, such as how quickly the client is writing data, how many files will be accessed simultaneously, and the behavior of the NFS server. The number of threads must be a multiple of 8 minus 1 (for example, 7 or 15 is optimal).
NFS
servers attempt to gather writes into complete UFS
clusters before initiating I/O, and the number of threads
(plus 1) is the number of writes that a client can have outstanding
at any one time.
Having exactly 7 or 15 threads produces the most
efficient blocking for I/O operations.
If write
gathering is enabled, and the client does not have any threads,
you may experience a performance
degradation.
To disable write gathering, use the
dbx patch
command to set the
nfs_write_gather
kernel variable to
zero.
See
Section 4.4.6
for information.
Use the
ps axlmp 0 | grep nfs
command to display
idle
I/O threads on the client.
If few threads are sleeping, you may be able
to improve NFS performance by increasing the number of threads.
See
nfsiod(8)
for more information.
For read-only file systems and slow network links, performance may be improved by changing the cache timeout limits on NFS client systems. These timeouts affect how quickly you see updates to a file or directory that has been modified by another host. If you are not sharing files with users on other hosts, including the server system, increasing these values will give you slightly better performance and will reduce the amount of network traffic that you generate.
See
mount(8)
and the descriptions of the
acregmin,
acregmax,
acdirmin,
acdirmax,
actimeo
options for more information.
NFS does not perform well if it is used over slow network
links, congested networks, or wide area networks (WANs).
In particular, network timeouts on client systems can severely degrade
NFS performance.
This condition can be identified
by using the
nfsstat
command and determining the ratio
of timeouts to calls.
If timeouts are more than 1 percent of total calls,
NFS performance may be severely degraded.
See
Section 9.5.1.1
for sample
nfsstat
output of timeout and call statistics and
nfsstat(8)
for more information.
You can also
use the
netstat -s
command to
verify the existence of a timeout problem.
A
nonzero value in the
fragments dropped after timeout
field
in the
ip
section of the
netstat
output
may indicate that the problem exists.
See
Section 10.1.1
for sample
netstat
command output.
If fragment drops are a problem, on a client system,
use the
mount
command with the
-rsize=1024
and
-wsize=1024
options
to set the size of the NFS read and write buffers to 1 KB.
NFS protocol Version 3 provides NFS client-side asynchronous write support, which improves the cache consistency protocol and requires less network load than Version 2. These performance improvements slightly decrease the performance benefit that Prestoserve provided for NFS Version 2. However, with Protocol Version 3, Prestoserve still speeds file creation and deletion.