This chapter describes how you can tune your system to use resources most efficiently under a variety of system load conditions. Tuning your system can include changing system configuration file parameters or sysconfigtab attributes, increasing resources such as CPU or cache memory, and changing the system configuration, such as adding disks, spreading out file systems, or adding swap space.
Note
If you have a performance problem on your system, never attempt to tune your system until you have confirmed that the problem is not caused by an application that is either broken or in need of further optimization. For information on application optimization, see the Programmer's Guide.
As a general rule, performance tuning consists of performing several of the following tasks:
Section 3.3 explains the mechanisms that you can use to tune your system.
Prior to tuning your system, you need to understand how your system is being used. For example, is your system used primarily as a file server? Are users running many small applications or are they running mostly large applications? Without this understanding, your attempts at tuning may cause more harm than good. If you understand the system's intended use and you perceive a performance problem, keep the following tuning rules in mind:
The following text provides an example of an analytical path you could follow to determine where your system needs tuning - after first confirming that your applications are not causing the performance problem:
Check the number of pages on the free list. If the number is less than 128, you may have a virtual memory problem. Possible solutions to a virtual memory problem include the following:
After you determine where your performance problem exists, you can then begin your effort to remedy it. The following sections describe the various tuning possibilities.
When applications are operating correctly but are experiencing CPU saturation (that is, a lack of available CPU cycles), your options for correcting the situation are limited. If the CPU overload condition is expected to continue indefinitely, the best long-term solution is to add more memory, add an additional processor (on multiprocessor systems), or replace your system with a larger one. In the short term, before you are able to increase CPU capacity, you can make a number of adjustments to temporarily increase performance on a system that is compute bound.
The following sections describe the effects of expanding CPU capacity on a Symmetrical Multiprocessing system (SMP) (Section 3.2.1) and temporary changes that you can make to reduce CPU load (Section 3.2.2).
SMP systems allow you to expand the computing power of a system by adding additional processors. In most cases, increasing computing power in this way improves the performance of a system. Adding additional processors can be an effective solution for performance problems on SMP systems that are compute bound (only nominal idle time) and have multiple processes with the ability to run concurrently. Note that your system's ability to take advantage of the increase in computing capacity provided by additional processors may be limited if you do not also increase your system's I/O capacity.
Workloads that lend themselves well to SMP include DBMS servers, mail servers, and compute servers, to name a few. Basically, most workloads that have multiple processes or multiple threads of execution that can run concurrently can benefit from SMP. It is important to note that the gating factor for these workloads in some cases is not computing power, and these types of workloads may require additional tuning. For example, workloads that are metadata intensive and that involve a limited number of directories may benefit more if NVRAM is also added to your system whenever an additional processor is added. (Metadata intensive applications open large numbers of small files and access them repetitively.)
The operating system is designed to ensure that the user load is balanced across the available processors. Factored into the load-balancing algorithm are the user load, system load, and the interrupt rate. The algorithm is tuned to attempt to allow threads that have recently run on a given processor to continue to run on that processor (to take advantage of data retained in caches). Users can optionally choose to bind a particular process to a particular processor. This can be done using either the runon command or the bind_to_cpu system call. See runon(1) or bind_to_cpu(3) for details.
The utilities iostat and vmstat allow you to monitor the memory, CPU, and I/O consumption on your system. The cpustat extension to kdbx also allows application developers to monitor the time spent in user mode, system mode, and kernel mode on each of the processors. This information can help application developers determine how effectively they are achieving parallelism across the system. To enhance parallelism, application developers working in Fortran or C should consider using the Kuch & Associates Preprocessor (KAP), which can have a significant impact on SMP performance. See the Programmer's Guide for details on KAP.
In general, the following adjustments can be made to improve CPU processing on a temporary basis:
See the manual System Administration for details on how you can adjust these limits. (Note that job scheduling can also be a very important consideration in determining how you are going to handle large programs. Also note that the limit and unlimit commands can affect several of these size limits.)
The system vnode table limits the number of active files. If your system is heavily loaded but does not have a shortage of memory, increasing the size of the system vnode table may improve performance. You can increase its size by giving a higher value to the maxusers attribute.
System parameters are global variables that you can tune in a variety of ways:
The latter method is preferred because it requires only a system reboot to put new values permanently into effect; whereas, using the other methods is either only a temporary solution (dbx method) or requires a kernel rebuild and a system reboot (system configuration file or param.c method). Kernel rebuilds can be difficult and time consuming and should be avoided whenever possible.
Using the attribute method entails establishing attribute values by interacting with the Kernel Tuner (dxkerneltuner) provided by the Common Desk Environment's (CDE) graphical user interface or by issuing the sysconfigdb or sysconfig -r commands from a terminal window:
As indicated earlier in this section, the value of a global variable can be reset by a variety of mechanisms. As a result, global variables in the system configuration file and the param.c file can have values that differ from those of their corresponding attributes in the sysconfigtab file. To understand how one value of a global variable is overridden by another, it is necessary to understand the levels at which global variables are controlled in a system. From lowest to highest, the control levels are as follows:
Each global variable can have a different value at each level of control. As a result, the following precedence rules, from highest to lowest, apply in a running system:
See Section 2.2.9 for information on how to monitor the values of configuration attributes. For descriptions of the tunable configuration attributes in sysconfigtab, see Appendix B. (Note that not all subsystems displayed by a sysconfig -r command are covered in Appendix B. Only those subystems that have tunable attributes affecting performance are covered.)
The memory subsystem is one of the first places where a performance problem can occur. Performance can be degraded when the virtual memory subsystem cannot keep up with the demand for pages.
Memory tuning can be divided into the following two areas of concern:
You can limit the amount of memory that the UBC uses for the file system buffer cache. This increases the amount of memory available to the virtual memory subsystem, but decreases I/O performance. (See Section 3.4.1 for details on UBC tuning.)
You can tune several sysconfigtab attributes to improve the performance of the virtual memory subsystem. Another method of improving its performance is to configure additional swap space or spread out your disk I/O. Adding more physical memory is an option that will always improve performance. (See Section 3.4.2 for details on virtual memory tuning.)
Table 3-1
lists some
sysconfigtab
attributes that can have a
significant impact on virtual memory, including paging and swapping,
and the UBC. Reboot the system if you change any of these attributes.
Attribute/Parameter | Default | Description |
Parameters: | ||
dfldsiz | 134217728 | Default data segment size limit. |
dflssiz | 1048576 | Default stack size limit. |
maxdsiz | 1073741824 | Maximum data segment size limit. |
maxusers | 32 | See maxusers attribute. |
Attributes: | ||
bufcache | 3 | Percentage of memory dedicated to the file system buffer cache. |
buffer-hash-size | 512 | The size of the buffer cache hash chain table used to store the heads of the hashed buffer queues. |
max-proc-per-user | 64 | Maximum number of processes one user can run simultaneously. |
maxusers | 32 | Number of users the system can support simultaneously. |
msg-mnb | 16384 | Maximum number of bytes on a System V message queue. |
msg-mni | 50 | Number of System V message queue identifiers. |
msg-tql | 40 | Number of System V message headers. |
name-cache-hash-size | 256 | The size of the hash chain table for the namei cache. |
ubc-borrowpercent | 20 | The percentage of physical memory above which the UBC is borrowing memory from the system. |
ubc-maxpercent | 100 | Maximum percentage of memory that the UBC can consume. |
ubc-minpercent | 10 | Percentage of memory at which page stealing from the UBC is prohibited. |
vm-aggressive-swap | 0(off) | Controls whether the task swapper should be more aggressive in swapping out idle tasks to prevent the system from reaching a low-memory condition. |
vm-asyncswapbuffers | 4 | The total number of asynchronous I/O requests by the page stealing daemon that can be outstanding, per swap partition, at any one time. |
vm-clustermap | 1024*1024*1 | Cluster duplication map size. |
vm-clustersize | 1024*64 | Maximum cluster duplication for each bp. |
vm-cowfaults | 4 | Copy point. |
vm-csubmapsize | 1024*1024 | Size of kernel copy map. (The kernel copy map is the address space for copying data into and out of the kernel.) |
vm-heappercent | 7 | Percent of kernel virtual address space to allocate for use by the heap. |
vm-inswappedmin | 1 | Minimum number of inswapped ticks (in seconds) that must occur before a task can be swapped out. |
vm-mapentries | 200 | Maximum number of virtual memory map entries that a user map can have. Map entries are allocated when the user maps an object into address space that is not adjacent to another object that has the same protection and that can grow. |
vm-maxvas | 1L<<30 | Maximum virtual address space for user maps (see vm-mapentries). |
vm-maxwire | 1L<<24 | Maximum amount of memory that can be wired by a user process. |
vm-page-free-min | 20 | The free list's low watermark (below which physical memory reclamation begins). |
vm-page-free-optimal | 74 | A value for pages on the free list that can cause the system to swap out entire tasks to reduce memory demand. |
vm-page-free-reserved | 10 | The number of pages on the free list that are reserved for the kernel. |
vm-page-free-target | 128 | The free list's high watermark (above which physical memory reclamation stops). |
vm-page-prewrite-target | 256 | The number of pages that the virtual memory subsystem attempts to keep clean. |
vm-segmentation | 1(on) | Enables shared page tables. |
vm-syncswapbuffers | 128 | Number of synchronous swap buffers. |
vm-syswiredpercent | 80 | Maximum percentage of system-wide wired memory. |
vm-ubcbuffers | 256 | Minimum number of buffers that the UBC can contain. |
vm-ubcdirtypercent | 10 | Maximum percentage of UBC pages that can be modified ("dirtied"). |
vm-ubcpagesteal | 24 | Number of pages that the UBC can have for a file before the UBC will begin to take pages from the file to satisfy the file's own demands. |
vm-ubcseqpercent | 10 | The size of a file as a percentage of the UBC. |
vm-ubcseqstartpercent | 50 | The size of the UBC as a percentage of total memory. |
vm-vpagemax | 16384 | The maximum number of individually protected pages in a user address space. |
vm-zone_size | 67108864 | Amount of kernel virtual address space that is available for many of the system's dynamic data structures. |
Detailed descriptions of the attributes are provided in Section B.21. For information about the parameters listed in the table, see the manual System Administration.
In some cases, an I/O-intensive process may degrade the performance of other processes by using a major portion of the UBC. If you need more memory for the virtual memory subsystem, you can reduce the amount of memory that is available to the UBC. Note that reducing the memory available to the UBC may adversely affect file system I/O because less file system data will be cached in the UBC and increasing amounts of data will have to be accessed from disk.
The UBC is flushed by the update daemon. UBC statistics can be viewed by using dbx and checking the vm_perfsum structure. You can also monitor the UBC by using the dbx -k command and examining the ufs_getapage_stats kernel data structure.
The size of the UBC is influenced by the values of the following configuration attributes in the sysconfigtab file:
By default, the UBC will use at least 10 percent of all memory and can use up to 100 percent of all memory. If you want to reduce the amount of memory that can be allocated to the UBC, you could set ubc-maxpercent to 50 percent of all memory. This ensures that the UBC will not adversely affect the virtual memory subsystem. Note that the performance of an application that generates a lot of random I/O will not be improved by enlarging the UBC because the next access location for random I/O cannot be predetermined.
If vmstat output shows excessive paging but few or no page outs, the value of ubc-borrowpercent is probably set too low. It is particularly important to watch for this on low-memory systems (24-MB systems) because they tend to reclaim UBC pages much more aggressively than systems with more memory, and this condition can have an adverse effect on system performance.
Typically, the UBC borrows all physical memory above ubc-borrowpercent (up to the ubc-maxpercent limit). Increasing this value allows more memory to remain in the UBC before global page reclamation begins (that is, before the number of free pages in the system equals the value of the vm-page-free-min attribute). This typically increases the UBC cache effectiveness, but decreases the system response time when a low memory condition occurs. The range of values for this parameter is 0 to 100.
If the page-out rate is high and you are not using the file system heavily, you could decrease the value of ubc-maxpercent to reduce the rate of paging. Use the vmstat command to determine whether the system is paging excessively. Using dbx, periodically examine the vpf_pgiowrites and vpf_ubcalloc fields of the vm_perfsum kernel structure. The page-out rate may shrink if page outs greatly exceed UBC allocations.
For I/O servers, you may want to raise the value of the ubc-minpercent attribute in the sysconfigtab file to ensure that more memory is available for the UBC. If you do this, large programs that run occasionally will not completely fill the UBC. To check that you did not raise ubc-minpercent too high, use the vmstat command to examine the page-out rate.
If you change the values of the ubc-maxpercent and ubc-minpercent attributes in the sysconfigtab file, do not make the values so close together that you degrade I/O performance or cause the system to page excessively.
The Digital UNIX operating system uses some configuration attributes in the sysconfigtab file to prevent a large file from completely filling the UBC, thus limiting the amount of memory available to the virtual memory subsystem. The system will reuse the pages in the UBC instead of taking pages from the free page list when both of the following conditions are met:
The vm-ubcseqstartpercent and vm-ubcseqpercent attributes in the sysconfigtab file are used to ensure that a large file does not take all of the pages on the free page list and cause the system to page excessively.
For example, using the default values, the UBC would have to be larger than 50 percent of all memory and a file would have to be larger than 10 percent of the UBC (that is, the file size would have to be at least 5 percent of all memory) in order for the system to reuse the pages in the UBC.
To determine the values of the vm-ubcseqstartpercent and vm-ubcseqpercent attributes, use the sysconfig -q command.
On large-memory systems that are doing a lot of file system operations, you may want to lower the vm-ubcseqstartpercent value to 30 percent. Do not specify a lower value unless you decrease the size of the UBC. You probably do not want to change the value of vm-ubcseqpercent.
Although all memory is shared between the virtual memory subsystem and the UBC, the file system code that deals with the UNIX file system's metadata - including directories, indirect blocks, and inodes - still uses the traditional BSD buffer cache. The following configuration attributes in the sysconfigtab file affect file system buffer cache usage:
The value of the buffer-hash-size attribute can be changed so that each hash chain has 3 or 4 buffers. To determine a value to assign to the buffer-hash-size attribute, use dbx to examine the value of nbuf, then divide the value by 3 or 4, and finally round the result to a power of two. For example, if dbx shows that nbuf has a value of 360, dividing 360 by 3 gives you a value of 120. Based on that value, 128 (that is, 2**7) would be a good value to use for the buffer-hash-size attribute. (See Section 2.2.10 for information on how to use dbx to examine system variables such as nbuf.)
You can change the value of the name-cache-hash-size attribute so that each hash chain has three or four name cache entries. To determine a value to assign to the name-cache-hash-size attribute, divide the value of name-cache-size attribute by three or four and then round the result to a power of two. For example, if the value of name-cache-size is 1029, dividing 1029 by four produces a value of 257. Based on this calculation, 256 (that is, 2**8) would be a good value for the name-cache-hash-size attribute.
If your system has adequate physical memory, you can also improve performance by increasing the value of the name-cache-size attribute instead of adjusting the hash size. Select which method to use depending on whether physical memory is the resource that is constraining the performance of your system.
To determine whether you should change the value of bufcache, use dbx to examine the bio_stats structure (see Section 2.2.10.5). The miss rate (block misses divided by the sum of the block misses and block hits) should not be more than 3 percent. If you have a high miss rate (low hit rate), you may want to raise the value of bufcache. Note that any additional memory that you allocate to the metadata buffer cache is taken away from the rest of the system. This may cause system performance to decline because it reduces the amount of memory that is available to the UBC and the virtual memory subsystem. In most cases, it is not advisable to modify the bufcache value. If you need to raise the value, never raise it to more than 10 percent.
You can decrease the value of bufcache on large memory systems if the hit rate is high and you want to increase the amount of memory that is available to the virtual memory subsystem.
Excessive paging, which is sometimes called thrashing, decreases performance. This means that the natural working set size has exceeded available memory. Because the virtual memory subsystem runs at a higher priority, it blocks out other processes and spends all system resources on servicing page faults for the currently running processes.
You can determine whether a system has memory problems by examining the output of the vmstat command. The pout column lists the number of page outs. The free column lists the amount of pages on the free page list. Less than 128 pages on the free page list and a consistently high number of page outs may indicate that excessive paging and swapping is occurring.
Some general solutions for reducing excessive paging and swapping are as follows:
See Table 3-1 for a list of parameters and attributes that can be used to tune virtual memory. Detailed descriptions of the attributes are provided in Section B.21. For information about the parameters listed in the table, see the manual System Administration.
To optimize the use of your swap space, use your fastest disks for swap devices and spread out your swap space across multiple devices. Use the swapon -s command to display your swap space configuration. Use the iostat command to determine which disks are being used the most.
To ensure the best performance, place each swap partition on its own disk (instead of placing multiple swap partitions on the same disk). The page reclamation code uses a form of disk striping (known as swap space interleaving) so that pages can be written to the multiple disks. In addition, configure all of your swap devices at boot time to optimize swap space. See the manual System Administration for details on how to perform these operations.
To increase performance, you can change your swap mode from immediate mode (the default) to deferred mode (overcommitment mode) by removing or moving the /sbin/swapdefault file. Deferred mode requires less swap space than immediate mode and causes the system to run faster because less swap bookkeeping is required. However, because deferred mode does not reserve swap space in advance, the swap space may not be available when it is needed by a task and the process may be killed asynchronously. (See Appendix A for information about special considerations associated with low-memory systems (24-MB systems) operating in overcommitment mode.)
Application messages such as the following usually indicate that not enough swap space is configured into the system or that a process limit has been reached:
See the manual System Administration for information on how to fix these problems.
You may be able to improve IPC performance by tuning the following configuration attributes in the sysconfigtab file:
The process will be unable to send a message to a queue if doing so would make the total number of bytes in that queue greater than the limit specified by msg-mnb. When the limit is reached, the process sleeps, waiting for this condition to be false.
The process will be unable to send a message if doing so would make the total number of message headers currently in the system greater than the limit specified by msg-tql. If the limit is reached, the process sleeps, waiting for a message header to be freed.
You can track the use of IPC facilities with the ipcs -a command (see ipcs(1)). By looking at the current number of bytes and message headers in the queues, you can then determine whether you need to increase the values of the msg-mnb and msg-tql attributes in the sysconfigtab file to diminish waiting.
You may also want to consider tuning several other IPC attributes in the sysconfigtab file. How you tune the following attributes depends on what you are trying to do in your application:
(Note: As a design consideration, consider whether you would be better off using threads instead of shared memory.)
I/O tuning can be divided into the following three areas of concern:
You can improve disk I/O performance by changing file system fragment sizes and other parameters that control the layout of the file systems.
You can improve network performance by reducing the number of network applications, redesigning the network, or adding more memory.
In addition to improving NFS performance by using techniques that improve the performance of other file systems, you can also improve NFS performance by using Prestoserve and adjusting a few of its parameters. (See the Guide to Prestoserve for details on Prestoserve.)
The operating system includes several configuration parameters and attributes that can affect the I/O subsystem. As specified in Table 3-2, they are set in either the system configuration file or the sysconfigtab file or by using dbx. Reboot the system if you change any configuration parameters or attributes to place the new values in effect.
Attribute/Parameter | Default | Description |
Read parameters: | ||
cluster_consec_incr | 1 | The increment for determining the number of blocks that should be combined on the next read-ahead request after the first read-ahead request. (Set with dbx.) |
cluster_consec_init | 2 | The number of blocks that should be combined for the first read-ahead request. See Section 3.6.1.2 for more details on this parameter. (Set with dbx.) |
cluster_lastr_init | -1 | The number of contiguous reads that need to be detected before read-ahead is requested. The default value will start read-ahead on the very first contiguous read request. (Set with dbx.) |
cluster_max_read_ahead | 8 | The maximum number of clusters that can be used in read-ahead operations. See Section 3.6.1.2 for more details on this parameter. (Set with dbx.) |
cluster_read_all | 1 | This variable is either on (!= 0) or off (==0). By default (on), perform cluster read operations on nonread-ahead blocks and read-ahead blocks. If off, perform cluster read operations only on read-ahead blocks. See Section 3.6.1.2 for more details on this parameter. (Set with dbx.) |
Write parameters: | ||
cluster_maxcontig | 8 | The number of blocks that will combined into a single write request. The default tries to combine eight 8K-byte blocks into a 64K-byte cluster. This variable controls all mounted UNIX file systems (UFS). See Section 3.6.1.2 for more details on this parameter. (Set with dbx.) |
cluster_write_one | 1 | This variable is either on (!=0) or off (==0). By default (on), when a cluster needs to be written (that is, 64KB of data has been dirtied), but nonlogically contiguous blocks make up the cluster, just the contiguous data is written, leaving the remaining data. The remaining data may be combined into future cluster requests. If off, 64KB of data will be written regardless of the number of write requests required to do so. (Set with dbx.) |
Other parameters that influence I/O: | ||
delay_wbuffers | 0 | This variable applies only to UFS. It is either on (!=0) or off (==0). By default (off), write-behind is turned on. If on, flushing full buffers to disk is delayed until a sync call is issued. See Section 3.6.1.2 for more details on this parameter. (Set with dbx.) |
maxcontig parameter | 8 | The maximum number of contiguous blocks that will be laid out before forcing a rotational delay. (Set with tunefs or newfs.) |
maxusers | 32 | See maxusers attribute. |
rotdelay | 4 | The expected time (in milliseconds) to service a transfer completion interrupt and initiate a new transfer on the same disk. It is used to decide how much rotational spacing to place between successive blocks in a file. If zero, blocks are allocated contiguously. (Set with tunefs or newfs.) |
Attributes that influence I/O: | ||
maxusers | 32 | The number of users that your system can support simultaneously without straining system resources. See Section 3.6.1 for more details on this attribute. |
max-vnodes | Varies | The maximum number of vnodes that can be allocated on a system. On a 32-MB or larger system, the maximum is the number of vnodes that will fit into 5 percent of available memory; on a 24-MB system, the default is 1000. (Set in the sysconfigtab file.) |
min-free-vnodes | Varies | The minimum number of free vnodes that will be kept on the free list. On a 32-MB or larger system, the default is the value of nvnode; on a 24-MB system, the default is 150. (Set in the sysconfigtab file.) |
namei-cache-valid-time | 1200 | The amount of time, in seconds, that governs when vnodes are deallocated. (Set in the sysconfigtab file.) |
open-max-hard | 4096 | Hard limit for the number of files that a process can have open at one time. (Set in the sysconfigtab file.) |
open-max-soft | 4096 | Soft limit for the number of file descriptors that a process may have open. This value is the default for all processes, and it must be less than or equal to the value of open_max_hard. See Section 3.6.1.2 for more details on this parameter. (Set in the sysconfigtab file.) |
vnode-age | 120 | The amount of time, in seconds, that a vnode is guaranteed to be kept on the free list before it is recycled. (Set in the sysconfigtab file.) |
For information about the parameters listed in the table, see the manual System Administration. Detailed descriptions of the attributes are provided in Section B.8.
Disk throughput is the gating performance factor for most applications. Data transfers to and from disk are much slower than data transfers involving the CPU or main memory. As a result, configuring and tuning disk subsystems to maximize the efficiency of I/O operations can have a critical impact on an application's performance.
The size of the disk operation is also important. In doing I/O to a disk, most of the time is usually taken up with the seek followed by the rotational delay. This is called the access time. For small I/O requests, access time is more important than the transfer rate. For large I/O requests, the transfer rate is more critical than the access time. Access time is also important for workstation, timeshare, and server environments.
Most performance problems occur because of disk saturation, that is, because demands for disk I/O exceed the capacity of your system. Before you attempt to fix a disk saturation problem by tuning the UFS and the AdvFS file systems and the Common Access Method (CAM) subsystem, try to improve performance by making some of the following adjustments:
The following sections describe how to tune the Virtual File System (VFS), UFS, and AdvFS file systems and the CAM subsystem.
Depending on system requirements, you can change Virtual File System (VFS) limits by tuning the following configuration attributes in the sysconfigtab file:
Allocation and deallocation of vnodes is handled dynamically by the system using the values set for these four attributes. With the exception of namei-cache-valid-time, the values of all of these attributes can be changed with the sysconfig -r command while the kernel is executing. Tuning considerations associated with these configuration attributes are as follows:
The maximum number of vnodes that can be allocated on a system can be set by the max-vnodes attribute. If max-vnodes is not set in the sysconfigtab file, the default value for the maximum number of vnodes for 24-MB systems is 1000; for 32-MB or larger systems, the default value is calculated from the following values:
The default value for the percentage of memory that can be used for vnodes is defined by a global element named vn_conf.percent_mem_for_vnodes (5 percent by default).
The system allocates vnodes based on demand, up to the maximum number of vnodes, and later deallocates them when their demand goes down. On a very busy system, if the number of active vnodes tends to exceed the maximum number of vnodes, you can adjust the maximum number upward by modifying the value associated with max-vnodes. For example:
#
sysconfig -r vfs max-vnodes=15000
#
sysconfig -q vfs max-vnodes
max-vnodes: 15000#
Increasing the maximum number of vnodes puts an extra demand on the available memory on the system and should only be done if the system reports that it is out of vnodes. If the number of users on the system exceeds the value of maxusers, increase the value of max-vnodes proportionally.
You can change the maximum number of vnodes in either of the following ways:
When vnode deallocation is in progress, the value of min-free-vnodes determines the minimum number of free vnodes that will be kept on the free list. Vnode deallocation stops when the number of free vnode reaches this value. A larger value for min-free-vnodes caches more free vnodes in the system and improves performance when free vnodes are reactivated as a result of vnode cache lookup operations. However, a larger value also has a proportional increase in the demand on memory because of the increase in the retention of vnodes.
On 24-MB systems the default value associated with the
min-free-vnodes
attribute is 150. On 32-MB or larger systems, the default value
depends on the value of the
maxusers
attribute. It is possible to set
maxusers
so high that the value of
min-free-vnodes
is close to or larger than the value associated with the
max-vnodes
attribute. These conditions can have the following effects:
On systems that need to reclaim the memory used by vnodes, you should ensure that the value of min-free-vnodes is significantly lower than max-vnodes.
If the value of min-free-vnodes needs to be close to the value of max-vnodes, it is recommended that you turn off vnode deallocation by using the sysconfig -r command to set the value of min-free-vnodes to be larger than max-vnodes. (You can also turn off vnode deallocation by setting the vnode-deallocation-enable attribute to zero (0). However, this method is not recommended because it causes the system to be very conservative in allocating vnodes.)
The value of min-free-vnodes can be changed on a running kernel. For example:
#
sysconfig -q vfs min-free-vnodes
min-free-vnodes: 10388# sysconfig -r vfs min-free-vnodes=468
min-free-vnodes: 468#
A decision to recycle a vnode on the free list is based on its age and the value of the vnode-age attribute in the sysconfigtab file. The value of the vnode-age attribute represents the time in seconds that a vnode is guaranteed to be kept on the free list before it is recycled. If a vnode selected for recycling from the LRU vnode free list is not older than vnode-age, a new vnode is allocated.
The default value for vnode-age is set to 120 seconds on 32-MB or larger systems (two seconds on 24-MB systems). A larger value of vnode-age retains free vnodes on the free list for a longer time and can improve the chances of it being successfully looked up and reused before it is recycled. The value of vnode-age can be changed with the sysconfigdb command (which takes effect when the system is rebooted) or with the sysconfig -r command (which takes effect immediately on a running system).
Vnodes are deallocated only if they have not been looked up within the amount of time preset by the namei-cache-valid-time attribute in the sysconfigtab file. Its default value is 1200 seconds on 32-MB or larger systems, and on 24-MB systems, 30 seconds.
Use the sysconfigdb command to change the value of this configuration attribute in the sysconfigtab file.
Increasing the value of ncache_valid_time delays deallocation of vnodes. Decreasing the value causes faster deallocations, but reduces the efficiency of the vnode cache.
If you need to optimize processing time, you can disable vnode deallocation by setting the value of the attribute vnode-deallocation-enable to zero (0). Disabling vnode deallocation increases memory usage because memory used by the vnodes is not returned back to the system.
Use the dumpfs command to display file system information.
To tune the UNIX File System (UFS), you can use one or more of the following options and techniques:
LSM mirroring can improve read performance, but it slows down write performance.
LSM striping improves performance by evenly distributing the I/O load across a number of disk drives.
See the manual Logical Storage Manager for details.
Prestoserve can dramatically improve synchronous write performance. (See the Guide to Prestoserve for details.)
You can determine whether a disk is fragmented by determining how effectively the system is clustering. You can do this by using dbx to examine the ufs_clusterstats, ufs_clusterstats_read, and ufs_clusterstats_write structures. UFS block clustering is usually reasonably efficient. If the numbers from the UFS clustering kernel structures show that clustering is not being particularly effective, the disk may be heavily fragmented.
Currently, the operating system does not have an online disk defragmenter for UFS. However, you can perform a defragmentation procedure to take care of heavily fragmented disk as follows:
You can do this using the newfs command. The fragment size is 1KB by default. The UFS file system block size is fixed at 8KB. A block size/fragment size of 8KB/1KB is usually sufficient.
You can use a larger fragment size if the file system is used for executable files. Note that a large fragment size can waste disk space.
You can use the default fragment size (1KB) if the file system is used for small files or code development. A small fragment size uses disk space more efficiently than a large fragment size.
If you want to increase disk speed and most of the files are greater than two blocks (16KB), make the file system fragment size equal to the block size (8KB/8KB). This results in less overhead for the system, but it requires more space on the disk.
If the file system has many large files, reduce the density of inodes by using the newfs -i command.
Set the rotdelay parameter to 0 (zero) to allocate blocks contiguously. This can be done using either the tunefs command or the newfs command. A rotational delay of zero will allocate logically contiguous blocks, which aids UFS block clustering.
Use the tunefs command or the newfs command to change the value of maxcontig, which specifies the maximum number of contiguous blocks that will be laid out before forcing a rotational delay (that is, the number of blocks that can be combined into a cluster). The default is 8. This causes the file system to attempt I/O read requests in a size that is defined by the value of maxcontig multiplied by the block size (64KB by default).
Use the tunefs or the newfs command to change the value of maxbpg, which is the maximum number of file blocks that any single file can allocate per cylinder group. Usually, this value is set to about one-quarter of the total blocks in a cylinder group.
The maxbpg parameter is used to prevent a single file from using all of the blocks in a single cylinder group, which could degrade access times for all files subsequently allocated in that cylinder group. By limiting the number of file blocks, large files must perform long seeks more frequently than if they were allowed to allocate all the blocks in a cylinder group before seeking elsewhere.
If your file system contains only large files, you can set the maxbpg parameter higher than the default value. To get the performance benefit on an existing file system, you must lay out the files on the disk again.
When a block of data in the UBC is scheduled to be written, it is sent asynchronously to disk. The default operating system behavior prevents the block from being read while the write is in progress. An application that reads a block immediately after it is written could improve its performance if the write was delayed so that the block could be read from memory instead of disk.
You can control when full UBC buffers are flushed to disk by using dbx to modify the value of the delay_wbuffers kernel parameter:
Enabling delay_wbuffers is useful when many small files are created or when files are written and immediately reread. Delaying the operation of writing out the data increases the chances of having the data immediately available in memory. Applications that generate a lot of intermediate (temporary) files can often benefit from enabling delay_wbuffers.
Note
Enabling delay_wbuffers causes an increase in the number of dirty (modified) pages in the buffer cache and makes it more likely that data will be lost if the system is shut down abnormally. In addition, enabling delay_wbuffers induces an I/O pattern with more spikes in it because of the inactivity between sync calls. This I/O pattern could negatively affect real workloads (that is, nonbenchmark workloads).
The kernel parameter cluster_max_read_ahead defines the maximum number of read-ahead clusters that the kernel can schedule. The default for cluster_max_read_ahead is 8. You can make the open algorithm faster by setting cluster_read_all to 1 and cluster_consec_init to the value of cluster_max_read_ahead. You can change these global variables with dbx. (See Section 1.5.3.1 for a general description or read and write clustering.)
The cluster_maxcontig parameter is the number of blocks that will be combined into a single write. This variable controls all UFS file systems. The default value for cluster_maxcontig is 8 (which is optimal when using the default blocksize). You can change the value with dbx. (See Section 1.5.3.1 for a general description or read and write clustering.)
The open-max-soft and open-max-hard attributes in the sysconfigtab file control the maximum number of open file descriptors for each process. When the open-max-soft limit is reached, a warning message is issued, and when the open-max-hard limit is reached, the process is stopped. These attributes prevent runaway allocations, for example, allocations within a loop that cannot be exited because of an error condition.
The open-max-soft and open-max-hard attributes both have the value 4096 as a default. You can modify the values of the attributes by means of the sysconfigdb command.
Mount structures are dynamically allocated when a mount request is made and subsequently deallocated when an unmount request is made. However, the max-ufs-mounts configuration attribute in the sysconfigtab file has a default value of 1000 as an upper limit on the maximum number of mounts. If there is a need to mount more than 1000 UFS or MFS file systems on a system, this value has to be increased. You can use either of the following methods to change the value:
The POLYCENTER Advanced File System (AdvFS) is a file system option available on the Digital UNIX operating system. It provides rapid crash recovery, high performance, and a flexible structure that enables you to manage your file system while the system is on line. Optional AdvFS utilities further enhance file system management capabilities. In particular, the defragment, stripe, and migrate utilities provide online performance tuning. The AdvFS utilities are available as a separately licensed layered product.
Methods for improving AdvFS performance include the following:
LSM mirroring can improve read performance, but it slows down write performance. See the manual Logical Storage Manager for details.
Enhance AdvFS performance by dedicating an entire disk (usually partition C) to one file domain. This avoids I/O scheduling contention.
If you do not have the optional AdvFS utilities, you can defragment your disks by backing up and restoring the filesets:
Fileset quotas apply to the fileset, not to individual users or groups. By establishing quotas you can limit the amount of disk storage and number of files consumed by a fileset. This is useful when a file domain contains several filesets. Without fileset quotas, all filesets have access to all disk space in a file domain, allowing one fileset to use all of the disk space in a file domain.
You can use the defragment utility to defragment your file system frequently without reducing system availability.
File fragmentation can reduce the read/write performance of the file because it results in more I/O operations to access the file. The defragment utility reduces the amount of file fragmentation in a file domain by attempting to move files and parts of files together so that the number of file extents is reduced.
You do not need to dismount the filesets in a file domain or otherwise take the domain offline in order to run the defragment utility. You can perform all normal I/O operations while the defragment utility is running.
You can use the migrate utility in conjunction with the showfile command to improve file performance by monitoring and altering the way that large files are mapped on the disk. This method of defragmenting files is useful for defragmenting specific files. (Use the defragment utility to defragment all files in a domain.) Use the following procedure as a guideline for this method of improving file performance:
If several files in the file system are fragmented, you can add a new volume to the file domain and remove the volume containing the fragmented files. This action prompts AdvFS to automatically migrate all of the files to the new volume and defragment each file during the process.
You can use the stripe utility to distribute segments of a file across specific disks (or volumes) within a file domain. File striping provides load balancing and a higher transfer rate.
File striping increases contiguous read/write performance by allocating storage in segments across more than one disk or volume without preconfiguring the disks. AdvFS determines the number of pages per stripe segment, and the segments alternate among the disks in a sequential pattern. For instance, the file system allocates the first segment of a three-disk striped file on the first disk; the next segment on the second disk; and the next segment on the third disk. This completes one sequence, or stripe. The next stripe starts on the first disk, and so on.
The operating system uses the Common Access Method (CAM) as the operating system interface to the hardware. CAM maintains pools of buffers that are used to perform I/O. Each buffer takes approximately 1KB of physical memory. You should monitor these pools and tune them if necessary.
The following attributes can be checked with the dbx debugger and modified in the param.c file or the sysconfigtab file:
If the I/O pattern associated with your system tends to have intermittent bursts of I/O operations (I/O spikes), increasing the values of the cam_ccb_pool_size and cam_ccb_increment attributes may result in improved performance. See Section 3.3 for information on how to modify param.c parameters or sysconfigtab attributes.
Most resources used by the network subsystems are allocated and adjusted dynamically, so tuning is typically not an issue with the network itself. NFS tuning, however, can be critically important because NFS is the heaviest user of the network (see Section 3.6.3).
The one network subsystem resource that may require tuning is the number of network threads configured in your system. If the netstat -m command shows that the number of network threads configured in your system exceeds the peak number of currently active threads, your system may be configured with too many threads, thereby consuming system memory unnecessarily. To adjust the number of threads configured in your system, modify the netisrthreads attribute in the sysconfigtab file. Adjust the number downward to free up system memory.
Network performance is affected only when the supply of resources
is unable to keep up with the demand for resources. Two types of
conditions can cause this congestion to occur:
Neither of these problems are network tuning issues. In the case of a problem on the network, you must isolate the problem and fix it (which may involve tuning some other components of the system). In the case of an overloaded network (for example, when the kernel issues a can't get mbuf message on a regular basis), you must either redesign the network, reduce the number of network applications, or increase the physical memory (RAM). See the Network Programmer's Guide or the manual Network Administration for information on how to resolve network problems.
The Network File System (NFS) shares the unified buffer cache with the virtual memory subsystem and local file systems. Much of what is described in Section 3.6.1.2 also applies to NFS. For example, adding more disks on a server and spreading the I/O across spindles can greatly enhance NFS performance.
Most performance problems with NFS can be attributed to bottlenecks in the file system, network, or disk subsystems. For example, NFS performance is severely degraded by lost packets on the network. Packets can be lost as a result of a variety of network problems. Such problems include congestion in the server, corruption of packets during transmission (which can be caused by bad electrical connections, noisy environments, babbling Ethernet interfaces, and other problems), and routers that abandon forwarding attempts too readily.
Apart from adjustments to the file system, network, and disk subsystems, NFS performance can be directly enhanced in the following ways:
Prestoserve greatly improves write performance for servers that are using NFS Version 2. An NFS Version 2 server must write a client's write data to stable storage before responding to the client's write request. With Prestoserve, this write data can be stored in the NVRAM. Storing the data in this way is much faster than writing it to disk.
Prestoserve can also help improve write performance for NFS Version 3 servers, but not as much as with NFS Version 2 because NFS Version 3 servers can reliably write data to volatile storage without risking loss of data in the event of failure. NFS Version 3 clients detect server failures and resend write data that the server may have lost in volatile storage.
See the Guide to Prestoserve for details.
To determine whether performance is being degraded by an insufficient number of nfsiod and nfsd daemons, issue the following command:
%
ps alxww | grep nfs
This command displays the nfsiod and nfsd daemons that have been established to service client and server requests. If only one or two nfsiod or nfsd daemons are idle, increasing their numbers may improve NFS performance. See the nfsiod(8) and nfsd(8) reference pages for details.
You can also use the netstat -s command to verify the existence of a timeout problem; a nonzero count for "fragments dropped after timeout" in the "ip" section of the netstat output is a reliable indicator that the problem exists. See Section 2.2.11 for sample netstat output.
If fragment drops are a problem, use the mount command to set the size of the NFS read and write buffers to 1KB. For example:
mount -o rsize=1024,wsize=1024 server:/dir /mnt
Also, when evaluating NFS performance, be aware that NFS does not perform well if any file-locking mechanisms are in use on an NFS file. The locks prevent the file from being cached on the client.