This chapter describes the guidelines to tune networks and the Network File System (NFS). Many of the tuning tasks described in this chapter require you to modify system attributes. See Section 2.11 for more information about attributes.
Most resources used by the network subsystem are allocated and adjusted dynamically; however, there are some tuning recommendations that you can use to improve performance, particularly with systems that are Internet servers.
Network performance is affected when the supply of resources is unable to keep up with the demand for resources. The following two conditions can cause this congestion to occur:
A problem with one or more components of the network (hardware or software)
A workload (network traffic) that consistently exceeds the capacity of the available resources even though everything is operating correctly
Neither of these problems are network tuning issues. In the case of a problem on the network, you must isolate and eliminate the problem. In the case of high network traffic (for example, the hit rate on a Web server has reached its maximum value while the system is 100 percent busy), you must either redesign the network and redistribute the load, reduce the number of network clients, or increase the number of systems handling the network load. See the Network Programmer's Guide and the Network Administration manual for information on how to resolve network problems.
To obtain the best network performance, you must understand your workload and the performance characteristics of your network hardware, as described in Chapter 1 and the DIGITAL Systems & Options Catalog. Different network interfaces have different performance characteristics, including raw performance and system overhead. For example, a Fiber Distributed Data Interface (FDDI) interface provides better performance than an Ethernet interface.
Before you can tune your network, you must determine whether the source of the performance problem is an application, network interconnect, network controller, or the communication partner. Table 6-1 lists network subsystem tuning guidelines and performance benefits as well as tradeoffs.
Action | Performance Benefit | Tradeoff |
Increase the size of the hash table that the kernel uses to look up TCP control blocks (Section 6.1.1) | Improves the TCP control block lookup rate and increases the raw connection rate | Slightly increases the amount of wired memory |
Increase the limits for partial TCP connections on the socket listen queue (Section 6.1.2) | Improves throughput and response time on systems that handle a large number of connections | Consumes memory when pending connections are retained in the queue |
Increase the maximum number of concurrent nonreserved, dynamically allocated ports (Section 6.1.3) | Allows more simultaneous outgoing connections | Negligible increase in memory usage |
Enable TCP keepalive functionality (Section 6.1.4) | Enables inactive socket connections to time out | None |
Increase the size of the kernel interface alias table (Section 6.1.5) | Improves the IP address lookup rate for systems that serve many domain names | Slightly increases the amount of wired memory |
Make partial TCP connections time out more quickly (Section 6.1.6) | Prevents clients from overfilling the socket listen queue | A short time limit may cause viable connections to break prematurely |
Make the TCP connection context time out more quickly at the end of the connection (Section 6.1.7) | Frees connection resources sooner | Reducing the timeout limit increases the potential for data corruption, so guideline should be applied with caution |
Reduce the TCP retransmission rate (Section 6.1.8) | Prevents premature retransmissions and decreases congestion | A long retransmit time is not appropriate for all configurations |
Enable the immediate acknowledgement of TCP data (Section 6.1.9) | Can improve network performance for some connections | May adversely affect network bandwidth |
Increase the TCP maximum segment size (Section 6.1.10) | Allows sending more data per packet | May result in fragmentation at router boundary |
Increase the size of the transmit and receive buffers for a TCP socket (Section 6.1.11) | Buffers more TCP packets per socket | May decrease available memory when the buffer space is being used |
Increase the size of the transmit and receive buffers for a UDP socket (Section 6.1.12) | Helps to prevent dropping UDP packets | May decrease available memory when the buffer space is being used |
Allocate sufficient memory to the UBC (Section 6.1.13) | Improves disk I/O performance | May decrease the physical memory available to the virtual memory subsystem |
Disable the use of a PMTU (Section 6.1.14) | Improves the efficiency of Web servers that handle remote traffic from many clients | May reduce server efficiency for LAN traffic |
The following sections describe these tuning guidelines in detail.
You can modify the size of the hash table that the kernel uses to look up
Transmission Control Protocol (TCP) control blocks.
The
tcbhashsize
attribute specifies the number of hash
buckets in the kernel TCP connection table (the number of buckets in
the
inpcb
hash table).
The kernel must look up the
connection block for every TCP
packet it receives, so increasing the size of the table can speed the search
and and improve performance.
The default value is 32.
For Web servers and proxy servers,
set the
tcbhashsize
attribute to 16384.
You may be able to improve performance by increasing the limits for
the socket listen queue (only for TCP).
The
somaxconn
attribute specifies the maximum number of
pending TCP connections (the socket listen queue limit) for each
server socket.
If
the listen queue connection limit is too small, incoming connect requests
may be dropped.
Note that pending TCP connections can be caused by
lost packets in the Internet or denial of service attacks.
The default value of the
somaxconn
attribute
is 1024; the maximum value is 65535.
To improve throughput and response time with fewer drops, you can increase
the value of the
somaxconn
attribute.
A busy system running applications that
generate a large number of connections (for example, a Web server)
may have many pending connections.
For these systems,
set the value of the
somaxconn
attribute to the maximum value of 65535.
The
sominconn
attribute specifies the minimum number
of pending TCP connections (backlog) for each server socket.
The attribute controls how many SYN packets can be handled simultaneously
before additional requests are discarded.
The default value is 0.
The value of the
sominconn
attribute
overrides the application-specific backlog value,
which may be set too low for some server software.
To improve performance without recompiling an application,
you can set the value of the
sominconn
attribute to the
maximum value of 65535.
The value of the
sominconn
attribute should be the same as the value of the
somaxconn
attribute.
Network performance can degrade if a client saturates a socket listen
queue with erroneous
TCP SYN packets, effectively blocking other users from the queue.
To
eliminate this problem, increase the value of the
sominconn
attribute to 65535.
If the system continues
to drop incoming SYN packets, you can decrease the value of the
tcp_keepinit
attribute to 30 (15 seconds).
Three
socket
subsystem
attributes monitor socket listen queue events:
The
sobacklog_hiwat
attribute counts the maximum number
of pending requests to any server socket.
The
sobacklog_drops
attribute counts the number of
backlog drops that exceed the socket set backlog.
The
somaxconn_drops
attribute counts the number of drops
that exceed the value of the
somaxconn
attribute.
Use the
sysconfig -q socket
command to display the
kernel variable
values.
If the values show that the queues are overflowing, you may need to
increase the socket listen queue limit.
See
Section 2.9.3
for information about monitoring the
sobacklog_hiwat
,
sobacklog_drops
, and
somaxconn_drops
attributes.
The
ipport_userreserved
attribute controls the
number of times you can simultaneously make outgoing connections to other
systems.
The number of outgoing ports is the value of the
ipport_userreserved
attribute minus 1024.
The default value of the attribute is 5000; therefore, the default number of
outgoing ports is 3976.
The maximum value of the
ipport_userreserved
attribute is 65535.
When the kernel dynamically allocates a nonreserved port number for
use by a TCP or UDP application that creates an outgoing connection,
it selects the port number from a range of values between 1024 and the
value of the
ipport_userreserved
attribute.
Because each TCP client must use one of these ports, the range limits
the number of simultaneous outgoing connections to a value specified
by the value of the attribute minus 1024.
If your system requires many
outgoing ports, you may need to increase the value of the
ipport_userreserved
attribute.
If your system is a proxy
server with a load of more than 4000 connections, increase the value of the
ipport_userreserved
attribute to 65535.
DIGITAL does not recommend reducing the value of the
ipport_userreserved
attribute to a value that is
less than 5000.
Keepalive functionality enables the periodic transmission of messages on a connected socket in order to keep connections active. If you enable keepalive, sockets that do not exit cleanly are cleaned up when the keepalive interval expires. If keepalive is not enabled, those sockets will continue to exist until you reboot the system.
Applications enable keepalive for sockets by setting the
setsockopt
function's
SO_KEEPALIVE
option.
To override programs that do not set keepalive on
their own or if you do not have access to the application sources,
set the
tcp_keepalive_default
attribute to 1 in
order to enable keepalive for all sockets.
If you enable keepalive, you can also configure the following TCP options for each socket:
The
tcp_keepidle
attribute specifies the amount of idle
time before keepalive probes in 0.5 second units.
The default interval is 2 hours.
The
tcp_keepintvl
attribute specifies the amount of
time between retransmission of keepalive probes in 0.5 second units.
The default interval is 75 seconds.
The
tcp_keepcnt
attribute specifies the maximum number
of keepalive probes that are sent before the connection is dropped.
The default is 8 probes.
The
tcp_keepinit
attribute specifies the maximum
amount of time before an initial connection attempt times out
in 0.5 second units.
The default is 75 seconds.
The
inifaddr_hsize
attribute specifies the number
of hash buckets in the kernel interface alias table
(in_ifaddr
).
The default value of the
inifaddr_hsize
attribute is 32; the maximum value is 512.
If a system is used as a server for many different server domain names, each of which are bound to a unique IP address, the code that matches arriving packets to the right server address uses the hash table to speed lookup operations for the IP addresses. Increasing the number of hash buckets in the table can improve performance on systems that use large numbers of aliases.
For the best performance, the value of the
inifaddr_hsize
attribute is always rounded
down to the nearest power of 2.
If you are using more than 500 interface IP aliases, specify the
maximum value of 512.
If you are using less than 250 aliases, use
the default value of 32.
The
tcp_keepinit
attribute is the amount of time that
a partially established TCP connection remains on the socket listen queue
before it times out.
The value of the attribute is in units of 0.5 seconds.
The default value is 150 units (75 seconds).
Partial connections consume listen queue slots and
fill the queue with connections in the
SYN_RCVD
state.
You can make
partial connections time out sooner by decreasing the
value of the
tcp_keepinit
attribute.
However, do
not set the value too low, because you may prematurely break
connections associated with clients on network paths that are slow or
network paths that lose many packets.
Do not set the value to less than 20 units (10 seconds).
If you have a 32000 socket queue limit, the default (75 seconds) is usually
adequate.
Network performance can degrade if a client overfills a socket listen
queue with
TCP SYN packets, effectively blocking other users from the queue.
To
eliminate this problem, increase the value of the
sominconn
attribute to the maximum of 64000.
If the
system continues to drop SYN packets, decrease the value
of the
tcp_keepinit
attribute to 30 (15 seconds).
You can make the TCP connection context time out more quickly at the end of a connection. However, this will increase the chance of data corruption.
The TCP protocol includes a concept known as the Maximum Segment Lifetime
(MSL).
When a TCP connection enters the
TIME_WAIT
state, it must remain
in this state for twice the value of the MSL, or else undetected data errors
on future connections can occur.
The
tcp_msl
attribute
determines the maximum lifetime of a TCP segment and the timeout
value for the
TIME_WAIT
state.
The value of the attribute is set in units of 0.5 seconds.
The default value is 60 units (30 seconds), which means that the TCP
connection remains in
TIME_WAIT
state for 60 seconds (or twice the value of
the MSL).
In some situations, the default
timeout value for the
TIME_WAIT
state (60 seconds) is too large, so
reducing the value of the
tcp_msl
attribute frees
connection resources sooner than the default behavior.
Do not reduce the value
of the
tcp_msl
attribute unless you fully understand
the design and behavior of your network and the TCP protocol.
DIGITAL strongly recommends using the default value; otherwise, there is
the potential for data corruption.
The
tcp_rexmit_interval_min
attribute specifies
the minimum amount of time between the first TCP retransmission.
For
some wide area networks (WANs), the default value may be too small,
causing premature retransmission timeouts.
This may lead
to duplicate transmission of packets and the erroneous
invocation of the TCP congestion-control algorithms.
The
tcp_rexmit_interval_min
attribute is specified in units
of 0.5 seconds.
The default value is 1 unit (0.5 seconds).
You can increase the value of the
tcp_rexmit_interval_min
attribute to
slow the rate of TCP retransmissions, which decreases congestion and
improves performance.
However, not every connection needs a long
retransmission time.
Usually, the default value is adequate.
Do not specify a value that is less than 1 unit.
Do not change the attribute
unless you fully understand TCP algorithms.
The value of the
tcpnodelack
attribute determines whether
the system delays acknowledging TCP data.
The default is 0, which
delays the acknowledgment of TCP data.
Usually, the default is adequate.
However, for some connections (for example, loopback), the delay can degrade
performance.
You may be able to
improve network performance by setting the value of the
tcpnodelack
attribute to 1, which disables the
acknowledgment delay.
However, this may adversely impact
network bandwidth.
Use the
tcpdump
command to
check for excessive delays.
The
tcp_mssdflt
attribute specifies the TCP maximum
segment size (the default value of 536).
You can increase the value to
1460.
This allows sending more data per socket, but may cause
fragmentation at the router boundary.
The
tcp_sendspace
attribute specifies the default transmit
buffer size for a TCP socket.
The
tcp_recvspace
attribute specifies the default receive buffer size for a TCP socket.
The default value of both attributes is 32 KB.
You can increase the value of
these attributes to 60 KB.
This allows you to buffer more TCP packets
per socket.
However, increasing the values
uses more memory when the buffers are being used by an application
(sending or receiving data).
The
udp_sendspace
attribute specifies the default transmit
buffer size for an Internet User Datagram Protocol (UDP)
socket; the default value is 9 KB.
The
udp_recvspace
attribute specifies the default receive
buffer size for a UDP socket; the default value is 40 KB.
You can increase
the values of these attributes to 64 KB.
However, increasing the values
uses more memory when the buffers are being used by an application
(sending or receiving data).
You must ensure that you have sufficient memory allocated to the
Unified Buffer Cache (UBC).
Servers that perform lots of file I/O (for example, Web and proxy servers)
extensively utilize both the UBC and the virtual memory subsystem.
In
most cases, use the default value of 100 percent for the
ubc-maxpercent
attribute, which specifies the maximum amount of physical memory that can
be allocated to the UBC.
If necessary, you can decrease the
size of the attribute by increments of 10 percent.
See Section 4.8 for more information about tuning the UBC.
Packets transmitted between servers are fragmented into units of a
specific size in order to ease transmission of the data over routers and
small-packet networks, such as Ethernet networks.
When the
pmtu_enabled
attribute is enabled
(the default behavior), the system determines the largest common
path maximum transmission unit (PMTU) value
between servers and uses it as the unit size.
The system also creates a routing table entry for each client network
that attempts to connect to the server.
On a Web server that handles local traffic and some remote
traffic, enabling the use of a PMTU can improve bandwidth.
However, if a Web server handles traffic among many remote clients,
enabling the use of a PMTU can cause an excessive increase in the size of the
kernel routing tables, which can reduce server efficiency.
If a Web
server has poor performance and
the routing table increases to more than 1000 entries, set
the value of the
pmtu_enabled
attribute to 0 to
disable the use of PMTU protocol.
The Network File System (NFS) shares the unified buffer cache with the virtual memory subsystem and local file systems. Most performance problems with NFS can be attributed to bottlenecks in the virtual memory, network, or disk subsystem.
Lost packets on the network can severely degrade NFS performance. Lost packets can be caused by a congested server, the corruption of packets during transmission (which can be caused by bad electrical connections, noisy environments, or noisy Ethernet interfaces), and routers that abandon forwarding attempts too quickly.
You can monitor NFS by using the
nfsstat
command.
When evaluating NFS performance,
remember that NFS does not perform well if any file-locking mechanisms are
in use on an NFS file.
The locks prevent the file from being cached on the
client.
See
nfsstat
(8)
for more information.
Table 6-2 lists NFS tuning guidelines and performance benefits as well as tradeoffs.
Action | Performance Benefit | Tradeoff |
Use Prestoserve (Section 6.2.1) | Improves synchronous write performance | Cost |
Use the appropriate number of
nfsd
daemons on the server
(Section 6.2.2) |
Enables efficient I/O blocking operations | None |
Use the appropriate number of
nfsiod
daemons on the client
(Section 6.2.3) |
Enables efficient I/O blocking operations | None |
Increase the number of I/O threads (Section 6.2.4) | May improve NFS read and write performance | None |
Modifying cache timeout limits (Section 6.2.5) | May improve network performance for read-only file systems and slow network links | None |
Decrease network timeouts (Section 6.2.6) | May improve performance for slow or congested networks | None |
Use NFS protocol Version 3.0 (Section 6.2.7) | Improves network performance | Decreases the performance benefit of Prestoserve |
The following sections describe these guidelines in detail.
You can improve NFS performance by installing Prestoserve on the server. Prestoserve greatly improves synchronous write performance for servers that are using NFS Version 2. Prestoserve enables an NFS Version 2 server to write client data to a stable (nonvolatile) cache, instead of writing the data to disk.
Prestoserve may improve write performance for NFS Version 3 servers, but not as much as with NFS Version 2, because NFS Version 3 servers can reliably write data to volatile storage without risking loss of data in the event of failure. NFS Version 3 clients can detect server failures and resend any write data that the server may have lost in volatile storage.
See the Guide to Prestoserve for more information.
Servers use
nfsd
daemons to handle NFS requests from
client machines.
The number of
nfsd
daemons
determines the number of parallel operations and must be a multiple of 8.
For good performance on frequently used NFS servers, configure a network
with either 16 or 32
nfsd
daemons.
Having exactly 16 or 32
nfsd
daemons produces the most
efficient blocking for I/O operations.
Clients use
nfsiod
daemons to service
asynchronous I/O operations such as buffer cache readahead
and delayed write operations.
The number of
nfsiod
daemons determines the number of outstanding
I/O operations.
The number of
nfsiod
daemons must
be a multiple of 8 minus 1 (for example, 7 or 15 is optimal).
NFS servers attempt to gather writes into complete UFS
clusters before initiating I/O, and the number of
nfsiod
daemons (plus 1) is the number of writes that a client can have outstanding
at any one time.
Having exactly 7 or 15
nfsiod
daemons produces the most efficient blocking for I/O operations.
If write
gathering is enabled, and the client is not running any
nfsiod
daemons, you may experience a performance
degradation.
To disable write gathering, use
dbx
to set the
nfs_write_gather
kernel variable to 0.
On a client system, the
nfsiod
daemons spawn several I/O
threads to service asynchronous I/O operations to the server.
The
I/O threads improve the performance of both NFS reads and writes.
The optimum number of I/O threads depends on many variables, such as
how quickly the client will be writing, how many files will be accessed
simultaneously, and the characteristics of the NFS server.
For most clients, seven threads are sufficient.
Use the
ps axlmp 0 | grep nfs
command to display idle
I/O threads on the client.
If few threads are sleeping, you may be able
to improve NFS performance by increasing the number of threads.
See
Chapter 2,
nfsiod
(8),
and
nfsd
(8)
for more information.
For read-only file systems and slow network links, performance may be improved by changing the cache timeout limits. These timeouts affect how quickly you see updates to a file or directory that has been modified by another host. If you are not sharing files with users on other hosts, including the server system, increasing these values will give you slightly better performance and will reduce the amount of network traffic that you generate.
See
mount
(8)
and the descriptions of the
acregmin
,
acregmax
,
acdirmin
,
acdirmax
,
actimeo
options for more information.
NFS does not perform well if it is used over slow network
links, congested networks, or wide area networks (WANs).
In particular, network timeouts can severely degrade
NFS performance.
This condition can be identified
by using the
nfsstat
command and determining the ratio
of timeouts to calls.
If timeouts are more than 1 percent of total calls,
NFS performance may be severely degraded.
See
Chapter 2
for sample
nfsstat
output
of timeout and call statistics and
nfsstat
(8)
for more information.
You can also
use the
netstat -s
command to
verify the existence of a timeout problem.
A
nonzero count for
fragments dropped after timeout
in the
ip
section of the
netstat
output may indicate that the problem exists.
See
Chapter 2
for sample
netstat
command
output.
If fragment
drops are a problem, use the
mount
command with the
-rsize=1024
and
-wsize=1024
options
to set the size of the NFS read and write buffers to 1 KB.
NFS protocol Version 3.0 provides client-side asynchronous write support, which improves client perception of performance, improves the cache consistency protocol, and requires less network load than Version 2. Protocol Version 3 decreases the performance benefit of Prestoserve.