6 Tuning the Network Subsystem

This chapter describes the guidelines to tune networks and the Network File System (NFS). Many of the tuning tasks described in this chapter require you to modify system attributes. See Section 2.11 for more information about attributes.

6.1 Tuning Networks

Most resources used by the network subsystem are allocated and adjusted dynamically; however, there are some tuning recommendations that you can use to improve performance, particularly with systems that are Internet servers.

Network performance is affected when the supply of resources is unable to keep up with the demand for resources. The following two conditions can cause this congestion to occur:

A problem with one or more components of the network (hardware or software)

A workload (network traffic) that consistently exceeds the capacity of the available resources even though everything is operating correctly

Neither of these problems are network tuning issues. In the case of a problem on the network, you must isolate and eliminate the problem. In the case of high network traffic (for example, the hit rate on a Web server has reached its maximum value while the system is 100 percent busy), you must either redesign the network and redistribute the load, reduce the number of network clients, or increase the number of systems handling the network load. See the Network Programmer's Guide and the Network Administration manual for information on how to resolve network problems.

To obtain the best network performance, you must understand your workload and the performance characteristics of your network hardware, as described in Chapter 1 and the DIGITAL Systems & Options Catalog. Different network interfaces have different performance characteristics, including raw performance and system overhead. For example, a Fiber Distributed Data Interface (FDDI) interface provides better performance than an Ethernet interface.

Before you can tune your network, you must determine whether the source of the performance problem is an application, network interconnect, network controller, or the communication partner. Table 6-1 lists network subsystem tuning guidelines and performance benefits as well as tradeoffs.

Table 6-1: Network Tuning Guidelines

Action Performance Benefit Tradeoff

Increase the size of the hash table that the kernel uses to look up TCP control blocks (Section 6.1.1) Improves the TCP control block lookup rate and increases the raw connection rate Slightly increases the amount of wired memory

Increase the limits for partial TCP connections on the socket listen queue (Section 6.1.2) Improves throughput and response time on systems that handle a large number of connections Consumes memory when pending connections are retained in the queue

Increase the maximum number of concurrent nonreserved, dynamically allocated ports (Section 6.1.3) Allows more simultaneous outgoing connections Negligible increase in memory usage

Enable TCP keepalive functionality (Section 6.1.4) Enables inactive socket connections to time out None

Increase the size of the kernel interface alias table (Section 6.1.5) Improves the IP address lookup rate for systems that serve many domain names Slightly increases the amount of wired memory

Make partial TCP connections time out more quickly (Section 6.1.6) Prevents clients from overfilling the socket listen queue A short time limit may cause viable connections to break prematurely

Make the TCP connection context time out more quickly at the end of the connection (Section 6.1.7) Frees connection resources sooner Reducing the timeout limit increases the potential for data corruption, so guideline should be applied with caution

Reduce the TCP retransmission rate (Section 6.1.8) Prevents premature retransmissions and decreases congestion A long retransmit time is not appropriate for all configurations

Enable the immediate acknowledgement of TCP data (Section 6.1.9) Can improve network performance for some connections May adversely affect network bandwidth

Increase the TCP maximum segment size (Section 6.1.10) Allows sending more data per packet May result in fragmentation at router boundary

Increase the size of the transmit and receive buffers for a TCP socket (Section 6.1.11) Buffers more TCP packets per socket May decrease available memory when the buffer space is being used

Increase the size of the transmit and receive buffers for a UDP socket (Section 6.1.12) Helps to prevent dropping UDP packets May decrease available memory when the buffer space is being used

Allocate sufficient memory to the UBC (Section 6.1.13) Improves disk I/O performance May decrease the physical memory available to the virtual memory subsystem

Disable the use of a PMTU (Section 6.1.14) Improves the efficiency of Web servers that handle remote traffic from many clients May reduce server efficiency for LAN traffic

Action	Performance Benefit	Tradeoff
Increase the size of the hash table that the kernel uses to look up TCP control blocks (Section 6.1.1)	Improves the TCP control block lookup rate and increases the raw connection rate	Slightly increases the amount of wired memory
Increase the limits for partial TCP connections on the socket listen queue (Section 6.1.2)	Improves throughput and response time on systems that handle a large number of connections	Consumes memory when pending connections are retained in the queue
Increase the maximum number of concurrent nonreserved, dynamically allocated ports (Section 6.1.3)	Allows more simultaneous outgoing connections	Negligible increase in memory usage
Enable TCP keepalive functionality (Section 6.1.4)	Enables inactive socket connections to time out	None
Increase the size of the kernel interface alias table (Section 6.1.5)	Improves the IP address lookup rate for systems that serve many domain names	Slightly increases the amount of wired memory
Make partial TCP connections time out more quickly (Section 6.1.6)	Prevents clients from overfilling the socket listen queue	A short time limit may cause viable connections to break prematurely
Make the TCP connection context time out more quickly at the end of the connection (Section 6.1.7)	Frees connection resources sooner	Reducing the timeout limit increases the potential for data corruption, so guideline should be applied with caution
Reduce the TCP retransmission rate (Section 6.1.8)	Prevents premature retransmissions and decreases congestion	A long retransmit time is not appropriate for all configurations
Enable the immediate acknowledgement of TCP data (Section 6.1.9)	Can improve network performance for some connections	May adversely affect network bandwidth
Increase the TCP maximum segment size (Section 6.1.10)	Allows sending more data per packet	May result in fragmentation at router boundary
Increase the size of the transmit and receive buffers for a TCP socket (Section 6.1.11)	Buffers more TCP packets per socket	May decrease available memory when the buffer space is being used
Increase the size of the transmit and receive buffers for a UDP socket (Section 6.1.12)	Helps to prevent dropping UDP packets	May decrease available memory when the buffer space is being used
Allocate sufficient memory to the UBC (Section 6.1.13)	Improves disk I/O performance	May decrease the physical memory available to the virtual memory subsystem
Disable the use of a PMTU (Section 6.1.14)	Improves the efficiency of Web servers that handle remote traffic from many clients	May reduce server efficiency for LAN traffic

The following sections describe these tuning guidelines in detail.

6.1.1 Improving the Lookup Rate for TCP Control Blocks

You can modify the size of the hash table that the kernel uses to look up Transmission Control Protocol (TCP) control blocks. The tcbhashsize attribute specifies the number of hash buckets in the kernel TCP connection table (the number of buckets in the inpcb hash table). The kernel must look up the connection block for every TCP packet it receives, so increasing the size of the table can speed the search and and improve performance.

The default value is 32. For Web servers and proxy servers, set the tcbhashsize attribute to 16384.

6.1.2 Tuning the Socket Listen Queue Limits

You may be able to improve performance by increasing the limits for the socket listen queue (only for TCP). The somaxconn attribute specifies the maximum number of pending TCP connections (the socket listen queue limit) for each server socket. If the listen queue connection limit is too small, incoming connect requests may be dropped. Note that pending TCP connections can be caused by lost packets in the Internet or denial of service attacks. The default value of the somaxconn attribute is 1024; the maximum value is 65535.

To improve throughput and response time with fewer drops, you can increase the value of the somaxconn attribute. A busy system running applications that generate a large number of connections (for example, a Web server) may have many pending connections. For these systems, set the value of the somaxconn attribute to the maximum value of 65535.

The sominconn attribute specifies the minimum number of pending TCP connections (backlog) for each server socket. The attribute controls how many SYN packets can be handled simultaneously before additional requests are discarded. The default value is 0. The value of the sominconn attribute overrides the application-specific backlog value, which may be set too low for some server software. To improve performance without recompiling an application, you can set the value of the sominconn attribute to the maximum value of 65535. The value of the sominconn attribute should be the same as the value of the somaxconn attribute.

Network performance can degrade if a client saturates a socket listen queue with erroneous TCP SYN packets, effectively blocking other users from the queue. To eliminate this problem, increase the value of the sominconn attribute to 65535. If the system continues to drop incoming SYN packets, you can decrease the value of the tcp_keepinit attribute to 30 (15 seconds).

Three socket subsystem attributes monitor socket listen queue events:

The sobacklog_hiwat attribute counts the maximum number of pending requests to any server socket.

The sobacklog_drops attribute counts the number of backlog drops that exceed the socket set backlog.

The somaxconn_drops attribute counts the number of drops that exceed the value of the somaxconn attribute.

Use the sysconfig -q socket command to display the kernel variable values. If the values show that the queues are overflowing, you may need to increase the socket listen queue limit. See Section 2.9.3 for information about monitoring the sobacklog_hiwat, sobacklog_drops, and somaxconn_drops attributes.

6.1.3 Increasing the Maximum Number of Concurrent Nonreserved Dynamically Allocated Ports

The ipport_userreserved attribute controls the number of times you can simultaneously make outgoing connections to other systems. The number of outgoing ports is the value of the ipport_userreserved attribute minus 1024. The default value of the attribute is 5000; therefore, the default number of outgoing ports is 3976. The maximum value of the ipport_userreserved attribute is 65535.

When the kernel dynamically allocates a nonreserved port number for use by a TCP or UDP application that creates an outgoing connection, it selects the port number from a range of values between 1024 and the value of the ipport_userreserved attribute. Because each TCP client must use one of these ports, the range limits the number of simultaneous outgoing connections to a value specified by the value of the attribute minus 1024.

If your system requires many outgoing ports, you may need to increase the value of the ipport_userreserved attribute. If your system is a proxy server with a load of more than 4000 connections, increase the value of the ipport_userreserved attribute to 65535.

DIGITAL does not recommend reducing the value of the ipport_userreserved attribute to a value that is less than 5000.

6.1.4 Enabling TCP keepalive Functionality

Keepalive functionality enables the periodic transmission of messages on a connected socket in order to keep connections active. If you enable keepalive, sockets that do not exit cleanly are cleaned up when the keepalive interval expires. If keepalive is not enabled, those sockets will continue to exist until you reboot the system.

Applications enable keepalive for sockets by setting the setsockopt function's SO_KEEPALIVE option. To override programs that do not set keepalive on their own or if you do not have access to the application sources, set the tcp_keepalive_default attribute to 1 in order to enable keepalive for all sockets.

If you enable keepalive, you can also configure the following TCP options for each socket:

The tcp_keepidle attribute specifies the amount of idle time before keepalive probes in 0.5 second units. The default interval is 2 hours.

The tcp_keepintvl attribute specifies the amount of time between retransmission of keepalive probes in 0.5 second units. The default interval is 75 seconds.

The tcp_keepcnt attribute specifies the maximum number of keepalive probes that are sent before the connection is dropped. The default is 8 probes.

The tcp_keepinit attribute specifies the maximum amount of time before an initial connection attempt times out in 0.5 second units. The default is 75 seconds.

6.1.5 Improving the Lookup Rate for IP Addresses

The inifaddr_hsize attribute specifies the number of hash buckets in the kernel interface alias table (in_ifaddr). The default value of the inifaddr_hsize attribute is 32; the maximum value is 512.

If a system is used as a server for many different server domain names, each of which are bound to a unique IP address, the code that matches arriving packets to the right server address uses the hash table to speed lookup operations for the IP addresses. Increasing the number of hash buckets in the table can improve performance on systems that use large numbers of aliases.

For the best performance, the value of the inifaddr_hsize attribute is always rounded down to the nearest power of 2. If you are using more than 500 interface IP aliases, specify the maximum value of 512. If you are using less than 250 aliases, use the default value of 32.

6.1.6 Decreasing the Partial TCP Connection Timeout Limit

The tcp_keepinit attribute is the amount of time that a partially established TCP connection remains on the socket listen queue before it times out. The value of the attribute is in units of 0.5 seconds. The default value is 150 units (75 seconds).

Partial connections consume listen queue slots and fill the queue with connections in the SYN_RCVD state. You can make partial connections time out sooner by decreasing the value of the tcp_keepinit attribute. However, do not set the value too low, because you may prematurely break connections associated with clients on network paths that are slow or network paths that lose many packets. Do not set the value to less than 20 units (10 seconds). If you have a 32000 socket queue limit, the default (75 seconds) is usually adequate.

Network performance can degrade if a client overfills a socket listen queue with TCP SYN packets, effectively blocking other users from the queue. To eliminate this problem, increase the value of the sominconn attribute to the maximum of 64000. If the system continues to drop SYN packets, decrease the value of the tcp_keepinit attribute to 30 (15 seconds).

6.1.7 Decreasing the TCP Connection Context Timeout Limit

You can make the TCP connection context time out more quickly at the end of a connection. However, this will increase the chance of data corruption.

The TCP protocol includes a concept known as the Maximum Segment Lifetime (MSL). When a TCP connection enters the TIME_WAIT state, it must remain in this state for twice the value of the MSL, or else undetected data errors on future connections can occur. The tcp_msl attribute determines the maximum lifetime of a TCP segment and the timeout value for the TIME_WAIT state.

The value of the attribute is set in units of 0.5 seconds. The default value is 60 units (30 seconds), which means that the TCP connection remains in TIME_WAIT state for 60 seconds (or twice the value of the MSL). In some situations, the default timeout value for the TIME_WAIT state (60 seconds) is too large, so reducing the value of the tcp_msl attribute frees connection resources sooner than the default behavior.

Do not reduce the value of the tcp_msl attribute unless you fully understand the design and behavior of your network and the TCP protocol. DIGITAL strongly recommends using the default value; otherwise, there is the potential for data corruption.

6.1.8 Decreasing the TCP Retransmission Rate

The tcp_rexmit_interval_min attribute specifies the minimum amount of time between the first TCP retransmission. For some wide area networks (WANs), the default value may be too small, causing premature retransmission timeouts. This may lead to duplicate transmission of packets and the erroneous invocation of the TCP congestion-control algorithms.

The tcp_rexmit_interval_min attribute is specified in units of 0.5 seconds. The default value is 1 unit (0.5 seconds).

You can increase the value of the tcp_rexmit_interval_min attribute to slow the rate of TCP retransmissions, which decreases congestion and improves performance. However, not every connection needs a long retransmission time. Usually, the default value is adequate. Do not specify a value that is less than 1 unit. Do not change the attribute unless you fully understand TCP algorithms.

6.1.9 Disabling Delaying the Acknowledgment of TCP Data

The value of the tcpnodelack attribute determines whether the system delays acknowledging TCP data. The default is 0, which delays the acknowledgment of TCP data. Usually, the default is adequate. However, for some connections (for example, loopback), the delay can degrade performance. You may be able to improve network performance by setting the value of the tcpnodelack attribute to 1, which disables the acknowledgment delay. However, this may adversely impact network bandwidth. Use the tcpdump command to check for excessive delays.

6.1.10 Increasing the Maximum TCP Segment Size

The tcp_mssdflt attribute specifies the TCP maximum segment size (the default value of 536). You can increase the value to 1460. This allows sending more data per socket, but may cause fragmentation at the router boundary.

6.1.11 Increasing the Transmit and Receive Buffers for a TCP Socket

The tcp_sendspace attribute specifies the default transmit buffer size for a TCP socket. The tcp_recvspace attribute specifies the default receive buffer size for a TCP socket. The default value of both attributes is 32 KB. You can increase the value of these attributes to 60 KB. This allows you to buffer more TCP packets per socket. However, increasing the values uses more memory when the buffers are being used by an application (sending or receiving data).

6.1.12 Increasing the Transmit and Receive Buffers for a UDP Socket

The udp_sendspace attribute specifies the default transmit buffer size for an Internet User Datagram Protocol (UDP) socket; the default value is 9 KB. The udp_recvspace attribute specifies the default receive buffer size for a UDP socket; the default value is 40 KB. You can increase the values of these attributes to 64 KB. However, increasing the values uses more memory when the buffers are being used by an application (sending or receiving data).

6.1.13 Allocating Sufficient Memory to the UBC

You must ensure that you have sufficient memory allocated to the Unified Buffer Cache (UBC). Servers that perform lots of file I/O (for example, Web and proxy servers) extensively utilize both the UBC and the virtual memory subsystem. In most cases, use the default value of 100 percent for the ubc-maxpercent attribute, which specifies the maximum amount of physical memory that can be allocated to the UBC. If necessary, you can decrease the size of the attribute by increments of 10 percent.

See Section 4.8 for more information about tuning the UBC.

6.1.14 Disabling Use of a PMTU

Packets transmitted between servers are fragmented into units of a specific size in order to ease transmission of the data over routers and small-packet networks, such as Ethernet networks. When the pmtu_enabled attribute is enabled (the default behavior), the system determines the largest common path maximum transmission unit (PMTU) value between servers and uses it as the unit size. The system also creates a routing table entry for each client network that attempts to connect to the server.

On a Web server that handles local traffic and some remote traffic, enabling the use of a PMTU can improve bandwidth. However, if a Web server handles traffic among many remote clients, enabling the use of a PMTU can cause an excessive increase in the size of the kernel routing tables, which can reduce server efficiency. If a Web server has poor performance and the routing table increases to more than 1000 entries, set the value of the pmtu_enabled attribute to 0 to disable the use of PMTU protocol.

6.2 Tuning the Network File System

The Network File System (NFS) shares the unified buffer cache with the virtual memory subsystem and local file systems. Most performance problems with NFS can be attributed to bottlenecks in the virtual memory, network, or disk subsystem.

Lost packets on the network can severely degrade NFS performance. Lost packets can be caused by a congested server, the corruption of packets during transmission (which can be caused by bad electrical connections, noisy environments, or noisy Ethernet interfaces), and routers that abandon forwarding attempts too quickly.

You can monitor NFS by using the nfsstat command. When evaluating NFS performance, remember that NFS does not perform well if any file-locking mechanisms are in use on an NFS file. The locks prevent the file from being cached on the client. See nfsstat(8) for more information.

Table 6-2 lists NFS tuning guidelines and performance benefits as well as tradeoffs.

Table 6-2: Guidelines for NFS Tuning

Action Performance Benefit Tradeoff

Use Prestoserve (Section 6.2.1) Improves synchronous write performance Cost

Use the appropriate number of nfsd daemons on the server (Section 6.2.2) Enables efficient I/O blocking operations None

Use the appropriate number of nfsiod daemons on the client (Section 6.2.3) Enables efficient I/O blocking operations None

Increase the number of I/O threads (Section 6.2.4) May improve NFS read and write performance None

Modifying cache timeout limits (Section 6.2.5) May improve network performance for read-only file systems and slow network links None

Decrease network timeouts (Section 6.2.6) May improve performance for slow or congested networks None

Use NFS protocol Version 3.0 (Section 6.2.7) Improves network performance Decreases the performance benefit of Prestoserve

Action	Performance Benefit	Tradeoff
Use Prestoserve (Section 6.2.1)	Improves synchronous write performance	Cost
Use the appropriate number of `nfsd` daemons on the server (Section 6.2.2)	Enables efficient I/O blocking operations	None
Use the appropriate number of `nfsiod` daemons on the client (Section 6.2.3)	Enables efficient I/O blocking operations	None
Increase the number of I/O threads (Section 6.2.4)	May improve NFS read and write performance	None
Modifying cache timeout limits (Section 6.2.5)	May improve network performance for read-only file systems and slow network links	None
Decrease network timeouts (Section 6.2.6)	May improve performance for slow or congested networks	None
Use NFS protocol Version 3.0 (Section 6.2.7)	Improves network performance	Decreases the performance benefit of Prestoserve

The following sections describe these guidelines in detail.

6.2.1 Using Prestoserve to Improve Server Performance

You can improve NFS performance by installing Prestoserve on the server. Prestoserve greatly improves synchronous write performance for servers that are using NFS Version 2. Prestoserve enables an NFS Version 2 server to write client data to a stable (nonvolatile) cache, instead of writing the data to disk.

Prestoserve may improve write performance for NFS Version 3 servers, but not as much as with NFS Version 2, because NFS Version 3 servers can reliably write data to volatile storage without risking loss of data in the event of failure. NFS Version 3 clients can detect server failures and resend any write data that the server may have lost in volatile storage.

See the Guide to Prestoserve for more information.

6.2.2 Using the Appropriate Number of nfsd Daemons

Servers use nfsd daemons to handle NFS requests from client machines. The number of nfsd daemons determines the number of parallel operations and must be a multiple of 8. For good performance on frequently used NFS servers, configure a network with either 16 or 32 nfsd daemons. Having exactly 16 or 32 nfsd daemons produces the most efficient blocking for I/O operations.

6.2.3 Using the Appropriate Number of nfsiod Daemons

Clients use nfsiod daemons to service asynchronous I/O operations such as buffer cache readahead and delayed write operations. The number of nfsiod daemons determines the number of outstanding I/O operations. The number of nfsiod daemons must be a multiple of 8 minus 1 (for example, 7 or 15 is optimal).

NFS servers attempt to gather writes into complete UFS clusters before initiating I/O, and the number of nfsiod daemons (plus 1) is the number of writes that a client can have outstanding at any one time. Having exactly 7 or 15 nfsiod daemons produces the most efficient blocking for I/O operations. If write gathering is enabled, and the client is not running any nfsiod daemons, you may experience a performance degradation. To disable write gathering, use dbx to set the nfs_write_gather kernel variable to 0.

6.2.4 Increasing the Number of Threads

On a client system, the nfsiod daemons spawn several I/O threads to service asynchronous I/O operations to the server. The I/O threads improve the performance of both NFS reads and writes. The optimum number of I/O threads depends on many variables, such as how quickly the client will be writing, how many files will be accessed simultaneously, and the characteristics of the NFS server. For most clients, seven threads are sufficient.

Use the ps axlmp 0 | grep nfs command to display idle I/O threads on the client. If few threads are sleeping, you may be able to improve NFS performance by increasing the number of threads. See Chapter 2, nfsiod(8), and nfsd(8) for more information.

6.2.5 Modifying Cache Timeout Limits

For read-only file systems and slow network links, performance may be improved by changing the cache timeout limits. These timeouts affect how quickly you see updates to a file or directory that has been modified by another host. If you are not sharing files with users on other hosts, including the server system, increasing these values will give you slightly better performance and will reduce the amount of network traffic that you generate.

See mount(8) and the descriptions of the acregmin, acregmax, acdirmin, acdirmax, actimeo options for more information.

6.2.6 Decreasing Network Timeouts

NFS does not perform well if it is used over slow network links, congested networks, or wide area networks (WANs). In particular, network timeouts can severely degrade NFS performance. This condition can be identified by using the nfsstat command and determining the ratio of timeouts to calls. If timeouts are more than 1 percent of total calls, NFS performance may be severely degraded. See Chapter 2 for sample nfsstat output of timeout and call statistics and nfsstat(8) for more information.

You can also use the netstat -s command to verify the existence of a timeout problem. A nonzero count for fragments dropped after timeout in the ip section of the netstat output may indicate that the problem exists. See Chapter 2 for sample netstat command output.

If fragment drops are a problem, use the mount command with the -rsize=1024 and -wsize=1024 options to set the size of the NFS read and write buffers to 1 KB.

6.2.7 Using NFS Protocol Version 3.0

NFS protocol Version 3.0 provides client-side asynchronous write support, which improves client perception of performance, improves the cache consistency protocol, and requires less network load than Version 2. Protocol Version 3 decreases the performance benefit of Prestoserve.