[Contents] [Prev. Chapter] [Next Section] [Next Chapter] [Index] [Help]

1    Introduction to High-Performance and High-Availability Systems

Businesses want a computing environment that is dependable and able to handle the workload placed on that environment. Users and applications place different demands on a system, and both require consistent performance with minimal down time. A system also must be able to absorb an increase in workload without a decline in performance. By following the guidelines in this manual, you can configure and tune a dependable, high-performance system that will meet your current and future computing needs.

This chapter introduces you to the process of configuring a system and includes information about the following topics:

Later chapters provide detailed information about monitoring systems, identifying performance problems, optimizing applications and the central processing unit (CPU), and configuring and tuning the virtual memory, storage, and network subsystems.


[Contents] [Prev. Chapter] [Next Section] [Next Chapter] [Index] [Help]

1.1    Terminology and Concepts

This section introduces the terms and concepts that are used to describe performance and availability.


[Contents] [Prev. Chapter] [Prev. Section] [Next Section] [Next Chapter] [Index] [Help]

1.1.1    System Configuration

Your system configuration consists of a combination of hardware and software for a single system or a cluster of systems. For example, CPUs, memory boards, the operating system, and mirrored disks are parts of a configuration. To configure a system, you need to set up a new or modify an existing hardware or software configuration. For example, configuring the I/O subsystem can include setting up mirrored disks.

Systems can be single-CPU systems or multiprocessor systems, which allow two or more processors to share common physical memory. An example of a multiprocessing system is a symmetrical multiprocessing (SMP) system, in which the CPUs execute the same version of the operating system, access common memory, and execute instructions simultaneously.

Certain types of environments, such as large database environments, require multiprocessing systems and large storage configurations to handle the workload. Very-large memory (VLM) systems utilize 64-bit architecture, multiprocessing, and at least 2 GB of memory. Very-large database (VLDB) systems are VLM systems that also use a large and complex storage configuration. The following list describes the components of a typical VLM/VLDB configuration:

The virtual memory subsystem controls the allocation of memory to processes by using a portion of physical memory, disk swap space, and various daemons and algorithms. A page is the smallest portion of physical memory that the system can allocate (8 KB of memory). Virtual memory operation involves paging, reclaiming pages so they can be reused, and swapping, writing a suspended process' modified (dirty) pages to swap space, which frees large amounts of memory.

After a system is configured, you may want to tune the system to improve performance. You can tune a system by changing the values of kernel variables in order to modify the kernel. Kernel variables affect the behavior and performance of the kernel, the virtual memory subsystem, the I/O subsystems, and applications. You can temporarily modify the kernel by changing the kernel variables while the system is running, or you can permanently modify the kernel by changing the values of attributes.

Use attributes to modify the kernel without rebuilding the kernel. In some cases, you can modify the kernel by changing parameter values in the system configuration file; however, you must rebuild the kernel to use the new parameter values. See Section 2.11 for information about viewing and modifying kernel variables, attributes, and parameters.

If tuning a system does not sufficiently improve performance, you may have to reconfigure your system, which can involve adding CPUs or memory, changing the storage configuration, or modifying the software application.


[Contents] [Prev. Chapter] [Prev. Section] [Next Section] [Next Chapter] [Index] [Help]

1.1.2    System Performance

System performance depends on an efficient utilization of system resources, which are the hardware and software components (CPUs, memory, networks, and disk storage) that are available to users or applications. A system must perform well under the normal workload exerted on the system by the applications and the users.

The system workload changes over time. You may add users or run additional applications. You may need to reconfigure your system to handle an increasing workload. Scalability refers to a system's ability to utilize additional resources with a predictable increase in performance, or the ability to absorb an increase in workload without a significant performance degradation.

A performance problem in a specific area of the configuration is called a bottleneck. Potential bottlenecks include the virtual memory subsystem and I/O buses. A bottleneck can occur if the workload demands more from a resource than its capacity, which is the maximum theoretical throughput of a system resource.

Performance is often described in terms of two rates. Bandwidth is the rate at which an I/O subsystem or component can transfer bytes of data. Bandwidth is often called the transfer rate. Bandwidth is especially important for applications that perform large sequential data transfers. Throughput is the rate at which an I/O subsystem or component can perform I/O operations. Throughput is especially important for applications that perform many small I/O operations.

Performance is also measured in terms of latency, which is the amount of time to complete a specific operation. Latency is often called delay. High system performance requires a low latency time. I/O latency is measured in milliseconds; memory latency is measured in nanoseconds. Memory latency depends on the memory bank configuration and the system's memory requirements.


[Contents] [Prev. Chapter] [Prev. Section] [Next Section] [Next Chapter] [Index] [Help]

1.1.3    Disk Performance

Disk performance is often described in terms of disk access time, which is a combination of the seek time, the amount of time for a disk head to move to a specific disk track, and the rotational latency, which is the amount of time for a disk to rotate to a specific disk sector.

The Unified Buffer Cache (UBC) affects disk I/O performance. The UBC is allocated a portion of physical memory to cache most-recently accessed file system data. By functioning as a layer between the operating system and the storage subsystem, the UBC is able to decrease the number of disk operations.

Disk I/O performance also depends on the characteristics of the workload's I/O operations. Data transfers can be large or small and can involve reading data from a disk or writing data to a disk.

Data transfers also have different access patterns. A sequential access pattern is an access pattern in which data is read from or written to contiguous (adjacent) blocks on a disk. A random access pattern is an access pattern in which data is read from or written to blocks in different (usually nonadjacent) locations on a disk.

In addition, data transfers can consist of file-system data or raw I/O, which is I/O to a disk or disk partition that does not contain a file system. Raw I/O bypasses buffers and caches, and it may provide better performance than file system I/O. Raw I/O is often used by the operating system and by database application software.

Disk I/O performance also is affected by the use of redundant array of independent disks (RAID) technology, which can provide both high disk I/O performance and high data availability. The DIGITAL UNIX operating system provides RAID functionality by using the Logical Storage Manager (LSM) software. DIGITAL UNIX also supports hardware-based RAID products, which provide RAID functionality by using intelligent controllers, caches, and software.

There are four primary RAID levels:

To address your performance and availability needs, you can combine some RAID levels (for example, you can combine RAID 1 with RAID 0 to mirror striped disks). Some hardware-based RAID products support adaptive RAID 3/5 (also called dynamic parity RAID), which improves disk I/O performance for a wide variety of applications by dynamically adjusting, according to workload needs, between data transfer-intensive algorithms and I/O operation-intensive algorithms.

See Section 5.2.1 for more information about RAID and RAID products.


[Contents] [Prev. Chapter] [Prev. Section] [Next Section] [Next Chapter] [Index] [Help]

1.1.4    High Availability

High availability is the ability of a resource to withstand a hardware or software failure. Resources (for example, systems or disk data) can be made highly available by using some form of resource duplication or redundancy.

For example, you can make the data on a disk highly available by mirroring that disk; that is, replicating the data on a different disk. If the original disk fails, the copy is still available to users and applications. If you use parity RAID, the redundant data is stored in the parity information, which is used to regenerate data if a disk failure occurs.

In addition, you can make the network highly available by using redundant network connections. If one connection becomes unavailable, you can still use the other connection for network access. Network availability depends on the application, the network configuration, and the network protocol.

To make a system highly available, you must set up a cluster, which is a loosely coupled group of servers configured as cluster member systems. In a cluster, software applications are capable of running on any member system. Some applications can run on only one member system at a time; others can run on multiple systems simultaneously. Cluster member systems usually share highly available disk data, and some clusters support a high-performance interconnect that enables fast and reliable communications between members.

A cluster utilizes failover to ensure application and system availability. If a member system fails, all cluster-configured applications running on that system fail over to a different member system, which restarts the applications and makes them available to users.

To completely protect a configuration from failure, you must eliminate each point of failure. An example of a configuration that has no single point of failure is as follows:

For increased availability, you can use multiple layers of redundancy to protect against multiple failures. See Section 1.2 for more information about availability.

Availability is also measured by a resource's reliability, which is the average amount of time that a component will perform before a failure that causes a loss of data. It is often expressed as the mean time to data loss (MTDL), the mean time to first failure (MTTF), and the mean time between failures (MTBF).


[Contents] [Prev. Chapter] [Prev. Section] [Next Section] [Next Chapter] [Index] [Help]

1.2    Understanding High Availability

A resource that is highly available is resistant to specific hardware and software failures. This is accomplished by duplicating resources (for example, systems, network interfaces, or data), and may also include an automatic failover mechanism that makes the resource failure virtually imperceptible to users.

There are various degrees of high availability, and you must determine how much you need for your environment. A configuration that has no single point of failure is one in which you have duplicated each vital resource. Environments that are not prone to failure or are able to accommodate down time may only require data to be highly available.

Figure 1-1 shows a configuration that is vulnerable to multiple failures, including system, network, disk, and bus failures.

Figure 1-1:  Configuration With Potential Points of Failure

The more levels of resource redundancy, the greater the resource availability. Mission-critical operations and production environments often require that resources be resistant to multiple failures. For example, if you have only two cluster member systems and one fails, you now have a potential point of failure (the remaining system), and your configuration is vulnerable to down time. Therefore, a cluster with three or more member systems has more levels of redundancy and higher availability than a two-system cluster, because it can survive multiple system failures.

However, it is not always possible or practical to protect against every possible failure scenario or to provide multiple levels of redundancy. When planning your configuration, you must determine how much availability you need and the best way to achieve it.

Software-based RAID (LSM) and hardware-based RAID products provide you with various degrees of data availability. In addition, specific configurations can improve data availability. For example, mirroring data across buses protects against disk, bus, and adapter failures.

DIGITAL UNIX TruCluster TM products provide high system and application availability. Brief descriptions of some cluster products are as follows:

The following sections describe how to eliminate points of failure, and how to increase resource availability.


[Contents] [Prev. Chapter] [Prev. Section] [Next Section] [Next Chapter] [Index] [Help]

1.2.1    Eliminating Points of Failure

When configuring a system for high availability, you must protect the system's resources from failure. The following list describes each potential point of failure and how to eliminate it:

Figure 1-2 shows a fully redundant cluster configuration with no single point of failure for the server systems.

Figure 1-2:  Fully Redundant Cluster Configuration

Because you can never eliminate the possibility that multiple failures will make a resource or component unavailable, you must repair or replace a failed component as soon as possible to maintain some form of redundancy. This will help to ensure that you do not experience down time.


[Contents] [Prev. Chapter] [Prev. Section] [Next Section] [Next Chapter] [Index] [Help]

1.2.2    Increasing System Availability

You must decide how much system availability you need and where a system is most vulnerable to failure. Table 1-1 describes how to increase the system availability by eliminating single points of failure, as well as the tradeoffs.

Table 1-1:  Increasing System Availability

To protect against: You can: Tradeoff:
Single system failure Set up a cluster with at least two members Cost of additional hardware and software, increased management complexity
  Use the lastest versions of hardware, firmware, and operating system Possible down time during upgrade
Multiple system failures Set up a cluster with more than two members Cost of additional hardware and software, increased management complexity
Network connection failure Configure multiple network connections Cost of additional hardware, requires I/O slots
Cluster interconnect failure Set up a second cluster interconnect Cost of additional hardware, uses a PCI slot
Total power failure Use a battery-backed uninterruptible power system (UPS) Cost of UPS hardware
Cabinet power supply failure Use redundant power supplies or mirror disks across cabinets with independent power supplies Cost of additional hardware and decrease in write performance on mirrored disks


[Contents] [Prev. Chapter] [Prev. Section] [Next Section] [Next Chapter] [Index] [Help]

1.2.3    Increasing Data Availability

Not only is it important for users and applications to be able to access data easily and quickly, data needs to be available. Table 1-2 describes how to increase the availability of data by addressing points of failure, as well as the tradeoffs.

Table 1-2:  Increasing Data Availability

To protect against: You can: Tradeoff:
Disk failure Mirror disks Cost of additional disks and decrease in write performance
  Use parity RAID Cost of additional hardware and software, increase in management complexity, and performance impact under write loads
Host bus adapter or bus failure Mirror data across disks on different buses Cost of additional hardware and requires additional I/O bus slots
System failure Set up a cluster Cost of additional hardware and software, increase in management complexity
Total power failure Use a battery-backed uninterruptible power system (UPS) Cost of UPS hardware
Cabinet power supply failure Use redundant power supplies or mirror disks across cabinets with independent power supplies Cost of additional hardware and decrease in write performance on mirrored disks


[Contents] [Prev. Chapter] [Prev. Section] [Next Section] [Next Chapter] [Index] [Help]

1.2.4    Achieving High Availability and High Performance

Configuring a system for high availability can affect performance, depending on your configuration and the characteristics of your workload. Table 1-3 shows how high-availability solutions affect system performance.

Table 1-3:  Impact of High Availability on System Performance

Availability Solution Performance Impact
Mirroring disks Can improve disk read performance, but may cause a degradation in write performance (you can mirror striped disks to combine the performance benefits of striping with high availability)
Mirroring disks across different buses Prevents a single bus from becoming an I/O bottleneck
Parity RAID Improves disk I/O performance only if all member disks are available; performance degrades as disks fail
Redundant network connections Improves network performance and increases client access
Cluster Improves overall performance by spreading workload across member systems, which provides applications and users with more CPU and memory resources


[Contents] [Prev. Chapter] [Prev. Section] [Next Section] [Next Chapter] [Index] [Help]

1.3    Understanding High Performance

A system must have a dependable level of performance to meet the needs of users and applications. You must configure your system so that it can rapidly respond to the demands of a normal workload and maintain an adequate level of performance if the workload increases.

Some environments require that a system be scalable. A scalable system allows you to add hardware (for example, CPUs) to improve performance or to absorb an increase in the workload.

You must understand the characteristics of your workload to determine the level of performance you require, and which configuration will meet your performance needs. Although some environments require the highest possible performance, this level of performance may not be necessary or cost effective.

System performance depends on the interaction between the hardware and software configuration and the workload. A system that performs well must use CPU, memory, and I/O resources efficiently. If a resource reaches its capacity, it becomes a bottleneck and can degrade performance. Bottlenecks are often interrelated; for example, insufficient memory can cause excessive paging and swapping, which may result in a bottleneck in the disk I/O subsystem.

To plan a configuration that will meet your performance needs, you must identify which resources will have the biggest impact on performance. For example, if your applications are CPU-intensive, you may want to consider a system with multiple CPUs and sufficient memory bandwidth. If the applications require a lot of memory, you must configure sufficient memory for the system. An inadequate amount of memory will degrade the overall system performance.

If your applications perform a large number of disk I/O operations, configure your storage subsystem to prevent disk and bus bottlenecks. If your system is an Internet server, you must be sure it can handle many network requests. In addition, if you require both high availability and high performance, you must determine how a high-availability configuration impacts system performance.

After you plan and set up your configuration, you may be able to improve performance by tuning the system. However, tuning may provide only marginal performance improvements, so make sure that your configuration is appropriate for your workload.

Performance problems can have various sources, including the following:

The commands described in Chapter 2 can help you identify the source of a performance problem.


[Contents] [Prev. Chapter] [Prev. Section] [Next Section] [Next Chapter] [Index] [Help]

1.4    Planning Your Configuration

To plan your DIGITAL UNIX configuration, follow these steps:

  1. Understand your workload and the characteristics of the users and applications.

  2. Determine your performance and availability requirements.

  3. Choose which hardware and software configuration will satisfy your performance and availability needs.

The following sections describe these steps in detail.


[Contents] [Prev. Chapter] [Prev. Section] [Next Section] [Next Chapter] [Index] [Help]

1.4.1    Understanding Your Workload

Before choosing a configuration to meet your needs, you must determine the impact of your workload on the system. To do this, you must understand the characteristics of your applications and users and how they utilize the software and hardware (for example, how they perform disk I/O).

Use Table 1-4 to help you understand application behavior. You may want to duplicate and fill out this table for each application.

Table 1-4:  Application Characteristics

Application Name:  
Describe the application objectives.  
Describe the performance requirements.  
Is the application CPU-intensive?  
What are the application's memory needs?  
How much disk storage does the application require?  
Does the application require high bandwidth or throughput?  
Does the application perform large sequential data transfers?  
Does the application perform many small data transfers?  
What is the size of the average data transfer?  
What percentage of the data transfers are reads?  
What percentage of the data transfers are writes?  
Does the application perform many network operations?  
What are your system availability requirements?  
What are your data availability requirements?  
What are your network availability requirements?  

Use Table 1-5 to help you understand user behavior. Different users may place different demands on the system. For example, some users may be performing data processing, while others may be compiling code. You may want to duplicate and fill out this table for each type of user.

Table 1-5:  User Characteristics

User Type:  
Describe the type of user.  
Specify the number of users.  
Describe the objectives of the users.  
Describe the tasks that the users perform.  
List the applications run by the users.  
What are the data storage requirements for the users?  

After you understand how your applications and users use the hardware and software, you can determine the performance and availability goals for your environment.


[Contents] [Prev. Chapter] [Prev. Section] [Next Section] [Next Chapter] [Index] [Help]

1.4.2    Determining Performance and Availability Goals

Before you configure a system, you must determine the goals for the environment in terms of the following criteria:

After you determine the goals for your environment, you can choose the configuration that will meet the needs of the applications and users and address your environment goals.


[Contents] [Prev. Chapter] [Prev. Section] [Next Section] [Next Chapter] [Index] [Help]

1.4.3    Choosing an Appropriate Configuration

After you understand the needs of your applications and users and determine your performance and availability goals, choose the hardware and software configuration that meets your needs.

You must choose a system that will provide the necessary CPU and memory resources, and that will support your network and storage configuration. Because systems have different characteristics and features, the type of system you choose determines whether you can install additional CPU or memory boards, connect multiple I/O buses, or use the system in a cluster. Systems also vary in their scalability, which will determine whether you can improve system performance by adding resources, such as CPUs.

A primary consideration for choosing a system is its CPU and memory capabilities. Some systems support multiple CPUs. Another consideration is the number of I/O bus slots in the system.

For detailed information about features for systems, network adapters, host bus adapters, RAID controllers, and disks, see the DIGITAL Systems & Options Catalog. For information about operating system hardware support, see the DIGITAL UNIX Software Product Description.

When choosing a system that will meet your needs, you must determine your requirements for the following hardware and functionality:

Table 1-6 can help you identify the characteristics of a system that will meet your needs.

Table 1-6:  System Characteristics

If you require: You need a system that:
Multiprocessing support Supports multiprocessing and the number of CPUs that you want.
Fast processing time Supports CPUs with fast speeds and fast memory.
Additional memory boards Has backplane slots available for memory boards.
Cluster support Supports the cluster product that you want to use.
Network adapters Supports the network adapters that you want to use, and has an I/O slot available for each adapter.
Host bus adapters Supports the host bus adapters that you want to use, and has an I/O slot available for each adapter.
RAID controllers Supports the RAID controllers that you want to use, and has an I/O bus slot available for each controller.
Cluster interconnects Has a PCI slot available for each interconnect.

Fill in the requirements listed in Table 1-7 to get a profile of the system that will meet your needs.

Table 1-7:  System Requirements

Feature: Requirement:
Number of CPU boards:  
CPU processing speed:  
Total amount of memory:  
Number of memory boards:  
Cluster support:  
Type and number of network adapters:  
Type and number of host bus adapters:  
Type and number of backplane RAID controllers:  
Number of cluster interconnects:  


[Contents] [Prev. Chapter] [Prev. Section] [Next Section] [Next Chapter] [Index] [Help]

1.5    Primary Configuration and Tuning Recommendations

This manual describes many configuration and tuning tasks that you can use to improve system performance. Some of the recommendations can greatly improve performance. However, many of the recommendations provide only marginal improvement and should be used with caution.

To help you configure and tune your system, there are recommendations to follow that will provide you with the best performance improvement for most configurations. Many of these recommendations are used by the sys_check utility, which gathers performance information and outputs this information in an easy-to-read format. The sys_check utility uses some of the tools described in Chapter 2 to check your configuration and kernel variable settings and provides warnings and tuning recommendations if necessary.

To obtain the sys_check utility, access the following location or call your customer service representative:

ftp://ftp.digital.com/pub/DEC/IAS/sys_check

The following list describes the primary tuning recommendations. If these recommendations do not solve your performance problem, use the other recommendations described in this manual.


[Contents] [Prev. Chapter] [Prev. Section] [Next Chapter] [Index] [Help]

1.6    Steps to Configure and Tune Systems

Setting up and maintaining a high-performance or high-availability system requires a number of steps. The process is as follows:

  1. Configure the system.

    To configure (or reconfigure) a system, you must determine the requirements of your environment and choose a configuration to meet your needs. Then, you can set up the hardware, operating system, layered products, and applications.

  2. Perform any recommended initial tuning tasks.

    For some configurations, you may have to perform some tuning tasks immediately after you configure your system. For example, if your system is used as an Internet server, follow the recommendations to modify the default values of system parameters and attributes.

  3. Monitor system performance.

    You must carefully monitor the performance of your system, as described in Chapter 2.

    If system performance is acceptable, you must continue to monitor the system on a consistent basis, because performance may degrade if resources reach their capacity or if there is a significant change in the environment (for example, you increase the workload or you reconfigure the system).

    If system performance is not acceptable, you must determine the source of the problem.

  4. Identify the source of the performance problem.

    Use the tools described in Chapter 2 to locate the source of the problem. The DIGITAL Systems & Options Catalog contains information about the capacity of hardware resources.

  5. Determine if there is a tuning solution that will eliminate the performance problem.

    If there is no tuning solution or if you have exhausted all possible tuning solutions, you may have to reconfigure the system to eliminate the performance problem.

  6. Eliminate the performance problem.

    To eliminate a performance problem, first try simple, no-cost solutions, such as running applications at offpeak hours or restricting disk access. Then, you can try more complex and expensive solutions, such as tuning the system or adding more hardware. Section 1.5 includes a list of the primary tuning tasks that may help you to improve performance.

    If you are sure your CPU and applications are optimized, tuning the virtual memory subsystem provides the best performance benefit and should be the primary area of focus. If tuning memory does not eliminate the problem, tune the I/O subsystem. Tuning usually requires modifying kernel attributes. However, you may be able to improve system performance by performing some administrative tasks, such as defragmenting file systems or modifying stripe widths.

  7. Monitor system performance.

    After you tune the system, you must carefully monitor the system to ensure that the performance problem has been eliminated. If a tuning recommendation does not eliminate the problem, try another recommendation. If you cannot reduce or eliminate a performance problem by tuning the system, you must reconfigure the system.

The flowchart shown in Figure 1-3 describes the configuration and tuning process. Detailed information about diagnosing performance problems and information about configuring and tuning the CPU, virtual memory, storage, and networks is discussed in later chapters.

Figure 1-3:  Configuration and Tuning Process


[Contents] [Prev. Chapter] [Prev. Section] [Next Chapter] [Index] [Help]