2    Availability Considerations

This chapter describes some high level considerations regarding configuration of your system and its components for increased availability. These considerations pertain to either standalone systems or members of a cluster. This chapter does not intend to provide a comprehensive discussion of the considerations.

This chapter discusses the following topics:

2.1    Overview

Proper system setup and configuration of your hardware is essential in maintaining its availability. You need to be aware of environmental factors, for example, room temperature, altitude and humidity. For the specific environmental specifications, see the User's Guide that came with your system. You also must ensure that you have access to the correct power requirements for your system.

The following sections provide descriptions and guidelines, so that you can ensure your system is configured for maximum system availability.

2.2    Redundancy of Systems and Components for Availability

One common method of assuring the availability of systems is redundancy. Redundancy of components, such as multiple CPUs or multipath I/O connections, is important to assuring operation that is as continuous as possible.

As an example, if only one component supplies a critical function, a failure in this component would stop a system from providing its services. This is known as a single point of failure. Redundancy of components allows the failure of a particular component to occur, yet not be a critical failure to the system or the services it provides.

Redundancy is handled by one of two methods: failover or multipathing.

When hardware components are configured for failover, a backup component on standby takes over for the failed component. This incurs the added cost of the second component, with no added performance, but is a simple and effective way to avoid the effects of a component failure. An example of this is NetRAIN network arrays, discussed in Section 2.3.1.

With multipathing, multiple components provide multiple paths for data to flow. This has the added benefit, in many cases, of increasing the performance of the system. If one of the components fails, the system continues functioning in a degraded state until repairs are made. An example of this is the Multipathing of SCSI or Fibre Channel, discussed in Section 2.3.3.

With either method, it is important to replace the failed component quickly in order to guarantee the system's ability to maintain availability.

Components that are capable of being replaced while the system remains on line allows servicing of your system without a loss of availability. Components that are capable of Online Addition and Removal (OLAR) are discussed in Chapter 4.

Systems that require absolute minimal down time of a system and its services may require redundancy of systems supplied by clustering. Clustering also has the benefit of easily allowing redundancy and failover of software using the Cluster Alias and Cluster Application Availability subsystems. For more information on clustering technology, see the TruCluster Server Cluster Technical Overview.

2.3    Configuring I/O

The following sections discuss configuration techniques for increasing the availability of your I/O devices, including references to documentation providing detailed configuration steps.

2.3.1    Using Redundant Array of Independent Network (NetRAIN) for Available I/O

NetRAIN (Redundant Array of Independent Network adapters) detects the physical loss of network connectivity and automatically switches traffic to a working network interface.

One network interface in the array of adapters is always active while the others remain idle. If the active interface fails, one of the idle set members comes on line with the same IP address.

For more information on NetRAIN configuration, see the Network Administration: Connections manual.

2.3.2    Using Link Aggregation for Available I/O

Link aggregation, or trunking, enables administrators to combine two or more physical Ethernet Network Interface Cards (NICs) and create a single logical link. (Upper-layer software sees this link aggregation group as a single logical interface.) The single logical link can carry traffic at higher data rates than a single interface because the traffic is distributed across all of the physical ports that make up the link aggregation group.

If one network interface in the aggregattion group fails, the remaining interfaces continue to provide connectivity with degraded bandwidth.

For more information on link aggregation see lag(7) or Network Administration: Connections.

2.3.3    Configuring SCSI and Fibre Channel for Multipath Redundancy

Multipath redundancy is the ability to connect more than one adapter to the same storage. The system automatically (in almost all cases) determines that the same storage sets are connected through multiple adapters and coordinates the access appropriately.

Multipath redundancy can be used to increase availability and increase performance (however, some configurations may experience a decrease in performance). Some configurations eliminate the single point of failure of the SCSI bus, while other configurations still will retain that single point of failure (the single SCSI bus that connects it all together). Multipath configurations can contain paths to storage that use either single busses or multiple busses.

Multibus is similar to multipath, and often confused with multipath. Multipath is the generic term used to refer to multiple adapters connected to the same storage. Multibus is a more specific term that refers to the capability of those devices to connect to multiple independent busses (or multiple ports).

Multibus configurations do not have the bus as a single-point of failure for I/O, while multipath using a single bus does.

For further discussion of hardware configurations for clusters and single systems, see Cluster Hardware Configuration or http://www.tru64unix.compaq.com/docs/updates/TCR51_FC/TITLE.HTM.

2.3.4    Configuring PCI Drawers

Multiprocessor systems such as a GS80/GS160/GS320 have multiple PCI drawers, each with its own power supply and connection to the system. To avoid losing access to a service provided by a PCI card due to a failure of one of the drawers, you can increase your system's resilience by configuring PCI cards so redundant cards are in separate PCI drawers.

For example, if access to a network is supplied by two network cards in a NetRAIN set, placing one of the cards in one PCI drawer and the other in another PCI drawer will guard against failure of one of the drawers. If you were to place both cards in one PCI drawer, even if on separate busses in that drawer, you then would have a single point of failure that could remove the whole NetRAIN set and therefore remove access to the corresponding network.

2.3.5    Using AdvFS and LSM for I/O Availability

With AdvFS you can modify your storage configuration at any time without taking down the system. As your system requirements change, AdvFS allows you to easily adjust your storage size up or down to meet your requirements.

AdvFS also minimizes down time at reboots because it can have a faster boot time compared to UFS file systems because the file system does not need to be analyzed by the fsck command before boot.

AdvFS can incorporate Logical Storage Manager (LSM) volumes into the file system structure. AdvFS configured with LSM improves file system reliability and availability because AdvFS can take advantage of LSM features.

The Logical Storage Manager (LSM) software is an optional integrated, host-based disk storage management application. LSM uses Redundant Arrays of Independent Disks (RAID) technology to enable you to configure storage devices into a virtual pool of storage to protect against data loss, maximize disk use, improve performance, provide high data availability, and manage storage without disrupting users or applications accessing data on those disks.

LSM allows you to manage all of your storage devices, such as disks, partitions, or RAID sets, as a flexible pool of storage from which you create LSM volumes. You configure new file systems, databases, and applications, or encapsulate existing ones to use an LSM volume instead of a disk partition.

For more information, see Logical Storage Manager.

2.3.6    Choosing Hardware or Software RAID for I/O Availability

RAID can be used to increase the availability of storage. RAID also can benefit performance of storage access. The choice of whether to use hardware or software RAID and which RAID level to implement are subject to your needed cost, performance and levels of availability.

Levels of RAID I/O

The following are the most common levels of RAID and a summary of their different capabilities:

RAID Level 0 supplies striping of data across multiple disks. This does not provide an increase in availability, but increases performance. RAID Level 0 often is combined with RAID Level 1.

RAID Level 1 supplies mirroring of data across disks. If one disk fails, all data is available because of the mirroring. RAID Level 1 increases availability. Increased costs arise due to duplication of storage. Write performance is also somewhat lowered when using RAID Level 1.

RAID Level 5 supplies striping across disks with stored parity data. If one of the hard drives fails, data still can be accessed at somewhat degraded performance until the failed disk is replaced and the RAID set is rebuilt. RAID Level 5 can be implemented in software, like LSM, but the overhead of parity checking usually calls for a hardware controller to achieve reasonable performance.

For more information, see Logical Storage Manager.