1 Technical Overview

This chapter describes the purpose of a cluster interconnect, its uses, methods of controlling traffic over it, and how to decide what kind of interconnect to use. The chapter discusses the following topics:

Understanding the role of the cluster interconnect (Section 1.1)

Controlling storage traffic across the cluster interconnect (Section 1.2)

Controlling application traffic across the cluster interconnect (Section 1.3)

Controlling cluster alias traffic across the cluster interconnect (Section 1.4)

Understanding the effect of cluster size and cluster interconnect traffic (Section 1.5)

Selecting a cluster interconnect (Section 1.6)

See the Cluster Technical Overview for a general discussion of the features of the TruCluster Server and its operational components.

1.1 The Role of a Cluster Interconnect

A cluster must have a dedicated cluster interconnect to which all cluster members are connected. This interconnect serves as a private communication channel between cluster members. For hardware, the cluster interconnect can use either Memory Channel or a private local area network (LAN), but not both.

In general, the cluster interconnect is used for the following high-level functions:

Health, status, and synchronization messages
The connection manager uses these messages to monitor the state of the cluster and its members and to coordinate membership and application placement within the cluster. This type of message traffic increases during membership transitions (for example, when a member joins or leaves the cluster), but is minimal in a steady-state cluster. (See Section 1.5 for additional information.)

Distributed lock manager (DLM) messages
TruCluster Server uses the DLM to coordinate access to shared resources. User-level applications can also use this coordination function through the DLM application programming interface (API) library. The message traffic required to coordinate these locking functions is transmitted over the cluster's interconnect media. Although an application can make heavy use of this capability, the DLM traffic created by the cluster software itself is minimal.

Accessing remote file systems
TruCluster Server software presents a unified picture of the availability of storage devices across all cluster members. Storage located on one member's private storage bus is visible to all cluster members. Reads and writes from other cluster members to file systems on this storage are transmitted by means of the cluster interconnect. Whenever possible, I/O requests (reads, in particular) to files on shared storage are sent directly to the storage and bypass the cluster interconnect. How file systems and storage are configured within the cluster can significantly impact the throughput requirements placed on the cluster interconnect. (See Section 1.2 for additional information.)

Application-specific traffic
The cluster interconnect has a TCP/IP address associated with a virtual network interface (ics0) on each member. User applications can use this address to communicate over the interconnect. The load that this traffic places on the interconnect varies with the application mix. (See Section 1.3 for additional information.)

Cluster alias routing
While a cluster alias presents a single TCP/IP address that clients can use to reference the entire cluster or a subset of its members, the cluster alias establishes individual TCP/IP connections to processes on a given member. For example, while multiple simultaneous Network File System (NFS) operations to a cluster alias are balanced across cluster members, each individual NFS operation directed at the cluster alias is served by an NFS daemon on one member. The cluster interconnect is used when it is necessary to route the TCP/IP packets addressed to the cluster alias to the specific member that is hosting the connection. The bandwidth requirements that the cluster alias can place on the interconnect depend upon the degree to which the cluster alias is being used. (See Section 1.4 for additional information.)

Considering these high-level uses, the communications load of the cluster interconnect can be seen as being heavily influenced both by the cluster's storage configuration and by the set of applications the cluster runs.

Table 1-1 compares a LAN interconnect and a Memory Channel interconnect with respect to cost, performance, size, distance between members, support of the Memory Channel application programming interface (API) library, and redundancy. Subsequent sections discuss how to manage cluster interconnect bandwidth and make an appropriate choice of interconnect based on several factors.

Table 1-1: Comparison of Memory Channel and LAN Interconnect Characteristics

Memory Channel	LAN
Higher cost	Generally lower cost
High bandwidth, low latency	Medium to high bandwidth and latency
Up to eight members, limited by the capacity of the Memory Channel hub	Up to eight members initially; will support more in the future
Up to 20 meters (65.6 feet) between members with copper cable; up to 2000 meters (1.2 miles) with fiber-optic cable in virtual hub mode; up to 6000 meters (3.7 miles) with fiber-optic cable using a physical hub.	Up to 200 meters (656.2 feet) between members with copper (100BASE-TX) cable (either directly connected or using a single Class I or Class II repeater (switch or hub)); up to 412 meters (1,351.7 feet) with direct-connect fiber-optic (100BASE-FX) cable. If a single Class II repeater is used to link fiber segments, the maximum distance between members is 320 meters (1,049.9 feet). If a single Class I repeater is used to link fiber segments, the maximum distance between members is 272 meters (892.4 feet). Use of an additional Ethernet switch or hub between members lessens the overall distance.
Supports the use of the Memory Channel application programming interface (API) library	Does not support the Memory Channel API library. Some applications may find the general mechanism, introduced in TruCluster Server Version 5.1A, for sending signals from one cluster member to another (clusterwide `kill`) sufficient for communicating between members.
Multirail (failover pair) redundant Memory Channel configuration	Redundancy by configuring multiple network adapters as a redundant array of independent network adapters (NetRAIN) virtual interface on each member, distributing their connections across multiple switches

1.2 Controlling Storage Traffic Across the Cluster Interconnect

The Cluster File System (CFS) coordinates accesses to file systems across the cluster. It does so by designating a cluster member as the CFS server for a given file system. The CFS server performs all accesses, reads or writes, to that file system on behalf of all cluster members.

Starting in TruCluster Version 5.1A, read accesses to a given file system can bypass the CFS server and go directly to the disk, thus not having to pass over the cluster interconnect. If all storage in the cluster is equally accessible from all cluster members, this feature minimizes the bandwidth read operations require of the cluster interconnect. Although some read accesses can bypass the interconnect, all non-direct-I/O write accesses to a file system served by another member must pass through the interconnect. To mitigate this traffic, we recommend that, where possible, applications that write large quantities of data to a file system be located on the same member that is the CFS server for that file system. Given these recommendations, the file system I/O that must traverse the interconnect is limited to remote writes. Understanding the application mix, the CFS server placement, and the volume of data that will be remotely written, can help you determine the most appropriate interconnect for the cluster.

An application, such as Oracle Parallel Server (OPS), can avoid traversing the cluster interconnect to the CFS server by having its disk writes sent directly to disk. This direct-I/O method (enabled by the application's specifying the O_DIRECTIO flag on a file open) asserts to CFS that the application is coordinating its own writes to this file across the entire cluster. Applications that use this feature can both increase their clusterwide write throughput to the specified files and eliminate their remote write traffic from the cluster interconnect.

This method is useful only to those applications, such as OPS, that would not otherwise obtain the performance benefit of data caching, read-aheads, or asynchronous writes. Application developers considering using this flag must be very careful, however. Setting this flag means that the operating system will not apply its normal write synchronization functions to this file for the duration of it being opened (or written) by the application. If the application does not perform its own cache management, locking, and asynchronous I/Os, severe performance degradation and data corruption can ensue.

See the Cluster Technical Overview and the Cluster Administration manuals for additional information on the use of the cluster interconnect by CFS and the device request dispatcher and on the optimizations provided by the direct-I/O feature.

1.3 Controlling Application Traffic Across the Cluster Interconnect

Applications use a cluster's compute resources in different ways. In some clusters, members can be considered as separate islands of computing that share a common storage and management environment (for example, a timesharing cluster in which users are running their own programs on one system). Other applications, such as OPS, use distributed processing to focus the compute power of all cluster members onto a single clusterwide application. In this case, it is important to understand how the distributed application's components communicate:

Do they communicate information by means of shared disk files?

Do they communicate through direct process-to-process communications over the interconnect?

How often do these pieces communicate and how much data is transferred per unit of time?

What does the application require in terms of transmission latency?

With the answers to these questions, it becomes straightforward to map the application's requirements to the characteristics of the interconnect options. For example, an application that requires only 10,000 bytes per second of coordination messaging can fully utilize the compute resources of even a large cluster without stressing a LAN interconnect. On the other hand, distributed applications with high data rate and low latency requirements, such as OPS, benefit from having a Memory Channel as the interconnect, even in smaller clusters.

1.4 Controlling Cluster Alias Traffic Across the Cluster Interconnect

The mix of applications that will use a cluster alias, the amount of data being sent to the cluster via the cluster aliases, and the cluster network topology (for example, are members symmetrically or asymmetrically connected to the external networks?) are important factors to consider when deciding which type of cluster interconnect is appropriate.

Some common uses for the cluster alias (such as telnet, ftp, and Web hosting) typically add only small communication requirements to the interconnect. These applications are examples where the amount of data sent to the cluster's alias is generally far outweighed by the amount of data returned to clients from the cluster. Only the incoming data packets might need to traverse the interconnect to reach the process serving the request. All outgoing packets go directly to the external network and thus do not have to be conveyed over the interconnect. (This presumes that all members have connectivity to the external network.) Applications such as these, in most cases, place low bandwidth requirements on the interconnect.

The Network File System (NFS), on the other hand, is a commonly used application that can place a significant bandwidth requirement on the cluster interconnect. While reads from the served disks do not cause much interconnect traffic (only the read request itself potentially traverses the interconnect), disk writes through NFS can create interconnect traffic. In this case, the incoming data that might need to be delivered over the interconnect is comprised of disk blocks. If the cluster is going to serve NFS volumes, compare the average rate that disk writes are likely to occur with the bandwidth offered by the various interconnect options.

TruCluster Server Version 5.1A introduces a feature that can lessen the impact of NFS writes. For the purposes of NFS serving, you can assign alternate cluster aliases to subsets of cluster members. This allows a selected set of cluster members to be identified as the NFS servers, thus lowering the average number of inbound packets that must be sent over the interconnect to reach that connection's serving process. (In a randomly distributed four-member cluster, an average of 75 percent of the disk writes will traverse the interconnect. If two of those members are assigned an alternate cluster alias for their NFS serving, the average number of writes traversing the interconnect drops to 50 percent.)

See the Cluster Technical Overview and the Cluster Administration manuals for information on how to use and tune a cluster alias.

1.5 Effect of Cluster Size on Cluster Interconnect Traffic

You cannot consider solely the number or size of the members in a cluster when determining the most appropriate interconnect, but must also look at how the cluster's use will affect the load placed on the interconnect. Although larger clusters tend to have higher data transfer requirements for a given application mix, how the cluster's storage is configured and the characteristics of its applications are better guides to determining the proper interconnect. However, one aspect of cluster size can impact the interconnect bandwidth requirements. Presuming a perfectly random (and unmanaged) distribution of work across the cluster and an equally random distribution of CFS servers, the percentage of disk writes that must traverse the cluster interconnect increases as the cluster size increases. In a two-member cluster, for example, 50 percent of the average writes might go over the interconnect. In a four-node cluster, this increases to 75 percent. In Section 1.2 we recommend the system that will be performing most writes to a file system be the CFS server for that file system. This recommendation minimizes the number of writes that must be sent over the interconnect and is appropriate regardless of which type of interconnect is used. To the degree that you can meet this recommendation, the less interconnect bandwidth the disk writes will require.

However, there is one situation in which the size of the cluster (measured both in terms of the number of members and the number of disks in use) has a direct impact on the interconnect traffic: cluster membership transitions. In particular, when a member leaves the cluster, the remaining members must pass coordination messages to the other cluster members. Due to the lower latency characteristics of the Memory Channel interconnect, these transitions can be completed faster on a Memory Channel-based cluster. When deciding which interconnect to use, consider how often you expect membership transitions to occur (for example, whether cluster members will routinely be rebooted).

1.6 Selecting a Cluster Interconnect

In addition to the recommendations provided in the previous sections, the following rules and restrictions apply to the selection of a cluster interconnect:

All cluster members must be configured either to use a LAN interconnect or to use Memory Channel. You cannot mix interconnect types within a cluster.

Applications using the Memory Channel API library require Memory Channel. A cluster using a LAN interconnect can also be configured with a Memory Channel that is used by Memory Channel API applications only. Note that use of the Memory Channel API also generates some slight TCP/IP traffic over the cluster interconnect.

A LAN interconnect is required when configuring one or more AlphaServer^TM DS10L systems in a cluster. An AlphaServer DS10L system is shipped with two 10/100 Mb/s Ethernet ports, one 64-bit peripheral component interconnect (PCI) expansion slot, and a fixed internal integrated device electronic (IDE) disk. When you configure an AlphaServer DS10L in a cluster, we recommend that you use the single PCI expansion slot for the shared storage (where the cluster root, member boot disks, and optional quorum disk reside), one Ethernet port for the external network, and the other Ethernet port for the LAN interconnect. See Section 2.2.4 for a description of cluster configurations including AlphaServer DS10L systems.

Replacing a Memory Channel interconnect with a LAN interconnect (or vice versa) requires some cluster downtime. Section 4.5 describes how to migrate from Memory Channel to a LAN interconnect.

Although the Logical Storage Manager (LSM) provides for transparent mirroring and highly available access to storage, LSM is not a suitable data replication technology in an extended cluster. Although a disaster-tolerant configuration using a LAN-based or Memory Channel-based interconnect and LSM is not supported, there are supported configurations using the StorageWorks^TM Data Replication Manager (DRM) solution.