1    Technical Overview

This chapter describes the purpose of a cluster interconnect, its uses, methods of controlling traffic over it, and how to decide what kind of interconnect to use. The chapter discusses the following topics:

See the Cluster Technical Overview for a general discussion of the features of the TruCluster Server and its operational components.

1.1    The Role of a Cluster Interconnect

A cluster must have a dedicated cluster interconnect to which all cluster members are connected. This interconnect serves as a private communication channel between cluster members. For hardware, the cluster interconnect can use either Memory Channel or a private local area network (LAN), but not both.

In general, the cluster interconnect is used for the following high-level functions:

Considering these high-level uses, the communications load of the cluster interconnect can be seen as being heavily influenced both by the cluster's storage configuration and by the set of applications the cluster runs.

Table 1-1 compares a LAN interconnect and a Memory Channel interconnect with respect to cost, performance, size, distance between members, support of the Memory Channel application programming interface (API) library, and redundancy. Subsequent sections discuss how to manage cluster interconnect bandwidth and make an appropriate choice of interconnect based on several factors.

Table 1-1:  Comparison of Memory Channel and LAN Interconnect Characteristics

Memory Channel LAN
Higher cost Generally lower cost
High bandwidth, low latency Medium to high bandwidth and latency
Up to eight members, limited by the capacity of the Memory Channel hub Up to eight members initially; will support more in the future
Up to 20 meters (65.6 feet) between members with copper cable; up to 2000 meters (1.2 miles) with fiber-optic cable in virtual hub mode; up to 6000 meters (3.7 miles) with fiber-optic cable using a physical hub. Up to 200 meters (656.2 feet) between members with copper (100BASE-TX) cable (either directly connected or using a single Class I or Class II repeater (switch or hub)); up to 412 meters (1,351.7 feet) with direct-connect fiber-optic (100BASE-FX) cable. If a single Class II repeater is used to link fiber segments, the maximum distance between members is 320 meters (1,049.9 feet). If a single Class I repeater is used to link fiber segments, the maximum distance between members is 272 meters (892.4 feet). Use of an additional Ethernet switch or hub between members lessens the overall distance.
Supports the use of the Memory Channel application programming interface (API) library Does not support the Memory Channel API library. Some applications may find the general mechanism, introduced in TruCluster Server Version 5.1A, for sending signals from one cluster member to another (clusterwide kill) sufficient for communicating between members.
Multirail (failover pair) redundant Memory Channel configuration Redundancy by configuring multiple network adapters as a redundant array of independent network adapters (NetRAIN) virtual interface on each member, distributing their connections across multiple switches

1.2    Controlling Storage Traffic Across the Cluster Interconnect

The Cluster File System (CFS) coordinates accesses to file systems across the cluster. It does so by designating a cluster member as the CFS server for a given file system. The CFS server performs all accesses, reads or writes, to that file system on behalf of all cluster members.

Starting in TruCluster Version 5.1A, read accesses to a given file system can bypass the CFS server and go directly to the disk, thus not having to pass over the cluster interconnect. If all storage in the cluster is equally accessible from all cluster members, this feature minimizes the bandwidth read operations require of the cluster interconnect. Although some read accesses can bypass the interconnect, all non-direct-I/O write accesses to a file system served by another member must pass through the interconnect. To mitigate this traffic, we recommend that, where possible, applications that write large quantities of data to a file system be located on the same member that is the CFS server for that file system. Given these recommendations, the file system I/O that must traverse the interconnect is limited to remote writes. Understanding the application mix, the CFS server placement, and the volume of data that will be remotely written, can help you determine the most appropriate interconnect for the cluster.

An application, such as Oracle Parallel Server (OPS), can avoid traversing the cluster interconnect to the CFS server by having its disk writes sent directly to disk. This direct-I/O method (enabled by the application's specifying the O_DIRECTIO flag on a file open) asserts to CFS that the application is coordinating its own writes to this file across the entire cluster. Applications that use this feature can both increase their clusterwide write throughput to the specified files and eliminate their remote write traffic from the cluster interconnect.

This method is useful only to those applications, such as OPS, that would not otherwise obtain the performance benefit of data caching, read-aheads, or asynchronous writes. Application developers considering using this flag must be very careful, however. Setting this flag means that the operating system will not apply its normal write synchronization functions to this file for the duration of it being opened (or written) by the application. If the application does not perform its own cache management, locking, and asynchronous I/Os, severe performance degradation and data corruption can ensue.

See the Cluster Technical Overview and the Cluster Administration manuals for additional information on the use of the cluster interconnect by CFS and the device request dispatcher and on the optimizations provided by the direct-I/O feature.

1.3    Controlling Application Traffic Across the Cluster Interconnect

Applications use a cluster's compute resources in different ways. In some clusters, members can be considered as separate islands of computing that share a common storage and management environment (for example, a timesharing cluster in which users are running their own programs on one system). Other applications, such as OPS, use distributed processing to focus the compute power of all cluster members onto a single clusterwide application. In this case, it is important to understand how the distributed application's components communicate:

With the answers to these questions, it becomes straightforward to map the application's requirements to the characteristics of the interconnect options. For example, an application that requires only 10,000 bytes per second of coordination messaging can fully utilize the compute resources of even a large cluster without stressing a LAN interconnect. On the other hand, distributed applications with high data rate and low latency requirements, such as OPS, benefit from having a Memory Channel as the interconnect, even in smaller clusters.

1.4    Controlling Cluster Alias Traffic Across the Cluster Interconnect

The mix of applications that will use a cluster alias, the amount of data being sent to the cluster via the cluster aliases, and the cluster network topology (for example, are members symmetrically or asymmetrically connected to the external networks?) are important factors to consider when deciding which type of cluster interconnect is appropriate.

Some common uses for the cluster alias (such as telnet, ftp, and Web hosting) typically add only small communication requirements to the interconnect. These applications are examples where the amount of data sent to the cluster's alias is generally far outweighed by the amount of data returned to clients from the cluster. Only the incoming data packets might need to traverse the interconnect to reach the process serving the request. All outgoing packets go directly to the external network and thus do not have to be conveyed over the interconnect. (This presumes that all members have connectivity to the external network.) Applications such as these, in most cases, place low bandwidth requirements on the interconnect.

The Network File System (NFS), on the other hand, is a commonly used application that can place a significant bandwidth requirement on the cluster interconnect. While reads from the served disks do not cause much interconnect traffic (only the read request itself potentially traverses the interconnect), disk writes through NFS can create interconnect traffic. In this case, the incoming data that might need to be delivered over the interconnect is comprised of disk blocks. If the cluster is going to serve NFS volumes, compare the average rate that disk writes are likely to occur with the bandwidth offered by the various interconnect options.

TruCluster Server Version 5.1A introduces a feature that can lessen the impact of NFS writes. For the purposes of NFS serving, you can assign alternate cluster aliases to subsets of cluster members. This allows a selected set of cluster members to be identified as the NFS servers, thus lowering the average number of inbound packets that must be sent over the interconnect to reach that connection's serving process. (In a randomly distributed four-member cluster, an average of 75 percent of the disk writes will traverse the interconnect. If two of those members are assigned an alternate cluster alias for their NFS serving, the average number of writes traversing the interconnect drops to 50 percent.)

See the Cluster Technical Overview and the Cluster Administration manuals for information on how to use and tune a cluster alias.

1.5    Effect of Cluster Size on Cluster Interconnect Traffic

You cannot consider solely the number or size of the members in a cluster when determining the most appropriate interconnect, but must also look at how the cluster's use will affect the load placed on the interconnect. Although larger clusters tend to have higher data transfer requirements for a given application mix, how the cluster's storage is configured and the characteristics of its applications are better guides to determining the proper interconnect. However, one aspect of cluster size can impact the interconnect bandwidth requirements. Presuming a perfectly random (and unmanaged) distribution of work across the cluster and an equally random distribution of CFS servers, the percentage of disk writes that must traverse the cluster interconnect increases as the cluster size increases. In a two-member cluster, for example, 50 percent of the average writes might go over the interconnect. In a four-node cluster, this increases to 75 percent. In Section 1.2 we recommend the system that will be performing most writes to a file system be the CFS server for that file system. This recommendation minimizes the number of writes that must be sent over the interconnect and is appropriate regardless of which type of interconnect is used. To the degree that you can meet this recommendation, the less interconnect bandwidth the disk writes will require.

However, there is one situation in which the size of the cluster (measured both in terms of the number of members and the number of disks in use) has a direct impact on the interconnect traffic: cluster membership transitions. In particular, when a member leaves the cluster, the remaining members must pass coordination messages to the other cluster members. Due to the lower latency characteristics of the Memory Channel interconnect, these transitions can be completed faster on a Memory Channel-based cluster. When deciding which interconnect to use, consider how often you expect membership transitions to occur (for example, whether cluster members will routinely be rebooted).

1.6    Selecting a Cluster Interconnect

In addition to the recommendations provided in the previous sections, the following rules and restrictions apply to the selection of a cluster interconnect: