7 Cluster Interconnect

A cluster must have a dedicated cluster interconnect to which all cluster members are connected. The cluster interconnect serves as a private communications channel between cluster members. It is used by the connection manager to maintain cluster membership, by the Cluster File System (CFS) to perform I/O to and from remotely served storage, by the distributed lock manager (DLM) to maintain resource lock information, and by most other cluster subcomponents. For hardware, the cluster interconnect can use either Memory Channel or a private LAN.

In general, the following rules and restrictions apply to the selection of a cluster interconnect:

All cluster members must be configured to use a LAN interconnect (Section 7.1) or to use Memory Channel (Section 7.2). You cannot mix interconnect types within a cluster.

Replacing a Memory Channel interconnect with a LAN interconnect (or vice versa) requires some cluster downtime. (That is, you cannot perform a rolling upgrade from one interconnect type to the other.) The Cluster LAN Interconnect manual describes how migrate from Memory Channel to a LAN interconnect.

Applications using the Memory Channel application programming interface (API) library require Memory Channel. The Memory Channel API library is not supported in a cluster using a LAN interconnect.

7.1 LAN Interconnect

Because of the relatively low cost of Ethernet hardware, a LAN interconnect is a good choice as the private communications channel for an entry level cluster. In general, any Ethernet adapter, switch, or hub that works in a standard LAN at 100 Mb/s should work within a LAN interconnect. (Fiber Distributed Data Interface (FDDI) and ATM LAN Emulation (LANE), 10 Mb/s Ethernet, and Gigabit Ethernet are not supported.)

See the Cluster LAN Interconnect manual for guidelines and examples of cluster LAN interconnect configurations.

7.2 Memory Channel Interconnect

The Memory Channel interconnect is a specialized interconnect designed specifically for the needs of clusters. This interconnect provides both broadcast and point-to-point connections between cluster members. The Memory Channel interconnect:

Allows a cluster member to set up a high-performance, memory-mapped connection to other cluster members. These other cluster members can, in turn, map transfers from the Memory Channel interconnect directly into their memory. A cluster member can thus obtain a write-only window into the memory of other cluster systems. Normal memory transfers across this connection can be accomplished at extremely low latency (3 to 5 microseconds).

Has built-in error checking, virtually guaranteeing no undetected errors and allowing software error detection mechanisms, such as checksums, to be eliminated. The detected error rate is very low (on the order of one error per year per connection).

Supports high-performance mutual exclusion locking (by means of spinlocks) for synchronized resource control among cooperating applications.

Figure 7-1 shows the general flow of a Memory Channel transfer.

Figure 7-1: Memory Channel Logical Diagram

A Memory Channel adapter must be installed in a PCI slot on each member system. A link cable connects the adapters. If the cluster contains more than two members, a Memory Channel hub is also required.

A redundant, multirail Memory Channel configuration can further improve reliability and availability. It requires a second Memory Channel adapter in each cluster member, and link cables to connect the adapters. A second Memory Channel hub is required for clusters containing more than two members.

The Memory Channel multirail model operates on the concept of physical rails and logical rails. A physical rail is defined as a Memory Channel hub with its cables and Memory Channel adapters and the Memory Channel driver for the adapters on each node. A logical rail is made up of one or two physical rails.

A cluster can have one or more logical rails, up to a maximum of four. Logical rails can be configured in the following styles:

Single-rail

Failover pair

If a cluster is configured in the single-rail style, there is a one-to-one relationship between physical rails and logical rails. This configuration has no failover properties; if the physical rail fails, the logical rail fails. Its primary use is for high-performance computing applications using the Memory Channel application programming interface (API) library and not for highly available applications.

If a cluster is configured in the failover pair style, a logical rail consists of two physical rails, with one physical rail active and the other inactive. If the active physical rail fails, a failover takes place and the inactive physical rail is used, allowing the logical rail to remain active after the failover. This failover is transparent to the user. The failover pair style is the default for all multirail configurations.

A cluster fails over from one Memory Channel interconnect to another if a configured and available secondary Memory Channel interconnect exists on all member systems, and if one of the following situations occurs in the primary interconnect:

More than 10 errors are logged within 1 minute.

A link cable is disconnected.

The hub is turned off.

After the failover completes, the secondary Memory Channel interconnect becomes the primary interconnect. Another interconnect failover cannot occur until you fix the problem with the interconnect that was originally the primary.

If more than 10 Memory Channel errors occur on any member system within a 1-minute interval, the Memory Channel error recovery code attempts to determine whether a secondary Memory Channel interconnect has been configured on the member as follows:

If a secondary Memory Channel interconnect exists on all member systems, the member system that encountered the error marks the primary Memory Channel interconnect as bad and instructs all member systems (including itself) to fail over to their secondary Memory Channel interconnect.

If any member system does not have a secondary Memory Channel interconnect configured and available, the member system that encountered the error displays a message indicating that it has exceeded the Memory Channel hardware error limit and panics.

See the Cluster Hardware Configuration manual for information on how to configure the Memory Channel interconnect in a cluster.

The Memory Channel API library implements highly efficient memory sharing between Memory Channel API cluster members, with automatic error handling, locking, and UNIX style protections. See the Cluster Highly Available Applications manual for a discussion of the Memory Channel API library.