Sun Microsystems
Products & Services
 
Support & Training
 
 

Previous Previous     Contents     Index     Next Next

Fault Management Models

This section describes some of the faults that can occur in a cluster, and how those faults are managed.

Fault Types

When one critical fault occurs, it is called a single fault. A single fault can be the failure of one master-eligible node, the failure of a service, or the failure of one of the redundant networks. After a single fault, the cluster continues to operate correctly but it is not highly available until the fault is repaired.

When two critical faults occur that affect both parts of the redundant system, it is called a double fault. A double fault can be the simultaneous failure of both master-eligible nodes, or the simultaneous failure of both redundant network links. Although many double faults can be detected, it might not be possible to recover from all double faults. Although rare, double faults can result in cluster failure.

Some faults can result in the election of two master nodes. This error scenario is called split brain. Split brain is usually caused by communication failure between master-eligible nodes. When the communication between the master-eligible nodes is restored, the last elected master node remains the master node. The other master-eligible node is elected as the vice-master node. The synchronized state must then be restored using Reliable NFS.

When a peer node does not receive information from the master node for more than 10 seconds, a stale cluster error occurs. A cluster becomes stale if the master node does not send information to the peer node, or if information does not reach the peer node.

When a cluster restarts with stale cluster configuration data, the fault is called amnesia. Amnesia is caused by restarting the cluster from a node that was not previously part of the most recent cluster membership list.

Fault Detection

Fault detection is critical for a cluster running highly available applications. The Foundation Services have the following fault detection mechanisms:

  • The Cluster Membership Manager detects the failure of peer nodes. It notifies the other peer nodes of the failure. For information about the Cluster Membership Manager, see Chapter 8, Cluster Membership Manager.

  • The Daemon Monitor supervises Foundation Services daemons, many Solaris operating system daemons, and some companion products daemons. When a critical service or a descendent of a critical service fails, the Daemon Monitor detects the failure and triggers a recovery response. For information about the Daemon Monitor, see Chapter 10, Daemon Monitor.

  • The Watchdog Timer monitors hardware watchdogs at the lights-off management level. For information about the Watchdog Timer, see Chapter 12, Watchdog Timer.

Fault Reporting

Errors that indicate potential failure are reported so that you can understand the sequence of events that have led to the problem. The Foundation Services have the following fault reporting mechanisms:

  • All error messages are sent to system log files. For information about how to configure log files, see the Netra High Availability Suite Foundation Services 2.1 6/03 Cluster Administration Guide.

  • The Cluster Membership Manager on the master node notifies clients when a node fails, a failover occurs, or the cluster membership changes. Clients can be subscribed system services or applications.

  • The Node Management Agent can be used to develop applications that retrieve statistics on the Cluster Membership Manager, CGTP, Reliable NFS, and the Daemon Monitor. These applications can be used to detect faults or diminished levels of service on your system. For further information on how to collect and manage node and cluster statistics, see the Netra High Availability Suite Foundation Services 2.1 6/03 NMA Programming Guide.

Fault Isolation

Fault isolation has two aspects: isolation and redundancy. When a fault occurs in the cluster, the node on which the fault occurred is isolated. The Cluster Membership Manager ensures that the failed node cannot communicate with the other peer nodes.

If the master node fails, a failover occurs. If a diskless node or dataless node fails, there is no failover.

Fault Recovery

The first recovery response to a critical failure is the failover to a backup node or service. Failover ensures the continuation of a service until the failure is repaired.

Failed nodes are often repaired by reboot. Overload errors are often repaired by waiting for an acceptable delay and then rebooting or restarting the failed service. The Foundation Services are designed so that individual nodes can be shut down and restarted independently, reducing the impact of errors. After failover, the master node and vice-master node are synchronized so that the repaired vice-master node can rejoin the cluster in its current state.

Previous Previous     Contents     Index     Next Next