![]() |
|||
![]() |
![]() ![]() |
![]() |
![]() ![]() |
![]() |
![]() ![]() |
![]() |
| ||
Fault Management ModelsThis section describes some of the faults that can occur in a cluster, and how those faults are managed. Fault TypesWhen one critical fault occurs, it is called a single fault. A single fault can be the failure of one master-eligible node, the failure of a service, or the failure of one of the redundant networks. After a single fault, the cluster continues to operate correctly but it is not highly available until the fault is repaired. When two critical faults occur that affect both parts of the redundant system, it is called a double fault. A double fault can be the simultaneous failure of both master-eligible nodes, or the simultaneous failure of both redundant network links. Although many double faults can be detected, it might not be possible to recover from all double faults. Although rare, double faults can result in cluster failure. Some faults can result in the election of two master nodes. This error scenario is called split brain. Split brain is usually caused by communication failure between master-eligible nodes. When the communication between the master-eligible nodes is restored, the last elected master node remains the master node. The other master-eligible node is elected as the vice-master node. The synchronized state must then be restored using Reliable NFS. When a peer node does not receive information from the master node for more than 10 seconds, a stale cluster error occurs. A cluster becomes stale if the master node does not send information to the peer node, or if information does not reach the peer node. When a cluster restarts with stale cluster configuration data, the fault is called amnesia. Amnesia is caused by restarting the cluster from a node that was not previously part of the most recent cluster membership list. Fault DetectionFault detection is critical for a cluster running highly available applications. The Foundation Services have the following fault detection mechanisms:
Fault ReportingErrors that indicate potential failure are reported so that you can understand the sequence of events that have led to the problem. The Foundation Services have the following fault reporting mechanisms:
Fault IsolationFault isolation has two aspects: isolation and redundancy. When a fault occurs in the cluster, the node on which the fault occurred is isolated. The Cluster Membership Manager ensures that the failed node cannot communicate with the other peer nodes. If the master node fails, a failover occurs. If a diskless node or dataless node fails, there is no failover. Fault RecoveryThe first recovery response to a critical failure is the failover to a backup node or service. Failover ensures the continuation of a service until the failure is repaired. Failed nodes are often repaired by reboot. Overload errors are often repaired by waiting for an acceptable delay and then rebooting or restarting the failed service. The Foundation Services are designed so that individual nodes can be shut down and restarted independently, reducing the impact of errors. After failover, the master node and vice-master node are synchronized so that the repaired vice-master node can rejoin the cluster in its current state. | ||
| ||
![]() |