Data May Not Be Saved to a Node When a Cluster Loses Its Ability to Communicate (289863)



The information in this article applies to:

  • Microsoft Windows NT Server, Enterprise Edition 4.0
  • Microsoft Windows NT Server, Enterprise Edition 4.0 SP4
  • Microsoft Windows NT Server, Enterprise Edition 4.0 SP5
  • Microsoft Windows NT Server, Enterprise Edition 4.0 SP6
  • Microsoft Windows NT Server, Enterprise Edition 4.0 SP6a

This article was previously published under Q289863

SYMPTOMS

When a cluster loses all its abilities to communicate with its nodes, the nodes may arbitrate as to which node is to become the owner of the quorum that owns all of the resources. Any physical disk resources that are on the losing (or halted) node may still seem to be attached to the cluster, which can lead you to assume that both read and write operations are enabled. However, these operations may not be enabled and any changes that you may make using these operations may not be saved to disk.

CAUSE

This problem can occur when a cluster uses a single network for all of its communications. For example, in Windows NT 4.0, node 2 owns the quorum and arbitrates immediately for the quorum disk. Node 1 owns data disk D. Node 2 is the quorum owner so the arbitration process is fast. Node 1 does not own the quorum so the arbitration process is much slower. Node 1 breaks the reservation that node 2 has on the quorum, and then gives node 2 the opportunity to re-establish the reservation. This process is how node 1 is able to determine whether node 2 is active when all of the communication links are inactive.

When node 2 has performed its arbitration process, node 2 starts to bring online all of the groups and the resources, including disk D. Node 2 protects disk D with a reservation. However, until node 1 has performed its arbitration process, node 1 still operates as though it retains ownership of disk D and continues to enable input/outputs (I/Os) to the disk. The reservation that node 2 placed on the disk can cause the I/Os from node 1 to be unsuccessful. Node 1 operates as though it owns the disk and regains ownership of the disk by breaking the reservation, and then by retrying the I/Os. This time the I/Os work, but the disk is not in a state that is consistent with the in-memory state of node 1 so the disk becomes corrupted.

RESOLUTION

To resolve this problem, upgrade to Microsoft Windows 2000 Service Pack 1 (SP1).

This problem is fixed in Windows 2000 SP1 by ensuring that all of the nodes have a minimum arbitration time.

WORKAROUND

To work around this problem, configure the cluster with multiple network adapters.

When this problem occurs, all of the communication links between the nodes are broken and the quorum arbitration mechanism is the final mechanism for determining the nodes that are in the cluster as well as the nodes that are not going to be part of a split-brain cluster. By adding an additional network connection between the nodes, the potential single point of communication failure is removed and the problem cannot occur with the loss of a single connection or a single network adapter failure. However, a cluster with one network adapter for each node is not a highly available solution.

STATUS

Microsoft has confirmed that this is a problem in the Microsoft products that are listed at the beginning of this article.

MORE INFORMATION

Data corruption problems can also occur in a Windows NT 4.0 cluster because of a known problem in Windows NT 4.0. This problem is related to the cluster regroup and disk arbitration mechanisms. To illustrate this problem, consider a two-node cluster, node 1 and node 2. Node 1 owns data disk D that is being used by client computers and node 2 owns quorum Q. The method whereby node 1 uses the disk is not relevant. Whether node 1 has either local access or remote access from a Windows-based client is also irrelevant.

You can assume in this example that the cluster is set up to have a single network between the nodes (not a recommended deployment). If the network cable is removed, both the nodes lose all network communication to each other. In Windows NT 4.0, node 2 owns the quorum and arbitrates immediately for the quorum disk. Because node 2 is the quorum owner already, the arbitration process is fast. Node 1 does not own the quorum and the arbitration process is much slower. Node 1 breaks the reservation that node 2 has on the quorum, and then gives node 2 the opportunity to re-establish the reservation. This process is how node 1 determines whether node 2 is active when all of the communication links are inactive.

When node 2 has performed its arbitration process, node 2 starts to bring online all of the groups and the resources, including disk D. Node 2 protects disk D with a reservation. However, until node 1 has completed its arbitration process, node 1 still operates as though it owns disk D and continues to enable I/Os to the disk. The reservation that node 2 put on the disk can causes I/Os from node 1 to be unsuccessful. Because Node 1 operates as though it owns the disk, node 1 takes the disk back by breaking the reservation, and then retrying the I/Os. This time the I/Os work, but the disk is not in a state that is consistent with the in-memory state of node 1 so the disk becomes corrupted.

Modification Type:MajorLast Reviewed:10/27/2003
Keywords:kbbug kbfix kbQFE KB289863