INTRODUCTION
Cluster nodes periodically communicate with each other on a network that is designated for cluster use. Each node has access to the following communication types:
- Internal cluster communications only
This type is also known as a private network. - All communications
This type is also known as a mixed network. A mixed network is a combination of a private network and a public network.
Note A public network is designated for client network access. A public network is not used by the Cluster service for intra-node communications. However, the public network will be monitored by the cluster network driver to make sure it is available for all cluster nodes.
For intra-node communications, cluster nodes communicate over User Datagram Protocol (UDP) port 3343. Each node in the cluster periodically exchanges sequenced, unicast UDP datagrams with every other node in the cluster. The purpose of this exchange is to determine whether all nodes are running correctly and to monitor the health of network links.
The cluster network driver (Clusnet.sys) manages cluster communications. The cluster network driver performs the following functions:
- Provides a uniform interface for cluster node communications that are independent of the network infrastructure.
- Monitors the status of all communication paths in the cluster.
- Routes intra-cluster messages over the optimal paths.
- Detects node failures by using periodic messages that are known as "heartbeats".
- Detects failures in network and TCP/IP communications.
Event ID 1123 and event ID 1122 may be logged consecutively to the System log in your cluster. Frequently, these events indicate temporary interruptions in intra-cluster communication and can be ignored.
Event messages that are similar to the following are logged:
Message 1Event ID: 1123
Source: ClusSvc
Description:
The node lost communication with cluster node ComputerName on network 'Public Network'.
Message 2Event ID: 1123
Source: ClusSvc
Description:
The node lost communication with cluster node ComputerName on network 'Private Network'.
Message 3Event ID: 1122
Source: ClusSvc
Description:
The node (re)established communication with cluster node ComputerName on network 'Public Network'.
Message 4Event ID: 1122
Source: ClusSvc
Description:
The node (re)established communication with cluster node ComputerName on network 'Private Network'.
MORE INFORMATION
The heartbeat process
The exchange of UDP datagrams between nodes in a cluster is known as the "heartbeat process".
By default, heartbeats are sent every 1.2 seconds from each network interface for each node to each network interface for every other node that is in the cluster. In Windows Server 2003, multicast datagrams can be used to reduce the amount of heartbeat traffic that occurs between cluster nodes. By default, Windows Server 2003 uses multicast datagrams when three or more nodes are configured in a cluster.
Event IDs 1123 and 1122
Event ID 1123 indicates that node A in the cluster did not receive a heartbeat from node B in the cluster for two heartbeat intervals over a specified network interface. That means that node A did not receive a heartbeat from node B for 2.4 seconds.
Event ID 1122 indicates that node A received a heartbeat from node B. This communication update is received after 2.4 seconds but before 4.8 seconds.
Event ID 1122 is logged if communications are re-established over a network interface that was previously shut down. For example, event ID 1122 occurs when a node that was shut down rejoins the cluster.
The regroup process
Assume that node A does not receive an update from node B after six consecutive heartbeats over all network interfaces that are enabled for internal cluster communications. In this case, node B is assumed to be inactive. The cluster may perform a "regroup" process. During a regroup process, the cluster network driver on node A notifies the Membership Manager and the Node Manager that a failure has occurred. The Membership Manager and the Node Manager initiate a regroup operation that takes node B offline and removes it from active membership in the cluster. When this regroup process occurs, event ID 1126 is logged in the System log. Event ID 1135 may be subsequently logged in the System log. Event ID 1135 indicates that a node has been removed from active cluster membership. Messages that are similar to the following are logged:
Message 1Event ID: 1126
Source: ClusSvc
Description:
The interface for cluster node
ClusterNode on network 'Public Network' is unreachable by at least one other cluster node attached to the network. The cluster was not able to determine the location of the failure. Look for additional entries in the system event log indicating which other nodes have lost communication with node
ClusterNode. If the condition persists, check the cable connecting the node to the network. Next, check for hardware or software errors in the node's network adaptor. Finally, check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
Message 2Event ID: 1135
Source: ClusSvc
Description:
Cluster node
ClusterNode was removed from the active cluster membership. The Clustering Service may have been stopped on the node, the node may have failed, or the node may have lost communication with the other active cluster nodes.
Troubleshooting event IDs 1123 and 1122
When event ID 1123 is followed by event ID 1122, you can generally ignore the events if the following conditions are true:
- There are no coincident failures of cluster IP address resources, and there are no concurrent resource group failovers.
- The nodes that were removed from cluster membership were removed only because of a loss of network communication. For example, a node was removed when the node was shut down or restarted.
Note You can also generally ignore event IDs 1124, 1126, 1127, and 1130 if they occur during a node restart.
Important When event IDs 1126 and 1127 follow event ID 1123, a problem may exist. Event IDs 1126 and 1127 indicates that all cluster nodes agree that a network interface is not functioning correctly. In this case, messages that are similar to the following are logged:
Message 1Event ID: 1126
Source: ClusSvc
Description:
The interface for cluster node
ClusterNode on network 'Public Network' is unreachable by at least one other cluster node attached to the network. The cluster was not able to determine the location of the failure. Look for additional entries in the system event log indicating which other nodes have lost communication with node
ClusterNode. If the condition persists, check the cable connecting the node to the network. Next, check for hardware or software errors in the node's network adapter. Finally, check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
Message 2Event Id: 1127
Source: ClusSvc
Description:
The interface for cluster node
ClusterNode on network 'Public Network'
failed. If the condition persists, check the cable connecting the node to the
network. Next, check for hardware or software errors in node's network adapter.
Finally, check for failures in any network components to which the node is
connected such as hubs, switches, or bridges.This section describes possible reasons why you may receive event ID 1123 followed by event ID 1122. Use this information to evaluate and to troubleshoot these events before you contact Microsoft support.
Network adaptor teaming
Network adaptor teaming can involve multi-port card or separate single-port PCI network adaptors.
Note Network adaptor teaming is not supported on the cluster heartbeat network adaptor.
The following articles discuss network adaptor teaming with Windows Clustering:
254101 Network adaptor teaming and server clustering
276457 Event success messages 4201 and 1122 using Windows Clustering
Network adaptor driver issues
Network adaptor drivers may be outdated or incorrect. Additionally, some drivers may not match the drivers on other nodes in the cluster.
Network device failures
Network devices, such as switch ports or network adaptors, may not be working correctly. However, if all cluster networks log the same error message, a network device is unlikely to be the cause.
If only one of the cluster networks logs event IDs 1123 and 1122, you may have one of the following problems:
- Device configuration mismatches
This problem occurs when settings for the network adaptor and for the port that the node is attached to do not match. For example, this problem occurs when a network adaptor is set to Auto Negotiate and when the switch port is set to 100 megabytes (MB) full-duplex. Additionally, some network adaptors take over some of the functionality of the TCP/IP stack. For example, some network adaptors perform flow control and hardware checksumming. As part of the troubleshooting process, you may have to configure the network adaptor to return this functionality to the TCP/IP stack.
For more information, click the following article number to view the article in the Microsoft Knowledge Base:
174812
The effects of using Autodetect setting on cluster network interface card
- Switch port issues
This problem is identified by connecting the cluster node to another port. If you connect the node to another port and if event IDs 1123 and 1122 are not repeated, the problem is with the switch port. To identify this problem, you can also plug the cluster nodes into a network hub and then uplink the hub to the switch port. Use this method when the following conditions are true:- The public network is supported by a switch.
- The private, or heartbeat, network is supported by either a hub or a crossover cable.
- Switch configuration issues
This problem occurs when the spanning tree protocol (STP) has been enabled on the port and when the port is no longer in the forwarding state. Disable this configuration, or enable the rapid spanning tree protocol (RSTP) if the switch supports it. RSTP reduces the time that the switch port must use to transition from a blocking state to a forwarding state. - Virtual local area network (VLAN) issues
This problem occurs when the cluster nodes are part of a VLAN where the ports reside on different physical switches and when a trunk link configuration is set up between the switches. To resolve this issue, move the node connection to a port that is on the same physical switch.
Node resource issues
A node resource problem occurs because the Server service cannot keep up with incoming or outgoing network connections. The Server service cannot meet the demand for the network items that are queued by the network layer of the I/O stream. In this case, a Server service event, such as event ID 2022, may be logged in the System log. A message that is similar to the following is logged:Event ID: 2022
Source: Srv
Description:
Server was unable to find a free connection
n times in the last
NumberofSeconds seconds.
In this situation, deferred procedure call (DPC) requests are queued ahead of the network requests that are registered by the Interrupt Service Routine (ISR) for the network device.
To troubleshoot this issue, investigate all components of the I/O path. This includes the network I/O and hard disk I/O. Use System Monitor to collect this data.
For more information about how to troubleshoot this issue, click the following article number to view the article in the Microsoft Knowledge Base:
317249
How to troubleshoot event ID 2021 and event ID 2022
DPC requests that occur in a cluster are typically caused by the following sources:
- SCSI host bus adaptor (SCSI/HBA) network adaptor drivers.
- Multi-path software drivers, such as PowerPath or SecurePath.
- Redundant disk array controllers (RDAC).
- Third-party programs, such as backup software or disk quota software.
You must make sure that all third-party hardware device drivers are current and that they are supported in a cluster configuration by the hardware vendor. Additionally, we recommend that you contact your third-party program vendor for more information about how the third-party software functions in a clustered environment.
For more information, click the following article number to view the article in the Microsoft Knowledge Base:
814607
Microsoft support for server clusters with 3rd party system components
Incorrect software on Windows-2000 based cluster nodes
In a Windows 2000-based cluster, all cluster nodes must be running Windows 2000 Service Pack 3 or a later version. If your Windows 2000-based computer logged event ID 2022, view the following articles in the Microsoft Knowledge Base to resolve this issue:
317249 How to troubleshoot event ID 2021 and event ID 2022
245080 Receiving multiple instances of event ID 2022
If you still experience this issue, apply the hotfix that is described in the following article in the Microsoft Knowledge Base:
830901 Event ID 2022 is logged and your Windows 2000-based computer may stop responding
Incorrect registry settings
Warning Serious problems might occur if you modify the registry incorrectly by using Registry Editor or by using another method. These problems might require that you reinstall your operating system. Microsoft cannot guarantee that these problems can be solved. Modify the registry at your own risk.
To resolve event messages in Windows 2000-based and Windows Server 2003-based clusters, you may have to make changes to the following registry subkey on each node:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\lanmanserver\parameters
Important In Windows 2000-based clusters, you must install hotfix 872790 before you make these changes.
Add the following DWORD values to the registry subkey:
Value Name: MaxRawWorkItems
Data Type: REG_DWORD
Value data: 512 (decimal)
Value Name: MaxFreeConnections
Data Type: REG_DWORD
Value data: 4096 (decimal)
Value Name: MinFreeConnections
Data Type: REG_DWORD
Value data: 100 (decimal)
Value Name: MaxWorkItems
Data Type: REG_DWORD
Value data: 6000 (decimal)
To create the MaxWorkItems DWORD value, follow these steps:
- Click Start, click Run, type regedit, and then click OK.
- Locate and then click the following registry subkey:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\lanmanserver\parameters
- Right-click parameters, point to New, and then click DWORD Value.
- Type MaxWorkItems, and then press ENTER.
- Right-click MaxWorkItems, click Modify, type 6000, click to select the Decimal option, and then click OK.
Repeat these steps for each new DWORD value, and then restart your computer.
High kernel-mode CPU usage
To troubleshoot high kernel-mode CPU usage, use System Monitor to identify the problem. High kernel mode CPU usage may be caused by the following sources:
- Hardware drivers that use DPC and that compete with the DPC routines of the cluster heartbeat process.
- Frequent multiple hardware interrupt requests that occur at the same time.
- Excessive I/O output, such as kernel debug sessions over a serial connection.
High CPU usage that is caused by SNMP agents
Third-party Simple Network Management Protocol (SNMP) agents that run in a cluster may periodically contact the NTFS file system on a shared cluster disk resource. The agents use the
CreateFile function to contact NTFS. This behavior can cause significant CPU usage when the SNMP agent caches data on a specific volume.
Multicast issues
Multicast issues may occur in a Windows Server 2003 cluster. To troubleshoot multicast issues, disable multicast support in the cluster.
For more information about how to disable multicast, click the following article number to view the article in the Microsoft Knowledge Base:
307962
Multicast support enabled for the cluster heartbeat