Explanation of why server clusters do not verify that resources will work properly on all nodes (303431)



The information in this article applies to:

  • Microsoft Windows Server 2003, Datacenter Edition
  • Microsoft Windows Server 2003, Enterprise Edition
  • Microsoft Windows 2000 Advanced Server
  • Microsoft Windows 2000 Datacenter Server

This article was previously published under Q303431

SUMMARY

The Microsoft Cluster service (MSCS) in Microsoft Windows Server 2003 and in Microsoft Windows 2000 Server cannot verify that clustered resources are able to run on all nodes in the cluster. For example, only one node can ever have a physical handle to a disk, and because of this, it is not possible to verify that another node can bring the disk online without disrupting the other node. It is the responsibility of the administrator to verify that the cluster is configured properly and that all groups can failover to the appropriate nodes.

See the "Test node Failure" topic in on-line Help in the Server Cluster section for ways to test if your server cluster will failover properly. This should be done before the server cluster is put into production to verify that it will provide a highly-available cluster solution. Some resources will go online on one node in the cluster but not on others for a variety of reasons. This article describes a few of the reasons why resources may come online on one node but not on any of the others.

MORE INFORMATION

To find out what resource is keeping a group from failing over, first take the entire group offline, then use the Move Group process. Then bring each resource online individually, starting with the lowest resource on the dependency tree. Normally the first resource to bring online is the Physical Disk, followed by the IP Address and Network Name, and finally by bringing the Application resources online, such as File Shares, Exchange/SQL, Generic Services, and so on. The following list is the list of common resources and some typical reasons why they might not come online on all nodes of the Cluster.

Network Name and IP Address resources

If a Network Name or IP Address resource will not come online on one node but will on the other node, there is probably something wrong with the network subsystem on the problematic nodes. The first thing to check is that the network cables of the problematic node are attached properly. Verify that the public network interface has connectivity with the other nodes in the cluster and can establish a connection to the WINS and DNS servers. Ping the network name and IP address of the working node from the problematic node followed by pinging the WINS and DNS Servers. The Network Name and IP Address resources rely on the local node's TCP/IP configuration. Verify that the local node has the appropriate default gateway, subnet mask, and so on for the network adapter that is attached to the public network. Verify that the Public and the Private networks are on different logical networks.

Physical Disk resource

If a Physical Disk resource will not come online on a particular cluster node, there is probably a problem with that particular nodes disk subsystem. There are some troubleshooting steps you can use while the cluster server is online, but others may require that the server cluster be taken down for maintenance and troubleshooting.

While the cluster is online, log on as administrator on the problematic node. In Device Manager, view the SCSI and RAID Controllers branch and verify that the Host Bus Adapter (HBA) for the shared disk has a driver loaded, the device is enabled, and that it has a device status of "working properly". Verify that the HBA has the same driver and firmware version as the working node. Under the Disk Drives branch, verify that the disk is listed and is not disabled.

Note The disk will have been scanned and should be listed even though they are accessible on the problematic node.

If the problematic node still cannot bring the Disk resource online, further troubleshooting require that the entire cluster be brought down for troubleshooting. To start, from the nodes that are online and working, create a complete backup. Once the cluster is backed up, power down all nodes except for the problematic node. On the problematic node, view Computer Management in Administrative Tools. Open the Services and Applications branch and select Services. Double-click the Cluster Service, set the startup type to manual, and then click OK.

In Device Manager, view System Devices. On the View menu, click Show Hidden Devices. A new Non-Plug and Play drivers branch is displayed in Device Manager, and you should view this branch. Double-click Cluster Disk, and then click the Driver tab. Under the Startup section, change the "Type:" to Disabled, and then click OK. Quit Computer Management, and then reboot the server. Once the server comes online, all disks on the shared bus should be accessible. The cluster components have been disabled so all drives should behave as they would in Windows without clustering.

Important Only have one node powered on if the cluster disk driver is disabled.

Once the server comes up after the clustering components have been disabled, use the Disk Management Console to verify that the problematic node can access the disk and that the drive letters are assigned correctly. Verify that the disk can be written to by copying several large files to the disk (typical of the size that the program uses).

Generic Service resource

Bring all of the resources online except for the Generic Service on the problematic node. Once all required resources are online for the Generic Service (Disk, Network Name, IP Address, and so on), click Services under the Services and Applications branch in Computer Management. Locate the service that the Generic Application starts, and then double-click it. On the Log On tab, verify that the proper permissions for the service are being used. On the General tab, click the Start button to see if the service starts. Look in the Application log for details about the error that is displayed or check with the manufacturer of the service for information about why the program does not start.

Generic Application resource

Bring all resources online except for the Generic Application on the problematic node. Once all required resources are online for the Generic Application (Disk, Network Name, IP Address, and so on), go to the location of the executable (.exe) program file that the resource points to, and then double-click it. Note any error messages that you receive. Verify that the problematic node's Cluster Service account has the appropriate permissions and that the location of the files are accessible by the node. Create a test Generic Application resource that will start Notepad.exe to verify that the Generic Application resource type (Clusres.dll) is working properly.

File Share resource

Bring all resources online except for the File Share on the problematic node. Once all required resources are online for the File Share (Disk, Network Name, IP Address, and so on), use Windows Explorer to verify that the path to the file share is accessible. Verify that the Cluster Service account has appropriate permissions to the file share path. Verify that the Server service is functioning properly on the problematic node. Log on to and then connect to the problematic node's administrative share (%computer name%\C$), and then copy files that are typical of the size that the share normal contains.

Modification Type:MinorLast Reviewed:3/9/2006
Keywords:kbenv kbinfo kbnetwork KB303431