Windows Server 2003 Server Cluster with a Generic Script Resource Stops Responding for Long Periods (811685)



The information in this article applies to:

  • Microsoft Windows Server 2003, Standard Edition
  • Microsoft Windows Server 2003, Enterprise Edition
  • Microsoft Windows Server 2003, Datacenter Edition
  • Microsoft Windows Server 2003, Web Edition
  • Microsoft Windows Server 2003, 64-Bit Enterprise Edition
  • Microsoft Windows Server 2003, 64-Bit Datacenter Edition

SYMPTOMS

In a cluster where there is an active Generic Script resource, the cluster may become unresponsive. Cluster Administrator and Cluster.exe appear to stop responding (hang). The cluster log shows blocked threads inside a Generic Script resource. For example:

000007c4.000007e4::2002/12/12-19:17:03.781 INFO [FM] FmpRmOnlineResource: called InterlockedIncrement on gdwQuoBlockingResources for resource f37f58fb-03ff-44b3-a4d7-086b0838d73d

The event log contains a message similar to either of the following: Event ID: 1232
Event Type: Error
Event Source: ClusSvc
Cluster generic script resource MyScript timed out. Online script entry point did not complete execution in a timely manner. This could be due to an infinite loop or a hang in this entry point, or the pending timeout may be too short for this resource. Please review the Online script entry point to make sure there's no infinite loop or a hang in the script code, and then consider increasing the pending timeout value if necessary. In a command shell, run "cluster res "MyScript" /prop PersistentState=0" to disable this resource, and then run "net stop clussvc" to stop the cluster service. Ensure that any problem in the script code is fixed. Then run "net start clussvc" to start the cluster service. If necessary, ensure that the pending time out is increased before bringing the resource online again. or Event ID: 1233
Event Type: Error
Event Source: ClusSvc
Cluster generic script resource MyScript: Request to perform the Online operation will not be processed. This is because of a previous failed attempt to execute the Online entry point in a timely fashion. Please review the script code for this entry point to make sure there is no infinite loop or a hang in it, and then consider increasing the resource pending timeout value if necessary. In a command shell, run "cluster res "MyScript" /pro PersistentState=0" to disable this resource, and then run "net stop clussvc" to stop the cluster service. Ensure that any problem in the script code is fixed. Then run "net start clussvc" to start the cluster service. If necessary, ensure that the pending time out is increased before bringing the resource online again.

CAUSE

A Generic Script resource script can cause the whole cluster to stop responding or become unresponsive if any of the following conditions exist:
  • The Generic Script resource script contains an infinite loop (and therefore never exits).
  • Calls to certain cluster application programming interfaces (APIs) are occurring. Calls to certain cluster APIs must be avoided from within a resource DLL or resource script because they can cause a cluster-wide deadlock. This script may be calling cluster APIs or starting Cluster.exe (which may result in calling cluster APIs that must be avoided) as one of the steps. For information about APIs that should not be called from a resource DLL or script, see "Function Calls to Avoid in Resource DLLs" in the Microsoft Platform SDK (PSDK).
  • An action the Generic Script resource script is performing takes longer than the pending timeout value.
To avoid an infinite hang situation, the Cluster Resource Monitor refuses to perform any operations (such as Online, Offline, IsAlive, and LooksAlive) on the script after any operation has exceeded the pending timeout value. Any additional attempts to perform Generic Script resource operations on that resource will result in the second event log message that is shown in the "Symptoms" section of this article.

RESOLUTION

The Cluster Resource Monitor will not perform any additional operations on a Generic Script resource after any entry point has exceeded the pending timeout value, but the problematic thread will continue to run. To resolve the problem, disable the resource (that is, prevent it from coming online), stop the Cluster service (this terminates the problematic thread), fix the script problem, and then restart the Cluster service. Depending on the cause of this problem, you may want to increase the online or offline pending timeout value for this resource. For step-by-step instructions, see the "Recover and Restart the Cluster Service" section later in this article.

Changing Pending Timeout Values

Any cluster resource operation should complete execution well inside the range of the pending timeout. For this reason, do not change the timeout value without a thorough understanding of why your script entry point exceeds this period of time. Also, consider all the implications of increasing this value because the cluster will be unresponsive until the timeout value is exceeded.

Recover and Restart the Cluster Service

  1. Disable the resource (in this example, named MyScript) by typing the following command:

    cluster resource "MyScript" /properties PersistentState=0

  2. Stop the Cluster service on the node that currently owns this resource's group by typing the following command in a console window:

    net stop clussvc

  3. Fix any problem that you identify in the script that causes it to stop responding, loop, or exceed the pending timeout value. You may determine that the appropriate thing to do is to increase the pending timeout value, but make sure that you carefully consider the implications of doing so.
  4. Restart the Cluster service by typing the following command:

    net start clussvc

  5. Bring the resource back online manually by using Cluster Administrator or Cluster.exe. To do so, type the following command:

    cluster resource "MyScript" /online

    Note that bringing the resource back online automatically sets PersistentState to 1, so there is no need for an additional command to change the value from 0.

STATUS

Microsoft has confirmed that this is a bug in the Microsoft products that are listed at the beginning of this article.

Modification Type:MajorLast Reviewed:12/19/2003
Keywords:kbBug KB811685