INF: Troubleshooting SQL Cluster Wizard Failures (254593)



The information in this article applies to:

  • Microsoft SQL Server, Enterprise Edition 6.5
  • Microsoft SQL Server, Enterprise Edition 7.0

This article was previously published under Q254593

SUMMARY

This article provides information on the actions that the SQL Server Cluster Wizard performs, along with the order in which these actions are performed. Additionally, detailed information is given concerning possible problems that you may encounter with each step that might cause the wizard to fail. Possible resolutions for these problems are also included. Detailed, specific problem scenarios and resolutions are also provided.

Note All SQL Server 6.5 and 7.0 Cluster customers should upgrade to SQL Server 2000 as soon as it is available. The following tools, features, and components are supported with failover clustering in SQL Server 2000 Enterprise Edition:
  • Microsoft Search service (Full Text)
  • Multiple instances
  • SQL Server Enterprise Manager
  • Service Control Manager
  • Replication
  • SQL Profiler
  • SQL Query Analyzer

MORE INFORMATION

The following steps describe using the SQL Server Cluster Wizard:
  1. First, the SQL Cluster Wizard connects to the server and verifies that all the databases and binaries are on shared disks.

    Possible Problem

    About the only thing that can go wrong at this point is that the service may not be able to start, which is normally due to the shared disk being owned by the wrong node or the failure to install both the program files and data files to the cluster disk.

    Resolution

    To correct this problem, confirm that the shared disk is owned by the correct node before you run the cluster wizard. Also, check and make sure that both the program files and data files are installed to the cluster disk.

  2. After you enter the IP address and network name, the wizard creates a test resource with those properties and brings it online to see if any conflicts on the network occur.

    NOTE: An error message only occurs if you enter an IP address that is in use; invalid IP addresses or a bad subnet mask are not detected.

    Possible Problem

    If you have just unclustered and are re-clustering the server, you may get an error message indicating that your network name is in use. This may occur because Windows NT occasionally fails to remove the network name from the net bios registration properly.

    First Resolution

    Open a command prompt window and enter the following command:

    nbtstat -RR
    Press Return. Upon completion, try using the IP address again. If the IP address still fails, move to the second resolution.

    Second Resolution

    Reboot the system.

  3. After you enter all the information, the wizard copies all the COM files that are registered in BINN to the SQL Server subdirectory of the location pointed to by the following registry key:

    HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Shared Tools\SharedFilesDir

    By default, this key points to the following location:
    C:\Program Files\Common Files\Microsoft Shared\

    Possible Problem

    The SQL Server Cluster Wizard is unable to find these files or the location to which they should be copied. This problem usually occurs when something is wrong with the following registry key used by the wizard:
    HKey_Local_Machine\Software\Microsoft\SharedTools\SharedFilesD

    Note that the setup uses a different registry key (listed below), but the two should normally point to the same path:
    HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\CurrentVersion\CommonFilesDir

    NOTE: By design, when this problem occurs, no error is displays. This is done so that you can install without replication and still run the wizard without having it fail. The only way to identify if this problem is occurring is to look at the debug output window, which is formed by setting _PRINT_CONSOLE_=1 in the system environment prior to running the SQL Cluster Wizard. If this step is executing correctly, you see references to the replication files, such as Replres.dll and Distrib.exe, as they are copied. If you do not see references to these files, you are encountering this problem.

    First Resolution

    Refer to Scenario 8 in the "Specific Scenarios" section.

    Second Resolution

    Refer to Scenario 3 in the "Specific Scenarios" section.

  4. Next, the SQL Cluster Wizard copies the same files to the same place on the other node, finding the correct path by reading the remote registry. The SQL Cluster Wizard creates a share on the remote node named cluster_tools_share, copies the files to that share, and then deletes it.

    Possible Problem

    The only problem that may occur at this point is if registry key problems exist on the other node or if the wizard was unable to create the share.

    First Resolution

    Refer to step 8.

    Second Resolution

    Refer to Scenario 3 in the "Specific Scenarios" section.

  5. The wizard then copies the cluster specific files to the \System32 directory of both nodes.

    Possible Problem

    Normally, this step completes successfully. The SQL Cluster Wizard copies the files from the CD or network share, so it is possible that it may lose the connection to the share or be unable to create the cluster_tools_share because it already exists.

    Resolution

    Refer to Scenario 3 in the "Specific Scenarios" section.

  6. The wizard runs the "secnode" setup, which installs the necessary system files to the remote node and registers all the COM files that have been copied to the C:\Program files\Common files\Microsoft shared directory.

    Possible Problems

    One of the most common problems at this point occurs if the setup is run from a share point on the first node (connected through net use where you specify a user and a password). When this happens, by default, node2 does not have access to the share so that when secnode runs it fails to connect back to the install location to copy the files. When this occurs, you receive a message indicating that setup could not be run on the remote computer.

    Another problem may occur if you install from a network share when the path has a space in the name. This causes the secnode setup to fail because it is unable to handle paths with spaces unless they are quoted. There is no way around this problem apart from renaming the share.

    First Resolution

    If you experience either of these problems, you should check in the <%SYSROOT%> directory for the Sqlclstr.log file or on the second node's TEMP directory for the Remsetup.log for clues or descriptions of the problem. Correct all problems and then run the wizard again.

    Second Resolution

    Permissions problems can also prevent the SQL cluster Wizard from working correctly when performing operations on the second node. The account under which setup runs MUST have the appropriate permissions:

    • Be a local administrator for both nodes.
    • Have the user right to "log on as a service".
    • Have the user right to "act as part of the operating system".

    These permissions MUST exist on BOTH nodes; otherwise this fails.

    You set these permissions from the primary domain controller (PDC). After the correct permissions are set, you need to logoff and then logon again for the changes to be reflected. For further details, refer to scenario 5 in the "Specific Scenarios" section.

    Possible Problem

    Secnode may also fail if it runs but has errors internally, such as not successfully registering all the COM files.

    Resolution

    Correct all problems reported in the Sqlstp.log on the second node.

  7. Next, the SQL Server Cluster Wizard rebinds all the files located in the following places:

    • The SQL BINN directory.
    • C:\Program Files\Common Files\Microsoft Shared\SQL Server
    • C:\Program Files\Common Files\Microsoft Shared\Database Replication
    This occurs on both nodes.

    The SQL Server Cluster Wizard then rebinds the following system files on both nodes:
    • Dbnmpntw.dll
    • Sqlstr.dll
    • Sqlwoa.dll
    • Sqlsrv32.dll
    • Cliconfg.dll
    • Cliconfg.exe
    The SQL Server Cluster Wizard rebinds %Sysroot%\System32\Sqlctr70.dll on the local node only.

    Possible Problem

    The rebinding process can only be broken when something is using one of the files it is trying to bind. If any SQL applications, including the SQL Service Manager, are open this message displays:

    ..could not update binaries...
    For additional details, refer to:

    248380 PRB: SQL 7.0 Failover Wizard Error when Updating Binaries

    The most common problem is that some of the system files are in use.

    You can usually tell on which node the problem is occurring by the amount of time it takes to the message to display again after a retry. If the message displays instantaneously, this usually indicates that a file on the local computer is in use, but if it takes a few seconds, then the problem is probably occurring on the other node.

    Resolution

    You can usually work around this problem by stopping all offending services and make sure that you do not have any applications open. To verify which services you should have running, refer to the following articles in the Microsoft Knowledge Base:

    192708 INF: Installation Order: Cluster Server Support for SQL or MSMQ

    for SQL 6.5 Enterprise Edition

    219264 INF: Order of Installation for SQL Server 7.0 Clustering Setup

    for SQL 7.0 Enterprise Edition.


    Possible Problem

    If you are unclustering and one of the resource DLLs is in use, the resource DLL may stop responding in one of its connections to the server. This causes the resource monitor process (Resrcmon.exe) to have the dbnmpntw.dll file open even when the resource is offline.

    First Resolution
    Reboot and re-run the wizard to uninstall.

    Second Resolution

    Rename the offending DLL to Dbnmpntw.dll.copy, and then copy it back to the original name. Now the .copy file is in use but the dbnmpntw.dll file is not, so the wizard may complete without any problems.

  8. The SQL Cluster Wizard now creates the net name, IP, sqlserver, agent and vsrvsvc resources in the cluster, brings the SQL Server resource online, and changes the local server in the sysservers system table to the virtual server name.

    Possible Problem

    Creation of the resources is usually never a problem. You should see the resources being created in the group in which the disk resides. All this step does is create the resources and make dependencies between them so that they can start in the correct order.

    Bringing the resources online is the last phase of the setup. The first phase is to start the MSSQLSERVER$VIRTNAME service, connect to it, and set the values in sysservers correctly. If this step fails, then the whole setup fails and rollbacks all the work it has done so far. When the rebinding of the Sqlsrv32.dll (an ODBC file) file does not work correctly. When this occurs, you will see error 123 or 126 in the cluster setup log (Sqlclstr.log) just after the fixsysservers call.

    If this happens:

    1. The cluster is completely broken.
    2. It is caused by the wizard only changing one of the two references to the Kernel32.dll file to reference the Vernel32.dll file instead.
    3. If you previously installed a different version of Microsoft Data Access Components (MDAC) on the computer before installing SQL, the version of the Sqlsrv32.dll file on the system is different.

    First Resolution

    Reboot both servers and, before retrying, make sure that only the minimum services are running as outlined in the following Microsoft Knowledge Base articles:

    192708 INF: Installation Order: Cluster Server Support for SQL or MSMQ

    for SQL 6.5 Enterprise Edition.

    219264 INF: Order of Installation for SQL Server 7.0 Clustering Setup

    for SQL 7.0 Enterprise Edition.

    Second Resolution

    Rename the Sqlsrv32.dll file, and then reboot the computer. Before retrying, make sure that only the minimum services are running as outlined in the following Microsoft Knowledge Base articles:

    192708 INF: Installation Order: Cluster Server Support for SQL or MSMQ

    for SQL 6.5 Enterprise Edition.

    219264 INF: Order of Installation for SQL Server 7.0 Clustering Setup

    for SQL 7.0 Enterprise Edition.

    Third Resolution

    Contact SQL Product Support Services.

  9. The SQL Cluster Wizard finishes.


Specific Scenarios

Scenario 1

Problem

SQL Cluster Wizard fails with this log entry:

@ CopyFileIfNeeded: [D:\EnterpriseEdition\x86\CLUSTER\SQAGTRES.DLL] => [C:\WINNT\System32\SQAGTRES.DLL]
@@@ CopyFileIfNeeded: [D:\EnterpriseEdition\x86\CLUSTER\SQAGTRES.DLL] => [\\LNXDAYCC02\admin$\system32\SQAGTRES.DLL]
~~~ XXX InstallRemote failed
[reghelp.cpp:34] : 2 (0x2): The system cannot find the file specified.
					
Resolution

Verify that you can make a \\server_name\admin$ connection from both nodes in the cluster.

Make sure you check this if any network interface card (NIC) settings have been changed or if network cards have been replaced.

WARNING: If you disable File and Print Sharing for Microsoft Networks, under the Network Connection properties on Windows 2000 computers, you will not be able to make a connection to the Administrative shares. Attempts to access the Administrative shares causes a Error: 53 error message to occur. Scenario 2

Problem

The SQL Cluster Wizard fails with the following generic message and there is not a reference to a specific file:

File already exists.
Resolution

Verify that the SQL group name is in all capital letters. If it is not, the wizard tries to create a new group but is unable to so. If it is not all uppercase, rename it to a temporary name (such as x) and then rename it to the correct name in all uppercase.

NOTE: This applies to renamed groups only. The default names like "Disk Group 1" have their resources moved to the new group if required by SQL.

Scenario 3

Problem

The Sqlclstr.log file shows the following:

~~~ ClusterResourceStart... tick=2, state=2
[validate.cpp:147] DeleteTestGroup:OpenClusterResource: 5007 (0x138f): The cluster resource could not be found. 
~~~ XXX Copy Files failed
[reghelp.cpp:34] : 2 (0x2): The system cannot find the file specified.
					
Resolution

Check the net shares on each node and look for the following:

  • \\cluster_tools_share
  • \\cluster_setup_share

If either is found, delete them.

Scenario 4

Problem

When you try to re-cluster SQL after installing SQL service pack 1, the install fails with the following error in the Sqlcluster.log file:

Looking at disk P:
Disk P is fixed in group SQL_Disk
Looking at disk Q:
Disk Q is used by SQL but is moveable
Looking at disk R:
Error: Resource groups SQL_Disk and Disk_R both contain SQL disks
[chkconf.cpp:1416] : 160 (0xa0): The argument string passed to DosExecPgm is not correct. 
[chkconf.cpp:1482] ClusterFindVirtualSQLSrvGroup: 160 (0xa0): The argument string passed to DosExecPgm is not correct.
					

The "P" drive is the drive to which SQL was installed and the installer thought it was the only drive in use. Actually, the P, Q and R drives are being used.

Resolution

Check the SQL error logs and sysdevices system table and make sure that all drives being used by SQL are in the SQL group SQL is using.

NOTE: If additional cluster disk resources are added to the cluster for use by SQL, or if other disks currently used in the cluster are designated for use by the clustered SQL server, they should be added as dependencies of the SQL Server.

Scenario 5

Problem

Setup is unable to update the remote node or errors occur when connecting to all the default databases during the initial setup.

For example:

#### SQL Server Remote Setup - Start Time 10/28/99 13:14:22 ####
Script file copied to '\\server8\ADMIN$\secnode.iss' successfully.
Installing remote service...
Running '\\node1\F$\ENGLISH\X86\setup\setupsql.exe SecNode=1 -s -f1 \\node2\ADMIN$\secnode.iss'...
Remote process exit code was '-1'.
\\node2\Admin$\sqlsp.log
Disconnecting from remote machine...
Service removed successfully.
Remote files removed successfully.
#### SQL Server Remote Setup - Stop Time 10/28/99 13:15:08 ####
Resolution

Be sure the service account is set up with all the correct permissions. By copying the existing Administrator account, you can make sure that the group memberships and many other properties are copied to the new account. When a user account is copied, the description, group memberships, logon hours, logon workstations, and account information are copied exactly. The user name, full name, and password boxes of the new account are blank and must be entered. The User Cannot Change Password and Password Never Expires check boxes are copied.

NOTE: When copying an account that is a member of the Administrators local group, the User Cannot Change Password setting is not copied. Usually, the User Must Change Password At Next Logon check box is selected, regardless of its setting in the original account; however, this check box should be clear. Also, the Password Never Expires check box should be selected. After all the entries are complete, click Add.

Now, from the User Manager menu, select Policies\User Rights, select to show Advanced User Rights, and then grant the following rights to the new user:

  • Act as part of the operating system.
  • Logon as a service.
  • Logon locally.
Next, logon to both nodes with the newly created account and perform basic connectivity and rights testing:

  • To verify remote procedure call (RPC) connectivity, try to log on remotely from each node to the other with either Perfmon, Regedt32 or Srvmgr.

  • To verify NetBIOS, try issuing a net view \\machine_name and net use \\machine_name\admin$

  • To verify RDR and SRV without NBT and IP connectivity net view \\ IP Address

  • Try using a telnet or FTP session to test for transport functionality.
Scenario 6

Problem

The SQL 6.5 Cluster Wizard fails and the last line of cluster wizard log states:

Start SQL Server cConnectString="ODBC;DSN='';DRIVER={SQL Server};SERVER=CLIO;DATABASE=master;UID=sa;PWD="
					

Resolution

First verify that performing a @@servername does not return a NULL response. If it does, then the sysservers system table does not have an entry for the local server name. Correct this and continue.

If you were able to verify @@servername, you should reload the ODBC drivers and then run the SQL Cluster Wizard again. To reload the ODBC drivers, run the setup program from the SQL Server 6.5 Extended Edition compact disk in either the \I386\Odbc directory for Intel based computers or the \Alpha\Odbc directory for Alpha based computers.

Scenario 7

Problem

Every time the Clustwiz.exe file runs, a Dr. Watson message appears pointing to the Cpqmgmt.dbg file.

Resolution

All the following Microsoft Knowledge Base references indicate that this problem is related to the Compaq Insight Manager. Apply the latest Compaq SoftPak (in most cases SSD 2.12a) and stop all possible conflicting services as outlined in the following Microsoft Knowledge Base articles:

192708 INF: Installation Order: Cluster Server Support for SQL or MSMQ

219264 INF: Order of Installation for SQL Server 7.0 Clustering Setup

Scenario 8

Problem

The following registry entry is incorrect:
Hkey_Local_Machine\Software\Microsoft\Windows\CurrentVer\CommonFilesDir

Resolution

Correct the path if it is wrong.

Scenario 9

Problem

You are unable to uncluster SQL using the SQL Cluster Failover Wizard.

Resolution

When the SQL Cluster Failover Wizard is run, the SQL cluster resources are created. By default, these resources have the following naming structure:

   <Virtual_SQL_Server_Name> IP Address
   <Virtual_SQL_Server_Name> Network Name
   <Virtual_SQL_Server_Name> SQL Server 7.0
   <Virtual_SQL_Server_Name> VServer
   <Virtual_SQL_Server_Name> SQL Server Agent 7.0
					
For example, if the Virtual_SQL_Server_Name is xyz, the SQL resources are, by default, named as:
   xyz IP Address
   xyz Network Name
   xyz SQL Server 7.0
   xyz VServer
   xyz SQL Server Agent 7.0
					
If all or some of these resources are then modified to:
   IP Address
   Network Name
   SQL Server
   Virtual Server
   SQL Agent
					
this can cause the SQL Cluster Failover Wizard to fail or hang when used. To resolve this, rename the resources back to the default names.

Scenario 10

Problem

The SQLCLUST.LOG shows the following:

~~~ OnEnableCluster: UpdateSku
~~~ OnEnableCluster: TransferSQLServices
+++ TransferSQLServices: enter
+++ TransferSQLServices: calling AddVSNameLanManServer
[reghelp.h:132] type not REG_MULTI_SZ: 160 (0xa0): The argument string passed to DosExecPgm is not correct.

[reghelp.h:133] : 160 (0xa0): The argument string passed to DosExecPgm is not correct.

[reghelp.h:290] : 160 (0xa0): The argument string passed to DosExecPgm is not correct.

[clenable.cpp:1803] : 160 (0xa0): The argument string passed to DosExecPgm is not correct.

[clenable.cpp:1836] : 160 (0xa0): The argument string passed to DosExecPgm is not correct.

[clenable.cpp:2379] : 160 (0xa0): The argument string passed to DosExecPgm is not correct.

~~~ XXX TransferSQLServices failed
					
Resolution

Verify that the type value of the following registry key:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\LanmanServer\Parameters\NullSessionPipes
					
is REG_MULTI_SZ.

The actual failure is in RegQueryValue_MULTI_SZ(). It fails because the type of the key is not REG_MULTI_SZ.

It the type of the key is not REG_MULTI_SZ, you will need to copy the contents from the key, delete and re-create the key with the same name and correct type value, and then replace the contents.


Modification Type:MajorLast Reviewed:8/31/2006
Keywords:kbinfo KB254593 kbAudDeveloper