SUMMARY
Infrequently Windows NT may halt with a STOP screen, commonly called a
"blue screen", or it may hard hang, where the console is completely frozen
and non-responsive. This may sometimes happen on a computer where SQL
Server is running, or may coincide with a particular SQL Server operation
such as a DUMP or LOAD, BCP, a long-running query, and so on.
The vast majority of time, this indicates an operating system, device
driver, or hardware problem and should be pursued as such. Windows NT user
or kernel mode process isolation ensures that a user mode application
problem will not cause the operating system to stop responding. This
article discusses exceptions to this and ways to determine whether to
troubleshoot the problem at the system or application layer.
Sometimes the cause of a machine hard hang or blue screen may be an
NMI (non-maskable interrupt) error. This is sometimes visible as an error
code stating NMI, parity check or I/O parity check. NMI errors are almost
always hardware. Usually they are caused by a memory failure but can
originate in other hardware subsystems such as video boards. Even if the
NMI error only happens during certain SQL Server operations, and if the
system passes initial hardware diagnostics, it should still be considered
a hardware problem and pursued as such. It may be necessary to use a
dedicated memory SIMM testing device which can often find a transient
memory error that eludes software-based diagnostics. For more information
see the Windows NT Resource Kit under the heading "Memory Problems", and
the following article in the Microsoft Knowledge Base:
101272 Memory Parity Errors: Causes and Suggestions
MORE INFORMATION
Processes exist on Windows NT in either user mode or kernel mode (sometimes
called supervisor or privileged mode). In the Intel i386 architecture, user
mode maps to ring 3 and kernel mode to ring 0 of the 4-ring protection
system. The i386 architecture has been carried forward with little change
in all Intel and compatible processors to date, including the Pentium Pro
and Pentium II. RISC processors such as the Alpha AXP likewise typically
have unprivileged and privileged modes.
Kernel mode is a privileged processor mode in which a thread has access to
system-wide memory (including that of all user-mode processes) and to
hardware. By contrast, user mode is a nonprivileged processor mode in which
a thread can only access system resources by calling system services.
A user mode process cannot access kernel mode memory, nor can it access
memory of another user mode process. This is enforced by processor
hardware, in conjunction with kernel mode data structures such as Page
Tables. For information on this see the 80386 Programmer's Reference
Manual, the 80386 System Software Writer's Guide, or equivalent Alpha AXP
documentation.
As a result of this protection system, a user mode application generally
cannot stop responding, cause a blue screen, or otherwise cause a failure
in the Windows NT operating system. Such problems should be primarily
pursued at the system layer as an operating system, device driver, or
hardware issue.
While an application error cannot cause a failure in the operating system,
an operating system error can cause an application to stop responding. This
is because of the general rule: applications must call inward (to kernel
mode), but the operating system can reference outward to user mode freely
at any time. A microkernel-influenced architecture like Windows NT may in
turn dispatch certain work to a user-mode system process rather than
perform the work in kernel mode. However, the overall principal remains the
same: processor hardware enforces process context isolation, which prevents
one process from causing a failure in another, whether one or both are in
user mode.
If a user mode application passes an invalid parameter in a Win32 API call,
it is the operating system's responsibility to validate this parameter. In
very rare cases, passing an invalid parameter may cause a Windows NT blue
screen error. However, this is an operating system issue, and should be
debugged and pursued as such.
There are a few narrow exceptions to the above guidelines. These exceptions
can be easily and quickly eliminated:
A Winlogon Problem Caused by SQL Extensible Performance Counters
Current Windows NT architecture stipulates that any extensible performance
counters added by a service will run in the process context of the Windows
NT Winlogon process. Because Winlogon is a vital component of the operating
system, a bug or resource leak in any performance counter DLL may disrupt
Winlogon, and hence the operating system. The SQL Server extensible
performance counter DLL is called Sqlctr60.dll, and it exports several SQL
Server-specific objects from SQL Server to the operating system. You can
use Performance Monitor to monitor these objects. For more details on
extensible performance counters, see volume 4 of the Windows NT Resource
Kit, titled "Optimizing Windows NT," by Russ Blake.
Although it is very rare for Sqlctr60.dll to cause a Winlogon problem, you
can expedite problem identification when pursuing a Windows NT failure or
blue screen problem on a computer running SQL Server by renaming this DLL.
Doing this eliminates use of SQL Server performance counters, but you can
still use Performance Monitor to monitor SQL Server by using regular
Windows NT performance counters (such as threads, process, memory, and so
on).
If renaming Sqlctr60.dll fixes the problem, and if this is confirmed by
reinstating and removing the DLL several times, the problem should be
pursued as a SQL Server issue. Otherwise, it should be pursued as a system-
layer issue.
A Resource Leak
If a resource leak continues for a period of time, the operating system
should return the appropriate return code to the application, which should
log this. For example, if you receive operating system error 8 "not enough
storage," the operating system should handle the situation gracefully by
not granting further resource requests. However a continued application
resource leak may not be handled gracefully by the operating system under
all conditions, resulting in a blue screen or operating system or
application failure.
Almost all resource leaks will manifest themselves as a gradual increase in
consumption of some resource, such as handles, virtual memory, private
bytes, and so on. Therefore, the easiest way to rule in or out a resource
leak is to run Performance Monitor and log all objects to a file. When the
problem occurs, examine the logged performance data for signs of a leak.
Some good counters to examine are: handle count, page file bytes, pool
paged bytes, pool nonpaged bytes, private bytes, thread count, virtual
bytes, and working set for each process running on the computer.
It is not necessary to classify certain values as normal or abnormal. Focus
on identifying leaks by the continuous nature of the increase, not by the
absolute value at a given time. Remember it's normal for the perfmon
"private bytes" counter for SQL Server to start well below the configured
sp_conigure "memory" value, then increase with activity until it roughly
approaches, but doesn't significantly exceed, that value.
If one of the logged Performance Monitor counters continuously increases
for the Sqlservr.exe process, and if reaching a certain value repeatedly
coincides with a Windows NT blue screen or operating system failure, it
should be pursued temporarily as a SQL Server issue until the cause of the
continuous SQL Server resource leak is understood. Otherwise, it should be
pursued as a system layer problem.
CPU Monopolization
If a process spawns high priority threads that are continuously in a
runnable state, this process can dominate the computer and prevent the
operating system from running. A properly configured SQL Server will not
cause this problem. However if SQL Server "SMP concurrency" is set to -1,
and if a number of long-duration CPU intensive queries is run equal to or
greater than the number of processors, under some conditions the operating
system may appear to stop responding. In general, you should leave SQL
Server the "SMP concurrency" option at the default setting of 0. Likewise,
leave the "priority boost" setting at the default. For more information,
see the following articles in the Microsoft Knowledge Base:
111405 SQL Server and Windows NT Thread Scheduling
166967 Proper SQL Server Configuration Settings
When pursuing an operating system hang problem, first ensure that the SQL
Server configuration settings mentioned above are at their default values.
After making this check, if the operating system or application failure
recurs, it should be pursued as a system layer problem.