INF: Understanding and Resolving SQL Server 6.x Blocking Problems (162361)

MORE INFORMATION

OVERVIEW

To achieve consistent results during concurrent update activity, any database must impose isolation between transactions. Without isolation, inconsistent results may occur under concurrent use. Most database products use locks to impose isolation. There are isolation techniques other than locks, such as versioning (also called time domain addressing), but each technique has its own cost and overhead. For details, see "Transaction Processing: Concepts and Techniques," by Jim Gray and Andreas Reuter, ISBN 1-55860-190-2. There is no "free lunch" in implementing isolation. Locking is well understood and highly refined in current products, so this article focuses on locking. All parts of the transaction must be protected with locks, or else a ROLLBACK would not be reliable. Some of the actions needing locking protection may not be obvious. These include locks on system catalog tables, indexes, and allocation structures such as extents.

Locks are automatically acquired by the database server in response to certain types of queries. Locks are not typically acquired under manual programmatic control (although the current product allows this through optimizer hints).

Locks are not generally intended for use under manual control to implement pessimistic concurrency. For example, one connection should not manually acquire a lock in order to indicate to other connections that the page or row is "in use." Rather, the application should use an "in use" column as a flag to other connections. Alternatively, the application could use a cursor under optimistic concurrency control, which would then signal the application if another connection changed the data during the interval the first connection was browsing the data. For more details on implementing cursors, see the SQL Server 6.5 documentation.
For best scalability, performance, and concurrency, application and query design should emphasize keeping the transaction path length short and holding locks as briefly as possible. The foundation of most concurrency problems is laid when the application and database are designed. For this reason, it is critical that these issues be well understood at design time. Otherwise, a hidden performance limitation may be unintentionally engineered into the application, and this may not appear until full-scale stress testing. Stress testing should always be done at the projected full user load, to ensure that concurrency at this load factor meets your requirements. Failure to do stress testing may result in concurrency, blocking, or performance problems appearing late in the design cycle (or worse, after application deployment). Problems discovered at this stage may be very costly to correct.

Different RDBMSs may handle locking and concurrency differently. If the application is being ported from one RDBMS to another, you cannot expect the two databases to behave identically. The application may require adjustments, especially if the application takes advantage of a feature or characteristic unique to one RDBMS. This is so even if the application uses only standard ANSI SQL, because locking and concurrency control issues are implementation-specific.

The RDBMS designers must balance a tradeoff between locking granularity and overhead. Fine-grain locks at the row or column level can allow more concurrency, but they also entail greater overhead. This relationship between locking granularity and overhead has been well understood for many years in the academic community. See "Effects of Locking Granularity in a Database Management System", by Daniel Ries and Michael Stonebraker from ACM Transactions on Database Systems, September 1977, and "Locking Granularity Revisited", same authors and publication, June 1979. Microsoft SQL Server locks at the table, page, or (with version 6.5) at the row level for INSERTs. Future versions of SQL Server will have more extensive row level locking, but this will not prevent blocking problems.

No matter how fine-grained the locks taken, if the application does not exercise good discipline regarding transaction path length and query management, blocking problems may occur. This is similar to a slow moving automobile causing a backup by driving in the left lane of a freeway. The driver may think that sufficient lanes should be made available for him or her to drive any way he or she wants. However, no feasible number of highway lanes will allow drivers to disregard good lane discipline without causing a traffic slowdown. Likewise, no RDBMS isolation technique can allow applications to disregard the impact they have on concurrency and scalability.

INVESTIGATING A BLOCKING PROBLEM

Identify the SPID at the head of the blocking chain.

Most blocking problems happen because a single process holds locks for an extended period of time. This usually causes a chain of blocked processes, similar to a slow-moving automobile that causes a backup on the freeway. Identify the head of the blocking chain by using the SQL Enterprise Manager command Server/CurrentActivity and observing the "Object Locks" tab.

Alternatively, you can use the following query, which should return one row for each SPID at the head of a blocking chain, plus the query the blocking spid is running, and the types of blocking locks it holds. We suggest you do not run multiple concurrent instances of this query unless you are on SQL Server versions later than version 6.5 Service Pack 3. This query is only an example; you may want to modify it or use your own query.

/* Query to find spids at head of a blocking chain, their input buffers, */ 
/* and the type of blocking locks they hold */ 

declare @blocker_spid smallint
declare @i_buff_string char(30)
set nocount on

/* Get all blocked spids */ 

select spid, blocked, hostname=substring (hostname, 1, 10),
progname=substring(program_name, 1, 10), cmd=substring(cmd, 1, 10),
 status, physical_io, waittype
into #blk from master..sysprocesses (nolock) where blocked != 0

/* delete all blocking spids except the 1st in each blocking chain */ 

delete from #blk
where blocked in (select spid from #blk)

/* get each spid from sysprocesses which is referenced in */ 
/* the "blocked" column of #blk. This should be the head */ 
/* of each blocking chain */ 

select "Blocking spid" = spid, loginame=substring(suser_name(suid),1,10),
   hostname=substring (hostname, 1, 10), progname=substring(program_name, 1,10),
   cmd=substring(cmd, 1, 10), status, physical_io, waittype
from master..sysprocesses (nolock)
where spid in
  (select blocked from #blk)

/* For each spid at the head of a blocking chain */ 
/* print its input buffer to show what query it's running */ 

declare blk_cursor CURSOR FOR SELECT blocked from #blk
open blk_cursor
fetch next from blk_cursor into @blocker_spid
while (@@fetch_status <> -1)
  begin
  select @i_buff_string = ("dbcc inputbuffer (" +
convert(char(6),@blocker_spid) +")")
  select "Below is input buffer for this blocking spid: ", @blocker_spid
  select ""
  exec (@i_buff_string)
  fetch next from blk_cursor into @blocker_spid
  end
deallocate blk_cursor

/* For each spid at the head of a blocking chain */ 
/* print the type of blocking locks it holds */ 

select spid, syslocks.type, locktype=name, table_id=id, page, dbid
from syslocks, master.dbo.spt_values v
where syslocks.type=v.number
and v.type='L'
and (syslocks.type & 256)=256
and spid in (select blocked from #blk)
order by spid

drop table #blk

Find the query the blocking SPID is running.

You can do this by running the above query, or by doing DBCC INPUTBUFFER (spid), where spid is the blocking SPID. or by using the Server/CurrentActivity function in SQL Enterprise Manager and double-clicking the SPID to show the input buffer. Save this information for future reference.
Find the type of locks the blocking SPID is holding.

You can do this by running the above query, or by either running sp_lock or querying master..syslocks. Save this information for future reference.

Alternatively, you can use the Server/CurrentActivity function in SQL Enterprise Manager. However, it is sometimes necessary to use queries instead of Enterprise Manager, because some types of tempdb blocking problems may prevent you from running queries that use temp table operations. Using direct queries gives you the control necessary to avoid this problem. An example of a query that results in temp table operations (and therefore may not be usable in certain blocking situations) is sp_lock, which does an ORDER BY.
Find the transaction nesting level and process status of the blocking SPID.

This is essentially the same number as @@TRANCOUNT, but it can be determined from outside the SPID by using the DBCC PSS command. Save this information for future reference. The following shows an example of the syntax:
```
      dbcc traceon(3604) /* return subsequent DBCC output to client rather
         than errorlog */ 
      go
      SELECT SUID FROM SYSPROCESSES WHERE SPID=<blocking SPID number>
      go
      DBCC PSS (suid, spid, 0) /* where suid is from above, and spid is the
      blocking SPID number */ 
      go

						
```
In the returned information, note pxcb->xcb_xactcnt=n, where n is the @@TRANCOUNT value for the SPID. This shows the transaction nesting level for the blocking SPID, which in turn can explain why it is holding locks. For example, if the value is greater than zero, the SPID is in the midst of a transaction (in which case it is normal that it retain any exclusive locks it acquired). Note also pstat=n, where n indicates the internal status of the SPID. This can indicate why the blocking SPID may itself be waiting for certain events.

You can also check to see if any long-term open transaction exists in the database by using DBCC OPENTRAN(database_name).

By examining these four pieces of information, you can usually determine the cause of most blocking problems. The next section of this article is a discussion of how to use this information to identify and resolve some common blocking scenarios.

IDENTIFYING AND RESOLVING COMMON BLOCKING SCENARIOS

I. Blocking Caused by a Normally Running Query with a Long Execution Time

A long-running query can block other queries. For example, a bulk DELETE or UPDATE can acquire many locks that (whether they escalate to a table lock or not) block other queries. For this reason, you generally do not want to intermix long-running decision support queries and OLTP queries on the same database. You can identify this situation by observing the blocking SPID. The INPUTBUFFER may point to a query known to have a long execution time. This will often cause a steady I/O consumption, visible by running the following query several times in succession:

   SELECT SPID, PHYSICAL_IO FROM MASTER..SYSPROCESSES

The PHYSICAL_IO value does not always reflect all the work done by a SPID, because work done on its behalf by other processes (like the read ahead manager) is not charged back to the SPID. However, it is often an approximate indication of I/O activity. If the PHYSICAL_IO value continuously increases, inspect the query running, and decide whether it should be running at the time. Run it in isolation on a quiescent computer, and monitor the I/O activity with SET STATISTICS IO ON. If the query consumes a large amount of I/O resources, it may cause blocking when run on a busy system. The solution is to look for ways to optimize the query, by changing indexes, breaking a large, complex query into simpler queries, or running the query during off hours or on a separate computer.

This class of blocking problem may just be a performance problem, and may require you to pursue it as such. For more information see the following article in the Microsoft Knowledge Base:

110352 : Optimizing Microsoft SQL Server Performance

One reason queries can be long-running and hence cause blocking is if they inappropriately use cursors. Cursors can be a convenient method for navigating through a result set, but using them may be slower than set- oriented queries. For more details, see "Microsoft SQL Server 6.5 Unleashed", by David Solomon, Ray Rankins, et al, ISBN 0-672-30956-4.

II. Blocking Caused by a Sleeping SPID That Has Lost Track of the Transaction Nesting Level

This type of blocking can often be identified by a SPID that is sleeping or awaiting a command, yet whose @@TRANCOUNT is greater than zero. This can happen if the application issues the DB-Library call dbcancel() or the ODBC call sqlcancel() without also issuing the required number of ROLLBACK and COMMIT statements. Issuing these calls cancels the query and the batch, but does not automatically rollback or commit the transaction. This can be seen by issuing a simple query from ISQL/w, such as BEGIN TRAN SELECT * FROM MASTER..SYSMESSAGES and clicking the red Cancel button. After the query is canceled, SELECT @@TRANCOUNT indicates that the transaction nesting level is one. Had this been a DELETE or an UPDATE query, or had HOLDLOCK been used on the SELECT, all the locks acquired would still be held. Applications must properly manage transaction nesting levels, or they may cause a blocking problem following the cancellation of the query. See the following article in the Microsoft Knowledge Base for more information:

117143 INFO: When and How to Use dbcancel() or sqlcancel()

NOTE: The transaction nesting level of the SPID can be observed by using DBCC PSS.

III. Blocking Caused by a SPID Whose Corresponding Client Application

Did Not Fetch All Result Rows to Completion

This problem is caused by poor application design. After sending a query to the server, all applications must immediately fetch all result rows to completion. If an application does not fetch all result rows, locks may can be left on the tables, blocking other users. If you are using an application that transparently submits SQL statements to the server, the application must fetch all result rows. If it does not (and if it cannot be configured to do so), you may be unable to resolve the blocking problem. To avoid the problem, you can restrict poorly-behaved applications to a reporting or a decision-support database.

You can often identify this problem by the following behavior:

The blocking SPID continuously has a waittype of 0x800.
The status may indicate runnable.
The cmd may be SELECT.
Transaction nesting level is zero. (If you cannot examine @@TRANCOUNT within the application, you can examine it externally by using the DBCC PSS command.

IV. Blocking Caused by a Distributed Client/Server Deadlock

Unlike a conventional deadlock, a distributed deadlock is not detectable using the RDBMS lock manager. The blocking SPID will often appear sleeping, with a waittype of 0x800 (waiting on a network I/0). A SPID in this state cannot be KILLed, as it is waiting on a return from a Windows NT Server API call. A distributed client/server deadlock may occur if the application opens more than one connection to the RDBMS and submits a query asynchronously. The following are two examples of how this can happen, and possible ways the application can avoid it.

Example A: Client/Server Distributed Deadlock with a Single Client Thread

If the client has multiple open connections (dbprocs in DB-Library terms), and a single thread of execution, the following distributed deadlock may occur. For brevity, the term dbproc refers to the client connection structure. In ODBC API terms, the closest analogy is an hdbc.

NOTE: Used a fixed font for the information below to display correctly.

 SPID1------blocked on lock------->SPID2
  /\                         (waiting on net write)        Server side
  |                          (sysprocesses.waittype==0x800)
  |                                 |
  |                                 |
  | ================================|==================================
  |     <-- single thread -->       |                      Client side
  |                                 \/ 
 dbproc1   <-------------------   dbproc2
 (waiting on dbnextrow            (effectively blocked on dbproc1, awaiting
  or SQLFetch)                     single thread of execution to run)

In the case shown above, a single client application thread has two open connections. It asynchronously submits a SQL operation on dbproc1. This means it does not wait on the call to return before proceeding. The DB- Library asynchronous call is dbsqlsend(). ODBC applications select asynchronous mode with SQLSetStmtOption() and use the SQL_ASYNC_ENABLE parameter. The application then submits another SQL operation on dbproc2, and awaits the results to start processing the returned data. When data starts coming back, (whichever dbproc first responds), it processes to completion all the data returned on that dbproc. Assume this is dbproc1. It fetches results from dbproc1 until SPID1 gets blocked on a lock held by SPID2 (because the two queries are running asynchronously on the server). At this point, dbproc1 will wait indefinitely for more data. SPID2 is not blocked on a lock, but tries to send data to its client, dbproc2. However, dbproc2 is effectively blocked on dbproc1, awaiting the single thread of execution to run.

Example B: Client/Server Distributed Deadlock with a Thread per Connection

Even if a separate thread exists for each connection on the client, a variation of this distributed deadlock may still occur, as shown by the following:

NOTE: Use a fixed font for the information below to display correctly.

SPID1------blocked on lock-------->SPID2
  /\                         (waiting on net write)        Server side
  |                          (sysprocesses.waittype==0x800)
  |                                 |
  | INSERT                          |SELECT
  | ================================|==================================
  |     <-- thread per dbproc -->   |                      Client side
  |                                 \/ 
 dbproc1   <-----data row-------   dbproc2
 (waiting on dbnextrow           (blocked on dbproc1, waiting for it
  or SQLFetch)                    to read the row from its buffer)

This case is similar to Example A, except dbproc2 and SPID2 are running a SELECT with the intention of performing row-at-a-time processing and handing each row through a buffer to dbproc1 for an INSERT in the same table. Eventually, SPID1 becomes blocked on a lock held by SPID2. SPID2 then writes a result row to the client dbproc2. Dbproc2 then tries to pass the row in a buffer to dbproc1, but finds dbproc1 has not yet fetched the last row from the buffer (because it is blocked waiting on SPID1, which is blocked on SPID2).

Both examples A and B are fundamental issues that application developers must be aware of. They must code applications to handle these cases appropriately. Two reliable solutions are to use either a query timeout or bound connections.

Query Timeout: When a query timeout has been provided, if the distributed deadlock occurs, it will be broken when then timeout happens. See the DB- Library or ODBC documentation for more information on using a query timeout.

Bound Connections: This is a feature new to SQL Server 6.5. It allows a client having multiple connections to bind them into a single transaction space, so the connections don't block each other. See the SQL Server 6.5 documentation under "bound connections" for more information.

V. Blocking Caused by a SPID That Is in a "Golden," or Rollback, State

A data modification query that is KILLed, or canceled outside of a user- defined transaction will be rolled back. This can also occur as a side effect of the client computer restarting and its network session disconnecting. Likewise, a query selected as the deadlock victim will be rolled back. A data modification query often cannot be rolled back any faster than the changes were initially applied. In the case of a DELETE, INSERT, or UPDATE that had been running for an hour, it could take at least an hour to roll back. If the server is shut down in the midst of this operation, the database will be in recovery mode upon restarting, and it will be inaccessible until all open transactions are processed. Startup recovery takes essentially the same amount of time per transaction as run- time recovery, and the database is inaccessible during this period. Thus, forcing the server down to fix a SPID in a rollback state will often be counterproductive. This is expected behavior, because the changes made must be completely rolled back, or transactional and physical integrity in the database would be compromised. Because this must happen, SQL Server marks the SPID in a "golden" or rollback state (which means it cannot be KILLed or selected as a deadlock victim). This can often be identified by observing the sp_who output, which may indicate the ROLLBACK command. On version 6.5 Service Pack 2 or later, a ROLLBACK status has been added to sysprocesses.status, which will also appear in sp_who output or the SQL Enterprise Manager "current activity" screen. However, the most reliable way to get this information is to inspect the DBCC PSS of the blocking SPID in question, and observing the pstat value. For example, it may be something like the following:

   pstat=0x4000, 0x800, 0x100, 0x1

Meaning of PSTAT bits:

0x4000 -- Delay KILL and ATTENTION signals if inside a critical section
0x2000 -- Process is being killed
0x800  -- Process is in backout, thus cannot be chosen as deadlock victim
0x400  -- Process has received an ATTENTION signal, and has responded by
          raising an internal exception
0x100  -- Process in the middle of a single statement xact
0x80   -- Process is involved in multi-db transaction
0x8    -- Process is currently executing a trigger
0x2    -- Process has received KILL command
0x1    -- Process has received an ATTENTION signal

The pstat value above would be a typical situation if a long-running data modification was canceled (for example, by clicking the Cancel Query button on a GUI application), and then the SPID was found to block users and yet be unkillable. This situation is normal; the transaction must be backed out. It can be identified by the bits, as noted above.

In addition to the pstat field of the PSS, the sysprocesses.waittype field can also give information about why the SPID may be waiting. The following are some common values:

0x800  -- Waiting on network I/O completion
0x8011 -- Waiting on buffer resource lock (shared) request
0x81   -- Waiting on writelog
0x0020 -- Waiting on buffer in I/O
0x0005 -- Waiting on exclusive page lock
0x13   -- Waiting on buffer resource lock (exclusive) request
0x8001 -- Waiting on exclusive table lock
0x8007 -- Waiting on update page lock
0x8005 -- Waiting on exclusive page lock
0x8003 -- Waiting on exclusive intent lock
0x6    -- Waiting on shared page lock
0x8006 -- Waiting on shared page lock
0x23   -- Waiting on buffer being dumped
0x5    -- Waiting on exclusive page lock
0x0013 -- Waiting on buffer resource lock (exclusive) request
0x0022 -- Waiting on buffer being dirtied

VI. Blocking Caused by SQL Server 6.5 Atomic SELECT INTO Behavior

By definition, SQL Server treats each statement as a separate transaction. Beginning with SQL Server version 6.5, SELECT INTO was made consistent with this standard by including the table creation and data insert phases in a single atomic operation. A side effect of this is that locks on system catalog tables are maintained for the duration of a SELECT INTO statement. This is more frequently seen in tempdb, because applications often do SELECT INTO temporary tables. Blocking caused by this action can be identified by examining the locks held by the blocking SPID. The atomic SELECT INTO behavior can be disabled with trace flag 5302. For more information, see the following article in the Microsoft Knowledge Base:

153441 FIX: SELECT INTO Locking Behavior

VII. Blocking Caused by an Orphaned Connection

If the client application traps or the client workstation is restarted, the network session to the server may not be immediately canceled under some conditions. From the server's perspective, the client still appears to be present, and any locks acquired may still be retained. For more information, see the following article in the Microsoft Knowledge Base:

137983 : How to Troubleshoot Orphaned Connections in SQL Server

APPLICATION INVOLVEMENT IN BLOCKING PROBLEMS

SQL Server is essentially a puppet of the client application. The client application has almost total control over (and responsibility for) the locks acquired on the server. While the SQL Server lock manager automatically uses locks to protect transactions, this is directly instigated by the query type sent from the client application, and the way the results are processed. Therefore, resolution of most blocking problems necessitates inspecting the client application.

Often, turnkey client applications are used against SQL Server, in addition to higher-level application development tools. These may encapsulate the DB- Library or ODBC API calls to the database in a higher abstraction level. However, from the perspective of SQL Server, there is essentially no difference between one of these higher level applications, a call-level DB- Library application, and a call-level ODBC application. SQL Server only perceives a stream of Transact-SQL queries and certain control tokens sent by each client API call. The same basic issues will cause blocking problems whether the client application is a call-level application written in C or a higher level application that encapsulates the database calls. Likewise, the solution to these problems is generally the same.

This means that no matter what level of visibility the application exposes regarding the database calls being made, a blocking problem nonetheless frequently requires both the inspection of the exact SQL statements submitted by the application and the application's exact behavior regarding query cancellation, connection management, fetching all result rows, and so on. If the development tool does not allow explicit control over connection management, query cancellation, query timeout, result fetching, and so on, blocking problems may not be resolvable. This potential should be closely examined before selecting an application development tool for SQL Server, especially for business-critical OLTP environments.

There may be a tendency to focus on server-side tuning and platform issues when facing a blocking problem. However, this does not usually lead to a resolution, and can absorb time and energy better directed at examining the client application and the queries it submits.

Several tracing utilities can be used to examine exactly what the client application is sending to the server. Each client application may have its own tracing feature, so examine the documentation for the application. In addition, server-side tracing can be done using the -T4032 trace flag and the SQL Trace utility. You can usually resolve a blocking problem without using these utilities, but they are available if needed.

APPLICATION DESIGN TECHNIQUES TO AVOID A BLOCKING PROBLEM

DO NOT:

Fail to immediately fetch all result rows to completion.
Design a large-scale OLTP system using an application development tool that does not allow explicit control over connections, transactions, and the SQL syntax sent to the server.
Use or design a client application that allows users to fill in edit boxes that generate a long-running query. For example, do not use or design an application that prompts the user for inputs, but allows leaving certain fields blank or entering a wildcard. This may cause the application to submit a query with an excessive running time, thereby causing a blocking problem.
Use or design an application that allows user input within a transaction.

ALWAYS:

Allow for query cancellation by means of dbcancel(), sqlcancel(), or an equivalent command.
Use a query timeout by means of dbsetltime() or SQLSetStmtOption(), to prevent a runaway query and avoid distributed deadlocks.
Immediately fetch all result rows to completion.
Keep transactions as short as possible.
Explicitly control connection management.
Stress test the application at the full projected concurrent user load.

It is vital that great care be exercised during the design and construction phase of the database and application. In particular, the resource consumption, isolation level, and transaction path length should be evaluated for each query. Each query and transaction should be as lightweight as possible. Good connection management discipline must be exercised. If this is not done, it is possible that the application may appear to have acceptable performance at low numbers of users, but the performance may degrade significantly as the number of users scales upward.

With proper application and query design, Microsoft SQL Server is capable of supporting many thousands of simultaneous users on a single server, with little blocking. The successful sites that reach this level typically use the techniques described in this article.