MORE INFORMATION
OVERVIEW
To achieve consistent results during concurrent update activity,
any database must impose isolation between transactions. Without isolation,
inconsistent results may occur under concurrent use. Most database products use
locks to impose isolation. There are isolation techniques other than locks,
such as versioning (also called time domain addressing), but each technique has
its own cost and overhead. For details, see "Transaction Processing: Concepts
and Techniques," by Jim Gray and Andreas Reuter, ISBN 1-55860-190-2. There is
no "free lunch" in implementing isolation. Locking is well understood and
highly refined in current products, so this article focuses on locking. All
parts of the transaction must be protected with locks, or else a ROLLBACK would
not be reliable. Some of the actions needing locking protection may not be
obvious. These include locks on system catalog tables, indexes, and allocation
structures such as extents.
Locks are automatically acquired by the
database server in response to certain types of queries. Locks are not
typically acquired under manual programmatic control (although the current
product allows this through optimizer hints).
Locks are not
generally intended for use under manual control to implement pessimistic
concurrency. For example, one connection should not manually acquire a lock in
order to indicate to other connections that the page or row is "in use."
Rather, the application should use an "in use" column as a flag to other
connections. Alternatively, the application could use a cursor under optimistic
concurrency control, which would then signal the application if another
connection changed the data during the interval the first connection was
browsing the data. For more details on implementing cursors, see the SQL Server
6.5 documentation.
For best scalability, performance, and concurrency,
application and query design should emphasize keeping the transaction path
length short and holding locks as briefly as possible. The foundation of most
concurrency problems is laid when the application and database are designed.
For this reason, it is critical that these issues be well understood at design
time. Otherwise, a hidden performance limitation may be unintentionally
engineered into the application, and this may not appear until full-scale
stress testing. Stress testing should always be done at the projected full user
load, to ensure that concurrency at this load factor meets your requirements.
Failure to do stress testing may result in concurrency, blocking, or
performance problems appearing late in the design cycle (or worse, after
application deployment). Problems discovered at this stage may be very costly
to correct.
Different RDBMSs may handle locking and concurrency
differently. If the application is being ported from one RDBMS to another, you
cannot expect the two databases to behave identically. The application may
require adjustments, especially if the application takes advantage of a feature
or characteristic unique to one RDBMS. This is so even if the application uses
only standard ANSI SQL, because locking and concurrency control issues are
implementation-specific.
The RDBMS designers must balance a tradeoff
between locking granularity and overhead. Fine-grain locks at the row or column
level can allow more concurrency, but they also entail greater overhead. This
relationship between locking granularity and overhead has been well understood
for many years in the academic community. See "Effects of Locking Granularity
in a Database Management System", by Daniel Ries and Michael Stonebraker from
ACM Transactions on Database Systems, September 1977, and "Locking Granularity
Revisited", same authors and publication, June 1979. Microsoft SQL Server locks
at the table, page, or (with version 6.5) at the row level for INSERTs. Future
versions of SQL Server will have more extensive row level locking, but this
will not prevent blocking problems.
No matter how fine-grained the
locks taken, if the application does not exercise good discipline regarding
transaction path length and query management, blocking problems may occur. This
is similar to a slow moving automobile causing a backup by driving in the left
lane of a freeway. The driver may think that sufficient lanes should be made
available for him or her to drive any way he or she wants. However, no feasible
number of highway lanes will allow drivers to disregard good lane discipline
without causing a traffic slowdown. Likewise, no RDBMS isolation technique can
allow applications to disregard the impact they have on concurrency and
scalability.
INVESTIGATING A BLOCKING PROBLEM
- Identify the SPID at the head of the blocking chain.
Most blocking problems happen because a single process holds locks
for an extended period of time. This usually causes a chain of blocked
processes, similar to a slow-moving automobile that causes a backup on the
freeway. Identify the head of the blocking chain by using the SQL Enterprise
Manager command Server/CurrentActivity and observing the "Object Locks" tab.
Alternatively, you can use the following query, which should return
one row for each SPID at the head of a blocking chain, plus the query the
blocking spid is running, and the types of blocking locks it holds. We suggest
you do not run multiple concurrent instances of this query unless you are on
SQL Server versions later than version 6.5 Service Pack 3. This query is only
an example; you may want to modify it or use your own query.
/* Query to find spids at head of a blocking chain, their input buffers, */
/* and the type of blocking locks they hold */
declare @blocker_spid smallint
declare @i_buff_string char(30)
set nocount on
/* Get all blocked spids */
select spid, blocked, hostname=substring (hostname, 1, 10),
progname=substring(program_name, 1, 10), cmd=substring(cmd, 1, 10),
status, physical_io, waittype
into #blk from master..sysprocesses (nolock) where blocked != 0
/* delete all blocking spids except the 1st in each blocking chain */
delete from #blk
where blocked in (select spid from #blk)
/* get each spid from sysprocesses which is referenced in */
/* the "blocked" column of #blk. This should be the head */
/* of each blocking chain */
select "Blocking spid" = spid, loginame=substring(suser_name(suid),1,10),
hostname=substring (hostname, 1, 10), progname=substring(program_name, 1,10),
cmd=substring(cmd, 1, 10), status, physical_io, waittype
from master..sysprocesses (nolock)
where spid in
(select blocked from #blk)
/* For each spid at the head of a blocking chain */
/* print its input buffer to show what query it's running */
declare blk_cursor CURSOR FOR SELECT blocked from #blk
open blk_cursor
fetch next from blk_cursor into @blocker_spid
while (@@fetch_status <> -1)
begin
select @i_buff_string = ("dbcc inputbuffer (" +
convert(char(6),@blocker_spid) +")")
select "Below is input buffer for this blocking spid: ", @blocker_spid
select ""
exec (@i_buff_string)
fetch next from blk_cursor into @blocker_spid
end
deallocate blk_cursor
/* For each spid at the head of a blocking chain */
/* print the type of blocking locks it holds */
select spid, syslocks.type, locktype=name, table_id=id, page, dbid
from syslocks, master.dbo.spt_values v
where syslocks.type=v.number
and v.type='L'
and (syslocks.type & 256)=256
and spid in (select blocked from #blk)
order by spid
drop table #blk
- Find the query the blocking SPID is running.
You
can do this by running the above query, or by doing DBCC INPUTBUFFER (spid),
where spid is the blocking SPID. or by using the Server/CurrentActivity
function in SQL Enterprise Manager and double-clicking the SPID to show the
input buffer. Save this information for future reference. - Find the type of locks the blocking SPID is holding.
You can do this by running the above query, or by either running
sp_lock or querying master..syslocks. Save this information for future
reference.
Alternatively, you can use the Server/CurrentActivity
function in SQL Enterprise Manager. However, it is sometimes necessary to use
queries instead of Enterprise Manager, because some types of tempdb blocking
problems may prevent you from running queries that use temp table operations.
Using direct queries gives you the control necessary to avoid this problem. An
example of a query that results in temp table operations (and therefore may not
be usable in certain blocking situations) is sp_lock, which does an ORDER BY.
- Find the transaction nesting level and process status of
the blocking SPID.
This is essentially the same number as
@@TRANCOUNT, but it can be determined from outside the SPID by using the DBCC
PSS command. Save this information for future reference. The following shows an
example of the syntax:
dbcc traceon(3604) /* return subsequent DBCC output to client rather
than errorlog */
go
SELECT SUID FROM SYSPROCESSES WHERE SPID=<blocking SPID number>
go
DBCC PSS (suid, spid, 0) /* where suid is from above, and spid is the
blocking SPID number */
go
In the returned information, note pxcb->xcb_xactcnt=n, where n is
the @@TRANCOUNT value for the SPID. This shows the transaction nesting level
for the blocking SPID, which in turn can explain why it is holding locks. For
example, if the value is greater than zero, the SPID is in the midst of a
transaction (in which case it is normal that it retain any exclusive locks it
acquired). Note also pstat=n, where n indicates the internal status of the
SPID. This can indicate why the blocking SPID may itself be waiting for certain
events.
You can also check to see if any long-term open transaction
exists in the database by using DBCC OPENTRAN(database_name).
By examining these four pieces of information, you can usually
determine the cause of most blocking problems. The next section of this article
is a discussion of how to use this information to identify and resolve some
common blocking scenarios.
IDENTIFYING AND RESOLVING COMMON BLOCKING SCENARIOS
I. Blocking Caused by a Normally Running Query with a Long
Execution Time
A long-running query can block other queries. For
example, a bulk DELETE or UPDATE can acquire many locks that (whether they
escalate to a table lock or not) block other queries. For this reason, you
generally do not want to intermix long-running decision support queries and
OLTP queries on the same database. You can identify this situation by observing
the blocking SPID. The INPUTBUFFER may point to a query known to have a long
execution time. This will often cause a steady I/O consumption, visible by
running the following query several times in succession:
SELECT SPID, PHYSICAL_IO FROM MASTER..SYSPROCESSES
The PHYSICAL_IO value does not always reflect all the work done by
a SPID, because work done on its behalf by other processes (like the read ahead
manager) is not charged back to the SPID. However, it is often an approximate
indication of I/O activity. If the PHYSICAL_IO value continuously increases,
inspect the query running, and decide whether it should be running at the time.
Run it in isolation on a quiescent computer, and monitor the I/O activity with
SET STATISTICS IO ON. If the query consumes a large amount of I/O resources, it
may cause blocking when run on a busy system. The solution is to look for ways
to optimize the query, by changing indexes, breaking a large, complex query
into simpler queries, or running the query during off hours or on a separate
computer.
This class of blocking problem may just be a performance
problem, and may require you to pursue it as such. For more information see the
following article in the Microsoft Knowledge Base:
110352 : Optimizing Microsoft SQL Server Performance
One reason queries can be long-running and hence cause blocking is if they
inappropriately use cursors. Cursors can be a convenient method for navigating
through a result set, but using them may be slower than set- oriented queries.
For more details, see "Microsoft SQL Server 6.5 Unleashed", by David Solomon,
Ray Rankins, et al, ISBN 0-672-30956-4.
II. Blocking Caused by a
Sleeping SPID That Has Lost Track of the Transaction Nesting Level
This type of blocking can often be identified by a SPID that is sleeping or
awaiting a command, yet whose @@TRANCOUNT is greater than zero. This can happen
if the application issues the DB-Library call dbcancel() or the ODBC call
sqlcancel() without also issuing the required number of ROLLBACK and COMMIT
statements. Issuing these calls cancels the query and the batch, but does not
automatically rollback or commit the transaction. This can be seen by issuing a
simple query from ISQL/w, such as BEGIN TRAN SELECT * FROM MASTER..SYSMESSAGES
and clicking the red Cancel button. After the query is canceled, SELECT
@@TRANCOUNT indicates that the transaction nesting level is one. Had this been
a DELETE or an UPDATE query, or had HOLDLOCK been used on the SELECT, all the
locks acquired would still be held. Applications must properly manage
transaction nesting levels, or they may cause a blocking problem following the
cancellation of the query. See the following article in the Microsoft Knowledge
Base for more information:
117143 INFO: When and How to Use dbcancel() or sqlcancel()
NOTE: The transaction nesting level of the SPID can
be observed by using DBCC PSS.
III. Blocking Caused by a SPID Whose
Corresponding Client Application
Did Not Fetch All Result Rows to Completion
This problem is caused by poor application design. After
sending a query to the server, all applications must immediately fetch all
result rows to completion. If an application does not fetch all result rows,
locks may can be left on the tables, blocking other users. If you are using an
application that transparently submits SQL statements to the server, the
application must fetch all result rows. If it does not (and if it cannot be
configured to do so), you may be unable to resolve the blocking problem. To
avoid the problem, you can restrict poorly-behaved applications to a reporting
or a decision-support database.
You can often identify this problem
by the following behavior:
- The blocking SPID continuously has a waittype of 0x800.
- The status may indicate runnable.
- The cmd may be SELECT.
- Transaction nesting level is zero. (If you cannot examine
@@TRANCOUNT within the application, you can examine it externally by using the
DBCC PSS command.
IV. Blocking Caused by a Distributed Client/Server Deadlock
Unlike a conventional deadlock, a distributed deadlock is not
detectable using the RDBMS lock manager. The blocking SPID will often appear
sleeping, with a waittype of 0x800 (waiting on a network I/0). A SPID in this
state cannot be KILLed, as it is waiting on a return from a Windows NT Server
API call. A distributed client/server deadlock may occur if the application
opens more than one connection to the RDBMS and submits a query asynchronously.
The following are two examples of how this can happen, and possible ways the
application can avoid it.
Example A: Client/Server Distributed Deadlock with a Single Client Thread
If the client has multiple open connections (dbprocs in
DB-Library terms), and a single thread of execution, the following distributed
deadlock may occur. For brevity, the term dbproc refers to the client
connection structure. In ODBC API terms, the closest analogy is an hdbc.
NOTE: Used a fixed font for the information below to display
correctly.
SPID1------blocked on lock------->SPID2
/\ (waiting on net write) Server side
| (sysprocesses.waittype==0x800)
| |
| |
| ================================|==================================
| <-- single thread --> | Client side
| \/
dbproc1 <------------------- dbproc2
(waiting on dbnextrow (effectively blocked on dbproc1, awaiting
or SQLFetch) single thread of execution to run)
In the case shown above, a single client application thread
has two open connections. It asynchronously submits a SQL operation on dbproc1.
This means it does not wait on the call to return before proceeding. The DB-
Library asynchronous call is dbsqlsend(). ODBC applications select asynchronous
mode with SQLSetStmtOption() and use the SQL_ASYNC_ENABLE parameter. The
application then submits another SQL operation on dbproc2, and awaits the
results to start processing the returned data. When data starts coming back,
(whichever dbproc first responds), it processes to completion all the data
returned on that dbproc. Assume this is dbproc1. It fetches results from
dbproc1 until SPID1 gets blocked on a lock held by SPID2 (because the two
queries are running asynchronously on the server). At this point, dbproc1 will
wait indefinitely for more data. SPID2 is not blocked on a lock, but tries to
send data to its client, dbproc2. However, dbproc2 is effectively blocked on
dbproc1, awaiting the single thread of execution to run.
Example B: Client/Server Distributed Deadlock with a Thread per Connection
Even if a separate thread exists for each connection on the
client, a variation of this distributed deadlock may still occur, as shown by
the following:
NOTE: Use a fixed font for the information below to
display correctly.
SPID1------blocked on lock-------->SPID2
/\ (waiting on net write) Server side
| (sysprocesses.waittype==0x800)
| |
| INSERT |SELECT
| ================================|==================================
| <-- thread per dbproc --> | Client side
| \/
dbproc1 <-----data row------- dbproc2
(waiting on dbnextrow (blocked on dbproc1, waiting for it
or SQLFetch) to read the row from its buffer)
This case is similar to Example A, except dbproc2 and SPID2
are running a SELECT with the intention of performing row-at-a-time processing
and handing each row through a buffer to dbproc1 for an INSERT in the same
table. Eventually, SPID1 becomes blocked on a lock held by SPID2. SPID2 then
writes a result row to the client dbproc2. Dbproc2 then tries to pass the row
in a buffer to dbproc1, but finds dbproc1 has not yet fetched the last row from
the buffer (because it is blocked waiting on SPID1, which is blocked on SPID2).
Both examples A and B are fundamental issues that application
developers must be aware of. They must code applications to handle these cases
appropriately. Two reliable solutions are to use either a query timeout or
bound connections.
Query Timeout: When a query timeout has been
provided, if the distributed deadlock occurs, it will be broken when then
timeout happens. See the DB- Library or ODBC documentation for more information
on using a query timeout.
Bound Connections: This is a feature new
to SQL Server 6.5. It allows a client having multiple connections to bind them
into a single transaction space, so the connections don't block each other. See
the SQL Server 6.5 documentation under "bound connections" for more
information.
V. Blocking Caused by a SPID That Is in a "Golden," or
Rollback, State
A data modification query that is KILLed, or
canceled outside of a user- defined transaction will be rolled back. This can
also occur as a side effect of the client computer restarting and its network
session disconnecting. Likewise, a query selected as the deadlock victim will
be rolled back. A data modification query often cannot be rolled back any
faster than the changes were initially applied. In the case of a DELETE,
INSERT, or UPDATE that had been running for an hour, it could take at least an
hour to roll back. If the server is shut down in the midst of this operation,
the database will be in recovery mode upon restarting, and it will be
inaccessible until all open transactions are processed. Startup recovery takes
essentially the same amount of time per transaction as run- time recovery, and
the database is inaccessible during this period. Thus, forcing the server down
to fix a SPID in a rollback state will often be counterproductive. This is
expected behavior, because the changes made must be completely rolled back, or
transactional and physical integrity in the database would be compromised.
Because this must happen, SQL Server marks the SPID in a "golden" or rollback
state (which means it cannot be KILLed or selected as a deadlock victim). This
can often be identified by observing the sp_who output, which may indicate the
ROLLBACK command. On version 6.5 Service Pack 2 or later, a ROLLBACK status has
been added to sysprocesses.status, which will also appear in sp_who output or
the SQL Enterprise Manager "current activity" screen. However, the most
reliable way to get this information is to inspect the DBCC PSS of the blocking
SPID in question, and observing the pstat value. For example, it may be
something like the following:
pstat=0x4000, 0x800, 0x100, 0x1
Meaning of PSTAT bits:
0x4000 -- Delay KILL and ATTENTION signals if inside a critical section
0x2000 -- Process is being killed
0x800 -- Process is in backout, thus cannot be chosen as deadlock victim
0x400 -- Process has received an ATTENTION signal, and has responded by
raising an internal exception
0x100 -- Process in the middle of a single statement xact
0x80 -- Process is involved in multi-db transaction
0x8 -- Process is currently executing a trigger
0x2 -- Process has received KILL command
0x1 -- Process has received an ATTENTION signal
The pstat value above would be a typical situation if a
long-running data modification was canceled (for example, by clicking the
Cancel Query button on a GUI application), and then the SPID was found to block
users and yet be unkillable. This situation is normal; the transaction must be
backed out. It can be identified by the bits, as noted above.
In
addition to the pstat field of the PSS, the sysprocesses.waittype field can
also give information about why the SPID may be waiting. The following are some
common values:
0x800 -- Waiting on network I/O completion
0x8011 -- Waiting on buffer resource lock (shared) request
0x81 -- Waiting on writelog
0x0020 -- Waiting on buffer in I/O
0x0005 -- Waiting on exclusive page lock
0x13 -- Waiting on buffer resource lock (exclusive) request
0x8001 -- Waiting on exclusive table lock
0x8007 -- Waiting on update page lock
0x8005 -- Waiting on exclusive page lock
0x8003 -- Waiting on exclusive intent lock
0x6 -- Waiting on shared page lock
0x8006 -- Waiting on shared page lock
0x23 -- Waiting on buffer being dumped
0x5 -- Waiting on exclusive page lock
0x0013 -- Waiting on buffer resource lock (exclusive) request
0x0022 -- Waiting on buffer being dirtied
VI. Blocking Caused by SQL Server 6.5 Atomic SELECT
INTO Behavior
By definition, SQL Server treats each statement as a
separate transaction. Beginning with SQL Server version 6.5, SELECT INTO was
made consistent with this standard by including the table creation and data
insert phases in a single atomic operation. A side effect of this is that locks
on system catalog tables are maintained for the duration of a SELECT INTO
statement. This is more frequently seen in tempdb, because applications often
do SELECT INTO temporary tables. Blocking caused by this action can be
identified by examining the locks held by the blocking SPID. The atomic SELECT
INTO behavior can be disabled with trace flag 5302. For more information, see
the following article in the Microsoft Knowledge Base:
153441 FIX: SELECT INTO Locking Behavior
VII. Blocking Caused by an Orphaned Connection
If the client application traps or the client workstation is
restarted, the network session to the server may not be immediately canceled
under some conditions. From the server's perspective, the client still appears
to be present, and any locks acquired may still be retained. For more
information, see the following article in the Microsoft Knowledge Base:
137983 : How to Troubleshoot Orphaned Connections in SQL Server
APPLICATION INVOLVEMENT IN BLOCKING PROBLEMS
SQL Server is essentially a puppet of the client application. The
client application has almost total control over (and responsibility for) the
locks acquired on the server. While the SQL Server lock manager automatically
uses locks to protect transactions, this is directly instigated by the query
type sent from the client application, and the way the results are processed.
Therefore, resolution of most blocking problems necessitates inspecting the
client application.
Often, turnkey client applications are used
against SQL Server, in addition to higher-level application development tools.
These may encapsulate the DB- Library or ODBC API calls to the database in a
higher abstraction level. However, from the perspective of SQL Server, there is
essentially no difference between one of these higher level applications, a
call-level DB- Library application, and a call-level ODBC application. SQL
Server only perceives a stream of Transact-SQL queries and certain control
tokens sent by each client API call. The same basic issues will cause blocking
problems whether the client application is a call-level application written in
C or a higher level application that encapsulates the database calls. Likewise,
the solution to these problems is generally the same.
This means
that no matter what level of visibility the application exposes regarding the
database calls being made, a blocking problem nonetheless frequently requires
both the inspection of the exact SQL statements submitted by the application
and the application's exact behavior regarding query cancellation, connection
management, fetching all result rows, and so on. If the development tool does
not allow explicit control over connection management, query cancellation,
query timeout, result fetching, and so on, blocking problems may not be
resolvable. This potential should be closely examined before selecting an
application development tool for SQL Server, especially for business-critical
OLTP environments.
There may be a tendency to focus on server-side
tuning and platform issues when facing a blocking problem. However, this does
not usually lead to a resolution, and can absorb time and energy better
directed at examining the client application and the queries it submits.
Several tracing utilities can be used to examine exactly what the
client application is sending to the server. Each client application may have
its own tracing feature, so examine the documentation for the application. In
addition, server-side tracing can be done using the -T4032 trace flag and the
SQL Trace utility. You can usually resolve a blocking problem without using
these utilities, but they are available if needed.
APPLICATION DESIGN TECHNIQUES TO AVOID A BLOCKING PROBLEM
DO NOT:
- Fail to immediately fetch all result rows to completion.
- Design a large-scale OLTP system using an application
development tool that does not allow explicit control over connections,
transactions, and the SQL syntax sent to the server.
- Use or design a client application that allows users to
fill in edit boxes that generate a long-running query. For example, do not use
or design an application that prompts the user for inputs, but allows leaving
certain fields blank or entering a wildcard. This may cause the application to
submit a query with an excessive running time, thereby causing a blocking
problem.
- Use or design an application that allows user input within
a transaction.
ALWAYS:
- Allow for query cancellation by means of dbcancel(),
sqlcancel(), or an equivalent command.
- Use a query timeout by means of dbsetltime() or
SQLSetStmtOption(), to prevent a runaway query and avoid distributed deadlocks.
- Immediately fetch all result rows to completion.
- Keep transactions as short as possible.
- Explicitly control connection management.
- Stress test the application at the full projected
concurrent user load.
It is vital that great care be exercised during the design
and construction phase of the database and application. In particular, the
resource consumption, isolation level, and transaction path length should be
evaluated for each query. Each query and transaction should be as lightweight
as possible. Good connection management discipline must be exercised. If this
is not done, it is possible that the application may appear to have acceptable
performance at low numbers of users, but the performance may degrade
significantly as the number of users scales upward.
With proper
application and query design, Microsoft SQL Server is capable of supporting
many thousands of simultaneous users on a single server, with little blocking.
The successful sites that reach this level typically use the techniques
described in this article.