SYMPTOMS
If operating system, driver, or hardware problems cause lost
write conditions or stale read conditions, you may see data integrity-related
error messages such as Errors 605, 823, 3448. You may receive error messages
that are similar to the following examples:
2003-07-24
16:43:04.57 spid63 Getpage: bstat=0x9, sstat=0x800, cache
2003-07-24
16:43:04.57 spid63 pageno is/should be: objid is/should be:
2003-07-24
16:43:04.57 spid63 (1:7040966)/(1:7040966) 2093354622/2039782424
2003-07-24 16:43:04.57 spid63 ... IAM indicates that page is allocated to
this object
2003-07-24 16:52:37.67 spid63 Error: 605, Severity: 21, State:
1
2003-07-24 16:52:37.67 spid63 Attempt to fetch logical page (1:7040966)
in database 'pubs' belongs to object 'authors', not to object 'titles'..
2003-07-24 16:52:40.99 spid63 Error: 3448,
Severity: 21, State: 1
2003-07-24 16:52:40.99 spid63 Could not undo log
record (63361:16876:181), for transaction ID (0:159696956), on page
(1:7040977), database 'pubs' (database ID 12). Page information: LSN =
(63192:958360:10), type = 2. Log information: OpCode = 2, context
1..
2003-07-09 14:31:35.92 spid66 Error: 823,
Severity: 24, State: 2
2003-07-09 14:31:35.92 spid66 I/O error (bad page
ID) detected during read at offset 0x00000016774000 in file
'h:\sql\MSSQL\data\tempdb.mdf'..
RESOLUTION
Service pack information
To resolve this problem, obtain the latest service pack for Microsoft SQL Server 2000. For additional information, click the following article number to view the article in the Microsoft Knowledge Base:
290211 How to obtain the latest SQL Server 2000 service pack
Build information
Microsoft has released a build of SQL Server that introduces
extended tracking capabilities. These capabilities are designed to help you
troubleshoot the error messages described in the "Symptoms" section. If you do
not suspect an operating system or a hardware releated problem, you may not
have to apply this build.
The English version of this build has the
file attributes (or later) that are listed in the following table. The dates
and times for these files are listed in coordinated universal time (UTC). When
you view the file information, it is converted to local time. To find the
difference between UTC and local time, use the
Time Zone tab
in the Date and Time tool in Control Panel.
Date Time Version Size File name
-------------------------------------------------------------------------
31-May-2003 18:45 2000.80.818.0 78,400 bytes Console.exe
25-Jun-2003 01:01 2000.80.818.0 33,340 bytes Dbmslpcn.dll
25-Apr-2003 02:12 786,432 bytes Distmdl.ldf
25-Apr-2003 02:12 2,359,296 bytes Distmdl.mdf
30-Jan-2003 01:55 180 bytes Drop_repl_hotfix.sql
23-Jun-2003 22:40 2000.80.837.0 1,557,052 bytes Dtsui.dll
23-Jun-2003 22:40 2000.80.837.0 639,552 bytes Dtswiz.dll
24-Apr-2003 02:51 747,927 bytes Instdist.sql
03-May-2003 01:56 1,581 bytes Inst_repl_hotfix.sql
08-Feb-2003 06:40 2000.80.765.0 90,692 bytes Msgprox.dll
01-Apr-2003 02:07 1,873 bytes Odsole.sql
05-Apr-2003 01:46 2000.80.800.0 62,024 bytes Odsole70.dll
07-May-2003 20:41 2000.80.819.0 25,144 bytes Opends60.dll
02-Apr-2003 21:48 2000.80.796.0 57,904 bytes Osql.exe
02-Apr-2003 23:15 2000.80.797.0 279,104 bytes Pfutil80.dll
22-May-2003 22:57 19,195 bytes Qfe469571.sql
11-Jul-2003 17:04 1,084,147 bytes Replmerg.sql
04-Apr-2003 21:53 2000.80.798.0 221,768 bytes Replprov.dll
08-Feb-2003 06:40 2000.80.765.0 307,784 bytes Replrec.dll
11-Jul-2003 16:56 1,085,925 bytes Replsys.sql
01-Jun-2003 01:01 2000.80.818.0 492,096 bytes Semobj.dll
31-May-2003 18:27 2000.80.818.0 172,032 bytes Semobj.rll
29-May-2003 00:29 115,944 bytes Sp3_serv_uni.sql
01-Jun-2003 01:01 2000.80.818.0 4,215,360 bytes Sqldmo.dll
07-Apr-2003 17:44 25,172 bytes Sqldumper.exe
19-Mar-2003 18:20 2000.80.789.0 28,672 bytes Sqlevn70.rll
02-Jul-2003 00:18 2000.80.834.0 180,736 bytes Sqlmap70.dll
08-Feb-2003 06:40 2000.80.765.0 57,920 bytes Sqlrepss.dll
01-Aug-2003 00:50 2000.80.847.0 7,594,065 bytes Sqlservr.exe
25-Jul-2003 21:44 2000.80.845.0 590,396 bytes Sqlsort.dll
08-Feb-2003 06:40 2000.80.765.0 45,644 bytes Sqlvdi.dll
25-Jun-2003 01:01 2000.80.818.0 33,340 bytes Ssmslpcn.dll
01-Jun-2003 01:01 2000.80.818.0 82,492 bytes Ssnetlib.dll
01-Jun-2003 01:01 2000.80.818.0 25,148 bytes Ssnmpn70.dll
01-Jun-2003 01:01 2000.80.818.0 158,240 bytes Svrnetcn.dll
31-May-2003 18:59 2000.80.818.0 76,416 bytes Svrnetcn.exe
30-Apr-2003 23:52 2000.80.816.0 45,132 bytes Ums.dll
02-Jul-2003 00:19 2000.80.834.0 98,816 bytes Xpweb70.dll
Note Because of file dependencies, the most recent hotfix or feature
that contains these files may also contain additional files.
Microsoft has confirmed that under rare and heavy I/O loads, some hardware platforms can return a stale read. If the extended diagnostics indicate a possible stale read/lost write condition, contact your hardware vendor for immediate follow up and test with the SQLIOStress utility.
MORE INFORMATION
If you receive any of the error messages that are mentioned
in the "Symptoms" section and they cannot be explained by an event such as a
physical drive failure, then review any known problems with SQL Server, the
operating system, the drivers, and the hardware. Additional SQL Server
diagnostics have been added in this build to help detect I/O related external
problems. The additional diagnostics try to provide information about the two
following conditions:
- Lost Write: A successful call to the WriteFile API, but the operating system, a driver, or the caching
controller does not correctly flush the data to the physical media even though
SQL Server is informed that the write was successful.
- Stale Read: A successful call to the ReadFile API, but the operating system, a driver, or the caching
controller incorrectly returns an older version of the data.
For example, Microsoft has confirmed scenarios where a
WriteFile API call returns as successful, but an immediate, successful read
of the same data block returns older data, including data that is likely stored
in a hardware read cache. Sometimes, this problem occurs because of a read
cache problem. In other cases, the write data is never actually written to the
physical disk.
To enable additional diagnostics for these types of
problems, SQL Server has added trace flag 818. You can specify trace flag 818
as a startup parameter, -T818, for the computer that is running SQL Server, or
you can run the following statement:
DBCC TRACEON(818, -1)
Trace flag 818 enables an in-memory ring buffer that is used for
tracking the last 2,048 successful write operations that are performed by the
computer running SQL Server, not including sort and workfile I/Os. When errors
such as Error 605, 823, or 3448 occur, the incoming buffer's log sequence
number (LSN) value is compared to the recent write list. If the LSN that is
retrieved during the read operation is older than the one specified during the
write operation, a new error message is logged in the SQL Server error log.
Most SQL Server write operations occur as checkpoints or as lazy writes. A lazy
write is a background task that uses asynchronous I/O. The implementation of
the ring buffer is lightweight, thereby making the performance affect on the
system negligible.
The following message indicates that SQL Server did
not receive an error from the
WriteFile API call or the
ReadFile API call. However, when the LSN was reviewed, the value was not
correct:
SQL Server has detected an unreported
OS/hardware level read or write problem on Page (1:75007) of database 12
LSN returned (63361:16876:181), LSN expected (63361:16876:500)
Contact the hardware vendor and consider disabling caching mechanisms to
correct the problem
At this point, either the read cache contains an
older version of the page, or the data was not correctly written to the
physical disk. In either case (a Lost Write or a Stale Read), SQL Server
reports an external problem with the operating system, the driver, or the
hardware layers.
If Error 3448 occurs when you try to rollback a
transaction that has Error 605 or Error 823, the computer running SQL Server
automatically closes the database and tries to open and recover the database.
The first page that experiences Error 605 or Error 823 is considered a bad
page, and the page id is kept by the computer running SQL Server. During
recovery (before the redo phase) when the bad page id is read, the primary
details about the page header are logged in the SQL Server error log. This
action is important because it helps to distinguish between Lost Write and
Stale Read scenarios.
You may see the following two common behaviors
in Stale Read scenarios:
- If the database files are closed and then opened, the
correct and most recently written data is returned during recovery.
- When you issue a checkpoint and run the DBCC DROPCLEANBUFFERS statement (to remove all database pages from the memory), and
then run the DBCC CHECKDB statement on the database, the most recently written data is
returned.
The behaviors mentioned in the previous paragraph indicate a
read caching problem and they are frequently solved by disabling the read
cache. The actions that are outlined in the previous paragraph typically force
a cache invalidation and the successful reads that occur show that the physical
media is correctly updated. The Lost Write behavior occurs when the page that
is read back is still the older version of the data, even after a forced flush
of the caching mechanisms.
Sometimes, the problem may not be specific
to a hardware cache. It may be a problem with a filter driver. In such cases,
review your software, including backup utilities and antivirus software, and
then see if there are problems with the filter driver.
Microsoft has
also noted conditions that do not meet the criteria for Error 605 or Error 823
but are caused by the same Stale Read or Lost Write activity. In some
instances, a page appears to be updated two times but with the same LSN value.
This behavior may occur if the
Object ID and the
Page ID are correct (page already allocated to the object), and a change
is made to the page and flushed to the disk. The next page retrieval returns an
older image, and then a second change is made. The SQL Server transaction log
shows that the page was updated two times with the same LSN value. This action
becomes a problem when you try to restore a transaction log sequence or with
data consistency problems, such as foreign key failures or missing data
entries. The following error message illustrates one example of this
condition:
Error: 3456, Severity: 21, State: 1 Could not
redo log record (276666:1664:19), for transaction ID (0:825853240), on page
(1:1787100), database 'authors' (7). Page: LSN = (276658:4501:9), type = 1.
Log: OpCode = 4, context 2, PrevPageLSN: (275565:3959:31)..
Some
scenarios are outlined in more detail in the following lists:
LSN Sequence Action
1 Checkpoint
2 Begin Transaction
3 Table created or truncated
4 Inserts (Pages allocated)
5 Newly allocated page written to disk by Lazy Writer
6 Select from table - Scans IAM chain, newly allocated page read back from disk (LRU | HASHED = 0x9 in getpage message), encounters Error 605 - Invalid Object ID
7 Rollback of transaction initiated
LSN Sequence Action
1 Checkpoint
2 Begin Transaction
3 Page Modification
4 Page written to disk by Lazy Writer
5 Page read in for another modification (stale image returned)
6 Page Modified for a second time but because of stale image does not see first modification
7 Rollback - Fails - Transaction Log shows two different log records with the same PREV LSN for the page
SQL Server 'sort' operators perform I/O activities, primarily to
and from the
tempdb database. These I/O operations are similar to the buffer I/O
operations; however, they have already been designed to use read retry logic to
try to resolve similar issues. The additional diagnostics explained in this
article do not apply to these I/O operations.
Microsoft has noted
that the root cause for the following sort read failures is generally a Stale
Read or a Lost Write:
2003-04-01 20:13:31.38 spid122 SQL
Server Assertion: File: <p:\sql\ntdbms\storeng\drs\include\record.inl>,
line=1447 Failed Assertion = 'm_SizeRec > 0 && m_SizeRec <=
MAXDATAROW'.
2003-03-29 09:51:41.12 spid57 Sort read failure (bad
page ID). pageid = (0x1:0x13e9), dbid = 2, file = e:\program files\Microsoft
SQL Server\mssql\data\tempdb.mdf. Retrying.
2003-03-29 09:51:41.13
spid57 Error: 823, Severity: 24, State: 7
2003-03-29 09:51:41.13 spid57
I/O error (bad page ID) detected during read at offset 0x000000027d2000 in file
'e:\program files\Microsoft SQL Server\mssql\data\tempdb.mdf'..
*
00931097 Module(sqlservr+00531097) (utassert_fail+000002E3)
* 005B1DA8
Module(sqlservr+001B1DA8) (RecBase::Resize+00000091)
* 00407EE7
Module(sqlservr+00007EE7) (RecBase::LocateColumn+00000012)
* 00852520
Module(sqlservr+00452520) (mergerow+000000A4)
* 008522B3
Module(sqlservr+004522B3) (merge_getnext+00000285)
* 0085207D
Module(sqlservr+0045207D) (mergenext+0000000D)
* 004FC5FB
Module(sqlservr+000FC5FB) (getsorted+00000021)
Customers who
have experienced these sort errors have frequently resolved the problems by
moving
tempdb to a non-caching local drive, or by disabling the read caching
mechanisms.
Because a stale read or a lost write results in data
storage that is not expected, a wide variety of behaviors may occur. It may
appear as missing data, but some of the more common effects of missing data
appear as index corruptions, such as Error 644 or Error 625:
Error 644 Severity Level 21 Message Text Could not find the index
entry for RID '%.*hs' in index page %S_PGID, index ID %d, database '%.*ls'.
Error 625 Severity Level 21 Message Text Cannot retrieve row from page
%S_PGID by RID because the slotid (%d) is not valid.
Some
customers have reported missing rows after they perform row count activities.
This problem occurs because of a lost write. Perhaps the page was supposed to
be linked to the clustered index page chain. If the write was physically lost,
the data is also lost.
Important If you experience any of the behaviors, or if you are suspicious
of similar problems together with disabling caching mechanisms, Microsoft
strongly recommends that you obtain the latest update for SQL Server and the
latest SQL Server I/O Stress Simulator. Microsoft also strongly encourages that
you perform a strict review of your operating system and its associated
configurations.
For additional information, click the following article
number to view the article in the Microsoft Knowledge Base:
231619
HOW TO: Use the SQLIOStress Utility to Stress a Disk Subsystem Such as SQL Server