Cost of using non-clustered indexes may be incorrect if data correlation exists (285996)

The information in this article applies to:

Microsoft SQL Server 7.0
Microsoft SQL Server 2000 (all editions)

This article was previously published under Q285996

SYMPTOMS

In scenarios where there is a data correlation between the order of the non-clustered index entries and the rows in the base table, the optimizer may choose a plan that performs a table scan or clustered index scan rather than using a qualifying non-clustered index to filter the data. Data Correlation means that there is a relationship between the key values of the indexes such that they provide close to the same ordering of the data instead of a random distribution of data.

CAUSE

The SQL Server optimizer is built on the assumption that there is no data correlation between indexes and hence assumes that the cost in physical IO of using a non-clustered index to retrieve the data row to be 1 physical IO operation for every qualifying result row.

For a table scan, the SQL Server optimizer computes the physical IO operations to be equal to the number of pages in the table regardless of the number of estimated rows that meet the filter conditions; however, this is adjusted by the capability of Readahead, which allows a single physical IO to retrieve many pages (up to 8) as well as using sequential IOs instead of random IOs. Based on this it is easy to see that if there are more rows in the table that meet the filter conditions matching a non-clustered index than compared to (pages read/readahead read) * (pages of the table), it is cheaper and ultimately faster to use a table scan.

WORKAROUND

You can use the following suggestions to help the SQL Server optimizer choose the best plan when a data correlation exists.

Note The items are listed from least impact to client applications to the most impact.

Reorder index key columns so that the key column with the highest cardinality is the first column of the index.
Create indexes to cover the queries. Note that on SQL Server 7.0 and SQL Server 2000 that multiple single or multi-column indexes can be combined to make a covering index.
Add index hints to the queries.

MORE INFORMATION

There are two types of data correlation problems that have been discovered, which cause the SQL Server optimizer to over estimate the physical IO used by a non-clustered index:

A single table where the non-clustered index key is correlated to the clustered index key.
A single table that is a heap and the data entered in the non-clustered index key columns is ever increasing or decreasing values and there are very few updates or delete operations on the table.

The following examples help to illustrate this situation and are all based on the following table script:

use pubs
go
create table x (
xid int identity(1,1),
xrand int,
xdate datetime,
xdata char(100))
go
--insert 100k rows
set nocount on
go
begin transaction
declare @x int
select @x = 0
while @x < 100000
begin
  insert x (xrand,xdate,xdata) values (rand() * 100000,getdate(),'a')
  if @x % 10000 = 9999
  begin
    commit transaction
    begin transaction
  end
  select @x = @x + 1
end
commit transaction
go
set nocount off
go

It is important to note this information about the examples that follow:

The database needs to be set offline after the indexes are created in each test to make sure that the pages of the table are not in the SQL Server cache.
On each query there is an index hint that is commented. Run the scenario once with the comment to see the optimizer's choice and then uncomment the index hint to view the other behavior.
The SET STATISTICS PROFILE ON statement causes the query to generate a second result set that contains the query plan and costing information. Because the data used is random in some columns, there is slightly different results each time new data is generated. For the best results, use the SQL Server Query Analyzer tool and set the results to Grid format instead of Text.
To get a more accurate gauge of the actual performance uncomment the following statements that wrap around the query as follows:
```
--declare @x datetime
--select @x = getdate()
< query to check >
--select datediff(ms,@x,getdate())
```
This causes the query execution time including compilation in milliseconds to display.
All query cost values in the examples that follow are approximate and are rounded to 2 decimal places.

The first example shows that the optimizer is correct in choosing the table scan over using the non-clustered index to locate the row when no data correlation exists. This is the basis for comparison with future examples. This operation is referred to as a Bookmark Lookup operator in SQL Server 7.0 and SQL Server 2000. In this example, on SQL Server 2000, the cost of the table scan was 1.22 and executed in 2.5 seconds and the cost of using the non-clustered index was 5 and executed in 8.3 seconds.

drop index x.xrand
drop index x.xid
drop index x.xdate
go
create clustered index xid on x(xid)
go
create index xrand on x(xrand) 
go
use master
go
exec sp_dboption 'pubs', offline, true
exec sp_dboption 'pubs', offline, false
go
use pubs
go
set statistics profile on
go
--declare @x datetime
--select @x = getdate()
select * from x --(index=2)
 where xrand  between 1000 and 2000
--select datediff(ms,@x,getdate())
go
set statistics profile off
go

The second example shows the optimizer choosing a table scan instead of using the non-clustered index to locate the row when a data correlation exists between the clustered index and the non-clustered index. In this example on SQL Server 2000, the cost of the table scan was 1.33 and the cost of using the non-clustered index was 5.25. Based on this the optimizer did choose the cheapest estimated plan. However, the table scan took 2.3 seconds to execute and the non-clustered index took 0.2 seconds to execute. So, with the data correlation, using the non-clustered index is faster than performing a table scan.

Despite the higher cost of the non-clustered index seek, the non-clustered index seek is faster because the bookmark lookup benefits from the data being arranged sequentially. Typically, a bookmark lookup is an expensive operator because each read key from the non-clustered index causes a random I/O against the base table. If the base table is large and there are many bookmark lookups, you may experience situations where the same page incurs multiple physical I/Os because the page is aged from cache between accesses. When the non-clustered index keys are correlated with the base table, the bookmark lookup behaves more like sequential I/O because each page from the base table never requires more than one physical I/O. And any secondary accesses for additional lookups will occur immediately after the first lookup.

drop index x.xrand
drop index x.xid
drop index x.xdate
go
create clustered index xdate on x(xdate)
go
create unique index xid on x(xid) 
go
use master
go
exec sp_dboption 'pubs', offline, true
exec sp_dboption 'pubs', offline, false
go
use pubs
go
set statistics profile on
go
--declare @x datetime
--select @x = getdate()
select * from x --(index=2)
 where xid  between 1000 and 2000
--select datediff(ms,@x,getdate())
go
set statistics profile off
go

The third example shows the optimizer choosing a table scan instead of using the non-clustered index to locate the row when a data correlation exists between data entry order or row order left over from a dropped clustered index and the non-clustered index. In this example, on SQL Server 2000, the cost of the table scan was 1.38 and the cost of using the non-clustered index was 5.25. Based on this the optimizer did choose the cheapest estimated plan. However, the table scan took 1 second to execute and the non-clustered index took 0.14 seconds to execute. So, once again with the data correlation, using the non-clustered index is faster than performing a table scan.

drop index x.xrand
drop index x.xid
drop index x.xdate
go
create index xid on x(xid)
go
use master
go
exec sp_dboption 'pubs', offline, true
exec sp_dboption 'pubs', offline, false
go
use pubs
go
set statistics profile on
go
--declare @x datetime
--select @x = getdate()
select * from x --(index=2)
 where xid  between 1000 and 2000
--select datediff(ms,@x,getdate())
go
set statistics profile off
go

In order to detect data correlations you can check the following things:

The index is on values that are always increasing or decreasing in the table itself (either the clustered index or the heap). You can also detect this by running a query with an order by clause that matches one of the index key columns and then visually examine the other index key columns in the results to see if they are also in order.
If hinting the non-clustered indexes with no table or index pages in cache is faster than the same query running with a table scan.
For composite indexes, the first few index columns are the same in both indexes.
If there is some known relationship where the key of one index affects the domain of another index.

REFERENCES

For more information about the topics discussed in this article, refer to:

Delaney, Kalen. Inside SQL Server 2000, pages 98, 120, 833-836, 889-890. Microsoft Press, December 2000. ISBN: 0735609985

Modification Type:	Major	Last Reviewed:	11/16/2005
Keywords:	kbprb KB285996