Comparing SQL collations to Windows collations (322112)

MORE INFORMATION

Collation semantics

For a Windows collation, a comparison of non-Unicode data is implemented by using the same algorithm as Unicode data. Both Unicode and non-Unicode sorting are compatible with string comparison rules in a particular version of Windows. This provides consistency across data types in SQL Server. It also enables developers who use the CompareString Win32 API function to sort strings in their application by using the same rules that SQL Server uses.

In a SQL collation, SQL Server defines different comparison semantics for non-Unicode data. SQL Server bases these comparison semantics on a SQL "sort order." For a mapping of sort orders to SQL collations, see the "SQL Collation Name" topic in SQL Server Books Online.

A SQL collation's rules for sorting non-Unicode data are incompatible with any sort routine that is provided by the Microsoft Windows operating system; however, the sorting of Unicode data is compatible with a particular version of the Windows sorting rules. Because the comparison rules for non-Unicode and Unicode data are different, when you use a SQL collation you might see different results for comparisons of the same characters, depending on the underlying data type. For example, if you are using the SQL collation "SQL_Latin1_General_CP1_CI_AS", the non-Unicode string 'a-c' is less than the string 'ab' because the hyphen ("-") is sorted as a separate character that comes before "b". However, if you convert these strings to Unicode and you perform the same comparison, the Unicode string N'a-c' is considered to be greater than N'ab' because the Unicode sorting rules use a "word sort" that ignores the hyphen.

String compare performance

Unicode sorting rules are much more complex than the rules for a non-Unicode SQL sort order. When SQL Server compares Unicode data, the characters are assigned a weight that is dynamically modified based on the collation's locale. The data is also modified by comparison style settings such as width, accent, or Kana-sensitivity. The Unicode sort routines support more intelligent sort behaviors like word sorting.

Additionally, because the routines must handle Unicode data, they are flexible enough to handle the sorting and the comparison of several thousand distinct characters, instead of the maximum 255 characters that most SQL Server sort orders can handle. For these reasons, raw string comparison work that uses the Unicode sorting rules is typically more expensive in terms of both time and CPU cycles than a similar string comparison that uses a non-Unicode SQL sort order.

What this means for the possible combinations of data types and collation types in SQL Server:

If you are storing and handling your data by using non-Unicode data types (char, varchar, text), and you are using a SQL collation, string comparisons will be performed with a non-Unicode SQL sort order.
If you are storing and handling your data by using non-Unicode data types (char, varchar, text), and you are using a Windows collation, string comparisons will be performed with the Unicode sorting rules. This may cause certain operations that are unusually dependent on string sorting performance to take longer and to use more CPU than a similar operation that is performed with a SQL collation.
If you are using Unicode data types (nchar, nvarchar, ntext), there is no difference in the sorting behavior for the SQL and the Windows collations. Both will use Unicode sorting rules.

Generally, the degree of performance difference between the Windows and the SQL collations will not be significant. The difference only appears if a workload is CPU-bound, rather than being constrained by I/O or by network speed, and most of this CPU burden is caused by the overhead of string manipulation or comparisons performed in SQL Server. An example of an application where the performance difference might be pronounced is a system where an application passes a long string value to a SQL Server stored procedure. The stored procedure then parses the string through extensive use of Transact-SQL string manipulation functions like CHARINDEX or PATINDEX. If the workload is fairly one-dimensional and it is dominated by executions of this string parsing stored procedure, the difference in performance between a SQL collation and a Windows collation might be noticeable. However, the design of most applications does not lead to a situation where the performance difference is significant.

Recommendations

SQL collations are provided for backward compatibility with earlier versions of SQL Server. Windows collations provide consistent string comparisons for both Unicode and for non-Unicode text in SQL Server that are also consistent with string comparisons in the Windows operating system. For all these reasons, Windows collations are preferred unless there are backward compatibility issues or specific performance issues that require a SQL collation.
If you are considering a SQL collation based only on the performance characteristics of a SQL collation, realize that the performance of most applications does not benefit significantly from a change in collation. Make sure that you have isolated queries that show a benefit from a SQL collation. As soon as you identify the affected queries, consider the following alternatives to a change in collation. Both of these alternatives may provide a performance benefit that is greater than what you will see if you change the instance collation to a SQL collation:
1. If the overhead for the Windows collations is traced to Transact-SQL routines that perform explicit string manipulation or parsing, and if you are using non-Unicode data types, you may want to specify a SQL collation or a binary Windows collation for the operation that is frequently executed and that is most expensive. Suppose you use the PATINDEX function to determine whether a text column in a table contains the "x" character. If you force a SQL collation for that particular comparison operation, and you continue to use a Windows collation for the rest of the database and the application, you do not have to change the collation for the whole system:
```
SELECT PATINDEX ('%x%', MemoFld COLLATE SQL_Latin1_General_Cp1_CI_AS) FROM ...
```
2. If the overhead for the Windows collations is traced to more mundane queries that do not use complex string manipulation functions, improved index or query designs might provide improvements that dwarf those you would see by changing to a SQL collation. A query that can be satisfied by highly selective seeks on appropriate indexes will not be sensitive to minor changes in string comparison cost. In contrast, a small amount of overhead per string comparison can add up quickly in a query that must perform a table scan and compare a particular value to each of millions of rows. If you prevent large table or index scans from the query plan by changing the indexing or the query itself, your query will perform much faster than it would if you change to a SQL collation.

Note There is a third type of collation that is a variation of a SQL collation. This third collation is known as a "compatibility collation" or an "obsolescent collation." A compatibility collation is a set of sorting and comparison rules that do not have a predefined collation name in SQL Server 2000. For example, if you set up SQL Server 7.0 with an inconsistent case-sensitivity setting for Unicode and for non-Unicode data, you will have a compatibility collation when you upgrade this instance of SQL Server 7.0 to SQL Server 2000. In the discussion earlier in this article, the information about SQL collations also applies to compatibility collations.

For more information about compatibility collations, click the following article number to view the article in the Microsoft Knowledge Base: