November | 2008 | Arvind Shyamsundar's technical blog

This one is even more common and therefore has a high impact: the usage of the above functions in WHERE clause predicates. I’ve examined each in some detail below.

Here’s a typical example I see, based off the AdventureWorks database.

Scenario

To illustrate the effect of this example, I have added an extra index on the FirstName column in Person.Contact table:

CREATE NONCLUSTERED INDEX NC_Person_Contact_FirstName
ON Person.Contact(FirstName)

Here’s a typical query pattern coming in from the client:

DECLARE @FirstName nvarchar(50)
SELECT @FirstName = ‘pilar’

SELECT *
FROM Person.Contact
WHERE UPPER(FirstName) = UPPER(@FirstName)

Problem

Let’s take a look at the execution plan for this batch:

As expected there is a scan happening, due to the presence of the UPPER() on the FirstName column. Interestingly the optimizer chose to scan the non-clustered index rather than the clustered one. It did that because the cost of scanning the narrower index on FirstName is cheaper than the clustered index. Anyway, it is still far from optimal.

Workarounds

If your server / database / column is set to use a case insensitive collation (if you don’t know what a collation is, I suggest you start with this Books Online entry first) then you really do not need any of the UPPER() or LOWER() function calls, because in that case, the execution engine will perform a case-insensitive comparison. Most customers I work with, use the US English installation and the default collation of SQL_Latin1_General_CP1_CI_AS. For those customers, this workaround would be safe.

Is it a good one by design? The answer as with many things in SQL Server, is it depends. If your installation procedures are standardized and repeatable, and if you are using the default collation (or any other case-insensitive one) then this workaround may very well turn into a by-design solution. In other cases, this workaround may not solve the problem.

A more robust solution would be to transform (in this case, to uppercase) the values being stored in these tables (in our case, FirstName) at INSERT time. That way, the predicate in the SELECT query can change to the following:

FirstName = UPPER(@FirstName)

Conclusion

Careful thought needs to go into design and development of databases where strings are being compared. Using a case-insensitive collation is sometimes the best way to avoid some of the problems described above. However if you have a case-sensitive collation, performing these operations up front in client code at INSERT time is perhaps the best way to negate the performance side effects of using the UPPER() / LOWER() predicates in WHERE clauses.

In the course of my work as a SQL Server Premier Field Engineer, we tend to see a lot of patterns in code. Some of these patterns do not lend themselves to optimal performance. This post is the first in a series of such ‘how not to code’ tips which would hopefully help some T-SQL developers out there.

Scenario

OK. So imagine a situation where there’s a table like this:

CREATE TABLE Orders
(
    OrderNumber nvarchar(50) PRIMARY KEY NOT NULL,
    ShipAddress nvarchar(255),
    IsDelivered    bit
)

Now in itself the table design looks innocuous except that the OrderNumber is encoded as per a scheme which looks like this:

First 3 characters: BranchId which generated this order
Next 12 characters: Date and time encoded as YYYYMMDDHHmm
Next 35 characters would accommodate a running number reset each day

Because of the fact that the PRIMARY KEY constraint is by default implemented as a clustered index, we get the following index on OrderNumber:

PK__Orders__0AD2A005 clustered, unique, primary key located on PRIMARY OrderNumber

Problem

SELECT queries often run on this table which list the orders by branch. Those typically look like this:

SELECT OrderNumber, ShipAddress
FROM Orders
WHERE SUBSTRING(OrderNumber, 1, 3) = N’004′

Not surprisingly we get a table scan, though there is an index on OrderNumber. Note the parallelism gather streams operator also. That’s because the cost of the overall query plan exceeded the default ‘cost threshold for parallelism’ (5).

This is a very common pattern I observe, especially on transactional tables which have a large number of enquiries (SELECTs) on them. The resultant logical reads due to the index scans cause poor response times. In many cases the cost of the scan exceeds the cost threshold for parallelism and further compounds the problem.

Root Cause

So, what is the problem here? Evidently it is a problem of database design. Storing such encoded values in a field is a violation of the First Normal Form in database design (http://support.microsoft.com/kb/283878). So the right solution would involve taking this encoded BranchId and placing it in another field.

Workaround

However, it is not easy to implement this overnight in most systems which are in production. For the above specific query, there is a way to improve performance, though. The SUBSTRING is used to strip the first 3 characters and then compare with an input parameter. This can be replaced with an equivalent query such as:

SELECT OrderNumber, ShipAddress
FROM Orders
WHERE OrderNumber LIKE N’004%’

The resultant execution plan looks like this, with a dramatically reduced cost now that the index is being used to do a seek:

Of course there are other possibilities involving indexed views / computed columns but those I will examine in another post.

Conclusion

Database design conforming to the normal forms is generally a good idea to drive usage of indexes and have acceptable response times. In some cases where queries use a predicate comparing the prefix of values stored in a field, it may be worthwhile examining the use of LIKE as compared to a SUBSTRING.

Arvind Shyamsundar's technical blog

Arvind Shyamsundar is a Principal PM @ MSFT Azure Data, working on Azure SQL. Data geek. Apache Accumulo and Fluo PMC. SQL MCM, ex-Principal PFE (MSFT Services). These are my own opinions and not those of Microsoft.

Month / November 2008

T-SQL Anti-pattern of the day: UPPER() and LOWER()

T-SQL Anti-pattern of the day: Prefix codes stored in (n)varchar fields