T-SQL Anti-pattern of the day: comparing DATETIME field with date-only literal

Scenario

It is a pretty common situation to have transaction date-time stored in a DATETIME field. The problems start with the fact most applications used GETDATE() or some such equivalent at the client side to record the order date-time stamp. So a typical entry for an OrderDate would actually end up with a time component as well. In versions prior to SQL Server 2008, there was only this option – no specific options like the DATE data type.

Problem

While INSERT code worked pretty adequately in this case, the problem starts with running reports of some kind. Most reports take parameters with the date (or date range) for which the reports are required. The (anti-)pattern around the WHERE clause predicates in a typical report query (based on the AdventureWorks database) would look like this:

SELECT *
FROM Sales.SalesOrderHeader
WHERE CONVERT (varchar(10), OrderDate, 112) = ‘20040731’

OR

SELECT *
FROM Sales.SalesOrderHeader
WHERE OrderDate LIKE ’31 Jul 2004%’

Both these predicates are non-optimal for usage of any index (and hopefully it would be a clustered index, to support range scans efficiently) on OrderDate:

  • The first query fails to use an index (and hence scans) because it explicitly converts the OrderDate field to a ISO-format date (varchar) in order to perform the comparison:

image

  • The second query has an implicit conversion to a varchar data type and hence again fails to use any indexes on OrderDate:

image

Workaround

From a design perspective there are potentially 2 problems rolled into 1 in this scenario:

  • The fact that we are storing date and time values in the same datetime field, which is required for an OLTP system
  • The fact that the report is running on the same database as the OLTP system

In many cases the 2nd aspect (reporting and OLTP on the same database) is unavoidable due to hardware constraints. In such cases, the possible workarounds could be:

Option 1: Use explicit timestamps in the literal portion of the predicate in the WHERE clause

This one is perhaps the most practical workaround for most workloads. Here the query is re-written as follows:

SELECT *
FROM Sales.SalesOrderHeader
WHERE OrderDate BETWEEN ’31 Jul 2004 00:00:00′ AND ’31 Jul 2004 23:59:59:997′

The ensuing query plan looks more efficient. Of course the cost of the key lookup is due to the ‘SELECT *’ usage, and that is something which can be overcome by either:

  • Specifying only the relevant fields in the output list OR
  • Considering a clustered index on OrderDate (this probably makes sense in a reporting-only database)

Another thing to note is the usage of 997 milliseconds in the above query. Why not 999 milliseconds? Because then SQL Server would round it up to ‘1 Aug 2004 00:00:00’. Check this linked KB article for more details.

image

Option 2: Store only the date portion

If the time portion of the order timestamp is not important, have the application INSERT the CONVERTed varchar (or better still, perform that conversion in the application) and store only the date time portion. If you are in a position to upgrade to SQL Server 2008 then the DATE datatype may be of interest here.

Option 3: Have separate fields for datetime and date portion

Modify the database schema to have separate fields for the OrderDateTime and OrderDate. Modify the INSERT code to insert the raw date-timestamp into OrderDateTime, and the a CONVERTed date portion into the OrderDate field. Use the OrderDate field for reports and OrderDateTime for other queries which need it.

Conclusion

Separating reporting from OLTP processing is the key design aspect to always be kept in mind. In situations where that is not possible, some refactoring of the query can help. Table denormalization, coupled with minor changes in the application can also help in some situations to optimize query execution.

There may be other innovative solutions to the problem, and I am eager to hear from any readers about those comments!

T-SQL Anti-pattern of the day: UPPER() and LOWER()

This one is even more common and therefore has a high impact: the usage of the above functions in WHERE clause predicates. I’ve examined each in some detail below.

Here’s a typical example I see, based off the AdventureWorks database.

Scenario

To illustrate the effect of this example, I have added an extra index on the FirstName column in Person.Contact table:

CREATE NONCLUSTERED INDEX NC_Person_Contact_FirstName
ON Person.Contact(FirstName)

Here’s a typical query pattern coming in from the client:

DECLARE @FirstName nvarchar(50)
SELECT @FirstName = ‘pilar’

SELECT *
FROM Person.Contact
WHERE UPPER(FirstName) = UPPER(@FirstName)

Problem

Let’s take a look at the execution plan for this batch:

image

As expected there is a scan happening, due to the presence of the UPPER() on the FirstName column. Interestingly the optimizer chose to scan the non-clustered index rather than the clustered one. It did that because the cost of scanning the narrower index on FirstName is cheaper than the clustered index. Anyway, it is still far from optimal.

Workarounds

  • If your server / database / column is set to use a case insensitive collation (if you don’t know what a collation is, I suggest you start with this Books Online entry first) then you really do not need any of the UPPER() or LOWER() function calls, because in that case, the execution engine will perform a case-insensitive comparison. Most customers I work with, use the US English installation and the default collation of SQL_Latin1_General_CP1_CI_AS. For those customers, this workaround would be safe.

Is it a good one by design? The answer as with many things in SQL Server, is it depends. If your installation procedures are standardized and repeatable, and if you are using the default collation (or any other case-insensitive one) then this workaround may very well turn into a by-design solution. In other cases, this workaround may not solve the problem.

  • A more robust solution would be to transform (in this case, to uppercase) the values being stored in these tables (in our case, FirstName) at INSERT time. That way, the predicate in the SELECT query can change to the following:

FirstName = UPPER(@FirstName)

Conclusion

Careful thought needs to go into design and development of databases where strings are being compared. Using a case-insensitive collation is sometimes the best way to avoid some of the problems described above. However if you have a case-sensitive collation, performing these operations up front in client code at INSERT time is perhaps the best way to negate the performance side effects of using the UPPER() / LOWER() predicates in WHERE clauses.

T-SQL Anti-pattern of the day: Prefix codes stored in (n)varchar fields

In the course of my work as a SQL Server Premier Field Engineer, we tend to see a lot of patterns in code. Some of these patterns do not lend themselves to optimal performance. This post is the first in a series of such ‘how not to code’ tips which would hopefully help some T-SQL developers out there.

Scenario

OK. So imagine a situation where there’s a table like this:

CREATE TABLE Orders
(
    OrderNumber nvarchar(50) PRIMARY KEY NOT NULL,
    ShipAddress nvarchar(255),
    IsDelivered    bit
)

Now in itself the table design looks innocuous except that the OrderNumber is encoded as per a scheme which looks like this:

  • First 3 characters: BranchId which generated this order
  • Next 12 characters: Date and time encoded as YYYYMMDDHHmm
  • Next 35 characters would accommodate a running number reset each day

Because of the fact that the PRIMARY KEY constraint is by default implemented as a clustered index, we get the following index on OrderNumber:

PK__Orders__0AD2A005    clustered, unique, primary key located on PRIMARY    OrderNumber

Problem

SELECT queries often run on this table which list the orders by branch. Those typically look like this:

SELECT OrderNumber, ShipAddress
FROM Orders
WHERE SUBSTRING(OrderNumber, 1, 3) = N’004′

Not surprisingly we get a table scan, though there is an index on OrderNumber. Note the parallelism gather streams operator also. That’s because the cost of the overall query plan exceeded the default ‘cost threshold for parallelism’ (5).

image

This is a very common pattern I observe, especially on transactional tables which have a large number of enquiries (SELECTs) on them. The resultant logical reads due to the index scans cause poor response times. In many cases the cost of the scan exceeds the cost threshold for parallelism and further compounds the problem.

Root Cause

So, what is the problem here? Evidently it is a problem of database design. Storing such encoded values in a field is a violation of the First Normal Form in database design (http://support.microsoft.com/kb/283878). So the right solution would involve taking this encoded BranchId and placing it in another field.

Workaround

However, it is not easy to implement this overnight in most systems which are in production. For the above specific query, there is a way to improve performance, though. The SUBSTRING is used to strip the first 3 characters and then compare with an input parameter. This can be replaced with an equivalent query such as:

SELECT OrderNumber, ShipAddress
FROM Orders
WHERE OrderNumber LIKE N’004%’

The resultant execution plan looks like this, with a dramatically reduced cost now that the index is being used to do a seek:

image

Of course there are other possibilities involving indexed views / computed columns but those I will examine in another post.

Conclusion

Database design conforming to the normal forms is generally a good idea to drive usage of indexes and have acceptable response times. In some cases where queries use a predicate comparing the prefix of values stored in a field, it may be worthwhile examining the use of LIKE as compared to a SUBSTRING.

SQL Worker Thread Stack Sizes

Recently during some internal discussions we had a lot of talk around the stack sizes for various SQL builds. Here is the summary of that talk:

 

SQL Server Build OS Build Stack Size
SQL Server x86 (32-bit) x86 (32-bit) 512 KB
SQL Server x86 (32-bit) x64 (64-bit) 768 KB (see notes 1 and 2)
SQL Server x64 (64-bit) x64 (64-bit) 2048 KB
SQL Server IA64 (64-bit Itanium) IA64 (64-bit Itanium) 4096 KB (see note 3)

 

Notes:

  1. This (wow64 stack size) is confirmed from http://www.microsoft.com/technet/prodtechnol/sql/2005/diagandcorrecterrs.mspx: SQL Server 2000 running on X64 in WOW mode also uses more memory per worker (768 KB = 512 KB for standard stack + 256 KB used by WOW to support the stack).
  2. Also check http://support.microsoft.com/kb/916950 for an issue related to SQL 2000 x86 (really this should be the only case where this configuration might be used) running in WOW64 mode on x64 OS. http://msdn.microsoft.com/en-us/library/aa384219(VS.85).aspx mentions the overhead which is required for a 32-bit process running inside WOW64.
  3. Slava Oaks blogged specifically about this http://blogs.msdn.com/slavao/archive/2005/03/19/399117.aspx

I hope this summary helps someone out there!

Using PowerShell to cleanup older versions of files

I only recently started using PowerShell and was amazed with the flexibility and ease of use. Here’s a sample script I put together after a few hours of playing around. I have also described the scenario around the requirement for this script.

Task at hand: A folder contains many files such as:

  • File1v1
  • File2v1
  • File2v2
  • ABCv3
  • ABCv4
  • ABCv5
  • MyFile9
  • MyFile15
  • MyFile21

The requirement was to keep only the latest version of each file (and therefore identify the older files to move). In my case, I have over 400 such files at any given time (and the number will grow.) The only way to distinguish (in my specific case) which file was the ‘latest’ was to use the numeric suffix at the end of each file (timestamps etc. are not reliable for my requirement.)

So to automate this task, I figured I’d try to learn PowerShell and put it to use. Here is the script I used:

################# IdentifyOlderVersions.ps1 ######################

$files = dir X:myFolder*.* | Sort-Object

$prevfile = “”
$filenamewithoutsuffix = “”
$previousversions = @()

foreach ($file in $files)
{
    $filenamewithoutsuffix = ($file.BaseName -split “d+$”)[0]
    if ($filenamewithoutsuffix -eq $prevfile)
    {
        # this is the same file ‘lineage’, add the current file to the array
        $previousversions = $previousversions + $file
    }
    else
    {
        # new lineage, need to move files and then reset $previousversions
        $listoffilestobemoved = “”
        $tmpindex = 0
        foreach ($filetobemoved in $previousversions)
        {

            # The check below is to ensure that we only list the files to be moved

            if ($tmpindex -eq ($previousversions.length – 1))
            {
                break
            }           

            $listoffilestobemoved = $listoffilestobemoved + “,” + $filetobemoved
            $tmpindex++
        }
        $previousversions = @()         
        $previousversions = $previousversions + $file
        if ($listoffilestobemoved -ne “”)
        {
            Write-Host $listoffilestobemoved
        }       
    }

    # in both cases, assign the previous filename = current one
    $prevfile = $filenamewithoutsuffix
}

################# IdentifyOlderVersions.ps1 ######################

This worked as a charm, and as you can see, some pretty involved scripts can be put together. We can also create .NET classes inside a script, and use those to achieve more than what is normally possible.

 

I encourage you to try using PowerShell for all such tedious tasks. There are some great hints and information (including link to a free course book) at http://www.microsoft.com/technet/scriptcenter/topics/winpsh/toolbox.mspx. The starting point for most of us should be http://www.microsoft.com/technet/scriptcenter/topics/winpsh/pshell2.mspx. Get the CTP2 latest release and start playing around!

‘Interesting Events’ from ReadTrace

I’m sure many of you are aware of the wonderful RML Tools (description here and x86 download here) and specifically the ReadTrace tool. Maybe you are also aware of SQLNexus which relies on ReadTrace for the trace analysis.

At some point in time when you use these tools, you would come across the ‘Interesting Events’ table (ReadTrace.tblInterestingEvents in the SQLNexus database) which is populated with various errors and warnings. While ReadTrace comes with a way to view this information at a summary level, one thing it does not easily provide is the ability to tie sample batch text to the ‘interesting event’. Strictly speaking the ReadTrace schema does not provide a ‘by-design’ way to query this information, however best-effort guessing can be done by using the time the ‘interesting event’ was generated and then using it to query tblBatches. Here’s one such query I use to dig for clues to which batches led to such events:

 

SELECT TEC.name, TUB.OrigText, TB1.Cpu
FROM ReadTrace.tblInterestingEvents TIE
JOIN ReadTrace.tblBatches TB1
    ON TIE.SPID = TB1.SPID and TIE.StartTime >= TB1.StartTime AND TB1.EndTime >= TIE.StartTime
JOIN ReadTrace.tblBatches TB2
    ON TB1.BatchSeq = TB2.BatchSeq AND TB1.BatchSeq >= TB2.BatchSeq
JOIN ReadTrace.tblUniqueBatches TUB
    ON TB1.HashID = TUB.HashID
JOIN sys.trace_events TEC
    ON TEC.trace_event_id = TIE.EventID
WHERE name = ‘Hash Warning’

 

Of course, the ‘Hash Warning’ (more information about it available here) event name mentioned above could be replaced with others, such as ‘Sort Warning’, ‘Exception’ etc.

NOT NULL and SWITCH TO PARTITION

Today I was to execute a ALTER TABLE… SWITCH TO PARTITION… operation to SWITCH an unpartitioned table into a partitioned table. I had taken care of the normal documented prerequisites:



  • Same schema for both tables

  • CHECK CONSTRAINT on the unpartitioned table to ensure values which match the destination partition function range

Despite these, I I ran into the following error:


Msg 4982, Level 16, State 1, Line 1
ALTER TABLE SWITCH statement failed. Check constraints of source table ‘foodb.dbo.unparttable’ allow values that are not allowed by range defined by partition 5 on target table ‘foodb.dbo.parttable’.


After some troubleshooting, it was apparent that the issue was due to the following note mentioned in the BOL topic ‘Transferring Data Efficiently by Using Partition Switching’: 


We recommend that you specify NOT NULL on the partitioning column of partitioned tables, and also nonpartitioned tables that are sources or targets for ALTER TABLE…SWITCH operations. By doing this, you make sure that any CHECK constraints on partitioning columns do not have to check for null values. Null values are ordinarily placed in the leftmost partition of a partitioned table. When you are switching any partition other than the leftmost partition and when the ANSI_NULLS database option is set to ON, the absence of a NOT NULL constraint on the source and target tables might interfere with any CHECK constraints that are also defined on the partitioning column.


Little wonder – in my case, I did not have a NOT NULL for the partitioning column. After correcting that, the ALTER TABLE SWITCH TO PARTITION worked just fine!

Exploring LINQ

I just got some time to look into newer things, and one of the most intriguing is LINQ (Language INtegrated Query). I’m just sharing some useful links I came across over a night of browsing for information:



And some interesting posts on performance, which is always dear to my heart:



I’ll post more once I play around more. In case you are keen on exploring, you should download the 90-day evaluation version of Visual Studio.NET 2008 (http://www.microsoft.com/downloads/details.aspx?FamilyID=83C3A1EC-ED72-4A79-8961-25635DB0192B&displaylang=en) and play around!


One last thing which is quite neat is that you can also get the MSDN Library for VS.NET 2008 from http://www.microsoft.com/downloads/details.aspx?FamilyID=6ff3bc60-32c8-4c22-8591-a20bf8dff1a2&DisplayLang=en. Be aware though that both these downloads are very hefty.


Happy New Year 2008!

Connectivity related questions

Some time back there were some connectivity related questions thrown at me:


Q1. We know about the Dedicated Admin Connection (DAC) and how it is useful for troubleshooting service hang issues. We also know that DAC should only be used when necessary. Can we find out if someone is misusing the DAC and staying connected all the time?


A. You can use the following query from a normal query window to check:


select * from sys.dm_exec_sessions


where endpoint_id IN


(


        select endpoint_id from sys.tcp_endpoints where is_admin_endpoint = 1


)


Do note that this assumes that the service is working normally and that you can connect normally 🙂


Q2. “I have disabled the named pipes protocol on server but I still find the server ‘listening’ on named pipes. See errorlog snippet below for what I mean”:



2007-05-23 11:36:56.69 Server      Server local connection provider is ready to accept connection on [ \.pipeSQLLocalINSTANCE1 ].



2007-05-23 11:36:56.69 Server      Server local connection provider is ready to accept connection on [ \.pipeMSSQL$INSTANCE1sqlquery ].


The “local connection provider” refers to the shared memory protocol. If named pipes is genuinely enabled, you will receive an entry similar to: 


2006-05-09 07:37:12.57 Server Server named pipe provider is ready to accept connection on [ \.pipeMSSQL$INSTANCE1sqlquery ].


So, do not worry about it – named pipes is not enabled 🙂


Q3. “I have configured a fixed TCP port for SQL to listen to but I still find entries in the errorlog which seem to indicate dynamic ports being used:”



2007-05-23 12:27:42.84 Server      Server is listening on [ 127.0.0.1 <ipv4> 1115].


To disable dynamic ports, use a blank value (remove the 0) from the Dynamic Port entry. However, you would still receive such an entry for DAC like this:


2007-06-15 17:03:16.63 Server      Server is listening on [ 127.0.0.1 <ipv4> 1509].


2007-06-15 17:03:16.63 Server      Dedicated admin connection support was established for listening locally on port 1509.


Hope these answers are useful!

The role of STOPAT in RESTORE DATABASE

Previously in SQL Server 2000, the STOPAT clause was provided for RESTORE LOG wherein you can “halt” the recovery process at a particular point in time. This feature is very useful for recovering from accidental user errors and such.


Now in SQL Server 2005, there is a STOPAT clause provided in the RESTORE DATABASE command also. The first time I saw this option, I was confused and wondered how we can provide a point-in-time recovery with a full database backup – because the full database backup is essentially an image of the database and does not contain the entire log backup (only contains the portion which changed during the backup command itself.) But I never thought about it much after that.


So when a customer asked me how this feature works, I dug in a little bit and here’s what I found:



  • Without log backups, there is simply no way to achieve a true point-in-time restore

  • What we do when we specify STOPAT in RESTORE DATABASE is to test if the full database backup is already ahead of the point in time you want to stop at. If it is, then the command errors out.

  • Why it errors out is to provide you a clear signal that even if you restore this full backup and then restore subsequent transaction log backups on top (specifying a STOPAT) then those transaction log backups would fail with errors since the database is already rolled forward to a point in time ahead of the STOPAT

  • In the above case, you would have wasted a lot of time and effort first restoring the database and then the transaction log backup before finding out that you should have started with a earlier full backup.

So, STOPAT in RESTORE DATABASE can be a little disappointing if you are not aware of the real purpose of this clause. Please review http://msdn2.microsoft.com/en-us/library/ms191468(SQL.90).aspx which explains this in detail.