T-SQL ranking functions to the rescue once again!

This week I happened to assist with an internal application where there was a performance issue with a specific query. One of my observations with that query was that it was doing self-joins in order to determine the latest iteration of a specific record. That’s a common anti-pattern which can be fixed by using T-SQL’s windowing functions, such as ROW_NUMBER() or RANK().

The Inefficient Way

To give more context, here is an example from the Contoso Retail DW database. The requirement (very similar to the actual customer case) here is to obtain a list of all the customers who have placed an order with us, and for each of those customers, obtain the date of the latest order. The implementation that I first noticed was using a self-join as given below; notice the special case handling where the customer has 2 orders on the same day and then the order number (the OnlineSalesKey column here) becomes the tie-breaker:

SELECT l.CustomerKey
	,l.DateKey AS LatestOrder
FROM dbo.FactOnlineSales l
LEFT JOIN dbo.FactOnlineSales r ON l.CustomerKey = r.CustomerKey
	AND (
		l.DateKey < r.DateKey
		OR (
			l.DateKey = r.DateKey
			AND l.OnlineSalesKey > r.OnlineSalesKey
			)
		)
WHERE r.DateKey IS NULL

This query is *extremely* inefficient, burning 100% CPU on my i7 laptop with 8 logical CPUs! On my laptop it will run for well over 10 minutes before I get impatient and cancel. For reference, here is the estimated execution plan for the above query (scroll to the right and note the query cost of 167692).

badplan

Just for kicks, I ran the above query with MAXDOP 144 (NOT a good idea, but this was for fun) on our lab machine which has 144 logical CPUs and here is the picture I see there 🙂 Obviously NOT something you want in production!

144procs

Rewriting the query

Now, the efficient way to re-write this is to use Ranking Functions in T-SQL. These have been around a while (SQL 2005 actually!) but I feel they are under-utilized. Here is the query re-written using the ROW_NUMBER() function. This solution also elegantly takes care of the above tie-breaking logic which required the disjunction (OR predicate) in the previous query.

WITH NumberedOrders
AS (
	SELECT CustomerKey
		,Datekey
		,ROW_NUMBER() OVER (
			PARTITION BY CustomerKey ORDER BY Datekey DESC
				,OnlineSalesKey DESC
			) AS RowNumber
	FROM FactOnlineSales
	)
SELECT CustomerKey
	,Datekey
FROM NumberedOrders
WHERE RowNumber = 1

Here is the new execution plan. Note that the cost is also much lesser: 447.

goodplan

Here are the execution statistics of this query, it completes in 43 seconds compared the self-join approach taking forever!

  • Table ‘FactOnlineSales’. Scan count 9, logical reads 92516, physical reads 0, read-ahead reads 72663, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
  • Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 31263, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
  • SQL Server Execution Times:
  • CPU time = 19231 ms, elapsed time = 43834 ms.

Moral of the story: next time you see query patterns involving self joins to achieve this kind of ‘latest record of a particular type’ take a pause and see if T-SQL Ranking Functions can help!

Common sub-expression elimination in SQL Server: what you need to know

In working with our customers, I come across some very interesting patterns in T-SQL. Not all of these patterns are complex on the face of it, but their impact can sometimes be substantial. Here is a simple example to demonstrate what I am referring to:

SELECT CASE WHEN EXISTS (SELECT *
                         FROM   Person.Person AS P
                         WHERE  P.BusinessEntityID = SOD.ProductID) THEN 1 ELSE 0 END AS Col1,
       CASE WHEN EXISTS (SELECT *
                         FROM   Person.Person AS P
                         WHERE  P.BusinessEntityID = SOD.ProductID) THEN 1 ELSE 0 END AS Col2
FROM   Sales.SalesOrderDetail AS SOD;

This query is clearly fictional, but it will suffice for our walkthrough here 🙂 Now here’s the execution plan for the above statement (you can also test this by using AdventureWorks2012 and above):

httpsmsdnshared.blob.core.windows.netmedia201606image922

Observe that the Person.Person table is accessed twice in the above plan. In the real customer scenario that I was looking at, the table involved was accessed using a nested loop join, for a total of 13 million seeks. And these 13 million were repeated again for the second usage of the subquery.

My attempt to rewrite the query was to use a LEFT OUTER JOIN syntax, as reflected in this simple example:

select CASE WHEN P.BusinessEntityID IS NOT NULL THEN 1 ELSE 0 END as Col1,

CASE WHEN P.BusinessEntityID IS NOT NULL THEN 1 ELSE 0 END as Col2

from Sales.SalesOrderDetail SOD

LEFT OUTER JOIN Person.Person P ON P.BusinessEntityID = SOD.ProductID

Here is the execution plan for the optimized version:

httpsmsdnshared.blob.core.windows.netmedia201606image923These two queries are functionally equivalent (because in this case there is a unique key on the BusinessEntityId column) and return the same set of results. As you can see the Person table is just accessed once in the revised query. The cost of the re-written query is also ~ 6 times lesser than the original query! This kind of optimization is referred to as Constant Subexpression Elimination. There are some strategies used by SQL Server (see this paper and this one as well for some ideas) but not every possible scenario (such as the above one) can be optimized. Also, this paper from Microsoft Research covers future ideas on the topic.

For now, it is best to review your T-SQL code for multiple instances of the same subquery and replace those with an explicit JOIN instead.