Normalizing T-SQL text, part 1: using the RML Utilities and the DMVs

A common problem when dealing with workloads which issue ad-hoc SQL commands (i.e. without parameterization) is to find out the ‘normalized’ version of the pattern. For example, these three statements are essentially the same ‘template’:

SELECT BusinessEntityId FROM Person.Person WHERE LastName = ‘Smith’
SELECT BusinessEntityId FROM Person.Person WHERE LastName = ‘Singh’
SELECT BusinessEntityId FROM Person.Person WHERE LastName = ‘Lee’

And so are these three, except that they are a distinctly new pattern:

SELECT * FROM Person.person WHERE LastName IN (‘Smith’)
SELECT * FROM Person.person WHERE LastName IN (‘Smith’, ‘Singh’)
SELECT * FROM Person.person WHERE LastName IN (‘Smith’, ‘Singh’, ‘Lee’)

The requirement therefore is to detect only the common ‘template’ / ‘normalized’ text – in most cases a representative example of the ad-hoc statement. In this post, I will show you the classic ways to do that: using the RML Utilities (specifically using ReadTrace) or using the query_hash value in the DMVs.

Option 1: Using the RML Utilities / ReadTrace

The classic way to do with profiler traces of course, was to capture a trace file (using server side tracing preferably) and then run that trace through the ReadTrace utility. The neat thing about this utility is that it would ‘aggregate’ similar patterns into hash buckets and then display statistics for each hash ID. To do that successfully, it requires a SQL instance and database.

Here is a sample command line for using this utility:

readtrace -Ic:temptest.trc -S. -dTest -f

The above imports the trace (from c:temptest.trc) into a database called Test on my local default instance of SQL 2012. Once that is done, you can work your magic on the various tables which are created by this utility and extract the unique batches:

SELECT OrigText, HashID
FROM ReadTrace.tblUniqueBatches

Here is the sample output from my test, where I had issued those 6 statements (each delimited by a GO) which were shown at the beginning of this post:

SELECT BusinessEntityId FROM Person.Person WHERE LastName = ‘Smith’ => hash value of -8759399170292131263
select * from Person.person where LastName IN (‘SMith’) => hash value of 1644287078950160061

So the good news is that this utility correctly found the underlying patterns very accurately, but it also requires extra efforts (to provision a SQL instance and gather a trace file.) One other advantage with the ReadTrace utility is that it also supports importing Extended Event Logs (XEL) files as well (though that feature is still in Beta.)

For more details on the normalization capabilities within ReadTrace, Download RML Utilities for SQL Server (x64) and then read the PDF help file. The various patterns are described in the section ‘Normalization Patterns’.

Option 2: Using the query_hash value

Starting SQL Server 2008 and above you can use the query_hash functionality to isolate these common patterns. Here is an example of a query which can do that and list one sample query for each ‘pattern’:

select * from
(select ST.text, ROW_NUMBER() OVER (PARTITION BY query_hash ORDER BY query_hash) Ranking from sys.dm_exec_query_stats QS
cross apply sys.dm_exec_sql_text (QS.sql_handle) ST
where objectid IS NULL
and ST.encrypted = 0
) AS Q
where Q.Ranking = 1

While this is easy to use and available out of the box, it does not work for SQL 2005 or obviously not for SQL 2000 (why do I even bring up SQL 2000, given that it is out of support? When upgrading those few workloads from SQL 2000 to higher versions, it is appropriate to tune some of this stuff by replacing them with parameterized queries.)

Also, there is another problem with IN predicates. When there is a varying set of literals passed to the IN predicate, it turns out that the query_hash is different. Here is an example:

select * from Person.person where LastName IN (‘SMith’, ‘Singh’, ‘Lee’) => hash value 0x6FAFADDF37AA7683
select * from Person.person where LastName IN (‘SMith’) => hash value 0x77A1391010F61AE7
select * from Person.person where LastName IN (‘SMith’, ‘Singh’) => hash value 0x9BEA60D8D6595AA5

Summary

I’d like to quickly compare these two approaches and then setup what’s coming next.

ReadTrace	query_hash
Offline method Can consume TRC or XEL files Supports SQL Server versions 2000, 2005 and 2008, 2008 R2, 2012 and SQL 2014 CTP2 (version 9.04.0004 as of time of writing this post) Correctly handles many patterns including IN lists	Online method Requires SQL 2008 or above Dependent on the query plan cache Some limitations such as not aggregating IN predicate lists correctly

In the next part I will show you how you can cook your own ‘aggregator’ / ‘normalizer’ using the Transact-SQL ScriptDom parser and then compare it with these two approaches. Stay tuned!

Arvind Shyamsundar's technical blog

Arvind Shyamsundar is a Principal PM @ MSFT Azure Data, working on Azure SQL. Data geek. Apache Accumulo and Fluo PMC. SQL MCM, ex-Principal PFE (MSFT Services). These are my own opinions and not those of Microsoft.

Normalizing T-SQL text, part 1: using the RML Utilities and the DMVs

Option 1: Using the RML Utilities / ReadTrace

Option 2: Using the query_hash value

Summary

Leave a comment Cancel reply

Option 1: Using the RML Utilities / ReadTrace

Option 2: Using the query_hash value

Summary

Share this:

Related

Leave a comment Cancel reply