January 13, 2014 by Arvind Shyamsundar

Normalizing T-SQL text, part 2: using the TransactSql.ScriptDom classes

Happy New Year 2014 to all of you! With SQL Server 2014 set to release this year, I’m sure you are all excited about the months to come.

In my previous post I had reviewed the classic ways of obtaining ‘normalized’ text for ad-hoc SQL queries. Do take a minute to glance at that post in case you have not already done so. Both the methods described previously are dynamic – they need an active workload to operate upon. So if you have a static set of queries captured somewhere – such as a .SQL file or such, then we need an alternate method.

Algorithm

If you think about it, the core of normalizing these ad-hoc query text patterns is to identify literals and replace then with a generic / common value. Once the specific literal values are replaced with their generic ‘placeholders’ then it becomes a relatively easy task to identify commonalities.

To identify commonalities we propose to use a hashing algorithm, conceptually similar to the one used in the previous approaches. However, when computing hashes, there is another problem to consider: minor differences in whitespace / alphabet case of the query text will cause different hash values to be raised for essentially identical text.

ScriptDom Implementation

The first thing to consider is what kind of literals we would replace. In the ScriptDom class hierarchy, we have the following derived classes for the parent Literal class:

IntegerLiteral: whole numbers
NumericLiteral: decimal numbers such as 0.03
RealLiteral: numbers written with scientific notation such as 1E-02
MoneyLiteral: values prefixed with currency symbol such as $12
BinaryLiteral: such as 0xABCD1234
StringLiteral: such as ‘Hello World’
NullLiteral: the NULL value
DefaultLiteral: the DEFAULT keyword
MaxLiteral: the MAX specifier
OdbcLiteral: ODBC formatted literal such as { T ‘blah’ }
IdentifierLiteral: ‘special’ case when an identifier is used as a literal. I’ve never seen a real world example of this

We need to keep this in mind when we write the visitor to traverse through the AST.

Visitor definition

Next, we need to setup our visitor. We will use the Visitor pattern to do this, and implement overridden methods to handle the various types of literals described above. And for each type of literal we will replace the value of the literal with a fixed, generic value. Here is an example for the real literal visitor method:

public override void ExplicitVisit(RealLiteral node)
{
node.Value = “0.5E-2”;
base.ExplicitVisit(node);
}

Visitor invocation

For performance reasons we will call the visitor with the Parallel.ForEach loop which makes efficient use of multi-core CPUs:

Parallel.ForEach(
(frag as TSqlScript).Batches,
batch =>
{
myvisitor visit = new myvisitor();

batch.Accept(visit);

This way, each T-SQL batch in the original script is visited on a separate thread.

Regenerate the Script

Once the visitor does its job to ‘normalize’ the various literals encountered, the next step is to generate the script based on the tokens already obtained. That will take care of one of the 2 problems we spoke about – namely whitespace. We can do that using one of the many SqlScriptGenerator classes available (there is one for each compatibility level.) In the code snippet below, srcgen is one of the SqlScriptGenerator classes and script holds the output:

scrgen.GenerateScript(batch, out script);

Calculate the hash

Once the normalized script is re-generated from the SqlScriptGenerator class, it can then be run through a proper hash algorithm (in this sample we use SHA1) to calculate the hash value of the given script. Here is where we also handle the case sensitive / insensitive nature of the script:

For case insensitive cases, we simply convert the generated script to lower case before we compute the hash.
For case sensitive, we calculate the hash as-is on the generated script.

using (var hashProvider = new SHA1CryptoServiceProvider())
{
if (caseSensitive)
{
hashValue = Convert.ToBase64String(hashProvider.ComputeHash(Encoding.Unicode.GetBytes(script)));
}
else
{
hashValue = Convert.ToBase64String(hashProvider.ComputeHash(Encoding.Unicode.GetBytes(script.ToLowerInvariant())));
}
}

Track unique hashes

We can use a simple Dictionary class in .NET to track these, along with usage counts for each bucket. Each bucket also tracks an example of the batch (the original text itself.)

Sample output

The sample project when compiled and executed as below gives us the following output.

Command line

TSQLTextNormalizer.exe c:tempinput.sql c:tempoutput.sql 110 false

Input

select * from ABC
where
i = 1
GO

select * from abC where i = 3
GO

Output

— 2 times:
select * from ABC
where
i = 1
GO

That’s it! You can use this in many cases, limited only by your imagination Smile And more importantly I hope it showed you the power and flexibility of the ScriptDom classes.

[EDIT 2022-07-27] The sample code for this post is now on GitHub. Please upvote the repo on GitHub if you liked this post and found the sample code useful. The URL is https://github.com/arvindshmicrosoft/SQLScriptDomSamples/tree/main/8_TSQLTextNormalizer

Disclaimer

This Sample Code is provided for the purpose of illustration only and is not intended to be used in a production environment. THIS SAMPLE CODE AND ANY RELATED INFORMATION ARE PROVIDED “AS IS” WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE IMPLIED WARRANTIES OF MERCHANTABILITY AND/OR FITNESS FOR A PARTICULAR PURPOSE. We grant You a nonexclusive, royalty-free right to use and modify the Sample Code and to reproduce and distribute the object code form of the Sample Code, provided that You agree: (i) to not use Our name, logo, or trademarks to market Your software product in which the Sample Code is embedded; (ii) to include a valid copyright notice on Your software product in which the Sample Code is embedded; and (iii) to indemnify, hold harmless, and defend Us and Our suppliers from and against any claims or lawsuits, including attorneys’ fees, that arise or result from the use or distribution of the Sample Code.

This posting is provided “AS IS” with no warranties, and confers no rights. Use of included script samples are subject to the terms specified at http://www.microsoft.com/info/cpyright.htm.

December 30, 2013 by Arvind Shyamsundar

Normalizing T-SQL text, part 1: using the RML Utilities and the DMVs

A common problem when dealing with workloads which issue ad-hoc SQL commands (i.e. without parameterization) is to find out the ‘normalized’ version of the pattern. For example, these three statements are essentially the same ‘template’:

SELECT BusinessEntityId FROM Person.Person WHERE LastName = ‘Smith’
SELECT BusinessEntityId FROM Person.Person WHERE LastName = ‘Singh’
SELECT BusinessEntityId FROM Person.Person WHERE LastName = ‘Lee’

And so are these three, except that they are a distinctly new pattern:

SELECT * FROM Person.person WHERE LastName IN (‘Smith’)
SELECT * FROM Person.person WHERE LastName IN (‘Smith’, ‘Singh’)
SELECT * FROM Person.person WHERE LastName IN (‘Smith’, ‘Singh’, ‘Lee’)

The requirement therefore is to detect only the common ‘template’ / ‘normalized’ text – in most cases a representative example of the ad-hoc statement. In this post, I will show you the classic ways to do that: using the RML Utilities (specifically using ReadTrace) or using the query_hash value in the DMVs.

Option 1: Using the RML Utilities / ReadTrace

The classic way to do with profiler traces of course, was to capture a trace file (using server side tracing preferably) and then run that trace through the ReadTrace utility. The neat thing about this utility is that it would ‘aggregate’ similar patterns into hash buckets and then display statistics for each hash ID. To do that successfully, it requires a SQL instance and database.

Here is a sample command line for using this utility:

readtrace -Ic:temptest.trc -S. -dTest -f

The above imports the trace (from c:temptest.trc) into a database called Test on my local default instance of SQL 2012. Once that is done, you can work your magic on the various tables which are created by this utility and extract the unique batches:

SELECT OrigText, HashID
FROM ReadTrace.tblUniqueBatches

Here is the sample output from my test, where I had issued those 6 statements (each delimited by a GO) which were shown at the beginning of this post:

SELECT BusinessEntityId FROM Person.Person WHERE LastName = ‘Smith’ => hash value of -8759399170292131263
select * from Person.person where LastName IN (‘SMith’) => hash value of 1644287078950160061

So the good news is that this utility correctly found the underlying patterns very accurately, but it also requires extra efforts (to provision a SQL instance and gather a trace file.) One other advantage with the ReadTrace utility is that it also supports importing Extended Event Logs (XEL) files as well (though that feature is still in Beta.)

For more details on the normalization capabilities within ReadTrace, Download RML Utilities for SQL Server (x64) and then read the PDF help file. The various patterns are described in the section ‘Normalization Patterns’.

Option 2: Using the query_hash value

Starting SQL Server 2008 and above you can use the query_hash functionality to isolate these common patterns. Here is an example of a query which can do that and list one sample query for each ‘pattern’:

select * from
(select ST.text, ROW_NUMBER() OVER (PARTITION BY query_hash ORDER BY query_hash) Ranking from sys.dm_exec_query_stats QS
cross apply sys.dm_exec_sql_text (QS.sql_handle) ST
where objectid IS NULL
and ST.encrypted = 0
) AS Q
where Q.Ranking = 1

While this is easy to use and available out of the box, it does not work for SQL 2005 or obviously not for SQL 2000 (why do I even bring up SQL 2000, given that it is out of support? When upgrading those few workloads from SQL 2000 to higher versions, it is appropriate to tune some of this stuff by replacing them with parameterized queries.)

Also, there is another problem with IN predicates. When there is a varying set of literals passed to the IN predicate, it turns out that the query_hash is different. Here is an example:

select * from Person.person where LastName IN (‘SMith’, ‘Singh’, ‘Lee’) => hash value 0x6FAFADDF37AA7683
select * from Person.person where LastName IN (‘SMith’) => hash value 0x77A1391010F61AE7
select * from Person.person where LastName IN (‘SMith’, ‘Singh’) => hash value 0x9BEA60D8D6595AA5

Summary

I’d like to quickly compare these two approaches and then setup what’s coming next.

ReadTrace	query_hash
Offline method Can consume TRC or XEL files Supports SQL Server versions 2000, 2005 and 2008, 2008 R2, 2012 and SQL 2014 CTP2 (version 9.04.0004 as of time of writing this post) Correctly handles many patterns including IN lists	Online method Requires SQL 2008 or above Dependent on the query plan cache Some limitations such as not aggregating IN predicate lists correctly

In the next part I will show you how you can cook your own ‘aggregator’ / ‘normalizer’ using the Transact-SQL ScriptDom parser and then compare it with these two approaches. Stay tuned!

November 11, 2013 by Arvind Shyamsundar

SQL PASS ScriptDom talk: Recording now available!

The SQL PASS AppDev team has uploaded the recording of my session. You can click on the embedded player to view the HD recording. And if you have not looked at the slides and demos, please do review my previous post which refers to those!

November 5, 2013 by Arvind Shyamsundar

Slides and samples for my SQL PASS ScriptDom talk

Thank you very much, PASS AppDev for giving me the chance to share the capabilities and usage of the ScriptDom parser! The session was packed to capacity even before we got underway, something we did not quite expect given the specialized nature of the talk.

Request: if you want to see an ‘encore’ or want more, please leave a comment on this blog, or email the AppDev DL.

Edit 12 Nov 2013: The recording of this session is now available.

Download Samples

Download Slide deck

Disclaimer

This Sample Code is provided for the purpose of illustration only and is not intended to be used in a production environment. THIS SAMPLE CODE AND ANY RELATED INFORMATION ARE PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE IMPLIED WARRANTIES OF MERCHANTABILITY AND/OR FITNESS FOR A PARTICULAR PURPOSE. We grant You a nonexclusive, royalty-free right to use and modify the Sample Code and to reproduce and distribute the object code form of the Sample Code, provided that You agree: (i) to not use Our name, logo, or trademarks to market Your software product in which the Sample Code is embedded; (ii) to include a valid copyright notice on Your software product in which the Sample Code is embedded; and (iii) to indemnify, hold harmless, and defend Us and Our suppliers from and against any claims or lawsuits, including attorneys’ fees, that arise or result from the use or distribution of the Sample Code.

This posting is provided "AS IS" with no warranties, and confers no rights. Use of included script samples are subject to the terms specified at http://www.microsoft.com/info/cpyright.htm.

October 30, 2013 by Arvind Shyamsundar

SQL PASS AppDev Virtual Chapter meeting

I’ll be speaking about the SQL Server ScriptDom parser at the PASS AppDev Virtual Chapter meeting this Friday, 1st of November 2013. Here is the session abstract as published on the AppDev VC website:

Session title: DIY: T-SQL Swiss Knife Using the ScriptDOM T-SQL Parser

Abstract: Want to find out how exactly many DELETE statements in your database? Even better, how many of those are missing a WHERE clause? Or maybe you want to check your code for the usage of the COMPUTE clause, but Upgrade Advisor does not help you do so? Or maybe you want to ‘pretty print’ your SQL scripts? Many such requirements can be easily taken care of by using the TransactSql.ScriptDom parser which ships with SQL Server 2012. This session is for DBAs and developers who want to dig DEEP into the T-SQL code with simple C# and PowerShell scripts. And thereby end up making their jobs (and lives) easier!

If you are interested in T-SQL and programming, do join us at Fri, Nov 01 2013 07:00 (GMT-07:00) Mountain Time (US & Canada).

October 30, 2013January 14, 2020 by Arvind Shyamsundar

XML ‘Visualizer’ for the TransactSql.ScriptDom AST

EDIT Jan 14, 2020: Several readers have asked me about the download links mentioned below. Due to this blog having migrated from MSDN to WordPress, those links are expected to be broken. Specifically for this set of articles, I redirect you to my GitHub SQLScriptDomSamples repo which have the sample code now.

I’ve described the ScriptDom parser previously on this blog. The visitor pattern is great when you know exactly what you are looking for in the parse tree, but as you would have seen, due to the polymorphic nature of the ScriptDom classes, it is quite difficult for a truly generic tree walk when you don’t know what to expect.

The subtypes problem

For example, let’s look at the SelectStatement class. Here are the members of that class:

SelectStatement
QueryExpression QueryExpression
SchemaObjectName Into
List<> ComputeClauses
WithCtesAndXmlNamespaces WithCtesAndXmlNamespaces
…
List<> ScriptTokenStream

The QueryExpression in turn looks like the below:

QueryExpression
OrderByClause OrderByClause
OffsetClause OffsetClause
ForClause ForClause
…
IList<> ScriptTokenStream

QueryExpression is actually an abstract class. In the real world, what you end up getting one of the subtypes of QueryExpression:

QueryParenthesisExpression
BinaryQueryExpression
QuerySpecification

This means you will end up hard-coding some IF…ELSE structures if you want to handle each of the above 3 cases. This quickly becomes very complex when you consider the pattern repeats in many classes. For example, within QuerySpecification you have a FromClause:

QuerySpecification
UniqueRowFilter UniqueRowFilter
TopRowFilter TopRowFilter
IList<> SelectElements
FromClause FromClause
WhereClause WhereClause
GroupByClause GroupByClause
HavingClause HavingClause
OrderByClause OrderByClause
OffsetClause OffsetClause
ForClause ForClause
…
IList<> ScriptTokenStream

The FromClause is in turn a List<> of TableReference. That class in turn has a hierarchy of subtypes, which if you navigate (painfully sometimes) you will typically find two types of interest:

NamedTableReference for normal tables
QueryDerivedTable for derived tables

This goes on and on Smile So it can be very difficult for a normal developer to figure out what classes to expect for a given T-SQL snippet. What we need is a way to walk the tree for a given T-SQL fragment and document exactly what classes are present for that T-SQL.

Creating a XML ‘Visualizer’

XML is a very convenient way to document parse trees. This approach has been previously applied to another T-SQL parser from Microsoft SQL product team. (That parser was never publicly supported and it is now unavailable and has effectively been replaced by the ScriptDom parser.) You can find a thread about the older parser and the XML it produced at this thread.

So unfortunately for our ScriptDom parser we don’t have a Xml property which would neatly provide the parse tree in XML format. Instead we have to do the hard work ourselves. Actually it is not very hard – with a little bit of knowledge of the Reflection API, one can easily do this. The advantage of the reflection method is that it allows us to see the actual derived class which has been produced by the parser.

Sample application

I’ve attached a sample C# console application which does this and writes the output XML to a file on disk. I personally find this very useful to visualize the parse tree offline. You can use the free XML Notepad or XML Explorer tools to examine the resultant XML. You can then run XPath queries as well to look for patterns.

Here are some sample fragments of the XML produced by the application for the AdventureWorks procedure called uspGetManagerEmployees.

Here is the FromClause depicted in XML:

Here is the JOIN within the FROM clause:

That’s it for now. Hope you enjoyed this post, do leave your comments on the blog or get in touch with me via Twitter.

Download Sample C# console app

Disclaimer

This posting is provided “AS IS” with no warranties, and confers no rights. Use of included script samples are subject to the terms specified at http://www.microsoft.com/info/cpyright.htm.

July 24, 2013 by Arvind Shyamsundar

Data Quality Services (DQS) and Failover Cluster instances

One of my customers made the observation that it is not possible to uncheck the Data Quality Services component in the Feature Selection screen in SQL Server 2012 setup. The interesting thing was that she clearly recalled that earlier it was possible to select it (the default was unchecked), and we wondered what changed that it now mandated the selection of the DQS installation:

FYI, she was now running SQL Server 2012 with the Product Update enabled using the command line switches /Action=InstallFailoverCluster /UpdateSource=<path to SP1 EXE>. This means the setup support files are those from SP1.

Analysis

It then came to our attention that in CU1 there was a fix released to ensure DQS compatibility with failover cluster. The issue is documented in KB article 2674817.

Now as per the KB, there is no way to apply the fix (other than a re-install of the FCI Sad smile ) to an already-installed FCI which had the DQS components previously selected (but which were otherwise not working due to the issue described in the KB.) So to prevent users from running into the same problem if they were ever to use DQS after the FCI installation, the product team decided that post CU1 the selection of DQS is mandatory.

The other angle on this is, what if you are never planning to use DQS, is this selection going to add services, databases or files which are unused? The good news is that the impact is not much; the selection of DQS in the above feature list just drops a DQSInstaller.exe into the SQL BINN folder, which then has to be (optionally) executed to actually complete the DQS server installation. So, if you never run the DQSInstaller.exe, it will not create the DQS databases on the FCI.

Conclusion

To summarize, though the feature cannot be unchecked, in reality it does not add too much to the footprint of the installed FCI. Hope this helps!

July 5, 2013 by Arvind Shyamsundar

Error message “Invalid column name ‘uses_native_compilation’” when scripting objects in SQL 2014 CTP1

Due to the In-Memory OLTP enhancements in SQL 2014, there is metadata support added in the form of a new column: uses_native_compilation in the sys.all_sql_objects catalog view. When you use SMO to retrieve a database object (typically when you script it using the SSMS Generate Scripts wizard, for example,) it retrieves properties from such catalog views to populate the corresponding SMO properties such as IsSystemObject and IsNativelyCompiled.

Unfortunately the code in the CTP1 (after all, it is non-production ready status for a good reason Smile ) has an issue because it does not consider that IsNativelyCompiled property is absent from the catalog views for older releases of SQL Server. You may see this today as an error when scripting objects from Management Studio 2014, or if you directly use SMO to script objects as well. Note that this issue is only when connected to SQL 2012 or SQL 2008; scripting objects for SQL 2014 instances will work just fine.

FYI the relevant portion of the call stack might look like this:

Microsoft.SqlServer.Management.Common.ExecutionFailureException: An exception occurred while executing a Transact-SQL statement or batch.
System.Data.SqlClient.SqlException: Invalid column name ‘uses_native_compilation’.

… at Microsoft.SqlServer.Management.Smo.SmoDependencyOrderer.ResolveSprocDependencies(List`1 schemaboundList)

So if you are getting this issue, please vote on the Connect Bug I filed for this issue, so that the concerned team in the product group can prioritize this and fix it. Here is the link again in plain text format:

https://connect.microsoft.com/SQLServer/feedback/details/792637/system-data-sqlclient-sqlexception-invalid-column-name-uses-native-compilation-when-scripting-older

July 3, 2013 by Arvind Shyamsundar

SQL 2014 In-Memory OLTP ‘Hekaton’: training videos and white papers

Personally, the feature I am most excited about in SQL Server 2014 is the In-Memory OLTP (codenamed ‘Hekaton’) feature. Since SQL Server CTP1 was released, a lot of my interactions with colleagues and customers have revolved around this new feature and the one question everyone has is, how do I get started? Well, here’s a readiness roadmap I have put together to help you!

Videos

Sunil Agarwal, Principal Program Manager from the SQL Server product team, has been one of the key people steering the work towards its eventual public unveiling in CTP1. I’ve been fortunate enough to interact with Sunil on a few occasions and I marvel at his passion and energy. He recently presented three sessions at TechEd 2013 North America, which are now available online, complete with slides and video!

Microsoft SQL Server In-Memory OLTP: Overview of Project "Hekaton". This session introduces the motivations and high-level design of the in-memory OLTP system, and Rick Kutschera (from BWin) sharing their experience migrating to the new ecosystem.
Microsoft SQL Server In-Memory OLTP Project "Hekaton": App Dev Deep Dive. Here, Sunil presents the developer side of In-Memory OLTP, this talk is very useful if you are planning to migrate parts of your workload to Hekaton.
Microsoft SQL Server In-Memory OLTP Project "Hekaton": Management Deep Dive. This session goes into depth about the storage structures, durability, transaction logging, backup / restore etc. AMAZING stuff!

A related set of deep sessions were delivered by Jos de Bruijn (also a key member of the Hekaton PM team) at TechEd Europe 2013. Jos is a PhD and his technical depth and passion to share information is clearly evident – take a look for yourself!

Microsoft SQL Server 2014 In-Memory OLTP: Overview
SQL Server In-Memory OLTP: Developer Deep Dive. This session is a developer focused session, in which Jos starts off with a quick review of the in-memory framework, quickly transitioning into T-SQL support for DDL and programmability objects. He also explains hash indexes for memory optimized tables, and goes into depth about how the natively compiled procedures work. He then closes with a lot of coverage of the concurrency benefits of in-memory OLTP. WOW!
SQL Server In-Memory OLTP: DBA Deep Dive. This one goes after the manageability aspects for DBAs.

In-Memory OLTP Code Samples

Here are some useful samples from MSDN.

General Reading

Data Platform Insider blog: Getting started
Hekaton Whitepaper by Kalen Delaney
MSDN home page: In-Memory OLTP (In-Memory Optimization)

Research Papers

All the inside scoops from the team which made it happen!

Hekaton Breaks Through – Microsoft Research: A nice read on the history of this project and the people involved in it.
Hekaton: SQL Server’s Memory-Optimized OLTP Engine. This is one paper you MUST read, because it was written by the developers themselves! It is written for the ACM SIGMOD conference.

High-Performance Concurrency Control Mechanisms for Main-Memory Databases. This paper describes the multiversion concurrency control used in Hekaton.
The Bw-Tree: A B-tree for New Hardware Platforms. This paper describes some of the implementation details of the indexing scheme for storing ordered data in Hekaton. Do note, this paper is HEAVY stuff, not for the faint-hearted!

If you find more resources which you would like to share, please post a comment or contact me through this blog to let me know, I will gladly link them up in this roadmap!

Arvind Shyamsundar's technical blog

Arvind Shyamsundar is a Principal PM @ MSFT Azure Data, working on Azure SQL. Data geek. Apache Accumulo and Fluo PMC. SQL MCM, ex-Principal PFE (MSFT Services). These are my own opinions and not those of Microsoft.

Category / Uncategorized

Top Support Solutions for SQL Server

Normalizing T-SQL text, part 2: using the TransactSql.ScriptDom classes

Algorithm

ScriptDom Implementation

Visitor definition

Visitor invocation

Regenerate the Script

Calculate the hash

Track unique hashes

Sample output

Command line

Input

Output

Disclaimer

Normalizing T-SQL text, part 1: using the RML Utilities and the DMVs

Option 1: Using the RML Utilities / ReadTrace

Option 2: Using the query_hash value

Summary

SQL PASS ScriptDom talk: Recording now available!

Slides and samples for my SQL PASS ScriptDom talk

Disclaimer

SQL PASS AppDev Virtual Chapter meeting

XML ‘Visualizer’ for the TransactSql.ScriptDom AST

The subtypes problem

Creating a XML ‘Visualizer’

Sample application

Disclaimer

Data Quality Services (DQS) and Failover Cluster instances

Analysis

Conclusion

Error message “Invalid column name ‘uses_native_compilation’” when scripting objects in SQL 2014 CTP1

SQL 2014 In-Memory OLTP ‘Hekaton’: training videos and white papers

Videos

In-Memory OLTP Code Samples

General Reading

Research Papers