The ‘aggregate concatenation’ T-SQL (anti-)pattern

[Update 20 March 2019: this functionality is now natively provided in SQL Server 2017+ / Azure SQL DB through the STRING_AGG intrinsic]

Some time back, Dimitri Furman who works as a Senior Consultant in the Microsoft Services team shared a very useful tip. It pertains to a pattern of usage wherein strings from multiple rows are concatenated into a single large (typically comma delimited) string. The usual way that we see this being accomplished is by code like the below:

DECLARE @res NVARCHAR(max)

SELECT @res = COALESCE(@res + ',', '') + FirstName + LastName
FROM AdventureWorks2016CTP3.Person.Person

SELECT @res

While the above is quite simple to read, it is quite slow (as you will see soon) and worse – it does not ‘inline’ meaning you cannot use the above inside a inline table-valued function. This last property was very important for us, because in the case we were looking at, the above COALESCE based concatenation code was in a scalar UDF, which clearly is a big ‘no-no’ from a performance perspective. Also, we did not want to utilize SQLCLR so the usage of the GROUP_CONCAT CLR UDA was ruled out.

Now, here is a much better way to refactor the above concatenation ‘loop’ into the below:

SELECT STUFF((
SELECT ',' + FirstName + LastName
FROM AdventureWorks2016CTP3.Person.Person
FOR XML PATH('')
,TYPE
).value('.', 'nvarchar(max)'), 1, 1, '') AS RawResult

On first glance, it may seem counter-intuitive that the second version could be faster, but that is indeed the case. For the above queries, here is the comparison:

COALESCE loop XML Path
CPU time = 7969 ms,  elapsed time = 7984 ms CPU time = 15 ms,  elapsed time = 16 ms

The usage of XML PATH in this way is not exactly ‘breaking news’ but the performance difference is dramatic. And more importantly for the case we were discussing at work, the XML PATH based code can be inlined, making it suitable for use in TVFs. And in our real case, that’s exactly what we wanted to do (refactor the UDF into a TVF.)

Have a great week ahead!

T-SQL ranking functions to the rescue once again!

This week I happened to assist with an internal application where there was a performance issue with a specific query. One of my observations with that query was that it was doing self-joins in order to determine the latest iteration of a specific record. That’s a common anti-pattern which can be fixed by using T-SQL’s windowing functions, such as ROW_NUMBER() or RANK().

The Inefficient Way

To give more context, here is an example from the Contoso Retail DW database. The requirement (very similar to the actual customer case) here is to obtain a list of all the customers who have placed an order with us, and for each of those customers, obtain the date of the latest order. The implementation that I first noticed was using a self-join as given below; notice the special case handling where the customer has 2 orders on the same day and then the order number (the OnlineSalesKey column here) becomes the tie-breaker:

SELECT l.CustomerKey
	,l.DateKey AS LatestOrder
FROM dbo.FactOnlineSales l
LEFT JOIN dbo.FactOnlineSales r ON l.CustomerKey = r.CustomerKey
	AND (
		l.DateKey < r.DateKey
		OR (
			l.DateKey = r.DateKey
			AND l.OnlineSalesKey > r.OnlineSalesKey
			)
		)
WHERE r.DateKey IS NULL

This query is *extremely* inefficient, burning 100% CPU on my i7 laptop with 8 logical CPUs! On my laptop it will run for well over 10 minutes before I get impatient and cancel. For reference, here is the estimated execution plan for the above query (scroll to the right and note the query cost of 167692).

badplan

Just for kicks, I ran the above query with MAXDOP 144 (NOT a good idea, but this was for fun) on our lab machine which has 144 logical CPUs and here is the picture I see there 🙂 Obviously NOT something you want in production!

144procs

Rewriting the query

Now, the efficient way to re-write this is to use Ranking Functions in T-SQL. These have been around a while (SQL 2005 actually!) but I feel they are under-utilized. Here is the query re-written using the ROW_NUMBER() function. This solution also elegantly takes care of the above tie-breaking logic which required the disjunction (OR predicate) in the previous query.

WITH NumberedOrders
AS (
	SELECT CustomerKey
		,Datekey
		,ROW_NUMBER() OVER (
			PARTITION BY CustomerKey ORDER BY Datekey DESC
				,OnlineSalesKey DESC
			) AS RowNumber
	FROM FactOnlineSales
	)
SELECT CustomerKey
	,Datekey
FROM NumberedOrders
WHERE RowNumber = 1

Here is the new execution plan. Note that the cost is also much lesser: 447.

goodplan

Here are the execution statistics of this query, it completes in 43 seconds compared the self-join approach taking forever!

  • Table ‘FactOnlineSales’. Scan count 9, logical reads 92516, physical reads 0, read-ahead reads 72663, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
  • Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 31263, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
  • SQL Server Execution Times:
  • CPU time = 19231 ms, elapsed time = 43834 ms.

Moral of the story: next time you see query patterns involving self joins to achieve this kind of ‘latest record of a particular type’ take a pause and see if T-SQL Ranking Functions can help!

Common sub-expression elimination in SQL Server: what you need to know

In working with our customers, I come across some very interesting patterns in T-SQL. Not all of these patterns are complex on the face of it, but their impact can sometimes be substantial. Here is a simple example to demonstrate what I am referring to:

SELECT CASE WHEN EXISTS (SELECT *
                         FROM   Person.Person AS P
                         WHERE  P.BusinessEntityID = SOD.ProductID) THEN 1 ELSE 0 END AS Col1,
       CASE WHEN EXISTS (SELECT *
                         FROM   Person.Person AS P
                         WHERE  P.BusinessEntityID = SOD.ProductID) THEN 1 ELSE 0 END AS Col2
FROM   Sales.SalesOrderDetail AS SOD;

This query is clearly fictional, but it will suffice for our walkthrough here 🙂 Now here’s the execution plan for the above statement (you can also test this by using AdventureWorks2012 and above):

httpsmsdnshared.blob.core.windows.netmedia201606image922

Observe that the Person.Person table is accessed twice in the above plan. In the real customer scenario that I was looking at, the table involved was accessed using a nested loop join, for a total of 13 million seeks. And these 13 million were repeated again for the second usage of the subquery.

My attempt to rewrite the query was to use a LEFT OUTER JOIN syntax, as reflected in this simple example:

select CASE WHEN P.BusinessEntityID IS NOT NULL THEN 1 ELSE 0 END as Col1,

CASE WHEN P.BusinessEntityID IS NOT NULL THEN 1 ELSE 0 END as Col2

from Sales.SalesOrderDetail SOD

LEFT OUTER JOIN Person.Person P ON P.BusinessEntityID = SOD.ProductID

Here is the execution plan for the optimized version:

httpsmsdnshared.blob.core.windows.netmedia201606image923These two queries are functionally equivalent (because in this case there is a unique key on the BusinessEntityId column) and return the same set of results. As you can see the Person table is just accessed once in the revised query. The cost of the re-written query is also ~ 6 times lesser than the original query! This kind of optimization is referred to as Constant Subexpression Elimination. There are some strategies used by SQL Server (see this paper and this one as well for some ideas) but not every possible scenario (such as the above one) can be optimized. Also, this paper from Microsoft Research covers future ideas on the topic.

For now, it is best to review your T-SQL code for multiple instances of the same subquery and replace those with an explicit JOIN instead.

Tracking TEMPDB internal object space usage in SQL 2012

It is a documented fact that using the MAX specifier can result in TEMPDB usage in specific cases. From the TEMPDB whitepaper, some of these can use ‘internal object’ space within TEMPDB:

“all of the large object types: text, image, ntext, varchar(max), varbinary(max), and all others.”

Now, a FAQ from our customers is how to drill down to session level usage of TEMPDB when they suspect abnormal growth. Normally, the sys.dm_db_session_space_usage gives us the details we need. But are there any other options to see deeper? We wanted to test this and observe the effects. To do this, we created a simple stored procedure and a corresponding C# console application which passes in a large valued parameter.

Stored procedure code

create procedure testvarcharmax(@s varchar(max))
as
select @@version

C# application code

Here is the simple front-end application code which invokes the ‘testvarcharmax’ procedure:

SqlConnection cn = new SqlConnection("Data Source=.;initial catalog=tempdb;integrated security=true");
cn.Open();

SqlCommand cmd = new SqlCommand();
cmd.Connection = cn;
cmd.CommandType = System.Data.CommandType.StoredProcedure;
cmd.CommandText = "testvarcharmax";
SqlParameter parameter = new SqlParameter("s", System.Data.SqlDbType.VarChar, -1);
parameter.Value = new StringBuilder().Insert(0, "a", 1000000000).ToString();
cmd.Parameters.Add(parameter);
cmd.CommandTimeout = 360;

cmd.ExecuteNonQuery();

Monitoring TEMPDB

After executing the above C# code, my TEMPDB quickly grew by leaps and bounds. As any DBA would understand, the essential ask in such cases is to trace the session ID which is generating this TEMPDB space usage. One option is to use the sys.dm_db_file_space_usage DMV. That one is useful in this case, as it reports the aggregate space for the internal objects:

SELECT SUM(internal_object_reserved_page_count) AS [internal object pages used], (SUM(internal_object_reserved_page_count)*1.0/128) AS [internal object space in MB] FROM tempdb.sys.dm_db_file_space_usage;

However the above does not provide a session-level detail. To obtain that, you can use the queries on sys.dm_db_session_space_usage and sys.dm_db_task_space_usage as shown below:

select * from tempdb.sys.dm_db_session_space_usage
order by internal_objects_alloc_page_count desc

select * from tempdb.sys.dm_db_task_space_usage
order by internal_objects_alloc_page_count desc

Update: Adam Machanic pointed out that it is much easier and lightweight to use the sys.dm_db_task_space_usage and sys.dm_db_session_space_usage DMVs to track internal object usage to a given task / session. Please use the detailed approach described below only if you have a specific need to investigate lower-level details, such as a potential product issue or such.

Digging deeper

Suppose you suspect a ‘leak’ in TEMPDB objects of some sort, and you need to really peer deep into TEMPDB page allocation / deallocation. To peer more into the internals, KB article 947204 provides an interesting approach using the allocation ring buffer using trace flag 1106. This can be useful in rare cases, but first do note the important disclaimer about this flag:

Note We recommend that you do not use trace flag 1106 on production servers because trace flag 1106 affects the performance of SQL Server.

Note that the query specified therein does not return results on SQL 2012 even after enabling trace flag 1106. It turns out that the query requires a minor change, as the XML schema and the name for the ring buffer has changed in SQL 2012. The new name for the ring buffer is RING_BUFFER_SPACEMGR_TRACE. And here is the modified query, reflecting changes in the XML schema as well:

SELECT top 500
record.value(‘(Record/@id)[1]’, ‘int’) AS record_id,
CONVERT (varchar, DATEADD (ms, -1 * ((inf.cpu_ticks / inf.cpu_ticks) – [timestamp]), GETDATE()), 126) AS EventTime,
[timestamp] ,
  record.value(‘(Record/@id)[1]’, ‘int’) AS RingBuffer_Record_Id,
  record.value(‘(Record/SpaceMgr/Event)[1]’, ‘int’) AS AllocationEventType,
  record.value(‘(Record/SpaceMgr/SpId)[1]’, ‘int’) AS SpId,
  record.value(‘(Record/SpaceMgr/EcId)[1]’, ‘int’) AS EcId,
  record.value(‘(Record/SpaceMgr/PageId)[1]’, ‘nvarchar(50)’) AS AllocatedPageId,
  record.value(‘(Record/SpaceMgr/AuId)[1]’, ‘nvarchar(50)’) AS AllocationUnitId,
  record.value(‘(Record/SpaceMgr/LsId)[1]’, ‘nvarchar(50)’) AS LsId
FROM sys.dm_os_sys_info inf CROSS JOIN (
SELECT timestamp, CONVERT (xml, record) AS record
FROM sys.dm_os_ring_buffers
WHERE ring_buffer_type = ‘RING_BUFFER_SPACEMGR_TRACE’
AND ( record LIKE ‘%<Event>23</Event>%’ — uniform extent allocation
OR record LIKE ‘%<Event>22</Event>%’ — uniform extent deallocation
OR record LIKE ‘%<Event>24</Event>%’ — mixed extent allocation
OR record LIKE ‘%<Event>25</Event>%’ — mixed extent deallocation
OR record LIKE ‘%<Event>10</Event>%’ — page allocation
OR record LIKE ‘%<Event>11</Event>%’ — page deallocation
)) AS t
    ORDER BY record.value(‘(Record/@id)[1]’, ‘int’) ASC

The results of the ring buffer query are too detailed for most real-time scenarios and the overheads associated with the trace flag make it impractical, as compared to the sys.dm_db_session_space_usage DMV. But if you happen to use the approach in KB 947204 on SQL 2012 for some really low level debugging, please take note of the above changes.

Reference

Queries useful for troubleshooting TEMPDB space usage

Normalizing T-SQL text, part 2: using the TransactSql.ScriptDom classes

Happy New Year 2014 to all of you! With SQL Server 2014 set to release this year, I’m sure you are all excited about the months to come.

In my previous post I had reviewed the classic ways of obtaining ‘normalized’ text for ad-hoc SQL queries. Do take a minute to glance at that post in case you have not already done so. Both the methods described previously are dynamic – they need an active workload to operate upon. So if you have a static set of queries captured somewhere – such as a .SQL file or such, then we need an alternate method.

Algorithm

If you think about it, the core of normalizing these ad-hoc query text patterns is to identify literals and replace then with a generic / common value. Once the specific literal values are replaced with their generic ‘placeholders’ then it becomes a relatively easy task to identify commonalities.

To identify commonalities we propose to use a hashing algorithm, conceptually similar to the one used in the previous approaches. However, when computing hashes, there is another problem to consider: minor differences in whitespace / alphabet case of the query text will cause different hash values to be raised for essentially identical text.

ScriptDom Implementation

The first thing to consider is what kind of literals we would replace. In the ScriptDom class hierarchy, we have the following derived classes for the parent Literal class:

  • IntegerLiteral: whole numbers
  • NumericLiteral: decimal numbers such as 0.03
  • RealLiteral: numbers written with scientific notation such as 1E-02
  • MoneyLiteral: values prefixed with currency symbol such as $12
  • BinaryLiteral: such as 0xABCD1234
  • StringLiteral: such as ‘Hello World’
  • NullLiteral: the NULL value
  • DefaultLiteral: the DEFAULT keyword
  • MaxLiteral: the MAX specifier
  • OdbcLiteral: ODBC formatted literal such as { T ‘blah’ }
  • IdentifierLiteral: ‘special’ case when an identifier is used as a literal. I’ve never seen a real world example of this Smile

We need to keep this in mind when we write the visitor to traverse through the AST.

Visitor definition

Next, we need to setup our visitor. We will use the Visitor pattern to do this, and implement overridden methods to handle the various types of literals described above. And for each type of literal we will replace the value of the literal with a fixed, generic value. Here is an example for the real literal visitor method:

public override void ExplicitVisit(RealLiteral node)
{
node.Value = “0.5E-2”;
base.ExplicitVisit(node);
}

Visitor invocation

For performance reasons we will call the visitor with the Parallel.ForEach loop which makes efficient use of multi-core CPUs:

Parallel.ForEach(
(frag as TSqlScript).Batches,
batch =>
{
myvisitor visit = new myvisitor();

                        batch.Accept(visit);

This way, each T-SQL batch in the original script is visited on a separate thread.

Regenerate the Script

Once the visitor does its job to ‘normalize’ the various literals encountered, the next step is to generate the script based on the tokens already obtained. That will take care of one of the 2 problems we spoke about – namely whitespace. We can do that using one of the many SqlScriptGenerator classes available (there is one for each compatibility level.) In the code snippet below, srcgen is one of the SqlScriptGenerator classes and script holds the output:

scrgen.GenerateScript(batch, out script);

Calculate the hash

Once the normalized script is re-generated from the SqlScriptGenerator class, it can then be run through a proper hash algorithm (in this sample we use SHA1) to calculate the hash value of the given script. Here is where we also handle the case sensitive / insensitive nature of the script:

  • For case insensitive cases, we simply convert the generated script to lower case before we compute the hash.
  • For case sensitive, we calculate the hash as-is on the generated script.

using (var hashProvider = new SHA1CryptoServiceProvider())
{
if (caseSensitive)
{
hashValue = Convert.ToBase64String(hashProvider.ComputeHash(Encoding.Unicode.GetBytes(script)));
}
else
{
hashValue = Convert.ToBase64String(hashProvider.ComputeHash(Encoding.Unicode.GetBytes(script.ToLowerInvariant())));
}
}

Track unique hashes

We can use a simple Dictionary class in .NET to track these, along with usage counts for each bucket. Each bucket also tracks an example of the batch (the original text itself.)

Sample output

The sample project when compiled and executed as below gives us the following output.

Command line

TSQLTextNormalizer.exe c:tempinput.sql c:tempoutput.sql 110 false

Input

select * from ABC
where
i = 1
GO

select * from abC where i = 3
GO

Output

— 2 times:
select * from ABC
where
i = 1
GO

That’s it! You can use this in many cases, limited only by your imagination Smile And more importantly I hope it showed you the power and flexibility of the ScriptDom classes.

[EDIT 2022-07-27] The sample code for this post is now on GitHub. Please upvote the repo on GitHub if you liked this post and found the sample code useful. The URL is https://github.com/arvindshmicrosoft/SQLScriptDomSamples/tree/main/8_TSQLTextNormalizer

Disclaimer

This Sample Code is provided for the purpose of illustration only and is not intended to be used in a production environment.  THIS SAMPLE CODE AND ANY RELATED INFORMATION ARE PROVIDED “AS IS” WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE IMPLIED WARRANTIES OF MERCHANTABILITY AND/OR FITNESS FOR A PARTICULAR PURPOSE.  We grant You a nonexclusive, royalty-free right to use and modify the Sample Code and to reproduce and distribute the object code form of the Sample Code, provided that You agree: (i) to not use Our name, logo, or trademarks to market Your software product in which the Sample Code is embedded; (ii) to include a valid copyright notice on Your software product in which the Sample Code is embedded; and (iii) to indemnify, hold harmless, and defend Us and Our suppliers from and against any claims or lawsuits, including attorneys’ fees, that arise or result from the use or distribution of the Sample Code.

This posting is provided “AS IS” with no warranties, and confers no rights. Use of included script samples are subject to the terms specified at http://www.microsoft.com/info/cpyright.htm.

Normalizing T-SQL text, part 1: using the RML Utilities and the DMVs

A common problem when dealing with workloads which issue ad-hoc SQL commands (i.e. without parameterization) is to find out the ‘normalized’ version of the pattern. For example, these three statements are essentially the same ‘template’:

  • SELECT BusinessEntityId FROM Person.Person WHERE LastName = ‘Smith’
  • SELECT BusinessEntityId FROM Person.Person WHERE LastName = ‘Singh’
  • SELECT BusinessEntityId FROM Person.Person WHERE LastName = ‘Lee’

And so are these three, except that they are a distinctly new pattern:

  • SELECT * FROM Person.person WHERE LastName IN (‘Smith’)
  • SELECT * FROM Person.person WHERE LastName IN (‘Smith’, ‘Singh’)
  • SELECT * FROM Person.person WHERE LastName IN (‘Smith’, ‘Singh’, ‘Lee’)

The requirement therefore is to detect only the common ‘template’ / ‘normalized’ text – in most cases a representative example of the ad-hoc statement. In this post, I will show you the classic ways to do that: using the RML Utilities (specifically using ReadTrace) or using the query_hash value in the DMVs.

Option 1: Using the RML Utilities / ReadTrace

The classic way to do with profiler traces of course, was to capture a trace file (using server side tracing preferably) and then run that trace through the ReadTrace utility. The neat thing about this utility is that it would ‘aggregate’ similar patterns into hash buckets and then display statistics for each hash ID. To do that successfully, it requires a SQL instance and database.

Here is a sample command line for using this utility:

readtrace -Ic:temptest.trc -S. -dTest -f

The above imports the trace (from c:temptest.trc) into a database called Test on my local default instance of SQL 2012. Once that is done, you can work your magic on the various tables which are created by this utility and extract the unique batches:

SELECT OrigText, HashID
FROM ReadTrace.tblUniqueBatches

Here is the sample output from my test, where I had issued those 6 statements (each delimited by a GO) which were shown at the beginning of this post:

  • SELECT BusinessEntityId FROM Person.Person WHERE LastName = ‘Smith’ => hash value of -8759399170292131263
  • select * from Person.person where LastName IN (‘SMith’) => hash value of 1644287078950160061

So the good news is that this utility correctly found the underlying patterns very accurately, but it also requires extra efforts (to provision a SQL instance and gather a trace file.) One other advantage with the ReadTrace utility is that it also supports importing Extended Event Logs (XEL) files as well (though that feature is still in Beta.)

For more details on the normalization capabilities within ReadTrace, Download RML Utilities for SQL Server (x64) and then read the PDF help file. The various patterns are described in the section ‘Normalization Patterns’.

Option 2: Using the query_hash value

Starting SQL Server 2008 and above you can use the query_hash functionality to isolate these common patterns. Here is an example of a query which can do that and list one sample query for each ‘pattern’:

select * from
(select ST.text, ROW_NUMBER() OVER (PARTITION BY query_hash ORDER BY query_hash) Ranking from sys.dm_exec_query_stats QS
cross apply sys.dm_exec_sql_text (QS.sql_handle) ST
where objectid IS NULL
and ST.encrypted = 0
) AS Q
where Q.Ranking = 1

While this is easy to use and available out of the box, it does not work for SQL 2005 or obviously not for SQL 2000 (why do I even bring up SQL 2000, given that it is out of support? When upgrading those few workloads from SQL 2000 to higher versions, it is appropriate to tune some of this stuff by replacing them with parameterized queries.)

Also, there is another problem with IN predicates. When there is a varying set of literals passed to the IN predicate, it turns out that the query_hash is different. Here is an example:

  • select * from Person.person where LastName IN (‘SMith’, ‘Singh’, ‘Lee’)  => hash value 0x6FAFADDF37AA7683
  • select * from Person.person where LastName IN (‘SMith’) => hash value 0x77A1391010F61AE7
  • select * from Person.person where LastName IN (‘SMith’, ‘Singh’) => hash value 0x9BEA60D8D6595AA5

Summary

I’d like to quickly compare these two approaches and then setup what’s coming next.

ReadTrace query_hash
  • Offline method
  • Can consume TRC or XEL files
  • Supports SQL Server versions 2000, 2005 and 2008, 2008 R2, 2012 and SQL 2014 CTP2 (version  9.04.0004 as of time of writing this post)
  • Correctly handles many patterns including IN lists
  • Online method
  • Requires SQL 2008 or above
  • Dependent on the query plan cache
  • Some limitations such as not aggregating IN predicate lists correctly

In the next part I will show you how you can cook your own ‘aggregator’ / ‘normalizer’ using the Transact-SQL ScriptDom parser and then compare it with these two approaches. Stay tuned!

XML ‘Visualizer’ for the TransactSql.ScriptDom AST

EDIT Jan 14, 2020: Several readers have asked me about the download links mentioned below. Due to this blog having migrated from MSDN to WordPress, those links are expected to be broken. Specifically for this set of articles, I redirect you to my GitHub SQLScriptDomSamples repo which have the sample code now.

I’ve described the ScriptDom parser previously on this blog. The visitor pattern is great when you know exactly what you are looking for in the parse tree, but as you would have seen, due to the polymorphic nature of the ScriptDom classes, it is quite difficult for a truly generic tree walk when you don’t know what to expect.

The subtypes problem

For example, let’s look at the SelectStatement class. Here are the members of that class:

SelectStatement
QueryExpression QueryExpression
SchemaObjectName Into
List<> ComputeClauses
WithCtesAndXmlNamespaces WithCtesAndXmlNamespaces

List<> ScriptTokenStream

The QueryExpression in turn looks like the below:

QueryExpression
OrderByClause OrderByClause
OffsetClause OffsetClause
ForClause ForClause

IList<> ScriptTokenStream

QueryExpression is actually an abstract class. In the real world, what you end up getting one of the subtypes of QueryExpression:

  • QueryParenthesisExpression
  • BinaryQueryExpression
  • QuerySpecification

This means you will end up hard-coding some IF…ELSE structures if you want to handle each of the above 3 cases. This quickly becomes very complex when you consider the pattern repeats in many classes. For example, within QuerySpecification you have a FromClause:

QuerySpecification
UniqueRowFilter UniqueRowFilter
TopRowFilter TopRowFilter
IList<> SelectElements
FromClause FromClause
WhereClause WhereClause
GroupByClause GroupByClause
HavingClause HavingClause
OrderByClause OrderByClause
OffsetClause OffsetClause
ForClause ForClause

IList<> ScriptTokenStream

The FromClause is in turn a List<> of TableReference. That class in turn has a hierarchy of subtypes, which if you navigate (painfully sometimes) you will typically find two types of interest:

This goes on and on Smile So it can be very difficult for a normal developer to figure out what classes to expect for a given T-SQL snippet. What we need is a way to walk the tree for a given T-SQL fragment and document exactly what classes are present for that T-SQL.

Creating a XML ‘Visualizer’

XML is a very convenient way to document parse trees. This approach has been previously applied to another T-SQL parser from Microsoft SQL product team. (That parser was never publicly supported and it is now unavailable and has effectively been replaced by the ScriptDom parser.) You can find a thread about the older parser and the XML it produced at this thread.

So unfortunately for our ScriptDom parser we don’t have a Xml property which would neatly provide the parse tree in XML format. Instead we have to do the hard work ourselves. Actually it is not very hard – with a little bit of knowledge of the Reflection API, one can easily do this. The advantage of the reflection method is that it allows us to see the actual derived class which has been produced by the parser.

Sample application

I’ve attached a sample C# console application which does this and writes the output XML to a file on disk. I personally find this very useful to visualize the parse tree offline. You can use the free XML Notepad or XML Explorer tools to examine the resultant XML. You can then run XPath queries as well to look for patterns.

Here are some sample fragments of the XML produced by the application for the AdventureWorks procedure called uspGetManagerEmployees.

Here is the FromClause depicted in XML:

image

Here is the JOIN within the FROM clause:

image

That’s it for now. Hope you enjoyed this post, do leave your comments on the blog or get in touch with me via Twitter.

Disclaimer

This Sample Code is provided for the purpose of illustration only and is not intended to be used in a production environment.  THIS SAMPLE CODE AND ANY RELATED INFORMATION ARE PROVIDED “AS IS” WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE IMPLIED WARRANTIES OF MERCHANTABILITY AND/OR FITNESS FOR A PARTICULAR PURPOSE.  We grant You a nonexclusive, royalty-free right to use and modify the Sample Code and to reproduce and distribute the object code form of the Sample Code, provided that You agree: (i) to not use Our name, logo, or trademarks to market Your software product in which the Sample Code is embedded; (ii) to include a valid copyright notice on Your software product in which the Sample Code is embedded; and (iii) to indemnify, hold harmless, and defend Us and Our suppliers from and against any claims or lawsuits, including attorneys’ fees, that arise or result from the use or distribution of the Sample Code.

This posting is provided “AS IS” with no warranties, and confers no rights. Use of included script samples are subject to the terms specified at http://www.microsoft.com/info/cpyright.htm.

PowerShell script to extract T-SQL task code from SSIS packages

Requirement

Some time back I presented a PowerShell script which extracts T-SQL code from RDL files. Remember that I created this script to assist me in doing code reviews of ad-hoc SQL snippets which were embedded in the report definition files.

Another common usage of ad-hoc T-SQL code is within SSIS packages, specifically the Execute SQL task or the (somewhat less commonly used) Execute T-SQL Statement Task. The sample script I provide in this blog post relates to the Execute SQL task.

Methodology

To do this efficiently we have a couple of choices:

  • We could parse the DTSX file (as it is finally a XML file) directly
  • We could use the managed API to load the package and iterate through the tasks

Given that our requirement is quite specific and we may want to reduce any dependency on managed objects, I chose to use the XML parsing approach. It does help very much that the DTSX package schema is fully documented (and I have kept in mind the changes done to the DTSX schema in SQL 2012.)

Scope: this script will focus on the Execute SQL task for the moment. Interested readers can modify the script to operate on the (rarely used) Execute T-SQL Statement Task. It will also skip any StoredProcedure invocation calls.

Usage

The TSQLfromDTSX.ps1 script takes 2 parameters: the path to the files and an option to recursively look at sub-folders.

powershell.exe .TSQLfromDTSX.ps1 -RootFolder <folder path> -Recurse <0 or 1>

Important: if the folder path has embedded spaces, you need to surround the path with single quotes (‘) and NOT the usual double quotes as you would expect.

Output

It outputs the extracted T-SQL code to the standard output, so you can simply redirect that to another file if required. An interesting feature I’ve plugged into the script is to prefix the T-SQL code with the name of the package and the specific task where that code was found. These ‘identity’ values are surrounded by the ~~ markers, which can make it easy for you to ‘tag’ these batches with some names in other activities.


This Sample Code is provided for the purpose of illustration only and is not intended to be used in a production environment.  THIS SAMPLE CODE AND ANY RELATED INFORMATION ARE PROVIDED “AS IS” WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE IMPLIED WARRANTIES OF MERCHANTABILITY AND/OR FITNESS FOR A PARTICULAR PURPOSE.  We grant You a nonexclusive, royalty-free right to use and modify the Sample Code and to reproduce and distribute the object code form of the Sample Code, provided that You agree: (i) to not use Our name, logo, or trademarks to market Your software product in which the Sample Code is embedded; (ii) to include a valid copyright notice on Your software product in which the Sample Code is embedded; and (iii) to indemnify, hold harmless, and defend Us and Our suppliers from and against any claims or lawsuits, including attorneys’ fees, that arise or result from the use or distribution of the Sample Code. 

This posting is provided “AS IS” with no warranties, and confers no rights. Use of included script samples are subject to the terms specified at http://www.microsoft.com/info/cpyright.htm.

PowerShell script to extract T-SQL code from RDL files

One of the things I do in my day job at Microsoft Services is to review database design and T-SQL code. Now, when it comes to code, there is a challenge just to find ALL the possible code which hits our database engine. This is because a non-trivial amount of T-SQL code is not in the form of Stored Procedures, but it is embedded inside other places – such as Reporting Services (SSRS).

Now if you have hundreds of RDL files, extracting T-SQL from those files is a very tedious job. Opening each file in Report Designer, you would have to comb through each dataset – a tedious and error prone procedure. This is where the knowledge of the RDL format is helpful. Luckily for us, the specification for RDL is documented and available publicly. Using this knowledge, I constructed a quick (and perhaps dirty Smile) PowerShell script to scan through multiple RDL files and extract the T-SQL command text.

To use the script just invoke it from PowerShell command prompt and pass in the root folder path (-RootFolder) where your RDL files are stored. The –Recurse parameter is also useful in case you have many subfolders where RDL files are stored.

Do leave comments on whether this script was useful for you!

This Sample Code is provided for the purpose of illustration only and is not intended to be used in a production environment.  THIS SAMPLE CODE AND ANY RELATED INFORMATION ARE PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE IMPLIED WARRANTIES OF MERCHANTABILITY AND/OR FITNESS FOR A PARTICULAR PURPOSE.  We grant You a nonexclusive, royalty-free right to use and modify the Sample Code and to reproduce and distribute the object code form of the Sample Code, provided that You agree: (i) to not use Our name, logo, or trademarks to market Your software product in which the Sample Code is embedded; (ii) to include a valid copyright notice on Your software product in which the Sample Code is embedded; and (iii) to indemnify, hold harmless, and defend Us and Our suppliers from and against any claims or lawsuits, including attorneys’ fees, that arise or result from the use or distribution of the Sample Code.

This posting is provided "AS IS" with no warranties, and confers no rights. Use of included script samples are subject to the terms specified at http://www.microsoft.com/info/cpyright.htm.