Unknown's avatar

Posts by Arvind Shyamsundar

I work to enable customers, partners and colleagues to excel at what they do by harnessing the power of data, advanced analytics and cloud services. I work as a Principal Program Manager within the Azure Data division at Microsoft Redmond, focussing on Azure SQL. I work with customers who are adopting our services and products in the most demanding workloads, and translate learnings from those engagements into product feedback and planning input so that we can make Azure SQL even better. I am proud to be one of the first Microsoft Certified Master (MCM) for SQL Server 2008.

Too many files causing RAM pressure? (a.k.a. DynCache to the rescue!)

Readers of this blog might recall a previous post which described one impact of having too many similarly named files in one folder in NTFS. It turns out that the 8.3 naming convention is not the only thing you need to worry about when you have very large amounts of (smaller) files in the same volume.

Today I was called in to assist with a performance issue on a server. The only visible symptom of the problem was excessive RAM utilization on the server. The interesting aspect was that no specific user-mode process was consuming that RAM, so we were wondering where it came from. If this server was running SQL Server, for example, the ‘ghost’ utilization could be due to locked pages (which do not show up in Task Manager) but that was not the case here.

Analysis

So we ran the RAMMap utility, we found that the usage for MetaFile was a substantial percentage of the total RAM usage. From this AskPerf blog post you can see what MetaFile is all about:

“Metafile is part of the system cache and consists of NTFS metadata. NTFS metadata includes the MFT as well as the other various NTFS metadata files (see How NTFS Works for more details, and of course Windows Internals is a great reference). In the MFT each file attribute record takes 1k and each file has at least one attribute record. Add to this the other NTFS metadata files and you can see why the Metafile category can grow quite large on servers with lots of files.”

The next step therefore was to cross-check how large the MFT was in reality. The easy way to do this is to utilize the command given below:

fsutil fsinfo ntfsinfo <drive letter>

A sample output is given below (from my own laptop Smile)

C:>fsutil fsinfo ntfsinfo c:
NTFS Volume Serial Number :       0x8a40c9ee40c9e0d5
NTFS Version   :                  3.1
LFS Version    :                  2.0
Number Sectors :                  0x000000003a2007ff
Total Clusters :                  0x00000000074400ff
Free Clusters  :                  0x0000000000ab1f84
Total Reserved :                  0x000000000002ef10
Bytes Per Sector  :               512
Bytes Per Physical Sector :       4096
Bytes Per Cluster :               4096
Bytes Per FileRecord Segment    : 1024
Clusters Per FileRecord Segment : 0
Mft Valid Data Length :           0x0000000030240000
Mft Start Lcn  :                  0x00000000000c0000
Mft2 Start Lcn :                  0x0000000000000002
Mft Zone Start :                  0x0000000004d58da0
Mft Zone End   :                  0x0000000004d655c0
Resource Manager Identifier :     96CC88FE-5621-11E3-AF31-3C970EA47926

In this output, the “Mft Valid Data Length” gives us an indicator of how many bytes are used by the MFT. In the above case for example it equates to around 770MB:

0x0000000030240000 / (1024*1024) = 770.25 MB

Just as a curiosity, if I run RAMMap in my laptop (Windows 8.1), here is what I see, you can see a rough alignment with the above computed number.

image

FYI, in the server that I was looking at in the real world, the size of the MFT was actually 1.5 times the amount of RAM on the box Smile

Mitigation

Since the server in question was running Windows Server 2008 R2 SP1, we recommended the usage of the DynCache service sample, which would automatically control the size of the system cache based on system memory notifications.

SQL Server FileStream

Practically, this issue would also apply when you use SQL FileStream to store a very large number of blobs in the NTFS file system. In such cases, here are my recommendations:

  1. Consider setting the SQL Server Database Engine to use ‘locked pages’ with ‘max server memory’ set appropriately.
  2. If the OS is Windows 2008 R2 or below, you may additionally consider using the DynCache service to mitigate the effect that the large MFT will have.

Other notes

If you anticipate huge numbers of files to be stored on the file system, keep in mind that each file record will take up around 1KB. That means 100 million files will take up close to 100GB worth of MFT storage!

On Windows 2008 R2 and below be aware of issues like the one described in KB article 967351 and install the updated version of NTFS.sys accordingly.

In extreme cases, if you want to achieve optimal performance without compromising on system cache memory utilization, be aware that the system RAM sizing must be done accordingly keeping in mind the very large MFT which might result.

References

You may want to read these articles for more information:

Database size limitation in SQL Express

I would generally not blog about this kind of topic, except that my customer recently talked to me about it. According to his tests, he seemed to find that the total size limitation across all databases was 10GB. That did not sound correct to me, because it is quite clearly documented that the 10GB limit is per database.

I quickly tested and found out that the documentation is correct. It is only when an individual database is attempted to be created with a data file beyond the size limit of 10GB, that you get the error:

Msg 1827, Level 16, State 4, Server SOMECOMPUTERSQLEXPRESS, Line 1
CREATE DATABASE or ALTER DATABASE failed because the resulting cumulative database size would exceed your licensed limit of 10240 MB per database.

To summarize:

  1. 10GB is the per-database, data-file size limit
  2. If you have more than one data file in your SQL Express database, the size limit is applied to the total size across all such data files
  3. Log files are not subject to this limit

Tracking TEMPDB internal object space usage in SQL 2012

It is a documented fact that using the MAX specifier can result in TEMPDB usage in specific cases. From the TEMPDB whitepaper, some of these can use ‘internal object’ space within TEMPDB:

“all of the large object types: text, image, ntext, varchar(max), varbinary(max), and all others.”

Now, a FAQ from our customers is how to drill down to session level usage of TEMPDB when they suspect abnormal growth. Normally, the sys.dm_db_session_space_usage gives us the details we need. But are there any other options to see deeper? We wanted to test this and observe the effects. To do this, we created a simple stored procedure and a corresponding C# console application which passes in a large valued parameter.

Stored procedure code

create procedure testvarcharmax(@s varchar(max))
as
select @@version

C# application code

Here is the simple front-end application code which invokes the ‘testvarcharmax’ procedure:

SqlConnection cn = new SqlConnection("Data Source=.;initial catalog=tempdb;integrated security=true");
cn.Open();

SqlCommand cmd = new SqlCommand();
cmd.Connection = cn;
cmd.CommandType = System.Data.CommandType.StoredProcedure;
cmd.CommandText = "testvarcharmax";
SqlParameter parameter = new SqlParameter("s", System.Data.SqlDbType.VarChar, -1);
parameter.Value = new StringBuilder().Insert(0, "a", 1000000000).ToString();
cmd.Parameters.Add(parameter);
cmd.CommandTimeout = 360;

cmd.ExecuteNonQuery();

Monitoring TEMPDB

After executing the above C# code, my TEMPDB quickly grew by leaps and bounds. As any DBA would understand, the essential ask in such cases is to trace the session ID which is generating this TEMPDB space usage. One option is to use the sys.dm_db_file_space_usage DMV. That one is useful in this case, as it reports the aggregate space for the internal objects:

SELECT SUM(internal_object_reserved_page_count) AS [internal object pages used], (SUM(internal_object_reserved_page_count)*1.0/128) AS [internal object space in MB] FROM tempdb.sys.dm_db_file_space_usage;

However the above does not provide a session-level detail. To obtain that, you can use the queries on sys.dm_db_session_space_usage and sys.dm_db_task_space_usage as shown below:

select * from tempdb.sys.dm_db_session_space_usage
order by internal_objects_alloc_page_count desc

select * from tempdb.sys.dm_db_task_space_usage
order by internal_objects_alloc_page_count desc

Update: Adam Machanic pointed out that it is much easier and lightweight to use the sys.dm_db_task_space_usage and sys.dm_db_session_space_usage DMVs to track internal object usage to a given task / session. Please use the detailed approach described below only if you have a specific need to investigate lower-level details, such as a potential product issue or such.

Digging deeper

Suppose you suspect a ‘leak’ in TEMPDB objects of some sort, and you need to really peer deep into TEMPDB page allocation / deallocation. To peer more into the internals, KB article 947204 provides an interesting approach using the allocation ring buffer using trace flag 1106. This can be useful in rare cases, but first do note the important disclaimer about this flag:

Note We recommend that you do not use trace flag 1106 on production servers because trace flag 1106 affects the performance of SQL Server.

Note that the query specified therein does not return results on SQL 2012 even after enabling trace flag 1106. It turns out that the query requires a minor change, as the XML schema and the name for the ring buffer has changed in SQL 2012. The new name for the ring buffer is RING_BUFFER_SPACEMGR_TRACE. And here is the modified query, reflecting changes in the XML schema as well:

SELECT top 500
record.value(‘(Record/@id)[1]’, ‘int’) AS record_id,
CONVERT (varchar, DATEADD (ms, -1 * ((inf.cpu_ticks / inf.cpu_ticks) – [timestamp]), GETDATE()), 126) AS EventTime,
[timestamp] ,
  record.value(‘(Record/@id)[1]’, ‘int’) AS RingBuffer_Record_Id,
  record.value(‘(Record/SpaceMgr/Event)[1]’, ‘int’) AS AllocationEventType,
  record.value(‘(Record/SpaceMgr/SpId)[1]’, ‘int’) AS SpId,
  record.value(‘(Record/SpaceMgr/EcId)[1]’, ‘int’) AS EcId,
  record.value(‘(Record/SpaceMgr/PageId)[1]’, ‘nvarchar(50)’) AS AllocatedPageId,
  record.value(‘(Record/SpaceMgr/AuId)[1]’, ‘nvarchar(50)’) AS AllocationUnitId,
  record.value(‘(Record/SpaceMgr/LsId)[1]’, ‘nvarchar(50)’) AS LsId
FROM sys.dm_os_sys_info inf CROSS JOIN (
SELECT timestamp, CONVERT (xml, record) AS record
FROM sys.dm_os_ring_buffers
WHERE ring_buffer_type = ‘RING_BUFFER_SPACEMGR_TRACE’
AND ( record LIKE ‘%<Event>23</Event>%’ — uniform extent allocation
OR record LIKE ‘%<Event>22</Event>%’ — uniform extent deallocation
OR record LIKE ‘%<Event>24</Event>%’ — mixed extent allocation
OR record LIKE ‘%<Event>25</Event>%’ — mixed extent deallocation
OR record LIKE ‘%<Event>10</Event>%’ — page allocation
OR record LIKE ‘%<Event>11</Event>%’ — page deallocation
)) AS t
    ORDER BY record.value(‘(Record/@id)[1]’, ‘int’) ASC

The results of the ring buffer query are too detailed for most real-time scenarios and the overheads associated with the trace flag make it impractical, as compared to the sys.dm_db_session_space_usage DMV. But if you happen to use the approach in KB 947204 on SQL 2012 for some really low level debugging, please take note of the above changes.

Reference

Queries useful for troubleshooting TEMPDB space usage

Managed Service Accounts (MSA) and SQL 2012: Practical Tips

One of the most common dilemmas for SQL Server administrators is whether they should use AD based domain users as the service accounts, or can they leverage the inbuilt accounts like Network Service etc. If your SQL Server instance is never going to participate in any cross-server contexts (such as availability Groups, Linked Servers, Log Shipping etc.) then you may very well be happy to use one of the inbuilt accounts or even better, in SQL 2012, the special ‘Virtual Account’ feature.

While these options relieve you from the overhead of periodically changing and syncing the service account password, they do impose the machine boundary and cause problems when you want to ‘jump’ across instances. So the classic solution has been to grit one’s teeth and ask for a domain user which will then be configured as the service account. However this brings the hassle of periodic password maintenance, and more importantly that causes downtime.

OR you can configure the SQL 2012 standalone instance to utilize the new Managed Service Accounts feature in Windows 2008 R2 and above. To do this, you follow the steps below.

Setup the MSA in Active Directory

First, create a new MSA in the AD using the PowerShell cmdlet. To do this, there are some simple considerations documented here, but the most important ones are to be an domain administrator, to be on Windows 2008 R2 or above and have the right PowerShell modules installed (Remote Server Administration Tools has a AD PowerShell module which you must install.)

Import-Module ActiveDirectory

New-ADServiceAccount –Name TestSQLMSA -Enabled $true

Next, associate the above MSA with the computer you wish to use it on. Note that a MSA in itself can only be used on one destination computer at a time. In this case, I will specify my lab computer named W2K8R2CN3.

Add-ADComputerServiceAccount -Identity W2K8R2CN3 -ServiceAccount TestSQLMSA

Install the MSA on the target server

Once this is done, you switch to the W2K8R2CN3 computer (which just happens to be a Windows 2008 R2 Core installation) and ‘install’ the MSA on the computer. Now, if you are on Windows 2008 R2 Core installation, to use the ActiveDirectory cmdlets you must effectively install the RSAT-AD-PowerShell feature, but directly using DISM:

DISM /online /enable-feature /featurename=ActiveDirectory-PowerShell

Once that is done, you can easily associate the MSA on the target computer:

Install-ADServiceAccount TestSQLMSA

Important: To do this correctly, you must be a domain administrator. If you run the Install-ADServiceAccount cmdlet and you are not a domain admin, it silently exits, but later when you try to change the service account and start SQL, you will receive an error:

“Error 1069: The service did not start due to a logon failure” and HRESULT 0x8007042d.

If you get those errors, have a domain admin logon to the target server and re-run Install-ADServiceAccount for you.

Change the SQL Service account

Last but not the least: you use SQL Configuration Manager to execute the service account change. An important note is that when specifying the ‘user name’ for the MSA, you must fully qualify it with domain name, and have a trailing $ sign as well. If you do not specify the $ you will receive the error message:

“Invalid parameter [0x80041008]”

As an example, in my test setup, here is how Configuration Manager looks:

image

Supportability questions

MSA and older SQL versions: Though you may be able to configure and / or use MSA with SQL 2008 R2 or older versions, it is officially not tested. If you notice, the documentation for SQL 2008 R2 service accounts is noticeably silent on this topic.

Group Managed Service Accounts (GMSA): As of Feb 2014, Group Managed Service Accounts are NOT officially supported with SQL 2012 Failover Clustered Instances. This is under review by the Product Group for future releases.

References

Please leave your questions, comments below! I’d be very happy to hear from you. Till next time, bye!

What happens next? SQL with Windows login / multiple groups…

This is a very common case: take a Windows user who belongs to multiple groups. Next, grant more than one of those groups as logins to SQL Server. The question then is: which permissions will that session inherit?

The answer it turns out is that the permissions will be additive and most restrictive. When a Windows login user authenticates to SQL Server, the login token is populated with ALL the groups that this user is associated with. You can view this information through the sys.login_token view. The information in this can be joined with sys.server_principals to get more information. At the database level, you can view the user token using the sys.user_token view.

Additive: Example

Here is a simple example of what I am talking about. Windows user contosomultigroup belongs to two groups, as can be seen here in the output of whoami /groups:

CONTOSOgroup2                   Group            S-1-5-21-1304351064-1365540280-1243399584-1119 Mandatory group, Enabled by default, Enabled group
CONTOSOgroup1                   Group            S-1-5-21-1304351064-1365540280-1243399584-1118 Mandatory group, Enabled by default, Enabled group

We then grant these groups a login and map those logins to a test database:

USE [master]
GO

CREATE LOGIN [CONTOSOgroup1] FROM WINDOWS WITH DEFAULT_DATABASE=[master], DEFAULT_LANGUAGE=[us_english]
GO

CREATE LOGIN [CONTOSOgroup2] FROM WINDOWS WITH DEFAULT_DATABASE=[master], DEFAULT_LANGUAGE=[us_english]
GO

USE [testdb]
GO

CREATE USER [CONTOSOgroup1] FOR LOGIN [CONTOSOgroup1]
GO

CREATE USER [CONTOSOgroup2] FOR LOGIN [CONTOSOgroup2]
GO

Here is the information for the login token:

principal_id    sid    name    type    usage
2    0x02    public    SERVER ROLE    GRANT OR DENY
268    0x01050000000000051500000058D1BE4DB87D6451A0C51C4A5E040000    CONTOSOgroup1    WINDOWS GROUP    GRANT OR DENY
269    0x01050000000000051500000058D1BE4DB87D6451A0C51C4A5F040000    CONTOSOgroup2    WINDOWS GROUP    GRANT OR DENY

As you can see, the login token includes both groups. Next, let us view the user token within the TestDB database:

principal_id    sid    name    type    usage
0    0x01050000000000090400000083741B006749C04BA943C02702F2A762    public    ROLE    GRANT OR DENY
5    0x01050000000000051500000058D1BE4DB87D6451A0C51C4A5E040000    CONTOSOgroup1    WINDOWS GROUP    GRANT OR DENY
6    0x01050000000000051500000058D1BE4DB87D6451A0C51C4A5F040000    CONTOSOgroup2    WINDOWS GROUP    GRANT OR DENY

This should confirm what we talked about earlier – that the membership is additive.

Most Restrictive: Example

If we now proceed to deny login permissions to GROUP1:

DENY CONNECT SQL TO [CONTOSOgroup1]
GO

Then our login for CONTOSOmultigroup fails:

Login failed for user ‘CONTOSOmultigroup’. (Microsoft SQL Server, Error: 18456)

Conclusion

The SQL Server engine security model is very powerful and flexible. At the same time it can be confusing in some scenarios. Hopefully, in this post I have cleared up an FAQ item around multiple group memberships for a Windows user; and how the effective permissions are both additive and also most restrictive.

That’s it for now! See you soon!

Indexed views with remote tables–possible?

Once upon a time, I had this question come up during an onsite visit: someone wanted to create a ‘materialized’ copy of the remote data locally as an indexed view. I had never heard that requirement before, but since the motivation was genuine, I decided to validate it.

There is a long list of what is permitted (and what is not) in indexed views. One fundamental requirement is that the view on which the index is created, must be schema-bound. And this fundamental requirement, actually nails the answer straight away: any attempt to create a schema-bound view which references a remote table ends up with the SQL error 2014:

Remote access is not allowed from within a schema-bound object

Now that it is clear that such ‘projection’ is not possible, we have to step back and understand the real problems why this requirement was posed. In most cases, such ideas come up for improving performance of distributed queries (DQ). For all such cases, I suggest you first start by reviewing the excellent talk from Conor on DQ.

But in many cases, especially where multiple remote data sources are involved, probably the best way to optimize the queries is to consolidate the data inside a data mart or warehouse, and then possibly consider the columnstore index (or the new Clustered Columnstore in SQL 2014) feature to optimize the aggregation-type queries which are very typical in those scenarios.

“Could not add cluster access…” during SQL 2012 Failover Cluster Instance setup

Today a lab setup of SQL 2012 SP1 + CU7 (I was using the /UpdateSource flag to have a patched setup + roll in the updated bits in one go) failed sometime before creating the SQL resource in cluster. The error message from the detail.txt file showed:

Could not add cluster access for SID ‘S-1-5-80-2358576813-3362504073-1364548504-537691503-200193738’. Error: There was a failure to call cluster code from a provider. Exception message: The specified path is invalid.

Preliminary Analysis

The GUID (S-1-5-80-2358576813-3362504073-1364548504-537691503-200193738) in this case happened to be the SQL Service SID. How did I know that? You can also do it, if you use the SC SHOWSID command:

C:Windowssystem32>sc showsid mssql$inst1

NAME: mssql$inst1
SERVICE SID: S-1-5-80-2358576813-3362504073-1364548504-537691503-200193738

This issue occurs when the setup routine fails to add the service account’s SID to the list of accounts allowed to access the Windows cluster itself. For example, in my test cluster (in a healthy state, of course) you can see the Service SID is added to the list of authorized accounts in the security descriptor of the Windows cluster itself:

image

But of course in my failed setup attempt, the SID was not added correctly. The big question is why?

(Another side question some of you may have is why should the service account have this permission on the cluster itself. Well, that’s another topic and another blog post – stay tuned for that!)

Digging deep: Cluster Log

Carrying on from the previous section, one helpful hint is to know that when such changes are made to the cluster’s ACL configuration, the changes are persisted to the cluster database, a copy of which is also persisted to the quorum disk resource (if one was configured.)

Now in my case, deeper troubleshooting was required to find out why the security descriptor could not be written. To find out, I dumped the cluster log (using the PowerShell cmdlet Get-ClusterLog) and found the following messages at the same time that the SQL setup failed:

00001004.00000fe0::2014/01/23-07:20:05.486 INFO  [RES] Physical Disk <RealWitness>: Path W:Cluster is not on the disk
00001004.00000fe0::2014/01/23-07:20:05.486 ERR   [RHS] Error 161 from ResourceControl for resource RealWitness.
00000fb8.000011fc::2014/01/23-07:20:05.486 WARN  [RCM] ResourceControl(STORAGE_IS_PATH_VALID) to RealWitness returned 161.
00000fb8.000011fc::2014/01/23-07:20:05.585 ERR   [RCM] rcm::RcmApi::SetQuorumResource: ERROR_BAD_PATHNAME(161)’ because of ‘ValidateQuorumPath( pRes, quorumPath )’

W:Cluster did not make sense initially, because this drive was supposed to be unused by this instance of SQL. I had to think of sequence of operations I had done during this setup…

Root cause found

Looking back, I remembered that I had swapped the disk quorum resource somewhere halfway in between the setup. It was earlier on the W: drive, but now on the Q: drive, which has the resource name RealWitness. So in short, the resource name for the quorum disk was correct, but the drive letter and path on that disk was incorrect. Stale information was used when the setup program tried to update the cluster configuration (which in turn would write to the quorum disk).

Important: Please note that this is the specific root cause for my specific situation. There may be other types of problems when the setup program tries to set the cluster access for the SID. One has to look at the inner exception message (which in my case was ‘specified path is invalid’) to be sure. The true root cause for other cases can typically be found by correlating to the cluster log file.

Solution

The solution, in this case was to remove the SQL bits (from the Control Panel) and nothing else but to re-run the setup program without any changes. It was a cleaner way, and the ‘moral of the story’ is never tweak cluster quorum in the middle of a SQL Setup procedure!

Exotic spinlocks: XTS_MGR

My friend and colleague – Fabricio Catae – had blogged about spinlocks previously. In one of the comments, a visitor has asked him about the XTS_MGR spinlock. There is no public documentation about this spinlock, because it is largely implementation specific detail.

However, a general explanation is that this spinlock is taken out in many functions within SQL Server, most of which are responsible for version store operations (when you enable Read Committed Snapshot for example). So if you ever face contention for this spinlock, one thing to consider is the effect of highly concurrent and extensive DML workloads, which generally work the version store much harder. Other than that, it is best to contact Microsoft CSS if you suspect that this contention is causing problems on your instance of SQL.

More information on the version store can be found here:

If you have more questions or have seen other spinlocks which you are curious about, do leave a comment. I’ll see what we can explain regarding those!

Top Support Solutions for SQL Server

One of the teams within the Customer Support organization at Microsoft has been busy aggregating information about the most common and useful solutions for many products, including SQL Server. You can view their work at this blog post.

The categories they cover in that post include:

  • Performance
  • Failover Clustering
  • Crashes and Exceptions
  • Backup & Restore
  • Connectivity

It is a good collection of key articles in one place. I encourage you to bookmark it!

Normalizing T-SQL text, part 2: using the TransactSql.ScriptDom classes

Happy New Year 2014 to all of you! With SQL Server 2014 set to release this year, I’m sure you are all excited about the months to come.

In my previous post I had reviewed the classic ways of obtaining ‘normalized’ text for ad-hoc SQL queries. Do take a minute to glance at that post in case you have not already done so. Both the methods described previously are dynamic – they need an active workload to operate upon. So if you have a static set of queries captured somewhere – such as a .SQL file or such, then we need an alternate method.

Algorithm

If you think about it, the core of normalizing these ad-hoc query text patterns is to identify literals and replace then with a generic / common value. Once the specific literal values are replaced with their generic ‘placeholders’ then it becomes a relatively easy task to identify commonalities.

To identify commonalities we propose to use a hashing algorithm, conceptually similar to the one used in the previous approaches. However, when computing hashes, there is another problem to consider: minor differences in whitespace / alphabet case of the query text will cause different hash values to be raised for essentially identical text.

ScriptDom Implementation

The first thing to consider is what kind of literals we would replace. In the ScriptDom class hierarchy, we have the following derived classes for the parent Literal class:

  • IntegerLiteral: whole numbers
  • NumericLiteral: decimal numbers such as 0.03
  • RealLiteral: numbers written with scientific notation such as 1E-02
  • MoneyLiteral: values prefixed with currency symbol such as $12
  • BinaryLiteral: such as 0xABCD1234
  • StringLiteral: such as ‘Hello World’
  • NullLiteral: the NULL value
  • DefaultLiteral: the DEFAULT keyword
  • MaxLiteral: the MAX specifier
  • OdbcLiteral: ODBC formatted literal such as { T ‘blah’ }
  • IdentifierLiteral: ‘special’ case when an identifier is used as a literal. I’ve never seen a real world example of this Smile

We need to keep this in mind when we write the visitor to traverse through the AST.

Visitor definition

Next, we need to setup our visitor. We will use the Visitor pattern to do this, and implement overridden methods to handle the various types of literals described above. And for each type of literal we will replace the value of the literal with a fixed, generic value. Here is an example for the real literal visitor method:

public override void ExplicitVisit(RealLiteral node)
{
node.Value = “0.5E-2”;
base.ExplicitVisit(node);
}

Visitor invocation

For performance reasons we will call the visitor with the Parallel.ForEach loop which makes efficient use of multi-core CPUs:

Parallel.ForEach(
(frag as TSqlScript).Batches,
batch =>
{
myvisitor visit = new myvisitor();

                        batch.Accept(visit);

This way, each T-SQL batch in the original script is visited on a separate thread.

Regenerate the Script

Once the visitor does its job to ‘normalize’ the various literals encountered, the next step is to generate the script based on the tokens already obtained. That will take care of one of the 2 problems we spoke about – namely whitespace. We can do that using one of the many SqlScriptGenerator classes available (there is one for each compatibility level.) In the code snippet below, srcgen is one of the SqlScriptGenerator classes and script holds the output:

scrgen.GenerateScript(batch, out script);

Calculate the hash

Once the normalized script is re-generated from the SqlScriptGenerator class, it can then be run through a proper hash algorithm (in this sample we use SHA1) to calculate the hash value of the given script. Here is where we also handle the case sensitive / insensitive nature of the script:

  • For case insensitive cases, we simply convert the generated script to lower case before we compute the hash.
  • For case sensitive, we calculate the hash as-is on the generated script.

using (var hashProvider = new SHA1CryptoServiceProvider())
{
if (caseSensitive)
{
hashValue = Convert.ToBase64String(hashProvider.ComputeHash(Encoding.Unicode.GetBytes(script)));
}
else
{
hashValue = Convert.ToBase64String(hashProvider.ComputeHash(Encoding.Unicode.GetBytes(script.ToLowerInvariant())));
}
}

Track unique hashes

We can use a simple Dictionary class in .NET to track these, along with usage counts for each bucket. Each bucket also tracks an example of the batch (the original text itself.)

Sample output

The sample project when compiled and executed as below gives us the following output.

Command line

TSQLTextNormalizer.exe c:tempinput.sql c:tempoutput.sql 110 false

Input

select * from ABC
where
i = 1
GO

select * from abC where i = 3
GO

Output

— 2 times:
select * from ABC
where
i = 1
GO

That’s it! You can use this in many cases, limited only by your imagination Smile And more importantly I hope it showed you the power and flexibility of the ScriptDom classes.

[EDIT 2022-07-27] The sample code for this post is now on GitHub. Please upvote the repo on GitHub if you liked this post and found the sample code useful. The URL is https://github.com/arvindshmicrosoft/SQLScriptDomSamples/tree/main/8_TSQLTextNormalizer

Disclaimer

This Sample Code is provided for the purpose of illustration only and is not intended to be used in a production environment.  THIS SAMPLE CODE AND ANY RELATED INFORMATION ARE PROVIDED “AS IS” WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE IMPLIED WARRANTIES OF MERCHANTABILITY AND/OR FITNESS FOR A PARTICULAR PURPOSE.  We grant You a nonexclusive, royalty-free right to use and modify the Sample Code and to reproduce and distribute the object code form of the Sample Code, provided that You agree: (i) to not use Our name, logo, or trademarks to market Your software product in which the Sample Code is embedded; (ii) to include a valid copyright notice on Your software product in which the Sample Code is embedded; and (iii) to indemnify, hold harmless, and defend Us and Our suppliers from and against any claims or lawsuits, including attorneys’ fees, that arise or result from the use or distribution of the Sample Code.

This posting is provided “AS IS” with no warranties, and confers no rights. Use of included script samples are subject to the terms specified at http://www.microsoft.com/info/cpyright.htm.