The ‘aggregate concatenation’ T-SQL (anti-)pattern

[Update 20 March 2019: this functionality is now natively provided in SQL Server 2017+ / Azure SQL DB through the STRING_AGG intrinsic]

Some time back, Dimitri Furman who works as a Senior Consultant in the Microsoft Services team shared a very useful tip. It pertains to a pattern of usage wherein strings from multiple rows are concatenated into a single large (typically comma delimited) string. The usual way that we see this being accomplished is by code like the below:

DECLARE @res NVARCHAR(max)

SELECT @res = COALESCE(@res + ',', '') + FirstName + LastName
FROM AdventureWorks2016CTP3.Person.Person

SELECT @res

While the above is quite simple to read, it is quite slow (as you will see soon) and worse – it does not ‘inline’ meaning you cannot use the above inside a inline table-valued function. This last property was very important for us, because in the case we were looking at, the above COALESCE based concatenation code was in a scalar UDF, which clearly is a big ‘no-no’ from a performance perspective. Also, we did not want to utilize SQLCLR so the usage of the GROUP_CONCAT CLR UDA was ruled out.

Now, here is a much better way to refactor the above concatenation ‘loop’ into the below:

SELECT STUFF((
SELECT ',' + FirstName + LastName
FROM AdventureWorks2016CTP3.Person.Person
FOR XML PATH('')
,TYPE
).value('.', 'nvarchar(max)'), 1, 1, '') AS RawResult

On first glance, it may seem counter-intuitive that the second version could be faster, but that is indeed the case. For the above queries, here is the comparison:

COALESCE loop XML Path
CPU time = 7969 ms,  elapsed time = 7984 ms CPU time = 15 ms,  elapsed time = 16 ms

The usage of XML PATH in this way is not exactly ‘breaking news’ but the performance difference is dramatic. And more importantly for the case we were discussing at work, the XML PATH based code can be inlined, making it suitable for use in TVFs. And in our real case, that’s exactly what we wanted to do (refactor the UDF into a TVF.)

Have a great week ahead!

T-SQL ranking functions to the rescue once again!

This week I happened to assist with an internal application where there was a performance issue with a specific query. One of my observations with that query was that it was doing self-joins in order to determine the latest iteration of a specific record. That’s a common anti-pattern which can be fixed by using T-SQL’s windowing functions, such as ROW_NUMBER() or RANK().

The Inefficient Way

To give more context, here is an example from the Contoso Retail DW database. The requirement (very similar to the actual customer case) here is to obtain a list of all the customers who have placed an order with us, and for each of those customers, obtain the date of the latest order. The implementation that I first noticed was using a self-join as given below; notice the special case handling where the customer has 2 orders on the same day and then the order number (the OnlineSalesKey column here) becomes the tie-breaker:

SELECT l.CustomerKey
	,l.DateKey AS LatestOrder
FROM dbo.FactOnlineSales l
LEFT JOIN dbo.FactOnlineSales r ON l.CustomerKey = r.CustomerKey
	AND (
		l.DateKey < r.DateKey
		OR (
			l.DateKey = r.DateKey
			AND l.OnlineSalesKey > r.OnlineSalesKey
			)
		)
WHERE r.DateKey IS NULL

This query is *extremely* inefficient, burning 100% CPU on my i7 laptop with 8 logical CPUs! On my laptop it will run for well over 10 minutes before I get impatient and cancel. For reference, here is the estimated execution plan for the above query (scroll to the right and note the query cost of 167692).

badplan

Just for kicks, I ran the above query with MAXDOP 144 (NOT a good idea, but this was for fun) on our lab machine which has 144 logical CPUs and here is the picture I see there 🙂 Obviously NOT something you want in production!

144procs

Rewriting the query

Now, the efficient way to re-write this is to use Ranking Functions in T-SQL. These have been around a while (SQL 2005 actually!) but I feel they are under-utilized. Here is the query re-written using the ROW_NUMBER() function. This solution also elegantly takes care of the above tie-breaking logic which required the disjunction (OR predicate) in the previous query.

WITH NumberedOrders
AS (
	SELECT CustomerKey
		,Datekey
		,ROW_NUMBER() OVER (
			PARTITION BY CustomerKey ORDER BY Datekey DESC
				,OnlineSalesKey DESC
			) AS RowNumber
	FROM FactOnlineSales
	)
SELECT CustomerKey
	,Datekey
FROM NumberedOrders
WHERE RowNumber = 1

Here is the new execution plan. Note that the cost is also much lesser: 447.

goodplan

Here are the execution statistics of this query, it completes in 43 seconds compared the self-join approach taking forever!

  • Table ‘FactOnlineSales’. Scan count 9, logical reads 92516, physical reads 0, read-ahead reads 72663, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
  • Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 31263, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
  • SQL Server Execution Times:
  • CPU time = 19231 ms, elapsed time = 43834 ms.

Moral of the story: next time you see query patterns involving self joins to achieve this kind of ‘latest record of a particular type’ take a pause and see if T-SQL Ranking Functions can help!

Common sub-expression elimination in SQL Server: what you need to know

In working with our customers, I come across some very interesting patterns in T-SQL. Not all of these patterns are complex on the face of it, but their impact can sometimes be substantial. Here is a simple example to demonstrate what I am referring to:

SELECT CASE WHEN EXISTS (SELECT *
                         FROM   Person.Person AS P
                         WHERE  P.BusinessEntityID = SOD.ProductID) THEN 1 ELSE 0 END AS Col1,
       CASE WHEN EXISTS (SELECT *
                         FROM   Person.Person AS P
                         WHERE  P.BusinessEntityID = SOD.ProductID) THEN 1 ELSE 0 END AS Col2
FROM   Sales.SalesOrderDetail AS SOD;

This query is clearly fictional, but it will suffice for our walkthrough here 🙂 Now here’s the execution plan for the above statement (you can also test this by using AdventureWorks2012 and above):

httpsmsdnshared.blob.core.windows.netmedia201606image922

Observe that the Person.Person table is accessed twice in the above plan. In the real customer scenario that I was looking at, the table involved was accessed using a nested loop join, for a total of 13 million seeks. And these 13 million were repeated again for the second usage of the subquery.

My attempt to rewrite the query was to use a LEFT OUTER JOIN syntax, as reflected in this simple example:

select CASE WHEN P.BusinessEntityID IS NOT NULL THEN 1 ELSE 0 END as Col1,

CASE WHEN P.BusinessEntityID IS NOT NULL THEN 1 ELSE 0 END as Col2

from Sales.SalesOrderDetail SOD

LEFT OUTER JOIN Person.Person P ON P.BusinessEntityID = SOD.ProductID

Here is the execution plan for the optimized version:

httpsmsdnshared.blob.core.windows.netmedia201606image923These two queries are functionally equivalent (because in this case there is a unique key on the BusinessEntityId column) and return the same set of results. As you can see the Person table is just accessed once in the revised query. The cost of the re-written query is also ~ 6 times lesser than the original query! This kind of optimization is referred to as Constant Subexpression Elimination. There are some strategies used by SQL Server (see this paper and this one as well for some ideas) but not every possible scenario (such as the above one) can be optimized. Also, this paper from Microsoft Research covers future ideas on the topic.

For now, it is best to review your T-SQL code for multiple instances of the same subquery and replace those with an explicit JOIN instead.

Unable to launch runtime for ‘R’ script: Check this first

This post is a quickie, hoping that the tip will come in handy to others who are facing the same issue.

Note: the steps below were written for SQL Server 2016 pre-release versions. They are NOT valid for RTM and subsequent releases.

Today I was trying to enable the R Services (a.k.a. Advanced Analytics) feature in SQL Server 2016, which brings the world of R into SQL Server. I thought that I had diligently followed instructions on installing and configuring the integration. Next, I tried to execute the basic script below (sort of like a ‘loopback’ test for this feature to check if the configuration is okay:

sp_execute_external_script @language =N’R’,@script =N’OutputDataSet <- InputDataSet’, @input_data_1 =N’SELECT 1 as Col’ WITH RESULT SETS ((col int not null));

Unfortunately it kept returning the error messages below:

Msg 39021, Level 16, State 1, Line 1
Unable to launch runtime for ‘R’ script. Please check the configuration of the ‘R’ runtime.
Msg 39019, Level 16, State 1, Line 1
An external script error occurred:
Unable to launch the runtime. ErrorCode 0x80070490: 1168(Element not found.).
Msg 11536, Level 16, State 1, Line 1
EXECUTE statement failed because its WITH RESULT SETS clause specified 1 result set(s), but the statement only sent 0 result set(s) at run time.

I went back to the instructions and realized I had forgotten one step which is to download an updated version of the post-installation configuration script from the Microsoft Download Center. Clicking on the previous link actually gives us an updated registerRext.exe. Running the registerRext.exe /uninstall followed by registerRext.exe /install with this updated version fixed my problem!

Do note that the errors mentioned above could have been caused by other issues, but at a minimum please ensure that you are using the updated registerRext.exe.

What’s new in the Server 2016 CTP 3.3 ScriptDom Parser

We just made SQL Server 2016 CTP 3.3 publicly available a few days ago, and it has some new features which are also reflected in the managed T-SQL parser (Microsoft.SqlServer.TransactSql.ScriptDom). As readers of this blog may know, I am a big fan of this parser and to see it keeping pace with the latest preview release of the SQL Server engine feels great!

Here are the main changes you can leverage if you are using the SQL Server 2016 CTP 3.3 ScriptDom parser:

  • The ScriptDom parser is up to date with the new support for delayed compression of delta rowgroups in Columnstore indexes: the new COMPRESSION_DELAY option in ALTER INDEX helps if you are using non-clustered indexes in a real-time analytics scenario. See this article for more details.
  • The RSA_3072, RSA_4096 algorithms are now added as algorithms inside the ScriptDom parser. These are typically used in the CREATE ASYMMETRIC KEY syntax.

I also wanted to inform anyone who is planning to upgrade their previous installation of SQL Server 2016 preview (CTP 3.2 typically) to the release notes, which contain a very important step (if you are using FILESTREAM) you must take BEFORE upgrading to CTP 3.3. So please take a minute and check the release notes prior to upgrading!

Hope this helps!

Error 0xC1900101 – 0x20017 upgrading to Windows 10 build 10162

We are all excited about the upcoming release of Windows 10 on the 29th of July. In the meantime, you can install a preview version from the Windows Insider web page, or if you’d like to wait, you can reserve your free upgrade by following the steps here.

At the beginning…

At Microsoft, we are at the forefront of the ‘self-hosting’ ballgame. Some time back, I had installed Windows 10 preview build 10074. This week, with the newer builds being released in rapid succession, I decided to switch the Preview Build frequency to ‘Windows Insider Fast’ so that I could get hold of the latest (at the time) 10162 build:

image

Everything seemed to work alright with the update to 10162…

image

… till the point where it rebooted the computer:

image

 

Ahem…

Instead of seeing the usual ‘Upgrading Windows, sit tight..’ screen, this is what I saw:

clip_image002

Well. the keyboard layout screen was certainly not expected! But when I selected US keyboard, it asked me whether I wanted to Troubleshoot or shutdown; which is again quite strange:

clip_image004

So I played along and tried to troubleshoot, at which stage I was basically left with options such as Reset the PC:

clip_image006

And even Reset PC failed with a helpful error message that the ‘drive where Windows is installed is locked’:

clip_image010

Left with no option – post reboot, I was greeted with the infamous error 0xC1900101 – 0x20017, which has plagued some users in the Windows 8.1 upgrade as well:

image

 

Some clues

At this time, I also noticed a very strange issue: any USB mass storage device I would plug in to the laptop was not assigned a drive letter. It would show up in Disk Management, but when I would try to give it a drive letter, Disk Management would popup a cryptic message: ‘The operation failed to complete because the Disk Management console view is not up-to-date. Refresh the view by using the refresh task. If the problem persists close the Disk Management console, then restart Disk Management or restart the computer’.

Well, this was getting stranger and stranger! But somehow, it appeared that the two issues were inter-related. Internally, I was told that the error code 0xC1900101 – 0x20017 was also related to USB drivers. And the above disk management error was in the past attributed to ‘ghost’ drives.

Hide and seek

With that clue in mind, I took at look at my Device Manager screen and enabled the ‘Show Hidden Devices’ under the View menu. In my case, I found some devices under Storage Volumes which only showed up when Hidden Devices were enabled. And those were clearly not attributed to any devices I had installed on my laptop (i.e. I had nothing related to Roxio Saib in reality on my laptop):

image

So I decided to delete these hidden Roxio devices (in another case where I had seen an identical problem the devices showed up as Unknown Device under Storage Volumes, in which case I proceeded to delete those Unknown Devices under Storage Volumes).

WARNING: Using Device Manager incorrectly can cause serious problems that may require you to reinstall Windows. Microsoft cannot guarantee that problems resulting from deleting incorrect devices can be solved. Use these steps at your own risk. Please note again that in this walkthrough, we have only deleted Hidden, Unknown devices which were listed under the Storage Volumes category.

clip_image002[5]

There were two of these Roxio Saib devices (which clearly did not apply to my laptop) and I proceeded to clean both instances up, followed by a reboot:

image

 

We’re there!

After the reboot, I re-tried the Update:

image

At the end of this, and post restart, lo and behold! The expected ‘Upgrading Windows’ screen appeared:

clip_image002[7]

And after around 20 minutes, the upgrade to 10162 was done successfully!

clip_image004[5]

For those of you who had the misfortune to run into the error code 0xc1900101 – 0x20017 when upgrading to Windows 10, I hope this walkthrough helps you get unblocked! And in case you are still unsure or not confident of what to do, please contact Microsoft Customer Services and Support to get more direct help on your issue.

The strange case of the large LiveKernelReports folder

Some time back, I ran into a bit of a space crunch on the C: drive of my laptop which runs Windows 8.1. On digging a bit, I found a 2GB+ file at C:\Windows\LiveKernelReports\WinsockAFD-20150114-1722.dmp. Now, this was the first time I had seen a folder called LiveKernelReports and definitely the first time that I had seen a dump file with the name WinsockAFD*.dmp.

Important note: if you are not a developer and came here trying to figure out what to do with the files in this folder, please proceed directly to the ‘So What?’ section below.

Inside the Dump File

The first thing I did (of course Smile) was to open up the dump file under the WinDbg debugger. In kernel mode dumps, the !analyze –v command generally gives good analysis results, so I decided to start from there (full output of !analyze is at the end of this post as an Appendix).

Firstly, the bugcheck code was 0x156. If you are a developer, and you have the Windows 8.1 SDK, you will see this file C:\Program Files (x86)\Windows Kits\8.1\Include\sharedbugcodes.h which has the bugcheck names. 0x156 is WINSOCK_DETECTED_HUNG_CLOSESOCKET_LIVEDUMP.

Second, this bugcheck, unlike most of the ones we know, did not ‘crash’ or ‘blue screen’ the system.

Live Kernel Dumps

All of this is great, but what’s really happening here? How come I got kernel dumps without the system ‘crashing’? Well, the answer is that in Windows 8.1 the Windows development team added some great reliability diagnostics in the form of ‘Live Kernel Dump Reporting’. With this feature, certain Windows components can request a ‘live dump’ to be gathered. In my above case, both a minidump (~ 278KB) and a ‘larger’ dump (~ 2GB) were gathered when the AFD (Ancillary Function Driver for WinSock) runtime detected that a socket did not close ‘in time’ (see bold sections in the Appendix for more information.)

The Windows Error Reporting feature will then use the minidump to help the Windows development team figure out if this is a ‘trending’ issue, prioritize it and then hopefully fix it if it is due to an issue with Windows. The ‘larger’ dump which I mentioned above is not normally uploaded unless the development team ‘asks’ for it again via the Windows Error Reporting and Action Center mechanisms (to ultimately give the end user control on what gets submitted.)

So What?

That is the million dollar question Smile As an end user, you may be wondering what to do with these types of dump files. The advice I can give you is: if the dump files are causing you to go very low on disk space, you can probably move the dump file off to cheaper storage, like an external HDD. BUT if you are repeatedly getting these dump files, it may be advisable to check for any third party drivers, especially anti-virus products or any other network related software. Sometimes older versions of such software may not ‘play well’ with Windows 8.1 and may be causing a stalled network operation, in turn leading to these dump files.

If you are an IT Pro and seeing these dump files on server class machines and / or on multiple PCs, you would do well to contact our CSS (Customer Service and Support) staff who can guide you further on why these dump files are occurring and what should be the course of action.

In Closing

I hope this helps understand this system folder and why it plays an important role in improving the reliability of Windows. If you are interested in this topic, I highly recommend this talk from Andrew Richards and Graham McIntyre, who are both on the Windows Reliability team. They explain how the OCA / WER mechanism works. Amazing stuff, check it out!

Appendix: !analyze –v output

0: kd> !analyze -v
…

WINSOCK_DETECTED_HUNG_CLOSESOCKET_LIVEDUMP (156)
Winsock detected a hung transport endpoint close request.
Arguments:
…

DEFAULT_BUCKET_ID:  WINBLUE_LIVE_KERNEL_DUMP

BUGCHECK_STR:  0x156

…

STACK_TEXT:
…

ffffd001`28e46660 fffff803`bdddd64d : ffffffff`800026bc 00000000`00000000 ffffc001`1f52ec00 00000000`00000000 : nt!DbgkpWerCaptureLiveFullDump+0x11f
ffffd001`28e466c0 fffff801`21b7e3b4 : 00000000`00000001 ffffd001`28e46889 00000000`00000048 ffffe000`3e9afda0 : nt!DbgkWerCaptureLiveKernelDump+0x1cd
ffffd001`28e46710 fffff801`21b7b4ff : ffffe000`3e9afda0 00000000`0000afd2 ffffe000`3e9afd00 00000000`00000002 : afd!AfdCaptureLiveKernelDumpForHungCloseRequest+0xa8
ffffd001`28e46770 fffff801`21b89cad : ffffe000`3e9afda0 ffffd001`28e46889 00000000`0000afd2 ffffd001`28e46808 : afd!AfdCloseTransportEndpoint+0x64ef
ffffd001`28e467d0 fffff801`21b89674 : 00000000`00000001 ffffe000`42d71010 00000000`00000000 ffffe000`3e9afda0 : afd!AfdCleanupCore+0x14d
ffffd001`28e468f0 fffff803`bdc47349 : ffffe000`42d71010 ffffe000`3d3fd080 00000000`00000000

…

Inline Index Definition in SQL Server 2016 CTP2

Firstly, I am sure you are as excited as I am about the next version of SQL Server, and you can try the public preview (a.k.a. Community Technology Preview 2 – CTP2) by visiting the evaluation page and install it NOW! Today morning, at the start of a  new week and I found something new in SQL 2016 CTP2 which has not been widely discussed as yet: inline index definition enhancements.

As a regulatory disclaimer though I must mention that the information here may be subject to potential change in future releases beyond CTP2.

 

Current State

SQL Server 2014 introduced T-SQL grammar support allowing us to declare index definitions inline with the CREATE TABLE itself. While this was mainly driven by the semantics of the in-memory OLTP tables, the elegance of defining indexes along with the table is compelling. However there were some drawbacks:

  • Inline definition of columnstore indexes was not allowed.
  • Inline indexes could not have a filter predicate associated.

I’m glad to see that both of these have scenarios have been addressed in SQL Server 2016 CTP2. Here are some examples of the new syntax which work in SQL Server 2016 CTP2:

Inline Filtered Index

create table t1
(
    c1 int,
    index IX1 (c1) where c1 > 0
)
go

Inline Columnstore Index

— inline clustered columnstore index definition
create table t2
(
    c1 int,
    c2 int,
    c3 nvarchar(500),
    index CCSI1 clustered columnstore
)

— inline nonclustered columnstore index definition
create table t3
(
    c1 int,
    c2 int,
    c3 nvarchar(500),
    index NCCSI1 nonclustered columnstore (c1, c2)
)

The CREATE TABLE help in SQL 2016 CTP2 Books Online has been updated to reflect these enhancements.

I’ll be posting other such nuggets on an ongoing basis, so stay tuned!

SQL Server and ‘Instant File Initialization’ Under the Hood – Part 3

Welcome back! As promised last time around, here’s part 3 of the ‘Instant File Initialization’ (a.k.a. ‘IFI’) optimization for SQL Server series. If you missed the first two parts you should definitely take some time to read them first before resuming this one, because the previous posts cover a lot of things which would be assumed in this post:

With that background, this post will show you how IFI works (or rather – did not – in specific release!) in conjunction with the Buffer Pool Extension feature in SQL 2014 and above.

Buffer Pool Extension Overview

A few weeks ago, one of my colleagues asked this question internally: ‘does IFI (Instant File Initialization) have impact on creating the BPE?’. That question was the inspiration for this entire series of posts, so I thank him for that. In order to answer the question we first need to understand conceptually how the Buffer Pool Extension (BPE) feature works. The Books Online topic for BPE is a good starting point, but here is my summary:

  • Think of the BPE as offering a ‘Level 2’ cache over the primary ‘Level 1’ i.e. classic Buffer Pool. If the DB page cannot be found in either L1 cache or L2 cache only then will it spill over to regular physical read from the data file.
  • The BPE cache mechanism is actually based on a file which is preferably hosted in very fast storage, such as a SSD. For example, in an Azure D-Series VM context, the D: drive is an excellent place holder for the BPE and / or the TEMPDB – see this article from the SQL Server Product Team for some details.
  • The BPE file is created when you ALTER SERVER CONFIGURATION command to enable BPE, or on SQL startup (if BPE was already configured.)
  • The size of the BPE file can be up to a multiple of the ‘max server memory’ configured (the limit varies by SQL Edition) but we do not generally recommend more than 4x the max server memory setting. The reason I mentioning this here is to consider that the BPE file may be a very large file and depending on which buffer page we are saving into the BPE file, the offset of that file write operation may be quite large.
  • Finally, the BPE file is deleted on SQL Server shutdown (and hence re-created on startup.)

BPE Internals

As with other operations in SQL, the writes into the BPE are optimized using the WriteFileGather() API. And based on which buffer page was being written to the file, the offset into the BPE file itself can be quite large. If we run a Process Monitor trace during the BPE file operations, we will notice that in SQL 2014 RTM there are a number of Synchronous Paging I/O (the second highlighted line in the below snipping) following a regular write operation to the BPE (which in the below screenshot is the first call to WriteFile at a offset of 196771840:

httpsmsdnshared.blob.core.windows.netmediaMSDNBlogsFSprod.evol.blogs.msdn.comCommunityServer.Blogs.Components.WeblogFiles0000006464metablogapi7444.image_1E74B602.png

But as you learnt in the first two parts, writing into ‘random’ locations inside a file will cause the OS to silently ‘zero out’ the allocations from the previous valid data length to the new location, and indeed in the case of BPE writes as well, you will see the tell-tale signs of this:

httpsmsdnshared.blob.core.windows.netmediaMSDNBlogsFSprod.evol.blogs.msdn.comCommunityServer.Blogs.Components.WeblogFiles0000006464metablogapi3755.image_2BDAC908.png

Notice the calls to CcZeroDataOnDisk above, which represents the zero-stamping at a Windows level. Now this is synchronous and will cause the top-level WriteFileGather() to block till the allocations are zeroed up to the current data length. What this means is that the SQL task which caused the buffer fetch in the first place will be blocked a bit more than you would like.

Salvation!

From the above, it certainly looks like there is a potential gap / improvement possibility in SQL 2014 RTM because the calls to write into the BPE would be effectively synchronous and slow down operations. Thankfully, our development team has acted on the feedback from customers and has introduced a call to SetFileValidData() in SQL 2014 Service Pack 1. Specifically, this issue was the one fixed by using the SetFileValidData() API in the BPE initialization code!

So if you now capture a Process Monitor trace during BPE initialization in SQL 2014 SP1, here is what you will see:

httpsmsdnshared.blob.core.windows.netmediaMSDNBlogsFSprod.evol.blogs.msdn.comCommunityServer.Blogs.Components.WeblogFiles0000006464metablogapi2021.image_0E6877FC.png

Since the valid data length is being set proactively to the entire file size itself, initial writes to the BPE file at any random (high) offsets are no longer blocking due to underlying zeroing. This leads to a significant improvement for some customers.

Edit 28 May 2015: Now that SQL 2016 CTP2 is officially available, I’m glad to report that the above improvement is also present in SQL 2016 code base!

So with that, you now know one more way in which the IFI optimization is being used within SQL Server engine. There is one more place which we can talk about, but I’d like to challenge our readers to share their guesses on what that might be – please post your guesses as comments, and I will come back to you shortly with that information as well!

SQL Server and ‘Instant File Initialization’ Under the Hood – Part 2

This is part 2 of my series on ‘Instant File Initialization’ and how that ‘brand name’ actually works under the covers. This post will take a look at what really happens when a database file is created and how the ‘Instant File Initialization’ optimization really helps from a SQL Server perspective. Before you proceed, it is highly recommended that you read Part 1 of this series; so if you have missed the first part, I highly recommend you start there!

Before we begin, a big ‘thank you’ to Bob Dorr, who offered some valuable insight on this topic and also authored an excellent white paper on the overall SQL I/O topic. As well as a shout out for Bob Ward’s excellent ‘Inside SQL I/O’ talk at SQL PASS Conference 2014. Links to both of their works are at the end of this blog post.

In the Beginning…

Let’s start simple: anyone who has worked with SQL Server knows that if you specify a very large file size for the data file, it takes a while (at least with the default setup) to finish this. You also probably know that this is because of ‘zeroing out’ of the underlying allocations.

Now, the million dollar question: when a database is created, ‘conceptually’ there is nothing inside it – right? Smile So why would we need to do the zeroing at this time? Recall from Part 1 of the series, that the first WriteFile() call triggered off the underlying zeroing at an OS level. So, though the data file is basically ‘empty’, maybe SQL is writing into some random file locations and causing this?

Now, why would SQL Server write into ‘random’ places at DB creation?. The answer is that SQL still needs to perform some ‘metadata’ setup on the file or on the new grown section of the file. This ‘metadata’ is basically the internal allocation related pages namely GAM, SGAM and PFS pages, which are scattered at predictable intervals throughout the length of the file.

GAM / PFS Initialization

Now, if you are like me, you would want to verify or see this in the debugger, and indeed some quick poking around with WinDbg will reveal the intricacies of why we are doing this random I/O immediately after resizing or creating the file (and therefore why the zeroing of clusters will normally happen unless you enabled the conditions to use ‘instant file initialization’.)

Firstly, you can poke around in the debugger (note that I used only public symbols for the below walkthrough – you can get started with WinDbg and SQL Server here) and if you get a bit savvy with the debugger you can uncover things like the below:

0:111> x sqlmin!Init*Pages
00007ff8`da328f90 sqlmin!InitGAMIntervalPages (<no parameter info>)
00007ff8`da329190 sqlmin!InitDBAllocPages (<no parameter info>)
00007ff8`da3286a0 sqlmin!InitPFSPages (<no parameter info>)

If you set a few breakpoints you will see the action around PFS and GAM initialization (you will see a lot more PFS than GAM pages because the interval tracked by GAM pages are much larger than PFS). Here is a sample for PFS pages initialization:

sqlmin!InitPFSPages
sqlmin!InitDBAllocPages
sqlmin!FileMgr::CreateNewFile
sqlmin!AsynchronousDiskAction::ExecuteDeferredAction
sqlmin!AsynchronousDiskWorker::ThreadRoutine
sqlmin!SubprocEntrypoint
sqldk!SOS_Task::Param::Execute
sqldk!SOS_Scheduler::RunTask

Please keep this aspect in mind because we will revisit this later.

Case 1: Without ‘Instant File Initialization’

Now, imagine this: if SQL were to directly start writing to ‘random’ locations corresponding to the above GAM, PFS pages, then consider (and if you read Part 1 carefully) we would expect the corresponding WriteFile() operations to cause the OS to issue underlying CcZeroDataOnDisk calls to zero out. This would be inefficient, so in SQL what we do is to proactively issue 8MB chunked I/O writes to zero out the file. You can easily verify this if you run a filtered Process Monitor trace, which I did do and the same is summarized below:

image

If you dig a bit deeper, specifically use the Stack view inside of Process Monitor for one of the WriteFile() calls shown above, you can see all the details down to the WriteFileGather() routine which does the I/O in chunks of 8MB to zero out the file proactively:

image

Notice that there are no calls by the kernel to CcZeroDataOnDisk. So we are in a way doing what the OS did in the earlier case, perhaps a bit more aggressively due to the larger I/O sizes (8MB.)

Now you can imagine why it takes a long time to zero out a large file. If you attended Bob Ward’s excellent ‘Inside SQL I/O’ session at SQL PASS 2014 he actually does some calculations to show you how long it would take to zero out a large data file. For example, if you have a 10GB data file and you have 150MB/sec serial I/O throughput on the drive, you can estimate roughly 70 seconds to do the zero initialization. That can be a really long time, especially if you get an autogrow of that size!

Seed question: if you scroll through the ProcMon trace to the last of the 8MB WriteFile operations (which are the zeroing ones) then you will notice that there are some 8KB writes which follow. Why? The answer follows at the end of ‘Case 2’ walkthrough below!

image

Case 2: With ‘Instant File Initialization’

Now, assume that the SQL Service account has been allocated the SeManageVolumePrivilege (which allows the successful use of the SetFileValidData API I mentioned in the previous post) then SQL will attempt to use this ‘optimization’ to avoid the zeroing overhead. We captured a sample trace using Process Monitor while SQL was creating a 5GB data file. Here is a screenshot of how the Process Monitor logs look like with Instant File Initialization optimization enabled successfully:

image

You can see the reference to SetValidDataLengthInformationFile (highlighted) followed by a series of 8KB writes. In the debugger, you will see the following call stack which proves that we do indeed call the SetFileValidData() API from the FCB::InitializeSpace() call:

KERNELBASE!SetFileValidData
sqlmin!FCB::InitializeSpace
sqlmin!FileMgr::CreateNewFile
sqlmin!AsynchronousDiskAction::ExecuteDeferredAction
sqlmin!AsynchronousDiskWorker::ThreadRoutine

Now we answer the previous question we seeded at the end of the Case 1 section: why do we still get the 8KB writes? If you recall from the ‘GAM / PFS Initialization’ section previously then this should be crystal clear! Here is a call stack of one of the 8KB writes:

image

As you can see above, this is for a PFS page initialization. So this explains the 8KB writes after the file was created.

Case 3: Sparse File Creation (Database Snapshot)

Next, let’s look at one of the special cases: Database snapshots in SQL Server are implemented using NTFS ‘sparse file’ functionality. Now, in the case of a sparse file, we do not use either of the two mechanisms mentioned above, and instead use a special mechanism to do the ‘zero initialization’. Why? Read on!

If you read the ‘Instant File Initialization’ (IFI) section in the SQL I/O Basics Chapter 2 white paper, you will see this sentence:

The algorithm used by SQL Server is more aggressive than the NTFS zero initialization (DeviceIoControl, FSCTL_SET_ZERO_DATA)

From MSDN it is clear that there is an optimization to set a range in a sparse file as all zeros without physically extending the file size:

If you use the WriteFile function to write zeros (0) to a sparse file, the file system allocates disk space for the data that you are writing. If you use the FSCTL_SET_ZERO_DATA control code to write zeros (0) to a sparse file and the zero (0) region is large enough, the file system may not allocate disk space.

AHA! So I hope that explains why we cannot use the conventional ‘zero stamping’ or the SetFileValidData mechanism for sparse files. But let’s see this for ourselves! Let’s start by creating a DB snapshot, but before I executed the below I also put a breakpoint in WinDbg on kernelbase!DeviceIoControl().

— Create the database snapshot
CREATE DATABASE ZN_test ON
( NAME = ZN, FILENAME =
‘l:tempZN_test.ss’ )
AS SNAPSHOT OF ZN;
GO

Here is the corresponding Process Monitor trace:

image

From WinDbg we can get the call stack. You can see that FCB::ZeroFile() calls the DeviceIoControl in this case:

KERNELBASE!DeviceIoControl
KERNEL32!DeviceIoControlImplementation
sqlmin!FCB::ZeroFile
sqlmin!FCB::InitializeSpace
sqlmin!FileMgr::CreateNewFile

Wow! So I hope you get a feel for how many optimizations we have in place for SQL Server from an I/O perspective.

Case 4: Log File Initialization

Last but not the least, let us study the case for the transaction log file. Interestingly (and as is known and documented in many places) the log file is always zero-initialized. Here is a ProcMon trace (which was taken when IFI was already leveraged for the data file creation):

image

The above operations are largely related to zeroing out the entire file and then formatting the Virtual Log Files within the initial chunk. The log file (2MB in size) was zero-initialized in one shot in the above case. It took 30 milliseconds to do that on my system. Obviously more real world sizes would take proportionately more time to finish.

FYI – you can see the progress of the log fixups by using undocumented trace flag 3004.

What Next?

So that’s it, I hope you enjoyed this spelunking into the internals of the OS and SQL. Next up, we will see how this optimization applies (or does not apply) to other key components within SQL. For further reading, the following resources are excellent resources on the topic of SQL I/O internals: