May 26, 2015March 14, 2019 by Arvind Shyamsundar

SQL Server and ‘Instant File Initialization’ Under the Hood – Part 3

Welcome back! As promised last time around, here’s part 3 of the ‘Instant File Initialization’ (a.k.a. ‘IFI’) optimization for SQL Server series. If you missed the first two parts you should definitely take some time to read them first before resuming this one, because the previous posts cover a lot of things which would be assumed in this post:

SQL Server and ‘Instant File Initialization’ Under the Hood – Part 1 (Windows fundamentals to understand IFI)
SQL Server and ‘Instant File Initialization’ Under the Hood – Part 2 (How SQL uses IFI to speed up DB file operations)

With that background, this post will show you how IFI works (or rather – did not – in specific release!) in conjunction with the Buffer Pool Extension feature in SQL 2014 and above.

Buffer Pool Extension Overview

A few weeks ago, one of my colleagues asked this question internally: ‘does IFI (Instant File Initialization) have impact on creating the BPE?’. That question was the inspiration for this entire series of posts, so I thank him for that. In order to answer the question we first need to understand conceptually how the Buffer Pool Extension (BPE) feature works. The Books Online topic for BPE is a good starting point, but here is my summary:

Think of the BPE as offering a ‘Level 2’ cache over the primary ‘Level 1’ i.e. classic Buffer Pool. If the DB page cannot be found in either L1 cache or L2 cache only then will it spill over to regular physical read from the data file.
The BPE cache mechanism is actually based on a file which is preferably hosted in very fast storage, such as a SSD. For example, in an Azure D-Series VM context, the D: drive is an excellent place holder for the BPE and / or the TEMPDB – see this article from the SQL Server Product Team for some details.
The BPE file is created when you ALTER SERVER CONFIGURATION command to enable BPE, or on SQL startup (if BPE was already configured.)
The size of the BPE file can be up to a multiple of the ‘max server memory’ configured (the limit varies by SQL Edition) but we do not generally recommend more than 4x the max server memory setting. The reason I mentioning this here is to consider that the BPE file may be a very large file and depending on which buffer page we are saving into the BPE file, the offset of that file write operation may be quite large.
Finally, the BPE file is deleted on SQL Server shutdown (and hence re-created on startup.)

BPE Internals

As with other operations in SQL, the writes into the BPE are optimized using the WriteFileGather() API. And based on which buffer page was being written to the file, the offset into the BPE file itself can be quite large. If we run a Process Monitor trace during the BPE file operations, we will notice that in SQL 2014 RTM there are a number of Synchronous Paging I/O (the second highlighted line in the below snipping) following a regular write operation to the BPE (which in the below screenshot is the first call to WriteFile at a offset of 196771840:

But as you learnt in the first two parts, writing into ‘random’ locations inside a file will cause the OS to silently ‘zero out’ the allocations from the previous valid data length to the new location, and indeed in the case of BPE writes as well, you will see the tell-tale signs of this:

Notice the calls to CcZeroDataOnDisk above, which represents the zero-stamping at a Windows level. Now this is synchronous and will cause the top-level WriteFileGather() to block till the allocations are zeroed up to the current data length. What this means is that the SQL task which caused the buffer fetch in the first place will be blocked a bit more than you would like.

Salvation!

From the above, it certainly looks like there is a potential gap / improvement possibility in SQL 2014 RTM because the calls to write into the BPE would be effectively synchronous and slow down operations. Thankfully, our development team has acted on the feedback from customers and has introduced a call to SetFileValidData() in SQL 2014 Service Pack 1. Specifically, this issue was the one fixed by using the SetFileValidData() API in the BPE initialization code!

So if you now capture a Process Monitor trace during BPE initialization in SQL 2014 SP1, here is what you will see:

Since the valid data length is being set proactively to the entire file size itself, initial writes to the BPE file at any random (high) offsets are no longer blocking due to underlying zeroing. This leads to a significant improvement for some customers.

Edit 28 May 2015: Now that SQL 2016 CTP2 is officially available, I’m glad to report that the above improvement is also present in SQL 2016 code base!

So with that, you now know one more way in which the IFI optimization is being used within SQL Server engine. There is one more place which we can talk about, but I’d like to challenge our readers to share their guesses on what that might be – please post your guesses as comments, and I will come back to you shortly with that information as well!

May 19, 2015 by Arvind Shyamsundar

SQL Server and ‘Instant File Initialization’ Under the Hood – Part 2

This is part 2 of my series on ‘Instant File Initialization’ and how that ‘brand name’ actually works under the covers. This post will take a look at what really happens when a database file is created and how the ‘Instant File Initialization’ optimization really helps from a SQL Server perspective. Before you proceed, it is highly recommended that you read Part 1 of this series; so if you have missed the first part, I highly recommend you start there!

Before we begin, a big ‘thank you’ to Bob Dorr, who offered some valuable insight on this topic and also authored an excellent white paper on the overall SQL I/O topic. As well as a shout out for Bob Ward’s excellent ‘Inside SQL I/O’ talk at SQL PASS Conference 2014. Links to both of their works are at the end of this blog post.

In the Beginning…

Let’s start simple: anyone who has worked with SQL Server knows that if you specify a very large file size for the data file, it takes a while (at least with the default setup) to finish this. You also probably know that this is because of ‘zeroing out’ of the underlying allocations.

Now, the million dollar question: when a database is created, ‘conceptually’ there is nothing inside it – right? Smile So why would we need to do the zeroing at this time? Recall from Part 1 of the series, that the first WriteFile() call triggered off the underlying zeroing at an OS level. So, though the data file is basically ‘empty’, maybe SQL is writing into some random file locations and causing this?

Now, why would SQL Server write into ‘random’ places at DB creation?. The answer is that SQL still needs to perform some ‘metadata’ setup on the file or on the new grown section of the file. This ‘metadata’ is basically the internal allocation related pages namely GAM, SGAM and PFS pages, which are scattered at predictable intervals throughout the length of the file.

GAM / PFS Initialization

Now, if you are like me, you would want to verify or see this in the debugger, and indeed some quick poking around with WinDbg will reveal the intricacies of why we are doing this random I/O immediately after resizing or creating the file (and therefore why the zeroing of clusters will normally happen unless you enabled the conditions to use ‘instant file initialization’.)

Firstly, you can poke around in the debugger (note that I used only public symbols for the below walkthrough – you can get started with WinDbg and SQL Server here) and if you get a bit savvy with the debugger you can uncover things like the below:

0:111> x sqlmin!Init*Pages
00007ff8`da328f90 sqlmin!InitGAMIntervalPages (<no parameter info>)
00007ff8`da329190 sqlmin!InitDBAllocPages (<no parameter info>)
00007ff8`da3286a0 sqlmin!InitPFSPages (<no parameter info>)

If you set a few breakpoints you will see the action around PFS and GAM initialization (you will see a lot more PFS than GAM pages because the interval tracked by GAM pages are much larger than PFS). Here is a sample for PFS pages initialization:

sqlmin!InitPFSPages
sqlmin!InitDBAllocPages
sqlmin!FileMgr::CreateNewFile
sqlmin!AsynchronousDiskAction::ExecuteDeferredAction
sqlmin!AsynchronousDiskWorker::ThreadRoutine
sqlmin!SubprocEntrypoint
sqldk!SOS_Task::Param::Execute
sqldk!SOS_Scheduler::RunTask

Please keep this aspect in mind because we will revisit this later.

Case 1: Without ‘Instant File Initialization’

Now, imagine this: if SQL were to directly start writing to ‘random’ locations corresponding to the above GAM, PFS pages, then consider (and if you read Part 1 carefully) we would expect the corresponding WriteFile() operations to cause the OS to issue underlying CcZeroDataOnDisk calls to zero out. This would be inefficient, so in SQL what we do is to proactively issue 8MB chunked I/O writes to zero out the file. You can easily verify this if you run a filtered Process Monitor trace, which I did do and the same is summarized below:

If you dig a bit deeper, specifically use the Stack view inside of Process Monitor for one of the WriteFile() calls shown above, you can see all the details down to the WriteFileGather() routine which does the I/O in chunks of 8MB to zero out the file proactively:

Notice that there are no calls by the kernel to CcZeroDataOnDisk. So we are in a way doing what the OS did in the earlier case, perhaps a bit more aggressively due to the larger I/O sizes (8MB.)

Now you can imagine why it takes a long time to zero out a large file. If you attended Bob Ward’s excellent ‘Inside SQL I/O’ session at SQL PASS 2014 he actually does some calculations to show you how long it would take to zero out a large data file. For example, if you have a 10GB data file and you have 150MB/sec serial I/O throughput on the drive, you can estimate roughly 70 seconds to do the zero initialization. That can be a really long time, especially if you get an autogrow of that size!

Seed question: if you scroll through the ProcMon trace to the last of the 8MB WriteFile operations (which are the zeroing ones) then you will notice that there are some 8KB writes which follow. Why? The answer follows at the end of ‘Case 2’ walkthrough below!

Case 2: With ‘Instant File Initialization’

Now, assume that the SQL Service account has been allocated the SeManageVolumePrivilege (which allows the successful use of the SetFileValidData API I mentioned in the previous post) then SQL will attempt to use this ‘optimization’ to avoid the zeroing overhead. We captured a sample trace using Process Monitor while SQL was creating a 5GB data file. Here is a screenshot of how the Process Monitor logs look like with Instant File Initialization optimization enabled successfully:

You can see the reference to SetValidDataLengthInformationFile (highlighted) followed by a series of 8KB writes. In the debugger, you will see the following call stack which proves that we do indeed call the SetFileValidData() API from the FCB::InitializeSpace() call:

KERNELBASE!SetFileValidData
sqlmin!FCB::InitializeSpace
sqlmin!FileMgr::CreateNewFile
sqlmin!AsynchronousDiskAction::ExecuteDeferredAction
sqlmin!AsynchronousDiskWorker::ThreadRoutine
…

Now we answer the previous question we seeded at the end of the Case 1 section: why do we still get the 8KB writes? If you recall from the ‘GAM / PFS Initialization’ section previously then this should be crystal clear! Here is a call stack of one of the 8KB writes:

As you can see above, this is for a PFS page initialization. So this explains the 8KB writes after the file was created.

Case 3: Sparse File Creation (Database Snapshot)

Next, let’s look at one of the special cases: Database snapshots in SQL Server are implemented using NTFS ‘sparse file’ functionality. Now, in the case of a sparse file, we do not use either of the two mechanisms mentioned above, and instead use a special mechanism to do the ‘zero initialization’. Why? Read on!

If you read the ‘Instant File Initialization’ (IFI) section in the SQL I/O Basics Chapter 2 white paper, you will see this sentence:

The algorithm used by SQL Server is more aggressive than the NTFS zero initialization (DeviceIoControl, FSCTL_SET_ZERO_DATA)

From MSDN it is clear that there is an optimization to set a range in a sparse file as all zeros without physically extending the file size:

If you use the WriteFile function to write zeros (0) to a sparse file, the file system allocates disk space for the data that you are writing. If you use the FSCTL_SET_ZERO_DATA control code to write zeros (0) to a sparse file and the zero (0) region is large enough, the file system may not allocate disk space.

AHA! So I hope that explains why we cannot use the conventional ‘zero stamping’ or the SetFileValidData mechanism for sparse files. But let’s see this for ourselves! Let’s start by creating a DB snapshot, but before I executed the below I also put a breakpoint in WinDbg on kernelbase!DeviceIoControl().

— Create the database snapshot
CREATE DATABASE ZN_test ON
( NAME = ZN, FILENAME =
‘l:tempZN_test.ss’ )
AS SNAPSHOT OF ZN;
GO

Here is the corresponding Process Monitor trace:

From WinDbg we can get the call stack. You can see that FCB::ZeroFile() calls the DeviceIoControl in this case:

KERNELBASE!DeviceIoControl
KERNEL32!DeviceIoControlImplementation
sqlmin!FCB::ZeroFile
sqlmin!FCB::InitializeSpace
sqlmin!FileMgr::CreateNewFile
…

Wow! So I hope you get a feel for how many optimizations we have in place for SQL Server from an I/O perspective.

Case 4: Log File Initialization

Last but not the least, let us study the case for the transaction log file. Interestingly (and as is known and documented in many places) the log file is always zero-initialized. Here is a ProcMon trace (which was taken when IFI was already leveraged for the data file creation):

The above operations are largely related to zeroing out the entire file and then formatting the Virtual Log Files within the initial chunk. The log file (2MB in size) was zero-initialized in one shot in the above case. It took 30 milliseconds to do that on my system. Obviously more real world sizes would take proportionately more time to finish.

FYI – you can see the progress of the log fixups by using undocumented trace flag 3004.

What Next?

So that’s it, I hope you enjoyed this spelunking into the internals of the OS and SQL. Next up, we will see how this optimization applies (or does not apply) to other key components within SQL. For further reading, the following resources are excellent resources on the topic of SQL I/O internals:

Bob Ward’s SQL PASS 2014 presentation (requires PASS 2014 event registration) http://www.sqlpass.org/summit/2014/Sessions/Details.aspx?sid=7057
Bob Dorr on SQL I/O http://blogs.msdn.com/b/psssql/archive/2010/03/24/how-it-works-bob-dorr-s-sql-server-i-o-presentation.aspx
SQL I/O Basics Part 2: https://technet.microsoft.com/en-us/library/cc917726.aspx
SQL I/O Basics Part 1: https://technet.microsoft.com/library/Cc966500

May 4, 2015 by Arvind Shyamsundar

SQL Server and ‘Instant File Initialization’ Under the Hood – Part 1

Recently a colleague of mine popped up a very interesting question around whether the SQL Server ‘Buffer Pool Extension’ feature in SQL 2014 uses the ‘instant file initialization’ optimization (or not). While answering that question I found some useful information which I believe will help many of us. So here we go… firstly, we need to understand what ‘instant file initialization’ is really all about, from the Windows perspective.

Background

At the OS level every file has three important attributes which are recorded in the metadata of the NTFS file system:

Physical file size
Allocation file size
Valid data size

In this post, we are mostly concerned with Physical and Valid Data sizes. More details are available at the this MSDN page but for simplicity, let me put it this way:

When you create a file with the CreateFile API, it starts with a 0 byte length
One way to ‘grow’ the file is of course to sequentially write bytes to it.
But if you want to ‘pre-size’ the file to a specific size, then you may not want to explicitly write data upfront.
In those cases the OS provides a SetEndOfFile() API to ‘resize’ the file, but as you will see below, there are still some things which will hold up the thread when the first write operation is done to the pre-sized file

Let’s work through this step-by-step. A bit of programming knowledge will help, though it should be fairly easy to figure out what’s going on by reading the comments inline in the code! Smile

Growing a file: C++ example

Here is a simple program which will demonstrate how you can grow a file to 3GB without having to write individual bytes till the 3GB mark:

#include <Windows.h>

int _tmain(int argc, _TCHAR* argv[])
{
    // create a file first. it will start as an empty file of course
    HANDLE myFile = ::CreateFile(L"l:\temp\ifi.dat",
        GENERIC_WRITE,
        0,
        NULL,
        CREATE_ALWAYS,
        FILE_ATTRIBUTE_NORMAL,
        NULL);

    if (INVALID_HANDLE_VALUE == myFile)
    {
        return -1;
    }

    // let’s now make the file 3GB in size
    LARGE_INTEGER newpos;
    newpos.QuadPart = (LONGLONG) 3 * 1024 * 1024 * 1024;

    LARGE_INTEGER newfp;

    // navigate to the new ‘end of the file’
    ::SetFilePointerEx(myFile,
        newpos,
        &newfp,
        FILE_BEGIN);

    // ‘seal’ the new EOF location
    ::SetEndOfFile(myFile);

    // now navigate to the EOF – 1024 bytes.
    newpos.QuadPart = (LONGLONG)3 * 1024 * 1024 * 1024 – 1024;
    ::SetFilePointerEx(myFile, newpos, &newfp, FILE_BEGIN);

    DWORD dwwritten = 0;

    // try to write 5 bytes to the 3GB-1024th location
    ::WriteFile(myFile,
        "hello",
        5,
        &dwwritten,
        NULL);

    return 0;
}

When we execute the above code, you will see that though we used the SetEndOfFile() API to locate the EOF marker without us explicitly writing anything, there is some work being done by the OS underneath our code to ‘zero’ out the contents of the clusters allocated to us. This is done for data privacy reasons and since it is physical I/O, it does take a while. You may want to refer the documentation for the SetFilePointerEx function:

Note that it is not an error to set the file pointer to a position beyond the end of the file. The size of the file does not increase until you call the SetEndOfFile, WriteFile, or WriteFileEx function. A write operation increases the size of the file to the file pointer position plus the size of the buffer written, leaving the intervening bytes uninitialized.

Snooping in with Process Monitor

You can actually look at the proof of what is happening underneath the hood by using Process Monitor from the Sysinternals suite. Here is a complete call stack of the application. Notice the call in the kernel to zero out data (CcZeroDataOnDisk). Notice that these are not our API calls. We simply called WriteFile() and that triggered off these underlying ‘zeroing’ writes.

In the same ProcMon trace you will also notice a bunch of I/O operations (corresponding to the above stack) just after I triggered my 5 bytes I/O:

The key takeaway from this walkthrough is that when we called SetEndOfFile(), we do not affect the ‘valid data length’ of that file stream. In that case, the OS will play it safe by zeroing out the allocations from the previous valid file length (which in our case above was actually 0) leading up to the location of the write (which in our case is 1024 bytes prior to the physical end of the file.) This operation is what causes the thread to block.

Growing a file – the ‘fast’ way

Instant File Initialization as we know it in SQL Server really reduces to an API call in Windows. To see that, we tweak the above sample and add in the ‘secret sauce’ which is the call to SetFileValidData() API:

// ‘seal’ the new EOF location
::SetEndOfFile(myFile);

// now ‘cleverly’ set the valid data length to 3GB
if (0 == ::SetFileValidData(myFile, newpos.QuadPart))
{
printf("Unable to use IFI, error %d", GetLastError());
}
else
{
printf("IFI was used!!!");
}

// now navigate to the EOF – 1024 bytes.
newpos.QuadPart = (LONGLONG)3 * 1024 * 1024 * 1024 – 1024;

You will then see that the same code executes almost instantly. The reason for this is because the OS will no longer need to zero any bytes underneath the hood, because the valid data length (as set by the above API call) == file size. This can be seen in Process Monitor as well:

Dangers of SetFileValidData()

The important thing to note is that SetFileValidData() is a dangerous API in a way, because it can potentially expose underlying fragments of data. Much has been said about this, and you can check out Raymond’s blog post on this topic. The MSDN page for this API is also very clear on the caveats:

You can use the SetFileValidData function to create large files in very specific circumstances so that the performance of subsequent file I/O can be better than other methods. Specifically, if the extended portion of the file is large and will be written to randomly, such as in a database type of application, the time it takes to extend and write to the file will be faster than using SetEndOfFile and writing randomly. In most other situations, there is usually no performance gain to using SetFileValidData, and sometimes there can be a performance penalty.

What next?

Of course, if you are like me, you are probably wondering what this all equates to. Remember, we are trying to explore some of the basis and background on the ‘instant file initialization’ optimization that SQL Server can leverage to quickly size new and grown chunks for data files. As the documentation and our team’s blog post explain in detail, this setting can be very useful in certain cases and is in fact recommended for deployments on Microsoft Azure IaaS VMs.

Next time, I will correlate this information we learnt above to how SQL Server leverages it in the process of creating new data files or growing existing ones. Till then, goodbye!

March 12, 2015 by Arvind Shyamsundar

The mysterious ‘MD’ lock type, and why you should stop using sp_lock

Today during some discussions with customers, there was a question about some locks being held by a session. Here is an example reproduced below:

begin tran
select * from Person.Person
where LastName = ‘Singh’

exec sp_lock @@spid

Here is the output:

spid    dbid    ObjId    IndId    Type    Resource    Mode    Status
52    11    0    0    DB                                        S    GRANT
52    11    0    0    MD    14(10000:0:0)                       Sch-S    GRANT
52    11    0    0    MD    14(10001:0:0)                       Sch-S    GRANT
52    1    1467152272    0    TAB                                        IS    GRANT
52    32767    -571204656    0    TAB                                        Sch-S    GRANT

The two rows highlighted in bold in the output were the point of discussion. It was not very apparent as to what those locks were attributed to. So, here is where the power of the newer DMV: sys.dm_os_tran_locks becomes apparent:

select resource_type, resource_subtype, resource_description, request_mode from sys.dm_tran_locks
where request_session_id = @@spid

Here is the output:

resource_type resource_subtype resource_description request_mode
METADATA XML_COLLECTION xml_collection_id = 65536 Sch-S
METADATA XML_COLLECTION xml_collection_id = 65537 Sch-S

Aha! So this made much more sense. So these are metadata locks on XML schema collections. When you look at the Person.Person table, indeed there are two XML columns to which XML schema collections are bound to:

[AdditionalContactInfo] [xml](CONTENT [Person].[AdditionalContactInfoSchemaCollection]) NULL,
[Demographics] [xml](CONTENT [Person].[IndividualSurveySchemaCollection]) NULL,

When you further reconcile the xml_collection_id from the tran_locks DMV, this is sealed:

select xml_collection_id, name from sys.xml_schema_collections
where xml_collection_id in (65536, 65537)

Here is the output:

xml_collection_id    name
65536    AdditionalContactInfoSchemaCollection
65537    IndividualSurveySchemaCollection

So, what other types of resources can we expect in the sys.dm_os_tran_locks DMV? If you do some poking around in my favorite catalog view sys.dm_xe_map_values, you will find the answer:

select map_value from sys.dm_xe_map_values
where name = ‘lock_resource_type’
order by map_key

Here is the output:

UNKNOWN_LOCK_RESOURCE
NULL_RESOURCE
DATABASE
FILE
UNUSED1
OBJECT
PAGE
KEY
EXTENT
RID
APPLICATION
METADATA
HOBT
ALLOCATION_UNIT
OIB
ROWGROUP
LAST_RESOURCE

Note: the above output was produced from a SQL 2014 instance, so you may not find all the values in older versions of SQL. Most of the above are easy to understand (for example, Object, Page or Key.)

NOW – there are some others in the above list which are not that easily understood. If you want to hear more, please leave a comment and indicate what exactly you want to know more about! I’ll do my best to explain within the boundaries of what we can share publicly Smile

Arvind Shyamsundar's technical blog

Arvind Shyamsundar is a Principal PM @ MSFT Azure Data, working on Azure SQL. Data geek. Apache Accumulo and Fluo PMC. SQL MCM, ex-Principal PFE (MSFT Services). These are my own opinions and not those of Microsoft.

Tag / SQL Internals

SQL Server and ‘Instant File Initialization’ Under the Hood – Part 3

Buffer Pool Extension Overview

BPE Internals

Salvation!

SQL Server and ‘Instant File Initialization’ Under the Hood – Part 2

In the Beginning…

GAM / PFS Initialization

Case 1: Without ‘Instant File Initialization’

Case 2: With ‘Instant File Initialization’

Case 3: Sparse File Creation (Database Snapshot)

Case 4: Log File Initialization

What Next?

SQL Server and ‘Instant File Initialization’ Under the Hood – Part 1

Background

Growing a file: C++ example

Snooping in with Process Monitor

Growing a file – the ‘fast’ way

Dangers of SetFileValidData()

What next?

The mysterious ‘MD’ lock type, and why you should stop using sp_lock