The strange case of the large LiveKernelReports folder

Some time back, I ran into a bit of a space crunch on the C: drive of my laptop which runs Windows 8.1. On digging a bit, I found a 2GB+ file at C:\Windows\LiveKernelReports\WinsockAFD-20150114-1722.dmp. Now, this was the first time I had seen a folder called LiveKernelReports and definitely the first time that I had seen a dump file with the name WinsockAFD*.dmp.

Important note: if you are not a developer and came here trying to figure out what to do with the files in this folder, please proceed directly to the ‘So What?’ section below.

Inside the Dump File

The first thing I did (of course Smile) was to open up the dump file under the WinDbg debugger. In kernel mode dumps, the !analyze –v command generally gives good analysis results, so I decided to start from there (full output of !analyze is at the end of this post as an Appendix).

Firstly, the bugcheck code was 0x156. If you are a developer, and you have the Windows 8.1 SDK, you will see this file C:\Program Files (x86)\Windows Kits\8.1\Include\sharedbugcodes.h which has the bugcheck names. 0x156 is WINSOCK_DETECTED_HUNG_CLOSESOCKET_LIVEDUMP.

Second, this bugcheck, unlike most of the ones we know, did not ‘crash’ or ‘blue screen’ the system.

Live Kernel Dumps

All of this is great, but what’s really happening here? How come I got kernel dumps without the system ‘crashing’? Well, the answer is that in Windows 8.1 the Windows development team added some great reliability diagnostics in the form of ‘Live Kernel Dump Reporting’. With this feature, certain Windows components can request a ‘live dump’ to be gathered. In my above case, both a minidump (~ 278KB) and a ‘larger’ dump (~ 2GB) were gathered when the AFD (Ancillary Function Driver for WinSock) runtime detected that a socket did not close ‘in time’ (see bold sections in the Appendix for more information.)

The Windows Error Reporting feature will then use the minidump to help the Windows development team figure out if this is a ‘trending’ issue, prioritize it and then hopefully fix it if it is due to an issue with Windows. The ‘larger’ dump which I mentioned above is not normally uploaded unless the development team ‘asks’ for it again via the Windows Error Reporting and Action Center mechanisms (to ultimately give the end user control on what gets submitted.)

So What?

That is the million dollar question Smile As an end user, you may be wondering what to do with these types of dump files. The advice I can give you is: if the dump files are causing you to go very low on disk space, you can probably move the dump file off to cheaper storage, like an external HDD. BUT if you are repeatedly getting these dump files, it may be advisable to check for any third party drivers, especially anti-virus products or any other network related software. Sometimes older versions of such software may not ‘play well’ with Windows 8.1 and may be causing a stalled network operation, in turn leading to these dump files.

If you are an IT Pro and seeing these dump files on server class machines and / or on multiple PCs, you would do well to contact our CSS (Customer Service and Support) staff who can guide you further on why these dump files are occurring and what should be the course of action.

In Closing

I hope this helps understand this system folder and why it plays an important role in improving the reliability of Windows. If you are interested in this topic, I highly recommend this talk from Andrew Richards and Graham McIntyre, who are both on the Windows Reliability team. They explain how the OCA / WER mechanism works. Amazing stuff, check it out!

Appendix: !analyze –v output

0: kd> !analyze -v
…

WINSOCK_DETECTED_HUNG_CLOSESOCKET_LIVEDUMP (156)
Winsock detected a hung transport endpoint close request.
Arguments:
…

DEFAULT_BUCKET_ID:  WINBLUE_LIVE_KERNEL_DUMP

BUGCHECK_STR:  0x156

…

STACK_TEXT:
…

ffffd001`28e46660 fffff803`bdddd64d : ffffffff`800026bc 00000000`00000000 ffffc001`1f52ec00 00000000`00000000 : nt!DbgkpWerCaptureLiveFullDump+0x11f
ffffd001`28e466c0 fffff801`21b7e3b4 : 00000000`00000001 ffffd001`28e46889 00000000`00000048 ffffe000`3e9afda0 : nt!DbgkWerCaptureLiveKernelDump+0x1cd
ffffd001`28e46710 fffff801`21b7b4ff : ffffe000`3e9afda0 00000000`0000afd2 ffffe000`3e9afd00 00000000`00000002 : afd!AfdCaptureLiveKernelDumpForHungCloseRequest+0xa8
ffffd001`28e46770 fffff801`21b89cad : ffffe000`3e9afda0 ffffd001`28e46889 00000000`0000afd2 ffffd001`28e46808 : afd!AfdCloseTransportEndpoint+0x64ef
ffffd001`28e467d0 fffff801`21b89674 : 00000000`00000001 ffffe000`42d71010 00000000`00000000 ffffe000`3e9afda0 : afd!AfdCleanupCore+0x14d
ffffd001`28e468f0 fffff803`bdc47349 : ffffe000`42d71010 ffffe000`3d3fd080 00000000`00000000

…

The strange case of the bloated C:WindowsCSC folder

A few days ago, when I was working at the office on the corporate network, I had to access a network share containing some installation files. At the time of starting this process, I remember distinctly that my disk free space was around 25GB or thereabouts. A few hours later, I found that my disk free space was down to almost 1GB! This, despite the fact that I never installed anything actually – I had just visited the network share to inspect file details.

A scan of the system with TreeSize Free (there are many others like SequoiaView which I have used as well) revealed around 25GB in the C:WindowsCSC folder itself. On first thoughts I was wondering if this folder was some kind of C# compiler related folder but that was quickly ruled out.

The answer is that this folder is used by the Offline Files feature in Windows. The way to check on the status and configuration for Offline Files in Windows 8.1 is quite simple. Open the Control Panel, and do a search for the keyword ‘Offline’. You will see the entry under ‘Sync Center’:

image

Click on ‘Manage offline files’ and therein you can check if Offline Files is enabled:

image

Then you can click on ‘Disk Usage’ to actually check how much space is used by Offline files. There are some files which are temporary in nature, and as you can see below, that was the case in my scenario. My colleague, Sujay Desai gave me this link to help understand what the Temporary Files really meant.

image

A good way to mitigate and control the disk space for Offline files is to adjust the limit of disk space for this feature. But I first chose to delete the ‘Temporary files’. That takes a few minutes typically and it will report that it has deleted the same:

image

Now, if you are a travelling person, you must also be aware of a proactive feature wherein on a slow connection Windows will automatically start caching files using Offline files. This is not configurable from the UI and can only be controlled by using a Group Policy.

image

There are many valid and useful cases where Offline Files can be helpful, but in my case I have no use for it. So, I decided to explicitly turn this Offline files feature off. You will be prompted for a reboot of the system after this is done:

image

Disclaimer: please only disable Offline files if you know what you are doing and if it is causing space issues on your boot volume. If you have made offline changes to your files, make sure you sync them back before doing any changes, because otherwise you may lose your changes. So please be VERY CAREFUL when you do this.

The end state in my system is as follows; Offline Files is disabled and the Cache has been emptied:

image

And of course, my free disk space is back to normal!

Hope you liked this tip, and if you liked it, please leave a comment and / or rate the blog post. I would appreciate that very much!

WOW64 madness: debugging through the confusion

Last week I was teaching unmanaged code debugging to one of my customers. They were using 32-bit applications on 64-bit servers, which does create some unique problems. For example, capturing a ‘hang’ dump using right click on the Task Manager –> Processes list, will create a ‘64-bit dump’ (if that makes sense Smile) and the stacks visible by default will be that of the wow64 emulation layer, which in turn cause problems for regular commands, debugging extensions like PSSCOR / SOS and even for some of the inbuilt extensions.

So here is a comprehensive listing of what happens when you use a specific style of capturing a dump / ETW log / using other tools. A similar themed post on all the tools we have is here.

 

32-bit OS + 32-bit application

64-bit OS + 32-bit application (i.e. WOW64)

64-bit OS + 64-bit application

Production Live Debugging

Attach 32-bit WinDbg / CDB Attach 32-bit WinDbg / CDB Attach 64-bit WinDbg / CDB

Gathering a ‘hang’ dump *

1. Use ProcDump, preferably with –r switch

2. Use the Processes tab in DebugDiag

3. Use the 32-bit WinDbg / CDB in non-invasive mode and issue the .dump command

1. Use ProcDump, preferably with –r switch

2. Use the Processes tab in DebugDiag

3. Use the 32-bit WinDbg / CDB in non-invasive mode and issue the .dump command

Avoid using Task Manager.

1. Use ProcDump, preferably with –r switch

2. Use the Processes tab in DebugDiag

3. Use the 64-bit WinDbg / CDB in non-invasive mode and issue the .dump command

Gathering a ‘crash’ dump * 1. Create a crash rule using DebugDiag x86 – ‘just works’

2. ProcDump –e ‘just works’

3. Use the appropriate debugger as per above guidance for hang dumps

1. Create a crash rule using DebugDiag x64 – ‘just works’. #

2. ProcDump –e ‘just works’

3. Use the appropriate debugger as per above guidance for hang dumps

1. Create a crash rule using DebugDiag x64 – ‘just works’

2. ProcDump –e ‘just works’

3. Use the appropriate debugger as per above guidance for hang dumps

Troubleshooting High CPU usage using XPerf

Use 32-bit Windows Performance Toolkit

Use 64-bit Windows Performance Toolkit

Use 64-bit Windows Performance Toolkit

Troubleshooting a Performance issue using PerfView PerfView works correctly regardless of the ‘bitness mix’ – same – – same –
Dump analysis of a dump containing only native code ^
Any debugger should work but the same ‘bitness’ is recommended as far as possible. – same – – same –
Dump analysis of a dump containing any managed code ^ Debug using a 32-bit debugger and 32-bit PSSCOR / SOS. Debug using a 32-bit debugger and 32-bit PSSCOR / SOS.

Debug using a 64-bit debugger and 64-bit PSSCOR / SOS.

‘Live’ kernel debugging using LiveKD LiveKD ‘just works’ provided you have installed x86 Debugging Tools for Windows. LiveKD ‘just works’ provided you have installed x64 Debugging Tools for Windows. LiveKD ‘just works’ provided you have installed x64 Debugging Tools for Windows.
GFLAGS utility Use the x86 version of GFlags Use the x86 version of GFlags Use the x64 version of GFlags
Application Verifier Install x86 version of Application Verifier Install x64 version of Application Verifier (automatically includes x86 binaries) and launch the ‘Application Verifier (WOW)’ program. $ Install x64 version of Application Verifier.

# Note that you will not be able to / should not install the 32-bit version of DebugDiag on 64-bit OS.

* Note: for ProcDump, must also use –ma switch for detailed debugging, especially if you are debugging memory issues and for managed code dumps it is a must as well.

^ Note: Dump must be gathered ‘correctly’ with one of the above supported mechanisms.

$ Empirically, running either the native or WOW version of the GUI seems to put entries under both sets of keys. Please check this post from Chris Jackson for some related details.

The other thing you need to understand before proceeding further is the role of the WOW64 layer. The following blog posts would help you in that case:

Hope this cheat sheet is useful for you. And if you do, I would really appreciate you taking a minute to rate this post and leave a comment!

Too many files causing RAM pressure? (a.k.a. DynCache to the rescue!)

Readers of this blog might recall a previous post which described one impact of having too many similarly named files in one folder in NTFS. It turns out that the 8.3 naming convention is not the only thing you need to worry about when you have very large amounts of (smaller) files in the same volume.

Today I was called in to assist with a performance issue on a server. The only visible symptom of the problem was excessive RAM utilization on the server. The interesting aspect was that no specific user-mode process was consuming that RAM, so we were wondering where it came from. If this server was running SQL Server, for example, the ‘ghost’ utilization could be due to locked pages (which do not show up in Task Manager) but that was not the case here.

Analysis

So we ran the RAMMap utility, we found that the usage for MetaFile was a substantial percentage of the total RAM usage. From this AskPerf blog post you can see what MetaFile is all about:

“Metafile is part of the system cache and consists of NTFS metadata. NTFS metadata includes the MFT as well as the other various NTFS metadata files (see How NTFS Works for more details, and of course Windows Internals is a great reference). In the MFT each file attribute record takes 1k and each file has at least one attribute record. Add to this the other NTFS metadata files and you can see why the Metafile category can grow quite large on servers with lots of files.”

The next step therefore was to cross-check how large the MFT was in reality. The easy way to do this is to utilize the command given below:

fsutil fsinfo ntfsinfo <drive letter>

A sample output is given below (from my own laptop Smile)

C:>fsutil fsinfo ntfsinfo c:
NTFS Volume Serial Number :       0x8a40c9ee40c9e0d5
NTFS Version   :                  3.1
LFS Version    :                  2.0
Number Sectors :                  0x000000003a2007ff
Total Clusters :                  0x00000000074400ff
Free Clusters  :                  0x0000000000ab1f84
Total Reserved :                  0x000000000002ef10
Bytes Per Sector  :               512
Bytes Per Physical Sector :       4096
Bytes Per Cluster :               4096
Bytes Per FileRecord Segment    : 1024
Clusters Per FileRecord Segment : 0
Mft Valid Data Length :           0x0000000030240000
Mft Start Lcn  :                  0x00000000000c0000
Mft2 Start Lcn :                  0x0000000000000002
Mft Zone Start :                  0x0000000004d58da0
Mft Zone End   :                  0x0000000004d655c0
Resource Manager Identifier :     96CC88FE-5621-11E3-AF31-3C970EA47926

In this output, the “Mft Valid Data Length” gives us an indicator of how many bytes are used by the MFT. In the above case for example it equates to around 770MB:

0x0000000030240000 / (1024*1024) = 770.25 MB

Just as a curiosity, if I run RAMMap in my laptop (Windows 8.1), here is what I see, you can see a rough alignment with the above computed number.

image

FYI, in the server that I was looking at in the real world, the size of the MFT was actually 1.5 times the amount of RAM on the box Smile

Mitigation

Since the server in question was running Windows Server 2008 R2 SP1, we recommended the usage of the DynCache service sample, which would automatically control the size of the system cache based on system memory notifications.

SQL Server FileStream

Practically, this issue would also apply when you use SQL FileStream to store a very large number of blobs in the NTFS file system. In such cases, here are my recommendations:

  1. Consider setting the SQL Server Database Engine to use ‘locked pages’ with ‘max server memory’ set appropriately.
  2. If the OS is Windows 2008 R2 or below, you may additionally consider using the DynCache service to mitigate the effect that the large MFT will have.

Other notes

If you anticipate huge numbers of files to be stored on the file system, keep in mind that each file record will take up around 1KB. That means 100 million files will take up close to 100GB worth of MFT storage!

On Windows 2008 R2 and below be aware of issues like the one described in KB article 967351 and install the updated version of NTFS.sys accordingly.

In extreme cases, if you want to achieve optimal performance without compromising on system cache memory utilization, be aware that the system RAM sizing must be done accordingly keeping in mind the very large MFT which might result.

References

You may want to read these articles for more information:

Debugging story: Slowness due to NTFS short file (8.3) name generation

When I teach production debugging to my customers, I always tell them that be successful you need to not only know the right tool and command syntax, but also know the right methodology. And perhaps even more importantly when debugging certain types of issues, knowledge of Windows Internals and the ability to ‘connect the dots’ is a savior.

Recently, I had one such experience wherein an issue with an application turned out to be related to a NTFS behavior (which I had previously read about in my SQL Server work) and had I not ‘connected the dots’ I would not have been able to remediate the issue quickly. Wondering what it is all about? Read on!

Scenario

So this was a strange problem: a server had hundreds of processes which were all seemingly ‘hung’ and not doing anything. The server in question was hosting a vendor application, which had spawned those processes, but normally those processes should execute quickly and not accumulate they way they had done.

This appeared to me like the processes where hung in some way. Since the application was deployed as a service, these child processes were on a non-visible desktop, so maybe they had popped up an UI element – that was my first theory. But as with any user mode process hanging situation, I decided to collect hang mode dumps and look at those to be sure about the root cause. I took sample dumps from 2 processes using DebugDiag 1.2.

Dump analysis

Once the dumps were available, I opened them in WinDbg (x86, as these were 32-bit application processes.) Here are the call stacks of the dump for the first process:

0:000> kL
ChildEBP RetAddr 
0018d368 74cd9a04 ntdll!NtSetInformationFile+0x12
0018d43c 74cecd44 kernel32!MoveFileWithProgressTransactedW+0x334
0018d474 74ceccec kernel32!MoveFileWithProgressTransactedA+0x5b
0018d494 74d3d93f kernel32!MoveFileWithProgressA+0x1b
0018d4b0 00401302 kernel32!MoveFileA+0x16
WARNING: Stack unwind information not available. Following frames may be wrong.
0018d4cc 00401aa2 fooapp+0x7777
0018ff94 76fb9ef2 kernel32!BaseThreadInitThunk+0xe
0018ffd4 76fb9ec5 ntdll!__RtlUserThreadStart+0x70
0018ffec 00000000 ntdll!_RtlUserThreadStart+0x1b

The above call stack is due to a file being moved. In this case by dumping the parameters to MoveFileA, we found that the file was being renamed in the same folder (the file extension was being changed from .FOO to .BAR – file extensions changed to protect the identity Smile). The call stack for the thread in the second process is given below:

0:000> kL
ChildEBP RetAddr 
0018d38c 74c7c5dc ntdll!NtCreateFile+0x12
0018d430 74cc3f86 KERNELBASE!CreateFileW+0x35e
0018d45c 74cc53e4 kernel32!CreateFileWImplementation+0x69
0018d48c 004012b3 kernel32!CreateFileA+0x37
WARNING: Stack unwind information not available. Following frames may be wrong.
0018d4c0 004019df fooapp+8888
0018ff94 76fb9ef2 kernel32!BaseThreadInitThunk+0xe
0018ffd4 76fb9ec5 ntdll!__RtlUserThreadStart+0x70
0018ffec 00000000 ntdll!_RtlUserThreadStart+0x1b

The above is a call stack of the application creating a new file. By looking at both of these, it appeared to me that the hard disk was performing slowly. So we extracted the file name (the first parameter to CreateFileA routine) and found that it was a new file within a folder on the D: drive. When we ran performance counter logs for the D: drive, it appeared to be just normal.

So it was quite puzzling to see the calls to creating and renaming files taking so long despite the physical drive performing quite well.

Root cause

We then looked a bit further by browsing to the folder on D: drive which was being accessed by the above application. We could not even list the contents (Explorer would appear unresponsive.) It then came to our mind that when we have a large number of similarly named files in the same folder, (which was exactly the case with this application) then NTFS has to work extra hard to generate to unique 8.3 naming convention names (so called ‘short file name’.) This TechNet article mentions a threshold of 300000 files above which this short name generation can become a bottleneck. The algorithm for the short file name generation is also documented here and there is another very interesting but really old KB article which shows how to achieve the name conversion in FoxPro Smile

Now, these names are more of a legacy remnant of DOS and the 16-bit world, and in most servers there should not be any reason to use them. So thankfully we have a way to disable the generation of these short file names. I then tried to evaluate the effect that disabling this has. My test results are given below.

Testing

I executed a series of tests with a simple PowerShell script which creates and renames file in a tight loop. For each case, the folder already contained over a million files were already created in the same folder. My tests were done on a Windows 8 laptop with a single spindle SATA hard disk.

Test Description

File Count per process

Avg. time (8dot3 ON)

Avg. time (8dot3 OFF)

% Improvements

Single process creating files

25000

126028

102617

19%

5 process creating + renaming files

25000

567611

375946

34%

20 process creating files

25000

2420367

1422348

41%

 

Conclusion

The benefits of disabling 8dot3 name generation are obvious from the test results above, and validate the guidance given in the KB articles mentioned previously. And keep in mind these results were obtained on Windows 8 and Windows 2008 R2, wherein the algorithm for 8dot3 name generation has been further tuned. The gains might be even more dramatic on Windows 2003.

So in summary, we would like to confirm the benefits of disabling 8dot3 name generation on servers, where there are large number of similarly named files in the same folder. And for SQL Servers using FILESTREAM storage, this setting is recommended as well.

 

Further Reading

For more information on NTFS, I encourage everyone to take a look at the presentation that Neal Christiansen (Principal SDE, Microsoft) delivered in a community meeting. The presentation is found here: NTFS – The workhorse file system for the Windows Platform. Neal also has presented two excellent talks on Channel9, which can be accessed here and here. Amazing information, take a look!

Jose Barreto, Principal PM at Microsoft has blogged about the performance impact of 8.3 names. His TechEd 2011 presentation is also worth a look.

On a side note, Adi Oltean talks about potential security considerations when the 8.3 filename is used, do take a look!