SQL collation and performance

Recently a colleague asked me if SQL collations have any impact on performance. We also hit upon another question: do local variables (DECLARE @somevar <datatype>) have a collation associated with them and if so how is it controlled?

Let us take some simple examples to demonstrate these points and derive conclusions.

Create databases

First, we are setting up 2 databases (inspired by SQL Books Online, these are aptly named GreekDB and LatinDB Smile).

— Setup databases first
create database greekdb
collate greek_ci_as
go

create database latindb
collate latin1_general_cs_as
go

Create tables with dummy data

Next, we are creating some tables in them, inserting data and then building indexes. First, we do the GreekDB ones:

use greekdb
create table greektable
(
    greekstr varchar(50)
)

insert greektable values (‘hello world’)

create index nc_greek
on greektable (greekstr)

Then the LatinDB ones:

use latindb
create table latintable
(
    latinstr varchar(50)
)

insert latintable values (‘hello world’)

create index nc_latin
on latintable (latinstr)

Query the tables

We start by querying these tables and observing the execution plans (I am using text showplan for ease of copying into this blog post.)

Query 1: Best case scenario

use greekdb
select *
from greektable
where greekstr = ‘Hello world’

Here is the execution plan:

  |–Index Seek(OBJECT:([greekdb].[dbo].[greektable].[nc_greek]), SEEK:([greekdb].[dbo].[greektable].[greekstr]=[@1]) ORDERED FORWARD)

Observations: Fairly obvious results, no collation mismatch, everything is fine, index gets used to seek.

Query 2: Cross database query with collation mismatch

use latindb
select *
from greekdb.dbo.greektable
where greekstr = ‘Hello world’

Here is the execution plan:

  |–Index Seek(OBJECT:([greekdb].[dbo].[greektable].[nc_greek]), SEEK:([greekdb].[dbo].[greektable].[greekstr]=CONVERT_IMPLICIT(varchar(8000),[@1],0)) ORDERED FORWARD)

Observations: Before you read further, keep in mind that USE DATABASE statement, referencing LatinDB. The rest of the query operates on greekdb. Looking at the execution plan, the implicit conversion of the string literal is interesting. String literals for a batch inherit the database collation (click on the previous link and check the Remarks section) if a COLLATE clause is not specified. That is what we are seeing here, the string literal ‘hello world’ defaults to the collation of the database, which is in this case is specified as LatinDB. Due to this, the string literal is converted to the collation of the column. The index seek can still be used because the column is not being converted.

Query 3: Usage of COLLATE to specify string literal collation

use latindb
select *
from greekdb.dbo.greektable
  where greekstr = ‘Hello world’ COLLATE greek_ci_as

Here is the execution plan:

  |–Index Seek(OBJECT:([greekdb].[dbo].[greektable].[nc_greek]), SEEK:([greekdb].[dbo].[greektable].[greekstr]=CONVERT(varchar(8000),[@1],0)) ORDERED FORWARD)

Observations: You see in this case that specifying COLLATE has no effect on the query plan, except that the CONVERT_IMPLICIT changes to a CONVERT (explicit.) The string literal still is coerced into the new collation (GREEK_CI_AS.) Please refer to the collation precedence rules for details (click on the previous link and check the ‘Collation rules’ section) on this.

Query 4: Explicit-explicit coercion is not allowed

select * from greekdb.dbo.greektable G
join latindb.dbo.latintable L
on G.greekstr = L.latinstr

Result: Msg 468, Level 16, State 9, Line 4
Cannot resolve the collation conflict between “Latin1_General_CS_AS” and “Greek_CI_AS” in the equal to operation

Observations: This should be obvious from the previous collation rules (if you clicked on the previous link).

Query 5: Usage of COLLATE to specify column collation

use latindb
select * from greekdb.dbo.greektable G
join latindb.dbo.latintable L
on G.greekstr collate latin1_general_cs_as = L.latinstr

Here is the execution plan:

|–Nested Loops(Inner Join, OUTER REFERENCES:([Expr1006]))
     |–Compute Scalar(DEFINE:([Expr1006]=CONVERT(varchar(50),[greekdb].[dbo].[greektable].[greekstr] as [G].[greekstr],0)))
     |    |–Index Scan(OBJECT:([greekdb].[dbo].[greektable].[nc_greek] AS [G]))
     |–Index Seek(OBJECT:([latindb].[dbo].[latintable].[nc_latin] AS [L]), SEEK:([L].[latinstr]=[Expr1006]) ORDERED FORWARD)

Observations: This query which is almost the same as previous one, works, while the previous errors out. The reason should be obvious: the explicit conversion of the Greek table columns to the Latin collation enables this to work. While this sounds good, there is potential for data loss. From this link, you can see the dreaded words: ‘Data loss during code page translations is not reported.’

Query 6: Explicit collation conversion on column

use greekdb
select *
from greekdb.dbo.greektable
where greekstr collate latin1_general_cs_ai = ‘Hello world’

Here is the execution plan:

  |–Index Scan(OBJECT:([greekdb].[dbo].[greektable].[nc_greek]),  WHERE:(CONVERT(varchar(50),[greekdb].[dbo].[greektable].[greekstr],0)=CONVERT_IMPLICIT(varchar(8000),[@1],0)))

Observations: In this query, which is kind of the same to the previous query, we convert the Greek column to a 3rd collation (accent-insensitive version of the LatinDb database collation.) The result is that underlying index is not used, causing a scan of the table. The other interesting thing to observe is the collation conversion of the string literal as well, which allows this comparison to happen.

Query 7: COLLATE specified with same collation

use greekdb
select *
from greekdb.dbo.greektable
where greekstr collate greek_ci_as = ‘Hello world’

Here is the execution plan:

  |–Index Seek(OBJECT:([greekdb].[dbo].[greektable].[nc_greek]), SEEK:([greekdb].[dbo].[greektable].[greekstr]=[@1]) ORDERED FORWARD)

Observations: This is the ‘best case’ all over again. The explicit COLLATE clause is effectively ignored by the engine. The index is used.

Query 8: Data type mismatch in case insensitive collation database

use greekdb
select *
from greekdb.dbo.greektable
where greekstr = N’Hello world’

Here is the execution plan:

  |–Nested Loops(Inner Join, OUTER REFERENCES:([Expr1006], [Expr1007], [Expr1005]))
       |–Compute Scalar(DEFINE:(([Expr1006],[Expr1007],[Expr1005])=GetRangeThroughConvert([@1],[@1],(62))))
       |    |–Constant Scan
       |–Index Seek(OBJECT:([greekdb].[dbo].[greektable].[nc_greek]), SEEK:([greekdb].[dbo].[greektable].[greekstr] > [Expr1006] AND [greekdb].[dbo].[greektable].[greekstr] < [Expr1007]),  WHERE:(CONVERT_IMPLICIT(nvarchar(50),[greekdb].[dbo].[greektable].[greekstr],0)=[@1]) ORDERED FORWARD)

Observations: This case is more complex. It involves specifying a string literal of type NVARCHAR, which is higher in precedence compared to our column data type of VARCHAR. The most important thing here is that the column ‘greekstr’ has to be implicitly converted to NVARCHAR(50). Also of note is that one row is returned by the SELECT query. The other thing of great interest is the usage of the GetRangeThroughConvert() function. Some details about this internal function are here and here, essentially you want to think of this as an optimization applied when SQL has to deal with an implicit conversion in the predicate, and enables the index to be ‘seeked’ (though there is a lot of work done before that seek can happen.)

Query 9: Data type mismatch in case-sensitive collation database

use latindb
select *
from latindb.dbo.latintable
where latinstr = N’Hello world’

Here is the execution plan:

|–Nested Loops(Inner Join, OUTER REFERENCES:([Expr1006], [Expr1007], [Expr1005]))
     |–Compute Scalar(DEFINE:(([Expr1006],[Expr1007],[Expr1005])=GetRangeThroughConvert([@1],[@1],(62))))
     |    |–Constant Scan
     |–Index Seek(OBJECT:([latindb].[dbo].[latintable].[nc_latin]), SEEK:([latindb].[dbo].[latintable].[latinstr] > [Expr1006] AND [latindb].[dbo].[latintable].[latinstr] < [Expr1007]),  WHERE:(CONVERT_IMPLICIT(nvarchar(50),[latindb].[dbo].[latintable].[latinstr],0)=[@1]) ORDERED FORWARD)

Observations: On the face of it, this query is very similar to the previous query. The main difference is that we are operating on a case-sensitive database. No rows are returned. Similar to the previous case, you see the effect of the GetRangeThroughConvert() internal function to enable the seek despite the data type mismatch.

Query 10: Case-insensitive collation specified

use latindb
select *
from latindb.dbo.latintable
where latinstr collate latin1_general_ci_as = N’Hello world’

Here is the execution plan:

  |–Index Scan(OBJECT:([latindb].[dbo].[latintable].[nc_latin]),  WHERE:(CONVERT_IMPLICIT(nvarchar(50),CONVERT(varchar(50),[latindb].[dbo].[latintable].[latinstr],0),0)=CONVERT_IMPLICIT(nvarchar(4000),[@1],0)))

Observations: This query is very similar to the previous ones, except that it has an explict COLLATE applied to the column. The query does return one row. You can see that not only is the latinstr column being converted (explicit CONVERT) to the same collation, it is later being implicitly converted to NVARCHAR. The other interesting aspect is that the string literal is also being converted to the same collation, which is necessary for a correct result.

Query 11: Effect of database collation on variable declaration

use greekdb — now this becomes the database whose collation will be used for all variables
declare @mygreekvar varchar(50)

use latindb
declare @mylatinvar varchar(50)  — you would think this is latin1_general_cs_as. Let’s see later.

set @mygreekvar = ‘hello world’

set @mylatinvar = ‘hello world’

if (@mygreekvar = @mylatinvar)
    select ‘they are the same’
else
    select ‘different’

use greekdb
select *
from greektable
where greekstr  = @mygreekvar

use latindb
select *
from latindb.dbo.latintable
where latinstr = @mylatinvar

Execution plan for the first SELECT:

  |–Index Seek(OBJECT:([greekdb].[dbo].[greektable].[nc_greek]), SEEK:([greekdb].[dbo].[greektable].[greekstr]=[@mygreekvar]) ORDERED FORWARD)

Execution plan for the second SELECT:

  |–Table Scan(OBJECT:([latindb].[dbo].[latintable]), WHERE:([latindb].[dbo].[latintable].[latinstr]=CONVERT_IMPLICIT(varchar(50),[@mylatinvar],0)))

Observations: Hopefully by now, you should have figured this one out. The collation applied to variables is actually that of the last USE database prior to the first DECLARE. Read more about that behavior here (opens up to Microsoft Connect.) So the @mylatinvar variable has to be converted to suit the latintable collation.

The key point I wish to call out here is that the usage of the COLLATE clause is potentially (see point a) below for why) going to cause a conversion of the column. The effect will depend on two things:

a. Whether the target collation is the same as the column collation or not. If the target is the same then there is no scalar conversion required.

AND

b. Whether this COLLATE clause is present in the predicate (such as WHERE clause or JOIN predicate). If it is in the predicate then it can cause a scan (if condition a) above has also been satisfied.)

c. ALSO

d. If the COLLATE is present in the output list (not in the predicate) and if condition a) has been satisfied then it will cause a scalar conversion operator to be introduced into the plan. Depending on how many such conversions (rows * columns) it may cause additional CPU overhead.

To summarize

PHEW! This is a rather long post, but exposes some important things:

  1. Collation mismatches can be dangerous – worst case, they can cause data loss or errors to occur (if coercion is disallowed) or in many cases can (at the very least) cause performance problems.
  2. The usage of the COLLATE clause can introduce a potential performance penalty. If this clause is applied on top of a table column inside a predicate, it may mean that any indexes might not be used (or we may have to do a lot of work using the GetRangeThroughConvert() function.)
  3. Lastly, if you are using cross-database references in a batch, beware of the collation assigned to each local variable – they inherit the same database collation of the last USE database prior to the first DECLARE.

Thank you for reading. Questions, comments are most welcome!

Hosting ReportServer catalog DBs on older version of SQL database engine

A FAQ we get is whether it is supported to have the ReportServer and ReportServerTempDB (catalog databases used in SSRS) on an older version of SQL Server (as compared to the version used on SSRS.) Here is the summary answer along with the reference links:

For SSRS 2012, it is supported to install the catalog databases on versions all the way down to SQL 2005. Reference: http://msdn.microsoft.com/en-us/library/ms157285.aspx

For SSRS 2008 & R2, the same story – catalog DB can be all the way down to SQL 2005. Reference http://msdn.microsoft.com/en-us/library/ms157285(v=sql.105).aspx. There is a ‘other versions’ link on this page, use that to navigate to the reference for SQL 2008 and 2005 as well.

One topic related to this is when you have SSRS 2008 R2 in the SharePoint 2010 ‘Connected Mode’. Please refer to the blog post http://blogs.msdn.com/b/prash/archive/2010/04/26/integrating-ssrs-r2-rtm-with-sharepoint-2010-rtm.aspx for more details on this scenario.

Thank you for reading!

Arvind.

AlwaysOn Availability Groups Connectivity Cheat Sheet

Today we were discussing the AlwaysOn connectivity options, and the various combinations therein of Readable Secondary setting and the connection strings became very complex. So I have summarized all the possible outcomes in the pivot tables below.

For convenience, we have two tables. The first table covers the cases when the Primary Replica is set to allow all types of connections (i.e. those with and without the ApplicationIntent=ReadOnly attribute.)

Case 1: Primary Replica allows ALL connections

Readable Secondary Setting

 

Additional Connection Parameters

Initial Database

No

Yes

Read-intent

 

Through Listener

None

Non-AG

Primary

Primary

Primary

ReadOnly

Primary

Primary

Primary

None

AG

Primary

Primary

Primary

ReadOnly

Not allowed (5)

Secondary

Secondary

Direct to Primary

None

Not applicable / don’t care

Primary

Primary

Primary

ReadOnly

Primary

Primary

Primary

Direct to Secondary

None

Not allowed (4)

Secondary

Not allowed (1)

ReadOnly

Not allowed (4)

Secondary

Secondary

The second table covers the cases when the Primary Replica is set to allow Read/Write (i.e. those without the ReadOnly attribute.)

Case 2: Primary Replica allows only Read/Write connections

Readable Secondary Setting

 

Additional Connection Parameters

Initial Database

No

Yes

Read-intent

 

 

Through Listener

None

Non-AG

Primary

Primary

Primary

ReadOnly

Not allowed (2)

Not allowed (2)

Not allowed (2)

None

AG

Primary

Primary

Primary

ReadOnly

Not Allowed (3)

Secondary

Secondary

Direct to Primary

None

Not applicable / don’t care

Primary

Primary

Primary

ReadOnly

Not allowed (2)

Not allowed (2)

Not allowed (2)

Direct to Secondary

None

Not allowed (4)

Secondary

Not allowed (1)

ReadOnly

Not allowed (4)

Secondary

Secondary

For some cases, the access to the database is not allowed (either at connection time or at run time, when the database is tried to be accessed) with the following error messages:

1. Error message: The target database (‘salesdb’) is in an availability group and is currently accessible for connections when the application intent is set to read only. For more information about application intent, see SQL Server Books Online.

2. Error message: The target database (‘SalesDb’) is in an availability group and currently does not allow read only connections. For more information about application intent, see SQL Server Books Online.

3. Error message: Unable to access the ‘SalesDB’ database because no online secondary replicas are enabled for read-only access. Check the availability group configuration to verify that at least one secondary replica is configured for read-only access. Wait for an enabled replica to come online, and retry your read-only operation.

4. Error message: The target database, ‘salesdb’, is participating in an availability group and is currently not accessible for queries. Either data movement is suspended or the availability replica is not enabled for read access. To allow read-only access to this and other databases in the availability group, enable read access to one or more secondary availability replicas in the group. For more information, see the ALTER AVAILABILITY GROUP statement in SQL Server Books Online

5. Error message: Unable to access the ‘SalesDB’ database because no online secondary replicas are enabled for read-only access. Check the availability group configuration to verify that at least one secondary replica is configured for read-only access. Wait for an enabled replica to come online, and retry your read-only operation.

This table should help you predict how the connectivity will behave in each of these combinations, and therefore setup your Availability Group replica properties correctly.

More information:

I hope you enjoyed this post as well. If you would like to see more, please leave a comment!

AlwaysOn Availability Groups Listener: connectivity details

Today in some discussions with my colleagues we were looking at the AlwaysOn Listener which allows SQL Server client applications to be abstracted from the details of the AlwaysOn Availability Group implementation. On the surface, The Listener essentially is a cluster resource pair – a Virtual IP and a Virtual Network Name. But on deeper examination, there are some details which are worth knowing. This blog post will help you discover those!

Firstly, here is an example of how the Listener looks like in the Cluster Management. You can see the network name resource (mylisten) and the virtual IP as well (192.168.1.78.)

image

And in Management Studio, you can see the same details as well:

image

In many diagrams the Listener is shown as a separate entity, distinct from the SQL database engine. In some ways this is correct, as you might interpret from the above virtual IP and network name. However, we have to consider the endpoint as well. If you think deeply, the question which might come up in your mind is, fine, this is a network name and a virtual IP; the cluster service helps ‘online’ these on the active node (it actually ‘brings online’ the virtual IP by registering it against the network interface of the active node) – but where is the listener port being bound and listened on? The cluster service cannot do that for us.

So here’s how you can investigate the actual mechanism. Firstly, do a netstat –ano on the primary replica (which is where the listener is active) and find out which process is listening on the listener’s IP and port. In our case these dummy address and port values are 192.168.1.78 and 1433 respectively. On checking, we see:

C:>netstat -ano

Active Connections

  Proto  Local Address          Foreign Address        State           PID

  TCP    192.168.1.78:1433      0.0.0.0:0              LISTENING       1232

The PID 1232 is actually the PID of sqlservr.exe! That means the listener functions are actually also implemented in the database engine. FYI, here is how we can check which process is PID 1232:

C:>tasklist /fi “pid eq 1232”

Image Name                     PID Session Name        Session#    Mem Usage
========================= ======== ================ =========== ============
sqlservr.exe                  1232 Services                   0    227,208 K

So this means that the ‘listener’ endpoint is actually being bound to inside of SQL database engine! We can confirm this easily by looking at the SQL Server errorlog:

2012-09-12 03:40:14.820 spid17s      The availability group database “salesdb” is changing roles ….
2012-09-12 03:40:14.830 Server       Server is listening on [ 192.168.1.78 <ipv4> 1433].

2012-09-12 03:40:14.830 Server       Started listening on virtual network name ‘mylisten’. No user action is required.

So this is good information – the listener actually ‘lives’ inside the sqlservr.exe of the primary replica. So any connection from the client will first reach this SQL instance and then potentially be redirected to the secondary read-only replica (if configured, which it is in my case.)

To test this, I ran a NetMon with a filter on the TCP ports for the listener (1433) and the replicas (1450). From this trace, it is obvious that there is an initial connection to the primary replica (which is actually the listener endpoint), and then the client is ‘redirected’ and reconnects to the actual secondary read-only replica:

image

In the above NetMon trace, 192.168.1.78 is the listener IP (which is actually bound to the primary replica’s node) and you will see initially a set of login and TDS packets flowing to and from the workstation to this listener. At the end of the TDS exchange, you will see that the client on SQLWORKSTN has picked up the 192.168.1.101 (which is the secondary replica instance in my test setup) and is talking to the secondary replica.

So to summarize:

  1. The listener IP is physically registered on the bublic network interface of the active node which hosts the AG listener cluster resource
  2. The cluster resource DLL for AlwaysOn (hadrres.dll) communicates with the primary replica SQL instance
  3. The primary replica in turns binds to the listener IP and port
  4. The read-write client which connects to the listener is actually directly connected, without any additional overhead
  5. Read-only clients may be redirected (depending on the configuration, which is indeed the case in my testing above)
  6. This redirection needs support from the client library, which is implemented in .NET 4.5 and SQL Client 2012 respectively (links below)

Some additional reading on this subject:

I hope this gives you a deeper insight how the AG Listener connectivity works. If you enjoyed reading this as much as I enjoyed creating this walkthrough, please take a minute and leave a comment and rate this post!

Thank you!

Arvind.

How to assign a static port to a SQL Server named instance – and avoid a common pitfall

While Books Online clearly mentions the steps to Configure a Server to Listen on a Specific TCP Port we still see people missing out on one small but important detail in these steps: they forget to delete the entry (0 or some random port) for dynamic port. This firstly results in confusion and occasionally can result in connectivity problems as well. Let me explain how using an example from one of our lab setups.

As a first step, let’s see what the ‘administrator’ (in this case, yours truly Smile) had done:

image

As you can see, they have added the static port for ‘IPAll’ with a value of 1450. That part is fine. The problem though is they forgot to remove the entries for the dynamic ports (0 or some random port). That means that when they restarted SQL, the dynamic port setting is still valid. In fact if we query sys.tcp_endpoints, you will still see the engine thinks it is listening on dynamic port:

SELECT        name, protocol_desc, type_desc, state_desc, is_admin_endpoint, port, is_dynamic_port, ip_address 
FROM            sys.tcp_endpoints

gives us:

name protocol_desc type_desc state_desc is_admin_endpoint port is_dynamic_port
Dedicated Admin Connection TCP TSQL STARTED 1 0 1
TSQL Default TCP TCP TSQL STARTED 0 0 1

The important observation is that the engine reports that it is still using a dynamic port. It does not report the static port number 1450 which we selected in Configuration Manager. Let’s double-check in the errorlog to see if indeed the static port is being picked up at all. And lo and behold:

Server is listening on [ 'any' <ipv6> 1450]. 
Server is listening on [ 'any' <ipv4> 1450]. 
Server is listening on [ 'any' <ipv6> 49626]. 
Server is listening on [ 'any' <ipv4> 49626].

In our case, sqlservr.exe has a PID of 1240. Using the command netstat –ano, we can see what it is listening on.

 Proto  Local Address          Foreign Address        State           PID 
  TCP    0.0.0.0:1450           0.0.0.0:0              LISTENING       1240 
  TCP    0.0.0.0:49626          0.0.0.0:0              LISTENING       1240 
  TCP    127.0.0.1:49627        0.0.0.0:0              LISTENING       1240 
  TCP    192.168.1.101:1450     192.168.1.200:49386    ESTABLISHED     1240 
  TCP    192.168.1.101:1450     192.168.1.200:49396    ESTABLISHED     1240 
  TCP    [::]:1450              [::]:0                 LISTENING       1240 
  TCP    [::]:49626             [::]:0                 LISTENING       1240 
  TCP    [::1]:49627            [::]:0                 LISTENING       1240

So it is not only listening on the static port, but also on the dynamic port 49626. The DAC is listening on TCP port 49627. The values with a local address of [::] are the IPv6 ‘All’ address.

So depending on what got cached earlier in HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\MSSQLServer\Client\SNI11.0\LastConnect on the client (you can find some detail about the LastConnect registry key and the caching mechanism in this KB article), the client might attempt to connect to the previous dynamic port (which is still valid based on our observation above.)

FYI, if we run NetMon, we can see that the SSRP response from the SQL Browser correctly gives back 1450 as the port for this named instance:

image

For clarity I’ve reproduced the response from SQL Browser (using the SSRP protocol) back to my client (SQLCMD.exe):

.^.ServerName;SOMESERVER;InstanceName;SOMEINSTANCE;IsClustered;No;Version;11.0.2100.60;tcp;1450;;

From the above it is clear that SQL Browser is correctly sending the static port assignment. But if you are like me, I feel uneasy till I fix the root cause, which is to delete the dynamic port assignment!

To summarize here is what we saw in this walkthrough:

  1. The official steps (captured in Books Online) to assign a static port for a named instance involve also deleting the value (0 or some random port) for the dynamic port.
  2. Failure to delete the dynamic port value in SQL Configuration Manager will cause SQL to listen on both the static as well as the dynamic ports.
  3. This means that clients will succeed to connect to the erstwhile dynamic port if they had that cached in the LastConnect client side registry key.
  4. For clients which do not have cached connection details, SQL Browser seems to pickup the static port and sends that back to the client.
  5. So follow the steps in the BOL article to the T and delete the dynamic port value right after you type in the static port value, and in any case before you restart the instance.

FYI the steps to fix a static port for the Dedicated Admin Connection (DAC) are in the KB article How to configure SQL Server to listen on a specific port under the section ‘Configuring an instance of SQL Server to use a static port’.

I hope you enjoyed reading this post as much as I enjoyed bringing it to you! Please take a minute to rate this post and leave some comments!

Windows Performance Toolkit: Finding CPU consumer

A colleague of mine recently asked the question ‘I see CPU N is constantly pegged at 100%; is there any good way to determine which specific process is using that specific CPU?’

Well, he already figured that Task Manager or Process Explorer does not provide that information. And by definition purely in user mode one cannot find this information, because the OS dispatcher schedules threads, not processes and the dispatcher runs at a lower level than user mode threads do. So classically, a live kernel debug would be required to be sure. The extension command !pcr or !prcb will give you the current thread for that processor, which you can use with !thread to get the process object.)

BUT no customer will let us do a live kernel debug in production unless it is really necessary. There must be an easier way out!

Today, with XPerf (part of Windows Performance Toolkit) you can get this information fairly easily and accurately. Here’s how you can get started:

  1. Download and install the Platform SDK (http://msdn.microsoft.com/en-us/windows/hardware/gg463009.aspx) and choose the install the Windows Performance Toolkit
  2. Redistributable versions should get installed as well at C:Program Files (x86)Windows Kits8.0Windows Performance ToolkitRedistributables (notice the ARM target is now available as well!)
  3. Use the Windows Performance Recorder utility to record a trace with just CPU in it. Save it to a (.ETL) file.
  4. Collect data for not more than 30-45 seconds. A LOT of information is collected, so limit the duration.
  5. Transfer and Double-click the .ETL file on your laptop, it should open up in the Windows Performance Analyzer (WPA) utility.
  6. Then from the graph explorer on the left side you should be able to select the CPU timeline report
  7. Select the CPU of interest, and ZOOM IN to a particular section of the timeline. The zooming is important as without that you are not told which process is utilizing that CPU.

Here is an example screenshot of the type of information that WPA displays. Click on the thumbnail to get a larger view.

image

As you can see, it is really detailed and extremely easy to use. If you have Windows 7, Windows Server 2008 R2, Windows 8 or Windows Server 2012, you should be able to leverage the WPT to the full extent.

For more information on WPT you can click the help links I have mentioned above, and for a more interactive discussion I recommend you view the BUILD 2011 videos around WPT / WPA:

There are some additional videos on Channel9 which cover specific scenarios where WPT / WPA can be very useful:

Please leave a comment and rate the post if you found it useful. I appreciate your time very much!

Happy performance investigations!

Error message: The target database (‘xyz’) is in an availability group and currently does not allow read only connections

Today while poking around the AlwaysOn Availability Groups feature I came across the above error when testing the read-only access routing configuration. That configuration allows connections with declared ReadOnly intent to be routed to one of the read-only secondary replicas.

The configuration looked like this:

  • 3-node Windows 2008 R2 Core cluster
  • 3 instances of SQL 2012 each on one node
  • 2 databases in the Availability Group
  • Each of the replicas was set to allow only ReadWrite connections when in primary role
  • Each of the replicas was set to ReadOnly intent connections when in secondary role
  • Read-only URL access for each replica was set correctly
  • Read-only routing list was also set correctly

I was testing connectivity from Management Studio to open a new Database Engine Query window (not the Object Explorer) and was using the following options:

  • Connect to database: <default>
  • Additional Connection Properties tab has ‘ApplicationIntent=ReadOnly’

However when I tried to USE any of the databases in the AG, it would disallow me. When I checked @@SERVERNAME it actually reported the instance name of the PRIMARY replica! So clearly the routing was not working.

Then it dawned on me that the default database for this login being master, it was causing an issue with the routing. On changing the ‘Connect to database’ to XYZ (the name of the AG database) it was routing me correctly to the read-only secondary.

Of course, this is only ONE reason for the above error. Other possibilities include the fact that your AG configuration is incorrect and does not allow read-only secondary in any case.

Hope this helps you!

“Operating system error 1398” while backing up database to network share

I was facing this on a VM setup, where the service account of my SQL Server service is a domain account. The DC is also a VM. Here were the symptoms:

While backing up the database to the UNC share, I was getting an error:

"Operating system error 1398" (There is a time and/or date difference between the client and server)

First things first, I did not have any timesync issue. That would have affected Kerberos and everything else around it.

Secondly, doing a xp_cmdshell ‘dir \uncserverpath’ would result in the same error, so this was not unique to the backup code (obvious, but better to check.)

Interestingly, if I opened up a new CMD prompt ‘running as’ the SQL Service account, and execute the ‘dir’ command, then the network share contents were listed (i.e. no error message.)

So then it dawned to me that at the time the SQL Service was starting up, there was a race condition wherein the DC was not yet ‘fully’ started. This caused the usage of stale Kerberos tickets and in turn caused the above error.

Only way to fix this was to restart my SQL instance, because then the expired Kerberos ticket would be replaced with a freshly authenticated and timestamped one. Sad, but true.

Hope this helps someone out there!

“Failed to join domain” trying to add Windows 2008 R2 Core to domain

Today I was using SCONFIG (if you don’t know what SCONFIG is, check out these posts to add a Windows 2008 R2 core server to an existing domain. I had set the IP address and ensured that the domain name and credentials are correct, but I kept getting ‘failed to join domain’.

I tried pre-creating the computer objects in the AD but to no avail. Finally after some troubleshooting it turned out to be a most basic issue: the primary DNS IP was not set correctly. After setting that (again with SCONFIG) I was able to add the Windows core box into the domain.

‘No Disks suitable for cluster disks’: one more possible cause

Today in the lab I was working on a test 3-node cluster. I was using Hyper-V for this. We normally use the iSCSI Software Target for testing, so I had set that up on a central VM and then published the target to all the VMs.

The disks were seen in the disk management of Node1, and I brought them online, partitioned and formatted them. However when I used Cluster Management to add them to the Storage group, it complained:

‘No Disks suitable for cluster disks were found, For diagnostic information about disks available run the Validate config wizard’

I did run the cluster validation (storage only) and got no further inputs. Shortly after some troubleshooting I realized I had forgotten to configure the volumes at the iSCSI initiator on each node. The easiest thing to do for this is to click the ‘Auto-configure’ button:

image

Due disclaimer: This is just ONE, but not the ONLY reason for the above error message! This is somewhat specific to the iSCSI initiator, so only check it if you are using iSCSI volumes in your cluster and if you get the above message.