Error message “Copying cluster-resource …fssres.dll to C:Windowssystem32fssres.dll failed…” during SQL Server 2014 setup

First and foremost, let me reproduce the complete error message I refer to above: “Copying cluster-resource C:Program FilesMicrosoft SQL ServerMSSQL12.SQL14MSSQLBinnfssres.dll to C:Windowssystem32fssres.dll failed because the destination file exists and is read-only”

Investigation

I received the above error message while setting up a SQL Server 2014 instance side by side with an existing copy of SQL Server 2012. The setup caused the “Database Engine” component to fail and later it had to be removed before re-attempting setup.

The root cause of such setup failures is often quite easy to locate when you look at the setup logs, which are conveniently placed under the “C:Program FilesMicrosoft SQL Server120Setup BootstrapLog” folder on your machine. When I looked at the Detail.txt file from such the problematic server, I could sport the following section:

(01) 2015-01-04 08:09:34 Slp: UpdateClusterResourceAction: Resource type = ‘SQL Server FILESTREAM Share’ ResourceDllName = fssres.dll Source Location = ‘C:Program FilesMicrosoft SQL ServerMSSQL12.SQL14MSSQLBinn’ Target Location = ‘C:Windowssystem32’.
(01) 2015-01-04 08:09:40 Slp: Type ‘SQL Server FILESTREAM Share’ not found. Performing copy directly …
.…
(01) 2015-01-04 08:09:40 Slp: Failed to version-copy file ‘C:Program FilesMicrosoft SQL ServerMSSQL12.SQL14MSSQLBinnfssres.dll’ to ‘C:Windowssystem32fssres.dll’. Exception data is: System.IO.IOException: The process cannot access the file ‘C:Windowssystem32fssres.dll’ because it is being used by another process.
   at System.IO.__Error.WinIOError(Int32 errorCode, String maybeFullPath)
   at System.IO.File.InternalCopy(String sourceFileName, String destFileName, Boolean overwrite)
   at Microsoft.SqlServer.Configuration.Cluster.UpdateClusterResourceAction.VersionCopy(String source, String target).
….
Microsoft.SqlServer.Configuration.Cluster.UpdateClusterResourceException:  Copying cluster-resource C:Program FilesMicrosoft SQL ServerMSSQL12.SQL14MSSQLBinnfssres.dll to C:Windowssystem32fssres.dll failed because the destination file exists and is read-only. —> System.IO.IOException: The process cannot access the file ‘C:Windowssystem32fssres.dll’ because it is being used by another process.
….
Microsoft.SqlServer.Configuration.Cluster.UpdateClusterResourceAction.VersionCopy(String source, String target)

From the above, it is quite clear that the FSSRES.DLL already exists previously. Now, when I checked the file version it was clear that the existing version is of version SQL Server 2012:

image

On executing a TASKLIST /M fssres.dll command it quickly became clear that the process which has this DLL already loaded is the RHS.exe for the existing instance of SQL 2012. That instance had an Availability Group already configured hence the RHS.exe was loading the DLL.

Given that the DLL was already loaded by the RHS.exe, there was no way for SQL setup to update it. That is why the SQL 2014 instance setup failed.

Workaround

The workaround I had to use in the above case was to firstly remove the SQL 2014 Database Engine instance; shutdown the cluster service on the failed node and then re-attempt the SQL 2014 setup. Less than ideal, but till the above issue is handled in the setup, we have very little options. I have logged also a Connect item requesting the Product team to investigate this issue. If you think this issue should be fixed, please vote it up on Connect!

Potential data latency with AlwaysOn Availability Groups and Memory Optimized Tables

Today I was testing a scenario with a readable secondary in an Availability Group (AG). The database in the AG contained a memory-optimized table and I was testing read-only query support on the secondary.

The mystery

While changes for on-disk tables are replicated near-instantaneously (of course in my test setup there was no other overhead) and were almost immediately visible on the secondary replica, I was surprised to see that it was not the case with the memory-optimized table.

For example, I would insert a row into the table at the primary replica. That row would not appear in the readable secondary, not even after waiting a few seconds. Note that there was no other workload related to memory optimized tables at the time.

Interestingly, if I would insert a new row in the table on the primary, it would cause the previously inserted row to show up at the secondary! Smile I was quite puzzled to see this behavior. On a hunch, I issued a manual CHECKPOINT on the primary. This ‘flushed’ the queue and I was able to see both rows on the secondary.

Truth revealed!

Since this was a bit strange, I dug around a bit and found a gem in the documentation which explains this behavior. There is something called a ‘safe timestamp’ which is used on the readable secondary to only return rows older than a ‘safe point’ which is normally updated by the garbage collection thread on the primary.

Normally on an active system, this safe timestamp would be periodically changed and periodically sent across to the secondary on a regular basis. But if there is not much activity on memory optimized tables in the database (like in my case) then there is an inordinate delay in sending this safe timestamp across.

Conclusion

If such data latency for in-memory tables is not tolerable on the readable secondary, the product team recommends the usage of ‘dummy’ transactions to propagate this safe timestamp more periodically:

Changes made by transactions on the primary replica since the last safe-timestamp update are not visible on the secondary replica till the next transmission and update of the safe-timestamp. If transactional activity on the primary replica stops before the internal threshold for safe-timestamp update is crossed, the changes made since the last update to safe-timestamp will not be visible on the secondary replica. To alleviate this issue, you may need to run a few DML transactions on a dummy durable memory-optimized table on the primary replica. Alternatively, though not recommended, you can force shipping of safe-timestamp by running a manual checkpoint.

Be aware of this behavior if you plan to use readable secondary and have some phases where there is minimal activity in the system but you still need the changes to in-memory tables to show up on the secondary as soon as possible.

NUL or NULL?

Ever since the old MS-DOS days, the Windows OS family has supported a special ‘null device’ – NUL. Notice the single ‘L’. This is a ‘reserved’ device identifier which you can use as a path. The special thing about this null device (NUL) is that it essentially swallows whatever is sent to it.

NUL and SQL

So what does this have to do with let’s say SQL? Many of us use the null device to test raw backup throughput. The correct way to do this is to specify DISK = ‘NUL’ in the backup command. In that case, you will see an entry similar to the below in the error log, and no physical destination file will be created:

2014-09-20 21:47:43.890 Backup       Database backed up. Database: master, creation date(time): 2014/09/18(16:33:26), pages dumped: 459, first LSN: 234:304:73, last LSN: 234:352:1, number of dump devices: 1, device information: (FILE=1, TYPE=DISK: {‘nul’}). This is an informational message only. No user action is required.

Now, in some cases, you might accidentally or unknowingly type in NULL (with the double L) instead of NUL. What that does is actually end up with a physical file called NULL Smile You can see proof of that here:

2014-09-20 21:47:03.480 Backup       Database backed up. Database: master, creation date(time): 2014/09/18(16:33:26), pages dumped: 460, first LSN: 234:208:128, last LSN: 234:280:1, number of dump devices: 1, device information: (FILE=1, TYPE=DISK: {‘C:\Program Files\Microsoft SQL Server\MSSQL12.SQL14\MSSQL\Backup\null’}). This is an informational message only. No user action is required.

Conclusion

So if you are using the NUL device in your tests, make sure you spell it as NUL (without the double L!)

Have a good weekend!

Strange issue when enabling AlwaysOn AG feature in SQL 2014

Today I ran into an issue, which eventually turned out to be a really silly thing, but I think it is important that I share the details.

I was trying to install SQL 2014 side-by-side with an already existing SQL 2012 AG setup. Everything went fine till the customary step when I tried to ‘tick the box’ and enable the AG feature for the newly installed SQL 2014 instances. But that was not to be, with the following error dialog (“The AlwaysOn Availability Groups feature requires the x86(non-WOW) or x64 Enterprise Edition of SQL Server 2012 (or later version) running on Windows Server 2008 (or later version) with WSFC hotfix KB 2494036 installed.)

image

The hotfix in question was already installed and most importantly, my SQL 2012 AG was already functional on this cluster. So it was definitely not something with the OS or cluster.

Not to be deterred, I used the PowerShell option, but that failed as well:

PS SQLSERVER:SQLSQL108W2K8R22SQL14> Enable-SqlAlwaysOn
Enable-SqlAlwaysOn : Could not obtain information about SQL Server Service ‘MSSQL$SQL14’.
At line:1 char:19
+ Enable-SqlAlwaysOn <<<<
    + CategoryInfo          : ResourceUnavailable: (MSSQL$SQL14:String) [Enable-SqlAlwaysOn], SqlPowerShellServiceException
    + FullyQualifiedErrorId : ServiceError,Microsoft.SqlServer.Management.PowerShell.Hadr.EnableSqlHADRServiceCommand

Providence!

Suddenly it dawned upon me that maybe in my haste I had installed an edition of SQL 2014 which did not permit the AG feature. And indeed, it was the Business Intelligence edition that I had installed, which does not permit the usage of AGs.

The cool thing though about SQL 2008 and above is the “Edition Upgrade” feature of the setup, which allows you to easily change editions (provided the upgrade path is allowed – for example you cannot ‘upgrade’ from BI edition to Evaluation) just by entering the correct product key.

The whole edition upgrade thing took just under a minute and after that I was able to successfully enable the SQL 2014 installation for AlwaysOn AGs.

Hope this tip helps someone!

“Could not add cluster access…” during SQL 2012 Failover Cluster Instance setup

Today a lab setup of SQL 2012 SP1 + CU7 (I was using the /UpdateSource flag to have a patched setup + roll in the updated bits in one go) failed sometime before creating the SQL resource in cluster. The error message from the detail.txt file showed:

Could not add cluster access for SID ‘S-1-5-80-2358576813-3362504073-1364548504-537691503-200193738’. Error: There was a failure to call cluster code from a provider. Exception message: The specified path is invalid.

Preliminary Analysis

The GUID (S-1-5-80-2358576813-3362504073-1364548504-537691503-200193738) in this case happened to be the SQL Service SID. How did I know that? You can also do it, if you use the SC SHOWSID command:

C:Windowssystem32>sc showsid mssql$inst1

NAME: mssql$inst1
SERVICE SID: S-1-5-80-2358576813-3362504073-1364548504-537691503-200193738

This issue occurs when the setup routine fails to add the service account’s SID to the list of accounts allowed to access the Windows cluster itself. For example, in my test cluster (in a healthy state, of course) you can see the Service SID is added to the list of authorized accounts in the security descriptor of the Windows cluster itself:

image

But of course in my failed setup attempt, the SID was not added correctly. The big question is why?

(Another side question some of you may have is why should the service account have this permission on the cluster itself. Well, that’s another topic and another blog post – stay tuned for that!)

Digging deep: Cluster Log

Carrying on from the previous section, one helpful hint is to know that when such changes are made to the cluster’s ACL configuration, the changes are persisted to the cluster database, a copy of which is also persisted to the quorum disk resource (if one was configured.)

Now in my case, deeper troubleshooting was required to find out why the security descriptor could not be written. To find out, I dumped the cluster log (using the PowerShell cmdlet Get-ClusterLog) and found the following messages at the same time that the SQL setup failed:

00001004.00000fe0::2014/01/23-07:20:05.486 INFO  [RES] Physical Disk <RealWitness>: Path W:Cluster is not on the disk
00001004.00000fe0::2014/01/23-07:20:05.486 ERR   [RHS] Error 161 from ResourceControl for resource RealWitness.
00000fb8.000011fc::2014/01/23-07:20:05.486 WARN  [RCM] ResourceControl(STORAGE_IS_PATH_VALID) to RealWitness returned 161.
00000fb8.000011fc::2014/01/23-07:20:05.585 ERR   [RCM] rcm::RcmApi::SetQuorumResource: ERROR_BAD_PATHNAME(161)’ because of ‘ValidateQuorumPath( pRes, quorumPath )’

W:Cluster did not make sense initially, because this drive was supposed to be unused by this instance of SQL. I had to think of sequence of operations I had done during this setup…

Root cause found

Looking back, I remembered that I had swapped the disk quorum resource somewhere halfway in between the setup. It was earlier on the W: drive, but now on the Q: drive, which has the resource name RealWitness. So in short, the resource name for the quorum disk was correct, but the drive letter and path on that disk was incorrect. Stale information was used when the setup program tried to update the cluster configuration (which in turn would write to the quorum disk).

Important: Please note that this is the specific root cause for my specific situation. There may be other types of problems when the setup program tries to set the cluster access for the SID. One has to look at the inner exception message (which in my case was ‘specified path is invalid’) to be sure. The true root cause for other cases can typically be found by correlating to the cluster log file.

Solution

The solution, in this case was to remove the SQL bits (from the Control Panel) and nothing else but to re-run the setup program without any changes. It was a cleaner way, and the ‘moral of the story’ is never tweak cluster quorum in the middle of a SQL Setup procedure!

SQL Server and Storage / SAN replication (mirroring) tips

Some customers have adopted block level replication of the storage disks (LUNs) as their DR solution for SQL Server databases. Remote mirroring of storage disks sounds great – transparent to the application, replicates every block as-is; so very little reconfiguration required on the other side, scales fairly linearly regardless of the number of databases and so on.  Sounds better than baked bread, huh? Smile However, there are some finer points we would like to share from our experiences on the field.

Support Policy

Our KB articles sometimes read like legal documentation and rightly so – they are in many cases the ‘rulings’ and ‘bye-laws’ by which customers and us reach conclusions in tricky situations. So there are some KB articles which are a must-read in the case of storage.

KB 913945 states that Microsoft does not officially certify specific third party products will work with SQL Server. The most important part of that article is here:

Microsoft support documents may use the phrase "not supported by Microsoft" with regard to third-party products. This phrase means that the first line of support for such a product should be the third-party product vendor rather than Microsoft.

Next we come to KB 910716 which specifically talks about remote mirroring a.k.a. SAN replication. Some excerpts from this document:

  • This article does not endorse any specific vendor’s solution.
  • Work with your hardware vendor and your storage vendor to make sure that the remote mirroring solution that you want to use supports the I/O requirements of SQL Server.
  • If your solution meets the requirements that are discussed in this article, Microsoft Customer Support Services will provide support for issues that are not related to storage devices.
  • To obtain support from Microsoft Customer Support Services for issues that involve transactional throughput, you must be able to reproduce the issue in a mirror-free environment.
  • To obtain support for issues that involve a storage device, see your storage vendor.
  • To obtain support for issues that involve third-party replication technology, see your technology vendor.

Lastly we have KB 943984 which talks about using the storage mirroring to support a so-called geo-cluster. In such cases, remote disk mirroring is used to provide a consistent copy of the clustered storage disks on a remote site. This article has one very important line:

Please contact the data replication vendor for any issues related to accessing data on failover.

By now, you must be feeling very jittery about using such remote mirroring Smile But fear not, there are some basic precautions you can take to be sure.

SQL I/O Requirements

The fundamental issue is that the database and log files are now in two places rather than one. And that means that the basic I/O requirements that SQL Server assumes or requires are to be followed in both places. These requirements are described in the ages-old (but still relevant) SQL I/O Basics Part 1 document. (FYI, the Part 2 document is also a very good read) and are given below:

  • Stable Media: Any system running SQL Server must provide stable media capabilities for the database log and data files. If the system has something like a non-battery backed cache or perhaps caching enabled on disk, it is not safe for SQL Server installations. Ensure that your system has proper caching attributes to secure the data to stable media.
  • Write Ordering: SQL Server is designed to maintain the WAL protocol as described earlier in this paper. The underlying environment plays a pivotal role in the maintenance of the protocol. Write ordering must be maintained. With any system, but specifically for remote mirroring implementations, write ordering is critical to maintain a point-in-time view of the data. Many remote mirroring vendors implement generation logic to maintain physical write order even when the write operations are transmitted between remote caches with alternate protocols.
  • Torn I/O Prevention (Split I/Os): For SQL Server, the 8-KB blocks need to be handled as a single block of data. Systems that split I/Os must be configured to avoid splitting I/O requests into smaller blocks. Some dynamic disk and volume managers can be configured with block and cluster sizes that are smaller than 8 KB or not on 8-KB boundaries. These systems may split the SQL Server I/O request across physical system components. In doing so, they can introduce the possibility of torn pages and they can destroy true write ordering precedence.

 

How do I test for these?

The easiest way to test if a disk subsystem (standalone or mirrored) meets the above requirements is to simulate SQL Server I/O patterns using the SQLIOSim utility. This utility actually replaces the SQLIOStress utility described in Bob Dorr’s whitepaper mentioned previously.

Write ordering implementations

There are some reference steps you can check for remote mirroring and SQL Server. I’ll keep adding to these as I come across more Smile

EMC SRDF

For example, this EMC paper talks about using something called Consistency Groups when using their SRDF technology to remote mirror a group of LUNs. Consistency Groups are used by EMC to ensure write-ordering, which as you might have read above, is a very important criteria, especially when you have different sets of LUNs for log and data files.

Note: other manufacturers may have their own equivalent to ensure write ordering. I mention EMC’s solution purely as a sample. It is not to be taken in any way as a recommendation or guarantee of any sort.

Hyper-V Replicas

Hyper-V in Windows Server 2012 offers something called Replicas which are an easy way to provide for DR of a VM. SQL Server is supported in such scenarios – see KB 956893 for details. The scenario is supported as long as the EnableWriteOrderPreservationAcrossDisks flag is set.

TEMPDB and remote mirroring / storage replication

One thing to definitely keep in mind when using remote mirroring is to consider where TEMPDB is placed. If you are replicating the LUNs where TEMPDB data / log files are placed, that is calling for big trouble! A lot of bandwidth on your storage replication channel will be wasted moving these bits across, when in reality the TEMPDB will be recreated on startup of the DR instance.

One good thing here is that SQL server 2012 allows TEMPDB data / log files to be placed on local disks even for clustered instances – this is a big help as it helps avoid including TEMPDB in the remote mirroring setup.

Wrap-up cheat sheet

Here is a summary of the key points to keep in mind when implementing storage replication / remote disk mirroring:

  1. Check with the manufacturer if they have addressed the SQL Server storage requirements, specifically the write-ordering implementation
  2. Test with SQLIOSim well before deployment of the actual SQL instances
  3. Always check if your HBA drivers and SAN firmware / software is up to date
  4. Ensure that TEMPDB is not being replicated using disk mirroring across the storage replication channel

That’s it for now. Hope you enjoyed this post! Please leave your feedback / questions in the comments, or use the ‘Email Author’ link on the top right section of the blog to reach me.

AlwaysOn Availability Group Forced Failover: Under the Hood

In this walkthrough I will take you through one aspect of FCI + AG configuration: forced failover. We will simulate failure and force failover. And in the process we will have some … Data Loss. Does that not sound horrible! Relax! You should not worry too much Smile, because in this case what we are going to discuss is actually a ‘manufactured’ data loss situation. The true intent of this post is to show you how the Availability Group architecture deals with ‘multiple master’ scenario and how it avoids the ‘split-brain’ condition.

Lab setup

For our testing the lab setup looks like this:

image

INST1 is a clustered instance of SQL with 2 nodes as possible owners, running SQL 2012 CU3. W2K8R2N3 is a server hosting a standalone instance of SQL 2012 CU3. All 3 servers are nodes in the same Windows Server Failover Cluster (WSFC), SQLAGCLUST.

Design decision #1: Asynchronous versus Synchronous secondary

We are allowed to configure either; but note the latency effects that synchronous secondary can have, especially if W2K8R2N3 was on a WAN link, I would prefer to use asynchronous in that case.

Design decision #2: Why not automatic failover for the secondary replica?

While configuring this setup, please note that while we are allowed to have a synchronous replica, it is not allowed to make that as automatic failover partner. If you try to configure W2K8R2N3 as a synchronous replica with automatic failover:

USE [master]
GO
ALTER AVAILABILITY GROUP [FCI_AG]
MODIFY REPLICA ON N’INST1′ WITH (PRIMARY_ROLE(READ_ONLY_ROUTING_LIST = (N’W2K8R2N3′)))
GO
ALTER AVAILABILITY GROUP [FCI_AG]
MODIFY REPLICA ON N’INST1′ WITH (FAILOVER_MODE = AUTOMATIC)
GO


When you do this, you will receive the following error message:

Msg 35215, Level 16, State 17, Line 1
The Alter operation is not allowed on availability replica ‘INST1’, because automatic failover mode is an invalid configuration on a SQL Server Failover Cluster Instance.  Retry the operation by specifying manual failover mode.

This is expected and by-design, the logic being that the FCI already has automatic failover configured between the nodes. The current implementation of AG doe not allow a failover outside of the FCI nodes.

 

Test Case

We will simulate a ‘perfect storm’ case – failure of the link to the DR, followed by failure of the primary site, followed by forced failover to the secondary site, and finally a failback to the primary site. In this rarest of rare scenarios we will show how the scope for data loss can arise.

Database and AG details

Here is the ‘full disclosure report’ for the test Smile

  • In our sample AG, there is just 1 database: ‘mydb’.
  • The primary replica INST1.mydb is a synchronous, readable secondary, with manual failover.
  • The database replica W2K8R2N3.mydb is synchronous, readable secondary with automatic failover.
  • For our testing we will use a simulated table ‘newtab’ which has 1 integer column, ‘i’.
  • Prior to these tests, the table ‘newtab’ was truncated and the ‘mydb’ transaction log has been backed up.
  • We will test for missing rows to determine the data loss.

Cluster node weight assignment

As part of the ‘good practices’ since W2K8R2N3 cannot participate in automatic failover, it is recommended to set it’s node weight to 0. By default it is 1:

PS C:Windowssystem32> (get-clusternode “w2k8r2n3”).Nodeweight
1

We will set it to 0:

PS C:Windowssystem32> (get-clusternode “w2k8r2n3”).Nodeweight = 0

Test 0: Ensure synchronization works normally

We insert our first row, and check the secondary replica to see if the log records made it across.

insert newtab values (1)

Almost immediately we can see the record is synchronized on the secondary replica. A casual inspection of the transaction log on both sides using fn_dblog() shows that the log records are in sync. Here are the last 3 entries displayed by fn_dblog:

0000001f:0000007b:0018    LOP_INSERT_ROWS    LCX_HEAP    0000:000002e9
0000001f:0000007b:0019    LOP_SET_FREE_SPACE    LCX_PFS    0000:00000000
0000001f:0000007b:001a    LOP_COMMIT_XACT    LCX_NULL    0000:000002e9

fn_dblog is undocumented, unsupported, but can be useful for learning and testing purposes. You can read more about it at the following community resource pages:

 

Test 1: Link failure / secondary down

This is simulated by stopping the service on W2K8R2N3. In this case, the primary replica (which is the FCI) runs ‘exposed’. ‘Exposed’ means that potentially the RPO of 0 data loss (for the AG) is not met because there is no link to the secondary replica.

At this time (when the secondary replica is stopped or disconnected) we see the following entry in the errorlog of primary replica INST1:

2012-09-20 11:52:42.880 spid44s      AlwaysOn Availability Groups connection with secondary database terminated for primary database ‘mydb’ on the availability replica with Replica ID: {ac95b2e1-d5fa-48c2-8995-2f7be4140ed3}. This is an informational message only. No user action is required.

Now, let’s continue ‘modifying’ data on INST1.mydb:

insert newtab values (2)

Let’s review the last few LSNs:

0000001f:00000081:0002    LOP_INSERT_ROWS    LCX_HEAP    0000:000002ec
0000001f:00000081:0003    LOP_COMMIT_XACT    LCX_NULL    0000:000002ec

So to summarize, at this time, we have 1 row (with a value of 2) in mydb.newtab which has not been transferred to W2K8R2N3. This is why we say we are ‘exposed’, because we have the potential of losing this change were something bad to happen to INST1 now.

Test 2: INST1 also fails; W2K8R2N3 comes up later

Imagine due to a steady degradation (maybe power issue or such) the INST1 SQL FCI instance also fails. We will demonstrate this by powering the N1 and N2 nodes off, so that way INST1 is unavailable. When these nodes are offline, the cluster itself fails.

Also imagine after a while, W2K8R2N3 (the erstwhile secondary) comes up. In such a case, let’s see what happens. Firstly, the failover cluster manager reports that it is unable to connect to our WSFC. So let’s see what we get when we generate the cluster log file:

C:Windowssystem32>cluster log /g
Generating the cluster log(s) …
The cluster log has been successfully generated on node ‘w2k8r2n3’…
The cluster log could not be created on node ‘W2K8R2N1’…
System error 1722 has occurred (0x000006ba).
The cluster log could not be created on node ‘W2K8R2N2’…
System error 1722 has occurred (0x000006ba).     

The multiple ‘errors’ are because the cluster log is attempted to be generated on all nodes. The Nodes 1 and 2 being down, these errors are expected. Let’s see what is in the cluster log on W2K8R2N3:

00000578.00000764::2012/09/20-19:10:39.769 INFO  [QUORUM] Node 3: setting next best epoch to 2012/09/13-06`03`57.603_11
00000578.00000764::2012/09/20-19:10:39.769 DBG   [QC] Calculating quorum: Configured nodes count = 3, Node weights count = 2, Quorum witness configured = false, Intersection count = 0
00000578.00000764::2012/09/20-19:10:39.769 INFO  [QUORUM] Node 3: Coordinator: no quorum yet. do nothing

Aha! So due to lack of quorum, our N3 cluster service is unable to form a cluster. At this stage, I will go ahead and stop the cluster service on N3 (if it has not stopped already.) We will start it and ‘force quorum’ later.

Test 3: Forcing quorum on W2K8R2N3

This part is quite simple in operation but underneath there are many complex details going on. I will not attempt to reproduce those details here but you can search for ‘Paxos Tag’ and also look at the 2 articles listed below to understand the mechanism:

So to force the cluster service to start and force a quorum, here are the commands. Firstly, we will use PowerShell to set the node weight of this W2K8R2N3 to 1:

PS C:Windowssystem32> (get-clusternode “w2k8r2n3”).NodeWeight  = 1

Next, we use the ForceQuorum switch to start the cluster service.

C:Windowssystem32>net start clussvc /fq

At this stage, if you look  at the SQL Management Studio, you will see this database in ‘Not Synchronizing’ state. Also the AG resource will state ‘Resolving’:

image

Next, we will force the failover of the AG to this node. To do this, we right click on the AG and select the Failover option, which in turn brings up this set of wizard screens. The first one warns us of the scope of data loss:

image

The second one ensures that we understand the implications of forcing failover:

image

Once the necessary steps are done, you will see the database status as Synchronized:

image

Forced failover: Under the hood

And most importantly from an internals perspective, looking at the transaction log using fn_dblog() shows the same record as the original modification (the row with value 1. The LSN is 1f:7b:18

0000001f:0000007b:0018    LOP_INSERT_ROWS    LCX_HEAP    0000:000002e9

At this time, we can also check the rows in the table on N3:

select * from mydb.dbo.newtab

The above shows just 1 row, with value 1 as expected. Let’s take a look at the error log on W2K8R2N3:

2012-09-20 12:32:38.110 spid53       The state of the local availability replica in availability group ‘FCI_AG’ has changed from ‘RESOLVING_NORMAL’ to ‘RESOLVING_PENDING_FAILOVER’. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error.

Below you will see the state transition which is in response to our request to force failover:

2012-09-20 12:32:48.050 spid55       AlwaysOn: The local replica of availability group ‘FCI_AG’ is preparing to transition to the primary role in response to a request from the Windows Server Failover Clustering (WSFC) cluster. This is an informational message only. No user action is required
2012-09-20 12:32:48.060 spid55       The state of the local availability replica in availability group ‘FCI_AG’ has changed from ‘RESOLVING_PENDING_FAILOVER’ to ‘RESOLVING_NORMAL’. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error.
2012-09-20 12:32:48.080 spid55       The state of the local availability replica in availability group ‘FCI_AG’ has changed from ‘RESOLVING_NORMAL’ to ‘PRIMARY_PENDING’. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error.

2012-09-20 12:32:48.200 Server       The state of the local availability replica in availability group ‘FCI_AG’ has changed from ‘PRIMARY_PENDING’ to ‘PRIMARY_NORMAL’. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error.

At this stage, the database can be brought online on W2K8R2N3:

2012-09-20 12:32:48.210 spid21s      Starting up database ‘mydb’.
2012-09-20 12:32:48.300 spid21s      3 transactions rolled forward in database ‘mydb’ (5:0). This is an informational message only. No user action is required.
2012-09-20 12:32:48.340 spid10s      0 transactions rolled back in database ‘mydb’ (5:0). This is an informational message only. No user action is required.  

Test 4: Bring INST1 back online

By doing this, we potentially risk what is known as a ‘split-brain’ situation. Normally the implementation of the AG resource DLL and the Lease Timeout avoids this issue IF it all nodes in the cluster can talk to each other. In the real DR failback scenario normally one would need to be very careful when bringing INST1 back online. The recommended way to prepare for this is to start the cluster services with the PreventQuorum switch first.

When we do bring the nodes for INST1 back online, it does put the mydb on INST1 in a Not Synchronizing state.

image

If we try to use the mydb database on INST1 it gives us the following error:

The target database, ‘mydb’, is participating in an availability group and is currently not accessible for queries. Either data movement is suspended or the availability replica is not enabled for read access. To allow read-only access to this and other databases in the availability group, enable read access to one or more secondary availability replicas in the group.  For more information, see the ALTER AVAILABILITY GROUP statement in SQL Server Books Online. (Microsoft SQL Server, Error: 976)

Let’s see if we can create a database snapshot to peer into what is inside INST1.mydb.

— Create the database snapshot
CREATE DATABASE mydb_dr_snap ON
( NAME = mydb, FILENAME =
‘f:sql2012datamydb_dr_snap.ss’ )
AS SNAPSHOT OF mydb;
GO
  

At this time, querying the snapshot using the following command:

select * from mydb_dr_snap.dbo.newtab

This gives us 2 rows back. That means the transactions on the erstwhile primary instance INST1 are still ‘there’ but because of the force quorum, the common LSN cannot be negotiated. This causes the issue of INST1 being in Not Synchronized state. You can see the proof for this in the error log. To compare the state of the old secondary replica with INST1, you can use TableDiff utility.

First, you will see the attempt made by the new secondary (INST1) to connect to the new primary (W2K8R2N3):

2012-09-20 12:40:34.990 spid23s      A connection for availability group ‘FCI_AG’ from availability replica ‘INST1’ with id  [280B1AED-49EA-4EF7-9BAF-C1EC13F50E51] to ‘W2K8R2N3’ with id [AC95B2E1-D5FA-48C2-8995-2F7BE4140ED3] has been successfully established.  This is an informational message
2012-09-20 12:40:37.630 spid29s      AlwaysOn Availability Groups connection with primary database established for secondary database ‘mydb’ on the availability replica with Replica ID: {ac95b2e1-d5fa-48c2-8995-2f7be4140ed3}. This is an informational message only. No user action is required.
2012-09-20 12:40:37.630 spid29s      Error: 35285, Severity: 16, State: 1.

This next entry is SO important for us to note. It is actually telling us that the ‘recovery LSN’ which is the ‘common handshake’ is 31:129:1. In hexadecimal notation, this is 1F : 81 : 1. From the first few steps in this walkthrough, you might remember that when we inserted row 2, the LSN was 1f: 81: 2.

2012-09-20 12:40:37.630 spid29s      The recovery LSN (31:129:1) was identified for the database with ID 5. This is an informational message only. No user action is required.
2012-09-20 12:40:37.630 spid29s      AlwaysOn Availability Groups data movement for database ‘mydb’ has been suspended for the following reason: “failover from partner” (Source ID 1; Source string: ‘SUSPEND_FROM_PARTNER’). To resume data movement on the database, you will need to resume the d  

Here is the notification that the snapshot was created:

2012-09-20 12:44:36.460 spid57       Starting up database ‘mydb_dr_snap’.
2012-09-20 12:44:36.470 spid57       4 transactions rolled forward in database ‘mydb_dr_snap’ (6:0). This is an informational message only. No user action is required.
2012-09-20 12:44:36.520 spid57       0 transactions rolled back in database ‘mydb_dr_snap’ (6:0). This is an informational message only. No user action is required.
2012-09-20 12:44:36.520 spid57       Recovery is writing a checkpoint in database ‘mydb_dr_snap’ (6). This is an informational message only. No user action is required.

Resume synchronization for INST1.mydb

Later, we resumed the synchronization for this database on INST1. That effectively ‘sets the clock back’ on INST1.mydb by rolling back some of the transaction which are subsequent to the recovery LSN. We do this by right clicking on the database under the Availability Group and selecting Resume Data Synchronization.

image

At this time select * from mydb.dbo.newtab gives 1 row back! That confirms the second of our ‘manufactured data loss’ scenarios Smile

As always let us look at what happened under the hood, by noting the error log entries on INST1. First, you will see the effect of the RESUME:

2012-09-20 12:48:26.410 spid55       ALTER DB param option: RESUME
2012-09-20 12:48:26.410 spid55       AlwaysOn Availability Groups data movement for database ‘mydb’ has been resumed. This is an informational message only. No user action is required.
2012-09-20 12:48:26.530 spid29s      AlwaysOn Availability Groups connection with primary database established for secondary database ‘mydb’ on the availability replica with Replica ID: {ac95b2e1-d5fa-48c2-8995-2f7be4140ed3}. This is an informational message only. No user action is required.
2012-09-20 12:48:26.530 spid29s      Error: 35285, Severity: 16, State: 1.
2012-09-20 12:48:26.530 spid29s      The recovery LSN (31:129:1) was identified for the database with ID 5. This is an informational message only. No user action is required.
2012-09-20 12:48:26.640 spid29s      Error: 35278, Severity: 17, State: 1.
2012-09-20 12:48:26.640 spid29s      Availability database ‘mydb’, which is in the secondary role, is being restarted to resynchronize with the current primary database. This is an informational message only. No user action is required.  

What follows is even more critical to note. It says we identified a higher LSN (31 : 130 : 1) but are ‘going back’ to the recovery LSN (31: 129: 1):

2012-09-20 12:48:26.790 spid40s      Error: 35285, Severity: 16, State: 1.
2012-09-20 12:48:26.790 spid40s      The recovery LSN (31:130:1) was identified for the database with ID 5. This is an informational message only. No user action is required.
2012-09-20 12:48:26.790 spid40s      Error: 35286, Severity: 16, State: 1.
2012-09-20 12:48:26.790 spid40s      Using the recovery LSN (31:129:1) stored in the metadata for the database with ID 5.

Finally, the database can be recovered and resume synchronization:

2012-09-20 12:48:26.960 spid40s      Starting up database ‘mydb’.
2012-09-20 12:48:27.000 spid40s      Starting up database ‘mydb_dr_snap’.
2012-09-20 12:48:27.420 spid40s      AlwaysOn Availability Groups connection with primary database established for secondary database ‘mydb’ on the availability replica with Replica ID: {ac95b2e1-d5fa-48c2-8995-2f7be4140ed3}. This is an informational message only. No user action is required.
2012-09-20 12:48:27.430 spid40s      Error: 35285, Severity: 16, State: 1.
2012-09-20 12:48:27.430 spid40s      The recovery LSN (31:129:1) was identified for the database with ID 5. This is an informational message only. No user action is required.
2012-09-20 12:48:27.430 spid40s      3 transactions rolled forward in database ‘mydb’ (5:0). This is an informational message only. No user action is required.
2012-09-20 12:48:27.460 spid29s      AlwaysOn Availability Groups connection with primary database established for secondary database ‘mydb’ on the availability replica with Replica ID: {ac95b2e1-d5fa-48c2-8995-2f7be4140ed3}. This is an informational message only. No user action is required.  

Here is the final state of transaction log on INST1. You will notice the absence of the LSN numbers associated with the row with value 2 (1f: 81: 2)

0000001f:0000007b:0019    LOP_SET_FREE_SPACE    LCX_PFS    0000:00000000
0000001f:0000007b:001a    LOP_COMMIT_XACT    LCX_NULL    0000:000002e9
0000001f:00000080:0001    LOP_MODIFY_ROW    LCX_BOOT_PAGE    0000:00000000
0000001f:00000081:0001    LOP_MODIFY_ROW    LCX_BOOT_PAGE    0000:00000000
0000001f:00000082:0001    LOP_MODIFY_ROW    LCX_BOOT_PAGE    0000:00000000
0000001f:00000083:0001    LOP_BEGIN_XACT    LCX_NULL    0000:000002ec
0000001f:00000083:0002    LOP_HOBT_DELTA    LCX_NULL    0000:000002ec

So this proves that we did lose the changes done in INST1 because of the forced failover to W2K8R2N3. However, with the database snapshot, we can attempt to reconcile those ‘lost’ changes with the base table.

Wrapping up

We covered quite of lot of deep stuff here. Here is a quick summary:

  1. Automatic failover is not possible when the primary replica is a FCI
  2. Normally because of this fact, you will assign 0 votes to the secondary replica nodes
  3. If you lose the primary site, you must bring up cluster service with force quorum on the secondary replica
  4. Next, you will force failover the AG to the secondary replica. Data loss might occur in this scenario.
  5. If you do get back the primary site in the mean time, use the Prevent Quorum switch on the erstwhile primary nodes to prevent ‘split-brain’ condition
  6. If you take the right steps and the earlier primary will be in Not Synchronizing state
  7. The earlier primary replica will still contain the changes which were made to it but not transmitted to the secondary site; you can use a database snapshot on the earlier primary replica to identify these changes, typically using an utility like TableDiff
  8. On resuming the earlier primary replica, it will don the role of a secondary. Again, in this case, data loss is expected if changes were made on the new primary which ‘override’ the earlier primary.

Wow! That was a long post but I hope it brings to light a lot of the detail around this very specific scenario of forced failover. I hope you enjoyed reading this. Please leave your comments, questions and feedback below. I appreciate those very much!