My colleague reported that during testing forced failover for a SQL database engine instance, it just ‘failed’ and refused to fail over to the other node in a 2-node cluster. The failure in this case was initiated by shutting down the local service for the clustered instance – which is tantamount to failing the clustered instance itself.
This behavior was slightly unexpected. After some research, we traced it to the ‘Maximum failures in the specified period’ setting at the cluster group (service application) level:
It turns out that the testing activity involved some failures already due to which the above limit was already reached. Subsequently, when the SQL service level failure was initiated, the cluster service did not fail it over to the other node and left it in the failed state.
During this investigation I found some articles which are very useful in this:
- http://support.microsoft.com/kb/947712 – this is the primary article which describes the failure limit and the associated failover logic
- http://support.microsoft.com/kb/950804 – this article explains an UI issue in Win2008 cluster where the above maximum failure limit is displayed incorrectly (n-1)
- http://support.microsoft.com/kb/228923 – this advanced article explains the RetryPeriodOnFailure property which changed from Windows 2003 (where it was ‘disabled’) to 60 minutes in Windows 2008+.
Moral of the story: When performing failover drills or testing, it may be appropriate to increase this value to a higher number, such as 5 or 10 for the duration of the testing. Subsequently the value may be reset back to 1 or 2.
FYI when you look at the cluster log, this is the message which is recorded when the limit has been reached (and is therefore disallowing restarts):
0000088c.00001038::2012/03/07-06:09:09.834 WARN [RCM] Not failing over group <groupname>, failoverCount 8, failover threshold 4294967295, nodeAvailCount 1.
Hope this helps!