Failover cluster (group) maximum failures limit

My colleague reported that during testing forced failover for a SQL database engine instance, it just ‘failed’ and refused to fail over to the other node in a 2-node cluster. The failure in this case was initiated by shutting down the local service for the clustered instance – which is tantamount to failing the clustered instance itself.

This behavior was slightly unexpected. After some research, we traced it to the ‘Maximum failures in the specified period’ setting at the cluster group (service application) level:

image

It turns out that the testing activity involved some failures already due to which the above limit was already reached. Subsequently, when the SQL service level failure was initiated, the cluster service did not fail it over to the other node and left it in the failed state.

During this investigation I found some articles which are very useful in this:

Moral of the story: When performing failover drills or testing, it may be appropriate to increase this value to a higher number, such as 5 or 10 for the duration of the testing. Subsequently the value may be reset back to 1 or 2.

FYI when you look at the cluster log, this is the message which is recorded when the limit has been reached (and is therefore disallowing restarts):

0000088c.00001038::2012/03/07-06:09:09.834 WARN  [RCM] Not failing over group <groupname>, failoverCount 8, failover threshold 4294967295, nodeAvailCount 1.

Hope this helps!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s