[Pacemaker] Node lost early in HA startup --> no STONITH

Discussion:

Chris Walker

2015-08-03 03:02:28 UTC

Hello,

We recently had an unfortunate sequence on our two-node cluster (nodes n02
and n03) that can be summarized as:
1. n03 became pathologically busy and was STONITHed by n02
2. The heavy load migrated to n02, which also became pathologically busy
3. n03 was rebooted
4. During the startup of HA on n03, n02 was initially seen by n03:

Jul 26 15:23:43 n03 crmd: [143569]: info: crm_update_peer_proc: n02.ais is
now online

5. But later during the startup sequence (after DC election and CIB sync)
we see n02 die (n02 is really wrapped around the axle, many stuck threads,
etc)

Jul 26 15:27:44 n03 heartbeat: [143544]: WARN: node n02: is dead
...
Jul 26 15:27:45 n03 crmd: [143569]: info: ais_status_callback: status: n02
is now lost (was member)

our deadtime is 240 seconds, so n02 became unresponsive almost immediately
after n03 reported it up at 15:23:43

6. The troubling aspect of this incident is that even though there are
multiple STONITH resources configured for n03, none of them was engaged and
n03 then mounted filesystems that were also active on n02.

I'm wondering whether the fact that no STONITH resources were started by
this time explains why n02 was not STONITHed. Shortly after n02 is
declared dead we see STONITH resources begin starting, e.g.,

Jul 26 15:27:47 n03 pengine: [152499]: notice: LogActions: Start
n03-3-ipmi-stonith (n03)

Does the fact that since there were no active STONITH resources when n02
was declared dead, no STONITH action was taken against this node? Is there
a fix/workaround for this scenario (we're using heartbeat 3.0.5 and
pacemaker 3.1.6 (RHEL6.2))?

Thanks very much!
Chris

Thomas Meagher

2015-08-03 15:14:47 UTC

Permalink

Sounds similar to the issue I described here last week. We also had two nodes, and lost network connection between the two nodes while one was starting up after a fence. Although we had stonith resources configured, those resources were never called, and the cluster was considered active on both nodes throughout the network split. We were able to reproduce this issue in our lab, it seems there is a window during corosync startup where if a node joins the cluster and then leaves before Pacemaker stonith resources have started, it will not be fenced. This issue may be isolated to two node systems, as normally a single node that is separated from cluster will have lost quorum, which is not the case with two_node.

Are you running with "two_node" in corosync.conf?
Are you running with "wait_for_all"? (It's on by default with "two_node")

________________________________
From: Chris Walker [***@gmail.com]
Sent: Sunday, August 02, 2015 23:02
To: ***@oss.clusterlabs.org
Subject: [Pacemaker] Node lost early in HA startup --> no STONITH

Hello,

We recently had an unfortunate sequence on our two-node cluster (nodes n02 and n03) that can be summarized as:
1. n03 became pathologically busy and was STONITHed by n02
2. The heavy load migrated to n02, which also became pathologically busy
3. n03 was rebooted
4. During the startup of HA on n03, n02 was initially seen by n03:

Jul 26 15:23:43 n03 crmd: [143569]: info: crm_update_peer_proc: n02.ais is now online

5. But later during the startup sequence (after DC election and CIB sync) we see n02 die (n02 is really wrapped around the axle, many stuck threads, etc)

Jul 26 15:27:44 n03 heartbeat: [143544]: WARN: node n02: is dead
...
Jul 26 15:27:45 n03 crmd: [143569]: info: ais_status_callback: status: n02 is now lost (was member)

our deadtime is 240 seconds, so n02 became unresponsive almost immediately after n03 reported it up at 15:23:43

6. The troubling aspect of this incident is that even though there are multiple STONITH resources configured for n03, none of them was engaged and n03 then mounted filesystems that were also active on n02.

I'm wondering whether the fact that no STONITH resources were started by this time explains why n02 was not STONITHed. Shortly after n02 is declared dead we see STONITH resources begin starting, e.g.,

Jul 26 15:27:47 n03 pengine: [152499]: notice: LogActions: Start n03-3-ipmi-stonith (n03)

Does the fact that since there were no active STONITH resources when n02 was declared dead, no STONITH action was taken against this node? Is there a fix/workaround for this scenario (we're using heartbeat 3.0.5 and pacemaker 3.1.6 (RHEL6.2))?

Thanks very much!
Chris

emmanuel segura

2015-08-03 15:27:33 UTC

Permalink

From what I see, he is using heartbeat.

Post by Thomas Meagher
Sounds similar to the issue I described here last week. We also had two
nodes, and lost network connection between the two nodes while one was
starting up after a fence. Although we had stonith resources configured,
those resources were never called, and the cluster was considered active on
both nodes throughout the network split. We were able to reproduce this
issue in our lab, it seems there is a window during corosync startup where
if a node joins the cluster and then leaves before Pacemaker stonith
resources have started, it will not be fenced. This issue may be isolated
to two node systems, as normally a single node that is separated from
cluster will have lost quorum, which is not the case with two_node.
Are you running with "two_node" in corosync.conf?
Are you running with "wait_for_all"? (It's on by default with "two_node")
________________________________
Sent: Sunday, August 02, 2015 23:02
Subject: [Pacemaker] Node lost early in HA startup --> no STONITH
Hello,
We recently had an unfortunate sequence on our two-node cluster (nodes n02
1. n03 became pathologically busy and was STONITHed by n02
2. The heavy load migrated to n02, which also became pathologically busy
3. n03 was rebooted
Jul 26 15:23:43 n03 crmd: [143569]: info: crm_update_peer_proc: n02.ais is now online
5. But later during the startup sequence (after DC election and CIB sync)
we see n02 die (n02 is really wrapped around the axle, many stuck threads,
etc)
Jul 26 15:27:44 n03 heartbeat: [143544]: WARN: node n02: is dead
...
Jul 26 15:27:45 n03 crmd: [143569]: info: ais_status_callback: status: n02
is now lost (was member)
our deadtime is 240 seconds, so n02 became unresponsive almost immediately
after n03 reported it up at 15:23:43
6. The troubling aspect of this incident is that even though there are
multiple STONITH resources configured for n03, none of them was engaged and
n03 then mounted filesystems that were also active on n02.
I'm wondering whether the fact that no STONITH resources were started by
this time explains why n02 was not STONITHed. Shortly after n02 is declared
dead we see STONITH resources begin starting, e.g.,
Jul 26 15:27:47 n03 pengine: [152499]: notice: LogActions: Start
n03-3-ipmi-stonith (n03)
Does the fact that since there were no active STONITH resources when n02 was
declared dead, no STONITH action was taken against this node? Is there a
fix/workaround for this scenario (we're using heartbeat 3.0.5 and pacemaker
3.1.6 (RHEL6.2))?
Thanks very much!
Chris
_______________________________________________
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

--
.~.
/V\
// \\
/( )\
^`~'^

_______________________________________________
Pacemaker mailing list: ***@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Chris Walker

2015-08-03 15:38:04 UTC

Permalink

I saw Thomas's post from last week and it sounded very similar to what we
saw, but I wasn't sure if the heartbeat/corosync difference made this a
different issue. I'm trying to dup and assemble the log/config info.

Thanks again,
Chris

Post by emmanuel segura
From what I see, he is using heartbeat.

Post by Thomas Meagher
both nodes throughout the network split. We were able to reproduce this
issue in our lab, it seems there is a window during corosync startup

where

Post by Thomas Meagher
if a node joins the cluster and then leaves before Pacemaker stonith
resources have started, it will not be fenced. This issue may be

isolated

Post by Thomas Meagher
to two node systems, as normally a single node that is separated from
cluster will have lost quorum, which is not the case with two_node.
Are you running with "two_node" in corosync.conf?
Are you running with "wait_for_all"? (It's on by default with "two_node")
________________________________
Sent: Sunday, August 02, 2015 23:02
Subject: [Pacemaker] Node lost early in HA startup --> no STONITH
Hello,
We recently had an unfortunate sequence on our two-node cluster (nodes

n02

Post by Thomas Meagher
1. n03 became pathologically busy and was STONITHed by n02
2. The heavy load migrated to n02, which also became pathologically busy
3. n03 was rebooted
Jul 26 15:23:43 n03 crmd: [143569]: info: crm_update_peer_proc: n02.ais

Post by Thomas Meagher
now online
5. But later during the startup sequence (after DC election and CIB

sync)

Post by Thomas Meagher
we see n02 die (n02 is really wrapped around the axle, many stuck

threads,

Post by Thomas Meagher
etc)
Jul 26 15:27:44 n03 heartbeat: [143544]: WARN: node n02: is dead
...

n02

Post by Thomas Meagher
is now lost (was member)
our deadtime is 240 seconds, so n02 became unresponsive almost

immediately

Post by Thomas Meagher
after n03 reported it up at 15:23:43
6. The troubling aspect of this incident is that even though there are
multiple STONITH resources configured for n03, none of them was engaged

and

Post by Thomas Meagher
n03 then mounted filesystems that were also active on n02.
I'm wondering whether the fact that no STONITH resources were started by
this time explains why n02 was not STONITHed. Shortly after n02 is

declared

Post by Thomas Meagher
dead we see STONITH resources begin starting, e.g.,
Jul 26 15:27:47 n03 pengine: [152499]: notice: LogActions: Start
n03-3-ipmi-stonith (n03)
Does the fact that since there were no active STONITH resources when n02

was

Post by Thomas Meagher
declared dead, no STONITH action was taken against this node? Is there a
fix/workaround for this scenario (we're using heartbeat 3.0.5 and

pacemaker

Post by Thomas Meagher
3.1.6 (RHEL6.2))?
Thanks very much!
Chris
_______________________________________________
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

--
.~.
/V\
// \\
/( )\
^`~'^
_______________________________________________
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org