Chris Walker
2015-08-03 03:02:28 UTC
Hello,
We recently had an unfortunate sequence on our two-node cluster (nodes n02
and n03) that can be summarized as:
1. n03 became pathologically busy and was STONITHed by n02
2. The heavy load migrated to n02, which also became pathologically busy
3. n03 was rebooted
4. During the startup of HA on n03, n02 was initially seen by n03:
Jul 26 15:23:43 n03 crmd: [143569]: info: crm_update_peer_proc: n02.ais is
now online
5. But later during the startup sequence (after DC election and CIB sync)
we see n02 die (n02 is really wrapped around the axle, many stuck threads,
etc)
Jul 26 15:27:44 n03 heartbeat: [143544]: WARN: node n02: is dead
...
Jul 26 15:27:45 n03 crmd: [143569]: info: ais_status_callback: status: n02
is now lost (was member)
our deadtime is 240 seconds, so n02 became unresponsive almost immediately
after n03 reported it up at 15:23:43
6. The troubling aspect of this incident is that even though there are
multiple STONITH resources configured for n03, none of them was engaged and
n03 then mounted filesystems that were also active on n02.
I'm wondering whether the fact that no STONITH resources were started by
this time explains why n02 was not STONITHed. Shortly after n02 is
declared dead we see STONITH resources begin starting, e.g.,
Jul 26 15:27:47 n03 pengine: [152499]: notice: LogActions: Start
n03-3-ipmi-stonith (n03)
Does the fact that since there were no active STONITH resources when n02
was declared dead, no STONITH action was taken against this node? Is there
a fix/workaround for this scenario (we're using heartbeat 3.0.5 and
pacemaker 3.1.6 (RHEL6.2))?
Thanks very much!
Chris
We recently had an unfortunate sequence on our two-node cluster (nodes n02
and n03) that can be summarized as:
1. n03 became pathologically busy and was STONITHed by n02
2. The heavy load migrated to n02, which also became pathologically busy
3. n03 was rebooted
4. During the startup of HA on n03, n02 was initially seen by n03:
Jul 26 15:23:43 n03 crmd: [143569]: info: crm_update_peer_proc: n02.ais is
now online
5. But later during the startup sequence (after DC election and CIB sync)
we see n02 die (n02 is really wrapped around the axle, many stuck threads,
etc)
Jul 26 15:27:44 n03 heartbeat: [143544]: WARN: node n02: is dead
...
Jul 26 15:27:45 n03 crmd: [143569]: info: ais_status_callback: status: n02
is now lost (was member)
our deadtime is 240 seconds, so n02 became unresponsive almost immediately
after n03 reported it up at 15:23:43
6. The troubling aspect of this incident is that even though there are
multiple STONITH resources configured for n03, none of them was engaged and
n03 then mounted filesystems that were also active on n02.
I'm wondering whether the fact that no STONITH resources were started by
this time explains why n02 was not STONITHed. Shortly after n02 is
declared dead we see STONITH resources begin starting, e.g.,
Jul 26 15:27:47 n03 pengine: [152499]: notice: LogActions: Start
n03-3-ipmi-stonith (n03)
Does the fact that since there were no active STONITH resources when n02
was declared dead, no STONITH action was taken against this node? Is there
a fix/workaround for this scenario (we're using heartbeat 3.0.5 and
pacemaker 3.1.6 (RHEL6.2))?
Thanks very much!
Chris