[Pacemaker] IPMI stonith resource gets stuck

Discussion:

Jérôme Charaoui

2015-01-28 18:53:17 UTC

Hi,

I'm testing a 2-node Corosync (1.4.6) and Pacemaker (1.1.10+git20130802)
cluster on Debian 8.0 and having some problems with the stonith resources.

I've set up two external/ipmi resources on each node and wanted to test
how they would react by physically unplugging the IPMI device network
interfaces.

On the DC, no problem, the resource monitor fails, stop op succeeds and
due to location constraints, as expected the resource enters the stop
state and stays there. After replugging the network cable and cleaningup
the resource, it gets restored to normal state.

On the slave node, different scenario: after monitor op fails, stop op
also fails for an unknown reason. The cluster then retries the stop
operation unsuccessfully until I have the node enter/exit standby mode.
Replugging the network cable on the IPMI device has no effect.

At least, that's what I figure is happenning from these logs:

DC: http://pastebin.com/raw.php?i=QpwG6nea
Slave: http://pastebin.com/raw.php?i=3nesX8yJ
Config: http://pastebin.com/raw.php?i=3FrJuwWz

Any help tracking down the issue would be much appreciated.

Thanks!
--
Jérôme Charaoui
Technicien informatique
Collège de Maisonneuve

_______________________________________________
Pacemaker mailing list: ***@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Dejan Muhamedagic

2015-01-30 12:49:48 UTC

Permalink

Hi,

Post by JÃ©rÃ´me Charaoui
Hi,
I'm testing a 2-node Corosync (1.4.6) and Pacemaker
(1.1.10+git20130802) cluster on Debian 8.0 and having some problems
with the stonith resources.
I've set up two external/ipmi resources on each node and wanted to
test how they would react by physically unplugging the IPMI device
network interfaces.
On the DC, no problem, the resource monitor fails, stop op succeeds
and due to location constraints, as expected the resource enters the
stop state and stays there. After replugging the network cable and
cleaningup the resource, it gets restored to normal state.
On the slave node, different scenario: after monitor op fails, stop
op also fails for an unknown reason. The cluster then retries the

The stop operation for stonith devices does not involve the
device at all, it's just stonithd operation, something like
"disable resource". From the "slave" logs, after some abort,

Jan 28 12:04:22 [31422] scatlas01 stonith-ng: error: crm_abort: crm_glib_handler: Forked child 15705 to record non-fatal assert at logging.c:73 : Source ID 63 was not found when attempting to remove it

stonithd exits:

Jan 28 12:05:42 [31422] scatlas01 stonith-ng: info: st_child_term: Child 16540 timed out, sending SIGTERM
Jan 28 12:05:42 [31422] scatlas01 stonith-ng: info: crm_signal_dispatch: Invoking handler for signal 15: Terminated
Jan 28 12:05:42 [31422] scatlas01 stonith-ng: info: stonith_shutdown: Terminating with 2 clients

Apparently, there're a number of stop operations started, for the
same resource, which all exited (or got cancelled) around
12:29:09. There probably was some confusion in lrmd after
stonithd left. In short, you ran into a bug, but I guess that
that bug got fixed in the meantime.

Beekhof and David Vossel should know.

Thanks,

Dejan

Post by JÃ©rÃ´me Charaoui
stop operation unsuccessfully until I have the node enter/exit
standby mode. Replugging the network cable on the IPMI device has no
effect.
DC: http://pastebin.com/raw.php?i=QpwG6nea
Slave: http://pastebin.com/raw.php?i=3nesX8yJ
Config: http://pastebin.com/raw.php?i=3FrJuwWz
Any help tracking down the issue would be much appreciated.
Thanks!
--
Jérôme Charaoui
Technicien informatique
Collège de Maisonneuve
_______________________________________________
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: ***@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Jérôme Charaoui

2015-01-30 16:03:18 UTC

Permalink

Post by Dejan Muhamedagic
Hi,

The stop operation for stonith devices does not involve the
device at all, it's just stonithd operation, something like
"disable resource". From the "slave" logs, after some abort,
Jan 28 12:04:22 [31422] scatlas01 stonith-ng: error: crm_abort: crm_glib_handler: Forked child 15705 to record non-fatal assert at logging.c:73 : Source ID 63 was not found when attempting to remove it
Jan 28 12:05:42 [31422] scatlas01 stonith-ng: info: st_child_term: Child 16540 timed out, sending SIGTERM
Jan 28 12:05:42 [31422] scatlas01 stonith-ng: info: crm_signal_dispatch: Invoking handler for signal 15: Terminated
Jan 28 12:05:42 [31422] scatlas01 stonith-ng: info: stonith_shutdown: Terminating with 2 clients
Apparently, there're a number of stop operations started, for the
same resource, which all exited (or got cancelled) around
12:29:09. There probably was some confusion in lrmd after
stonithd left.

Thank you for looking at this, much appreciated.

The timeout issue intrigued me because I had noticed ipmitool taking
sometimes over 10 seconds attempting to execute an action on a
non-responding IPMI device over the lanplus interface.

So I had a look at the ipmi stonith plugin code and the ipmitool manpage
itself and noticed this little gem in the latter:

-R <count> Set the number of retries for lan/lanplus interface
(default=4).

I then went ahead and added "-R 1" in the plugin's ipmitool_opts
variable, and my problem went away!

Post by Dejan Muhamedagic
In short, you ran into a bug, but I guess that
that bug got fixed in the meantime.

This bug report seems like a match:
https://github.com/ClusterLabs/pacemaker/pull/334

If I'm not mistaken in reading the changelog, this fix was released in
1.12, correct?

Post by Dejan Muhamedagic
Beekhof and David Vossel should know.
Thanks,
Dejan

_______________________________________________
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Marek "marx" Grac

2015-02-02 15:16:23 UTC

Permalink

Post by JÃ©rÃ´me Charaoui
Thank you for looking at this, much appreciated.
The timeout issue intrigued me because I had noticed ipmitool taking
sometimes over 10 seconds attempting to execute an action on a
non-responding IPMI device over the lanplus interface.
So I had a look at the ipmi stonith plugin code and the ipmitool
-R <count> Set the number of retries for lan/lanplus interface
(default=4).
I then went ahead and added "-R 1" in the plugin's ipmitool_opts
variable, and my problem went away!

If you use fence agent fence_ipmilan then you can set this with retry_on
(or --retry-on X when using as argv)

m,

_______________________________________________
Pacemaker mailing list: ***@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org