Discussion:
[Pacemaker] [ClusterLabs] Unexpected behaviour of PEngine Recheck Timer while in maintenance mode
Andrew Beekhof
2015-04-26 21:27:25 UTC
Permalink
Hi all!
I encountered some strange behaviour of the PEngine Recheck Timer while in
maintenance mode ending in a reboot.
Setup is a 2 node cluster, 1 resource group consisting of several drbds and
filesystems that are exported via nfs and iscsi via a roaming IP (see
attached output of crm_report for details).
crm configure property maintenance-mode=true
Cluster ended up in unmanaged mode, as expected.
109 Apr 19 23:01:50 astorage1 crmd: [4150]: info: do_te_invoke: Processing graph 473 (ref=pe_calc-dc-1429477310-1155) derived from /var/lib/pengine/pe-input-688.bz2
...
112 Apr 19 23:01:50 astorage1 lrmd: [4147]: info: cancel_op: operation monitor[43] on astorage2-fencing for client 4150, its parameters: hostname=[astorage2] userid=[...] CRM_meta_timeout=[20000] CRM_meta_name=[monitor] passwd=[...] crm_feature_set=[3.0.6] ipaddr=[10.10.30.22] CRM_meta_interval=[60000] cancelled
...
168 Apr 19 23:01:51 astorage1 crmd: [4150]: notice: run_graph: ==== Transition 473 (Complete=31, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-688.bz2): Complete
At that point I started to do some work on the resources.
What exactly were you doing at this point?
I ask because:

Apr 19 23:16:51 astorage1 crmd: [4150]: ERROR: lrm_get_rsc(666): failed to send a getrsc message to lrmd via ch_cmd channel.
Apr 19 23:16:51 astorage1 crmd: [4150]: ERROR: lrm_get_rsc(666): failed to send a getrsc message to lrmd via ch_cmd channel.
Apr 19 23:16:51 astorage1 crmd: [4150]: ERROR: lrm_add_rsc(870): failed to send a addrsc message to lrmd via ch_cmd channel.
Apr 19 23:16:51 astorage1 crmd: [4150]: ERROR: lrm_get_rsc(666): failed to send a getrsc message to lrmd via ch_cmd channel.

and

Apr 19 23:16:52 astorage2 crmd: [4018]: CRIT: lrm_connection_destroy: LRM Connection failed
Apr 19 23:16:52 astorage2 crmd: [4018]: info: lrm_connection_destroy: LRM Connection disconnected

suggest that the lrmd processes on both machines crashed or failed.

I would definitely recommend an upgrade from 1.1.7
191 Apr 19 23:16:51 astorage1 crmd: [4150]: info: crm_timer_popped: PEngine Recheck Timer (I_PE_CALC) just popped (900000ms)
192 Apr 19 23:16:51 astorage1 crmd: [4150]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ]
193 Apr 19 23:16:51 astorage1 crmd: [4150]: info: do_state_transition: Progressed to state S_POLICY_ENGINE after C_TIMER_POPPED
The PEngine Recheck Timer triggered and tried to cancel the monitors again,
207 Apr 19 23:16:51 astorage1 crmd: [4150]: info: te_rsc_command: Initiating action 4: cancel astorage2-fencing_monitor_60000 on astorage1 (local)
208 Apr 19 23:16:51 astorage1 crmd: [4150]: info: cancel_op: No pending op found for astorage2-fencing:43
209 Apr 19 23:16:51 astorage1 lrmd: [4147]: info: on_msg_cancel_op: no operation with id 43
...
227 Apr 19 23:16:51 astorage1 crmd: [4150]: ERROR: do_lrm_invoke: Invalid resource definition
108 Apr 19 23:16:52 astorage2 lrmd: [4015]: info: on_msg_cancel_op: no operation with id 43
109 Apr 19 23:16:52 astorage2 crmd: [4018]: CRIT: lrm_connection_destroy: LRM Connection failed
110 Apr 19 23:16:52 astorage2 crmd: [4018]: info: lrm_connection_destroy: LRM Connection disconnected
111 Apr 19 23:16:52 astorage2 crmd: [4018]: ERROR: do_log: FSA: Input I_ERROR from lrm_connection_destroy() received in state S_NOT_DC
crmd was not too happy about that, cleaned up and exited with rc=2.
As I have 'crm on' in ha.cfg, heartbeat rebooted the system (both nodes).
crm_attribute -n cluster-recheck-interval -v 0
IMHO the PEngine Recheck Timer should not take any action while in
unmanaged mode, the same way regular action timers' timeouts are ignored.
Additionally, I don't quite understand why crmd got so confused by the
missing op id 43 that it decided to quit (lrmd appears not to have crashed).
Was my assumption about that timer wrong or is this indeed unexpected
behaviour?
Please CC replies to my email-address, I am not subscribed to the list.
Greets,
Rolf Weber
<reboot-while-in-maintenance-mode.tbz2>_______________________________________________
http://clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
_______________________________________________
Pacemaker mailing list: ***@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
Andrew Beekhof
2015-05-18 02:34:17 UTC
Permalink
Hi!
Post by Andrew Beekhof
What exactly were you doing at this point?
resizing a filesystem.
fs was unexported and unmounted.
as I understand maintenence mode this should have worked (and there was no
problem until the recheck timer triggered).
Post by Andrew Beekhof
Apr 19 23:16:51 astorage1 crmd: [4150]: ERROR: lrm_get_rsc(666): failed to send a getrsc message to lrmd via ch_cmd channel.
Apr 19 23:16:51 astorage1 crmd: [4150]: ERROR: lrm_get_rsc(666): failed to send a getrsc message to lrmd via ch_cmd channel.
Apr 19 23:16:51 astorage1 crmd: [4150]: ERROR: lrm_add_rsc(870): failed to send a addrsc message to lrmd via ch_cmd channel.
Apr 19 23:16:51 astorage1 crmd: [4150]: ERROR: lrm_get_rsc(666): failed to send a getrsc message to lrmd via ch_cmd channel.
and
Apr 19 23:16:52 astorage2 crmd: [4018]: CRIT: lrm_connection_destroy: LRM Connection failed
Apr 19 23:16:52 astorage2 crmd: [4018]: info: lrm_connection_destroy: LRM Connection disconnected
suggest that the lrmd processes on both machines crashed or failed.
I know. I didn't find any core files or other indications that it indeed did
crash.
if you were resizing the filesystem that pacemaker was running on, then i’m not surprised (either that the processes aborted or that there was no core file).
errors appeared after the failed-message for op 43 (which is ipmi fencing).
Post by Andrew Beekhof
I would definitely recommend an upgrade from 1.1.7
will be done, now that jessie is stable. don't quite know when yet.
cheers,
rolf weber
_______________________________________________
http://clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
_______________________________________________
Pacemaker mailing list: ***@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.cluste

Loading...