[Pacemaker] large cluster - failure recovery

Discussion:

Radoslaw Garbacz

2015-11-04 18:41:53 UTC

Hi,

I have a cluster of 32 nodes, and after some tuning was able to have it
started and running,
but it does not recover from a node disconnect-connect failure.
It regains quorum, but CIB does not recover to a synchronized state and
"cibadmin -Q" times out.

Is there anything with corosync or pacemaker parameters I can do to make it
recover from such a situation
(everything works for smaller clusters).

In my case it is OK for a node to disconnect (all the major resources are
shutdown)
and later reconnect the cluster (the running monitoring agent will cleanup
and restart major resources if needed),
so I do not have STONITH configured.

Details:
OS: CentOS 6
Pacemaker: Pacemaker 1.1.9-1512.el6
Corosync: Corosync Cluster Engine, version '2.3.2'

Corosync configuration:
token: 10000
#token_retransmits_before_loss_const: 10
consensus: 15000
join: 1000
send_join: 80
merge: 1000
downcheck: 2000
#rrp_problem_count_timeout: 5000
max_network_delay: 150 # for azure

Some logs:

[...]
Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice:
cib_process_diff: Diff 1.9254.1 -> 1.9255.1 from local not applied
to 1.9275.1: current "epoch" is greater than required
Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice:
update_cib_cache_cb: [cib_diff_notify] Patch aborted: Application of
an update diff failed (-1006)
Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice:
cib_process_diff: Diff 1.9255.1 -> 1.9256.1 from local not applied
to 1.9275.1: current "epoch" is greater than required
Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice:
update_cib_cache_cb: [cib_diff_notify] Patch aborted: Application of
an update diff failed (-1006)
Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice:
cib_process_diff: Diff 1.9256.1 -> 1.9257.1 from local not applied
to 1.9275.1: current "epoch" is greater than required
Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice:
update_cib_cache_cb: [cib_diff_notify] Patch aborted: Application of
an update diff failed (-1006)
Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice:
cib_process_diff: Diff 1.9257.1 -> 1.9258.1 from local not applied
to 1.9275.1: current "epoch" is greater than required
Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice:
update_cib_cache_cb: [cib_diff_notify] Patch aborted: Application of
an update diff failed (-1006)
[...]

[...]
Nov 04 17:43:24 [12176] ip-10-109-145-175 crm_mon: error:
cib_native_perform_op_delegate: Couldn't perform cib_query
operation (timeout=120s): Operation already in progress (-114)
Nov 04 17:43:24 [12176] ip-10-109-145-175 crm_mon: error:
get_cib_copy: Couldnt retrieve the CIB
Nov 04 17:43:24 [12176] ip-10-109-145-175 crm_mon: error:
cib_native_perform_op_delegate: Couldn't perform cib_query
operation (timeout=120s): Operation already in progress (-114)
Nov 04 17:43:24 [12176] ip-10-109-145-175 crm_mon: error:
get_cib_copy: Couldnt retrieve the CIB
Nov 04 17:47:40 [10599] ip-10-109-145-175 corosync notice [QUORUM]
Members[32]: 3 27 11 29 23 21 24 9 17 12 32 13 2 10 16 15 6 28 19 1 22 26 5\
Nov 04 17:47:40 [10599] ip-10-109-145-175 corosync notice [QUORUM]
Members[32]: 14 20 31 30 8 25 18 7 4
Nov 04 17:47:40 [10599] ip-10-109-145-175 corosync notice [MAIN ]
Completed service synchronization, ready to provide service.
Nov 04 18:06:55 [10599] ip-10-109-145-175 corosync notice [QUORUM]
Members[32]: 3 27 11 29 23 21 24 9 17 12 32 13 2 10 16 15 6 28 19 1 22 26 5\
Nov 04 18:06:55 [10599] ip-10-109-145-175 corosync notice [QUORUM]
Members[32]: 14 20 31 30 8 25 18 7 4
[...]

[...]
Nov 04 18:21:15 [17749] ip-10-178-149-131 stonith-ng: notice:
update_cib_cache_cb: [cib_diff_notify] Patch aborted: Application of an
update diff failed (-1006)
Nov 04 18:21:15 [17749] ip-10-178-149-131 stonith-ng: info:
apply_xml_diff: Digest mis-match: expected
01192e5118739b7c33c23f7645da3f45, calculated
f8028c0c98526179ea5df0a2ba0d09de
Nov 04 18:21:15 [17749] ip-10-178-149-131 stonith-ng: warning:
cib_process_diff: Diff 1.15046.2 -> 1.15046.3 from local not applied
to 1.15046.2: Failed application of an update diff
Nov 04 18:21:15 [17749] ip-10-178-149-131 stonith-ng: notice:
update_cib_cache_cb: [cib_diff_notify] Patch aborted: Application of an
update diff failed (-1006)
Nov 04 18:21:15 [17749] ip-10-178-149-131 stonith-ng: notice:
cib_process_diff: Diff 1.15046.2 -> 1.15046.3 from local not applied
to 1.15046.3: current "num_updates" is greater than required
[...]

ps. Sorry if should posted on corosync newsgroup, just the CIB
synchronization fails, so this group seemed to me the right place.

--
Best Regards,

Radoslaw Garbacz

Trevor Hemsley

2015-11-04 18:50:01 UTC

Permalink

Post by Radoslaw Garbacz
OS: CentOS 6
Pacemaker: Pacemaker 1.1.9-1512.el6
Corosync: Corosync Cluster Engine, version '2.3.2'

yum update

Pacemaker is currently 1.1.12 and corosync 1.4.7 on CentOS 6. There were
major improvements in speed with later versions of pacemaker.

Trevor

_______________________________________________
Pacemaker mailing list: ***@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Radoslaw Garbacz

2015-11-04 22:26:14 UTC

Permalink

Thank you, will give it a try.

Post by Trevor Hemsley

Post by Radoslaw Garbacz
OS: CentOS 6
Pacemaker: Pacemaker 1.1.9-1512.el6
Corosync: Corosync Cluster Engine, version '2.3.2'

--
Best Regards,

Radoslaw Garbacz
XtremeData Incorporation

Cédric Dufour - Idiap Research Institute

2015-11-19 08:28:40 UTC

Permalink

Hello,

We've also setup a fairly large cluster - 24 nodes / 348 resources (pacemaker 1.1.12, corosync 1.4.7) - and pacemaker 1.1.12 is definitely the minimum version you'll want, thanks to changes on how the CIB is handled.

If you're going to handle a large number (~several hundreds) of resources as well, you may need to concern yourself with the CIB size as well.

You may want to have a look at pp.17-18 of the document I wrote to describre our setup: http://cedric.dufour.name/cv/download/idiap_havc2.pdf

Currently, I would consider that with 24 nodes / 348 resources, we are close to the limit of what our cluster can handle, the bottleneck being CPU(core) power for CIB/CRM handling. Our "worst performing nodes" (out of the 24 in the cluster) are Xeon E7-2830 @ 2.13GHz.
The main issue we currently face in when a DC is taken out and a new one must be elected: CPU goes 100% for several tens of seconds (even minutes), during which the cluster is totally unresponsive. Fortunately, resources themselves just seat tight and remain available (I can't say about those who would need to be migrated because being colocated with the DC; we manually avoid that situation when performing maintenance that may affect the DC)

I'm looking forwards to migrate to corosync 2+ (there are some backports available for debian/Jessie) and see it this would allow to push the limit further. Unfortunately, I can't say for sure as I have only a limited understanding of how Pacemaker/Corosync work and where CPU is bond to become a bottleneck.

'Hope it can help,

Cédric

Post by Radoslaw Garbacz
Thank you, will give it a try.

Post by Radoslaw Garbacz
OS: CentOS 6
Pacemaker: Pacemaker 1.1.9-1512.el6
Corosync: Corosync Cluster Engine, version '2.3.2'

yum update
Pacemaker is currently 1.1.12 and corosync 1.4.7 on CentOS 6. There were
major improvements in speed with later versions of pacemaker.
Trevor
_______________________________________________
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
--
Best Regards,
Radoslaw Garbacz
XtremeData Incorporation
_______________________________________________
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Continue reading on narkive:

Search results for '[Pacemaker] large cluster - failure recovery' (Questions and Answers)

replies

What plants are poisnes to horses and what are there symptons?

started 2007-09-11 05:37:33 UTC

horses

replies

I got some Cisco Questions?

started 2007-11-29 13:01:52 UTC

desktops

replies

Hard drive recovery?