[Pacemaker] Centos 70->71 update fails with "Application of an update diff failed (rc=-206)"

Discussion:

Patrick Zwahlen

2015-04-24 15:09:53 UTC

Hi,

I'm running a CentOS 7.0 2-nodes cluster providing iSCSI/SAN features. In order to upgrade to CentOS 7.1, I'm testing the whole process in VMs and it fails. I've now stripped my config down to a pair of DRBD MS with IPADDR2 (cluster.cfg) attached.

From a running cluster, here are the steps (I'm upgrading node san2):
- put node san2 in standby
- stop/disable pacemaker on san2
- stop/disable corosync on san2
- update san2 to CentOS 7.1 (pacemaker 1.1.10-32.el7_0.1 -> 1.1.12-22.el7_1.1)
- reboot san2
- enable/start corosync on san2. It looks good, rings are fine in "corosync-cfgtool-s")
- enable/start pacemaker on san2

I can see the following in the logs:

/var/log/messages (attached, line #57)
=================
Apr 24 16:18:26 san2 crmd[11759]: notice: erase_xpath_callback: Deletion of "//node_state[@uname='san2.local']/transient_attributes": Application of an update diff failed (rc=-206)

/var/log/pacemaker.log (attached, starting from line #292)
======================
Apr 24 16:18:26 [11754] san2.local cib: info: xml_apply_patchset: v1 digest mis-match: expected 428c0eb4cd80a4c1ee19b627f6876abd, calculated ffb5456991bd4ed9e5a7774f49e8259d
Apr 24 16:18:26 [11754] san2.local cib: info: __xml_diff_object: Moved ***@id (0 -> 6)
Apr 24 16:18:26 [11754] san2.local cib: info: __xml_diff_object: Moved ***@uname (1 -> 0)
Apr 24 16:18:26 [11759] san2.local crmd: notice: erase_xpath_callback: Deletion of "//node_state[@uname='san2.local']/transient_attributes": Application of an update diff failed (rc=-206)
Apr 24 16:18:26 [11754] san2.local cib: info: send_sync_request: Requesting re-sync from peer
Apr 24 16:18:26 [11754] san2.local cib: notice: cib_server_process_diff: Not applying diff 0.0.14 -> 0.46.15 (sync in progress)
Apr 24 16:18:26 [11754] san2.local cib: notice: cib_server_process_diff: Not applying diff 0.0.15 -> 0.46.16 (sync in progress)
Apr 24 16:18:26 [11754] san2.local cib: notice: cib_server_process_diff: Not applying diff 0.0.16 -> 0.46.17 (sync in progress)

Google doesn't help me in figuring out what might be wrong.

Config was generated with crmsh-2.1-1.4 is that can have an impact.

Any hint would be highly appreciated.

Cheers, Patrick

NOTE: I have kernel modules (scst/zfs) that require reboots when upgrading, so I cannot upgrade both nodes while in unmanaged state. I really need to upgrade one node after the other.

**************************************************************************************
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the system manager. "***@navixia.com" Navixia SA
**************************************************************************************

emmanuel segura

2015-04-24 15:40:38 UTC

Permalink

Are you sure your cluster hostnames are ok?

get_node_name: Could not obtain a node name for corosync nodeid 2

Post by Patrick Zwahlen
Hi,
I'm running a CentOS 7.0 2-nodes cluster providing iSCSI/SAN features. In order to upgrade to CentOS 7.1, I'm testing the whole process in VMs and it fails. I've now stripped my config down to a pair of DRBD MS with IPADDR2 (cluster.cfg) attached.
- put node san2 in standby
- stop/disable pacemaker on san2
- stop/disable corosync on san2
- update san2 to CentOS 7.1 (pacemaker 1.1.10-32.el7_0.1 -> 1.1.12-22.el7_1.1)
- reboot san2
- enable/start corosync on san2. It looks good, rings are fine in "corosync-cfgtool-s")
- enable/start pacemaker on san2
/var/log/messages (attached, line #57)
=================
/var/log/pacemaker.log (attached, starting from line #292)
======================
Apr 24 16:18:26 [11754] san2.local cib: info: xml_apply_patchset: v1 digest mis-match: expected 428c0eb4cd80a4c1ee19b627f6876abd, calculated ffb5456991bd4ed9e5a7774f49e8259d
Apr 24 16:18:26 [11754] san2.local cib: info: send_sync_request: Requesting re-sync from peer
Apr 24 16:18:26 [11754] san2.local cib: notice: cib_server_process_diff: Not applying diff 0.0.14 -> 0.46.15 (sync in progress)
Apr 24 16:18:26 [11754] san2.local cib: notice: cib_server_process_diff: Not applying diff 0.0.15 -> 0.46.16 (sync in progress)
Apr 24 16:18:26 [11754] san2.local cib: notice: cib_server_process_diff: Not applying diff 0.0.16 -> 0.46.17 (sync in progress)
Google doesn't help me in figuring out what might be wrong.
Config was generated with crmsh-2.1-1.4 is that can have an impact.
Any hint would be highly appreciated.
Cheers, Patrick
NOTE: I have kernel modules (scst/zfs) that require reboots when upgrading, so I cannot upgrade both nodes while in unmanaged state. I really need to upgrade one node after the other.
**************************************************************************************
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
**************************************************************************************
_______________________________________________
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

--
esta es mi vida e me la vivo hasta que dios quiera

_______________________________________________
Pacemaker mailing list: ***@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Patrick Zwahlen

2015-04-25 10:19:41 UTC

Permalink

Post by emmanuel segura
Are you sure your cluster hostnames are ok?
get_node_name: Could not obtain a node name for corosync nodeid 2

emmanuel segura

2015-04-26 00:03:41 UTC

Permalink

map your ip cluster to hostname using /etc/hosts and try to use an
example like this
http://clusterlabs.org/doc/fr/Pacemaker/1.1-pcs/html/Clusters_from_Scratch/_sample_corosync_configuration.html

Post by Patrick Zwahlen

Post by emmanuel segura
Are you sure your cluster hostnames are ok?
get_node_name: Could not obtain a node name for corosync nodeid 2

(Confused with pacemaker<->clusterlabs mailinglists. Sorry for the
double-post)
Cluster works perfectly on CentOS 7.0, even though I have these logs as
well. Might be due to corosync.conf (attached) containing only IP (generated
by pcs)
Regards, Patrick
_______________________________________________
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Patrick Zwahlen

2015-04-26 09:27:11 UTC

Permalink

Post by emmanuel segura
map your ip cluster to hostname using /etc/hosts and try to use an
example like this
http://clusterlabs.org/doc/fr/Pacemaker/1.1-
pcs/html/Clusters_from_Scratch/_sample_corosync_configuration.html

I've added "name: fqdn" in my corosync.conf and I don't have those hostname
logs anymore.

This being said, I think it's unrelated to my original problem (1.1.10 -
1.1.12 upgrade pb). I have tried my upgrade once more and it keeps showing
that "diff failed" log.

But thanks for helping. Patrick

Andrew Beekhof

2015-04-26 20:59:18 UTC

Permalink

Post by Patrick Zwahlen

I've added "name: fqdn" in my corosync.conf and I don't have those hostname
logs anymore.

Excellent.

Post by Patrick Zwahlen
This being said, I think it's unrelated to my original problem (1.1.10 -
1.1.12 upgrade pb). I have tried my upgrade once more and it keeps showing
that "diff failed" log.

Apart from those scary logs, does anything actually break?
What your seeing is probably just ignorable noise from the older version - I would expect the underlying cib to resolve things correctly.

Post by Patrick Zwahlen
But thanks for helping. Patrick
_______________________________________________
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: ***@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Patrick Zwahlen

2015-04-27 08:35:10 UTC

Permalink

Post by Andrew Beekhof
Apart from those scary logs, does anything actually break?
What your seeing is probably just ignorable noise from the older version
- I would expect the underlying cib to resolve things correctly.

Thanks Andrew for the response.

After starting the new 1.1.12 and trying to migrate my resources, I ended up
with groups stuck "halfway" with some resources stopped on the old node and
no migration (apparently without errors from my RA).

This WE I tried another route, as I finally found how to upgrade *just*
corosync/pacemaker (without the whole OS).

- enter maintenance
- "pcs cluser stop --all"
- "yum update corosync pacemaker libqb resource-agents pcs"
- "pcs cluser start --all"
- exit maintenance

I initially just did a "yum update corosync pacemaker" and then pacemaker
didn't start. I was missing libqb but I also think there a dependency
missing somewhere in the RPMs, as libqb should get updated as well.

Anyway, I have been able to migrate from CentOS 7.0 to 7.1 in my lab without
losing anything.

Cheers, Patrick

Andrew Beekhof

2015-04-28 03:43:38 UTC

Permalink

Post by Patrick Zwahlen

If you’d like to send a crm_report I’d be interested to have a look.

Post by Patrick Zwahlen
This WE I tried another route, as I finally found how to upgrade *just*
corosync/pacemaker (without the whole OS).
- enter maintenance
- "pcs cluser stop --all"
- "yum update corosync pacemaker libqb resource-agents pcs"
- "pcs cluser start --all"
- exit maintenance
I initially just did a "yum update corosync pacemaker" and then pacemaker
didn't start. I was missing libqb but I also think there a dependency
missing somewhere in the RPMs, as libqb should get updated as well.

Nod. We’re adding that in.
Both sides keep maintaining backwards compatibility - pacemaker just wants to use the version it was built against but rpm isn’t smart enough to do that automagically :-(

Post by Patrick Zwahlen
Anyway, I have been able to migrate from CentOS 7.0 to 7.1 in my lab without
losing anything.

Excellent. Sounds like it might have been something to do with the resources themselves then :-/

Post by Patrick Zwahlen
Cheers, Patrick
_______________________________________________
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org