Discussion:
[Pacemaker] Pacemaker won't start after node was fenced
Jake Smith
2015-01-27 06:23:55 UTC
Permalink
Had a failover of my active/passive cluster and now the passive node will
not rejoin the cluster.



2 nodes running Ubuntu 12.04

coro 1.4.2-2, openais 1.1.4-4, pcmk 1.1.6-2ubuntu3



Corosync ring membership is fine on both rings.



Tried stopping coro/pace and clearing /var/lib/heartbeat/crm/ and then
restarting on passive node without success.

Tried rebooting passive node (again - it was successfully fenced)

Tried updating pacemaker to latest in distro (1.1.6-2ubuntu3.3) then went
back on passive node

Tried putting active node in maintenance mode and stopping pacemaker and
corosync on both nodes. Then restarting on both nodes. Corosync came
back fine as before but now I have the same problem on both nodes with
pacemaker not starting successfully. Both show exactly same now - attrd:
[24883]: ERROR: main: HA Signon failed.



Log:

Jan 27 01:09:59 Condor crmd: [24885]: info: crmd_init: Starting crmd

Jan 27 01:09:59 Condor cib: [24881]: info: validate_with_relaxng: Creating
RNG parser context

Jan 27 01:09:59 Condor lrmd: [24882]: info: enabling coredumps

Jan 27 01:09:59 Condor lrmd: [24882]: info: Started.

Jan 27 01:09:59 Condor corosync[24778]: [IPC ] Invalid IPC
credentials.

Jan 27 01:09:59 Condor attrd: [24883]: ERROR: main: HA Signon failed

Jan 27 01:09:59 Condor attrd: [24883]: ERROR: main: Aborting startup

Jan 27 01:09:59 Condor pacemakerd: [24877]: ERROR: pcmk_child_exit: Child
process attrd exited (pid=24883, rc=100)

Jan 27 01:09:59 Condor pacemakerd: [24877]: notice: pcmk_child_exit: Child
process attrd no longer wishes to be respawned

Jan 27 01:09:59 Condor pacemakerd: [24877]: info: update_node_processes:
Node Condor now has process list: 00000000000000000000000000110312 (was
00000000000000000000000000111312)

Jan 27 01:09:59 Condor stonith-ng: [24880]: info:
init_ais_connection_classic: AIS connection established

Jan 27 01:09:59 Condor stonith-ng: [24880]: info: get_ais_nodeid: Server
details: id=167837962 uname=Condor cname=pcmk

Jan 27 01:09:59 Condor stonith-ng: [24880]: info:
init_ais_connection_once: Connection to 'classic openais (with plugin)':
established

Jan 27 01:09:59 Condor stonith-ng: [24880]: info: crm_new_peer: Node
Condor now has id: 167837962

Jan 27 01:09:59 Condor stonith-ng: [24880]: info: crm_new_peer: Node
167837962 is now known as Condor

Jan 27 01:09:59 Condor stonith-ng: [24880]: info: main: Starting
stonith-ng mainloop

Jan 27 01:09:59 Condor stonith-ng: [24880]: info: crm_update_peer: Node
Condor: id=167837962 state=unknown addr=(null) votes=0 born=0 seen=0
proc=00000000000000000000000000110312 (new)

Jan 27 01:09:59 Condor cib: [24881]: info: startCib: CIB Initialization
completed successfully

Jan 27 01:09:59 Condor cib: [24881]: info: get_cluster_type: Cluster type
is: 'openais'

Jan 27 01:09:59 Condor cib: [24881]: notice: crm_cluster_connect:
Connecting to cluster infrastructure: classic openais (with plugin)

Jan 27 01:09:59 Condor cib: [24881]: info: init_ais_connection_classic:
Creating connection to our Corosync plugin

Jan 27 01:09:59 Condor corosync[24778]: [IPC ] Invalid IPC
credentials.

Jan 27 01:09:59 Condor cib: [24881]: info: init_ais_connection_classic:
Connection to our AIS plugin (9) failed: unknown (100)

Jan 27 01:09:59 Condor cib: [24881]: CRIT: cib_init: Cannot sign in to the
cluster... terminating

Jan 27 01:09:59 Condor pacemakerd: [24877]: ERROR: pcmk_child_exit: Child
process cib exited (pid=24881, rc=100)

Jan 27 01:09:59 Condor pacemakerd: [24877]: notice: pcmk_child_exit: Child
process cib no longer wishes to be respawned

Jan 27 01:09:59 Condor pacemakerd: [24877]: info: update_node_processes:
Node Condor now has process list: 00000000000000000000000000110212 (was
00000000000000000000000000110312)

Jan 27 01:09:59 Condor stonith-ng: [24880]: info: crm_update_peer: Node
Condor: id=167837962 state=unknown addr=(null) votes=0 born=0 seen=0
proc=00000000000000000000000000110212 (new)

Jan 27 01:10:00 Condor crmd: [24885]: info: do_cib_control: Could not
connect to the CIB service: connection failed

Jan 27 01:10:00 Condor crmd: [24885]: WARN: do_cib_control: Couldn't
complete CIB registration 1 times... pause and retry

Jan 27 01:10:00 Condor crmd: [24885]: info: crmd_init: Starting crmd's
mainloop

Jan 27 01:10:01 Condor CRON[24888]: (root) CMD (/etc/init.d/watchdog -e
/dev/null 2>&1)
Jan 27 01:10:02 Condor crmd: [24885]: info: crm_timer_popped: Wait Timer
(I_NULL) just popped (2000ms)

Jan 27 01:10:03 Condor crmd: [24885]: info: do_cib_control: Could not
connect to the CIB service: connection failed

Jan 27 01:10:03 Condor crmd: [24885]: WARN: do_cib_control: Couldn't
complete CIB registration 2 times... pause and retry

Jan 27 01:10:05 Condor crmd: [24885]: info: crm_timer_popped: Wait Timer
(I_NULL) just popped (2000ms)

Jan 27 01:10:06 Condor crmd: [24885]: info: do_cib_control: Could not
connect to the CIB service: connection failed

Jan 27 01:10:06 Condor crmd: [24885]: WARN: do_cib_control: Couldn't
complete CIB registration 3 times... pause and retry

Jan 27 01:10:08 Condor crmd: [24885]: info: crm_timer_popped: Wait Timer
(I_NULL) just popped (2000ms)

Jan 27 01:10:09 Condor crmd: [24885]: info: do_cib_control: Could not
connect to the CIB service: connection failed

Jan 27 01:10:09 Condor crmd: [24885]: WARN: do_cib_control: Couldn't
complete CIB registration 4 times... pause and retry

Jan 27 01:10:11 Condor crmd: [24885]: info: crm_timer_popped: Wait Timer
(I_NULL) just popped (2000ms)

Jan 27 01:10:12 Condor crmd: [24885]: info: do_cib_control: Could not
connect to the CIB service: connection failed

Jan 27 01:10:12 Condor crmd: [24885]: WARN: do_cib_control: Couldn't
complete CIB registration 5 times... pause and retry



Jacob A. Smith
IT Manager
Argotec, LLC
Andrew Beekhof
2015-02-24 00:42:55 UTC
Permalink
Had a failover of my active/passive cluster and now the passive node will not rejoin the cluster.
2 nodes running Ubuntu 12.04
coro 1.4.2-2, openais 1.1.4-4, pcmk 1.1.6-2ubuntu3
Corosync ring membership is fine on both rings.
Tried stopping coro/pace and clearing /var/lib/heartbeat/crm/ and then restarting on passive node without success.
Tried rebooting passive node (again – it was successfully fenced)
Tried updating pacemaker to latest in distro (1.1.6-2ubuntu3.3) then went back on passive node
Tried putting active node in maintenance mode and stopping pacemaker and corosync on both nodes. Then restarting on both nodes. Corosync came back fine as before but now I have the same problem on both nodes with pacemaker not starting successfully. Both show exactly same now - attrd: [24883]: ERROR: main: HA Signon failed.
Jan 27 01:09:59 Condor crmd: [24885]: info: crmd_init: Starting crmd
Jan 27 01:09:59 Condor cib: [24881]: info: validate_with_relaxng: Creating RNG parser context
Jan 27 01:09:59 Condor lrmd: [24882]: info: enabling coredumps
Jan 27 01:09:59 Condor lrmd: [24882]: info: Started.
Jan 27 01:09:59 Condor corosync[24778]: [IPC ] Invalid IPC credentials.
This seems to be the root of the errors.
Pacemaker looks a little old, could you consider updating?
Jan 27 01:09:59 Condor attrd: [24883]: ERROR: main: HA Signon failed
Jan 27 01:09:59 Condor attrd: [24883]: ERROR: main: Aborting startup
Jan 27 01:09:59 Condor pacemakerd: [24877]: ERROR: pcmk_child_exit: Child process attrd exited (pid=24883, rc=100)
Jan 27 01:09:59 Condor pacemakerd: [24877]: notice: pcmk_child_exit: Child process attrd no longer wishes to be respawned
Jan 27 01:09:59 Condor pacemakerd: [24877]: info: update_node_processes: Node Condor now has process list: 00000000000000000000000000110312 (was 00000000000000000000000000111312)
Jan 27 01:09:59 Condor stonith-ng: [24880]: info: init_ais_connection_classic: AIS connection established
Jan 27 01:09:59 Condor stonith-ng: [24880]: info: get_ais_nodeid: Server details: id=167837962 uname=Condor cname=pcmk
Jan 27 01:09:59 Condor stonith-ng: [24880]: info: init_ais_connection_once: Connection to 'classic openais (with plugin)': established
Jan 27 01:09:59 Condor stonith-ng: [24880]: info: crm_new_peer: Node Condor now has id: 167837962
Jan 27 01:09:59 Condor stonith-ng: [24880]: info: crm_new_peer: Node 167837962 is now known as Condor
Jan 27 01:09:59 Condor stonith-ng: [24880]: info: main: Starting stonith-ng mainloop
Jan 27 01:09:59 Condor stonith-ng: [24880]: info: crm_update_peer: Node Condor: id=167837962 state=unknown addr=(null) votes=0 born=0 seen=0 proc=00000000000000000000000000110312 (new)
Jan 27 01:09:59 Condor cib: [24881]: info: startCib: CIB Initialization completed successfully
Jan 27 01:09:59 Condor cib: [24881]: info: get_cluster_type: Cluster type is: 'openais'
Jan 27 01:09:59 Condor cib: [24881]: notice: crm_cluster_connect: Connecting to cluster infrastructure: classic openais (with plugin)
Jan 27 01:09:59 Condor cib: [24881]: info: init_ais_connection_classic: Creating connection to our Corosync plugin
Jan 27 01:09:59 Condor corosync[24778]: [IPC ] Invalid IPC credentials.
Jan 27 01:09:59 Condor cib: [24881]: info: init_ais_connection_classic: Connection to our AIS plugin (9) failed: unknown (100)
Jan 27 01:09:59 Condor cib: [24881]: CRIT: cib_init: Cannot sign in to the cluster... terminating
Jan 27 01:09:59 Condor pacemakerd: [24877]: ERROR: pcmk_child_exit: Child process cib exited (pid=24881, rc=100)
Jan 27 01:09:59 Condor pacemakerd: [24877]: notice: pcmk_child_exit: Child process cib no longer wishes to be respawned
Jan 27 01:09:59 Condor pacemakerd: [24877]: info: update_node_processes: Node Condor now has process list: 00000000000000000000000000110212 (was 00000000000000000000000000110312)
Jan 27 01:09:59 Condor stonith-ng: [24880]: info: crm_update_peer: Node Condor: id=167837962 state=unknown addr=(null) votes=0 born=0 seen=0 proc=00000000000000000000000000110212 (new)
Jan 27 01:10:00 Condor crmd: [24885]: info: do_cib_control: Could not connect to the CIB service: connection failed
Jan 27 01:10:00 Condor crmd: [24885]: WARN: do_cib_control: Couldn't complete CIB registration 1 times... pause and retry
Jan 27 01:10:00 Condor crmd: [24885]: info: crmd_init: Starting crmd's mainloop
Jan 27 01:10:01 Condor CRON[24888]: (root) CMD (/etc/init.d/watchdog -e >/dev/null 2>&1)
Jan 27 01:10:02 Condor crmd: [24885]: info: crm_timer_popped: Wait Timer (I_NULL) just popped (2000ms)
Jan 27 01:10:03 Condor crmd: [24885]: info: do_cib_control: Could not connect to the CIB service: connection failed
Jan 27 01:10:03 Condor crmd: [24885]: WARN: do_cib_control: Couldn't complete CIB registration 2 times... pause and retry
Jan 27 01:10:05 Condor crmd: [24885]: info: crm_timer_popped: Wait Timer (I_NULL) just popped (2000ms)
Jan 27 01:10:06 Condor crmd: [24885]: info: do_cib_control: Could not connect to the CIB service: connection failed
Jan 27 01:10:06 Condor crmd: [24885]: WARN: do_cib_control: Couldn't complete CIB registration 3 times... pause and retry
Jan 27 01:10:08 Condor crmd: [24885]: info: crm_timer_popped: Wait Timer (I_NULL) just popped (2000ms)
Jan 27 01:10:09 Condor crmd: [24885]: info: do_cib_control: Could not connect to the CIB service: connection failed
Jan 27 01:10:09 Condor crmd: [24885]: WARN: do_cib_control: Couldn't complete CIB registration 4 times... pause and retry
Jan 27 01:10:11 Condor crmd: [24885]: info: crm_timer_popped: Wait Timer (I_NULL) just popped (2000ms)
Jan 27 01:10:12 Condor crmd: [24885]: info: do_cib_control: Could not connect to the CIB service: connection failed
Jan 27 01:10:12 Condor crmd: [24885]: WARN: do_cib_control: Couldn't complete CIB registration 5 times... pause and retry
Jacob A. Smith
IT Manager
Argotec, LLC
_______________________________________________
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
_______________________________________________
Pacemaker mailing list: ***@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_S
Jake Smith
2015-03-03 22:27:48 UTC
Permalink
That will be tough but I'll see if I can give it a try sometime soon.

Have had no luck tracking down that error so running out of other options :/

Jake

-----Original Message-----
From: Andrew Beekhof [mailto:***@beekhof.net]
Sent: Monday, February 23, 2015 7:43 PM
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] Pacemaker won't start after node was fenced
Post by Jake Smith
Had a failover of my active/passive cluster and now the passive node will
not rejoin the cluster.
2 nodes running Ubuntu 12.04
coro 1.4.2-2, openais 1.1.4-4, pcmk 1.1.6-2ubuntu3
Corosync ring membership is fine on both rings.
Tried stopping coro/pace and clearing /var/lib/heartbeat/crm/ and then
restarting on passive node without success.
Tried rebooting passive node (again – it was successfully fenced)
Tried updating pacemaker to latest in distro (1.1.6-2ubuntu3.3) then
went back on passive node Tried putting active node in maintenance mode
and stopping pacemaker and corosync on both nodes. Then restarting on
both nodes. Corosync came back fine as before but now I have the same
problem on both nodes with pacemaker not starting successfully. Both show
exactly same now - attrd: [24883]: ERROR: main: HA Signon failed.
Jan 27 01:09:59 Condor crmd: [24885]: info: crmd_init: Starting crmd
Started.
Jan 27 01:09:59 Condor corosync[24778]: [IPC ] Invalid IPC
credentials.
This seems to be the root of the errors.
Pacemaker looks a little old, could you consider updating?
Post by Jake Smith
Jan 27 01:09:59 Condor attrd: [24883]: ERROR: main: HA Signon failed
Jan 27 01:09:59 Condor attrd: [24883]: ERROR: main: Aborting startup
Child process attrd exited (pid=24883, rc=100) Jan 27 01:09:59 Condor
pacemakerd: [24877]: notice: pcmk_child_exit: Child process attrd no
[24877]: info: update_node_processes: Node Condor now has process
list: 00000000000000000000000000110312 (was
[24880]: info: init_ais_connection_classic: AIS connection established
Server details: id=167837962 uname=Condor cname=pcmk Jan 27 01:09:59
Condor stonith-ng: [24880]: info: init_ais_connection_once: Connection
to 'classic openais (with plugin)': established Jan 27 01:09:59 Condor
stonith-ng: [24880]: info: crm_new_peer: Node Condor now has id: 167837962
Jan 27 01:09:59 Condor stonith-ng: [24880]: info: crm_new_peer: Node
[24880]: info: main: Starting stonith-ng mainloop Jan 27 01:09:59 Condor
stonith-ng: [24880]: info: crm_update_peer: Node Condor: id=167837962
state=unknown addr=(null) votes=0 born=0 seen=0
[24881]: info: startCib: CIB Initialization completed successfully Jan 27
'openais'
Connecting to cluster infrastructure: classic openais (with plugin) Jan 27
01:09:59 Condor cib: [24881]: info: init_ais_connection_classic: Creating
connection to our Corosync plugin
Jan 27 01:09:59 Condor corosync[24778]: [IPC ] Invalid IPC
credentials.
Cannot sign in to the cluster... terminating Jan 27 01:09:59 Condor
pacemakerd: [24877]: ERROR: pcmk_child_exit: Child process cib exited
notice: pcmk_child_exit: Child process cib no longer wishes to be
00000000000000000000000000110212 (was
[24880]: info: crm_update_peer: Node Condor: id=167837962
state=unknown addr=(null) votes=0 born=0 seen=0
proc=00000000000000000000000000110212 (new) Jan 27 01:10:00 Condor
crmd: [24885]: info: do_cib_control: Could not connect to the CIB
do_cib_control: Couldn't complete CIB registration 1 times... pause
Starting crmd's mainloop Jan 27 01:10:01 Condor CRON[24888]: (root)
CMD (/etc/init.d/watchdog -e >/dev/null 2>&1) Jan 27 01:10:02 Condor
crmd: [24885]: info: crm_timer_popped: Wait Timer (I_NULL) just popped
Could not connect to the CIB service: connection failed Jan 27
01:10:03 Condor crmd: [24885]: WARN: do_cib_control: Couldn't complete
CIB registration 2 times... pause and retry Jan 27 01:10:05 Condor
crmd: [24885]: info: crm_timer_popped: Wait Timer (I_NULL) just popped
Could not connect to the CIB service: connection failed Jan 27
01:10:06 Condor crmd: [24885]: WARN: do_cib_control: Couldn't complete
CIB registration 3 times... pause and retry Jan 27 01:10:08 Condor
crmd: [24885]: info: crm_timer_popped: Wait Timer (I_NULL) just popped
Could not connect to the CIB service: connection failed Jan 27
01:10:09 Condor crmd: [24885]: WARN: do_cib_control: Couldn't complete
CIB registration 4 times... pause and retry Jan 27 01:10:11 Condor
crmd: [24885]: info: crm_timer_popped: Wait Timer (I_NULL) just popped
Could not connect to the CIB service: connection failed Jan 27
01:10:12 Condor crmd: [24885]: WARN: do_cib_control: Couldn't complete
CIB registration 5 times... pause and retry
Jacob A. Smith
IT Manager
Argotec, LLC
_______________________________________________
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
_______________________________________________
Pacemaker mailing list: ***@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org Getting started:
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: ***@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http:

Loading...