Jake Smith
2015-01-27 06:23:55 UTC
Had a failover of my active/passive cluster and now the passive node will
not rejoin the cluster.
2 nodes running Ubuntu 12.04
coro 1.4.2-2, openais 1.1.4-4, pcmk 1.1.6-2ubuntu3
Corosync ring membership is fine on both rings.
Tried stopping coro/pace and clearing /var/lib/heartbeat/crm/ and then
restarting on passive node without success.
Tried rebooting passive node (again - it was successfully fenced)
Tried updating pacemaker to latest in distro (1.1.6-2ubuntu3.3) then went
back on passive node
Tried putting active node in maintenance mode and stopping pacemaker and
corosync on both nodes. Then restarting on both nodes. Corosync came
back fine as before but now I have the same problem on both nodes with
pacemaker not starting successfully. Both show exactly same now - attrd:
[24883]: ERROR: main: HA Signon failed.
Log:
Jan 27 01:09:59 Condor crmd: [24885]: info: crmd_init: Starting crmd
Jan 27 01:09:59 Condor cib: [24881]: info: validate_with_relaxng: Creating
RNG parser context
Jan 27 01:09:59 Condor lrmd: [24882]: info: enabling coredumps
Jan 27 01:09:59 Condor lrmd: [24882]: info: Started.
Jan 27 01:09:59 Condor corosync[24778]: [IPC ] Invalid IPC
credentials.
Jan 27 01:09:59 Condor attrd: [24883]: ERROR: main: HA Signon failed
Jan 27 01:09:59 Condor attrd: [24883]: ERROR: main: Aborting startup
Jan 27 01:09:59 Condor pacemakerd: [24877]: ERROR: pcmk_child_exit: Child
process attrd exited (pid=24883, rc=100)
Jan 27 01:09:59 Condor pacemakerd: [24877]: notice: pcmk_child_exit: Child
process attrd no longer wishes to be respawned
Jan 27 01:09:59 Condor pacemakerd: [24877]: info: update_node_processes:
Node Condor now has process list: 00000000000000000000000000110312 (was
00000000000000000000000000111312)
Jan 27 01:09:59 Condor stonith-ng: [24880]: info:
init_ais_connection_classic: AIS connection established
Jan 27 01:09:59 Condor stonith-ng: [24880]: info: get_ais_nodeid: Server
details: id=167837962 uname=Condor cname=pcmk
Jan 27 01:09:59 Condor stonith-ng: [24880]: info:
init_ais_connection_once: Connection to 'classic openais (with plugin)':
established
Jan 27 01:09:59 Condor stonith-ng: [24880]: info: crm_new_peer: Node
Condor now has id: 167837962
Jan 27 01:09:59 Condor stonith-ng: [24880]: info: crm_new_peer: Node
167837962 is now known as Condor
Jan 27 01:09:59 Condor stonith-ng: [24880]: info: main: Starting
stonith-ng mainloop
Jan 27 01:09:59 Condor stonith-ng: [24880]: info: crm_update_peer: Node
Condor: id=167837962 state=unknown addr=(null) votes=0 born=0 seen=0
proc=00000000000000000000000000110312 (new)
Jan 27 01:09:59 Condor cib: [24881]: info: startCib: CIB Initialization
completed successfully
Jan 27 01:09:59 Condor cib: [24881]: info: get_cluster_type: Cluster type
is: 'openais'
Jan 27 01:09:59 Condor cib: [24881]: notice: crm_cluster_connect:
Connecting to cluster infrastructure: classic openais (with plugin)
Jan 27 01:09:59 Condor cib: [24881]: info: init_ais_connection_classic:
Creating connection to our Corosync plugin
Jan 27 01:09:59 Condor corosync[24778]: [IPC ] Invalid IPC
credentials.
Jan 27 01:09:59 Condor cib: [24881]: info: init_ais_connection_classic:
Connection to our AIS plugin (9) failed: unknown (100)
Jan 27 01:09:59 Condor cib: [24881]: CRIT: cib_init: Cannot sign in to the
cluster... terminating
Jan 27 01:09:59 Condor pacemakerd: [24877]: ERROR: pcmk_child_exit: Child
process cib exited (pid=24881, rc=100)
Jan 27 01:09:59 Condor pacemakerd: [24877]: notice: pcmk_child_exit: Child
process cib no longer wishes to be respawned
Jan 27 01:09:59 Condor pacemakerd: [24877]: info: update_node_processes:
Node Condor now has process list: 00000000000000000000000000110212 (was
00000000000000000000000000110312)
Jan 27 01:09:59 Condor stonith-ng: [24880]: info: crm_update_peer: Node
Condor: id=167837962 state=unknown addr=(null) votes=0 born=0 seen=0
proc=00000000000000000000000000110212 (new)
Jan 27 01:10:00 Condor crmd: [24885]: info: do_cib_control: Could not
connect to the CIB service: connection failed
Jan 27 01:10:00 Condor crmd: [24885]: WARN: do_cib_control: Couldn't
complete CIB registration 1 times... pause and retry
Jan 27 01:10:00 Condor crmd: [24885]: info: crmd_init: Starting crmd's
mainloop
Jan 27 01:10:01 Condor CRON[24888]: (root) CMD (/etc/init.d/watchdog -e
(I_NULL) just popped (2000ms)
Jan 27 01:10:03 Condor crmd: [24885]: info: do_cib_control: Could not
connect to the CIB service: connection failed
Jan 27 01:10:03 Condor crmd: [24885]: WARN: do_cib_control: Couldn't
complete CIB registration 2 times... pause and retry
Jan 27 01:10:05 Condor crmd: [24885]: info: crm_timer_popped: Wait Timer
(I_NULL) just popped (2000ms)
Jan 27 01:10:06 Condor crmd: [24885]: info: do_cib_control: Could not
connect to the CIB service: connection failed
Jan 27 01:10:06 Condor crmd: [24885]: WARN: do_cib_control: Couldn't
complete CIB registration 3 times... pause and retry
Jan 27 01:10:08 Condor crmd: [24885]: info: crm_timer_popped: Wait Timer
(I_NULL) just popped (2000ms)
Jan 27 01:10:09 Condor crmd: [24885]: info: do_cib_control: Could not
connect to the CIB service: connection failed
Jan 27 01:10:09 Condor crmd: [24885]: WARN: do_cib_control: Couldn't
complete CIB registration 4 times... pause and retry
Jan 27 01:10:11 Condor crmd: [24885]: info: crm_timer_popped: Wait Timer
(I_NULL) just popped (2000ms)
Jan 27 01:10:12 Condor crmd: [24885]: info: do_cib_control: Could not
connect to the CIB service: connection failed
Jan 27 01:10:12 Condor crmd: [24885]: WARN: do_cib_control: Couldn't
complete CIB registration 5 times... pause and retry
Jacob A. Smith
IT Manager
Argotec, LLC
not rejoin the cluster.
2 nodes running Ubuntu 12.04
coro 1.4.2-2, openais 1.1.4-4, pcmk 1.1.6-2ubuntu3
Corosync ring membership is fine on both rings.
Tried stopping coro/pace and clearing /var/lib/heartbeat/crm/ and then
restarting on passive node without success.
Tried rebooting passive node (again - it was successfully fenced)
Tried updating pacemaker to latest in distro (1.1.6-2ubuntu3.3) then went
back on passive node
Tried putting active node in maintenance mode and stopping pacemaker and
corosync on both nodes. Then restarting on both nodes. Corosync came
back fine as before but now I have the same problem on both nodes with
pacemaker not starting successfully. Both show exactly same now - attrd:
[24883]: ERROR: main: HA Signon failed.
Log:
Jan 27 01:09:59 Condor crmd: [24885]: info: crmd_init: Starting crmd
Jan 27 01:09:59 Condor cib: [24881]: info: validate_with_relaxng: Creating
RNG parser context
Jan 27 01:09:59 Condor lrmd: [24882]: info: enabling coredumps
Jan 27 01:09:59 Condor lrmd: [24882]: info: Started.
Jan 27 01:09:59 Condor corosync[24778]: [IPC ] Invalid IPC
credentials.
Jan 27 01:09:59 Condor attrd: [24883]: ERROR: main: HA Signon failed
Jan 27 01:09:59 Condor attrd: [24883]: ERROR: main: Aborting startup
Jan 27 01:09:59 Condor pacemakerd: [24877]: ERROR: pcmk_child_exit: Child
process attrd exited (pid=24883, rc=100)
Jan 27 01:09:59 Condor pacemakerd: [24877]: notice: pcmk_child_exit: Child
process attrd no longer wishes to be respawned
Jan 27 01:09:59 Condor pacemakerd: [24877]: info: update_node_processes:
Node Condor now has process list: 00000000000000000000000000110312 (was
00000000000000000000000000111312)
Jan 27 01:09:59 Condor stonith-ng: [24880]: info:
init_ais_connection_classic: AIS connection established
Jan 27 01:09:59 Condor stonith-ng: [24880]: info: get_ais_nodeid: Server
details: id=167837962 uname=Condor cname=pcmk
Jan 27 01:09:59 Condor stonith-ng: [24880]: info:
init_ais_connection_once: Connection to 'classic openais (with plugin)':
established
Jan 27 01:09:59 Condor stonith-ng: [24880]: info: crm_new_peer: Node
Condor now has id: 167837962
Jan 27 01:09:59 Condor stonith-ng: [24880]: info: crm_new_peer: Node
167837962 is now known as Condor
Jan 27 01:09:59 Condor stonith-ng: [24880]: info: main: Starting
stonith-ng mainloop
Jan 27 01:09:59 Condor stonith-ng: [24880]: info: crm_update_peer: Node
Condor: id=167837962 state=unknown addr=(null) votes=0 born=0 seen=0
proc=00000000000000000000000000110312 (new)
Jan 27 01:09:59 Condor cib: [24881]: info: startCib: CIB Initialization
completed successfully
Jan 27 01:09:59 Condor cib: [24881]: info: get_cluster_type: Cluster type
is: 'openais'
Jan 27 01:09:59 Condor cib: [24881]: notice: crm_cluster_connect:
Connecting to cluster infrastructure: classic openais (with plugin)
Jan 27 01:09:59 Condor cib: [24881]: info: init_ais_connection_classic:
Creating connection to our Corosync plugin
Jan 27 01:09:59 Condor corosync[24778]: [IPC ] Invalid IPC
credentials.
Jan 27 01:09:59 Condor cib: [24881]: info: init_ais_connection_classic:
Connection to our AIS plugin (9) failed: unknown (100)
Jan 27 01:09:59 Condor cib: [24881]: CRIT: cib_init: Cannot sign in to the
cluster... terminating
Jan 27 01:09:59 Condor pacemakerd: [24877]: ERROR: pcmk_child_exit: Child
process cib exited (pid=24881, rc=100)
Jan 27 01:09:59 Condor pacemakerd: [24877]: notice: pcmk_child_exit: Child
process cib no longer wishes to be respawned
Jan 27 01:09:59 Condor pacemakerd: [24877]: info: update_node_processes:
Node Condor now has process list: 00000000000000000000000000110212 (was
00000000000000000000000000110312)
Jan 27 01:09:59 Condor stonith-ng: [24880]: info: crm_update_peer: Node
Condor: id=167837962 state=unknown addr=(null) votes=0 born=0 seen=0
proc=00000000000000000000000000110212 (new)
Jan 27 01:10:00 Condor crmd: [24885]: info: do_cib_control: Could not
connect to the CIB service: connection failed
Jan 27 01:10:00 Condor crmd: [24885]: WARN: do_cib_control: Couldn't
complete CIB registration 1 times... pause and retry
Jan 27 01:10:00 Condor crmd: [24885]: info: crmd_init: Starting crmd's
mainloop
Jan 27 01:10:01 Condor CRON[24888]: (root) CMD (/etc/init.d/watchdog -e
/dev/null 2>&1)
Jan 27 01:10:02 Condor crmd: [24885]: info: crm_timer_popped: Wait Timer(I_NULL) just popped (2000ms)
Jan 27 01:10:03 Condor crmd: [24885]: info: do_cib_control: Could not
connect to the CIB service: connection failed
Jan 27 01:10:03 Condor crmd: [24885]: WARN: do_cib_control: Couldn't
complete CIB registration 2 times... pause and retry
Jan 27 01:10:05 Condor crmd: [24885]: info: crm_timer_popped: Wait Timer
(I_NULL) just popped (2000ms)
Jan 27 01:10:06 Condor crmd: [24885]: info: do_cib_control: Could not
connect to the CIB service: connection failed
Jan 27 01:10:06 Condor crmd: [24885]: WARN: do_cib_control: Couldn't
complete CIB registration 3 times... pause and retry
Jan 27 01:10:08 Condor crmd: [24885]: info: crm_timer_popped: Wait Timer
(I_NULL) just popped (2000ms)
Jan 27 01:10:09 Condor crmd: [24885]: info: do_cib_control: Could not
connect to the CIB service: connection failed
Jan 27 01:10:09 Condor crmd: [24885]: WARN: do_cib_control: Couldn't
complete CIB registration 4 times... pause and retry
Jan 27 01:10:11 Condor crmd: [24885]: info: crm_timer_popped: Wait Timer
(I_NULL) just popped (2000ms)
Jan 27 01:10:12 Condor crmd: [24885]: info: do_cib_control: Could not
connect to the CIB service: connection failed
Jan 27 01:10:12 Condor crmd: [24885]: WARN: do_cib_control: Couldn't
complete CIB registration 5 times... pause and retry
Jacob A. Smith
IT Manager
Argotec, LLC