[Pacemaker] One node thinks everyone is online, but the other node doesn't think so

Discussion:

Dmitry Koterov

2015-01-25 23:53:52 UTC

Hello.

I have a 3-node cluster where node1 and node2 are running
corosync+pacemaker and node3 is running corosync only (for quorum).
Corosync 2.3.3, pacemaker 1.1.10. Everything worked fine the first couple
of days.

Once upon a time I discovered the following situation: node2 thinks that
both node1 and node2 are online, but node1 thinks that node2 is down. Could
you please say: how could it be? There are no connectivity problems between
the nodes at the moment (maybe they were, but why the system hasn't
recovered?). The "crm status" is below. What other logs should I attach for
the diagnostics?

Also, "service corosync stop" on node1 hangs forever with no additional
lines in logs, so I cannot even stop the service. (But after "service
corosync stop" on node1 the node node2 starts thinking that node1 is
offline, although the command still hangs.)

***@node2:~# crm status
Current DC: node1 (1760315215) - partition with quorum
2 Nodes configured
6 Resources configured
Online: [ node1 node2 ]
Master/Slave Set: ms_drbd [drbd]
Masters: [ node2 ]
Slaves: [ node1 ]
Resource Group: server
fs (ocf::heartbeat:Filesystem): Started node2
postgresql (lsb:postgresql): Started node2
bind9 (lsb:bind9): Started node2
nginx (lsb:nginx): Started node2

***@node1:/var/log# crm status
Current DC: node1 (1760315215) - partition with quorum
2 Nodes configured
6 Resources configured
Online: [ node1 ]
OFFLINE: [ node2 ]
Master/Slave Set: ms_drbd [drbd]
Masters: [ node1 ]
Stopped: [ node2 ]
Resource Group: server
fs (ocf::heartbeat:Filesystem): Started node1
postgresql (lsb:postgresql): Started node1
bind9 (lsb:bind9): Started node1
nginx (lsb:nginx): Started node1
Failed actions:
drbd_promote_0 (node=node1, call=634, rc=1, status=Timed Out,
last-rc-change=Thu Jan 22 10:30:08 2015, queued=20004ms, exec=0ms): unknown
error

A part of "crm configure show":

property $id="cib-bootstrap-options" \
dc-version="1.1.10-42f2063" \
cluster-infrastructure="corosync" \
stonith-enabled="false" \
last-lrm-refresh="1421250983"
rsc_defaults $id="rsc-options" \
resource-stickiness="100"

Also I see in logs on node1 (maybe they're related to the issue, maybe not):

Jan 22 10:14:02 node1 pengine[2772]: warning: pe_fence_node: Node node2 is
unclean because it is partially and/or un-expectedly down
Jan 22 10:14:02 node1 pengine[2772]: warning: determine_online_status:
Node node2 is unclean
Jan 22 10:14:02 node1 pengine[2772]: warning: stage6: Node node2 is
unclean!
Jan 22 10:14:02 node1 pengine[2772]: warning: stage6: YOUR RESOURCES ARE
NOW LIKELY COMPROMISED
Jan 22 10:14:02 node1 pengine[2772]: error: stage6: ENABLE STONITH TO
KEEP YOUR RESOURCES SAFE

On node2 the logs are:

Jan 22 10:13:57 node2 corosync[32761]: [TOTEM ] A new membership (
188.166.54.190:6276) was formed. Members left: 1760315215 13071578
Jan 22 10:13:57 node2 crmd[311]: notice: peer_update_callback: Our peer
on the DC is dead
Jan 22 10:13:57 node2 crmd[311]: notice: do_state_transition: State
transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION
cause=C_CRMD_STATUS_CALLBACK origin=peer_update_callback ]
Jan 22 10:13:57 node2 corosync[32761]: [QUORUM] This node is within the
non-primary component and will NOT provide any services.
Jan 22 10:13:57 node2 corosync[32761]: [QUORUM] Members[1]: 1017525950
Jan 22 10:13:57 node2 crmd[311]: notice: pcmk_quorum_notification:
Membership 6276: quorum lost (1)
Jan 22 10:13:57 node2 crmd[311]: notice: crm_update_peer_state:
pcmk_quorum_notification: Node (null)[13071578] - state is now lost (was
member)
Jan 22 10:13:57 node2 crmd[311]: notice: crm_update_peer_state:
pcmk_quorum_notification: Node node1[1760315215] - state is now lost (was
member)
Jan 22 10:13:57 node2 pacemakerd[302]: notice: pcmk_quorum_notification:
Membership 6276: quorum lost (1)
Jan 22 10:13:57 node2 pacemakerd[302]: notice: crm_update_peer_state:
pcmk_quorum_notification: Node node1[1760315215] - state is now lost (was
member)
Jan 22 10:13:57 node2 pacemakerd[302]: notice: crm_update_peer_state:
pcmk_quorum_notification: Node (null)[13071578] - state is now lost (was
member)
Jan 22 10:13:57 node2 corosync[32761]: [MAIN ] Completed service
synchronization, ready to provide service.
Jan 22 10:14:01 node2 corosync[32761]: [TOTEM ] A new membership (
104.236.71.79:6288) was formed. Members joined: 1760315215 13071578
Jan 22 10:14:02 node2 crmd[311]: error: pcmk_cpg_membership: Node
node1[1760315215] appears to be online even though we think it is dead
Jan 22 10:14:02 node2 crmd[311]: notice: crm_update_peer_state:
pcmk_cpg_membership: Node node1[1760315215] - state is now member (was lost)
Jan 22 10:14:03 node2 corosync[32761]: [QUORUM] This node is within the
primary component and will provide service.
Jan 22 10:14:03 node2 corosync[32761]: [QUORUM] Members[3]: 1760315215
13071578 1017525950
Jan 22 10:14:03 node2 crmd[311]: notice: pcmk_quorum_notification:
Membership 6288: quorum acquired (3)
Jan 22 10:14:03 node2 pacemakerd[302]: notice: pcmk_quorum_notification:
Membership 6288: quorum acquired (3)
Jan 22 10:14:03 node2 pacemakerd[302]: notice: crm_update_peer_state:
pcmk_quorum_notification: Node node1[1760315215] - state is now member (was
lost)
Jan 22 10:14:03 node2 corosync[32761]: [MAIN ] Completed service
synchronization, ready to provide service.
Jan 22 10:14:03 node2 crmd[311]: notice: corosync_node_name: Unable to
get node name for nodeid 13071578
Jan 22 10:14:03 node2 crmd[311]: notice: crm_update_peer_state:
pcmk_quorum_notification: Node (null)[13071578] - state is now member (was
lost)
Jan 22 10:14:03 node2 pacemakerd[302]: notice: corosync_node_name: Unable
to get node name for nodeid 13071578
Jan 22 10:14:03 node2 pacemakerd[302]: notice: crm_update_peer_state:
pcmk_quorum_notification: Node (null)[13071578] - state is now member (was
lost)
Jan 22 10:14:03 node2 crmd[311]: warning: do_log: FSA: Input I_JOIN_OFFER
from route_message() received in state S_ELECTION
Jan 22 10:14:04 node2 crmd[311]: notice: do_state_transition: State
transition S_ELECTION -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL
origin=do_election_count_vote ]
Jan 22 10:14:05 node2 attrd[310]: notice: attrd_local_callback: Sending
full refresh (origin=crmd)
Jan 22 10:14:05 node2 attrd[310]: notice: attrd_trigger_update: Sending
flush op to all hosts for: master-drbd (10000)
Jan 22 10:14:05 node2 attrd[310]: notice: attrd_trigger_update: Sending
flush op to all hosts for: probe_complete (true)
Jan 22 10:14:05 node2 crmd[311]: notice: do_state_transition: State
transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE
origin=do_cl_join_finalize_respond ]
Jan 22 10:15:11 node2 corosync[32761]: [TOTEM ] A new membership (
104.236.71.79:6296) was formed. Members left: 13071578
Jan 22 10:15:14 node2 corosync[32761]: [TOTEM ] A new membership (
128.199.116.218:6312) was formed. Members joined: 13071578 left: 1760315215
Jan 22 10:15:17 node2 corosync[32761]: [TOTEM ] A new membership (
104.236.71.79:6324) was formed. Members joined: 1760315215
Jan 22 10:15:19 node2 crmd[311]: notice: peer_update_callback: Our peer
on the DC is dead
Jan 22 10:15:19 node2 crmd[311]: notice: do_state_transition: State
transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION
cause=C_CRMD_STATUS_CALLBACK origin=peer_update_callback ]
Jan 22 10:15:20 node2 kernel: [690741.179442] block drbd0: peer( Primary ->
Secondary )
Jan 22 10:15:20 node2 corosync[32761]: [QUORUM] Members[3]: 1760315215
13071578 1017525950
Jan 22 10:15:20 node2 corosync[32761]: [MAIN ] Completed service
synchronization, ready to provide service.

Andrew Beekhof

2015-02-23 23:07:02 UTC

Permalink

Post by Dmitry Koterov
Hello.
I have a 3-node cluster where node1 and node2 are running corosync+pacemaker and node3 is running corosync only (for quorum). Corosync 2.3.3, pacemaker 1.1.10. Everything worked fine the first couple of days.
Once upon a time I discovered the following situation: node2 thinks that both node1 and node2 are online, but node1 thinks that node2 is down. Could you please say: how could it be? There are no connectivity problems between the nodes at the moment (maybe they were, but why the system hasn't recovered?).

Post by Dmitry Koterov
The "crm status" is below. What other logs should I attach for the diagnostics?
Also, "service corosync stop" on node1 hangs forever with no additional lines in logs, so I cannot even stop the service. (But after "service corosync stop" on node1 the node node2 starts thinking that node1 is offline, although the command still hangs.)
Current DC: node1 (1760315215) - partition with quorum
2 Nodes configured
6 Resources configured
Online: [ node1 node2 ]
Master/Slave Set: ms_drbd [drbd]
Masters: [ node2 ]
Slaves: [ node1 ]
Resource Group: server
fs (ocf::heartbeat:Filesystem): Started node2
postgresql (lsb:postgresql): Started node2
bind9 (lsb:bind9): Started node2
nginx (lsb:nginx): Started node2
Current DC: node1 (1760315215) - partition with quorum
2 Nodes configured
6 Resources configured
Online: [ node1 ]
OFFLINE: [ node2 ]
Master/Slave Set: ms_drbd [drbd]
Masters: [ node1 ]
Stopped: [ node2 ]
Resource Group: server
fs (ocf::heartbeat:Filesystem): Started node1
postgresql (lsb:postgresql): Started node1
bind9 (lsb:bind9): Started node1
nginx (lsb:nginx): Started node1
drbd_promote_0 (node=node1, call=634, rc=1, status=Timed Out, last-rc-change=Thu Jan 22 10:30:08 2015, queued=20004ms, exec=0ms): unknown error
property $id="cib-bootstrap-options" \
dc-version="1.1.10-42f2063" \
cluster-infrastructure="corosync" \
stonith-enabled="false" \
last-lrm-refresh="1421250983"
rsc_defaults $id="rsc-options" \
resource-stickiness="100"
Jan 22 10:14:02 node1 pengine[2772]: warning: pe_fence_node: Node node2 is unclean because it is partially and/or un-expectedly down
Jan 22 10:14:02 node1 pengine[2772]: warning: determine_online_status: Node node2 is unclean
Jan 22 10:14:02 node1 pengine[2772]: warning: stage6: Node node2 is unclean!
Jan 22 10:14:02 node1 pengine[2772]: warning: stage6: YOUR RESOURCES ARE NOW LIKELY COMPROMISED
Jan 22 10:14:02 node1 pengine[2772]: error: stage6: ENABLE STONITH TO KEEP YOUR RESOURCES SAFE
Jan 22 10:13:57 node2 corosync[32761]: [TOTEM ] A new membership (188.166.54.190:6276) was formed. Members left: 1760315215 13071578
Jan 22 10:13:57 node2 crmd[311]: notice: peer_update_callback: Our peer on the DC is dead
Jan 22 10:13:57 node2 crmd[311]: notice: do_state_transition: State transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION cause=C_CRMD_STATUS_CALLBACK origin=peer_update_callback ]
Jan 22 10:13:57 node2 corosync[32761]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
Jan 22 10:13:57 node2 corosync[32761]: [QUORUM] Members[1]: 1017525950
Jan 22 10:13:57 node2 crmd[311]: notice: pcmk_quorum_notification: Membership 6276: quorum lost (1)
Jan 22 10:13:57 node2 crmd[311]: notice: crm_update_peer_state: pcmk_quorum_notification: Node (null)[13071578] - state is now lost (was member)
Jan 22 10:13:57 node2 crmd[311]: notice: crm_update_peer_state: pcmk_quorum_notification: Node node1[1760315215] - state is now lost (was member)
Jan 22 10:13:57 node2 pacemakerd[302]: notice: pcmk_quorum_notification: Membership 6276: quorum lost (1)
Jan 22 10:13:57 node2 pacemakerd[302]: notice: crm_update_peer_state: pcmk_quorum_notification: Node node1[1760315215] - state is now lost (was member)
Jan 22 10:13:57 node2 pacemakerd[302]: notice: crm_update_peer_state: pcmk_quorum_notification: Node (null)[13071578] - state is now lost (was member)
Jan 22 10:13:57 node2 corosync[32761]: [MAIN ] Completed service synchronization, ready to provide service.
Jan 22 10:14:01 node2 corosync[32761]: [TOTEM ] A new membership (104.236.71.79:6288) was formed. Members joined: 1760315215 13071578
Jan 22 10:14:02 node2 crmd[311]: error: pcmk_cpg_membership: Node node1[1760315215] appears to be online even though we think it is dead
Jan 22 10:14:02 node2 crmd[311]: notice: crm_update_peer_state: pcmk_cpg_membership: Node node1[1760315215] - state is now member (was lost)
Jan 22 10:14:03 node2 corosync[32761]: [QUORUM] This node is within the primary component and will provide service.
Jan 22 10:14:03 node2 corosync[32761]: [QUORUM] Members[3]: 1760315215 13071578 1017525950
Jan 22 10:14:03 node2 crmd[311]: notice: pcmk_quorum_notification: Membership 6288: quorum acquired (3)
Jan 22 10:14:03 node2 pacemakerd[302]: notice: pcmk_quorum_notification: Membership 6288: quorum acquired (3)
Jan 22 10:14:03 node2 pacemakerd[302]: notice: crm_update_peer_state: pcmk_quorum_notification: Node node1[1760315215] - state is now member (was lost)
Jan 22 10:14:03 node2 corosync[32761]: [MAIN ] Completed service synchronization, ready to provide service.
Jan 22 10:14:03 node2 crmd[311]: notice: corosync_node_name: Unable to get node name for nodeid 13071578
Jan 22 10:14:03 node2 crmd[311]: notice: crm_update_peer_state: pcmk_quorum_notification: Node (null)[13071578] - state is now member (was lost)
Jan 22 10:14:03 node2 pacemakerd[302]: notice: corosync_node_name: Unable to get node name for nodeid 13071578
Jan 22 10:14:03 node2 pacemakerd[302]: notice: crm_update_peer_state: pcmk_quorum_notification: Node (null)[13071578] - state is now member (was lost)
Jan 22 10:14:03 node2 crmd[311]: warning: do_log: FSA: Input I_JOIN_OFFER from route_message() received in state S_ELECTION
Jan 22 10:14:04 node2 crmd[311]: notice: do_state_transition: State transition S_ELECTION -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_election_count_vote ]
Jan 22 10:14:05 node2 attrd[310]: notice: attrd_local_callback: Sending full refresh (origin=crmd)
Jan 22 10:14:05 node2 attrd[310]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-drbd (10000)
Jan 22 10:14:05 node2 attrd[310]: notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true)
Jan 22 10:14:05 node2 crmd[311]: notice: do_state_transition: State transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE origin=do_cl_join_finalize_respond ]
Jan 22 10:15:11 node2 corosync[32761]: [TOTEM ] A new membership (104.236.71.79:6296) was formed. Members left: 13071578
Jan 22 10:15:14 node2 corosync[32761]: [TOTEM ] A new membership (128.199.116.218:6312) was formed. Members joined: 13071578 left: 1760315215
Jan 22 10:15:17 node2 corosync[32761]: [TOTEM ] A new membership (104.236.71.79:6324) was formed. Members joined: 1760315215
Jan 22 10:15:19 node2 crmd[311]: notice: peer_update_callback: Our peer on the DC is dead
Jan 22 10:15:19 node2 crmd[311]: notice: do_state_transition: State transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION cause=C_CRMD_STATUS_CALLBACK origin=peer_update_callback ]
Jan 22 10:15:20 node2 kernel: [690741.179442] block drbd0: peer( Primary -> Secondary )
Jan 22 10:15:20 node2 corosync[32761]: [QUORUM] Members[3]: 1760315215 13071578 1017525950
Jan 22 10:15:20 node2 corosync[32761]: [MAIN ] Completed service synchronization, ready to provide service.
_______________________________________________
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: ***@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Dmitry Koterov

2015-03-10 15:21:08 UTC

Permalink

Post by Dmitry Koterov

Post by Dmitry Koterov
I have a 3-node cluster where node1 and node2 are running

corosync+pacemaker and node3 is running corosync only (for quorum).
Corosync 2.3.3, pacemaker 1.1.10. Everything worked fine the first couple
of days.

Post by Dmitry Koterov
Once upon a time I discovered the following situation: node2 thinks that

both node1 and node2 are online, but node1 thinks that node2 is down. Could
you please say: how could it be? There are no connectivity problems between
the nodes at the moment (maybe they were, but why the system hasn't
recovered?).
The logs show connectivity problems occurring, so no doubt there.
As to why it hasn't recovered, first check corosync - if it does not have
a consistent view of the world pacemaker has no hope.
Alternatively, I recall there was a bug that could be preventing this in
your version. So if corosync looks fine, perhaps try an upgrade.

Thanks.
Are you talking about this bug:
https://bugs.launchpad.net/ubuntu/+source/libqb/+bug/1341496 ?

I believe I reproduced the problem one more time (it's very unstable), the
symptoms were the following:

1. Once upon a time node2 became down.
2. The last message from corosync at node1 was "Quorum lost" (I suspect
there was a temporary misconnection with node3).
3. Then, in a couple of days, at node3 "service corosync stop" hanged (only
killall -9 helps). I tried to run strace during the service is stopping, it
shows:

[pid 19449] futex(0x7f580b4c62e0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished
...>
[pid 19448] --- SIGQUIT {si_signo=SIGQUIT, si_code=SI_USER, si_pid=28183,
si_uid=0} ---
[pid 19448] write(6, "\3\0\0\0", 4) = 4
[pid 19448] rt_sigreturn() = 360
... <and repeats for 19448 again and again>

where pstree shows:

init,1
ââcorosync,19448
â ââ{corosync},19449

4. As well as at node1: "service corosync stop" hangs at node1 too with
same symptoms, only killall -9 helps.
5. Restarting corosync & pacemaker at node1 and node2 solved the problem.

Could you please say is it related to the above bug in libqb?

Post by Dmitry Koterov

Post by Dmitry Koterov
The "crm status" is below. What other logs should I attach for the

diagnostics?

Post by Dmitry Koterov
Also, "service corosync stop" on node1 hangs forever with no additional

lines in logs, so I cannot even stop the service. (But after "service
corosync stop" on node1 the node node2 starts thinking that node1 is
offline, although the command still hangs.)

Post by Dmitry Koterov
Current DC: node1 (1760315215) - partition with quorum
2 Nodes configured
6 Resources configured
Online: [ node1 node2 ]
Master/Slave Set: ms_drbd [drbd]
Masters: [ node2 ]
Slaves: [ node1 ]
Resource Group: server
fs (ocf::heartbeat:Filesystem): Started node2
postgresql (lsb:postgresql): Started node2
bind9 (lsb:bind9): Started node2
nginx (lsb:nginx): Started node2
Current DC: node1 (1760315215) - partition with quorum
2 Nodes configured
6 Resources configured
Online: [ node1 ]
OFFLINE: [ node2 ]
Master/Slave Set: ms_drbd [drbd]
Masters: [ node1 ]
Stopped: [ node2 ]
Resource Group: server
fs (ocf::heartbeat:Filesystem): Started node1
postgresql (lsb:postgresql): Started node1
bind9 (lsb:bind9): Started node1
nginx (lsb:nginx): Started node1
drbd_promote_0 (node=node1, call=634, rc=1, status=Timed Out,

last-rc-change=Thu Jan 22 10:30:08 2015, queued=20004ms, exec=0ms): unknown
error

Post by Dmitry Koterov
property $id="cib-bootstrap-options" \
dc-version="1.1.10-42f2063" \
cluster-infrastructure="corosync" \
stonith-enabled="false" \
last-lrm-refresh="1421250983"
rsc_defaults $id="rsc-options" \
resource-stickiness="100"
Also I see in logs on node1 (maybe they're related to the issue, maybe
Jan 22 10:14:02 node1 pengine[2772]: warning: pe_fence_node: Node node2

is unclean because it is partially and/or un-expectedly down
Node node2 is unclean

Post by Dmitry Koterov
Jan 22 10:14:02 node1 pengine[2772]: warning: stage6: Node node2 is

unclean!

Post by Dmitry Koterov
Jan 22 10:14:02 node1 pengine[2772]: warning: stage6: YOUR RESOURCES

ARE NOW LIKELY COMPROMISED

Post by Dmitry Koterov
Jan 22 10:14:02 node1 pengine[2772]: error: stage6: ENABLE STONITH TO

KEEP YOUR RESOURCES SAFE

Post by Dmitry Koterov
Jan 22 10:13:57 node2 corosync[32761]: [TOTEM ] A new membership (

188.166.54.190:6276) was formed. Members left: 1760315215 13071578

Post by Dmitry Koterov
Jan 22 10:13:57 node2 crmd[311]: notice: peer_update_callback: Our

peer on the DC is dead

Post by Dmitry Koterov
Jan 22 10:13:57 node2 crmd[311]: notice: do_state_transition: State

transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION
cause=C_CRMD_STATUS_CALLBACK origin=peer_update_callback ]

Post by Dmitry Koterov
Jan 22 10:13:57 node2 corosync[32761]: [QUORUM] This node is within the

non-primary component and will NOT provide any services.

Post by Dmitry Koterov
Jan 22 10:13:57 node2 corosync[32761]: [QUORUM] Members[1]: 1017525950

Membership 6276: quorum lost (1)
pcmk_quorum_notification: Node (null)[13071578] - state is now lost (was
member)
pcmk_quorum_notification: Node node1[1760315215] - state is now lost (was
member)
pcmk_quorum_notification: Membership 6276: quorum lost (1)
pcmk_quorum_notification: Node node1[1760315215] - state is now lost (was
member)
pcmk_quorum_notification: Node (null)[13071578] - state is now lost (was
member)

Post by Dmitry Koterov
Jan 22 10:13:57 node2 corosync[32761]: [MAIN ] Completed service

synchronization, ready to provide service.

Post by Dmitry Koterov
Jan 22 10:14:01 node2 corosync[32761]: [TOTEM ] A new membership (

104.236.71.79:6288) was formed. Members joined: 1760315215 13071578

Post by Dmitry Koterov
Jan 22 10:14:02 node2 crmd[311]: error: pcmk_cpg_membership: Node

node1[1760315215] appears to be online even though we think it is dead
pcmk_cpg_membership: Node node1[1760315215] - state is now member (was lost)

Post by Dmitry Koterov
Jan 22 10:14:03 node2 corosync[32761]: [QUORUM] This node is within the

primary component and will provide service.

Post by Dmitry Koterov
Jan 22 10:14:03 node2 corosync[32761]: [QUORUM] Members[3]: 1760315215

13071578 1017525950
Membership 6288: quorum acquired (3)
pcmk_quorum_notification: Membership 6288: quorum acquired (3)
pcmk_quorum_notification: Node node1[1760315215] - state is now member (was
lost)

Post by Dmitry Koterov
Jan 22 10:14:03 node2 corosync[32761]: [MAIN ] Completed service

synchronization, ready to provide service.

Post by Dmitry Koterov
Jan 22 10:14:03 node2 crmd[311]: notice: corosync_node_name: Unable to

get node name for nodeid 13071578
pcmk_quorum_notification: Node (null)[13071578] - state is now member (was
lost)
Unable to get node name for nodeid 13071578
pcmk_quorum_notification: Node (null)[13071578] - state is now member (was
lost)

Post by Dmitry Koterov
Jan 22 10:14:03 node2 crmd[311]: warning: do_log: FSA: Input

I_JOIN_OFFER from route_message() received in state S_ELECTION

Post by Dmitry Koterov
Jan 22 10:14:04 node2 crmd[311]: notice: do_state_transition: State

transition S_ELECTION -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL
origin=do_election_count_vote ]
Sending full refresh (origin=crmd)
Sending flush op to all hosts for: master-drbd (10000)
Sending flush op to all hosts for: probe_complete (true)

Post by Dmitry Koterov
Jan 22 10:14:05 node2 crmd[311]: notice: do_state_transition: State

transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE
origin=do_cl_join_finalize_respond ]

Post by Dmitry Koterov
Jan 22 10:15:11 node2 corosync[32761]: [TOTEM ] A new membership (

104.236.71.79:6296) was formed. Members left: 13071578

Post by Dmitry Koterov
Jan 22 10:15:14 node2 corosync[32761]: [TOTEM ] A new membership (

128.199.116.218:6312) was formed. Members joined: 13071578 left: 1760315215

Post by Dmitry Koterov
Jan 22 10:15:17 node2 corosync[32761]: [TOTEM ] A new membership (

104.236.71.79:6324) was formed. Members joined: 1760315215

Post by Dmitry Koterov
Jan 22 10:15:19 node2 crmd[311]: notice: peer_update_callback: Our

peer on the DC is dead

Post by Dmitry Koterov
Jan 22 10:15:19 node2 crmd[311]: notice: do_state_transition: State

transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION
cause=C_CRMD_STATUS_CALLBACK origin=peer_update_callback ]

Post by Dmitry Koterov
Jan 22 10:15:20 node2 kernel: [690741.179442] block drbd0: peer( Primary

-> Secondary )

Post by Dmitry Koterov
Jan 22 10:15:20 node2 corosync[32761]: [QUORUM] Members[3]: 1760315215

13071578 1017525950

Post by Dmitry Koterov
Jan 22 10:15:20 node2 corosync[32761]: [MAIN ] Completed service

synchronization, ready to provide service.

Post by Dmitry Koterov
_______________________________________________
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Andrew Beekhof

2015-03-30 00:33:47 UTC

Permalink

Post by Andrew Beekhof

Post by Dmitry Koterov
I have a 3-node cluster where node1 and node2 are running corosync+pacemaker and node3 is running corosync only (for quorum). Corosync 2.3.3, pacemaker 1.1.10. Everything worked fine the first couple of days.
Once upon a time I discovered the following situation: node2 thinks that both node1 and node2 are online, but node1 thinks that node2 is down. Could you please say: how could it be? There are no connectivity problems between the nodes at the moment (maybe they were, but why the system hasn't recovered?).

The logs show connectivity problems occurring, so no doubt there.
As to why it hasn't recovered, first check corosync - if it does not have a consistent view of the world pacemaker has no hope.
Alternatively, I recall there was a bug that could be preventing this in your version. So if corosync looks fine, perhaps try an upgrade.
Thanks.
Are you talking about this bug: https://bugs.launchpad.net/ubuntu/+source/libqb/+bug/1341496 ?
1. Once upon a time node2 became down.
2. The last message from corosync at node1 was "Quorum lost" (I suspect there was a temporary misconnection with node3).
[pid 19449] futex(0x7f580b4c62e0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 19448] --- SIGQUIT {si_signo=SIGQUIT, si_code=SI_USER, si_pid=28183, si_uid=0} ---
[pid 19448] write(6, "\3\0\0\0", 4) = 4
[pid 19448] rt_sigreturn() = 360
... <and repeats for 19448 again and again>
init,1
├─corosync,19448
│ └─{corosync},19449
4. As well as at node1: "service corosync stop" hangs at node1 too with same symptoms, only killall -9 helps.
5. Restarting corosync & pacemaker at node1 and node2 solved the problem.
Could you please say is it related to the above bug in libqb?

I'm no expert in libqb, but it certainly looks likely.

Post by Andrew Beekhof

_______________________________________________
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
_______________________________________________
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org