[Pacemaker] pacemaker/corosync: a resource is started on 2 nodes

Discussion:

Sergey Arlashin

2015-01-28 10:20:51 UTC

Hi!

I have a small corosync/pacemaker based cluster which consists of 4 nodes. 2 nodes are in standby mode, another 2 actually handle all the resources.

corosync ver. 1.4.7-1.
pacemaker ver 1.1.11.
os: ubuntu 12.04.

Inside our production environment which has a plenty of free ram,cpu etc everything is working well. When I switch one node off all the resources move to another without any problems. And vice versa. That's what I need :)

Our staging environment has rather weak hardware (that's ok - it's just staging :) ) and is rather busy. Sometimes it even doesn't have enough cpu or disk speed to be stable. When that happens some of cluster resources fail (which I consider to be normal), but also I can see the following crm output:

Node db-node1: standby
Node db-node2: standby
Online: [ lb-node1 lb-node2 ]

Pgpool2 (ocf::heartbeat:pgpool): FAILED (unmanaged) [ lb-node2 lb-node1 ]
Resource Group: IPGroup
FailoverIP1 (ocf::heartbeat:IPaddr2): Started [ lb-node2 lb-node1 ]

As you can see the resource ocf::heartbeat:IPaddr2 is started on both nodes ( lb-node2 and lb-node1 ). But I can't figure out how than could happen.

this is the output of my crm configure show:

node db-node1 \
attributes standby=on
node db-node2 \
attributes standby=on
node lb-node1
node lb-node2
primitive Cachier ocf:site:cachier \
op monitor interval=10s timeout=30s depth=10 \
meta target-role=Started
primitive FailoverIP1 IPaddr2 \
params ip=111.22.33.44 cidr_netmask=32 iflabel=FAILOVER \
op monitor interval=30s
primitive Mailer ocf:site:mailer \
meta target-role=Started \
op monitor interval=10s timeout=30s depth=10
primitive Memcached memcached \
op monitor interval=10s timeout=30s depth=10 \
meta target-role=Started
primitive Nginx nginx \
params status10url="/nginx_status" testclient=curl port=8091 \
op monitor interval=10s timeout=30s depth=10 \
op start interval=0 timeout=40s \
op stop interval=0 timeout=60s \
meta target-role=Started
primitive Pgpool2 pgpool \
params checkmethod=pid \
op monitor interval=30s \
op start interval=0 timeout=40s \
op stop interval=0 timeout=60s
group IPGroup FailoverIP1 \
meta target-role=Started
colocation ip-with-cachier inf: Cachier IPGroup
colocation ip-with-mailer inf: Mailer IPGroup
colocation ip-with-memcached inf: Memcached IPGroup
colocation ip-with-nginx inf: Nginx IPGroup
colocation ip-with-pgpool inf: Pgpool2 IPGroup
order cachier-after-ip inf: IPGroup Cachier
order mailer-after-ip inf: IPGroup Mailer
order memcached-after-ip inf: IPGroup Memcached
order nginx-after-ip inf: IPGroup Nginx
order pgpool-after-ip inf: IPGroup Pgpool2
property cib-bootstrap-options: \
expected-quorum-votes=4 \
stonith-enabled=false \
default-resource-stickiness=100 \
maintenance-mode=false \
dc-version=1.1.10-9d39a6b \
cluster-infrastructure="classic openais (with plugin)" \
last-lrm-refresh=1422438144

So the question is - does my config allow a resource like ocf::heartbeat:IPaddr2 to be started on multiple nodes simultaneously? Is it something that normally can happen? Or is it happening because of the shortage of computing power which i described earlier? : )
How can I prevent a thing like this from happening? Is it a case which normally is supposed to be solved by STONITH?

Thanks in advance.

--
Best regards,
Sergey Arlashin

_______________________________________________
Pacemaker mailing list: ***@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Michael Schwartzkopff

2015-01-28 12:49:01 UTC

Permalink

Post by Sergey Arlashin
Hi!
I have a small corosync/pacemaker based cluster which consists of 4 nodes. 2
nodes are in standby mode, another 2 actually handle all the resources.
corosync ver. 1.4.7-1.
pacemaker ver 1.1.11.
os: ubuntu 12.04.
Inside our production environment which has a plenty of free ram,cpu etc
everything is working well. When I switch one node off all the resources
move to another without any problems. And vice versa. That's what I need :)
Our staging environment has rather weak hardware (that's ok - it's just
staging :) ) and is rather busy. Sometimes it even doesn't have enough cpu
or disk speed to be stable. When that happens some of cluster resources
fail (which I consider to be normal), but also I can see the following crm
Node db-node1: standby
Node db-node2: standby
Online: [ lb-node1 lb-node2 ]
Pgpool2 (ocf::heartbeat:pgpool): FAILED (unmanaged) [ lb-node2 lb-node1 ]
Resource Group: IPGroup
FailoverIP1 (ocf::heartbeat:IPaddr2): Started [ lb-node2 lb-node1 ]
As you can see the resource ocf::heartbeat:IPaddr2 is started on both nodes
( lb-node2 and lb-node1 ). But I can't figure out how than could happen.

Your config does not allow this, but since your HW is slow pacemaker runs into
timeouts and corosync conneciton problems. You could debug the problem be
tracing the event in the logs. With the command crm_mon -1rtf you find the time
of the failure. Search around that time in the logs.

If the communication in the cluster does not work, pacemaker sometimes behaves
verry odd.

Mit freundlichen Grüßen,

Michael Schwartzkopff
--
[*] sys4 AG

http://sys4.de, +49 (89) 30 90 46 64, +49 (162) 165 0044
Franziskanerstraße 15, 81669 München

Sitz der Gesellschaft: München, Amtsgericht München: HRB 199263
Vorstand: Patrick Ben Koetter, Marc Schiffbauer
Aufsichtsratsvorsitzender: Florian Kirstein

_______________________________________________
Pacemaker mailing list: ***@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Andrew Beekhof

2015-02-23 23:16:39 UTC

Permalink

Post by Sergey Arlashin
Hi!
I have a small corosync/pacemaker based cluster which consists of 4 nodes. 2 nodes are in standby mode, another 2 actually handle all the resources.
corosync ver. 1.4.7-1.
pacemaker ver 1.1.11.
os: ubuntu 12.04.
Inside our production environment which has a plenty of free ram,cpu etc everything is working well. When I switch one node off all the resources move to another without any problems. And vice versa. That's what I need :)
Node db-node1: standby
Node db-node2: standby
Online: [ lb-node1 lb-node2 ]
Pgpool2 (ocf::heartbeat:pgpool): FAILED (unmanaged) [ lb-node2 lb-node1 ]
Resource Group: IPGroup
FailoverIP1 (ocf::heartbeat:IPaddr2): Started [ lb-node2 lb-node1 ]
As you can see the resource ocf::heartbeat:IPaddr2 is started on both nodes ( lb-node2 and lb-node1 ). But I can't figure out how than could happen.

stonith-enabled=false is one especially good way.
particularly in an unstable environment.

it could even be that it is showing up as running due to failed monitor operations and is not actually running there (but for safety we have to assume it is)

Post by Sergey Arlashin
node db-node1 \
attributes standby=on
node db-node2 \
attributes standby=on
node lb-node1
node lb-node2
primitive Cachier ocf:site:cachier \
op monitor interval=10s timeout=30s depth=10 \
meta target-role=Started
primitive FailoverIP1 IPaddr2 \
params ip=111.22.33.44 cidr_netmask=32 iflabel=FAILOVER \
op monitor interval=30s
primitive Mailer ocf:site:mailer \
meta target-role=Started \
op monitor interval=10s timeout=30s depth=10
primitive Memcached memcached \
op monitor interval=10s timeout=30s depth=10 \
meta target-role=Started
primitive Nginx nginx \
params status10url="/nginx_status" testclient=curl port=8091 \
op monitor interval=10s timeout=30s depth=10 \
op start interval=0 timeout=40s \
op stop interval=0 timeout=60s \
meta target-role=Started
primitive Pgpool2 pgpool \
params checkmethod=pid \
op monitor interval=30s \
op start interval=0 timeout=40s \
op stop interval=0 timeout=60s
group IPGroup FailoverIP1 \
meta target-role=Started
colocation ip-with-cachier inf: Cachier IPGroup
colocation ip-with-mailer inf: Mailer IPGroup
colocation ip-with-memcached inf: Memcached IPGroup
colocation ip-with-nginx inf: Nginx IPGroup
colocation ip-with-pgpool inf: Pgpool2 IPGroup
order cachier-after-ip inf: IPGroup Cachier
order mailer-after-ip inf: IPGroup Mailer
order memcached-after-ip inf: IPGroup Memcached
order nginx-after-ip inf: IPGroup Nginx
order pgpool-after-ip inf: IPGroup Pgpool2
property cib-bootstrap-options: \
expected-quorum-votes=4 \
stonith-enabled=false \
default-resource-stickiness=100 \
maintenance-mode=false \
dc-version=1.1.10-9d39a6b \
cluster-infrastructure="classic openais (with plugin)" \
last-lrm-refresh=1422438144
So the question is - does my config allow a resource like ocf::heartbeat:IPaddr2 to be started on multiple nodes simultaneously? Is it something that normally can happen? Or is it happening because of the shortage of computing power which i described earlier? : )
How can I prevent a thing like this from happening? Is it a case which normally is supposed to be solved by STONITH?
Thanks in advance.
--
Best regards,
Sergey Arlashin
_______________________________________________
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: ***@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org