Sergey Arlashin
2015-01-28 10:20:51 UTC
Hi!
I have a small corosync/pacemaker based cluster which consists of 4 nodes. 2 nodes are in standby mode, another 2 actually handle all the resources.
corosync ver. 1.4.7-1.
pacemaker ver 1.1.11.
os: ubuntu 12.04.
Inside our production environment which has a plenty of free ram,cpu etc everything is working well. When I switch one node off all the resources move to another without any problems. And vice versa. That's what I need :)
Our staging environment has rather weak hardware (that's ok - it's just staging :) ) and is rather busy. Sometimes it even doesn't have enough cpu or disk speed to be stable. When that happens some of cluster resources fail (which I consider to be normal), but also I can see the following crm output:
Node db-node1: standby
Node db-node2: standby
Online: [ lb-node1 lb-node2 ]
Pgpool2 (ocf::heartbeat:pgpool): FAILED (unmanaged) [ lb-node2 lb-node1 ]
Resource Group: IPGroup
FailoverIP1 (ocf::heartbeat:IPaddr2): Started [ lb-node2 lb-node1 ]
As you can see the resource ocf::heartbeat:IPaddr2 is started on both nodes ( lb-node2 and lb-node1 ). But I can't figure out how than could happen.
this is the output of my crm configure show:
node db-node1 \
attributes standby=on
node db-node2 \
attributes standby=on
node lb-node1
node lb-node2
primitive Cachier ocf:site:cachier \
op monitor interval=10s timeout=30s depth=10 \
meta target-role=Started
primitive FailoverIP1 IPaddr2 \
params ip=111.22.33.44 cidr_netmask=32 iflabel=FAILOVER \
op monitor interval=30s
primitive Mailer ocf:site:mailer \
meta target-role=Started \
op monitor interval=10s timeout=30s depth=10
primitive Memcached memcached \
op monitor interval=10s timeout=30s depth=10 \
meta target-role=Started
primitive Nginx nginx \
params status10url="/nginx_status" testclient=curl port=8091 \
op monitor interval=10s timeout=30s depth=10 \
op start interval=0 timeout=40s \
op stop interval=0 timeout=60s \
meta target-role=Started
primitive Pgpool2 pgpool \
params checkmethod=pid \
op monitor interval=30s \
op start interval=0 timeout=40s \
op stop interval=0 timeout=60s
group IPGroup FailoverIP1 \
meta target-role=Started
colocation ip-with-cachier inf: Cachier IPGroup
colocation ip-with-mailer inf: Mailer IPGroup
colocation ip-with-memcached inf: Memcached IPGroup
colocation ip-with-nginx inf: Nginx IPGroup
colocation ip-with-pgpool inf: Pgpool2 IPGroup
order cachier-after-ip inf: IPGroup Cachier
order mailer-after-ip inf: IPGroup Mailer
order memcached-after-ip inf: IPGroup Memcached
order nginx-after-ip inf: IPGroup Nginx
order pgpool-after-ip inf: IPGroup Pgpool2
property cib-bootstrap-options: \
expected-quorum-votes=4 \
stonith-enabled=false \
default-resource-stickiness=100 \
maintenance-mode=false \
dc-version=1.1.10-9d39a6b \
cluster-infrastructure="classic openais (with plugin)" \
last-lrm-refresh=1422438144
So the question is - does my config allow a resource like ocf::heartbeat:IPaddr2 to be started on multiple nodes simultaneously? Is it something that normally can happen? Or is it happening because of the shortage of computing power which i described earlier? : )
How can I prevent a thing like this from happening? Is it a case which normally is supposed to be solved by STONITH?
Thanks in advance.
--
Best regards,
Sergey Arlashin
_______________________________________________
Pacemaker mailing list: ***@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
I have a small corosync/pacemaker based cluster which consists of 4 nodes. 2 nodes are in standby mode, another 2 actually handle all the resources.
corosync ver. 1.4.7-1.
pacemaker ver 1.1.11.
os: ubuntu 12.04.
Inside our production environment which has a plenty of free ram,cpu etc everything is working well. When I switch one node off all the resources move to another without any problems. And vice versa. That's what I need :)
Our staging environment has rather weak hardware (that's ok - it's just staging :) ) and is rather busy. Sometimes it even doesn't have enough cpu or disk speed to be stable. When that happens some of cluster resources fail (which I consider to be normal), but also I can see the following crm output:
Node db-node1: standby
Node db-node2: standby
Online: [ lb-node1 lb-node2 ]
Pgpool2 (ocf::heartbeat:pgpool): FAILED (unmanaged) [ lb-node2 lb-node1 ]
Resource Group: IPGroup
FailoverIP1 (ocf::heartbeat:IPaddr2): Started [ lb-node2 lb-node1 ]
As you can see the resource ocf::heartbeat:IPaddr2 is started on both nodes ( lb-node2 and lb-node1 ). But I can't figure out how than could happen.
this is the output of my crm configure show:
node db-node1 \
attributes standby=on
node db-node2 \
attributes standby=on
node lb-node1
node lb-node2
primitive Cachier ocf:site:cachier \
op monitor interval=10s timeout=30s depth=10 \
meta target-role=Started
primitive FailoverIP1 IPaddr2 \
params ip=111.22.33.44 cidr_netmask=32 iflabel=FAILOVER \
op monitor interval=30s
primitive Mailer ocf:site:mailer \
meta target-role=Started \
op monitor interval=10s timeout=30s depth=10
primitive Memcached memcached \
op monitor interval=10s timeout=30s depth=10 \
meta target-role=Started
primitive Nginx nginx \
params status10url="/nginx_status" testclient=curl port=8091 \
op monitor interval=10s timeout=30s depth=10 \
op start interval=0 timeout=40s \
op stop interval=0 timeout=60s \
meta target-role=Started
primitive Pgpool2 pgpool \
params checkmethod=pid \
op monitor interval=30s \
op start interval=0 timeout=40s \
op stop interval=0 timeout=60s
group IPGroup FailoverIP1 \
meta target-role=Started
colocation ip-with-cachier inf: Cachier IPGroup
colocation ip-with-mailer inf: Mailer IPGroup
colocation ip-with-memcached inf: Memcached IPGroup
colocation ip-with-nginx inf: Nginx IPGroup
colocation ip-with-pgpool inf: Pgpool2 IPGroup
order cachier-after-ip inf: IPGroup Cachier
order mailer-after-ip inf: IPGroup Mailer
order memcached-after-ip inf: IPGroup Memcached
order nginx-after-ip inf: IPGroup Nginx
order pgpool-after-ip inf: IPGroup Pgpool2
property cib-bootstrap-options: \
expected-quorum-votes=4 \
stonith-enabled=false \
default-resource-stickiness=100 \
maintenance-mode=false \
dc-version=1.1.10-9d39a6b \
cluster-infrastructure="classic openais (with plugin)" \
last-lrm-refresh=1422438144
So the question is - does my config allow a resource like ocf::heartbeat:IPaddr2 to be started on multiple nodes simultaneously? Is it something that normally can happen? Or is it happening because of the shortage of computing power which i described earlier? : )
How can I prevent a thing like this from happening? Is it a case which normally is supposed to be solved by STONITH?
Thanks in advance.
--
Best regards,
Sergey Arlashin
_______________________________________________
Pacemaker mailing list: ***@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org