[Pacemaker] Colocating with unmanaged resource

Discussion:

Покотиленко Костик

2014-12-19 19:21:49 UTC

Hi,

Simple scenario, several floating IPs should be living on "front" nodes
only if there is working Nginx. There are several reasons against Nginx
being controlled by Pacemaker.

So, decided to colocate FIPs with unmanaged Nginx. This worked fine in
1.1.6 with some exceptions.

Later, on other cluster I decided to switch to 1.1.10 and corosync 2
because of performance improvements. Now also testing 1.1.12.

It seems I can't reliably colocate FIPs with unmanaged Nginx on 1.1.10
and 10.1.12.

Here are behaviors of different versions of pacemaker:

1.1.6, 1.1.10, 1.1.12:

- if Nginx has started on a node after initial probe for Nginx clone
then pacemaker will never see it running until cleanup or other probe
trigger

1.1.6:

- stopping nginx on a node makes the clone instance FAIL for that node,
FIP moves away from that node. This is as expected
- starting nginx removes FAIL state and FIP moves back. This is as
expected

1.1.10:

- stopping nginx on a node:
- usually makes the clone instance to FAIL for that node, but
FIP stays running on that node regardless of INF colocation
- sometime makes the clone instance to FAIL for that node and
immediately after that clone instance returns to STARTED state,
FIP stays running on that node
- sometimes makes the clone instance to be STOPPED for that node,
FIP moves away from that node. This is as expected
- starting nginx:
- if was FAIL: removes FAIL state: FIP remains running
- if was STARTED:
- usually nothing happens: FIP remains running
- sometimes makes clone instance to FAIL for that node, but
FIP stays running on that node regardless of INF colocation
- if was STOPPED: moves FIP back. This is as expected

1.1.12:

- stopping nginx on a node always makes the clone instance to FAIL for
that node, but FIP stays running on that node regardless of INF
colocation
- starting nginx removes FAIL state, FIP remains running

Please comment on this. And some questions:

- are unmanaged resources designed to be used in normal conditions for
other resources to be colocated with them? How to cook them right?
- is there a some kind of "recurring probe" to "see" unmanaged resources
that have started after initial probe?

Let me know if more logs needed, right now can't collect logs for all
cases, some attached.

Config for 1.1.10 (similar configs for 1.1.6 and 1.1.12):

node $id="..." pcmk10-1 \
attributes onhv="1" front="true"
node $id="..." pcmk10-2 \
attributes onhv="2" front="true"
node $id="..." pcmk10-3 \
attributes onhv="3" front="true"

primitive FIP_1 ocf:heartbeat:IPaddr2 \
op monitor interval="2s" \
params ip="10.1.1.1" cidr_netmask="16" \
meta migration-threshold="2" failure-timeout="60s"
primitive FIP_2 ocf:heartbeat:IPaddr2 \
op monitor interval="2s" \
params ip="10.1.2.1" cidr_netmask="16" \
meta migration-threshold="2" failure-timeout="60s"
primitive FIP_3 ocf:heartbeat:IPaddr2 \
op monitor interval="2s" \
params ip="10.1.3.1" cidr_netmask="16" \
meta migration-threshold="2" failure-timeout="60s"

primitive Nginx lsb:nginx \
op start interval="0" enabled="false" \
op stop interval="0" enabled="false" \
op monitor interval="2s"

clone cl_Nginx Nginx \
meta globally-unique="false" notify="false" is-managed="false"

location loc-cl_Nginx cl_Nginx \
rule $id="loc-cl_Nginx-r1" 500: front eq true

location loc-FIP_1 FIP_1 \
rule $id="loc-FIP_1-r1" 500: onhv eq 1 and front eq true \
rule $id="loc-FIP_1-r2" 200: defined onhv and onhv ne 1 and
front eq true
location loc-FIP_2 FIP_2 \
rule $id="loc-FIP_2-r1" 500: onhv eq 2 and front eq true \
rule $id="loc-FIP_2-r2" 200: defined onhv and onhv ne 2 and
front eq true
location loc-FIP_3 FIP_3 \
rule $id="loc-FIP_3-r1" 500: onhv eq 3 and front eq true \
rule $id="loc-FIP_3-r2" 200: defined onhv and onhv ne 3 and
front eq true

colocation coloc-FIP_1-cl_Nginx inf: FIP_1 cl_Nginx
colocation coloc-FIP_2-cl_Nginx inf: FIP_2 cl_Nginx
colocation coloc-FIP_3-cl_Nginx inf: FIP_3 cl_Nginx

property $id="cib-bootstrap-options" \
dc-version="1.1.10-42f2063" \
cluster-infrastructure="corosync" \
symmetric-cluster="false" \
stonith-enabled="false" \
no-quorum-policy="stop" \
cluster-recheck-interval="10s" \
maintenance-mode="false" \
last-lrm-refresh="1418998945"
rsc_defaults $id="rsc-options" \
resource-stickiness="30"
op_defaults $id="op_defaults-options" \
record-pending="false"

Andrew Beekhof

2015-01-06 05:27:45 UTC

Permalink

Post by ÐÐ¾ÐºÐ¾ÑÐ¸Ð»ÐµÐ½ÐºÐ¾ ÐÐ¾ÑÑÐ¸Ðº
Hi,
Simple scenario, several floating IPs should be living on "front" nodes
only if there is working Nginx. There are several reasons against Nginx
being controlled by Pacemaker.
So, decided to colocate FIPs with unmanaged Nginx. This worked fine in
1.1.6 with some exceptions.
Later, on other cluster I decided to switch to 1.1.10 and corosync 2
because of performance improvements. Now also testing 1.1.12.
It seems I can't reliably colocate FIPs with unmanaged Nginx on 1.1.10
and 10.1.12.
- if Nginx has started on a node after initial probe for Nginx clone
then pacemaker will never see it running until cleanup or other probe
trigger

you'll want a recurring monitor with role=Stopped

Post by ÐÐ¾ÐºÐ¾ÑÐ¸Ð»ÐµÐ½ÐºÐ¾ ÐÐ¾ÑÑÐ¸Ðº
- stopping nginx on a node makes the clone instance FAIL for that node,
FIP moves away from that node. This is as expected
- starting nginx removes FAIL state and FIP moves back. This is as
expected
- usually makes the clone instance to FAIL for that node, but
FIP stays running on that node regardless of INF colocation
- sometime makes the clone instance to FAIL for that node and
immediately after that clone instance returns to STARTED state,
FIP stays running on that node
- sometimes makes the clone instance to be STOPPED for that node,
FIP moves away from that node. This is as expected
- if was FAIL: removes FAIL state: FIP remains running
- usually nothing happens: FIP remains running
- sometimes makes clone instance to FAIL for that node, but
FIP stays running on that node regardless of INF colocation
- if was STOPPED: moves FIP back. This is as expected
- stopping nginx on a node always makes the clone instance to FAIL for
that node, but FIP stays running on that node regardless of INF
colocation

can you attach a crm_report of the above test please?

Post by ÐÐ¾ÐºÐ¾ÑÐ¸Ð»ÐµÐ½ÐºÐ¾ ÐÐ¾ÑÑÐ¸Ðº
- starting nginx removes FAIL state, FIP remains running
- are unmanaged resources designed to be used in normal conditions for
other resources to be colocated with them? How to cook them right?
- is there a some kind of "recurring probe" to "see" unmanaged resources
that have started after initial probe?
Let me know if more logs needed, right now can't collect logs for all
cases, some attached.
node $id="..." pcmk10-1 \
attributes onhv="1" front="true"
node $id="..." pcmk10-2 \
attributes onhv="2" front="true"
node $id="..." pcmk10-3 \
attributes onhv="3" front="true"
primitive FIP_1 ocf:heartbeat:IPaddr2 \
op monitor interval="2s" \
params ip="10.1.1.1" cidr_netmask="16" \
meta migration-threshold="2" failure-timeout="60s"
primitive FIP_2 ocf:heartbeat:IPaddr2 \
op monitor interval="2s" \
params ip="10.1.2.1" cidr_netmask="16" \
meta migration-threshold="2" failure-timeout="60s"
primitive FIP_3 ocf:heartbeat:IPaddr2 \
op monitor interval="2s" \
params ip="10.1.3.1" cidr_netmask="16" \
meta migration-threshold="2" failure-timeout="60s"
primitive Nginx lsb:nginx \
op start interval="0" enabled="false" \
op stop interval="0" enabled="false" \
op monitor interval="2s"
clone cl_Nginx Nginx \
meta globally-unique="false" notify="false" is-managed="false"
location loc-cl_Nginx cl_Nginx \
rule $id="loc-cl_Nginx-r1" 500: front eq true
location loc-FIP_1 FIP_1 \
rule $id="loc-FIP_1-r1" 500: onhv eq 1 and front eq true \
rule $id="loc-FIP_1-r2" 200: defined onhv and onhv ne 1 and
front eq true
location loc-FIP_2 FIP_2 \
rule $id="loc-FIP_2-r1" 500: onhv eq 2 and front eq true \
rule $id="loc-FIP_2-r2" 200: defined onhv and onhv ne 2 and
front eq true
location loc-FIP_3 FIP_3 \
rule $id="loc-FIP_3-r1" 500: onhv eq 3 and front eq true \
rule $id="loc-FIP_3-r2" 200: defined onhv and onhv ne 3 and
front eq true
colocation coloc-FIP_1-cl_Nginx inf: FIP_1 cl_Nginx
colocation coloc-FIP_2-cl_Nginx inf: FIP_2 cl_Nginx
colocation coloc-FIP_3-cl_Nginx inf: FIP_3 cl_Nginx
property $id="cib-bootstrap-options" \
dc-version="1.1.10-42f2063" \
cluster-infrastructure="corosync" \
symmetric-cluster="false" \
stonith-enabled="false" \
no-quorum-policy="stop" \
cluster-recheck-interval="10s" \
maintenance-mode="false" \
last-lrm-refresh="1418998945"
rsc_defaults $id="rsc-options" \
resource-stickiness="30"
op_defaults $id="op_defaults-options" \
record-pending="false"
<1.1.10_fail-started.log><1.1.10_stopped-started.log>_______________________________________________
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: ***@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch

Andrew Beekhof

2015-01-22 03:59:02 UTC

Permalink

Post by Andrew Beekhof

Post by ÐÐ¾ÐºÐ¾ÑÐ¸Ð»ÐµÐ½ÐºÐ¾ ÐÐ¾ÑÑÐ¸Ðº
- stopping nginx on a node always makes the clone instance to FAIL for
that node, but FIP stays running on that node regardless of INF
colocation

can you attach a crm_report of the above test please?

crm_report of this test attached as
pcmk-nginx-fail-Wed-14-Jan-2015.tar.bz2

is there a reason nginx is not managed?
if it wasn't, then we'd have stopped it and FIP_2 would have been moved

Post by Andrew Beekhof

Post by ÐÐ¾ÐºÐ¾ÑÐ¸Ð»ÐµÐ½ÐºÐ¾ ÐÐ¾ÑÑÐ¸Ðº
- if Nginx has started on a node after initial probe for Nginx clone
then pacemaker will never see it running until cleanup or other

probe

Post by Andrew Beekhof

Post by ÐÐ¾ÐºÐ¾ÑÐ¸Ð»ÐµÐ½ÐºÐ¾ ÐÐ¾ÑÑÐ¸Ðº
trigger

you'll want a recurring monitor with role=Stopped

How is it done?

I don't know the crmsh syntax. Sorry

primitive Nginx lsb:nginx \
op monitor interval=2s \
op monitor interval=3s role=Stopped
This produces warning that monitor_stopped may be unsupported by RA.

I'm not familiar with that warning.
Where did you see it?

Should it?
And it's not recognizing start of nginx.

It seems role=Stopped only works for primitives (not clones)
I've made a note to get this fixed

- stop nginx on 2nd node
- cleanup cl_Nginx so that pacemaker forget nginx was running in 2nd
node
- clear logs
- start nginx
- nothing happens
- make crm_report
crm_report of this test attached as
pcmk-monitor-stopped-Wed-14-Jan-2015.tar.bz2
<pcmk-monitor-stopped-Wed-14-Jan-2015.tar.bz2><pcmk-nginx-fail-Wed-14-Jan-2015.tar.bz2>

Kristoffer Grönlund

2015-01-22 09:13:50 UTC

Permalink

Post by Andrew Beekhof

Post by Andrew Beekhof
you'll want a recurring monitor with role=Stopped

How is it done?

I don't know the crmsh syntax. Sorry

primitive Nginx lsb:nginx \
op monitor interval=2s \
op monitor interval=3s role=Stopped
This produces warning that monitor_stopped may be unsupported by RA.

To clarify, the above is indeed the correct crmsh syntax for a monitor
op with role=Stopped.
--
// Kristoffer Grönlund
// ***@suse.com

_______________________________________________
Pacemaker mailing list: ***@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pd

Покотиленко Костик

2015-02-27 19:00:08 UTC

Permalink

Post by Andrew Beekhof

can you attach a crm_report of the above test please?

crm_report of this test attached as
pcmk-nginx-fail-Wed-14-Jan-2015.tar.bz2

is there a reason nginx is not managed?
if it wasn't, then we'd have stopped it and FIP_2 would have been moved

I'm not sure I got this right.

Nginx is not managed by intention (is-managed="false") that's why subj.
And the whole subject is in fact that stopping unmanaged nginx doesn't
move away FIP which is INF colocated with it (this is regarding 1.1.12,
1.1.6 works fine).

Post by Andrew Beekhof

probe

Post by Andrew Beekhof

Post by ÐÐ¾ÐºÐ¾ÑÐ¸Ð»ÐµÐ½ÐºÐ¾ ÐÐ¾ÑÑÐ¸Ðº
trigger

you'll want a recurring monitor with role=Stopped

How is it done?

I don't know the crmsh syntax. Sorry

primitive Nginx lsb:nginx \
op monitor interval=2s \
op monitor interval=3s role=Stopped
This produces warning that monitor_stopped may be unsupported by RA.

I'm not familiar with that warning.
Where did you see it?

The exact text is:
WARNING: Nginx: action monitor_Stopped not advertised in meta-data, it may not be supported by the RA

This is produced by crm configure edit,

Post by Andrew Beekhof

Should it?
And it's not recognizing start of nginx.

It seems role=Stopped only works for primitives (not clones)
I've made a note to get this fixed

This will add usability for unmanaged resources, thanks.

Post by Andrew Beekhof

Andrew Beekhof

2015-03-30 01:11:25 UTC

Permalink

Post by ÐÐ¾ÐºÐ¾ÑÐ¸Ð»ÐµÐ½ÐºÐ¾ ÐÐ¾ÑÑÐ¸Ðº

Post by Andrew Beekhof

can you attach a crm_report of the above test please?

crm_report of this test attached as
pcmk-nginx-fail-Wed-14-Jan-2015.tar.bz2

is there a reason nginx is not managed?
if it wasn't, then we'd have stopped it and FIP_2 would have been moved

I'm not sure I got this right.
Nginx is not managed by intention (is-managed="false") that's why subj.
And the whole subject is in fact that stopping unmanaged nginx doesn't
move away FIP which is INF colocated with it (this is regarding 1.1.12,
1.1.6 works fine).

Ahhhh.
We changed the way monitors that return OCF_NOT_RUNNING were handled to still require a stop under most conditions.
I've added "not managed" to the list of exceptions:

diff --git a/lib/pengine/unpack.c b/lib/pengine/unpack.c
index 308258d..6dc44fd 100644
--- a/lib/pengine/unpack.c
+++ b/lib/pengine/unpack.c
@@ -2689,7 +2689,7 @@ determine_op_status(
break;

case PCMK_OCF_NOT_RUNNING:
- if (is_probe || target_rc == rc) {
+ if (is_probe || target_rc == rc || is_not_set(rsc->flags, pe_rsc_managed)) {
result = PCMK_LRM_OP_DONE;
rsc->role = RSC_ROLE_STOPPED;

Look for this in 1.1.13-rc2

Post by ÐÐ¾ÐºÐ¾ÑÐ¸Ð»ÐµÐ½ÐºÐ¾ ÐÐ¾ÑÑÐ¸Ðº

Post by Andrew Beekhof

probe

Post by Andrew Beekhof

Post by ÐÐ¾ÐºÐ¾ÑÐ¸Ð»ÐµÐ½ÐºÐ¾ ÐÐ¾ÑÑÐ¸Ðº
trigger

you'll want a recurring monitor with role=Stopped

How is it done?

I don't know the crmsh syntax. Sorry

primitive Nginx lsb:nginx \
op monitor interval=2s \
op monitor interval=3s role=Stopped
This produces warning that monitor_stopped may be unsupported by RA.

I'm not familiar with that warning.
Where did you see it?

WARNING: Nginx: action monitor_Stopped not advertised in meta-data, it may not be supported by the RA
This is produced by crm configure edit,

Hmmm, you'd have to take that up with the crmsh maintainers.

Post by ÐÐ¾ÐºÐ¾ÑÐ¸Ð»ÐµÐ½ÐºÐ¾ ÐÐ¾ÑÑÐ¸Ðº

Post by Andrew Beekhof

Should it?
And it's not recognizing start of nginx.

It seems role=Stopped only works for primitives (not clones)
I've made a note to get this fixed

This will add usability for unmanaged resources, thanks.

Post by Andrew Beekhof