[Pacemaker] 'stop' operation passes outdated set of instance attributes to RA

Discussion:

Vladislav Bogdanov

2015-02-13 14:10:41 UTC

Hi,

I believe that is a bug that 'stop' operation uses set of instance
attributes from the original 'start' op, not what successful 'reload' had.
Corresponding pe-input has correct set of attributes, and pre-stop
'notify' op uses updated set of attributes too.
This is easily reproducible with 3.9.6 resource agents and trace_ra.

pacemaker is c529898.

Should I provide more information?

Best,
Vladislav

_______________________________________________
Pacemaker mailing list: ***@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Andrew Beekhof

2015-02-23 02:20:24 UTC

Permalink

Yes please.
I suspect the lrmd needs to update it's parameter cache for the reload operation.

David?

_______________________________________________
Pacemaker mailing list: ***@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

David Vossel

2015-02-23 18:50:08 UTC

Permalink

----- Original Message -----

Post by Andrew Beekhof

Post by Vladislav Bogdanov
Hi,
I believe that is a bug that 'stop' operation uses set of instance
attributes from the original 'start' op, not what successful 'reload' had.
Corresponding pe-input has correct set of attributes, and pre-stop 'notify'
op uses updated set of attributes too.
This is easily reproducible with 3.9.6 resource agents and trace_ra.
pacemaker is c529898.
Should I provide more information?

Yes please.
I suspect the lrmd needs to update it's parameter cache for the reload operation.
David?

This falls on the crmd I believe. I haven't tested it, but something like
this should fix it I bet.

diff --git a/crmd/lrm.c b/crmd/lrm.c
index ead2e05..45641d2 100644
--- a/crmd/lrm.c
+++ b/crmd/lrm.c
@@ -186,6 +186,7 @@ update_history_cache(lrm_state_t * lrm_state, lrmd_rsc_info_t * rsc, lrmd_event_

if (op->params &&
(safe_str_eq(CRMD_ACTION_START, op->op_type) ||
+ safe_str_eq("reload", op->op_type) ||
safe_str_eq(CRMD_ACTION_STATUS, op->op_type))) {

if (entry->stop_params) {

_______________________________________________
Pacemaker mailing list: ***@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Andrew Beekhof

2015-02-23 19:21:25 UTC

Permalink

Post by David Vossel
----- Original Message -----

Post by Andrew Beekhof

Yes please.
I suspect the lrmd needs to update it's parameter cache for the reload operation.
David?

This falls on the crmd I believe.

Ah, yes. That rings a bell now. Thanks!

Post by David Vossel
I haven't tested it, but something like
this should fix it I bet.
diff --git a/crmd/lrm.c b/crmd/lrm.c
index ead2e05..45641d2 100644
--- a/crmd/lrm.c
+++ b/crmd/lrm.c
@@ -186,6 +186,7 @@ update_history_cache(lrm_state_t * lrm_state, lrmd_rsc_info_t * rsc, lrmd_event_
if (op->params &&
(safe_str_eq(CRMD_ACTION_START, op->op_type) ||
+ safe_str_eq("reload", op->op_type) ||
safe_str_eq(CRMD_ACTION_STATUS, op->op_type))) {
if (entry->stop_params) {

_______________________________________________
Pacemaker mailing list: ***@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Vladislav Bogdanov

2015-02-25 06:59:22 UTC

Permalink

Post by David Vossel
----- Original Message -----

Post by Andrew Beekhof

Yes please.
I suspect the lrmd needs to update it's parameter cache for the reload operation.
David?

This definitely fixes the issue,
thank you!

Post by David Vossel
_______________________________________________
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Vladislav Bogdanov

2015-02-23 18:53:19 UTC

Permalink

Post by Andrew Beekhof

Yes please.

I doubt what could be needed to reproduce and fix that.
On the one hand, everything from crm_report (may be except digest
hashed) will be ok. On the other, vars are set to the outdated values,
and that is visible in RA traces. May be it is enough to just to try to
reproduce with my latest patch to resource agents (included in 3.9.6)?
Steps are:
* create a clone resource (it is enough to set clone-max=1) with RA
which supports both reload and notify (may be it is simpler to
unconditionally set OCF_RESKEY_trace_ra=1 in the very beginning of the
resource agent before OCF framework is imported to get traces of all RA
executions)
* enable notifications (and trace_ra) for that resource
* start the resource
* change parameters for the resource - that should cause reload
* stop the resource
* compare printenv output in the very beginning of the start, reload,
notify pre-stop and stop actions traces.

Everything should be clear just after that is done I think.

Best,
Vladislav

Post by Andrew Beekhof
I suspect the lrmd needs to update it's parameter cache for the reload operation.
David?
_______________________________________________
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Andrew Beekhof

2015-02-23 19:21:00 UTC

Permalink

Post by Vladislav Bogdanov

Post by Andrew Beekhof

Yes please.

I doubt what could be needed to reproduce and fix that.
On the one hand, everything from crm_report (may be except digest hashed) will be ok. On the other, vars are set to the outdated values, and that is visible in RA traces. May be it is enough to just to try to reproduce with my latest patch to resource agents (included in 3.9.6)?
* create a clone resource (it is enough to set clone-max=1) with RA which supports both reload and notify (may be it is simpler to unconditionally set OCF_RESKEY_trace_ra=1 in the very beginning of the resource agent before OCF framework is imported to get traces of all RA executions)
* enable notifications (and trace_ra) for that resource
* start the resource
* change parameters for the resource - that should cause reload
* stop the resource
* compare printenv output in the very beginning of the start, reload, notify pre-stop and stop actions traces.
Everything should be clear just after that is done I think.

General rule of thumb... add 1 month turnaround if I need to set up a cluster to reproduce compared to looking at logs/PE files.
Thats not me being mean, I simply don't have the bandwidth. Yesterday I did nothing but answer emails and I barely scratched the surface.

So the easier it is for me to reply, the sooner its going to happen.

Post by Vladislav Bogdanov
Best,
Vladislav

Post by Andrew Beekhof
I suspect the lrmd needs to update it's parameter cache for the reload operation.

Did you try David's fix?
(See, I didn't even find time to hunt down the right place for a 1 line change)

Post by Vladislav Bogdanov

Post by Andrew Beekhof
David?
_______________________________________________
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

_______________________________________________
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org