[Pacemaker] Segfault on monitor resource

Discussion:

Oscar Salvador

2015-01-26 17:20:35 UTC

Hi!

I'm writing here because two days ago I experienced a strange problem in my
Pacemaker Cluster.
Everything was working fine, till suddenly a Segfault in Nginx monitor
resource happened:

Jan 25 03:55:24 lb02 crmd: [9975]: notice: run_graph: ==== Transition 7551
(Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
Source=/var/lib/pengine/pe-input-90.bz2): Complete
Jan 25 03:55:24 lb02 crmd: [9975]: notice: do_state_transition: State
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
cause=C_FSA_INTERNAL origin=notify_crmd ]
Jan 25 04:00:08 lb02 cib: [9971]: info: cib_stats: Processed 1 operations
(0.00us average, 0% utilization) in the last 10min
Jan 25 04:10:24 lb02 crmd: [9975]: info: crm_timer_popped: PEngine Recheck
Timer (I_PE_CALC) just popped (900000ms)
Jan 25 04:10:24 lb02 crmd: [9975]: notice: do_state_transition: State
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED
origin=crm_timer_popped ]
Jan 25 04:10:24 lb02 crmd: [9975]: info: do_state_transition: Progressed to
state S_POLICY_ENGINE after C_TIMER_POPPED
Jan 25 04:10:24 lb02 pengine: [10028]: WARN: unpack_rsc_op: Processing
failed op Ldirector-rsc_last_failure_0 on lb02: not running (7)
Jan 25 04:10:24 lb02 pengine: [10028]: notice: common_apply_stickiness:
Ldirector-rsc can fail 999997 more times on lb02 before being forced off
Jan 25 04:10:24 lb02 crmd: [9975]: notice: do_state_transition: State
transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
cause=C_IPC_MESSAGE origin=handle_response ]
Jan 25 04:10:24 lb02 pengine: [10028]: notice: process_pe_message:
Transition 7552: PEngine Input stored in: /var/lib/pengine/pe-input-90.bz2
Jan 25 04:10:24 lb02 crmd: [9975]: info: do_te_invoke: Processing graph
7552 (ref=pe_calc-dc-1422155424-7644) derived from
/var/lib/pengine/pe-input-90.bz2
Jan 25 04:10:24 lb02 crmd: [9975]: notice: run_graph: ==== Transition 7552
(Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
Source=/var/lib/pengine/pe-input-90.bz2): Complete
Jan 25 04:10:24 lb02 crmd: [9975]: notice: do_state_transition: State
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
cause=C_FSA_INTERNAL origin=notify_crmd ]

Jan 25 04:10:30 lb02 lrmd: [9972]: info: RA output:
(Nginx-rsc:monitor:stderr) Segmentation fault ******* here it starts

As you can see, the last line.
And then:

Jan 25 04:10:30 lb02 lrmd: [9972]: info: RA output:
(Nginx-rsc:monitor:stderr) Killed
/usr/lib/ocf/resource.d//heartbeat/nginx: 910:
/usr/lib/ocf/resource.d//heartbeat/nginx: Cannot fork

I guess here Nginx was killed.

And then I have some others errors till Pacemaker decide to move the
resources to the node:

Jan 25 04:10:30 lb02 crmd: [9975]: info: process_lrm_event: LRM operation
Nginx-rsc_monitor_10000 (call=52, rc=2, cib-update=7633, confirmed=false)
invalid parameter
Jan 25 04:10:30 lb02 crmd: [9975]: info: process_graph_event: Detected
action Nginx-rsc_monitor_10000 from a different transition: 5739 vs. 7552
Jan 25 04:10:30 lb02 crmd: [9975]: info: abort_transition_graph:
process_graph_event:476 - Triggered transition abort (complete=1,
tag=lrm_rsc_op, id=Nginx-rsc_last_failure_0,
magic=0:2;4:5739:0:42d1ed53-9686-4174-84e7-d2c230ed8832, cib=
3.14.40) : Old event
Jan 25 04:10:30 lb02 crmd: [9975]: WARN: update_failcount: Updating
failcount for Nginx-rsc on lb02 after failed monitor: rc=2 (update=value++,
time=1422155430)
Jan 25 04:10:30 lb02 crmd: [9975]: notice: do_state_transition: State
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL
origin=abort_transition_graph ]
Jan 25 04:10:30 lb02 attrd: [9974]: info: log-rotate detected on logfile
/var/log/ha-log
Jan 25 04:10:30 lb02 attrd: [9974]: notice: attrd_trigger_update: Sending
flush op to all hosts for: fail-count-Nginx-rsc (1)
Jan 25 04:10:30 lb02 pengine: [10028]: ERROR: unpack_rsc_op: Preventing
Nginx-rsc from re-starting on lb02: operation monitor failed 'invalid
parameter' (rc=2)
Jan 25 04:10:30 lb02 pengine: [10028]: WARN: unpack_rsc_op: Processing
failed op Nginx-rsc_last_failure_0 on lb02: invalid parameter (2)
Jan 25 04:10:30 lb02 pengine: [10028]: WARN: unpack_rsc_op: Processing
failed op Ldirector-rsc_last_failure_0 on lb02: not running (7)
Jan 25 04:10:30 lb02 pengine: [10028]: notice: common_apply_stickiness:
Ldirector-rsc can fail 999997 more times on lb02 before being forced off
Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Stop
IP-rsc_mysql (lb02)
Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Stop
IP-rsc_nginx (lb02)
Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Stop
IP-rsc_nginx6 (lb02)
Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Stop
IP-rsc_elasticsearch (lb02)
Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Move
Ldirector-rsc (Started lb02 -> lb01)
Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Move
Nginx-rsc (Started lb02 -> lb01)
Jan 25 04:10:30 lb02 attrd: [9974]: notice: attrd_perform_update: Sent
update 23: fail-count-Nginx-rsc=1
Jan 25 04:10:30 lb02 attrd: [9974]: notice: attrd_trigger_update: Sending
flush op to all hosts for: last-failure-Nginx-rsc (1422155430)

I see that Pacemaker is complaining about some errors like "invalid
paraemter", for example in these lines:

Jan 25 04:10:30 lb02 crmd: [9975]: info: process_lrm_event: LRM operation
Nginx-rsc_monitor_10000 (call=52, rc=2, cib-update=7633, confirmed=false)
invalid parameter

Jan 25 04:10:30 lb02 pengine: [10028]: ERROR: unpack_rsc_op: Preventing
Nginx-rsc from re-starting on lb02: operation monitor failed 'invalid
parameter' (rc=2)

It sounds(for me) like a syntax problem defining the resources, but I've
checked the confic with crm_verify and there is no error:

root# (S) crm_verify -LVV
root# (S)

So I'm just wondering why pacemaker is complaining about an invalid
parameter.

This is my CIB objetcs:

node $id="43b2c5a1-9552-4438-962b-6e98a2dd67c7" lb01
node $id="68328520-68e0-42fd-9adf-062655691643" lb02
primitive IP-rsc_elasticsearch ocf:heartbeat:IPaddr2 \
params ip="xx.xx.xx.xx" nic="eth0" cidr_netmask="255.255.255.224"
primitive IP-rsc_elasticsearch6 ocf:heartbeat:IPv6addr \
params ipv6addr="xxxxxxxxxxxxxxxx" \
op monitor interval="10s"
primitive IP-rsc_mysql ocf:heartbeat:IPaddr2 \
params ip="xx.xx.xx.xx" nic="eth0" cidr_netmask="255.255.255.224"
primitive IP-rsc_mysql6 ocf:heartbeat:IPv6addr \
params ipv6addr="xxxxxxxxxxxxxx" \
op monitor interval="10s"
primitive IP-rsc_nginx ocf:heartbeat:IPaddr2 \
params ip="xx.xx.xx.xx" nic="eth0" cidr_netmask="255.255.255.224"
primitive IP-rsc_nginx6 ocf:heartbeat:IPv6addr \
params ipv6addr="xxxxxxxxxxxxxx" \
op monitor interval="10s"
primitive Ldirector-rsc ocf:heartbeat:ldirectord \
op monitor interval="10s" timeout="30s"
primitive Nginx-rsc ocf:heartbeat:nginx \
op monitor interval="10s" timeout="30s"
location cli-standby-IP-rsc_elasticsearch6 IP-rsc_elasticsearch6 \
rule $id="cli-standby-rule-IP-rsc_elasticsearch6" -inf: #uname eq lb01
location cli-standby-IP-rsc_mysql IP-rsc_mysql \
rule $id="cli-standby-rule-IP-rsc_mysql" -inf: #uname eq lb01
location cli-standby-IP-rsc_mysql6 IP-rsc_mysql6 \
rule $id="cli-standby-rule-IP-rsc_mysql6" -inf: #uname eq lb01
location cli-standby-IP-rsc_nginx IP-rsc_nginx \
rule $id="cli-standby-rule-IP-rsc_nginx" -inf: #uname eq lb01
location cli-standby-IP-rsc_nginx6 IP-rsc_nginx6 \
rule $id="cli-standby-rule-IP-rsc_nginx6" -inf: #uname eq lb01
colocation hcu_c inf: Nginx-rsc Ldirector-rsc IP-rsc_mysql IP-rsc_nginx
IP-rsc_nginx6 IP-rsc_elasticsearch
order hcu_o inf: IP-rsc_nginx IP-rsc_nginx6 IP-rsc_mysql Ldirector-rsc
Nginx-rsc IP-rsc_elasticsearch
property $id="cib-bootstrap-options" \
dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff" \
cluster-infrastructure="Heartbeat" \
stonith-enabled="false

Do you have some hints that I can follow?

Thanks in advance!

Oscar

Oscar Salvador

2015-01-26 17:22:57 UTC

Permalink

Oh, I forgot some important details:

root# (S) crm status
============
Last updated: Mon Jan 26 18:21:35 2015
Last change: Sun Jan 25 05:19:13 2015 via crm_resource on lb01
Stack: Heartbeat
Current DC: lb01 (43b2c5a1-9552-4438-962b-6e98a2dd67c7) - partition with
quorum
Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
2 Nodes configured, unknown expected votes
8 Resources configured.
============

Online: [ lb01 lb02 ]

IP-rsc_mysql (ocf::heartbeat:IPaddr2): Started lb02
IP-rsc_nginx (ocf::heartbeat:IPaddr2): Started lb02
IP-rsc_nginx6 (ocf::heartbeat:IPv6addr): Started lb02
IP-rsc_mysql6 (ocf::heartbeat:IPv6addr): Started lb02
IP-rsc_elasticsearch6 (ocf::heartbeat:IPv6addr): Started lb02
IP-rsc_elasticsearch (ocf::heartbeat:IPaddr2): Started lb02
Ldirector-rsc (ocf::heartbeat:ldirectord): Started lb02
Nginx-rsc (ocf::heartbeat:nginx): Started lb02

This is running on:

Debian 7.8
pacemaker 1.1.7-1

Post by Oscar Salvador
Hi!
I'm writing here because two days ago I experienced a strange problem in
my Pacemaker Cluster.
Everything was working fine, till suddenly a Segfault in Nginx monitor
Jan 25 03:55:24 lb02 crmd: [9975]: notice: run_graph: ==== Transition 7551
(Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
Source=/var/lib/pengine/pe-input-90.bz2): Complete
Jan 25 03:55:24 lb02 crmd: [9975]: notice: do_state_transition: State
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
cause=C_FSA_INTERNAL origin=notify_crmd ]
Jan 25 04:00:08 lb02 cib: [9971]: info: cib_stats: Processed 1 operations
(0.00us average, 0% utilization) in the last 10min
Jan 25 04:10:24 lb02 crmd: [9975]: info: crm_timer_popped: PEngine Recheck
Timer (I_PE_CALC) just popped (900000ms)
Jan 25 04:10:24 lb02 crmd: [9975]: notice: do_state_transition: State
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED
origin=crm_timer_popped ]
Jan 25 04:10:24 lb02 crmd: [9975]: info: do_state_transition: Progressed
to state S_POLICY_ENGINE after C_TIMER_POPPED
Jan 25 04:10:24 lb02 pengine: [10028]: WARN: unpack_rsc_op: Processing
failed op Ldirector-rsc_last_failure_0 on lb02: not running (7)
Ldirector-rsc can fail 999997 more times on lb02 before being forced off
Jan 25 04:10:24 lb02 crmd: [9975]: notice: do_state_transition: State
transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
cause=C_IPC_MESSAGE origin=handle_response ]
Transition 7552: PEngine Input stored in: /var/lib/pengine/pe-input-90.bz2
Jan 25 04:10:24 lb02 crmd: [9975]: info: do_te_invoke: Processing graph
7552 (ref=pe_calc-dc-1422155424-7644) derived from
/var/lib/pengine/pe-input-90.bz2
Jan 25 04:10:24 lb02 crmd: [9975]: notice: run_graph: ==== Transition 7552
(Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
Source=/var/lib/pengine/pe-input-90.bz2): Complete
Jan 25 04:10:24 lb02 crmd: [9975]: notice: do_state_transition: State
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
cause=C_FSA_INTERNAL origin=notify_crmd ]
(Nginx-rsc:monitor:stderr) Segmentation fault ******* here it starts
As you can see, the last line.
(Nginx-rsc:monitor:stderr) Killed
/usr/lib/ocf/resource.d//heartbeat/nginx: Cannot fork
I guess here Nginx was killed.
And then I have some others errors till Pacemaker decide to move the
Jan 25 04:10:30 lb02 crmd: [9975]: info: process_lrm_event: LRM operation
Nginx-rsc_monitor_10000 (call=52, rc=2, cib-update=7633, confirmed=false)
invalid parameter
Jan 25 04:10:30 lb02 crmd: [9975]: info: process_graph_event: Detected
action Nginx-rsc_monitor_10000 from a different transition: 5739 vs. 7552
process_graph_event:476 - Triggered transition abort (complete=1,
tag=lrm_rsc_op, id=Nginx-rsc_last_failure_0,
magic=0:2;4:5739:0:42d1ed53-9686-4174-84e7-d2c230ed8832, cib=
3.14.40) : Old event
Jan 25 04:10:30 lb02 crmd: [9975]: WARN: update_failcount: Updating
failcount for Nginx-rsc on lb02 after failed monitor: rc=2 (update=value++,
time=1422155430)
Jan 25 04:10:30 lb02 crmd: [9975]: notice: do_state_transition: State
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL
origin=abort_transition_graph ]
Jan 25 04:10:30 lb02 attrd: [9974]: info: log-rotate detected on logfile
/var/log/ha-log
Jan 25 04:10:30 lb02 attrd: [9974]: notice: attrd_trigger_update: Sending
flush op to all hosts for: fail-count-Nginx-rsc (1)
Jan 25 04:10:30 lb02 pengine: [10028]: ERROR: unpack_rsc_op: Preventing
Nginx-rsc from re-starting on lb02: operation monitor failed 'invalid
parameter' (rc=2)
Jan 25 04:10:30 lb02 pengine: [10028]: WARN: unpack_rsc_op: Processing
failed op Nginx-rsc_last_failure_0 on lb02: invalid parameter (2)
Jan 25 04:10:30 lb02 pengine: [10028]: WARN: unpack_rsc_op: Processing
failed op Ldirector-rsc_last_failure_0 on lb02: not running (7)
Ldirector-rsc can fail 999997 more times on lb02 before being forced off
Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Stop
IP-rsc_mysql (lb02)
Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Stop
IP-rsc_nginx (lb02)
Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Stop
IP-rsc_nginx6 (lb02)
Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Stop
IP-rsc_elasticsearch (lb02)
Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Move
Ldirector-rsc (Started lb02 -> lb01)
Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Move
Nginx-rsc (Started lb02 -> lb01)
Jan 25 04:10:30 lb02 attrd: [9974]: notice: attrd_perform_update: Sent
update 23: fail-count-Nginx-rsc=1
Jan 25 04:10:30 lb02 attrd: [9974]: notice: attrd_trigger_update: Sending
flush op to all hosts for: last-failure-Nginx-rsc (1422155430)
I see that Pacemaker is complaining about some errors like "invalid
Jan 25 04:10:30 lb02 crmd: [9975]: info: process_lrm_event: LRM operation
Nginx-rsc_monitor_10000 (call=52, rc=2, cib-update=7633, confirmed=false)
invalid parameter
Jan 25 04:10:30 lb02 pengine: [10028]: ERROR: unpack_rsc_op: Preventing
Nginx-rsc from re-starting on lb02: operation monitor failed 'invalid
parameter' (rc=2)
It sounds(for me) like a syntax problem defining the resources, but I've
root# (S) crm_verify -LVV
root# (S)
So I'm just wondering why pacemaker is complaining about an invalid
parameter.
node $id="43b2c5a1-9552-4438-962b-6e98a2dd67c7" lb01
node $id="68328520-68e0-42fd-9adf-062655691643" lb02
primitive IP-rsc_elasticsearch ocf:heartbeat:IPaddr2 \
params ip="xx.xx.xx.xx" nic="eth0" cidr_netmask="255.255.255.224"
primitive IP-rsc_elasticsearch6 ocf:heartbeat:IPv6addr \
params ipv6addr="xxxxxxxxxxxxxxxx" \
op monitor interval="10s"
primitive IP-rsc_mysql ocf:heartbeat:IPaddr2 \
params ip="xx.xx.xx.xx" nic="eth0" cidr_netmask="255.255.255.224"
primitive IP-rsc_mysql6 ocf:heartbeat:IPv6addr \
params ipv6addr="xxxxxxxxxxxxxx" \
op monitor interval="10s"
primitive IP-rsc_nginx ocf:heartbeat:IPaddr2 \
params ip="xx.xx.xx.xx" nic="eth0" cidr_netmask="255.255.255.224"
primitive IP-rsc_nginx6 ocf:heartbeat:IPv6addr \
params ipv6addr="xxxxxxxxxxxxxx" \
op monitor interval="10s"
primitive Ldirector-rsc ocf:heartbeat:ldirectord \
op monitor interval="10s" timeout="30s"
primitive Nginx-rsc ocf:heartbeat:nginx \
op monitor interval="10s" timeout="30s"
location cli-standby-IP-rsc_elasticsearch6 IP-rsc_elasticsearch6 \
rule $id="cli-standby-rule-IP-rsc_elasticsearch6" -inf: #uname eq lb01
location cli-standby-IP-rsc_mysql IP-rsc_mysql \
rule $id="cli-standby-rule-IP-rsc_mysql" -inf: #uname eq lb01
location cli-standby-IP-rsc_mysql6 IP-rsc_mysql6 \
rule $id="cli-standby-rule-IP-rsc_mysql6" -inf: #uname eq lb01
location cli-standby-IP-rsc_nginx IP-rsc_nginx \
rule $id="cli-standby-rule-IP-rsc_nginx" -inf: #uname eq lb01
location cli-standby-IP-rsc_nginx6 IP-rsc_nginx6 \
rule $id="cli-standby-rule-IP-rsc_nginx6" -inf: #uname eq lb01
colocation hcu_c inf: Nginx-rsc Ldirector-rsc IP-rsc_mysql IP-rsc_nginx
IP-rsc_nginx6 IP-rsc_elasticsearch
order hcu_o inf: IP-rsc_nginx IP-rsc_nginx6 IP-rsc_mysql Ldirector-rsc
Nginx-rsc IP-rsc_elasticsearch
property $id="cib-bootstrap-options" \
dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff" \
cluster-infrastructure="Heartbeat" \
stonith-enabled="false
Do you have some hints that I can follow?
Thanks in advance!
Oscar

emmanuel segura

2015-01-27 09:10:12 UTC

Permalink

maybe you can use sar for checking if your server was tight of resources?

Jan 25 04:10:30 lb02 lrmd: [9972]: info: RA output:
(Nginx-rsc:monitor:stderr) Killed
/usr/lib/ocf/resource.d//heartbeat/nginx: 910:
/usr/lib/ocf/resource.d//heartbeat/nginx: Cannot fork

Post by Oscar Salvador
root# (S) crm status
============
Last updated: Mon Jan 26 18:21:35 2015
Last change: Sun Jan 25 05:19:13 2015 via crm_resource on lb01
Stack: Heartbeat
Current DC: lb01 (43b2c5a1-9552-4438-962b-6e98a2dd67c7) - partition with
quorum
Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
2 Nodes configured, unknown expected votes
8 Resources configured.
============
Online: [ lb01 lb02 ]
IP-rsc_mysql (ocf::heartbeat:IPaddr2): Started lb02
IP-rsc_nginx (ocf::heartbeat:IPaddr2): Started lb02
IP-rsc_nginx6 (ocf::heartbeat:IPv6addr): Started lb02
IP-rsc_mysql6 (ocf::heartbeat:IPv6addr): Started lb02
IP-rsc_elasticsearch6 (ocf::heartbeat:IPv6addr): Started lb02
IP-rsc_elasticsearch (ocf::heartbeat:IPaddr2): Started lb02
Ldirector-rsc (ocf::heartbeat:ldirectord): Started lb02
Nginx-rsc (ocf::heartbeat:nginx): Started lb02
Debian 7.8
pacemaker 1.1.7-1

_______________________________________________
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

--
esta es mi vida e me la vivo hasta que dios quiera

_______________________________________________
Pacemaker mailing list: ***@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Dejan Muhamedagic

2015-01-27 09:39:26 UTC

Permalink

Hi,

What exactly did segfault? Do you have a core dump to examine?

Post by Oscar Salvador
As you can see, the last line.
(Nginx-rsc:monitor:stderr) Killed
/usr/lib/ocf/resource.d//heartbeat/nginx: Cannot fork

This could be related to the segfault, or due to other serious
system error.

Post by Oscar Salvador
I guess here Nginx was killed.
And then I have some others errors till Pacemaker decide to move the
Jan 25 04:10:30 lb02 crmd: [9975]: info: process_lrm_event: LRM operation
Nginx-rsc_monitor_10000 (call=52, rc=2, cib-update=7633, confirmed=false)
invalid parameter
Jan 25 04:10:30 lb02 crmd: [9975]: info: process_graph_event: Detected
action Nginx-rsc_monitor_10000 from a different transition: 5739 vs. 7552
process_graph_event:476 - Triggered transition abort (complete=1,
tag=lrm_rsc_op, id=Nginx-rsc_last_failure_0,
magic=0:2;4:5739:0:42d1ed53-9686-4174-84e7-d2c230ed8832, cib=
3.14.40) : Old event
Jan 25 04:10:30 lb02 crmd: [9975]: WARN: update_failcount: Updating
failcount for Nginx-rsc on lb02 after failed monitor: rc=2 (update=value++,
time=1422155430)
Jan 25 04:10:30 lb02 crmd: [9975]: notice: do_state_transition: State
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL
origin=abort_transition_graph ]
Jan 25 04:10:30 lb02 attrd: [9974]: info: log-rotate detected on logfile
/var/log/ha-log
Jan 25 04:10:30 lb02 attrd: [9974]: notice: attrd_trigger_update: Sending
flush op to all hosts for: fail-count-Nginx-rsc (1)
Jan 25 04:10:30 lb02 pengine: [10028]: ERROR: unpack_rsc_op: Preventing
Nginx-rsc from re-starting on lb02: operation monitor failed 'invalid
parameter' (rc=2)
Jan 25 04:10:30 lb02 pengine: [10028]: WARN: unpack_rsc_op: Processing
failed op Nginx-rsc_last_failure_0 on lb02: invalid parameter (2)
Jan 25 04:10:30 lb02 pengine: [10028]: WARN: unpack_rsc_op: Processing
failed op Ldirector-rsc_last_failure_0 on lb02: not running (7)
Ldirector-rsc can fail 999997 more times on lb02 before being forced off
Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Stop
IP-rsc_mysql (lb02)
Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Stop
IP-rsc_nginx (lb02)
Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Stop
IP-rsc_nginx6 (lb02)
Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Stop
IP-rsc_elasticsearch (lb02)
Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Move
Ldirector-rsc (Started lb02 -> lb01)
Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Move
Nginx-rsc (Started lb02 -> lb01)
Jan 25 04:10:30 lb02 attrd: [9974]: notice: attrd_perform_update: Sent
update 23: fail-count-Nginx-rsc=1
Jan 25 04:10:30 lb02 attrd: [9974]: notice: attrd_trigger_update: Sending
flush op to all hosts for: last-failure-Nginx-rsc (1422155430)
I see that Pacemaker is complaining about some errors like "invalid

That error code is what the nginx RA exited with. It's unusual,
but perhaps also due to the segfault.

Thanks,

Dejan

Post by Oscar Salvador
Jan 25 04:10:30 lb02 crmd: [9975]: info: process_lrm_event: LRM operation
Nginx-rsc_monitor_10000 (call=52, rc=2, cib-update=7633, confirmed=false)
invalid parameter
Jan 25 04:10:30 lb02 pengine: [10028]: ERROR: unpack_rsc_op: Preventing
Nginx-rsc from re-starting on lb02: operation monitor failed 'invalid
parameter' (rc=2)
It sounds(for me) like a syntax problem defining the resources, but I've
root# (S) crm_verify -LVV
root# (S)
So I'm just wondering why pacemaker is complaining about an invalid
parameter.
node $id="43b2c5a1-9552-4438-962b-6e98a2dd67c7" lb01
node $id="68328520-68e0-42fd-9adf-062655691643" lb02
primitive IP-rsc_elasticsearch ocf:heartbeat:IPaddr2 \
params ip="xx.xx.xx.xx" nic="eth0" cidr_netmask="255.255.255.224"
primitive IP-rsc_elasticsearch6 ocf:heartbeat:IPv6addr \
params ipv6addr="xxxxxxxxxxxxxxxx" \
op monitor interval="10s"
primitive IP-rsc_mysql ocf:heartbeat:IPaddr2 \
params ip="xx.xx.xx.xx" nic="eth0" cidr_netmask="255.255.255.224"
primitive IP-rsc_mysql6 ocf:heartbeat:IPv6addr \
params ipv6addr="xxxxxxxxxxxxxx" \
op monitor interval="10s"
primitive IP-rsc_nginx ocf:heartbeat:IPaddr2 \
params ip="xx.xx.xx.xx" nic="eth0" cidr_netmask="255.255.255.224"
primitive IP-rsc_nginx6 ocf:heartbeat:IPv6addr \
params ipv6addr="xxxxxxxxxxxxxx" \
op monitor interval="10s"
primitive Ldirector-rsc ocf:heartbeat:ldirectord \
op monitor interval="10s" timeout="30s"
primitive Nginx-rsc ocf:heartbeat:nginx \
op monitor interval="10s" timeout="30s"
location cli-standby-IP-rsc_elasticsearch6 IP-rsc_elasticsearch6 \
rule $id="cli-standby-rule-IP-rsc_elasticsearch6" -inf: #uname eq lb01
location cli-standby-IP-rsc_mysql IP-rsc_mysql \
rule $id="cli-standby-rule-IP-rsc_mysql" -inf: #uname eq lb01
location cli-standby-IP-rsc_mysql6 IP-rsc_mysql6 \
rule $id="cli-standby-rule-IP-rsc_mysql6" -inf: #uname eq lb01
location cli-standby-IP-rsc_nginx IP-rsc_nginx \
rule $id="cli-standby-rule-IP-rsc_nginx" -inf: #uname eq lb01
location cli-standby-IP-rsc_nginx6 IP-rsc_nginx6 \
rule $id="cli-standby-rule-IP-rsc_nginx6" -inf: #uname eq lb01
colocation hcu_c inf: Nginx-rsc Ldirector-rsc IP-rsc_mysql IP-rsc_nginx
IP-rsc_nginx6 IP-rsc_elasticsearch
order hcu_o inf: IP-rsc_nginx IP-rsc_nginx6 IP-rsc_mysql Ldirector-rsc
Nginx-rsc IP-rsc_elasticsearch
property $id="cib-bootstrap-options" \
dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff" \
cluster-infrastructure="Heartbeat" \
stonith-enabled="false
Do you have some hints that I can follow?
Thanks in advance!
Oscar
_______________________________________________
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: ***@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Oscar Salvador

2015-01-27 14:18:13 UTC

Permalink

Hi,

I've checked the resource graphs I have, and the resources were fine, so I
think it's not a problem due to a high use of memory or something like that.
And unfortunately I don't have a core dump to analize(I'll enable it for a
future case) so the only thing I have are the logs.

For the line below, I though that was the process in charge to monitore
nginx what was killed due to a segfault:

RA output: (Nginx-rsc:monitor:stderr) Segmentation fault

I've checked the Nginx logs, and there is nothing worth there, actually
there is no activity, so I think it has to be something internal what
caused the failure.
I'll enable coredumps, it's the only thing I can do for now.

Thank you very much

Oscar

Post by Oscar Salvador
Hi,

Post by Oscar Salvador
Hi!
I'm writing here because two days ago I experienced a strange problem in

Post by Oscar Salvador
Pacemaker Cluster.
Everything was working fine, till suddenly a Segfault in Nginx monitor
Jan 25 03:55:24 lb02 crmd: [9975]: notice: run_graph: ==== Transition

7551

Post by Oscar Salvador
(Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
Source=/var/lib/pengine/pe-input-90.bz2): Complete
Jan 25 03:55:24 lb02 crmd: [9975]: notice: do_state_transition: State
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
cause=C_FSA_INTERNAL origin=notify_crmd ]
Jan 25 04:00:08 lb02 cib: [9971]: info: cib_stats: Processed 1 operations
(0.00us average, 0% utilization) in the last 10min
Jan 25 04:10:24 lb02 crmd: [9975]: info: crm_timer_popped: PEngine

Recheck

Post by Oscar Salvador
Timer (I_PE_CALC) just popped (900000ms)
Jan 25 04:10:24 lb02 crmd: [9975]: notice: do_state_transition: State
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC

cause=C_TIMER_POPPED

Post by Oscar Salvador
origin=crm_timer_popped ]
Jan 25 04:10:24 lb02 crmd: [9975]: info: do_state_transition: Progressed

Post by Oscar Salvador
state S_POLICY_ENGINE after C_TIMER_POPPED
Jan 25 04:10:24 lb02 pengine: [10028]: WARN: unpack_rsc_op: Processing
failed op Ldirector-rsc_last_failure_0 on lb02: not running (7)
Ldirector-rsc can fail 999997 more times on lb02 before being forced off
Jan 25 04:10:24 lb02 crmd: [9975]: notice: do_state_transition: State
transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
cause=C_IPC_MESSAGE origin=handle_response ]

/var/lib/pengine/pe-input-90.bz2

Post by Oscar Salvador
Jan 25 04:10:24 lb02 crmd: [9975]: info: do_te_invoke: Processing graph
7552 (ref=pe_calc-dc-1422155424-7644) derived from
/var/lib/pengine/pe-input-90.bz2
Jan 25 04:10:24 lb02 crmd: [9975]: notice: run_graph: ==== Transition

7552

Post by Oscar Salvador
(Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
Source=/var/lib/pengine/pe-input-90.bz2): Complete
Jan 25 04:10:24 lb02 crmd: [9975]: notice: do_state_transition: State
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
cause=C_FSA_INTERNAL origin=notify_crmd ]
(Nginx-rsc:monitor:stderr) Segmentation fault ******* here it starts

What exactly did segfault? Do you have a core dump to examine?

Post by Oscar Salvador
As you can see, the last line.
(Nginx-rsc:monitor:stderr) Killed
/usr/lib/ocf/resource.d//heartbeat/nginx: Cannot fork

This could be related to the segfault, or due to other serious
system error.

(update=value++,

Post by Oscar Salvador
time=1422155430)
Jan 25 04:10:30 lb02 crmd: [9975]: notice: do_state_transition: State
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC

cause=C_FSA_INTERNAL

Post by Oscar Salvador
origin=abort_transition_graph ]
Jan 25 04:10:30 lb02 attrd: [9974]: info: log-rotate detected on logfile
/var/log/ha-log
Jan 25 04:10:30 lb02 attrd: [9974]: notice: attrd_trigger_update: Sending
flush op to all hosts for: fail-count-Nginx-rsc (1)
Jan 25 04:10:30 lb02 pengine: [10028]: ERROR: unpack_rsc_op: Preventing
Nginx-rsc from re-starting on lb02: operation monitor failed 'invalid
parameter' (rc=2)
Jan 25 04:10:30 lb02 pengine: [10028]: WARN: unpack_rsc_op: Processing
failed op Nginx-rsc_last_failure_0 on lb02: invalid parameter (2)
Jan 25 04:10:30 lb02 pengine: [10028]: WARN: unpack_rsc_op: Processing
failed op Ldirector-rsc_last_failure_0 on lb02: not running (7)
Ldirector-rsc can fail 999997 more times on lb02 before being forced off
Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Stop
IP-rsc_mysql (lb02)
Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Stop
IP-rsc_nginx (lb02)
Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Stop
IP-rsc_nginx6 (lb02)
Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Stop
IP-rsc_elasticsearch (lb02)
Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Move
Ldirector-rsc (Started lb02 -> lb01)
Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Move
Nginx-rsc (Started lb02 -> lb01)
Jan 25 04:10:30 lb02 attrd: [9974]: notice: attrd_perform_update: Sent
update 23: fail-count-Nginx-rsc=1
Jan 25 04:10:30 lb02 attrd: [9974]: notice: attrd_trigger_update: Sending
flush op to all hosts for: last-failure-Nginx-rsc (1422155430)
I see that Pacemaker is complaining about some errors like "invalid

That error code is what the nginx RA exited with. It's unusual,
but perhaps also due to the segfault.
Thanks,
Dejan

Dejan Muhamedagic

2015-01-27 16:58:43 UTC

Permalink

Post by Oscar Salvador
Hi,
I've checked the resource graphs I have, and the resources were fine, so I
think it's not a problem due to a high use of memory or something like that.
And unfortunately I don't have a core dump to analize(I'll enable it for a
future case) so the only thing I have are the logs.
For the line below, I though that was the process in charge to monitore
RA output: (Nginx-rsc:monitor:stderr) Segmentation fault

This is just output captured during the execution of the RA
monitor action. It could've been anything within the RA (which is
just a shell script) to segfault.

Thanks,

Dejan

Post by Oscar Salvador
I've checked the Nginx logs, and there is nothing worth there, actually
there is no activity, so I think it has to be something internal what
caused the failure.
I'll enable coredumps, it's the only thing I can do for now.
Thank you very much
Oscar

Post by Oscar Salvador
Hi,

Post by Oscar Salvador
Hi!
I'm writing here because two days ago I experienced a strange problem in

Post by Oscar Salvador
Pacemaker Cluster.
Everything was working fine, till suddenly a Segfault in Nginx monitor
Jan 25 03:55:24 lb02 crmd: [9975]: notice: run_graph: ==== Transition

7551

Recheck

Post by Oscar Salvador
Timer (I_PE_CALC) just popped (900000ms)
Jan 25 04:10:24 lb02 crmd: [9975]: notice: do_state_transition: State
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC

cause=C_TIMER_POPPED

Post by Oscar Salvador
origin=crm_timer_popped ]
Jan 25 04:10:24 lb02 crmd: [9975]: info: do_state_transition: Progressed

/var/lib/pengine/pe-input-90.bz2

7552

What exactly did segfault? Do you have a core dump to examine?

Post by Oscar Salvador
As you can see, the last line.
(Nginx-rsc:monitor:stderr) Killed
/usr/lib/ocf/resource.d//heartbeat/nginx: Cannot fork

This could be related to the segfault, or due to other serious
system error.

(update=value++,

Post by Oscar Salvador
time=1422155430)
Jan 25 04:10:30 lb02 crmd: [9975]: notice: do_state_transition: State
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC

cause=C_FSA_INTERNAL

That error code is what the nginx RA exited with. It's unusual,
but perhaps also due to the segfault.
Thanks,
Dejan

Oscar Salvador

2015-01-27 17:12:34 UTC

Permalink

Post by Dejan Muhamedagic

Post by Oscar Salvador
Hi,
I've checked the resource graphs I have, and the resources were fine, so

Post by Oscar Salvador
think it's not a problem due to a high use of memory or something like

that.

Post by Oscar Salvador
And unfortunately I don't have a core dump to analize(I'll enable it for

Post by Oscar Salvador
future case) so the only thing I have are the logs.
For the line below, I though that was the process in charge to monitore
RA output: (Nginx-rsc:monitor:stderr) Segmentation fault

This is just output captured during the execution of the RA
monitor action. It could've been anything within the RA (which is
just a shell script) to segfault.

Hi,

Yes, I see.
I've enabled core dumps on the system, so the next time I'll be able to
check what is causing this.

Thank you very much
Oscar Salvador

Post by Dejan Muhamedagic
Thanks,
Dejan

Post by Oscar Salvador
Hi,

Post by Oscar Salvador
Hi!
I'm writing here because two days ago I experienced a strange

problem in