[Pacemaker] IPaddr2 Unkown interface cause a failover that didn't work

Luc Paulin

2015-09-30 18:15:55 UTC

Hi Everyone,
I have experience a weird issue last night where our cluster try to
failover due to an "Unkown interface"

Look like when the IPaddr2 monitor try to perform a status on eth0, it
didn't find the device. Both node are VM. I haven't found any reason as why
eth0 would have "disapear"

<LOG NODE1>
Sep 29 21:25:04 node-01 IPaddr2(vip_v207_174)[4082]: ERROR: Unknown
interface [eth0] No such device.
Sep 29 21:25:04 node-01 IPaddr2(vip_v207_174)[4082]: ERROR: [findif] failed
Sep 29 21:25:05 node-01 crmd[3369]: notice: process_lrm_event: Operation
vip_v207_174_monitor_10000: not configured (node=node-01, call=91, rc=6,
cib-update=73, confirmed=false)
Sep 29 21:25:06 node-01 attrd[3367]: notice: attrd_cs_dispatch: Update
relayed from node-02
Sep 29 21:25:06 node-01 attrd[3367]: notice: attrd_trigger_update:
Sending flush op to all hosts for: fail-count-vip_v207_174 (2)
Sep 29 21:25:06 node-01 attrd[3367]: notice: attrd_perform_update: Sent
update 41: fail-count-vip_v207_174=2
Sep 29 21:25:06 node-01 attrd[3367]: notice: attrd_cs_dispatch: Update
relayed from node-02
Sep 29 21:25:06 node-01 attrd[3367]: notice: attrd_trigger_update:
Sending flush op to all hosts for: last-failure-vip_v207_174 (1443576306)
Sep 29 21:25:06 node-01 attrd[3367]: notice: attrd_perform_update: Sent
update 43: last-failure-vip_v207_174=1443576306
Sep 29 21:25:07 node-01 crmd[3369]: notice: process_lrm_event: Operation
fwcorp-mailto-sysadmin_stop_0: ok (node=node-01, call=110, rc=0,
cib-update=74, confirmed=true)
Sep 29 21:25:07 node-01 crmd[3369]: notice: process_lrm_event: Operation
change-default-fw_stop_0: ok (node=node-01, call=112, rc=0, cib-update=75,
confirmed=true)
Sep 29 21:25:07 node-01 IPaddr2(vip_v254_230)[4259]: INFO: IP status = ok,
IP_CIP=
Sep 29 21:25:07 node-01 crmd[3369]: notice: process_lrm_event: Operation
vip_v254_230_stop_0: ok (node=node-01, call=114, rc=0, cib-update=76,
confirmed=true)
Sep 29 21:25:07 node-01 IPaddr2(vip_v27_1)[4313]: INFO: IP status = ok,
IP_CIP=
Sep 29 21:25:07 node-01 crmd[3369]: notice: process_lrm_event: Operation
vip_v27_1_stop_0: ok (node=node-01, call=116, rc=0, cib-update=77,
confirmed=true)
Sep 29 21:25:07 node-01 IPaddr2(vip_v26_1)[4366]: INFO: IP status = ok,
IP_CIP=
Sep 29 21:25:07 node-01 crmd[3369]: notice: process_lrm_event: Operation
vip_v26_1_stop_0: ok (node=node-01, call=118, rc=0, cib-update=78,
confirmed=true)
Sep 29 21:25:07 node-01 IPaddr2(vip_v207_174)[4419]: INFO: IP status = ok,
IP_CIP=
Sep 29 21:25:07 node-01 crmd[3369]: notice: process_lrm_event: Operation
vip_v207_174_stop_0: ok (node=node-01, call=120, rc=0, cib-update=79,
confirmed=true)
</LOG NODE1>

<LOG NODE2>
Sep 29 21:22:48 node-02 crmd[3241]: notice: do_state_transition: State
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED
origin=crm_timer_popped ]
Sep 29 21:22:48 node-02 pengine[3240]: notice: update_validation:
pacemaker-1.2-style configuration is also valid for pacemaker-1.3
Sep 29 21:22:48 node-02 pengine[3240]: notice: update_validation:
Upgrading pacemaker-1.3-style configuration to pacemaker-2.0 with
upgrade-1.3.xsl
Sep 29 21:22:48 node-02 pengine[3240]: notice: update_validation:
Transformed the configuration from pacemaker-1.2 to pacemaker-2.0
Sep 29 21:22:48 node-02 pengine[3240]: notice: unpack_config: On loss of
CCM Quorum: Ignore
Sep 29 21:22:48 node-02 crmd[3241]: notice: run_graph: Transition 14769
(Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
Source=/var/lib/pacemaker/pengine/pe-input-786.bz2): Complete
Sep 29 21:22:48 node-02 crmd[3241]: notice: do_state_transition: State
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
cause=C_FSA_INTERNAL origin=notify_crmd ]
Sep 29 21:22:48 node-02 pengine[3240]: notice: process_pe_message:
Calculated Transition 14769: /var/lib/pacemaker/pengine/pe-input-786.bz2
Sep 29 21:25:06 node-02 crmd[3241]: warning: update_failcount: Updating
failcount for vip_v207_174 on node-01 after failed monitor: rc=6
(update=value++, time=1443576306)
Sep 29 21:25:06 node-02 crmd[3241]: notice: do_state_transition: State
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL
origin=abort_transition_graph ]
Sep 29 21:25:06 node-02 pengine[3240]: notice: update_validation:
pacemaker-1.2-style configuration is also valid for pacemaker-1.3
Sep 29 21:25:06 node-02 pengine[3240]: notice: update_validation:
Upgrading pacemaker-1.3-style configuration to pacemaker-2.0 with
upgrade-1.3.xsl
Sep 29 21:25:06 node-02 pengine[3240]: notice: update_validation:
Transformed the configuration from pacemaker-1.2 to pacemaker-2.0
Sep 29 21:25:06 node-02 pengine[3240]: notice: unpack_config: On loss of
CCM Quorum: Ignore
Sep 29 21:25:06 node-02 pengine[3240]: warning: unpack_rsc_op_failure:
Processing failed op monitor for vip_v207_174 on node-01: not configured (6)
Sep 29 21:25:06 node-02 pengine[3240]: error: unpack_rsc_op: Preventing
vip_v207_174 from re-starting anywhere: operation monitor failed 'not
configured' (6)
Sep 29 21:25:06 node-02 pengine[3240]: notice: LogActions: Stop
vip_v207_174#011(node-01)
Sep 29 21:25:06 node-02 pengine[3240]: notice: LogActions: Stop
vip_v26_1#011(node-01)
Sep 29 21:25:06 node-02 pengine[3240]: notice: LogActions: Stop
vip_v27_1#011(node-01)
Sep 29 21:25:06 node-02 pengine[3240]: notice: LogActions: Stop
vip_v254_230#011(node-01)
Sep 29 21:25:06 node-02 pengine[3240]: notice: LogActions: Stop
change-default-fw#011(node-01)
Sep 29 21:25:06 node-02 pengine[3240]: notice: LogActions: Stop
fwcorp-mailto-sysadmin#011(node-01)
Sep 29 21:25:06 node-02 pengine[3240]: notice: process_pe_message:
Calculated Transition 14770: /var/lib/pacemaker/pengine/pe-input-787.bz2
Sep 29 21:25:06 node-02 crmd[3241]: notice: te_rsc_command: Initiating
action 16: stop fwcorp-mailto-sysadmin_stop_0 on node-01
Sep 29 21:25:06 node-02 crmd[3241]: notice: abort_transition_graph:
Transition aborted by status-node-01-fail-count-vip_v207_174,
fail-count-vip_v207_174=2: Transient attribute change (modify cib=0.94.107,
source=te_update_diff:391,
path=/cib/status/node_state[@id='node-01']/transient_attributes[@id='node-01']/instance_attributes[@id='status-node-01']/nvpair[@id='status-node-01-fail-count-vip_v207_174'],
0)
Sep 29 21:25:07 node-02 crmd[3241]: notice: run_graph: Transition 14770
(Complete=2, Pending=0, Fired=0, Skipped=7, Incomplete=0,
Source=/var/lib/pacemaker/pengine/pe-input-787.bz2): Stopped
Sep 29 21:25:07 node-02 pengine[3240]: notice: update_validation:
pacemaker-1.2-style configuration is also valid for pacemaker-1.3
Sep 29 21:25:07 node-02 pengine[3240]: notice: update_validation:
Upgrading pacemaker-1.3-style configuration to pacemaker-2.0 with
upgrade-1.3.xsl
Sep 29 21:25:07 node-02 pengine[3240]: notice: update_validation:
Transformed the configuration from pacemaker-1.2 to pacemaker-2.0
Sep 29 21:25:07 node-02 pengine[3240]: notice: unpack_config: On loss of
CCM Quorum: Ignore
Sep 29 21:25:07 node-02 pengine[3240]: warning: unpack_rsc_op_failure:
Processing failed op monitor for vip_v207_174 on node-01: not configured (6)
Sep 29 21:25:07 node-02 pengine[3240]: error: unpack_rsc_op: Preventing
vip_v207_174 from re-starting anywhere: operation monitor failed 'not
configured' (6)
Sep 29 21:25:07 node-02 pengine[3240]: notice: LogActions: Stop
vip_v207_174#011(node-01)
Sep 29 21:25:07 node-02 pengine[3240]: notice: LogActions: Stop
vip_v26_1#011(node-01)
Sep 29 21:25:07 node-02 pengine[3240]: notice: LogActions: Stop
vip_v27_1#011(node-01)
Sep 29 21:25:07 node-02 pengine[3240]: notice: LogActions: Stop
vip_v254_230#011(node-01)
Sep 29 21:25:07 node-02 pengine[3240]: notice: LogActions: Stop
change-default-fw#011(node-01)
Sep 29 21:25:07 node-02 crmd[3241]: notice: te_rsc_command: Initiating
action 14: stop change-default-fw_stop_0 on node-01
Sep 29 21:25:07 node-02 pengine[3240]: notice: process_pe_message:
Calculated Transition 14771: /var/lib/pacemaker/pengine/pe-input-788.bz2
Sep 29 21:25:07 node-02 crmd[3241]: notice: te_rsc_command: Initiating
action 13: stop vip_v254_230_stop_0 on node-01
Sep 29 21:25:07 node-02 crmd[3241]: notice: te_rsc_command: Initiating
action 12: stop vip_v27_1_stop_0 on node-01
Sep 29 21:25:07 node-02 crmd[3241]: notice: te_rsc_command: Initiating
action 11: stop vip_v26_1_stop_0 on node-01
Sep 29 21:25:07 node-02 crmd[3241]: notice: te_rsc_command: Initiating
action 3: stop vip_v207_174_stop_0 on node-01
Sep 29 21:25:07 node-02 crmd[3241]: notice: run_graph: Transition 14771
(Complete=8, Pending=0, Fired=0, Skipped=0, Incomplete=0,
Source=/var/lib/pacemaker/pengine/pe-input-788.bz2): Complete
Sep 29 21:25:07 node-02 crmd[3241]: notice: do_state_transition: State
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
cause=C_FSA_INTERNAL origin=notify_crmd ]
</LOG NODE2>

I know that I found some post that say to run sysctl -w
net.ipv4.conf.all.promote_secondaries=1 to avoid secondary nic to be remove
when primary is gone, but in this case the eth0 has a single nic that is
manage through IPaddr2 within crm configuration

Here's the configuration or node:

<CONFIGURATION>
Cluster Name: nodecluster1
Corosync Nodes:
node-01 node-02
Pacemaker Nodes:
node-01 node-02

Resources:
Group: lbpcivip
Resource: vip_v207_174 (class=ocf provider=heartbeat type=IPaddr2)
Attributes: ip=x.x.x.174 cidr_netmask=27 broadcast=x.x.x.191 nic=eth0
Operations: monitor interval=10s (vip_v207_174-monitor-interval-10s)
Resource: vip_v26_1 (class=ocf provider=heartbeat type=IPaddr2)
Attributes: ip=x.x.26.1
Operations: monitor interval=10s (vip_v26_1-monitor-interval-10s)
Resource: vip_v27_1 (class=ocf provider=heartbeat type=IPaddr2)
Attributes: ip=x.x.27.1
Operations: monitor interval=10s (vip_v27_1-monitor-interval-10s)
Resource: vip_v254_230 (class=ocf provider=heartbeat type=IPaddr2)
Attributes: ip=x.x.254.230
Operations: monitor interval=10s (vip_v254_230-monitor-interval-10s)
Resource: change-default-fw (class=lsb type=fwdefaultgw)
Operations: monitor interval=60s (change-default-fw-monitor-interval-60s)
Resource: fwcorp-mailto-sysadmin (class=ocf provider=heartbeat
type=MailTo)
Attributes: email=***@touchtunes.com subject="[node - Clustered
services]"
Operations: monitor interval=60s
(fwcorp-mailto-sysadmin-monitor-interval-60s)

Stonith Devices:
Fencing Levels:

Location Constraints:
Ordering Constraints:
Colocation Constraints:

Cluster Properties:
cluster-infrastructure: cman
dc-version: 1.1.11-97629de
last-lrm-refresh: 1412269491
no-quorum-policy: ignore
stonith-enabled: false
</CONFIGURATION>

Has anyone have suggestion on how I can solve this issue? Why did the
failover from node1 to node2 didn't work ?

If more information is require let me know, any suggestion would be
appreciated!

Thanx!

--
!!!!!
( o o )
--------------oOO----(_)----OOo--------------
Luc Paulin
email: paulinster(at)gmail.com
Skype: paulinster