[Pacemaker] RHEL 6.3 + fence_vmware

Discussion:

[Pacemaker] RHEL 6.3 + fence_vmware_soap + esx 5.1

Mistina Michal

2013-07-13 12:05:50 UTC

Hi,

Does somebody know how to set up fence_vmware_soap correctly so that it will
start fencing vmware machine in the esx 5.1?

My problem is the fence_vmware_soap resource agent for stonith timed out.
Don't know why.

[root at pcmk1 ~]# crm_verify -L -V

warning: unpack_rsc_op: Processing failed op
vm-fence-pcmk2_last_failure_0 on pcmk1: unknown exec error (-2)

warning: unpack_rsc_op: Processing failed op
vm-fence-pcmk1_last_failure_0 on pcmk2: unknown exec error (-2)

warning: common_apply_stickiness: Forcing vm-fence-pcmk2 away from
pcmk1 after 1000000 failures (max=1000000)

warning: common_apply_stickiness: Forcing vm-fence-pcmk1 away from
pcmk2 after 1000000 failures (max=1000000)

I have 2 node cluster. If I tried to manually reboot vmware machine by
calling fence_vmware_soap it worked.

[root at pcmk1 ~]# fence_vmware_soap -a x.x.x.x -l administrator -p password -n
"pcmk2" -o reboot -z

My settings are.

[root at pcmk1 ~]# stonith_admin -M -a fence_vmware_soap

<resource-agent name="fence_vmware_soap" shortdesc="Fence agent for VMWare
over SOAP API">

<longdesc>fence_vmware_soap is an I/O Fencing agent which can be used with
the virtual machines managed by VMWare products that have SOAP API v4.1+.

.P

Name of virtual machine (-n / port) has to be used in inventory path format
(e.g. /datacenter/vm/Discovered virtual machine/myMachine). In the cases
when name of yours VM is unique you can use it instead. Alternatively you
can always use UUID (-U / uuid) to access virtual machine.</longdesc> <vendor-url>http://www.vmware.com</vendor-url> <parameters> <parameter name="action" unique="0" required="1"> <getopt mixed="-o, --action=<action>"/> <content type="string" default="reboot"/> <shortdesc lang="en">Fencing Action</shortdesc> </parameter> <parameter name="ipaddr" unique="0" required="1"> <getopt mixed="-a, --ip=<ip>"/> <content type="string"/> <shortdesc lang="en">IP Address or Hostname</shortdesc> </parameter> <parameter name="login" unique="0" required="1"> <getopt mixed="-l, --username=<name>"/> <content type="string"/> <shortdesc lang="en">Login Name</shortdesc> </parameter> <parameter name="passwd" unique="0" required="0"> <getopt mixed="-p, --password=<password>"/> <content type="string"/> <shortdesc lang="en">Login password or passphrase</shortdesc> </parameter> <parameter name="passwd_script" unique="0" required="0"> <getopt mixed="-S, --password-script=<script>"/> <content type="string"/> <shortdesc lang="en">Script to retrieve password</shortdesc> </parameter> <parameter name="ssl" unique="0" required="0"> <getopt mixed="-z, --ssl"/> <content type="boolean"/> <shortdesc lang="en">SSL connection</shortdesc> </parameter> <parameter name="port" unique="0" required="0"> <getopt mixed="-n, --plug=<id>"/>

<content type="string"/>

<shortdesc lang="en">Physical plug number or name of virtual
machine</shortdesc>

</parameter>

<parameter name="uuid" unique="0" required="0">

<getopt mixed="-U, --uuid"/>

<content type="string"/>

<shortdesc lang="en">The UUID of the virtual machine to
fence.</shortdesc> </parameter> <parameter name="ipport" unique="0" required="0"> <getopt mixed="-u, --ipport=<port>"/>

<content type="string"/>

<shortdesc lang="en">TCP port to use for connection with
device</shortdesc> </parameter> <parameter name="verbose" unique="0" required="0"> <getopt mixed="-v, --verbose"/> <content type="boolean"/> <shortdesc lang="en">Verbose mode</shortdesc> </parameter> <parameter name="debug" unique="0" required="0"> <getopt mixed="-D, --debug-file=<debugfile>"/> <content type="string"/> <shortdesc lang="en">Write debug information to given file</shortdesc> </parameter> <parameter name="version" unique="0" required="0"> <getopt mixed="-V, --version"/> <content type="boolean"/> <shortdesc lang="en">Display version information and exit</shortdesc> </parameter> <parameter name="help" unique="0" required="0"> <getopt mixed="-h, --help"/> <content type="boolean"/> <shortdesc lang="en">Display help and exit</shortdesc> </parameter> <parameter name="separator" unique="0" required="0"> <getopt mixed="-C, --separator=<char>"/>

<content type="string" default=","/>

<shortdesc lang="en">Separator for CSV created by operation
list</shortdesc>

</parameter>

<parameter name="power_timeout" unique="0" required="0">

<getopt mixed="--power-timeout"/>

<content type="string" default="20"/>

<shortdesc lang="en">Test X seconds for status change after
ON/OFF</shortdesc>

</parameter>

<parameter name="shell_timeout" unique="0" required="0">

<getopt mixed="--shell-timeout"/>

<content type="string" default="3"/>

<shortdesc lang="en">Wait X seconds for cmd prompt after issuing
command</shortdesc>

</parameter>

<parameter name="login_timeout" unique="0" required="0">

<getopt mixed="--login-timeout"/>

<content type="string" default="5"/>

<shortdesc lang="en">Wait X seconds for cmd prompt after
login</shortdesc>

</parameter>

<parameter name="power_wait" unique="0" required="0">

<getopt mixed="--power-wait"/>

<content type="string" default="0"/>

<shortdesc lang="en">Wait X seconds after issuing ON/OFF</shortdesc>

</parameter>

<parameter name="delay" unique="0" required="0">

<getopt mixed="--delay"/>

<content type="string" default="0"/>

<shortdesc lang="en">Wait X seconds before fencing is
started</shortdesc>

</parameter>

<parameter name="retry_on" unique="0" required="0">

<getopt mixed="--retry-on"/>

<content type="string" default="1"/>

<shortdesc lang="en">Count of attempts to retry power on</shortdesc>

</parameter>

</parameters>

<actions>

<action name="on"/>

<action name="off"/>

<action name="reboot"/>

<action name="status"/>

<action name="list"/>

<action name="monitor"/>

<action name="metadata"/>

<action name="stop" timeout="20s"/>

<action name="start" timeout="20s"/>

</actions>

</resource-agent>

[root at pcmk1 ~]# crm configure show

node pcmk1

node pcmk2

primitive drbd_pg ocf:linbit:drbd \

params drbd_resource="postgres" \

op monitor interval="15" role="Master" \

op monitor interval="16" role="Slave" \

op start interval="0" timeout="240" \

op stop interval="0" timeout="120"

primitive pg_fs ocf:heartbeat:Filesystem \

params device="/dev/vg_local-lv_pgsql/lv_pgsql"
directory="/var/lib/pgsql/9.2/data" options="noatime,nodiratime"
fstype="xfs" \

op start interval="0" timeout="60" \

op stop interval="0" timeout="120"

primitive pg_lsb lsb:postgresql-9.2 \

op monitor interval="30" timeout="60" \

op start interval="0" timeout="60" \

op stop interval="0" timeout="60"

primitive pg_lvm ocf:heartbeat:LVM \

params volgrpname="vg_local-lv_pgsql" \

op start interval="0" timeout="30" \

op stop interval="0" timeout="30"

primitive pg_vip ocf:heartbeat:IPaddr2 \

params ip="x.x.x.x" iflabel="pcmkvip" \

op monitor interval="5"

primitive vm-fence-pcmk1 stonith:fence_vmware_soap \

params ipaddr="x.x.x.x" login="administrator" passwd="password"
port="pcmk1" ssl="1" retry_on="10" shell_timeout="20" login_timeout="15"
action="reboot"

primitive vm-fence-pcmk2 stonith:fence_vmware_soap \

params ipaddr="x.x.x.x" login="administrator" passwd="password"
port="pcmk2" ssl="1" retry_on="10" shell_timeout="20" login_timeout="15"
action="reboot"

group PGServer pg_lvm pg_fs pg_lsb pg_vip

ms ms_drbd_pg drbd_pg \

meta master-max="1" master-node-max="1" clone-max="2"
clone-node-max="1" notify="true"

location l-st-pcmk1 vm-fence-pcmk1 -inf: pcmk1

location l-st-pcmk2 vm-fence-pcmk2 -inf: pcmk2

location master-prefer-node1 pg_vip 50: pcmk1

colocation col_pg_drbd inf: PGServer ms_drbd_pg:Master

order ord_pg inf: ms_drbd_pg:promote PGServer:start

property $id="cib-bootstrap-options" \

dc-version="1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14" \

cluster-infrastructure="openais" \

expected-quorum-votes="4" \

stonith-enabled="true" \

no-quorum-policy="ignore" \

maintenance-mode="false"

rsc_defaults $id="rsc-options" \

resource-stickiness="100"

Am I doing something wrong?

Best regards,

Michal Mistina

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20130713/da557d95/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3057 bytes
Desc: not available
URL: <http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20130713/da557d95/attachment-0001.p7s>

Andrew Beekhof

2013-07-15 01:05:48 UTC

Permalink

Nothing in the stonith-ng logs?

[root at pcmk1 ~]# crm_verify -L -V
warning: unpack_rsc_op: Processing failed op vm-fence-pcmk2_last_failure_0 on pcmk1: unknown exec error (-2)
warning: unpack_rsc_op: Processing failed op vm-fence-pcmk1_last_failure_0 on pcmk2: unknown exec error (-2)
warning: common_apply_stickiness: Forcing vm-fence-pcmk2 away from pcmk1 after 1000000 failures (max=1000000)
warning: common_apply_stickiness: Forcing vm-fence-pcmk1 away from pcmk2 after 1000000 failures (max=1000000)
I have 2 node cluster. If I tried to manually reboot vmware machine by calling fence_vmware_soap it worked.
[root at pcmk1 ~]# fence_vmware_soap -a x.x.x.x -l administrator -p password -n "pcmk2" -o reboot ?z
My settings are?
[root at pcmk1 ~]# stonith_admin -M -a fence_vmware_soap
<resource-agent name="fence_vmware_soap" shortdesc="Fence agent for VMWare over SOAP API">
<longdesc>fence_vmware_soap is an I/O Fencing agent which can be used with the virtual machines managed by VMWare products that have SOAP API v4.1+.
.P
<content type="string" default=","/>
<shortdesc lang="en">Separator for CSV created by operation list</shortdesc>
</parameter>
<parameter name="power_timeout" unique="0" required="0">
<getopt mixed="--power-timeout"/>
<content type="string" default="20"/>
<shortdesc lang="en">Test X seconds for status change after ON/OFF</shortdesc>
</parameter>
<parameter name="shell_timeout" unique="0" required="0">
<getopt mixed="--shell-timeout"/>
<content type="string" default="3"/>
<shortdesc lang="en">Wait X seconds for cmd prompt after issuing command</shortdesc>
</parameter>
<parameter name="login_timeout" unique="0" required="0">
<getopt mixed="--login-timeout"/>
<content type="string" default="5"/>
<shortdesc lang="en">Wait X seconds for cmd prompt after login</shortdesc>
</parameter>
<parameter name="power_wait" unique="0" required="0">
<getopt mixed="--power-wait"/>
<content type="string" default="0"/>
<shortdesc lang="en">Wait X seconds after issuing ON/OFF</shortdesc>
</parameter>
<parameter name="delay" unique="0" required="0">
<getopt mixed="--delay"/>
<content type="string" default="0"/>
<shortdesc lang="en">Wait X seconds before fencing is started</shortdesc>
</parameter>
<parameter name="retry_on" unique="0" required="0">
<getopt mixed="--retry-on"/>
<content type="string" default="1"/>
<shortdesc lang="en">Count of attempts to retry power on</shortdesc>
</parameter>
</parameters>
<actions>
<action name="on"/>
<action name="off"/>
<action name="reboot"/>
<action name="status"/>
<action name="list"/>
<action name="monitor"/>
<action name="metadata"/>
<action name="stop" timeout="20s"/>
<action name="start" timeout="20s"/>
</actions>
</resource-agent>
[root at pcmk1 ~]# crm configure show
node pcmk1
node pcmk2
primitive drbd_pg ocf:linbit:drbd \
params drbd_resource="postgres" \
op monitor interval="15" role="Master" \
op monitor interval="16" role="Slave" \
op start interval="0" timeout="240" \
op stop interval="0" timeout="120"
primitive pg_fs ocf:heartbeat:Filesystem \
params device="/dev/vg_local-lv_pgsql/lv_pgsql" directory="/var/lib/pgsql/9.2/data" options="noatime,nodiratime" fstype="xfs" \
op start interval="0" timeout="60" \
op stop interval="0" timeout="120"
primitive pg_lsb lsb:postgresql-9.2 \
op monitor interval="30" timeout="60" \
op start interval="0" timeout="60" \
op stop interval="0" timeout="60"
primitive pg_lvm ocf:heartbeat:LVM \
params volgrpname="vg_local-lv_pgsql" \
op start interval="0" timeout="30" \
op stop interval="0" timeout="30"
primitive pg_vip ocf:heartbeat:IPaddr2 \
params ip="x.x.x.x" iflabel="pcmkvip" \
op monitor interval="5"
primitive vm-fence-pcmk1 stonith:fence_vmware_soap \
params ipaddr="x.x.x.x" login="administrator" passwd="password" port="pcmk1" ssl="1" retry_on="10" shell_timeout="20" login_timeout="15" action="reboot"
primitive vm-fence-pcmk2 stonith:fence_vmware_soap \
params ipaddr="x.x.x.x" login="administrator" passwd="password" port="pcmk2" ssl="1" retry_on="10" shell_timeout="20" login_timeout="15" action="reboot"
group PGServer pg_lvm pg_fs pg_lsb pg_vip
ms ms_drbd_pg drbd_pg \
meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true"
location l-st-pcmk1 vm-fence-pcmk1 -inf: pcmk1
location l-st-pcmk2 vm-fence-pcmk2 -inf: pcmk2
location master-prefer-node1 pg_vip 50: pcmk1
colocation col_pg_drbd inf: PGServer ms_drbd_pg:Master
order ord_pg inf: ms_drbd_pg:promote PGServer:start
property $id="cib-bootstrap-options" \
dc-version="1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14" \
cluster-infrastructure="openais" \
expected-quorum-votes="4" \
stonith-enabled="true" \
no-quorum-policy="ignore" \
maintenance-mode="false"
rsc_defaults $id="rsc-options" \
resource-stickiness="100"
Am I doing something wrong?
Best regards,
Michal Mistina
_______________________________________________
Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Mistina Michal

2013-07-15 10:56:24 UTC

Permalink

Hi Andrew.

Here is the ommited /var/log/messages with stonigh-ng sections.

Jul 15 09:53:38 PCMK1 stonith-ng[1538]: notice: stonith_device_action:
Device vm-fence-pcmk2 not found
Jul 15 09:53:38 PCMK1 stonith-ng[1538]: info: stonith_command: Processed
st_execute from lrmd: rc=-12
Jul 15 09:53:38 PCMK1 crmd[1542]: info: process_lrm_event: LRM operation
vm-fence-pcmk2_monitor_0 (call=11, rc=7, cib-update=21, confirmed=true) not
running
Jul 15 09:53:38 PCMK1 lrmd: [1539]: info: rsc:vm-fence-pcmk2:12: start
Jul 15 09:53:38 PCMK1 stonith-ng[1538]: info: stonith_device_register:
Added 'vm-fence-pcmk2' to the device list (1 active devices)
Jul 15 09:53:38 PCMK1 stonith-ng[1538]: info: stonith_command: Processed
st_device_register from lrmd: rc=0
Jul 15 09:53:38 PCMK1 stonith-ng[1538]: info: stonith_command: Processed
st_execute from lrmd: rc=-1
Jul 15 09:54:13 PCMK1 lrmd: [1539]: WARN: vm-fence-pcmk2:start process (PID
3332) timed out (try 1). Killing with signal SIGTERM (15).
Jul 15 09:54:18 PCMK1 lrmd: [1539]: WARN: vm-fence-pcmk2:start process (PID
3332) timed out (try 2). Killing with signal SIGKILL (9).
Jul 15 09:54:18 PCMK1 lrmd: [1539]: WARN: operation start[12] on
stonith::fence_vmware_soap::vm-fence-pcmk2 for client 1542, its parameters:
passwd=[password] shell_timeout=[20] ssl=[1] login=[administrator]
action=[reboot] crm_feature_set=[3.0.6] retry_on=[10] ipaddr=[x.x.x.x]
port=[T1-PCMK2] login_timeout=[15] CRM_meta_timeout=[20000] : pid [3332]
timed out
Jul 15 09:54:18 PCMK1 crmd[1542]: error: process_lrm_event: LRM operation
vm-fence-pcmk2_start_0 (12) Timed Out (timeout=20000ms)
Jul 15 09:54:18 PCMK1 attrd[1540]: notice: attrd_ais_dispatch: Update
relayed from pcmk2
Jul 15 09:54:18 PCMK1 attrd[1540]: notice: attrd_trigger_update: Sending
flush op to all hosts for: fail-count-vm-fence-pcmk2 (INFINITY)
Jul 15 09:54:18 PCMK1 attrd[1540]: notice: attrd_perform_update: Sent
update 24: fail-count-vm-fence-pcmk2=INFINITY
Jul 15 09:54:18 PCMK1 attrd[1540]: notice: attrd_ais_dispatch: Update
relayed from pcmk2
Jul 15 09:54:18 PCMK1 attrd[1540]: notice: attrd_trigger_update: Sending
flush op to all hosts for: last-failure-vm-fence-pcmk2 (1373874858)
Jul 15 09:54:18 PCMK1 attrd[1540]: notice: attrd_perform_update: Sent
update 27: last-failure-vm-fence-pcmk2=1373874858
Jul 15 09:54:21 PCMK1 lrmd: [1539]: info: rsc:vm-fence-pcmk2:13: stop
Jul 15 09:54:21 PCMK1 stonith-ng[1538]: info: stonith_device_remove:
Removed 'vm-fence-pcmk2' from the device list (0 active devices)
Jul 15 09:54:21 PCMK1 stonith-ng[1538]: info: stonith_command: Processed
st_device_remove from lrmd: rc=0
Jul 15 09:54:21 PCMK1 crmd[1542]: info: process_lrm_event: LRM operation
vm-fence-pcmk2_stop_0 (call=13, rc=0, cib-update=23, confirmed=true) ok

What does this output mean?

Best regards,
Michal Mistina

-----Original Message-----
From: Andrew Beekhof [mailto:andrew at beekhof.net]
Sent: Monday, July 15, 2013 3:06 AM
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] RHEL 6.3 + fence_vmware_soap + esx 5.1

Post by Mistina Michal
Hi,
Does somebody know how to set up fence_vmware_soap correctly so that it

will start fencing vmware machine in the esx 5.1?

Post by Mistina Michal
My problem is the fence_vmware_soap resource agent for stonith timed out.

Don't know why.

Nothing in the stonith-ng logs?

Post by Mistina Michal
[root at pcmk1 ~]# crm_verify -L -V
warning: unpack_rsc_op: Processing failed op

vm-fence-pcmk2_last_failure_0 on pcmk1: unknown exec error (-2)

Post by Mistina Michal
warning: unpack_rsc_op: Processing failed op

vm-fence-pcmk1_last_failure_0 on pcmk2: unknown exec error (-2)

Post by Mistina Michal
warning: common_apply_stickiness: Forcing vm-fence-pcmk2 away from

pcmk1 after 1000000 failures (max=1000000)

Post by Mistina Michal
warning: common_apply_stickiness: Forcing vm-fence-pcmk1 away from

pcmk2 after 1000000 failures (max=1000000)

Post by Mistina Michal
I have 2 node cluster. If I tried to manually reboot vmware machine by

calling fence_vmware_soap it worked.

Post by Mistina Michal
[root at pcmk1 ~]# fence_vmware_soap -a x.x.x.x -l administrator -p
password -n "pcmk2" -o reboot -z
My settings are.
[root at pcmk1 ~]# stonith_admin -M -a fence_vmware_soap <resource-agent
name="fence_vmware_soap" shortdesc="Fence agent for VMWare over SOAP API">
<longdesc>fence_vmware_soap is an I/O Fencing agent which can be used

with the virtual machines managed by VMWare products that have SOAP API
v4.1+.

Post by Mistina Michal
.P
Name of virtual machine (-n / port) has to be used in inventory path

format (e.g. /datacenter/vm/Discovered virtual machine/myMachine). In the
cases when name of yours VM is unique you can use it instead. Alternatively

Post by Mistina Michal
<content type="string"/>
<shortdesc lang="en">Physical plug number or name of virtual

machine</shortdesc>

Post by Mistina Michal
</parameter>
<parameter name="uuid" unique="0" required="0">
<getopt mixed="-U, --uuid"/>
<content type="string"/>
<shortdesc lang="en">The UUID of the virtual machine to
<content type="string"/>
<shortdesc lang="en">TCP port to use for connection with
<content type="string"/>
<shortdesc lang="en">Write debug information to given

file</shortdesc>

Post by Mistina Michal
</parameter>
<parameter name="version" unique="0" required="0">
<getopt mixed="-V, --version"/>
<content type="boolean"/>
<shortdesc lang="en">Display version information and
<content type="string" default=","/>
<shortdesc lang="en">Separator for CSV created by operation list</shortdesc>
</parameter>
<parameter name="power_timeout" unique="0" required="0">
<getopt mixed="--power-timeout"/>
<content type="string" default="20"/>
<shortdesc lang="en">Test X seconds for status change after ON/OFF</shortdesc>
</parameter>
<parameter name="shell_timeout" unique="0" required="0">
<getopt mixed="--shell-timeout"/>
<content type="string" default="3"/>
<shortdesc lang="en">Wait X seconds for cmd prompt after issuing command</shortdesc>
</parameter>
<parameter name="login_timeout" unique="0" required="0">
<getopt mixed="--login-timeout"/>
<content type="string" default="5"/>
<shortdesc lang="en">Wait X seconds for cmd prompt after

Post by Mistina Michal
</parameter>
<parameter name="power_wait" unique="0" required="0">
<getopt mixed="--power-wait"/>
<content type="string" default="0"/>
<shortdesc lang="en">Wait X seconds after issuing ON/OFF</shortdesc>
</parameter>
<parameter name="delay" unique="0" required="0">
<getopt mixed="--delay"/>
<content type="string" default="0"/>
<shortdesc lang="en">Wait X seconds before fencing is

started</shortdesc>

Post by Mistina Michal
</parameter>
<parameter name="retry_on" unique="0" required="0">
<getopt mixed="--retry-on"/>
<content type="string" default="1"/>
<shortdesc lang="en">Count of attempts to retry power on</shortdesc>
</parameter>
</parameters>
<actions>
<action name="on"/>
<action name="off"/>
<action name="reboot"/>
<action name="status"/>
<action name="list"/>
<action name="monitor"/>
<action name="metadata"/>
<action name="stop" timeout="20s"/>
<action name="start" timeout="20s"/>
</actions>
</resource-agent>
[root at pcmk1 ~]# crm configure show
node pcmk1
node pcmk2
primitive drbd_pg ocf:linbit:drbd \
params drbd_resource="postgres" \
op monitor interval="15" role="Master" \
op monitor interval="16" role="Slave" \
op start interval="0" timeout="240" \
op stop interval="0" timeout="120"
primitive pg_fs ocf:heartbeat:Filesystem \
params device="/dev/vg_local-lv_pgsql/lv_pgsql"

directory="/var/lib/pgsql/9.2/data" options="noatime,nodiratime"
fstype="xfs" \

Post by Mistina Michal
op start interval="0" timeout="60" \
op stop interval="0" timeout="120"
primitive pg_lsb lsb:postgresql-9.2 \
op monitor interval="30" timeout="60" \
op start interval="0" timeout="60" \
op stop interval="0" timeout="60"
primitive pg_lvm ocf:heartbeat:LVM \
params volgrpname="vg_local-lv_pgsql" \
op start interval="0" timeout="30" \
op stop interval="0" timeout="30"
primitive pg_vip ocf:heartbeat:IPaddr2 \
params ip="x.x.x.x" iflabel="pcmkvip" \
op monitor interval="5"
primitive vm-fence-pcmk1 stonith:fence_vmware_soap \
params ipaddr="x.x.x.x" login="administrator" passwd="password"

port="pcmk1" ssl="1" retry_on="10" shell_timeout="20" login_timeout="15"
action="reboot"

Post by Mistina Michal
primitive vm-fence-pcmk2 stonith:fence_vmware_soap \
params ipaddr="x.x.x.x" login="administrator" passwd="password"

port="pcmk2" ssl="1" retry_on="10" shell_timeout="20" login_timeout="15"
action="reboot"

Post by Mistina Michal
group PGServer pg_lvm pg_fs pg_lsb pg_vip ms ms_drbd_pg drbd_pg \
meta master-max="1" master-node-max="1" clone-max="2"

clone-node-max="1" notify="true"

Post by Mistina Michal
location l-st-pcmk1 vm-fence-pcmk1 -inf: pcmk1 location l-st-pcmk2
pcmk1 colocation col_pg_drbd inf: PGServer ms_drbd_pg:Master order
ord_pg inf: ms_drbd_pg:promote PGServer:start property
$id="cib-bootstrap-options" \
dc-version="1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14" \
cluster-infrastructure="openais" \
expected-quorum-votes="4" \
stonith-enabled="true" \
no-quorum-policy="ignore" \
maintenance-mode="false"
rsc_defaults $id="rsc-options" \
resource-stickiness="100"
Am I doing something wrong?
Best regards,
Michal Mistina
_______________________________________________
Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org Getting started:
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3057 bytes
Desc: not available
URL: <http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20130715/9b850833/attachment.p7s>

Andrew Beekhof

2013-07-16 03:23:10 UTC

Permalink

Post by Mistina Michal
Hi Andrew.
Here is the ommited /var/log/messages with stonigh-ng sections.
Device vm-fence-pcmk2 not found
Jul 15 09:53:38 PCMK1 stonith-ng[1538]: info: stonith_command: Processed
st_execute from lrmd: rc=-12
Jul 15 09:53:38 PCMK1 crmd[1542]: info: process_lrm_event: LRM operation
vm-fence-pcmk2_monitor_0 (call=11, rc=7, cib-update=21, confirmed=true) not
running
Jul 15 09:53:38 PCMK1 lrmd: [1539]: info: rsc:vm-fence-pcmk2:12: start
Added 'vm-fence-pcmk2' to the device list (1 active devices)
Jul 15 09:53:38 PCMK1 stonith-ng[1538]: info: stonith_command: Processed
st_device_register from lrmd: rc=0
Jul 15 09:53:38 PCMK1 stonith-ng[1538]: info: stonith_command: Processed
st_execute from lrmd: rc=-1
Jul 15 09:54:13 PCMK1 lrmd: [1539]: WARN: vm-fence-pcmk2:start process (PID
3332) timed out (try 1). Killing with signal SIGTERM (15).

you took too long, go away

Post by Mistina Michal
Jul 15 09:54:18 PCMK1 lrmd: [1539]: WARN: vm-fence-pcmk2:start process (PID
3332) timed out (try 2). Killing with signal SIGKILL (9).

seriously go away

Post by Mistina Michal
Jul 15 09:54:18 PCMK1 lrmd: [1539]: WARN: operation start[12] on
passwd=[password] shell_timeout=[20] ssl=[1] login=[administrator]
action=[reboot] crm_feature_set=[3.0.6] retry_on=[10] ipaddr=[x.x.x.x]
port=[T1-PCMK2] login_timeout=[15] CRM_meta_timeout=[20000] : pid [3332]
timed out

whatever that agent is doing, its taking to long
or you've not given it long enough

Post by Mistina Michal
Jul 15 09:54:18 PCMK1 crmd[1542]: error: process_lrm_event: LRM operation
vm-fence-pcmk2_start_0 (12) Timed Out (timeout=20000ms)
Jul 15 09:54:18 PCMK1 attrd[1540]: notice: attrd_ais_dispatch: Update
relayed from pcmk2
Jul 15 09:54:18 PCMK1 attrd[1540]: notice: attrd_trigger_update: Sending
flush op to all hosts for: fail-count-vm-fence-pcmk2 (INFINITY)
Jul 15 09:54:18 PCMK1 attrd[1540]: notice: attrd_perform_update: Sent
update 24: fail-count-vm-fence-pcmk2=INFINITY
Jul 15 09:54:18 PCMK1 attrd[1540]: notice: attrd_ais_dispatch: Update
relayed from pcmk2
Jul 15 09:54:18 PCMK1 attrd[1540]: notice: attrd_trigger_update: Sending
flush op to all hosts for: last-failure-vm-fence-pcmk2 (1373874858)
Jul 15 09:54:18 PCMK1 attrd[1540]: notice: attrd_perform_update: Sent
update 27: last-failure-vm-fence-pcmk2=1373874858
Jul 15 09:54:21 PCMK1 lrmd: [1539]: info: rsc:vm-fence-pcmk2:13: stop
Removed 'vm-fence-pcmk2' from the device list (0 active devices)
Jul 15 09:54:21 PCMK1 stonith-ng[1538]: info: stonith_command: Processed
st_device_remove from lrmd: rc=0
Jul 15 09:54:21 PCMK1 crmd[1542]: info: process_lrm_event: LRM operation
vm-fence-pcmk2_stop_0 (call=13, rc=0, cib-update=23, confirmed=true) ok
What does this output mean?
Best regards,
Michal Mistina
-----Original Message-----
From: Andrew Beekhof [mailto:andrew at beekhof.net]
Sent: Monday, July 15, 2013 3:06 AM
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] RHEL 6.3 + fence_vmware_soap + esx 5.1

Post by Mistina Michal
Hi,
Does somebody know how to set up fence_vmware_soap correctly so that it

will start fencing vmware machine in the esx 5.1?

Post by Mistina Michal
My problem is the fence_vmware_soap resource agent for stonith timed out.

Don't know why.
Nothing in the stonith-ng logs?

Post by Mistina Michal
[root at pcmk1 ~]# crm_verify -L -V
warning: unpack_rsc_op: Processing failed op

vm-fence-pcmk2_last_failure_0 on pcmk1: unknown exec error (-2)

Post by Mistina Michal
warning: unpack_rsc_op: Processing failed op

vm-fence-pcmk1_last_failure_0 on pcmk2: unknown exec error (-2)

Post by Mistina Michal
warning: common_apply_stickiness: Forcing vm-fence-pcmk2 away from

pcmk1 after 1000000 failures (max=1000000)

Post by Mistina Michal
warning: common_apply_stickiness: Forcing vm-fence-pcmk1 away from

pcmk2 after 1000000 failures (max=1000000)

Post by Mistina Michal
I have 2 node cluster. If I tried to manually reboot vmware machine by

calling fence_vmware_soap it worked.

with the virtual machines managed by VMWare products that have SOAP API
v4.1+.

Post by Mistina Michal
.P
Name of virtual machine (-n / port) has to be used in inventory path

format (e.g. /datacenter/vm/Discovered virtual machine/myMachine). In the
cases when name of yours VM is unique you can use it instead. Alternatively

Post by Mistina Michal
<content type="string"/>
<shortdesc lang="en">Physical plug number or name of virtual

machine</shortdesc>

file</shortdesc>

list</shortdesc>

Post by Mistina Michal
</parameter>
<parameter name="power_timeout" unique="0" required="0">
<getopt mixed="--power-timeout"/>
<content type="string" default="20"/>
<shortdesc lang="en">Test X seconds for status change after

ON/OFF</shortdesc>

Post by Mistina Michal
</parameter>
<parameter name="shell_timeout" unique="0" required="0">
<getopt mixed="--shell-timeout"/>
<content type="string" default="3"/>
<shortdesc lang="en">Wait X seconds for cmd prompt after issuing

command</shortdesc>

Post by Mistina Michal
</parameter>
<parameter name="login_timeout" unique="0" required="0">
<getopt mixed="--login-timeout"/>
<content type="string" default="5"/>
<shortdesc lang="en">Wait X seconds for cmd prompt after

started</shortdesc>

directory="/var/lib/pgsql/9.2/data" options="noatime,nodiratime"
fstype="xfs" \

port="pcmk1" ssl="1" retry_on="10" shell_timeout="20" login_timeout="15"
action="reboot"

Post by Mistina Michal
primitive vm-fence-pcmk2 stonith:fence_vmware_soap \
params ipaddr="x.x.x.x" login="administrator" passwd="password"

port="pcmk2" ssl="1" retry_on="10" shell_timeout="20" login_timeout="15"
action="reboot"

Post by Mistina Michal
group PGServer pg_lvm pg_fs pg_lsb pg_vip ms ms_drbd_pg drbd_pg \
meta master-max="1" master-node-max="1" clone-max="2"

clone-node-max="1" notify="true"

Post by Mistina Michal
cluster-infrastructure="openais" \
expected-quorum-votes="4" \
stonith-enabled="true" \
no-quorum-policy="ignore" \
maintenance-mode="false"
rsc_defaults $id="rsc-options" \
resource-stickiness="100"
Am I doing something wrong?
Best regards,
Michal Mistina
_______________________________________________
Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
_______________________________________________
Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Mistina Michal

2013-07-18 12:46:19 UTC

Permalink

Hi Andrew.
Thank you for a little insight. I tried to set higher timout limits within
fence_vmware_soap properties in cib database. After I had altered these
numbers I didn't experience SIGTERM or SIGKILL any more.
However automatic fencing was still not successfull.
I don't understand why "manual fencing" by using command "fence_vmware_soap"
is working though and automatic with same parameters isn't.

corosync.log attached further in the text shows there are some parsing
errors. I think this regards unusual characters used in the names of the
virtual machines which run on the ESX. This makes sense if unusual character
is used in the name of the fenced vmware machine. It isn't. The corosyng.log
shows names of other virtual machines on the ESX.

Is it safe to say the issue occured within fence_vmware_soap resource agent
because it cannot handle something, maybe names of the virtual machines? If
so, I will try to update that agent. I am using version
fence-agents-3.1.5-17.el6.x86_64.
Is there a chance that changing timeout limits will help the situation? I
have feeling timeouts doesn't solve anything. It times out because of
something else.

This is how the crm configuration looks now....
[root at pcmk1 ~]# crm configure show
node pcmk1
node pcmk2
primitive drbd_pg ocf:linbit:drbd \
params drbd_resource="postgres" \
op monitor interval="15" role="Master" \
op monitor interval="16" role="Slave" \
op start interval="0" timeout="240" \
op stop interval="0" timeout="120"
primitive pg_fs ocf:heartbeat:Filesystem \
params device="/dev/vg_local-lv_pgsql/lv_pgsql"
directory="/var/lib/pgsql/9.2/data" options="noatime,nodiratime"
fstype="xfs" \
op start interval="0" timeout="60" \
op stop interval="0" timeout="120"
primitive pg_lsb lsb:postgresql-9.2 \
op monitor interval="30" timeout="60" \
op start interval="0" timeout="60" \
op stop interval="0" timeout="60"
primitive pg_lvm ocf:heartbeat:LVM \
params volgrpname="vg_local-lv_pgsql" \
op start interval="0" timeout="30" \
op stop interval="0" timeout="30"
primitive pg_vip ocf:heartbeat:IPaddr2 \
params ip="x.x.x.x" iflabel="tstcapsvip" \
op monitor interval="5"
primitive vm-fence-pcmk1 stonith:fence_vmware_soap \
params ipaddr="x.x.x.x" login="administrator" passwd="password"
port="PCMK1" ssl="1" retry_on="10" shell_timeout="120" login_timeout="120"
action="reboot" \
op start interval="0" timeout="120"
primitive vm-fence-pcmk2 stonith:fence_vmware_soap \
params ipaddr="x.x.x.x" login="administrator" passwd="password"
port="PCMK2" ssl="1" retry_on="10" shell_timeout="120" login_timeout="120"
action="reboot" \
op start interval="0" timeout="120"
group PGServer pg_lvm pg_fs pg_lsb pg_vip \
meta target-role="Started"
ms ms_drbd_pg drbd_pg \
meta master-max="1" master-node-max="1" clone-max="2"
clone-node-max="1" notify="true"
location l-st-pcmk1 vm-fence-pcmk1 -inf: pcmk1
location l-st-pcmk2 vm-fence-pcmk2 -inf: pcmk2
location master-prefer-node1 pg_vip 50: pcmk1
colocation col_pg_drbd inf: PGServer ms_drbd_pg:Master
order ord_pg inf: ms_drbd_pg:promote PGServer:start
property $id="cib-bootstrap-options" \
dc-version="1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14" \
cluster-infrastructure="openais" \
expected-quorum-votes="4" \
stonith-enabled="true" \
no-quorum-policy="ignore" \
maintenance-mode="false"
rsc_defaults $id="rsc-options" \
resource-stickiness="100"

Command crm_verify -LV shows nothing.
[root at pcmk1 ~]# crm_verify -LV

[root at pcmk1 ~]# crm_mon -1
============
Last updated: Thu Jul 18 14:23:15 2013
Last change: Thu Jul 18 14:20:54 2013 via crm_resource on pcmk1
Stack: openais
Current DC: pcmk2 - partition WITHOUT quorum
Version: 1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14
2 Nodes configured, 4 expected votes
8 Resources configured.
============

Online: [ pcmk1 pcmk2 ]

Resource Group: PGServer
pg_lvm (ocf::heartbeat:LVM): Started pcmk1
pg_fs (ocf::heartbeat:Filesystem): Started pcmk1
pg_lsb (lsb:postgresql-9.2): Started pcmk1
pg_vip (ocf::heartbeat:IPaddr2): Started pcmk1
Master/Slave Set: ms_drbd_pg [drbd_pg]
Masters: [ pcmk1 ]
Slaves: [ pcmk2 ]
vm-fence-pcmk1 (stonith:fence_vmware_soap): Started pcmk2
vm-fence-pcmk2 (stonith:fence_vmware_soap): Started pcmk1

If I simulate split-brain by plugging out the cable from secondary server
pcmk2, /var/log/cluster/corosync.log on the primary server pcmk1 tell
this...
Jul 18 14:31:00 [1494] pcmk1 stonith-ng: info:
can_fence_host_with_device: Refreshing port list for vm-fence-pcmk2
Jul 18 14:31:00 [1494] pcmk1 stonith-ng: warning: parse_host_line:
Could not parse (13 21): [106.15],4222ac70-92c3-bddf-b524-24d848080cb2
Jul 18 14:31:00 [1494] pcmk1 stonith-ng: warning: parse_host_line:
Could not parse (13 21): [107.25],42224003-b614-5eb2-f141-5437fc8319d8
Jul 18 14:31:00 [1494] pcmk1 stonith-ng: warning: parse_host_line:
Could not parse (13 21): [107.29],4222719f-7bdc-84b2-4494-848a29c2bd5f
Jul 18 14:31:00 [1494] pcmk1 stonith-ng: warning: parse_host_line:
Could not parse (0 1): [ MEDI - WinXP with SP3 - MSDN
],4222238c-c927-3af1-f2e7-e0dd374d373b
Jul 18 14:31:00 [1494] pcmk1 stonith-ng: warning: parse_host_line:
Could not parse (31 32): ],4222238c-c927-3af1-f2e7-e0dd374d373b
Jul 18 14:31:00 [1494] pcmk1 stonith-ng: warning: parse_host_line:
Could not parse (0 1): [ MEDI WIN7 32-bit -
MSDN],42223e4a-9541-2326-2a21-3b3532756b47
Jul 18 14:31:00 [1494] pcmk1 stonith-ng: warning: parse_host_line:
Could not parse (13 22): [105.233],42220acd-6e21-4380-9b81-89d86f14317d
Jul 18 14:31:00 [1494] pcmk1 stonith-ng: warning: parse_host_line:
Could not parse (9 17): [106.21],42223377-1443-a44c-1dc0-815c2542898e
Jul 18 14:31:00 [1494] pcmk1 stonith-ng: warning: parse_host_line:
Could not parse (12 20): [106.29],4222394a-70f1-4612-6fcd-4525e13b0cc4
Jul 18 14:31:00 [1494] pcmk1 stonith-ng: warning: parse_host_line:
Could not parse (0 1): [ MEDI W2K8 R2 SP1 STD - MSDN
],4222dc65-6752-b1b4-c0f7-38c94cd5609a
Jul 18 14:31:00 [1494] pcmk1 stonith-ng: warning: parse_host_line:
Could not parse (30 31): ],4222dc65-6752-b1b4-c0f7-38c94cd5609a
Jul 18 14:31:00 [1494] pcmk1 stonith-ng: warning: parse_host_line:
Could not parse (12 20): [106.52],4222aa80-0fe6-66c4-8d11-fea5f547b566
Jul 18 14:31:00 [1494] pcmk1 stonith-ng: warning: parse_host_line:
Could not parse (13 21): [106.14],422249fc-a902-ba5c-deb0-e6db6198b984
Jul 18 14:31:00 [1494] pcmk1 stonith-ng: warning: parse_host_line:
Could not parse (18 25): [106.2],4222851c-1a9d-021a-4e16-9f8adc5bcc42
Jul 18 14:31:00 [1494] pcmk1 stonith-ng: warning: parse_host_line:
Could not parse (12 20): [106.28],422235ab-83c4-c0b7-812b-bc5b7019aff7
Jul 18 14:31:00 [1494] pcmk1 stonith-ng: warning: parse_host_line:
Could not parse (13 21): [106.26],4222bbff-48eb-d60c-0347-430b8d72baa2
Jul 18 14:31:00 [1494] pcmk1 stonith-ng: warning: parse_host_line:
Could not parse (13 21): [107.27],4222da62-3c55-37f8-f6b8-239657892914
Jul 18 14:31:00 [1494] pcmk1 stonith-ng: warning: parse_host_line:
Could not parse (0 1): [ MEDI WIN7 64-bit - MSDN
],4222289e-0bd2-4280-c0f4-548fd42e7eab
Jul 18 14:31:00 [1494] pcmk1 stonith-ng: warning: parse_host_line:
Could not parse (26 27): ],4222289e-0bd2-4280-c0f4-548fd42e7eab
Jul 18 14:31:00 [1494] pcmk1 stonith-ng: warning: parse_host_line:
Could not parse (17 26): [105.242],42228b51-4ef6-f9b8-b64a-882d68023074
Jul 18 14:31:00 [1494] pcmk1 stonith-ng: warning: parse_host_line:
Could not parse (20 29): [105.230],42223dcd-22c1-a0f7-c629-5c4489e2c55d
Jul 18 14:31:00 [1494] pcmk1 stonith-ng: warning: parse_host_line:
Could not parse (0 1): [ W2K3 R2 ENT 32-bit ENG
],4233c1c8-e0f9-26f3-b854-6376ec6b1d1c
Jul 18 14:31:00 [1494] pcmk1 stonith-ng: warning: parse_host_line:
Could not parse (25 26): ],4233c1c8-e0f9-26f3-b854-6376ec6b1d1c
Jul 18 14:31:00 [1494] pcmk1 stonith-ng: warning: parse_host_line:
Could not parse (9 17): [106.20],422285ba-6a31-0832-1b38-a910031cd057
Jul 18 14:31:00 [1494] pcmk1 stonith-ng: warning: parse_host_line:
Could not parse (13 21): [106.27],4222d166-5647-79a3-d9d8-f90650b6188b
Jul 18 14:31:00 [1494] pcmk1 stonith-ng: warning: parse_host_line:
Could not parse (21 30): [105.231],4222308c-41c7-02e9-3b20-c6df71838db9
Jul 18 14:31:00 [1494] pcmk1 stonith-ng: warning: parse_host_line:
Could not parse (25 28): !!! [105.235],422283ac-c5d9-4bf1-96eb-a57d8d18c118
Jul 18 14:31:00 [1494] pcmk1 stonith-ng: warning: parse_host_line:
Could not parse (29 38): [105.235],422283ac-c5d9-4bf1-96eb-a57d8d18c118
Jul 18 14:31:00 [1494] pcmk1 stonith-ng: warning: parse_host_line:
Could not parse (12 20): [106.13],42222137-0d67-ac9b-e3b6-11fb6d2c33e0
Jul 18 14:31:00 [1494] pcmk1 stonith-ng: warning: parse_host_line:
Could not parse (17 26): [105.241],4222a40f-d91a-0e4f-2292-ef92c4836bb5
Jul 18 14:31:00 [1494] pcmk1 stonith-ng: warning: parse_host_line:
Could not parse (17 26): [105.243],42222a9a-7440-6d19-b654-42c08a2abd69
Jul 18 14:31:00 [1494] pcmk1 stonith-ng: warning: parse_host_line:
Could not parse (0 1): [ MEDI W2K8 R2 SP1 ENT - MSDN
],42227507-c4fd-c5aa-b7d7-4ececd284f84
Jul 18 14:31:00 [1494] pcmk1 stonith-ng: warning: parse_host_line:
Could not parse (30 31): ],42227507-c4fd-c5aa-b7d7-4ececd284f84
Jul 18 14:31:00 [1494] pcmk1 stonith-ng: warning: parse_host_line:
Could not parse (0 1): [ MEDI_gw_chckpnt
],4222f42e-58c6-dc59-2a00-10041ad5ac08
Jul 18 14:31:00 [1494] pcmk1 stonith-ng: warning: parse_host_line:
Could not parse (18 19): ],4222f42e-58c6-dc59-2a00-10041ad5ac08
Jul 18 14:31:00 [1494] pcmk1 stonith-ng: warning: parse_host_line:
Could not parse (13 22): [105.234],422295e3-644e-8b51-a373-e7f166b2fd5d
Jul 18 14:31:00 [1494] pcmk1 stonith-ng: warning: parse_host_line:
Could not parse (13 22): [105.232],42228f9d-615f-1c3b-2158-d3ad08d40357
Jul 18 14:31:00 [1494] pcmk1 stonith-ng: warning: parse_host_line:
Could not parse (17 26): [105.240],4222b273-68e7-379d-b874-6a47211e9449
Jul 18 14:31:00 [1494] pcmk1 stonith-ng: warning: parse_host_line:
Could not parse (13 21): [107.28],4222cbc8-565d-eee1-4430-555b059663d0
Jul 18 14:31:00 [1494] pcmk1 stonith-ng: warning: parse_host_line:
Could not parse (13 22): [105.236],4222115e-789a-66dd-95e9-786ec0d84ec0
Jul 18 14:31:00 [1494] pcmk1 stonith-ng: warning: parse_host_line:
Could not parse (13 21): [107.26],4222fb16-fadc-9031-8e3d-110225505a0f
Jul 18 14:31:00 [1494] pcmk1 stonith-ng: warning: parse_host_line:
Could not parse (12 20): [106.12],42226bf9-8e78-9356-773c-ecde31cf2fa2
Jul 18 14:31:00 [1494] pcmk1 stonith-ng: warning: parse_host_line:
Could not parse (12 20): [106.51],4222ae99-f1d9-9811-d72b-10e875c58f56
Jul 18 14:31:00 [1494] pcmk1 stonith-ng: info:
can_fence_host_with_device: vm-fence-pcmk2 can not fence pcmk2:
dynamic-list
Jul 18 14:31:00 [1494] pcmk1 stonith-ng: info: stonith_command:
Processed st_query from pcmk1: rc=0
Jul 18 14:31:00 [1494] pcmk1 stonith-ng: error: remote_op_done:
Operation reboot of pcmk2 by <no-one> for
pcmk1[7496e5e6-4ab4-4028-b44d-c34c52a3fd04]: Operation timed out
Jul 18 14:31:00 [1498] pcmk1 crmd: info: tengine_stonith_callback:
StonithOp <remote-op state="0" st_target="pcmk2" st_op="reboot" />
Jul 18 14:31:00 [1498] pcmk1 crmd: notice: tengine_stonith_callback:
Stonith operation 4 for pcmk2 failed (Operation timed out): aborting
transition.
Jul 18 14:31:00 [1498] pcmk1 crmd: info: abort_transition_graph:
tengine_stonith_callback:454 - Triggered transition abort (complete=0) :
Stonith failed
Jul 18 14:31:00 [1498] pcmk1 crmd: notice: tengine_stonith_notify:
Peer pcmk2 was not terminated (reboot) by <anyone> for pcmk1: Operation
timed out (ref=ca100580-8e00-49d4-b895-c538139a28dd)
Jul 18 14:31:00 [1498] pcmk1 crmd: notice: run_graph: ====
Transition 2 (Complete=7, Pending=0, Fired=0, Skipped=4, Incomplete=5,
Source=/var/lib/pengine/pe-warn-34.bz2): Stopped
Jul 18 14:31:00 [1498] pcmk1 crmd: notice: do_state_transition:
State transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC
cause=C_FSA_INTERNAL origin=notify_crmd ]
Jul 18 14:31:00 [1497] pcmk1 pengine: notice: unpack_config: On loss
of CCM Quorum: Ignore
Jul 18 14:31:00 [1497] pcmk1 pengine: warning: pe_fence_node: Node
pcmk2 will be fenced because it is un-expectedly down
Jul 18 14:31:00 [1497] pcmk1 pengine: warning: determine_online_status:
Node pcmk2 is unclean
Jul 18 14:31:00 [1497] pcmk1 pengine: warning: custom_action: Action
drbd_pg:1_stop_0 on pcmk2 is unrunnable (offline)
Jul 18 14:31:00 [1497] pcmk1 pengine: warning: custom_action: Marking
node pcmk2 unclean
Jul 18 14:31:00 [1497] pcmk1 pengine: warning: custom_action: Action
drbd_pg:1_stop_0 on pcmk2 is unrunnable (offline)
Jul 18 14:31:00 [1497] pcmk1 pengine: warning: custom_action: Marking
node pcmk2 unclean
Jul 18 14:31:00 [1497] pcmk1 pengine: warning: custom_action: Action
vm-fence-pcmk1_stop_0 on pcmk2 is unrunnable (offline)
Jul 18 14:31:00 [1497] pcmk1 pengine: warning: custom_action: Marking
node pcmk2 unclean
Jul 18 14:31:00 [1497] pcmk1 pengine: warning: stage6: Scheduling Node
pcmk2 for STONITH
Jul 18 14:31:00 [1497] pcmk1 pengine: notice: LogActions: Stop
drbd_pg:1 (pcmk2)
Jul 18 14:31:00 [1497] pcmk1 pengine: notice: LogActions: Stop
vm-fence-pcmk1 (pcmk2)
Jul 18 14:31:00 [1498] pcmk1 crmd: notice: do_state_transition:
State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
cause=C_IPC_MESSAGE origin=handle_response ]
Jul 18 14:31:00 [1498] pcmk1 crmd: info: do_te_invoke:
Processing graph 3 (ref=pe_calc-dc-1374150660-46) derived from
/var/lib/pengine/pe-warn-35.bz2
Jul 18 14:31:00 [1498] pcmk1 crmd: info: te_rsc_command:
Initiating action 63: notify drbd_pg:0_pre_notify_stop_0 on pcmk1 (local)
Jul 18 14:31:00 pcmk1 lrmd: [1495]: info: rsc:drbd_pg:0:28: notify
Jul 18 14:31:00 [1498] pcmk1 crmd: notice: te_fence_node:
Executing reboot fencing operation (53) on pcmk2 (timeout=60000)
Jul 18 14:31:00 [1494] pcmk1 stonith-ng: info:
initiate_remote_stonith_op: Initiating remote operation reboot for
pcmk2: d69db4e3-7d3b-4bee-9bd5-aa7afb05c358
Jul 18 14:31:00 [1497] pcmk1 pengine: warning: process_pe_message:
Transition 3: WARNINGs found during PE processing. PEngine Input stored in:
/var/lib/pengine/pe-warn-35.bz2
Jul 18 14:31:00 [1497] pcmk1 pengine: notice: process_pe_message:
Configuration WARNINGs found during PE processing. Please run "crm_verify
-L" to identify issues.
Jul 18 14:31:01 [1498] pcmk1 crmd: info: process_lrm_event:
LRM operation drbd_pg:0_notify_0 (call=28, rc=0, cib-update=0,
confirmed=true) ok

Regards,
Michal Mistina
-----Original Message-----
From: Andrew Beekhof [mailto:andrew at beekhof.net]
Sent: Tuesday, July 16, 2013 5:23 AM
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] RHEL 6.3 + fence_vmware_soap + esx 5.1

Post by Mistina Michal
Hi Andrew.
Here is the ommited /var/log/messages with stonigh-ng sections.
Device vm-fence-pcmk2 not found
Jul 15 09:53:38 PCMK1 stonith-ng[1538]: info: stonith_command: Processed
st_execute from lrmd: rc=-12
Jul 15 09:53:38 PCMK1 crmd[1542]: info: process_lrm_event: LRM operation
vm-fence-pcmk2_monitor_0 (call=11, rc=7, cib-update=21,
rsc:vm-fence-pcmk2:12: start
Added 'vm-fence-pcmk2' to the device list (1 active devices)
Jul 15 09:53:38 PCMK1 stonith-ng[1538]: info: stonith_command: Processed
st_device_register from lrmd: rc=0
Jul 15 09:53:38 PCMK1 stonith-ng[1538]: info: stonith_command: Processed
st_execute from lrmd: rc=-1
Jul 15 09:54:13 PCMK1 lrmd: [1539]: WARN: vm-fence-pcmk2:start process (PID
3332) timed out (try 1). Killing with signal SIGTERM (15).

you took too long, go away

Post by Mistina Michal
Jul 15 09:54:18 PCMK1 lrmd: [1539]: WARN: vm-fence-pcmk2:start process (PID
3332) timed out (try 2). Killing with signal SIGKILL (9).

seriously go away

whatever that agent is doing, its taking to long or you've not given it long
enough

Post by Mistina Michal
Hi,
Does somebody know how to set up fence_vmware_soap correctly so that it

will start fencing vmware machine in the esx 5.1?

Post by Mistina Michal
My problem is the fence_vmware_soap resource agent for stonith timed out.

Don't know why.
Nothing in the stonith-ng logs?

Post by Mistina Michal
[root at pcmk1 ~]# crm_verify -L -V
warning: unpack_rsc_op: Processing failed op

vm-fence-pcmk2_last_failure_0 on pcmk1: unknown exec error (-2)

Post by Mistina Michal
warning: unpack_rsc_op: Processing failed op

vm-fence-pcmk1_last_failure_0 on pcmk2: unknown exec error (-2)

Post by Mistina Michal
warning: common_apply_stickiness: Forcing vm-fence-pcmk2 away from

pcmk1 after 1000000 failures (max=1000000)

Post by Mistina Michal
warning: common_apply_stickiness: Forcing vm-fence-pcmk1 away from

pcmk2 after 1000000 failures (max=1000000)

Post by Mistina Michal
I have 2 node cluster. If I tried to manually reboot vmware machine by

calling fence_vmware_soap it worked.

Post by Mistina Michal
[root at pcmk1 ~]# fence_vmware_soap -a x.x.x.x -l administrator -p
password -n "pcmk2" -o reboot -z
My settings are.
[root at pcmk1 ~]# stonith_admin -M -a fence_vmware_soap <resource-agent
name="fence_vmware_soap" shortdesc="Fence agent for VMWare over SOAP
API"> <longdesc>fence_vmware_soap is an I/O Fencing agent which can
be used

with the virtual machines managed by VMWare products that have SOAP
API v4.1+.

Post by Mistina Michal
.P
Name of virtual machine (-n / port) has to be used in inventory path

format (e.g. /datacenter/vm/Discovered virtual machine/myMachine). In
the cases when name of yours VM is unique you can use it instead.
Alternatively you can always use UUID (-U / uuid) to access virtual

Post by Mistina Michal
<content type="string"/>
<shortdesc lang="en">Physical plug number or name of virtual

machine</shortdesc>

file</shortdesc>

list</shortdesc>

ON/OFF</shortdesc>

command</shortdesc>

started</shortdesc>

Post by Mistina Michal
</parameter>
<parameter name="retry_on" unique="0" required="0">
<getopt mixed="--retry-on"/>
<content type="string" default="1"/>
<shortdesc lang="en">Count of attempts to retry power on</shortdesc>
</parameter>
</parameters>
<actions>
<action name="on"/>
<action name="off"/>
<action name="reboot"/>
<action name="status"/>
<action name="list"/>
<action name="monitor"/>
<action name="metadata"/>
<action name="stop" timeout="20s"/>
<action name="start" timeout="20s"/> </actions> </resource-agent>
[root at pcmk1 ~]# crm configure show
node pcmk1
node pcmk2
primitive drbd_pg ocf:linbit:drbd \
params drbd_resource="postgres" \
op monitor interval="15" role="Master" \
op monitor interval="16" role="Slave" \
op start interval="0" timeout="240" \
op stop interval="0" timeout="120"
primitive pg_fs ocf:heartbeat:Filesystem \
params device="/dev/vg_local-lv_pgsql/lv_pgsql"

directory="/var/lib/pgsql/9.2/data" options="noatime,nodiratime"
fstype="xfs" \

port="pcmk1" ssl="1" retry_on="10" shell_timeout="20" login_timeout="15"
action="reboot"

Post by Mistina Michal
primitive vm-fence-pcmk2 stonith:fence_vmware_soap \
params ipaddr="x.x.x.x" login="administrator" passwd="password"

port="pcmk2" ssl="1" retry_on="10" shell_timeout="20" login_timeout="15"
action="reboot"

Post by Mistina Michal
group PGServer pg_lvm pg_fs pg_lsb pg_vip ms ms_drbd_pg drbd_pg \
meta master-max="1" master-node-max="1" clone-max="2"

clone-node-max="1" notify="true"

_______________________________________________
Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
_______________________________________________
Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Andrew Beekhof

2013-07-22 03:23:50 UTC

Permalink

Post by Mistina Michal
Hi Andrew.
Thank you for a little insight. I tried to set higher timout limits within
fence_vmware_soap properties in cib database. After I had altered these
numbers I didn't experience SIGTERM or SIGKILL any more.
However automatic fencing was still not successfull.
I don't understand why "manual fencing" by using command "fence_vmware_soap"
is working though and automatic with same parameters isn't.

Because its not using the same parameters.
Until 1.1.10-rc6, Pacemaker used a calculated value for port and action - regardless of what you specified.

Look in "man stonithd" or the online docs for details on pcmk_host_map.
You'd probably want "pcmk1:PCMK1;pcmk2:PCMK2;"

Or just name the hosts in lowercase in vmware

Post by Mistina Michal
corosync.log attached further in the text shows there are some parsing
errors. I think this regards unusual characters used in the names of the
virtual machines which run on the ESX. This makes sense if unusual character
is used in the name of the fenced vmware machine. It isn't. The corosyng.log
shows names of other virtual machines on the ESX.
Is it safe to say the issue occured within fence_vmware_soap resource agent
because it cannot handle something, maybe names of the virtual machines? If
so, I will try to update that agent. I am using version
fence-agents-3.1.5-17.el6.x86_64.
Is there a chance that changing timeout limits will help the situation? I
have feeling timeouts doesn't solve anything. It times out because of
something else.
This is how the crm configuration looks now....
[root at pcmk1 ~]# crm configure show
node pcmk1
node pcmk2
primitive drbd_pg ocf:linbit:drbd \
params drbd_resource="postgres" \
op monitor interval="15" role="Master" \
op monitor interval="16" role="Slave" \
op start interval="0" timeout="240" \
op stop interval="0" timeout="120"
primitive pg_fs ocf:heartbeat:Filesystem \
params device="/dev/vg_local-lv_pgsql/lv_pgsql"
directory="/var/lib/pgsql/9.2/data" options="noatime,nodiratime"
fstype="xfs" \
op start interval="0" timeout="60" \
op stop interval="0" timeout="120"
primitive pg_lsb lsb:postgresql-9.2 \
op monitor interval="30" timeout="60" \
op start interval="0" timeout="60" \
op stop interval="0" timeout="60"
primitive pg_lvm ocf:heartbeat:LVM \
params volgrpname="vg_local-lv_pgsql" \
op start interval="0" timeout="30" \
op stop interval="0" timeout="30"
primitive pg_vip ocf:heartbeat:IPaddr2 \
params ip="x.x.x.x" iflabel="tstcapsvip" \
op monitor interval="5"
primitive vm-fence-pcmk1 stonith:fence_vmware_soap \
params ipaddr="x.x.x.x" login="administrator" passwd="password"
port="PCMK1" ssl="1" retry_on="10" shell_timeout="120" login_timeout="120"
action="reboot" \
op start interval="0" timeout="120"
primitive vm-fence-pcmk2 stonith:fence_vmware_soap \
params ipaddr="x.x.x.x" login="administrator" passwd="password"
port="PCMK2" ssl="1" retry_on="10" shell_timeout="120" login_timeout="120"
action="reboot" \
op start interval="0" timeout="120"
group PGServer pg_lvm pg_fs pg_lsb pg_vip \
meta target-role="Started"
ms ms_drbd_pg drbd_pg \
meta master-max="1" master-node-max="1" clone-max="2"
clone-node-max="1" notify="true"
location l-st-pcmk1 vm-fence-pcmk1 -inf: pcmk1
location l-st-pcmk2 vm-fence-pcmk2 -inf: pcmk2
location master-prefer-node1 pg_vip 50: pcmk1
colocation col_pg_drbd inf: PGServer ms_drbd_pg:Master
order ord_pg inf: ms_drbd_pg:promote PGServer:start
property $id="cib-bootstrap-options" \
dc-version="1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14" \
cluster-infrastructure="openais" \
expected-quorum-votes="4" \
stonith-enabled="true" \
no-quorum-policy="ignore" \
maintenance-mode="false"
rsc_defaults $id="rsc-options" \
resource-stickiness="100"
Command crm_verify -LV shows nothing.
[root at pcmk1 ~]# crm_verify -LV
[root at pcmk1 ~]# crm_mon -1
============
Last updated: Thu Jul 18 14:23:15 2013
Last change: Thu Jul 18 14:20:54 2013 via crm_resource on pcmk1
Stack: openais
Current DC: pcmk2 - partition WITHOUT quorum
Version: 1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14
2 Nodes configured, 4 expected votes
8 Resources configured.
============
Online: [ pcmk1 pcmk2 ]
Resource Group: PGServer
pg_lvm (ocf::heartbeat:LVM): Started pcmk1
pg_fs (ocf::heartbeat:Filesystem): Started pcmk1
pg_lsb (lsb:postgresql-9.2): Started pcmk1
pg_vip (ocf::heartbeat:IPaddr2): Started pcmk1
Master/Slave Set: ms_drbd_pg [drbd_pg]
Masters: [ pcmk1 ]
Slaves: [ pcmk2 ]
vm-fence-pcmk1 (stonith:fence_vmware_soap): Started pcmk2
vm-fence-pcmk2 (stonith:fence_vmware_soap): Started pcmk1
If I simulate split-brain by plugging out the cable from secondary server
pcmk2, /var/log/cluster/corosync.log on the primary server pcmk1 tell
this...
can_fence_host_with_device: Refreshing port list for vm-fence-pcmk2
Could not parse (13 21): [106.15],4222ac70-92c3-bddf-b524-24d848080cb2
Could not parse (13 21): [107.25],42224003-b614-5eb2-f141-5437fc8319d8
Could not parse (13 21): [107.29],4222719f-7bdc-84b2-4494-848a29c2bd5f
Could not parse (0 1): [ MEDI - WinXP with SP3 - MSDN
],4222238c-c927-3af1-f2e7-e0dd374d373b
Could not parse (31 32): ],4222238c-c927-3af1-f2e7-e0dd374d373b
Could not parse (0 1): [ MEDI WIN7 32-bit -
MSDN],42223e4a-9541-2326-2a21-3b3532756b47
Could not parse (13 22): [105.233],42220acd-6e21-4380-9b81-89d86f14317d
Could not parse (9 17): [106.21],42223377-1443-a44c-1dc0-815c2542898e
Could not parse (12 20): [106.29],4222394a-70f1-4612-6fcd-4525e13b0cc4
Could not parse (0 1): [ MEDI W2K8 R2 SP1 STD - MSDN
],4222dc65-6752-b1b4-c0f7-38c94cd5609a
Could not parse (30 31): ],4222dc65-6752-b1b4-c0f7-38c94cd5609a
Could not parse (12 20): [106.52],4222aa80-0fe6-66c4-8d11-fea5f547b566
Could not parse (13 21): [106.14],422249fc-a902-ba5c-deb0-e6db6198b984
Could not parse (18 25): [106.2],4222851c-1a9d-021a-4e16-9f8adc5bcc42
Could not parse (12 20): [106.28],422235ab-83c4-c0b7-812b-bc5b7019aff7
Could not parse (13 21): [106.26],4222bbff-48eb-d60c-0347-430b8d72baa2
Could not parse (13 21): [107.27],4222da62-3c55-37f8-f6b8-239657892914
Could not parse (0 1): [ MEDI WIN7 64-bit - MSDN
],4222289e-0bd2-4280-c0f4-548fd42e7eab
Could not parse (26 27): ],4222289e-0bd2-4280-c0f4-548fd42e7eab
Could not parse (17 26): [105.242],42228b51-4ef6-f9b8-b64a-882d68023074
Could not parse (20 29): [105.230],42223dcd-22c1-a0f7-c629-5c4489e2c55d
Could not parse (0 1): [ W2K3 R2 ENT 32-bit ENG
],4233c1c8-e0f9-26f3-b854-6376ec6b1d1c
Could not parse (25 26): ],4233c1c8-e0f9-26f3-b854-6376ec6b1d1c
Could not parse (9 17): [106.20],422285ba-6a31-0832-1b38-a910031cd057
Could not parse (13 21): [106.27],4222d166-5647-79a3-d9d8-f90650b6188b
Could not parse (21 30): [105.231],4222308c-41c7-02e9-3b20-c6df71838db9
Could not parse (25 28): !!! [105.235],422283ac-c5d9-4bf1-96eb-a57d8d18c118
Could not parse (29 38): [105.235],422283ac-c5d9-4bf1-96eb-a57d8d18c118
Could not parse (12 20): [106.13],42222137-0d67-ac9b-e3b6-11fb6d2c33e0
Could not parse (17 26): [105.241],4222a40f-d91a-0e4f-2292-ef92c4836bb5
Could not parse (17 26): [105.243],42222a9a-7440-6d19-b654-42c08a2abd69
Could not parse (0 1): [ MEDI W2K8 R2 SP1 ENT - MSDN
],42227507-c4fd-c5aa-b7d7-4ececd284f84
Could not parse (30 31): ],42227507-c4fd-c5aa-b7d7-4ececd284f84
Could not parse (0 1): [ MEDI_gw_chckpnt
],4222f42e-58c6-dc59-2a00-10041ad5ac08
Could not parse (18 19): ],4222f42e-58c6-dc59-2a00-10041ad5ac08
Could not parse (13 22): [105.234],422295e3-644e-8b51-a373-e7f166b2fd5d
Could not parse (13 22): [105.232],42228f9d-615f-1c3b-2158-d3ad08d40357
Could not parse (17 26): [105.240],4222b273-68e7-379d-b874-6a47211e9449
Could not parse (13 21): [107.28],4222cbc8-565d-eee1-4430-555b059663d0
Could not parse (13 22): [105.236],4222115e-789a-66dd-95e9-786ec0d84ec0
Could not parse (13 21): [107.26],4222fb16-fadc-9031-8e3d-110225505a0f
Could not parse (12 20): [106.12],42226bf9-8e78-9356-773c-ecde31cf2fa2
Could not parse (12 20): [106.51],4222ae99-f1d9-9811-d72b-10e875c58f56
dynamic-list
Processed st_query from pcmk1: rc=0
Operation reboot of pcmk2 by <no-one> for
pcmk1[7496e5e6-4ab4-4028-b44d-c34c52a3fd04]: Operation timed out
StonithOp <remote-op state="0" st_target="pcmk2" st_op="reboot" />
Stonith operation 4 for pcmk2 failed (Operation timed out): aborting
transition.
Stonith failed
Peer pcmk2 was not terminated (reboot) by <anyone> for pcmk1: Operation
timed out (ref=ca100580-8e00-49d4-b895-c538139a28dd)
Jul 18 14:31:00 [1498] pcmk1 crmd: notice: run_graph: ====
Transition 2 (Complete=7, Pending=0, Fired=0, Skipped=4, Incomplete=5,
Source=/var/lib/pengine/pe-warn-34.bz2): Stopped
State transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC
cause=C_FSA_INTERNAL origin=notify_crmd ]
Jul 18 14:31:00 [1497] pcmk1 pengine: notice: unpack_config: On loss
of CCM Quorum: Ignore
Jul 18 14:31:00 [1497] pcmk1 pengine: warning: pe_fence_node: Node
pcmk2 will be fenced because it is un-expectedly down
Node pcmk2 is unclean
Jul 18 14:31:00 [1497] pcmk1 pengine: warning: custom_action: Action
drbd_pg:1_stop_0 on pcmk2 is unrunnable (offline)
Jul 18 14:31:00 [1497] pcmk1 pengine: warning: custom_action: Marking
node pcmk2 unclean
Jul 18 14:31:00 [1497] pcmk1 pengine: warning: custom_action: Action
drbd_pg:1_stop_0 on pcmk2 is unrunnable (offline)
Jul 18 14:31:00 [1497] pcmk1 pengine: warning: custom_action: Marking
node pcmk2 unclean
Jul 18 14:31:00 [1497] pcmk1 pengine: warning: custom_action: Action
vm-fence-pcmk1_stop_0 on pcmk2 is unrunnable (offline)
Jul 18 14:31:00 [1497] pcmk1 pengine: warning: custom_action: Marking
node pcmk2 unclean
Jul 18 14:31:00 [1497] pcmk1 pengine: warning: stage6: Scheduling Node
pcmk2 for STONITH
Jul 18 14:31:00 [1497] pcmk1 pengine: notice: LogActions: Stop
drbd_pg:1 (pcmk2)
Jul 18 14:31:00 [1497] pcmk1 pengine: notice: LogActions: Stop
vm-fence-pcmk1 (pcmk2)
State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
cause=C_IPC_MESSAGE origin=handle_response ]
Processing graph 3 (ref=pe_calc-dc-1374150660-46) derived from
/var/lib/pengine/pe-warn-35.bz2
Initiating action 63: notify drbd_pg:0_pre_notify_stop_0 on pcmk1 (local)
Jul 18 14:31:00 pcmk1 lrmd: [1495]: info: rsc:drbd_pg:0:28: notify
Executing reboot fencing operation (53) on pcmk2 (timeout=60000)
initiate_remote_stonith_op: Initiating remote operation reboot for
pcmk2: d69db4e3-7d3b-4bee-9bd5-aa7afb05c358
/var/lib/pengine/pe-warn-35.bz2
Configuration WARNINGs found during PE processing. Please run "crm_verify
-L" to identify issues.
LRM operation drbd_pg:0_notify_0 (call=28, rc=0, cib-update=0,
confirmed=true) ok
Regards,
Michal Mistina
-----Original Message-----
From: Andrew Beekhof [mailto:andrew at beekhof.net]
Sent: Tuesday, July 16, 2013 5:23 AM
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] RHEL 6.3 + fence_vmware_soap + esx 5.1

Post by Mistina Michal
Hi Andrew.
Here is the ommited /var/log/messages with stonigh-ng sections.
Device vm-fence-pcmk2 not found

Processed

Post by Mistina Michal
st_execute from lrmd: rc=-12
Jul 15 09:53:38 PCMK1 crmd[1542]: info: process_lrm_event: LRM

operation

Post by Mistina Michal
vm-fence-pcmk2_monitor_0 (call=11, rc=7, cib-update=21,
rsc:vm-fence-pcmk2:12: start
Added 'vm-fence-pcmk2' to the device list (1 active devices)

Processed

Post by Mistina Michal
st_device_register from lrmd: rc=0

Processed

Post by Mistina Michal
st_execute from lrmd: rc=-1
Jul 15 09:54:13 PCMK1 lrmd: [1539]: WARN: vm-fence-pcmk2:start process (PID
3332) timed out (try 1). Killing with signal SIGTERM (15).

you took too long, go away

Post by Mistina Michal
Jul 15 09:54:18 PCMK1 lrmd: [1539]: WARN: vm-fence-pcmk2:start process (PID
3332) timed out (try 2). Killing with signal SIGKILL (9).

seriously go away

Post by Mistina Michal
Jul 15 09:54:18 PCMK1 lrmd: [1539]: WARN: operation start[12] on
stonith::fence_vmware_soap::vm-fence-pcmk2 for client 1542, its
passwd=[password] shell_timeout=[20] ssl=[1] login=[administrator]
action=[reboot] crm_feature_set=[3.0.6] retry_on=[10] ipaddr=[x.x.x.x]
port=[T1-PCMK2] login_timeout=[15] CRM_meta_timeout=[20000] : pid
[3332] timed out

whatever that agent is doing, its taking to long or you've not given it long
enough

Post by Mistina Michal
Jul 15 09:54:18 PCMK1 crmd[1542]: error: process_lrm_event: LRM

operation

Post by Mistina Michal
vm-fence-pcmk2_start_0 (12) Timed Out (timeout=20000ms)
Jul 15 09:54:18 PCMK1 attrd[1540]: notice: attrd_ais_dispatch: Update
relayed from pcmk2
Jul 15 09:54:18 PCMK1 attrd[1540]: notice: attrd_trigger_update: Sending
flush op to all hosts for: fail-count-vm-fence-pcmk2 (INFINITY)
Jul 15 09:54:18 PCMK1 attrd[1540]: notice: attrd_perform_update: Sent
update 24: fail-count-vm-fence-pcmk2=INFINITY
Jul 15 09:54:18 PCMK1 attrd[1540]: notice: attrd_ais_dispatch: Update
relayed from pcmk2
Jul 15 09:54:18 PCMK1 attrd[1540]: notice: attrd_trigger_update: Sending
flush op to all hosts for: last-failure-vm-fence-pcmk2 (1373874858)
Jul 15 09:54:18 PCMK1 attrd[1540]: notice: attrd_perform_update: Sent
update 27: last-failure-vm-fence-pcmk2=1373874858
Jul 15 09:54:21 PCMK1 lrmd: [1539]: info: rsc:vm-fence-pcmk2:13: stop
Removed 'vm-fence-pcmk2' from the device list (0 active devices)

Processed

Post by Mistina Michal
st_device_remove from lrmd: rc=0
Jul 15 09:54:21 PCMK1 crmd[1542]: info: process_lrm_event: LRM

operation

Post by Mistina Michal
vm-fence-pcmk2_stop_0 (call=13, rc=0, cib-update=23, confirmed=true) ok
What does this output mean?
Best regards,
Michal Mistina
-----Original Message-----
From: Andrew Beekhof [mailto:andrew at beekhof.net]
Sent: Monday, July 15, 2013 3:06 AM
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] RHEL 6.3 + fence_vmware_soap + esx 5.1
On 13/07/2013, at 10:05 PM, Mistina Michal <Michal.Mistina at virte.sk>

Post by Mistina Michal
Hi,
Does somebody know how to set up fence_vmware_soap correctly so that it

will start fencing vmware machine in the esx 5.1?

Post by Mistina Michal
My problem is the fence_vmware_soap resource agent for stonith timed out.

Don't know why.
Nothing in the stonith-ng logs?

Post by Mistina Michal
[root at pcmk1 ~]# crm_verify -L -V
warning: unpack_rsc_op: Processing failed op

vm-fence-pcmk2_last_failure_0 on pcmk1: unknown exec error (-2)

Post by Mistina Michal
warning: unpack_rsc_op: Processing failed op

vm-fence-pcmk1_last_failure_0 on pcmk2: unknown exec error (-2)

Post by Mistina Michal
warning: common_apply_stickiness: Forcing vm-fence-pcmk2 away from

pcmk1 after 1000000 failures (max=1000000)

Post by Mistina Michal
warning: common_apply_stickiness: Forcing vm-fence-pcmk1 away from

pcmk2 after 1000000 failures (max=1000000)

Post by Mistina Michal
I have 2 node cluster. If I tried to manually reboot vmware machine by

calling fence_vmware_soap it worked.

Post by Mistina Michal
[root at pcmk1 ~]# fence_vmware_soap -a x.x.x.x -l administrator -p
password -n "pcmk2" -o reboot -z
My settings are.
[root at pcmk1 ~]# stonith_admin -M -a fence_vmware_soap <resource-agent
name="fence_vmware_soap" shortdesc="Fence agent for VMWare over SOAP
API"> <longdesc>fence_vmware_soap is an I/O Fencing agent which can
be used

with the virtual machines managed by VMWare products that have SOAP
API v4.1+.

Post by Mistina Michal
.P
Name of virtual machine (-n / port) has to be used in inventory path

Post by Mistina Michal
<content type="string"/>
<shortdesc lang="en">Physical plug number or name of virtual

machine</shortdesc>

file</shortdesc>

list</shortdesc>

ON/OFF</shortdesc>

command</shortdesc>

started</shortdesc>

Post by Mistina Michal
</parameter>
<parameter name="retry_on" unique="0" required="0">
<getopt mixed="--retry-on"/>
<content type="string" default="1"/>
<shortdesc lang="en">Count of attempts to retry power on</shortdesc>
</parameter>
</parameters>
<actions>
<action name="on"/>
<action name="off"/>
<action name="reboot"/>
<action name="status"/>
<action name="list"/>
<action name="monitor"/>
<action name="metadata"/>
<action name="stop" timeout="20s"/>
<action name="start" timeout="20s"/> </actions> </resource-agent>
[root at pcmk1 ~]# crm configure show
node pcmk1
node pcmk2
primitive drbd_pg ocf:linbit:drbd \
params drbd_resource="postgres" \
op monitor interval="15" role="Master" \
op monitor interval="16" role="Slave" \
op start interval="0" timeout="240" \
op stop interval="0" timeout="120"
primitive pg_fs ocf:heartbeat:Filesystem \
params device="/dev/vg_local-lv_pgsql/lv_pgsql"

directory="/var/lib/pgsql/9.2/data" options="noatime,nodiratime"
fstype="xfs" \

port="pcmk1" ssl="1" retry_on="10" shell_timeout="20" login_timeout="15"
action="reboot"

Post by Mistina Michal
primitive vm-fence-pcmk2 stonith:fence_vmware_soap \
params ipaddr="x.x.x.x" login="administrator" passwd="password"

port="pcmk2" ssl="1" retry_on="10" shell_timeout="20" login_timeout="15"
action="reboot"

Post by Mistina Michal
group PGServer pg_lvm pg_fs pg_lsb pg_vip ms ms_drbd_pg drbd_pg \
meta master-max="1" master-node-max="1" clone-max="2"

clone-node-max="1" notify="true"

_______________________________________________
Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
_______________________________________________
Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
_______________________________________________
Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Mistina Michal

2013-07-25 11:38:35 UTC

Permalink

Hi Andrew.
You are right. I renamed vmware machines to have the name in lowercase and
it worked. I tested also dash and bracket [ . With unusual characters I
mentioned stonith failed.
However my vmware machine gets infinitely rebooted ower and ower. I've found
somewhere on the web this was a bug of the pacemaker 1.1.7 and was fixed in
the version 1.1.8 . I will try to compile pacemaker source to have the
newest version.

Thank you.

Best regards,
Michal Mistina

Post by Mistina Michal
Hi Andrew.
Thank you for a little insight. I tried to set higher timout limits
within fence_vmware_soap properties in cib database. After I had
altered these numbers I didn't experience SIGTERM or SIGKILL any more.
However automatic fencing was still not successfull.
I don't understand why "manual fencing" by using command

"fence_vmware_soap"

Post by Mistina Michal
is working though and automatic with same parameters isn't.

Because its not using the same parameters.
Until 1.1.10-rc6, Pacemaker used a calculated value for port and action -
regardless of what you specified.

Look in "man stonithd" or the online docs for details on pcmk_host_map.
You'd probably want "pcmk1:PCMK1;pcmk2:PCMK2;"

Or just name the hosts in lowercase in vmware

Post by Mistina Michal
corosync.log attached further in the text shows there are some parsing
errors. I think this regards unusual characters used in the names of
the virtual machines which run on the ESX. This makes sense if unusual
character is used in the name of the fenced vmware machine. It isn't.
The corosyng.log shows names of other virtual machines on the ESX.
Is it safe to say the issue occured within fence_vmware_soap resource
agent because it cannot handle something, maybe names of the virtual
machines? If so, I will try to update that agent. I am using version
fence-agents-3.1.5-17.el6.x86_64.
Is there a chance that changing timeout limits will help the
situation? I have feeling timeouts doesn't solve anything. It times
out because of something else.
This is how the crm configuration looks now....
[root at pcmk1 ~]# crm configure show
node pcmk1
node pcmk2
primitive drbd_pg ocf:linbit:drbd \
params drbd_resource="postgres" \
op monitor interval="15" role="Master" \
op monitor interval="16" role="Slave" \
op start interval="0" timeout="240" \
op stop interval="0" timeout="120"
primitive pg_fs ocf:heartbeat:Filesystem \
params device="/dev/vg_local-lv_pgsql/lv_pgsql"
directory="/var/lib/pgsql/9.2/data" options="noatime,nodiratime"
fstype="xfs" \
op start interval="0" timeout="60" \
op stop interval="0" timeout="120"
primitive pg_lsb lsb:postgresql-9.2 \
op monitor interval="30" timeout="60" \
op start interval="0" timeout="60" \
op stop interval="0" timeout="60"
primitive pg_lvm ocf:heartbeat:LVM \
params volgrpname="vg_local-lv_pgsql" \
op start interval="0" timeout="30" \
op stop interval="0" timeout="30"
primitive pg_vip ocf:heartbeat:IPaddr2 \
params ip="x.x.x.x" iflabel="tstcapsvip" \
op monitor interval="5"
primitive vm-fence-pcmk1 stonith:fence_vmware_soap \
params ipaddr="x.x.x.x" login="administrator" passwd="password"
port="PCMK1" ssl="1" retry_on="10" shell_timeout="120" login_timeout="120"
action="reboot" \
op start interval="0" timeout="120"
primitive vm-fence-pcmk2 stonith:fence_vmware_soap \
params ipaddr="x.x.x.x" login="administrator" passwd="password"
port="PCMK2" ssl="1" retry_on="10" shell_timeout="120" login_timeout="120"
action="reboot" \
op start interval="0" timeout="120"
group PGServer pg_lvm pg_fs pg_lsb pg_vip \
meta target-role="Started"
ms ms_drbd_pg drbd_pg \
meta master-max="1" master-node-max="1" clone-max="2"
clone-node-max="1" notify="true"
location l-st-pcmk1 vm-fence-pcmk1 -inf: pcmk1 location l-st-pcmk2
pcmk1 colocation col_pg_drbd inf: PGServer ms_drbd_pg:Master order
ord_pg inf: ms_drbd_pg:promote PGServer:start property
$id="cib-bootstrap-options" \
dc-version="1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14" \
cluster-infrastructure="openais" \
expected-quorum-votes="4" \
stonith-enabled="true" \
no-quorum-policy="ignore" \
maintenance-mode="false"
rsc_defaults $id="rsc-options" \
resource-stickiness="100"
Command crm_verify -LV shows nothing.
[root at pcmk1 ~]# crm_verify -LV
[root at pcmk1 ~]# crm_mon -1
============
Last updated: Thu Jul 18 14:23:15 2013 Last change: Thu Jul 18
14:20:54 2013 via crm_resource on pcmk1
Stack: openais
Current DC: pcmk2 - partition WITHOUT quorum
Version: 1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14
2 Nodes configured, 4 expected votes
8 Resources configured.
============
Online: [ pcmk1 pcmk2 ]
Resource Group: PGServer
pg_lvm (ocf::heartbeat:LVM): Started pcmk1
pg_fs (ocf::heartbeat:Filesystem): Started pcmk1
pg_lsb (lsb:postgresql-9.2): Started pcmk1
pg_vip (ocf::heartbeat:IPaddr2): Started pcmk1
Master/Slave Set: ms_drbd_pg [drbd_pg]
Masters: [ pcmk1 ]
Slaves: [ pcmk2 ]
vm-fence-pcmk1 (stonith:fence_vmware_soap): Started pcmk2
vm-fence-pcmk2 (stonith:fence_vmware_soap): Started pcmk1
If I simulate split-brain by plugging out the cable from secondary
server pcmk2, /var/log/cluster/corosync.log on the primary server
pcmk1 tell this...
can_fence_host_with_device: Refreshing port list for vm-fence-pcmk2
Could not parse (13 21): [106.15],4222ac70-92c3-bddf-b524-24d848080cb2
Could not parse (13 21): [107.25],42224003-b614-5eb2-f141-5437fc8319d8
Could not parse (13 21): [107.29],4222719f-7bdc-84b2-4494-848a29c2bd5f
Could not parse (0 1): [ MEDI - WinXP with SP3 - MSDN
],4222238c-c927-3af1-f2e7-e0dd374d373b
Could not parse (31 32): ],4222238c-c927-3af1-f2e7-e0dd374d373b
Could not parse (0 1): [ MEDI WIN7 32-bit -
MSDN],42223e4a-9541-2326-2a21-3b3532756b47
[105.233],42220acd-6e21-4380-9b81-89d86f14317d
Could not parse (9 17): [106.21],42223377-1443-a44c-1dc0-815c2542898e
Could not parse (12 20): [106.29],4222394a-70f1-4612-6fcd-4525e13b0cc4
Could not parse (0 1): [ MEDI W2K8 R2 SP1 STD - MSDN
],4222dc65-6752-b1b4-c0f7-38c94cd5609a
Could not parse (30 31): ],4222dc65-6752-b1b4-c0f7-38c94cd5609a
Could not parse (12 20): [106.52],4222aa80-0fe6-66c4-8d11-fea5f547b566
Could not parse (13 21): [106.14],422249fc-a902-ba5c-deb0-e6db6198b984
Could not parse (18 25): [106.2],4222851c-1a9d-021a-4e16-9f8adc5bcc42
Could not parse (12 20): [106.28],422235ab-83c4-c0b7-812b-bc5b7019aff7
Could not parse (13 21): [106.26],4222bbff-48eb-d60c-0347-430b8d72baa2
Could not parse (13 21): [107.27],4222da62-3c55-37f8-f6b8-239657892914
Could not parse (0 1): [ MEDI WIN7 64-bit - MSDN
],4222289e-0bd2-4280-c0f4-548fd42e7eab
Could not parse (26 27): ],4222289e-0bd2-4280-c0f4-548fd42e7eab
[105.242],42228b51-4ef6-f9b8-b64a-882d68023074
[105.230],42223dcd-22c1-a0f7-c629-5c4489e2c55d
Could not parse (0 1): [ W2K3 R2 ENT 32-bit ENG
],4233c1c8-e0f9-26f3-b854-6376ec6b1d1c
Could not parse (25 26): ],4233c1c8-e0f9-26f3-b854-6376ec6b1d1c
Could not parse (9 17): [106.20],422285ba-6a31-0832-1b38-a910031cd057
Could not parse (13 21): [106.27],4222d166-5647-79a3-d9d8-f90650b6188b
[105.231],4222308c-41c7-02e9-3b20-c6df71838db9
Could not parse (25 28): !!!
[105.235],422283ac-c5d9-4bf1-96eb-a57d8d18c118
[105.235],422283ac-c5d9-4bf1-96eb-a57d8d18c118
Could not parse (12 20): [106.13],42222137-0d67-ac9b-e3b6-11fb6d2c33e0
[105.241],4222a40f-d91a-0e4f-2292-ef92c4836bb5
[105.243],42222a9a-7440-6d19-b654-42c08a2abd69
Could not parse (0 1): [ MEDI W2K8 R2 SP1 ENT - MSDN
],42227507-c4fd-c5aa-b7d7-4ececd284f84
Could not parse (30 31): ],42227507-c4fd-c5aa-b7d7-4ececd284f84
Could not parse (0 1): [ MEDI_gw_chckpnt
],4222f42e-58c6-dc59-2a00-10041ad5ac08
Could not parse (18 19): ],4222f42e-58c6-dc59-2a00-10041ad5ac08
[105.234],422295e3-644e-8b51-a373-e7f166b2fd5d
[105.232],42228f9d-615f-1c3b-2158-d3ad08d40357
[105.240],4222b273-68e7-379d-b874-6a47211e9449
Could not parse (13 21): [107.28],4222cbc8-565d-eee1-4430-555b059663d0
[105.236],4222115e-789a-66dd-95e9-786ec0d84ec0
Could not parse (13 21): [107.26],4222fb16-fadc-9031-8e3d-110225505a0f
Could not parse (12 20): [106.12],42226bf9-8e78-9356-773c-ecde31cf2fa2
Could not parse (12 20): [106.51],4222ae99-f1d9-9811-d72b-10e875c58f56
dynamic-list
Processed st_query from pcmk1: rc=0
Operation reboot of pcmk2 by <no-one> for
pcmk1[7496e5e6-4ab4-4028-b44d-c34c52a3fd04]: Operation timed out
StonithOp <remote-op state="0" st_target="pcmk2" st_op="reboot" />
Stonith operation 4 for pcmk2 failed (Operation timed out): aborting
transition.
Stonith failed
Operation timed out (ref=ca100580-8e00-49d4-b895-c538139a28dd)
Jul 18 14:31:00 [1498] pcmk1 crmd: notice: run_graph: ====
Transition 2 (Complete=7, Pending=0, Fired=0, Skipped=4, Incomplete=5,
Source=/var/lib/pengine/pe-warn-34.bz2): Stopped
State transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [
input=I_PE_CALC cause=C_FSA_INTERNAL origin=notify_crmd ]
Jul 18 14:31:00 [1497] pcmk1 pengine: notice: unpack_config: On loss
of CCM Quorum: Ignore
Jul 18 14:31:00 [1497] pcmk1 pengine: warning: pe_fence_node: Node
pcmk2 will be fenced because it is un-expectedly down
Node pcmk2 is unclean
Jul 18 14:31:00 [1497] pcmk1 pengine: warning: custom_action: Action
drbd_pg:1_stop_0 on pcmk2 is unrunnable (offline)

Marking

Post by Mistina Michal
node pcmk2 unclean
Jul 18 14:31:00 [1497] pcmk1 pengine: warning: custom_action: Action
drbd_pg:1_stop_0 on pcmk2 is unrunnable (offline)

Marking

Post by Mistina Michal
node pcmk2 unclean
Jul 18 14:31:00 [1497] pcmk1 pengine: warning: custom_action: Action
vm-fence-pcmk1_stop_0 on pcmk2 is unrunnable (offline)

Marking

Post by Mistina Michal
node pcmk2 unclean
Jul 18 14:31:00 [1497] pcmk1 pengine: warning: stage6: Scheduling Node
pcmk2 for STONITH
Jul 18 14:31:00 [1497] pcmk1 pengine: notice: LogActions: Stop
drbd_pg:1 (pcmk2)
Jul 18 14:31:00 [1497] pcmk1 pengine: notice: LogActions: Stop
vm-fence-pcmk1 (pcmk2)
State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [
input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
Processing graph 3 (ref=pe_calc-dc-1374150660-46) derived from
/var/lib/pengine/pe-warn-35.bz2
Initiating action 63: notify drbd_pg:0_pre_notify_stop_0 on pcmk1
(local) Jul 18 14:31:00 pcmk1 lrmd: [1495]: info: rsc:drbd_pg:0:28: notify
Executing reboot fencing operation (53) on pcmk2 (timeout=60000)
initiate_remote_stonith_op: Initiating remote operation reboot for
pcmk2: d69db4e3-7d3b-4bee-9bd5-aa7afb05c358
/var/lib/pengine/pe-warn-35.bz2
Configuration WARNINGs found during PE processing. Please run
"crm_verify -L" to identify issues.
LRM operation drbd_pg:0_notify_0 (call=28, rc=0, cib-update=0,
confirmed=true) ok
Regards,
Michal Mistina
-----Original Message-----
From: Andrew Beekhof [mailto:andrew at beekhof.net]
Sent: Tuesday, July 16, 2013 5:23 AM
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] RHEL 6.3 + fence_vmware_soap + esx 5.1

Post by Mistina Michal
Hi Andrew.
Here is the ommited /var/log/messages with stonigh-ng sections.
Device vm-fence-pcmk2 not found

Processed

Post by Mistina Michal
st_execute from lrmd: rc=-12
Jul 15 09:53:38 PCMK1 crmd[1542]: info: process_lrm_event: LRM

operation

Post by Mistina Michal
vm-fence-pcmk2_monitor_0 (call=11, rc=7, cib-update=21,
rsc:vm-fence-pcmk2:12: start
Added 'vm-fence-pcmk2' to the device list (1 active devices)

Processed

Post by Mistina Michal
st_device_register from lrmd: rc=0

Processed

Post by Mistina Michal
st_execute from lrmd: rc=-1
Jul 15 09:54:13 PCMK1 lrmd: [1539]: WARN: vm-fence-pcmk2:start process (PID
3332) timed out (try 1). Killing with signal SIGTERM (15).

you took too long, go away

Post by Mistina Michal
Jul 15 09:54:18 PCMK1 lrmd: [1539]: WARN: vm-fence-pcmk2:start process (PID
3332) timed out (try 2). Killing with signal SIGKILL (9).

seriously go away

Post by Mistina Michal
Jul 15 09:54:18 PCMK1 lrmd: [1539]: WARN: operation start[12] on
stonith::fence_vmware_soap::vm-fence-pcmk2 for client 1542, its
passwd=[password] shell_timeout=[20] ssl=[1] login=[administrator]
action=[reboot] crm_feature_set=[3.0.6] retry_on=[10]
ipaddr=[x.x.x.x] port=[T1-PCMK2] login_timeout=[15]
CRM_meta_timeout=[20000] : pid [3332] timed out

whatever that agent is doing, its taking to long or you've not given
it long enough

Post by Mistina Michal
Jul 15 09:54:18 PCMK1 crmd[1542]: error: process_lrm_event: LRM

operation

Processed

Post by Mistina Michal
st_device_remove from lrmd: rc=0
Jul 15 09:54:21 PCMK1 crmd[1542]: info: process_lrm_event: LRM

operation

Post by Mistina Michal
Hi,
Does somebody know how to set up fence_vmware_soap correctly so that it

will start fencing vmware machine in the esx 5.1?

Post by Mistina Michal
My problem is the fence_vmware_soap resource agent for stonith timed out.

Don't know why.
Nothing in the stonith-ng logs?

Post by Mistina Michal
[root at pcmk1 ~]# crm_verify -L -V
warning: unpack_rsc_op: Processing failed op

vm-fence-pcmk2_last_failure_0 on pcmk1: unknown exec error (-2)

Post by Mistina Michal
warning: unpack_rsc_op: Processing failed op

vm-fence-pcmk1_last_failure_0 on pcmk2: unknown exec error (-2)

Post by Mistina Michal
warning: common_apply_stickiness: Forcing vm-fence-pcmk2 away from

pcmk1 after 1000000 failures (max=1000000)

Post by Mistina Michal
warning: common_apply_stickiness: Forcing vm-fence-pcmk1 away from

pcmk2 after 1000000 failures (max=1000000)

Post by Mistina Michal
I have 2 node cluster. If I tried to manually reboot vmware machine by

calling fence_vmware_soap it worked.

Post by Mistina Michal
[root at pcmk1 ~]# fence_vmware_soap -a x.x.x.x -l administrator -p
password -n "pcmk2" -o reboot -z
My settings are.
[root at pcmk1 ~]# stonith_admin -M -a fence_vmware_soap
<resource-agent name="fence_vmware_soap" shortdesc="Fence agent for
VMWare over SOAP API"> <longdesc>fence_vmware_soap is an I/O
Fencing agent which can be used

with the virtual machines managed by VMWare products that have SOAP
API v4.1+.

Post by Mistina Michal
.P
Name of virtual machine (-n / port) has to be used in inventory path

Post by Mistina Michal
<content type="string"/>
<shortdesc lang="en">Physical plug number or name of virtual

machine</shortdesc>

file</shortdesc>

list</shortdesc>

ON/OFF</shortdesc>

command</shortdesc>

started</shortdesc>

Post by Mistina Michal
</parameter>
<parameter name="retry_on" unique="0" required="0">
<getopt mixed="--retry-on"/>
<content type="string" default="1"/>
<shortdesc lang="en">Count of attempts to retry power on</shortdesc>
</parameter>
</parameters>
<actions>
<action name="on"/>
<action name="off"/>
<action name="reboot"/>
<action name="status"/>
<action name="list"/>
<action name="monitor"/>
<action name="metadata"/>
<action name="stop" timeout="20s"/>
<action name="start" timeout="20s"/> </actions> </resource-agent>
[root at pcmk1 ~]# crm configure show
node pcmk1
node pcmk2
primitive drbd_pg ocf:linbit:drbd \
params drbd_resource="postgres" \
op monitor interval="15" role="Master" \
op monitor interval="16" role="Slave" \
op start interval="0" timeout="240" \
op stop interval="0" timeout="120"
primitive pg_fs ocf:heartbeat:Filesystem \
params device="/dev/vg_local-lv_pgsql/lv_pgsql"

directory="/var/lib/pgsql/9.2/data" options="noatime,nodiratime"
fstype="xfs" \

port="pcmk1" ssl="1" retry_on="10" shell_timeout="20" login_timeout="15"
action="reboot"

Post by Mistina Michal
primitive vm-fence-pcmk2 stonith:fence_vmware_soap \
params ipaddr="x.x.x.x" login="administrator" passwd="password"

port="pcmk2" ssl="1" retry_on="10" shell_timeout="20" login_timeout="15"
action="reboot"

Post by Mistina Michal
group PGServer pg_lvm pg_fs pg_lsb pg_vip ms ms_drbd_pg drbd_pg \
meta master-max="1" master-node-max="1" clone-max="2"

clone-node-max="1" notify="true"

_______________________________________________
Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
_______________________________________________
Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org