[Pacemaker] Two node cluster and no hardware device for stonith.

Discussion:

Andrea

2015-01-21 13:13:46 UTC

Hi All,

I have a question about stonith
In my scenarion , I have to create 2 node cluster, but I don't have any
hardware device for stonith. No APC no IPMI ecc, no one of the list returned
by "pcs stonith list"
So, there is an option to do something?
This is my scenario:
- 2 nodes cluster
serverHA1
serverHA2

- Software
Centos 6.6
pacemaker.x86_64 1.1.12-4.el6
cman.x86_64 3.0.12.1-68.el6
corosync.x86_64 1.4.7-1.el6

-NO hardware device for stonith!

- Cluster creation ([ALL] operation done on all nodes, [ONE] operation done
on only one node)
[ALL] systemctl start pcsd.service
[ALL] systemctl enable pcsd.service
[ONE] pcs cluster auth serverHA1 serverHA2
[ALL] echo "CMAN_QUORUM_TIMEOUT=0" >> /etc/sysconfig/cman
[ONE] pcs cluster setup --name MyCluHA serverHA1 serverHA2
[ONE] pcs property set stonith-enabled=false
[ONE] pcs property set no-quorum-policy=ignore
[ONE] pcs resource create ping ocf:pacemaker:ping dampen=5s multiplier=1000
host_list=192.168.56.1 --clone

In my test, when I simulate network failure, split brain occurs, and when
network come back, One node kill the other node
-log on node 1:
Jan 21 11:45:28 corosync [CMAN ] memb: Sending KILL to node 2

-log on node 2:
Jan 21 11:45:28 corosync [CMAN ] memb: got KILL for node 2

There is a method to restart pacemaker when network come back instead of
kill it?

Thanks
Andrea

_______________________________________________
Pacemaker mailing list: ***@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Digimer

2015-01-21 16:18:05 UTC

Permalink

This post might be inappropriate. Click to display it.

E. Kuemmerle

2015-01-22 09:03:38 UTC

Permalink

On 21.01.2015 11:18 Digimer wrote:
> On 21/01/15 08:13 AM, Andrea wrote:
>> > Hi All,
>> >
>> > I have a question about stonith
>> > In my scenarion , I have to create 2 node cluster, but I don't have any
>> > hardware device for stonith. No APC no IPMI ecc, no one of the list returned
>> > by "pcs stonith list"
>> > So, there is an option to do something?
>> > This is my scenario:
>> > - 2 nodes cluster
>> > serverHA1
>> > serverHA2
>> >
>> > - Software
>> > Centos 6.6
>> > pacemaker.x86_64 1.1.12-4.el6
>> > cman.x86_64 3.0.12.1-68.el6
>> > corosync.x86_64 1.4.7-1.el6
>> >
>> > -NO hardware device for stonith!
>> >
>> > - Cluster creation ([ALL] operation done on all nodes, [ONE] operation done
>> > on only one node)
>> > [ALL] systemctl start pcsd.service
>> > [ALL] systemctl enable pcsd.service
>> > [ONE] pcs cluster auth serverHA1 serverHA2
>> > [ALL] echo "CMAN_QUORUM_TIMEOUT=0" >> /etc/sysconfig/cman
>> > [ONE] pcs cluster setup --name MyCluHA serverHA1 serverHA2
>> > [ONE] pcs property set stonith-enabled=false
>> > [ONE] pcs property set no-quorum-policy=ignore
>> > [ONE] pcs resource create ping ocf:pacemaker:ping dampen=5s multiplier=1000
>> > host_list=192.168.56.1 --clone
>> >
>> >
>> > In my test, when I simulate network failure, split brain occurs, and when
>> > network come back, One node kill the other node
>> > -log on node 1:
>> > Jan 21 11:45:28 corosync [CMAN ] memb: Sending KILL to node 2
>> >
>> > -log on node 2:
>> > Jan 21 11:45:28 corosync [CMAN ] memb: got KILL for node 2
>> >
>> >
>> > There is a method to restart pacemaker when network come back instead of
>> > kill it?
>> >
>> > Thanks
>> > Andrea
> You really need a fence device, there isn't a way around it. By
> definition, when a node needs to be fenced, it is in an unknown state
> and it can not be predicted to operate predictably.
>
> If you're using real hardware, then you can use a switched PDU
> (network-connected power bar with individual outlet control) to do
> fencing. I use the APC AP7900 in all my clusters and it works perfectly.
> I know that some other brands work, too.
>
> If you machines are virtual machines, then you can do fencing by talking
> to the hypervisor. In this case, one node calls the host of the other
> node and asks it to be terminated (fence_virsh and fence_xvm for KVM/Xen
> systems, fence_vmware for VMWare, etc).
>
> -- Digimer
If you want to save money and you can solder a bit, I can recommend rcd_serial.
The required device is described in cluster-glue/stonith/README.rcd_serial.
It is very simple but it works for us reliably since more than four years!

Eberhard

------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------

_______________________________________________
Pacemaker mailing list: ***@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Michael Schwartzkopff

2015-01-22 09:15:13 UTC

Permalink

Am Donnerstag, 22. Januar 2015, 10:03:38 schrieb E. Kuemmerle:
> On 21.01.2015 11:18 Digimer wrote:
> > On 21/01/15 08:13 AM, Andrea wrote:
> >> > Hi All,
> >> >
> >> > I have a question about stonith
> >> > In my scenarion , I have to create 2 node cluster, but I don't have any
> >> > hardware device for stonith. No APC no IPMI ecc, no one of the list
> >> > returned by "pcs stonith list"
> >> > So, there is an option to do something?
> >> > This is my scenario:
> >> > - 2 nodes cluster
> >> > serverHA1
> >> > serverHA2
> >> >
> >> > - Software
> >> > Centos 6.6
> >> > pacemaker.x86_64 1.1.12-4.el6
> >> > cman.x86_64 3.0.12.1-68.el6
> >> > corosync.x86_64 1.4.7-1.el6
> >> >
> >> > -NO hardware device for stonith!

Are you sure that you do not have fencing hardware? Perhaps you just did nit
configure it? Please read the manual of you BIOS and check your system board if
you have a IPMI interface.

> >> > - Cluster creation ([ALL] operation done on all nodes, [ONE] operation
> >> > done
> >> > on only one node)
> >> > [ALL] systemctl start pcsd.service
> >> > [ALL] systemctl enable pcsd.service
> >> > [ONE] pcs cluster auth serverHA1 serverHA2
> >> > [ALL] echo "CMAN_QUORUM_TIMEOUT=0" >> /etc/sysconfig/cman
> >> > [ONE] pcs cluster setup --name MyCluHA serverHA1 serverHA2
> >> > [ONE] pcs property set stonith-enabled=false
> >> > [ONE] pcs property set no-quorum-policy=ignore
> >> > [ONE] pcs resource create ping ocf:pacemaker:ping dampen=5s
> >> > multiplier=1000
> >> > host_list=192.168.56.1 --clone
> >> >
> >> >
> >> > In my test, when I simulate network failure, split brain occurs, and
> >> > when
> >> > network come back, One node kill the other node
> >> > -log on node 1:
> >> > Jan 21 11:45:28 corosync [CMAN ] memb: Sending KILL to node 2
> >> >
> >> > -log on node 2:
> >> > Jan 21 11:45:28 corosync [CMAN ] memb: got KILL for node 2

That is how fencing works.

Mit freundlichen Grüßen,

Michael Schwartzkopff

--
[*] sys4 AG

http://sys4.de, +49 (89) 30 90 46 64, +49 (162) 165 0044
Franziskanerstraße 15, 81669 München

Sitz der Gesellschaft: München, Amtsgericht München: HRB 199263
Vorstand: Patrick Ben Koetter, Marc Schiffbauer
Aufsichtsratsvorsitzender: Florian Kirstein

_______________________________________________
Pacemaker mailing list: ***@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Andrea

2015-01-22 10:08:59 UTC

Permalink

Michael Schwartzkopff <***@...> writes:

>
> Am Donnerstag, 22. Januar 2015, 10:03:38 schrieb E. Kuemmerle:
> > On 21.01.2015 11:18 Digimer wrote:
> > > On 21/01/15 08:13 AM, Andrea wrote:
> > >> > Hi All,
> > >> >
> > >> > I have a question about stonith
> > >> > In my scenarion , I have to create 2 node cluster, but I don't
>
> Are you sure that you do not have fencing hardware? Perhaps you just did nit
> configure it? Please read the manual of you BIOS and check your system
board if
> you have a IPMI interface.
>

> > >> > In my test, when I simulate network failure, split brain occurs, and
> > >> > when
> > >> > network come back, One node kill the other node
> > >> > -log on node 1:
> > >> > Jan 21 11:45:28 corosync [CMAN ] memb: Sending KILL to node 2
> > >> >
> > >> > -log on node 2:
> > >> > Jan 21 11:45:28 corosync [CMAN ] memb: got KILL for node 2
>
> That is how fencing works.
>
> Mit freundlichen Grüßen,
>
> Michael Schwartzkopff
>

Hi All

many thanks for your replies.
I will update my scenario to ask about adding some devices for stonith
- Option 1
I will ask for having 2 vmware virtual machine, so i can try fance_vmware
-Option 2
In the project, maybe will need a shared storage. In this case, the shared
storage will be a NAS that a can add to my nodes via iscsi. In this case I
can try fence_scsi

I will write here about news

Many thanks to all for support
Andrea

_______________________________________________
Pacemaker mailing list: ***@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs

Andrea

2015-01-27 10:35:32 UTC

Permalink

Andrea <***@...> writes:

>
> Michael Schwartzkopff <ms <at> ...> writes:
>
> >
> > Am Donnerstag, 22. Januar 2015, 10:03:38 schrieb E. Kuemmerle:
> > > On 21.01.2015 11:18 Digimer wrote:
> > > > On 21/01/15 08:13 AM, Andrea wrote:
> > > >> > Hi All,
> > > >> >
> > > >> > I have a question about stonith
> > > >> > In my scenarion , I have to create 2 node cluster, but I don't
> >
> > Are you sure that you do not have fencing hardware? Perhaps you just did
nit
> > configure it? Please read the manual of you BIOS and check your system
> board if
> > you have a IPMI interface.
> >
>
> > > >> > In my test, when I simulate network failure, split brain occurs, and
> > > >> > when
> > > >> > network come back, One node kill the other node
> > > >> > -log on node 1:
> > > >> > Jan 21 11:45:28 corosync [CMAN ] memb: Sending KILL to node 2
> > > >> >
> > > >> > -log on node 2:
> > > >> > Jan 21 11:45:28 corosync [CMAN ] memb: got KILL for node 2
> >
> > That is how fencing works.
> >
> > Mit freundlichen Grüßen,
> >
> > Michael Schwartzkopff
> >
>
> Hi All
>
> many thanks for your replies.
> I will update my scenario to ask about adding some devices for stonith
> - Option 1
> I will ask for having 2 vmware virtual machine, so i can try fance_vmware
> -Option 2
> In the project, maybe will need a shared storage. In this case, the shared
> storage will be a NAS that a can add to my nodes via iscsi. In this case I
> can try fence_scsi
>
> I will write here about news
>
> Many thanks to all for support
> Andrea
>

some news

- Option 2
In the customer environment I configured a iscsi target that our project
will use as cluster filesystem

[ONE]pvcreate /dev/sdb
[ONE]vgcreate -Ay -cy cluster_vg /dev/sdb
[ONE]lvcreate -L*G -n cluster_lv cluster_vg
[ONE]mkfs.gfs2 -j2 -p lock_dlm -t ProjectHA:ArchiveFS /dev/cluster_vg/cluster_lv

now I can add a Filesystem resource

[ONE]pcs resource create clusterfs Filesystem
device="/dev/cluster_vg/cluster_lv" directory="/var/mountpoint"
fstype="gfs2" "options=noatime" op monitor interval=10s clone interleave=true

and I can read and write from both node.

Now I'd like to use this device with fence_scsi.
It is ok? because I see in the man page this:
"The fence_scsi agent works by having each node in the cluster register a
unique key with the SCSI devive(s). Once registered, a single node will
become the reservation holder by creating a "write exclu-sive,
registrants only" reservation on the device(s). The result is that only
registered nodes may write to the device(s)"
It's no good for me, I need both node can write on the device.
So, I need another device to use with fence_scsi? In this case I will try to
create two partition, sdb1 and sdb2, on this device and use sdb1 as
clusterfs and sdb2 for fencing.

If i try to manually test this, I obtain before any operation
[ONE]sg_persist -n --read-keys
--device=/dev/disk/by-id/scsi-36e843b608e55bb8d6d72d43bfdbc47d4
PR generation=0x27, 1 registered reservation key follows:
0x98343e580002734d

Then, I try to set serverHA1 key
[serverHA1]fence_scsi -d
/dev/disk/by-id/scsi-36e843b608e55bb8d6d72d43bfdbc47d4 -f /tmp/miolog.txt -n
serverHA1 -o on

But nothing has changed
[ONE]sg_persist -n --read-keys
--device=/dev/disk/by-id/scsi-36e843b608e55bb8d6d72d43bfdbc47d4
PR generation=0x27, 1 registered reservation key follows:
0x98343e580002734d

and in the log:
gen 26 17:53:27 fence_scsi: [debug] main::do_register_ignore
(node_key=4d5a0001, dev=/dev/sde)
gen 26 17:53:27 fence_scsi: [debug] main::do_reset (dev=/dev/sde, status=6)
gen 26 17:53:27 fence_scsi: [debug] main::do_register_ignore (err=0)

The same when i try on serverHA2
It is normal?

In any case, i try to create a stonith device
[ONE]pcs stonith create iscsi-stonith-device fence_scsi
pcmk_host_list="serverHA1 serverHA2"
devices=/dev/disk/by-id/scsi-36e843b608e55bb8d6d72d43bfdbc47d4 meta
provides=unfencing

and the cluster status is ok
[ONE] pcs status
Cluster name: MyCluHA
Last updated: Tue Jan 27 11:21:48 2015
Last change: Tue Jan 27 10:46:57 2015
Stack: cman
Current DC: serverHA1 - partition with quorum
Version: 1.1.11-97629de
2 Nodes configured
5 Resources configured

Online: [ serverHA1 serverHA2 ]

Full list of resources:

Clone Set: ping-clone [ping]
Started: [ serverHA1 serverHA2 ]
Clone Set: clusterfs-clone [clusterfs]
Started: [ serverHA1 serverHA2 ]
iscsi-stonith-device (stonith:fence_scsi): Started serverHA1

How I can try this from remote connection?

Andrea

_______________________________________________
Pacemaker mailing list: ***@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

emmanuel segura

2015-01-27 10:44:55 UTC

Permalink

In normal situation every node can in your file system, fence_scsi is
used when your cluster is in split-braint, when your a node doesn't
comunicate with the other node, i don't is good idea.

2015-01-27 11:35 GMT+01:00 Andrea <***@codices.com>:
> Andrea <***@...> writes:
>
>>
>> Michael Schwartzkopff <ms <at> ...> writes:
>>
>> >
>> > Am Donnerstag, 22. Januar 2015, 10:03:38 schrieb E. Kuemmerle:
>> > > On 21.01.2015 11:18 Digimer wrote:
>> > > > On 21/01/15 08:13 AM, Andrea wrote:
>> > > >> > Hi All,
>> > > >> >
>> > > >> > I have a question about stonith
>> > > >> > In my scenarion , I have to create 2 node cluster, but I don't
>> >
>> > Are you sure that you do not have fencing hardware? Perhaps you just did
> nit
>> > configure it? Please read the manual of you BIOS and check your system
>> board if
>> > you have a IPMI interface.
>> >
>>
>> > > >> > In my test, when I simulate network failure, split brain occurs, and
>> > > >> > when
>> > > >> > network come back, One node kill the other node
>> > > >> > -log on node 1:
>> > > >> > Jan 21 11:45:28 corosync [CMAN ] memb: Sending KILL to node 2
>> > > >> >
>> > > >> > -log on node 2:
>> > > >> > Jan 21 11:45:28 corosync [CMAN ] memb: got KILL for node 2
>> >
>> > That is how fencing works.
>> >
>> > Mit freundlichen Grüßen,
>> >
>> > Michael Schwartzkopff
>> >
>>
>> Hi All
>>
>> many thanks for your replies.
>> I will update my scenario to ask about adding some devices for stonith
>> - Option 1
>> I will ask for having 2 vmware virtual machine, so i can try fance_vmware
>> -Option 2
>> In the project, maybe will need a shared storage. In this case, the shared
>> storage will be a NAS that a can add to my nodes via iscsi. In this case I
>> can try fence_scsi
>>
>> I will write here about news
>>
>> Many thanks to all for support
>> Andrea
>>
>
>
>
> some news
>
> - Option 2
> In the customer environment I configured a iscsi target that our project
> will use as cluster filesystem
>
> [ONE]pvcreate /dev/sdb
> [ONE]vgcreate -Ay -cy cluster_vg /dev/sdb
> [ONE]lvcreate -L*G -n cluster_lv cluster_vg
> [ONE]mkfs.gfs2 -j2 -p lock_dlm -t ProjectHA:ArchiveFS /dev/cluster_vg/cluster_lv
>
> now I can add a Filesystem resource
>
> [ONE]pcs resource create clusterfs Filesystem
> device="/dev/cluster_vg/cluster_lv" directory="/var/mountpoint"
> fstype="gfs2" "options=noatime" op monitor interval=10s clone interleave=true
>
> and I can read and write from both node.
>
>
> Now I'd like to use this device with fence_scsi.
> It is ok? because I see in the man page this:
> "The fence_scsi agent works by having each node in the cluster register a
> unique key with the SCSI devive(s). Once registered, a single node will
> become the reservation holder by creating a "write exclu-sive,
> registrants only" reservation on the device(s). The result is that only
> registered nodes may write to the device(s)"
> It's no good for me, I need both node can write on the device.
> So, I need another device to use with fence_scsi? In this case I will try to
> create two partition, sdb1 and sdb2, on this device and use sdb1 as
> clusterfs and sdb2 for fencing.
>
>
> If i try to manually test this, I obtain before any operation
> [ONE]sg_persist -n --read-keys
> --device=/dev/disk/by-id/scsi-36e843b608e55bb8d6d72d43bfdbc47d4
> PR generation=0x27, 1 registered reservation key follows:
> 0x98343e580002734d
>
>
> Then, I try to set serverHA1 key
> [serverHA1]fence_scsi -d
> /dev/disk/by-id/scsi-36e843b608e55bb8d6d72d43bfdbc47d4 -f /tmp/miolog.txt -n
> serverHA1 -o on
>
> But nothing has changed
> [ONE]sg_persist -n --read-keys
> --device=/dev/disk/by-id/scsi-36e843b608e55bb8d6d72d43bfdbc47d4
> PR generation=0x27, 1 registered reservation key follows:
> 0x98343e580002734d
>
>
> and in the log:
> gen 26 17:53:27 fence_scsi: [debug] main::do_register_ignore
> (node_key=4d5a0001, dev=/dev/sde)
> gen 26 17:53:27 fence_scsi: [debug] main::do_reset (dev=/dev/sde, status=6)
> gen 26 17:53:27 fence_scsi: [debug] main::do_register_ignore (err=0)
>
> The same when i try on serverHA2
> It is normal?
>
>
> In any case, i try to create a stonith device
> [ONE]pcs stonith create iscsi-stonith-device fence_scsi
> pcmk_host_list="serverHA1 serverHA2"
> devices=/dev/disk/by-id/scsi-36e843b608e55bb8d6d72d43bfdbc47d4 meta
> provides=unfencing
>
> and the cluster status is ok
> [ONE] pcs status
> Cluster name: MyCluHA
> Last updated: Tue Jan 27 11:21:48 2015
> Last change: Tue Jan 27 10:46:57 2015
> Stack: cman
> Current DC: serverHA1 - partition with quorum
> Version: 1.1.11-97629de
> 2 Nodes configured
> 5 Resources configured
>
>
> Online: [ serverHA1 serverHA2 ]
>
> Full list of resources:
>
> Clone Set: ping-clone [ping]
> Started: [ serverHA1 serverHA2 ]
> Clone Set: clusterfs-clone [clusterfs]
> Started: [ serverHA1 serverHA2 ]
> iscsi-stonith-device (stonith:fence_scsi): Started serverHA1
>
>
>
> How I can try this from remote connection?
>
>
> Andrea
>
> _______________________________________________
> Pacemaker mailing list: ***@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

--
esta es mi vida e me la vivo hasta que dios quiera

_______________________________________________
Pacemaker mailing list: ***@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf

emmanuel segura

2015-01-27 10:48:14 UTC

Permalink

sorry, but i forgot to tell you, you need to know the fence_scsi
doesn't reboot the evicted node, so you can combine fence_vmware with
fence_scsi as the second option.

2015-01-27 11:44 GMT+01:00 emmanuel segura <***@gmail.com>:
> In normal situation every node can in your file system, fence_scsi is
> used when your cluster is in split-braint, when your a node doesn't
> comunicate with the other node, i don't is good idea.
>
>
> 2015-01-27 11:35 GMT+01:00 Andrea <***@codices.com>:
>> Andrea <***@...> writes:
>>
>>>
>>> Michael Schwartzkopff <ms <at> ...> writes:
>>>
>>> >
>>> > Am Donnerstag, 22. Januar 2015, 10:03:38 schrieb E. Kuemmerle:
>>> > > On 21.01.2015 11:18 Digimer wrote:
>>> > > > On 21/01/15 08:13 AM, Andrea wrote:
>>> > > >> > Hi All,
>>> > > >> >
>>> > > >> > I have a question about stonith
>>> > > >> > In my scenarion , I have to create 2 node cluster, but I don't
>>> >
>>> > Are you sure that you do not have fencing hardware? Perhaps you just did
>> nit
>>> > configure it? Please read the manual of you BIOS and check your system
>>> board if
>>> > you have a IPMI interface.
>>> >
>>>
>>> > > >> > In my test, when I simulate network failure, split brain occurs, and
>>> > > >> > when
>>> > > >> > network come back, One node kill the other node
>>> > > >> > -log on node 1:
>>> > > >> > Jan 21 11:45:28 corosync [CMAN ] memb: Sending KILL to node 2
>>> > > >> >
>>> > > >> > -log on node 2:
>>> > > >> > Jan 21 11:45:28 corosync [CMAN ] memb: got KILL for node 2
>>> >
>>> > That is how fencing works.
>>> >
>>> > Mit freundlichen Grüßen,
>>> >
>>> > Michael Schwartzkopff
>>> >
>>>
>>> Hi All
>>>
>>> many thanks for your replies.
>>> I will update my scenario to ask about adding some devices for stonith
>>> - Option 1
>>> I will ask for having 2 vmware virtual machine, so i can try fance_vmware
>>> -Option 2
>>> In the project, maybe will need a shared storage. In this case, the shared
>>> storage will be a NAS that a can add to my nodes via iscsi. In this case I
>>> can try fence_scsi
>>>
>>> I will write here about news
>>>
>>> Many thanks to all for support
>>> Andrea
>>>
>>
>>
>>
>> some news
>>
>> - Option 2
>> In the customer environment I configured a iscsi target that our project
>> will use as cluster filesystem
>>
>> [ONE]pvcreate /dev/sdb
>> [ONE]vgcreate -Ay -cy cluster_vg /dev/sdb
>> [ONE]lvcreate -L*G -n cluster_lv cluster_vg
>> [ONE]mkfs.gfs2 -j2 -p lock_dlm -t ProjectHA:ArchiveFS /dev/cluster_vg/cluster_lv
>>
>> now I can add a Filesystem resource
>>
>> [ONE]pcs resource create clusterfs Filesystem
>> device="/dev/cluster_vg/cluster_lv" directory="/var/mountpoint"
>> fstype="gfs2" "options=noatime" op monitor interval=10s clone interleave=true
>>
>> and I can read and write from both node.
>>
>>
>> Now I'd like to use this device with fence_scsi.
>> It is ok? because I see in the man page this:
>> "The fence_scsi agent works by having each node in the cluster register a
>> unique key with the SCSI devive(s). Once registered, a single node will
>> become the reservation holder by creating a "write exclu-sive,
>> registrants only" reservation on the device(s). The result is that only
>> registered nodes may write to the device(s)"
>> It's no good for me, I need both node can write on the device.
>> So, I need another device to use with fence_scsi? In this case I will try to
>> create two partition, sdb1 and sdb2, on this device and use sdb1 as
>> clusterfs and sdb2 for fencing.
>>
>>
>> If i try to manually test this, I obtain before any operation
>> [ONE]sg_persist -n --read-keys
>> --device=/dev/disk/by-id/scsi-36e843b608e55bb8d6d72d43bfdbc47d4
>> PR generation=0x27, 1 registered reservation key follows:
>> 0x98343e580002734d
>>
>>
>> Then, I try to set serverHA1 key
>> [serverHA1]fence_scsi -d
>> /dev/disk/by-id/scsi-36e843b608e55bb8d6d72d43bfdbc47d4 -f /tmp/miolog.txt -n
>> serverHA1 -o on
>>
>> But nothing has changed
>> [ONE]sg_persist -n --read-keys
>> --device=/dev/disk/by-id/scsi-36e843b608e55bb8d6d72d43bfdbc47d4
>> PR generation=0x27, 1 registered reservation key follows:
>> 0x98343e580002734d
>>
>>
>> and in the log:
>> gen 26 17:53:27 fence_scsi: [debug] main::do_register_ignore
>> (node_key=4d5a0001, dev=/dev/sde)
>> gen 26 17:53:27 fence_scsi: [debug] main::do_reset (dev=/dev/sde, status=6)
>> gen 26 17:53:27 fence_scsi: [debug] main::do_register_ignore (err=0)
>>
>> The same when i try on serverHA2
>> It is normal?
>>
>>
>> In any case, i try to create a stonith device
>> [ONE]pcs stonith create iscsi-stonith-device fence_scsi
>> pcmk_host_list="serverHA1 serverHA2"
>> devices=/dev/disk/by-id/scsi-36e843b608e55bb8d6d72d43bfdbc47d4 meta
>> provides=unfencing
>>
>> and the cluster status is ok
>> [ONE] pcs status
>> Cluster name: MyCluHA
>> Last updated: Tue Jan 27 11:21:48 2015
>> Last change: Tue Jan 27 10:46:57 2015
>> Stack: cman
>> Current DC: serverHA1 - partition with quorum
>> Version: 1.1.11-97629de
>> 2 Nodes configured
>> 5 Resources configured
>>
>>
>> Online: [ serverHA1 serverHA2 ]
>>
>> Full list of resources:
>>
>> Clone Set: ping-clone [ping]
>> Started: [ serverHA1 serverHA2 ]
>> Clone Set: clusterfs-clone [clusterfs]
>> Started: [ serverHA1 serverHA2 ]
>> iscsi-stonith-device (stonith:fence_scsi): Started serverHA1
>>
>>
>>
>> How I can try this from remote connection?
>>
>>
>> Andrea
>>
>> _______________________________________________
>> Pacemaker mailing list: ***@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
>
>
> --
> esta es mi vida e me la vivo hasta que dios quiera

--
esta es mi vida e me la vivo hasta que dios quiera

_______________________________________________
Pacemaker mailing list: ***@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_

Andrea

2015-01-27 12:29:12 UTC

Permalink

emmanuel segura <***@...> writes:

>
> sorry, but i forgot to tell you, you need to know the fence_scsi
> doesn't reboot the evicted node, so you can combine fence_vmware with
> fence_scsi as the second option.
>
for this, i'm trying to use a watchdog script
https://access.redhat.com/solutions/65187

But when I start wachdog daemon, all node reboot.
I continue testing...

> 2015-01-27 11:44 GMT+01:00 emmanuel segura <emi2fast <at> gmail.com>:
> > In normal situation every node can in your file system, fence_scsi is
> > used when your cluster is in split-braint, when your a node doesn't
> > comunicate with the other node, i don't is good idea.
> >

So, i will see key registration only when nodes loose comunication?

> >
> > 2015-01-27 11:35 GMT+01:00 Andrea <a.bacchi <at> codices.com>:
> >> Andrea <a.bacchi <at> ...> writes:
>

_______________________________________________
Pacemaker mailing list: ***@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

emmanuel segura

2015-01-27 12:51:14 UTC

Permalink

When a node is dead the registration key is removed.

2015-01-27 13:29 GMT+01:00 Andrea <***@codices.com>:
> emmanuel segura <***@...> writes:
>
>>
>> sorry, but i forgot to tell you, you need to know the fence_scsi
>> doesn't reboot the evicted node, so you can combine fence_vmware with
>> fence_scsi as the second option.
>>
> for this, i'm trying to use a watchdog script
> https://access.redhat.com/solutions/65187
>
> But when I start wachdog daemon, all node reboot.
> I continue testing...
>
>
>
>> 2015-01-27 11:44 GMT+01:00 emmanuel segura <emi2fast <at> gmail.com>:
>> > In normal situation every node can in your file system, fence_scsi is
>> > used when your cluster is in split-braint, when your a node doesn't
>> > comunicate with the other node, i don't is good idea.
>> >
>
> So, i will see key registration only when nodes loose comunication?
>
>
>
>
>
>> >
>> > 2015-01-27 11:35 GMT+01:00 Andrea <a.bacchi <at> codices.com>:
>> >> Andrea <a.bacchi <at> ...> writes:
>>
>
>
>
>
>
> _______________________________________________
> Pacemaker mailing list: ***@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

--
esta es mi vida e me la vivo hasta que dios quiera

_______________________________________________
Pacemaker mailing list: ***@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Vinod Prabhu

2015-01-27 13:05:53 UTC

Permalink

is stonith enabled in crm conf?

On Tue, Jan 27, 2015 at 6:21 PM, emmanuel segura <***@gmail.com> wrote:

> When a node is dead the registration key is removed.
>
> 2015-01-27 13:29 GMT+01:00 Andrea <***@codices.com>:
> > emmanuel segura <***@...> writes:
> >
> >>
> >> sorry, but i forgot to tell you, you need to know the fence_scsi
> >> doesn't reboot the evicted node, so you can combine fence_vmware with
> >> fence_scsi as the second option.
> >>
> > for this, i'm trying to use a watchdog script
> > https://access.redhat.com/solutions/65187
> >
> > But when I start wachdog daemon, all node reboot.
> > I continue testing...
> >
> >
> >
> >> 2015-01-27 11:44 GMT+01:00 emmanuel segura <emi2fast <at> gmail.com>:
> >> > In normal situation every node can in your file system, fence_scsi is
> >> > used when your cluster is in split-braint, when your a node doesn't
> >> > comunicate with the other node, i don't is good idea.
> >> >
> >
> > So, i will see key registration only when nodes loose comunication?
> >
> >
> >
> >
> >
> >> >
> >> > 2015-01-27 11:35 GMT+01:00 Andrea <a.bacchi <at> codices.com>:
> >> >> Andrea <a.bacchi <at> ...> writes:
> >>
> >
> >
> >
> >
> >
> > _______________________________________________
> > Pacemaker mailing list: ***@oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
>
>
>
> --
> esta es mi vida e me la vivo hasta que dios quiera
>
> _______________________________________________
> Pacemaker mailing list: ***@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>

--
OSS BSS Developer
Hand Phone: 9860788344
[image: Picture]

emmanuel segura

2015-01-27 13:12:46 UTC

Permalink

if you are using cman+pacemaker you need to enabled the stonith and
configuring that in you crm config

2015-01-27 14:05 GMT+01:00 Vinod Prabhu <***@gmail.com>:

> is stonith enabled in crm conf?
>
> On Tue, Jan 27, 2015 at 6:21 PM, emmanuel segura <***@gmail.com>
> wrote:
>
>> When a node is dead the registration key is removed.
>>
>> 2015-01-27 13:29 GMT+01:00 Andrea <***@codices.com>:
>> > emmanuel segura <***@...> writes:
>> >
>> >>
>> >> sorry, but i forgot to tell you, you need to know the fence_scsi
>> >> doesn't reboot the evicted node, so you can combine fence_vmware with
>> >> fence_scsi as the second option.
>> >>
>> > for this, i'm trying to use a watchdog script
>> > https://access.redhat.com/solutions/65187
>> >
>> > But when I start wachdog daemon, all node reboot.
>> > I continue testing...
>> >
>> >
>> >
>> >> 2015-01-27 11:44 GMT+01:00 emmanuel segura <emi2fast <at> gmail.com>:
>> >> > In normal situation every node can in your file system, fence_scsi is
>> >> > used when your cluster is in split-braint, when your a node doesn't
>> >> > comunicate with the other node, i don't is good idea.
>> >> >
>> >
>> > So, i will see key registration only when nodes loose comunication?
>> >
>> >
>> >
>> >
>> >
>> >> >
>> >> > 2015-01-27 11:35 GMT+01:00 Andrea <a.bacchi <at> codices.com>:
>> >> >> Andrea <a.bacchi <at> ...> writes:
>> >>
>> >
>> >
>> >
>> >
>> >
>> > _______________________________________________
>> > Pacemaker mailing list: ***@oss.clusterlabs.org
>> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> >
>> > Project Home: http://www.clusterlabs.org
>> > Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> > Bugs: http://bugs.clusterlabs.org
>>
>>
>>
>> --
>> esta es mi vida e me la vivo hasta que dios quiera
>>
>> _______________________________________________
>> Pacemaker mailing list: ***@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
>
>
> --
> OSS BSS Developer
> Hand Phone: 9860788344
> [image: Picture]
>
> _______________________________________________
> Pacemaker mailing list: ***@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>

--
esta es mi vida e me la vivo hasta que dios quiera

Andrea

2015-01-27 13:24:46 UTC

Permalink

emmanuel segura <***@...> writes:

>
>
> if you are using cman+pacemaker you need to enabled the stonith and
configuring that in you crm config
>
>
> 2015-01-27 14:05 GMT+01:00 Vinod Prabhu
<***@gmail.com>:
> is stonith enabled in crm conf?
>

yes, stonith is enabled

[ONE]pcs property
Cluster Properties:
cluster-infrastructure: cman
dc-version: 1.1.11-97629de
last-lrm-refresh: 1422285715
no-quorum-policy: ignore
stonith-enabled: true

If I disable it, stonith device don't start

>
> On Tue, Jan 27, 2015 at 6:21 PM, emmanuel segura
<***@gmail.com> wrote:When a node is dead
the registration key is removed.

So I must see 2 key registered when I add fence_scsi device?
But I don't see 2 key registered...

Andrea

_______________________________________________
Pacemaker mailing list: ***@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

emmanuel segura

2015-01-27 13:42:33 UTC

Permalink

please show your configuration and your logs.

2015-01-27 14:24 GMT+01:00 Andrea <***@codices.com>:
> emmanuel segura <***@...> writes:
>
>>
>>
>> if you are using cman+pacemaker you need to enabled the stonith and
> configuring that in you crm config
>>
>>
>> 2015-01-27 14:05 GMT+01:00 Vinod Prabhu
> <***@gmail.com>:
>> is stonith enabled in crm conf?
>>
>
> yes, stonith is enabled
>
> [ONE]pcs property
> Cluster Properties:
> cluster-infrastructure: cman
> dc-version: 1.1.11-97629de
> last-lrm-refresh: 1422285715
> no-quorum-policy: ignore
> stonith-enabled: true
>
>
> If I disable it, stonith device don't start
>
>
>>
>> On Tue, Jan 27, 2015 at 6:21 PM, emmanuel segura
> <***@gmail.com> wrote:When a node is dead
> the registration key is removed.
>
> So I must see 2 key registered when I add fence_scsi device?
> But I don't see 2 key registered...
>
>
>
> Andrea
>
>
>
>
>
>
> _______________________________________________
> Pacemaker mailing list: ***@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

--
esta es mi vida e me la vivo hasta que dios quiera

_______________________________________________
Pacemaker mailing list: ***@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Andrea

2015-01-27 16:55:04 UTC

Permalink

emmanuel segura <***@...> writes:

>
> please show your configuration and your logs.
>
> 2015-01-27 14:24 GMT+01:00 Andrea <***@...>:
> > emmanuel segura <emi2fast <at> ...> writes:
> >
> >>
> >>
> >> if you are using cman+pacemaker you need to enabled the stonith and
> > configuring that in you crm config
> >>
> >>
> >> 2015-01-27 14:05 GMT+01:00 Vinod Prabhu
> > <***@...>:
> >> is stonith enabled in crm conf?
> >>
> >
> > yes, stonith is enabled
> >
> > [ONE]pcs property
> > Cluster Properties:
> > cluster-infrastructure: cman
> > dc-version: 1.1.11-97629de
> > last-lrm-refresh: 1422285715
> > no-quorum-policy: ignore
> > stonith-enabled: true
> >
> >
> > If I disable it, stonith device don't start
> >
> >
> >>
> >> On Tue, Jan 27, 2015 at 6:21 PM, emmanuel segura
> > <***@...> wrote:When a node is dead
> > the registration key is removed.
> >
> > So I must see 2 key registered when I add fence_scsi device?
> > But I don't see 2 key registered...
> >
> >

Sorry, I used wrong device id.
Now, with the correct device id, I see 2 key reserved

[ONE] sg_persist -n --read-keys
--device=/dev/disk/by-id/scsi-36e843b60f3d0cc6d1a11d4ff0da95cd8
PR generation=0x4, 2 registered reservation keys follow:
0x4d5a0001
0x4d5a0002

Tomorrow i will do some test for fencing...

thanks
Andrea

_______________________________________________
Pacemaker mailing list: ***@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Andrea

2015-01-30 11:38:20 UTC

Permalink

Andrea <***@...> writes:

>
> Sorry, I used wrong device id.
> Now, with the correct device id, I see 2 key reserved
>
> [ONE] sg_persist -n --read-keys
> --device=/dev/disk/by-id/scsi-36e843b60f3d0cc6d1a11d4ff0da95cd8
> PR generation=0x4, 2 registered reservation keys follow:
> 0x4d5a0001
> 0x4d5a0002
>
> Tomorrow i will do some test for fencing...
>

some news

If I try to fence serverHA2 with this command:
[ONE]pcs stonith fence serverHA2

I obtain that all seem to be ok, but serverHA2 freeze,
below the log from each node (on serverHA2 after loggin these lines, freeze)

The servers are 2 vmware virtual machine (I ask for an account on esx server
to test fence_vmware, I'm waiting response)

log serverHA1

Jan 30 12:13:02 [2510] serverHA1 stonith-ng: notice: handle_request: Client
stonith_admin.1907.b13e0290 wants to fence (reboot) 'serverHA2' with device
'(any)'
Jan 30 12:13:02 [2510] serverHA1 stonith-ng: notice:
initiate_remote_stonith_op: Initiating remote operation reboot for
serverHA2: 70b75107-8919-4510-9c6c-7cc65e6a00a6 (0)
Jan 30 12:13:02 [2510] serverHA1 stonith-ng: notice:
can_fence_host_with_device: iscsi-stonith-device can fence (reboot)
serverHA2: static-list
Jan 30 12:13:02 [2510] serverHA1 stonith-ng: info:
process_remote_stonith_query: Query result 1 of 2 from serverHA1 for
serverHA2/reboot (1 devices) 70b75107-8919-4510-9c6c-7cc65e6a00a6
Jan 30 12:13:02 [2510] serverHA1 stonith-ng: info: call_remote_stonith:
Total remote op timeout set to 120 for fencing of node serverHA2 for
stonith_admin.1907.70b75107
Jan 30 12:13:02 [2510] serverHA1 stonith-ng: info: call_remote_stonith:
Requesting that serverHA1 perform op reboot serverHA2 for stonith_admin.1907
(144s)
Jan 30 12:13:02 [2510] serverHA1 stonith-ng: notice:
can_fence_host_with_device: iscsi-stonith-device can fence (reboot)
serverHA2: static-list
Jan 30 12:13:02 [2510] serverHA1 stonith-ng: info:
stonith_fence_get_devices_cb: Found 1 matching devices for 'serverHA2'
Jan 30 12:13:02 [2510] serverHA1 stonith-ng: warning: stonith_device_execute:
Agent 'fence_scsi' does not advertise support for 'reboot', performing 'off'
action instead
Jan 30 12:13:02 [2510] serverHA1 stonith-ng: info:
process_remote_stonith_query: Query result 2 of 2 from serverHA2 for
serverHA2/reboot (1 devices) 70b75107-8919-4510-9c6c-7cc65e6a00a6
Jan 30 12:13:03 [2510] serverHA1 stonith-ng: notice: log_operation:
Operation 'reboot' [1908] (call 2 from stonith_admin.1907) for host 'serverHA2'
with device 'iscsi-stonith-device' returned: 0 (OK)
Jan 30 12:13:03 [2510] serverHA1 stonith-ng: warning: get_xpath_object:
No match for //@st_delegate in /st-reply
Jan 30 12:13:03 [2510] serverHA1 stonith-ng: notice: remote_op_done:
Operation reboot of serverHA2 by serverHA1 for
***@serverHA1.70b75107: OK
Jan 30 12:13:03 [2514] serverHA1 crmd: notice: tengine_stonith_notify:
Peer serverHA2 was terminated (reboot) by serverHA1 for serverHA1: OK
(ref=70b75107-8919-4510-9c6c-7cc65e6a00a6) by client stonith_admin.1907
Jan 30 12:13:03 [2514] serverHA1 crmd: notice: tengine_stonith_notify:
Notified CMAN that 'serverHA2' is now fenced
Jan 30 12:13:03 [2514] serverHA1 crmd: info: crm_update_peer_join:
crmd_peer_down: Node serverHA2[2] - join-2 phase 4 -> 0
Jan 30 12:13:03 [2514] serverHA1 crmd: info:
crm_update_peer_expected: crmd_peer_down: Node serverHA2[2] - expected
state is now down (was member)
Jan 30 12:13:03 [2514] serverHA1 crmd: info: erase_status_tag:
Deleting xpath: //node_state[@uname='serverHA2']/lrm
Jan 30 12:13:03 [2514] serverHA1 crmd: info: erase_status_tag:
Deleting xpath: //node_state[@uname='serverHA2']/transient_attributes
Jan 30 12:13:03 [2514] serverHA1 crmd: info: tengine_stonith_notify:
External fencing operation from stonith_admin.1907 fenced serverHA2
Jan 30 12:13:03 [2514] serverHA1 crmd: info: abort_transition_graph:
Transition aborted: External Fencing Operation
(source=tengine_stonith_notify:248, 1)
Jan 30 12:13:03 [2514] serverHA1 crmd: notice: do_state_transition:
State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Jan 30 12:13:03 [2514] serverHA1 crmd: warning: do_state_transition:
Only 1 of 2 cluster nodes are eligible to run resources - continue 0
Jan 30 12:13:03 [2509] serverHA1 cib: info: cib_process_request:
Forwarding cib_modify operation for section status to master
(origin=local/crmd/333)
Jan 30 12:13:03 [2509] serverHA1 cib: info: cib_process_request:
Forwarding cib_delete operation for section
//node_state[@uname='serverHA2']/lrm to master (origin=local/crmd/334)
Jan 30 12:13:03 [2509] serverHA1 cib: info: cib_process_request:
Forwarding cib_delete operation for section
//node_state[@uname='serverHA2']/transient_attributes to master
(origin=local/crmd/335)
Jan 30 12:13:03 [2509] serverHA1 cib: info: cib_perform_op: Diff:
--- 0.51.86 2
Jan 30 12:13:03 [2509] serverHA1 cib: info: cib_perform_op: Diff:
+++ 0.51.87 (null)
Jan 30 12:13:03 [2509] serverHA1 cib: info: cib_perform_op: +
/cib: @num_updates=87
Jan 30 12:13:03 [2509] serverHA1 cib: info: cib_perform_op: +
/cib/status/node_state[@id='serverHA2']:
@crm-debug-origin=send_stonith_update, @join=down, @expected=down
Jan 30 12:13:03 [2509] serverHA1 cib: info: cib_process_request:
Completed cib_modify operation for section status: OK (rc=0,
origin=serverHA1/crmd/333, version=0.51.87)
Jan 30 12:13:03 [2509] serverHA1 cib: info: cib_perform_op: Diff:
--- 0.51.87 2
Jan 30 12:13:03 [2509] serverHA1 cib: info: cib_perform_op: Diff:
+++ 0.51.88 (null)
Jan 30 12:13:03 [2509] serverHA1 cib: info: cib_perform_op: --
/cib/status/node_state[@id='serverHA2']/lrm[@id='serverHA2']
Jan 30 12:13:03 [2509] serverHA1 cib: info: cib_perform_op: +
/cib: @num_updates=88
Jan 30 12:13:03 [2509] serverHA1 cib: info: cib_process_request:
Completed cib_delete operation for section
//node_state[@uname='serverHA2']/lrm: OK (rc=0, origin=serverHA1/crmd/334,
version=0.51.88)
Jan 30 12:13:03 [2509] serverHA1 cib: info: cib_perform_op: Diff:
--- 0.51.88 2
Jan 30 12:13:03 [2509] serverHA1 cib: info: cib_perform_op: Diff:
+++ 0.51.89 (null)
Jan 30 12:13:03 [2509] serverHA1 cib: info: cib_perform_op: --
/cib/status/node_state[@id='serverHA2']/transient_attributes[@id='serverHA2']
Jan 30 12:13:03 [2509] serverHA1 cib: info: cib_perform_op: +
/cib: @num_updates=89
Jan 30 12:13:03 [2509] serverHA1 cib: info: cib_process_request:
Completed cib_delete operation for section
//node_state[@uname='serverHA2']/transient_attributes: OK (rc=0,
origin=serverHA1/crmd/335, version=0.51.89)
Jan 30 12:13:03 [2514] serverHA1 crmd: info: cib_fencing_updated:
Fencing update 333 for serverHA2: complete
Jan 30 12:13:03 [2514] serverHA1 crmd: info: abort_transition_graph:
Transition aborted by deletion of lrm[@id='serverHA2']: Resource state removal
(cib=0.51.88, source=te_update_diff:429,
path=/cib/status/node_state[@id='serverHA2']/lrm[@id='serverHA2'], 1)
Jan 30 12:13:03 [2514] serverHA1 crmd: info: abort_transition_graph:
Transition aborted by deletion of transient_attributes[@id='serverHA2']:
Transient attribute change (cib=0.51.89, source=te_update_diff:391,
path=/cib/status/node_state[@id='serverHA2']/transient_attributes[@id='serverHA2
'], 1)
Jan 30 12:13:03 [2513] serverHA1 pengine: info: process_pe_message:
Input has not changed since last time, not saving to disk
Jan 30 12:13:03 [2513] serverHA1 pengine: notice: unpack_config: On loss
of CCM Quorum: Ignore
Jan 30 12:13:03 [2513] serverHA1 pengine: info:
determine_online_status_fencing: Node serverHA2 is active
Jan 30 12:13:03 [2513] serverHA1 pengine: info: determine_online_status:
Node serverHA2 is online
Jan 30 12:13:03 [2513] serverHA1 pengine: info:
determine_online_status_fencing: Node serverHA1 is active
Jan 30 12:13:03 [2513] serverHA1 pengine: info: determine_online_status:
Node serverHA1 is online
Jan 30 12:13:03 [2513] serverHA1 pengine: info: clone_print: Clone
Set: ping-clone [ping]
Jan 30 12:13:03 [2513] serverHA1 pengine: info: short_print:
Started: [ serverHA1 serverHA2 ]
Jan 30 12:13:03 [2513] serverHA1 pengine: info: clone_print: Clone
Set: clusterfs-clone [clusterfs]
Jan 30 12:13:03 [2513] serverHA1 pengine: info: short_print:
Started: [ serverHA1 serverHA2 ]
Jan 30 12:13:03 [2513] serverHA1 pengine: info: native_print:
iscsi-stonith-device (stonith:fence_scsi): Started serverHA1
Jan 30 12:13:03 [2513] serverHA1 pengine: info: LogActions: Leave
ping:0 (Started serverHA2)
Jan 30 12:13:03 [2513] serverHA1 pengine: info: LogActions: Leave
ping:1 (Started serverHA1)
Jan 30 12:13:03 [2513] serverHA1 pengine: info: LogActions: Leave
clusterfs:0 (Started serverHA2)
Jan 30 12:13:03 [2513] serverHA1 pengine: info: LogActions: Leave
clusterfs:1 (Started serverHA1)
Jan 30 12:13:03 [2513] serverHA1 pengine: info: LogActions: Leave
iscsi-stonith-device (Started serverHA1)
Jan 30 12:13:03 [2514] serverHA1 crmd: info: handle_response:
pe_calc calculation pe_calc-dc-1422616383-286 is obsolete
Jan 30 12:13:03 [2513] serverHA1 pengine: notice: process_pe_message:
Calculated Transition 189: /var/lib/pacemaker/pengine/pe-input-145.bz2
Jan 30 12:13:03 [2513] serverHA1 pengine: notice: unpack_config: On loss
of CCM Quorum: Ignore
Jan 30 12:13:03 [2513] serverHA1 pengine: info:
determine_online_status_fencing: - Node serverHA2 is not ready to run
resources
Jan 30 12:13:03 [2513] serverHA1 pengine: info: determine_online_status:
Node serverHA2 is pending
Jan 30 12:13:03 [2513] serverHA1 pengine: info:
determine_online_status_fencing: Node serverHA1 is active
Jan 30 12:13:03 [2513] serverHA1 pengine: info: determine_online_status:
Node serverHA1 is online
Jan 30 12:13:03 [2513] serverHA1 pengine: info: clone_print: Clone
Set: ping-clone [ping]
Jan 30 12:13:03 [2513] serverHA1 pengine: info: short_print:
Started: [ serverHA1 ]
Jan 30 12:13:03 [2513] serverHA1 pengine: info: short_print:
Stopped: [ serverHA2 ]
Jan 30 12:13:03 [2513] serverHA1 pengine: info: clone_print: Clone
Set: clusterfs-clone [clusterfs]
Jan 30 12:13:03 [2513] serverHA1 pengine: info: short_print:
Started: [ serverHA1 ]
Jan 30 12:13:03 [2513] serverHA1 pengine: info: short_print:
Stopped: [ serverHA2 ]
Jan 30 12:13:03 [2513] serverHA1 pengine: info: native_print:
iscsi-stonith-device (stonith:fence_scsi): Started serverHA1
Jan 30 12:13:03 [2513] serverHA1 pengine: info: native_color:
Resource ping:1 cannot run anywhere
Jan 30 12:13:03 [2513] serverHA1 pengine: info: native_color:
Resource clusterfs:1 cannot run anywhere
Jan 30 12:13:03 [2513] serverHA1 pengine: info: probe_resources:
Action probe_complete-serverHA2 on serverHA2 is unrunnable (pending)
Jan 30 12:13:03 [2513] serverHA1 pengine: warning: custom_action: Action
ping:0_monitor_0 on serverHA2 is unrunnable (pending)
Jan 30 12:13:03 [2513] serverHA1 pengine: warning: custom_action: Action
clusterfs:0_monitor_0 on serverHA2 is unrunnable (pending)
Jan 30 12:13:03 [2513] serverHA1 pengine: warning: custom_action: Action
iscsi-stonith-device_monitor_0 on serverHA2 is unrunnable (pending)
Jan 30 12:13:03 [2513] serverHA1 pengine: notice: trigger_unfencing:
Unfencing serverHA2: node discovery
Jan 30 12:13:03 [2513] serverHA1 pengine: info: LogActions: Leave
ping:0 (Started serverHA1)
Jan 30 12:13:03 [2513] serverHA1 pengine: info: LogActions: Leave
ping:1 (Stopped)
Jan 30 12:13:03 [2513] serverHA1 pengine: info: LogActions: Leave
clusterfs:0 (Started serverHA1)
Jan 30 12:13:03 [2513] serverHA1 pengine: info: LogActions: Leave
clusterfs:1 (Stopped)
Jan 30 12:13:03 [2513] serverHA1 pengine: info: LogActions: Leave
iscsi-stonith-device (Started serverHA1)
Jan 30 12:13:03 [2514] serverHA1 crmd: info: do_state_transition:
State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
cause=C_IPC_MESSAGE origin=handle_response ]
Jan 30 12:13:03 [2514] serverHA1 crmd: info: do_te_invoke:
Processing graph 190 (ref=pe_calc-dc-1422616383-287) derived from
/var/lib/pacemaker/pengine/pe-input-146.bz2
Jan 30 12:13:03 [2513] serverHA1 pengine: notice: process_pe_message:
Calculated Transition 190: /var/lib/pacemaker/pengine/pe-input-146.bz2
Jan 30 12:13:03 [2514] serverHA1 crmd: notice: te_fence_node:
Executing on fencing operation (5) on serverHA2 (timeout=60000)
Jan 30 12:13:03 [2510] serverHA1 stonith-ng: notice: handle_request: Client
crmd.2514.b5961dc1 wants to fence (on) 'serverHA2' with device '(any)'
Jan 30 12:13:03 [2510] serverHA1 stonith-ng: notice:
initiate_remote_stonith_op: Initiating remote operation on for serverHA2:
e19629dc-bec3-4e63-baf6-a7ecd5ed44bb (0)
Jan 30 12:13:03 [2510] serverHA1 stonith-ng: info:
process_remote_stonith_query: Query result 2 of 2 from serverHA2 for
serverHA2/on (1 devices) e19629dc-bec3-4e63-baf6-a7ecd5ed44bb
Jan 30 12:13:03 [2510] serverHA1 stonith-ng: info:
process_remote_stonith_query: All queries have arrived, continuing (2, 2, 2)
Jan 30 12:13:03 [2510] serverHA1 stonith-ng: info: call_remote_stonith:
Total remote op timeout set to 60 for fencing of node serverHA2 for
crmd.2514.e19629dc
Jan 30 12:13:03 [2510] serverHA1 stonith-ng: info: call_remote_stonith:
Requesting that serverHA2 perform op on serverHA2 for crmd.2514 (72s)
Jan 30 12:13:03 [2510] serverHA1 stonith-ng: warning: get_xpath_object:
No match for //@st_delegate in /st-reply
Jan 30 12:13:03 [2510] serverHA1 stonith-ng: notice: remote_op_done:
Operation on of serverHA2 by serverHA2 for ***@serverHA1.e19629dc: OK
Jan 30 12:13:03 [2514] serverHA1 crmd: notice:
tengine_stonith_callback: Stonith operation
9/5:190:0:4e500b84-bb92-4406-8f9c-f4140dd40ec7: OK (0)
Jan 30 12:13:03 [2514] serverHA1 crmd: notice: tengine_stonith_notify:
serverHA2 was successfully unfenced by serverHA2 (at the request of serverHA1)
Jan 30 12:13:03 [2514] serverHA1 crmd: notice: run_graph:
Transition 190 (Complete=3, Pending=0, Fired=0, Skipped=0, Incomplete=0,
Source=/var/lib/pacemaker/pengine/pe-input-146.bz2): Complete
Jan 30 12:13:03 [2514] serverHA1 crmd: info: do_log: FSA: Input
I_TE_SUCCESS from notify_crmd() received in state S_TRANSITION_ENGINE
Jan 30 12:13:03 [2514] serverHA1 crmd: notice: do_state_transition:
State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
cause=C_FSA_INTERNAL origin=notify_crmd ]

log serverHA2

Jan 30 12:13:11 [2627] serverHA2 stonith-ng: notice:
can_fence_host_with_device: iscsi-stonith-device can fence (reboot)
serverHA2: static-list
Jan 30 12:13:11 [2627] serverHA2 stonith-ng: notice: remote_op_done:
Operation reboot of serverHA2 by serverHA1 for
***@serverHA1.70b75107: OK
Jan 30 12:13:11 [2631] serverHA2 crmd: crit: tengine_stonith_notify:
We were alegedly just fenced by serverHA1 for serverHA1!
Jan 30 12:13:11 [2626] serverHA2 cib: info: cib_perform_op: Diff:
--- 0.51.86 2
Jan 30 12:13:11 [2626] serverHA2 cib: info: cib_perform_op: Diff:
+++ 0.51.87 (null)
Jan 30 12:13:11 [2626] serverHA2 cib: info: cib_perform_op: +
/cib: @num_updates=87
Jan 30 12:13:11 [2626] serverHA2 cib: info: cib_perform_op: +
/cib/status/node_state[@id='serverHA2']:
@crm-debug-origin=send_stonith_update, @join=down, @expected=down
Jan 30 12:13:11 [2626] serverHA2 cib: info: cib_process_request:
Completed cib_modify operation for section status: OK (rc=0,
origin=serverHA1/crmd/333, version=0.51.87)
Jan 30 12:13:11 [2626] serverHA2 cib: info: cib_perform_op: Diff:
--- 0.51.87 2
Jan 30 12:13:11 [2626] serverHA2 cib: info: cib_perform_op: Diff:
+++ 0.51.88 (null)
Jan 30 12:13:11 [2626] serverHA2 cib: info: cib_perform_op: --
/cib/status/node_state[@id='serverHA2']/lrm[@id='serverHA2']
Jan 30 12:13:11 [2626] serverHA2 cib: info: cib_perform_op: +
/cib: @num_updates=88
Jan 30 12:13:11 [2626] serverHA2 cib: info: cib_process_request:
Completed cib_delete operation for section
//node_state[@uname='serverHA2']/lrm: OK (rc=0, origin=serverHA1/crmd/334,
version=0.51.88)
Jan 30 12:13:11 [2626] serverHA2 cib: info: cib_perform_op: Diff:
--- 0.51.88 2
Jan 30 12:13:11 [2626] serverHA2 cib: info: cib_perform_op: Diff:
+++ 0.51.89 (null)
Jan 30 12:13:11 [2626] serverHA2 cib: info: cib_perform_op: --
/cib/status/node_state[@id='serverHA2']/transient_attributes[@id='serverHA2']
Jan 30 12:13:11 [2626] serverHA2 cib: info: cib_perform_op: +
/cib: @num_updates=89
Jan 30 12:13:11 [2626] serverHA2 cib: info: cib_process_request:
Completed cib_delete operation for section
//node_state[@uname='serverHA2']/transient_attributes: OK (rc=0,
origin=serverHA1/crmd/335, version=0.51.89)
Jan 30 12:13:11 [2627] serverHA2 stonith-ng: notice:
can_fence_host_with_device: iscsi-stonith-device can fence (on) serverHA2:
static-list
Jan 30 12:13:11 [2627] serverHA2 stonith-ng: notice:
can_fence_host_with_device: iscsi-stonith-device can fence (on) serverHA2:
static-list
Jan 30 12:13:11 [2627] serverHA2 stonith-ng: info:
stonith_fence_get_devices_cb: Found 1 matching devices for 'serverHA2'
Jan 30 12:13:11 [2627] serverHA2 stonith-ng: notice: log_operation:
Operation 'on' [3037] (call 9 from crmd.2514) for host 'serverHA2' with device
'iscsi-stonith-device' returned: 0 (OK)
Jan 30 12:13:11 [2627] serverHA2 stonith-ng: notice: remote_op_done:
Operation on of serverHA2 by serverHA2 for ***@serverHA1.e19629dc: OK

I will continue testing....

Andrea

_______________________________________________
Pacemaker mailing list: ***@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Andrea

2015-02-02 11:22:53 UTC

Permalink

Hi,

I tryed a network failure and it works.
During failure, each node try to fence other node.
When network come back, the node with network problem is fenced and reboot.
Moreover, the cman kill(cman) on one node, tipically node1 kill(cman) on
node2, so, I have 2 situations:

1) Network failure on node2
When network come back, node2 is fenced and cman kill (cman) on node2 .
Watchdog script check for key registration, and reboot node2.
After reboot cluster come back with 2 nodes up.

2) Network failure on node1
When network come back, node1 is fenced, and cman kill(cman) on
node2.(cluster is down!)
Watchdog script check for key registration, and reboot node1.
During reboot cluster is offline because node1 is rebooting and cman on node
2 was killed.
After reboot, node1 is up and fence node2. Now, watchdog reboot node2.
After reboot, cluster come back with 2 nodes up.

The only "problem" is downtime in situation 2, but it is acceptable for my
context.
I created my fence device with this command:
[ONE]pcs stonith create scsi fence_scsi pcmk_host_list="serverHA1 serverHA2"
pcmk_reboot_action="off" meta provides="unfencing" --force
as described here
https://access.redhat.com/articles/530533

If possible, I will test the fence_vmware (without Wachdog script) and i
will post here my result

thansk to all
Andrea

_______________________________________________
Pacemaker mailing list: ***@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Digimer

2015-02-02 15:53:22 UTC

Permalink

That fence failed until the network came back makes your fence method
less than ideal. Will it eventually fence with the network still failed?

Most importantly though; Cluster resources blocked while the fence was
pending? If so, then your cluster is safe, and that is the most
important part.

On 02/02/15 06:22 AM, Andrea wrote:
> Hi,
>
> I tryed a network failure and it works.
> During failure, each node try to fence other node.
> When network come back, the node with network problem is fenced and reboot.
> Moreover, the cman kill(cman) on one node, tipically node1 kill(cman) on
> node2, so, I have 2 situations:
>
> 1) Network failure on node2
> When network come back, node2 is fenced and cman kill (cman) on node2 .
> Watchdog script check for key registration, and reboot node2.
> After reboot cluster come back with 2 nodes up.
>
> 2) Network failure on node1
> When network come back, node1 is fenced, and cman kill(cman) on
> node2.(cluster is down!)
> Watchdog script check for key registration, and reboot node1.
> During reboot cluster is offline because node1 is rebooting and cman on node
> 2 was killed.
> After reboot, node1 is up and fence node2. Now, watchdog reboot node2.
> After reboot, cluster come back with 2 nodes up.
>
>
> The only "problem" is downtime in situation 2, but it is acceptable for my
> context.
> I created my fence device with this command:
> [ONE]pcs stonith create scsi fence_scsi pcmk_host_list="serverHA1 serverHA2"
> pcmk_reboot_action="off" meta provides="unfencing" --force
> as described here
> https://access.redhat.com/articles/530533
>
>
> If possible, I will test the fence_vmware (without Wachdog script) and i
> will post here my result
>
> thansk to all
> Andrea
>
>
>
>
> _______________________________________________
> Pacemaker mailing list: ***@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>

--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

_______________________________________________
Pacemaker mailing list: ***@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Andrea

2015-02-04 16:40:29 UTC

Permalink

Digimer <***@...> writes:

>
> That fence failed until the network came back makes your fence method
> less than ideal. Will it eventually fence with the network still failed?
>
> Most importantly though; Cluster resources blocked while the fence was
> pending? If so, then your cluster is safe, and that is the most
> important part.
>
Hi Digimer

I'm using for fencing a remote NAS, attached via iscsi target.
During network failure, for example on node2, each node try to fence other node.
Fencing action on node1 get success, but on node2 fail, because it can't see
iscsi target(network is down!) .
I thinks it's the reason why node2 doesn't reboot now, because it can't make
operation on key reservation and watchdog can't check for this.
When network come back, watchdog can check for key registration and reboot
node2.

For clustered filesystem I planned to use ping resource with location
constraint as described here
http://clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/ch09s03s03s02.html
If the node can't see iscsi target..then..stop AppServer, Filesystem ecc

But it doesn't works. In the node with network failure i see in the log that
pingd is set to 0 but Filesystem resource doesn't stop.

I will continue testing...

Thanks
Andrea

_______________________________________________
Pacemaker mailing list: ***@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Dmitry Koterov

2015-02-05 02:38:35 UTC

Permalink

Could you please give a hint: how to use fencing in case the nodes are all
in different geo-distributed datacenters? How people do that? Because there
could be a network disconnection between datacenters, and we have no chance
to send a stonith signal somewhere.

On Wednesday, February 4, 2015, Andrea <***@codices.com> wrote:

> Digimer <***@...> writes:
>
> >
> > That fence failed until the network came back makes your fence method
> > less than ideal. Will it eventually fence with the network still failed?
> >
> > Most importantly though; Cluster resources blocked while the fence was
> > pending? If so, then your cluster is safe, and that is the most
> > important part.
> >
> Hi Digimer
>
> I'm using for fencing a remote NAS, attached via iscsi target.
> During network failure, for example on node2, each node try to fence other
> node.
> Fencing action on node1 get success, but on node2 fail, because it can't
> see
> iscsi target(network is down!) .
> I thinks it's the reason why node2 doesn't reboot now, because it can't
> make
> operation on key reservation and watchdog can't check for this.
> When network come back, watchdog can check for key registration and reboot
> node2.
>
> For clustered filesystem I planned to use ping resource with location
> constraint as described here
>
> http://clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/ch09s03s03s02.html
> If the node can't see iscsi target..then..stop AppServer, Filesystem ecc
>
> But it doesn't works. In the node with network failure i see in the log
> that
> pingd is set to 0 but Filesystem resource doesn't stop.
>
> I will continue testing...
>
> Thanks
> Andrea
>
>
>
>
>
>
> _______________________________________________
> Pacemaker mailing list: ***@oss.clusterlabs.org <javascript:;>
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>

Digimer

2015-02-05 08:18:50 UTC

Permalink

That is the problem that makes geo-clustering very hard to nearly
impossible. You can look at the Booth option for pacemaker, but that
requires two (or more) full clusters, plus an arbitrator 3rd location.
Outside of this though, there really is no way to have geo/stretch
clustering with automatic failover.

digimer

On 05/02/15 03:38 AM, Dmitry Koterov wrote:
> Could you please give a hint: how to use fencing in case the nodes are
> all in different geo-distributed datacenters? How people do that?
> Because there could be a network disconnection between datacenters, and
> we have no chance to send a stonith signal somewhere.
>
> On Wednesday, February 4, 2015, Andrea <***@codices.com
> <mailto:***@codices.com>> wrote:
>
> Digimer <***@...> writes:
>
> >
> > That fence failed until the network came back makes your fence method
> > less than ideal. Will it eventually fence with the network still
> failed?
> >
> > Most importantly though; Cluster resources blocked while the
> fence was
> > pending? If so, then your cluster is safe, and that is the most
> > important part.
> >
> Hi Digimer
>
> I'm using for fencing a remote NAS, attached via iscsi target.
> During network failure, for example on node2, each node try to fence
> other node.
> Fencing action on node1 get success, but on node2 fail, because it
> can't see
> iscsi target(network is down!) .
> I thinks it's the reason why node2 doesn't reboot now, because it
> can't make
> operation on key reservation and watchdog can't check for this.
> When network come back, watchdog can check for key registration and
> reboot
> node2.
>
> For clustered filesystem I planned to use ping resource with location
> constraint as described here
> http://clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/ch09s03s03s02.html
> If the node can't see iscsi target..then..stop AppServer, Filesystem ecc
>
> But it doesn't works. In the node with network failure i see in the
> log that
> pingd is set to 0 but Filesystem resource doesn't stop.
>
> I will continue testing...
>
> Thanks
> Andrea
>
>
>
>
>
>
> _______________________________________________
> Pacemaker mailing list: ***@oss.clusterlabs.org <javascript:;>
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
>
> _______________________________________________
> Pacemaker mailing list: ***@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>

--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

_______________________________________________
Pacemaker mailing list: ***@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Andrea

2015-02-05 16:08:56 UTC

Permalink

> > On Wednesday, February 4, 2015, Andrea <***@...
> > <mailto:***@...>> wrote:
> >
> > Digimer <lists <at> ...> writes:
> >
> > >
> > > That fence failed until the network came back makes your fence method
> > > less than ideal. Will it eventually fence with the network still
> > failed?
> > >
> > > Most importantly though; Cluster resources blocked while the
> > fence was
> > > pending? If so, then your cluster is safe, and that is the most
> > > important part.
> > >
> > Hi Digimer
> >
> > I'm using for fencing a remote NAS, attached via iscsi target.
> > During network failure, for example on node2, each node try to fence
> > other node.
> > Fencing action on node1 get success, but on node2 fail, because it
> > can't see
> > iscsi target(network is down!) .
> > I thinks it's the reason why node2 doesn't reboot now, because it
> > can't make
> > operation on key reservation and watchdog can't check for this.
> > When network come back, watchdog can check for key registration and
> > reboot
> > node2.
> >
> > For clustered filesystem I planned to use ping resource with location
> > constraint as described here
> >
http://clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/ch09s03s03s02.html
> > If the node can't see iscsi target..then..stop AppServer, Filesystem ecc
> >
> > But it doesn't works. In the node with network failure i see in the
> > log that
> > pingd is set to 0 but Filesystem resource doesn't stop.
> >
> > I will continue testing...
> >
> > Thanks
> > Andrea
> >
> >
> >

Hi

I test location constraint with ping resource to stop resource on
disconnected node, but with stonith active, doesn't works

I used dummy resource for test
[ONE]pcs resource create mydummy ocf:pacemaker:Dummy op monitor
interval=120s --clone

Ping resource
[ONE] pcs resource create ping ocf:pacemaker:ping dampen=5s multiplier=1000
host_list=pingnode --clone

Location Constraint
[ONE]pcs constraint location mydummy rule score=-INFINITY pingd lt 1 or
not_defined pingd

If the pingnode became not visible on node2, I will see pingd attribute on
node2 set to 0 and dummy resources stop on node2.
If I cut off nentire network on node2, I will see pingd attribute on node2
set to 0 bud dummy resource never stop.
During network failure...stonith agent is active and try to fence node 1
without success.
Why? Is the failed fence action that block location constraint?

Andrea

_______________________________________________
Pacemaker mailing list: ***@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

l***@alteeve.ca

2015-02-05 21:59:34 UTC

Permalink

On 2015-02-05 17:08, Andrea wrote:
> Hi
>
> I test location constraint with ping resource to stop resource on
> disconnected node, but with stonith active, doesn't works
>
> I used dummy resource for test
> [ONE]pcs resource create mydummy ocf:pacemaker:Dummy op monitor
> interval=120s --clone
>
> Ping resource
> [ONE] pcs resource create ping ocf:pacemaker:ping dampen=5s
> multiplier=1000
> host_list=pingnode --clone
>
> Location Constraint
> [ONE]pcs constraint location mydummy rule score=-INFINITY pingd lt 1 or
> not_defined pingd
>
>
> If the pingnode became not visible on node2, I will see pingd attribute
> on
> node2 set to 0 and dummy resources stop on node2.
> If I cut off nentire network on node2, I will see pingd attribute on
> node2
> set to 0 bud dummy resource never stop.
> During network failure...stonith agent is active and try to fence node
> 1
> without success.
> Why? Is the failed fence action that block location constraint?
>
>
> Andrea

When you disable stonith, pacemaker just assumes that "no contact" ==
"peer dead", so recovery happens. This is a very false sense of security
though, because most people test by actually crashing a node, so there
is no risk of a split-brain. The problem is, in the real world, this can
not be assured. A node can be running just fine, but the connection
fails. If you disable stonith, you get a split-brain.

So when you enable stonith, and you really must, then pacemaker will
never make an assumption about the state of the peer. So when the peer
stops responding, pacemaker blocks and calls a fence. It will then sit
there and wait for the fence to succeed. If the fence *doesn't* succeed,
it ends up staying blocked. This is the proper behaviour!

Now, if you enable stonith *and* it is configured properly, then you
will see that recovery proceeds as expected *after* the fence action
completes successfully. So, setup stonith! :)

_______________________________________________
Pacemaker mailing list: ***@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Dejan Muhamedagic

2015-02-06 15:15:44 UTC

Permalink

Hi,

On Thu, Feb 05, 2015 at 09:18:50AM +0100, Digimer wrote:
> That is the problem that makes geo-clustering very hard to nearly
> impossible. You can look at the Booth option for pacemaker, but that
> requires two (or more) full clusters, plus an arbitrator 3rd

A full cluster can consist of one node only. Hence, it is
possible to have a kind of stretch two-node [multi-site] cluster
based on tickets and managed by booth.

Thanks,

Dejan

> location. Outside of this though, there really is no way to have
> geo/stretch clustering with automatic failover.
>
> digimer
>
> On 05/02/15 03:38 AM, Dmitry Koterov wrote:
> >Could you please give a hint: how to use fencing in case the nodes are
> >all in different geo-distributed datacenters? How people do that?
> >Because there could be a network disconnection between datacenters, and
> >we have no chance to send a stonith signal somewhere.
> >
> >On Wednesday, February 4, 2015, Andrea <***@codices.com
> ><mailto:***@codices.com>> wrote:
> >
> > Digimer <***@...> writes:
> >
> > >
> > > That fence failed until the network came back makes your fence method
> > > less than ideal. Will it eventually fence with the network still
> > failed?
> > >
> > > Most importantly though; Cluster resources blocked while the
> > fence was
> > > pending? If so, then your cluster is safe, and that is the most
> > > important part.
> > >
> > Hi Digimer
> >
> > I'm using for fencing a remote NAS, attached via iscsi target.
> > During network failure, for example on node2, each node try to fence
> > other node.
> > Fencing action on node1 get success, but on node2 fail, because it
> > can't see
> > iscsi target(network is down!) .
> > I thinks it's the reason why node2 doesn't reboot now, because it
> > can't make
> > operation on key reservation and watchdog can't check for this.
> > When network come back, watchdog can check for key registration and
> > reboot
> > node2.
> >
> > For clustered filesystem I planned to use ping resource with location
> > constraint as described here
> > http://clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/ch09s03s03s02.html
> > If the node can't see iscsi target..then..stop AppServer, Filesystem ecc
> >
> > But it doesn't works. In the node with network failure i see in the
> > log that
> > pingd is set to 0 but Filesystem resource doesn't stop.
> >
> > I will continue testing...
> >
> > Thanks
> > Andrea
> >
> >
> >
> >
> >
> >
> > _______________________________________________
> > Pacemaker mailing list: ***@oss.clusterlabs.org <javascript:;>
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> >
> >
> >
> >_______________________________________________
> >Pacemaker mailing list: ***@oss.clusterlabs.org
> >http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> >Project Home: http://www.clusterlabs.org
> >Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >Bugs: http://bugs.clusterlabs.org
> >
>
>
> --
> Digimer
> Papers and Projects: https://alteeve.ca/w/
> What if the cure for cancer is trapped in the mind of a person
> without access to education?
>
> _______________________________________________
> Pacemaker mailing list: ***@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: ***@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Lars Ellenberg

2015-02-09 15:41:19 UTC

Permalink

On Fri, Feb 06, 2015 at 04:15:44PM +0100, Dejan Muhamedagic wrote:
> Hi,
>
> On Thu, Feb 05, 2015 at 09:18:50AM +0100, Digimer wrote:
> > That is the problem that makes geo-clustering very hard to nearly
> > impossible. You can look at the Booth option for pacemaker, but that
> > requires two (or more) full clusters, plus an arbitrator 3rd
>
> A full cluster can consist of one node only. Hence, it is
> possible to have a kind of stretch two-node [multi-site] cluster
> based on tickets and managed by booth.

In theory.

In practice, we rely on "proper behaviour" of "the other site",
in case a ticket is revoked, or cannot be renewed.

Relying on a single node for "proper behaviour" does not inspire
as much confidence as relying on a multi-node HA-cluster at each site,
which we can expect to ensure internal fencing.

With reliable hardware watchdogs, it still should be ok to do
"stretched two node HA clusters" in a reliable way.

Be generous with timeouts.

And document which failure modes you expect to handle,
and how to deal with the worst-case scenarios if you end up with some
failure case that you are not equipped to handle properly.

There are deployments which favor
"rather online with _potential_ split brain" over
"rather offline just in case".

Document this, print it out on paper,

"I am aware that this may lead to lost transactions,
data divergence, data corruption, or data loss.
I am personally willing to take the blame,
and live with the consequences."

Have some "boss" sign that ^^^
in the real world using a real pen.

Lars

--
: Lars Ellenberg
: http://www.LINBIT.com | Your Way to High Availability
: DRBD, Linux-HA and Pacemaker support and consulting

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.

_______________________________________________
Pacemaker mailing list: ***@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Dejan Muhamedagic

2015-02-10 14:58:57 UTC

Permalink

On Mon, Feb 09, 2015 at 04:41:19PM +0100, Lars Ellenberg wrote:
> On Fri, Feb 06, 2015 at 04:15:44PM +0100, Dejan Muhamedagic wrote:
> > Hi,
> >
> > On Thu, Feb 05, 2015 at 09:18:50AM +0100, Digimer wrote:
> > > That is the problem that makes geo-clustering very hard to nearly
> > > impossible. You can look at the Booth option for pacemaker, but that
> > > requires two (or more) full clusters, plus an arbitrator 3rd
> >
> > A full cluster can consist of one node only. Hence, it is
> > possible to have a kind of stretch two-node [multi-site] cluster
> > based on tickets and managed by booth.
>
> In theory.
>
> In practice, we rely on "proper behaviour" of "the other site",
> in case a ticket is revoked, or cannot be renewed.
>
> Relying on a single node for "proper behaviour" does not inspire
> as much confidence as relying on a multi-node HA-cluster at each site,
> which we can expect to ensure internal fencing.
>
> With reliable hardware watchdogs, it still should be ok to do
> "stretched two node HA clusters" in a reliable way.
>
> Be generous with timeouts.

As always.

> And document which failure modes you expect to handle,
> and how to deal with the worst-case scenarios if you end up with some
> failure case that you are not equipped to handle properly.
>
> There are deployments which favor
> "rather online with _potential_ split brain" over
> "rather offline just in case".

There's an arbitrator which should help in case of split brain.

> Document this, print it out on paper,
>
> "I am aware that this may lead to lost transactions,
> data divergence, data corruption, or data loss.
> I am personally willing to take the blame,
> and live with the consequences."
>
> Have some "boss" sign that ^^^
> in the real world using a real pen.

Well, of course running such a "stretch" cluster would be
rather different from a "normal" one.

The essential thing is that there's no fencing, unless configured
as a dead-man switch for the ticket. Given that booth has a
"sanity" program hook, maybe that could be utilized to verify if
this side of the cluster is healthy enough.

Thanks,

Dejan

> Lars
>
> --
> : Lars Ellenberg
> : http://www.LINBIT.com | Your Way to High Availability
> : DRBD, Linux-HA and Pacemaker support and consulting
>
> DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
>
> _______________________________________________
> Pacemaker mailing list: ***@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: ***@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Andrei Borzenkov

2015-02-11 06:10:45 UTC

Permalink

В Tue, 10 Feb 2015 15:58:57 +0100
Dejan Muhamedagic <***@fastmail.fm> пишет:

> On Mon, Feb 09, 2015 at 04:41:19PM +0100, Lars Ellenberg wrote:
> > On Fri, Feb 06, 2015 at 04:15:44PM +0100, Dejan Muhamedagic wrote:
> > > Hi,
> > >
> > > On Thu, Feb 05, 2015 at 09:18:50AM +0100, Digimer wrote:
> > > > That is the problem that makes geo-clustering very hard to nearly
> > > > impossible. You can look at the Booth option for pacemaker, but that
> > > > requires two (or more) full clusters, plus an arbitrator 3rd
> > >
> > > A full cluster can consist of one node only. Hence, it is
> > > possible to have a kind of stretch two-node [multi-site] cluster
> > > based on tickets and managed by booth.
> >
> > In theory.
> >
> > In practice, we rely on "proper behaviour" of "the other site",
> > in case a ticket is revoked, or cannot be renewed.
> >
> > Relying on a single node for "proper behaviour" does not inspire
> > as much confidence as relying on a multi-node HA-cluster at each site,
> > which we can expect to ensure internal fencing.
> >
> > With reliable hardware watchdogs, it still should be ok to do
> > "stretched two node HA clusters" in a reliable way.
> >
> > Be generous with timeouts.
>
> As always.
>
> > And document which failure modes you expect to handle,
> > and how to deal with the worst-case scenarios if you end up with some
> > failure case that you are not equipped to handle properly.
> >
> > There are deployments which favor
> > "rather online with _potential_ split brain" over
> > "rather offline just in case".
>
> There's an arbitrator which should help in case of split brain.
>

You can never really differentiate between site down and site cut off
due to (network) infrastructure outage. Arbitrator can mitigate split
brain only to the extent you trust your network. You still have to take
decision what you value more - data availability or data consistency.

Long distance clusters are really for disaster recovery. It is
convenient to have a single button that starts up all resources in
controlled manner, but someone really need to decide to push this
button.

> > Document this, print it out on paper,
> >
> > "I am aware that this may lead to lost transactions,
> > data divergence, data corruption, or data loss.
> > I am personally willing to take the blame,
> > and live with the consequences."
> >
> > Have some "boss" sign that ^^^
> > in the real world using a real pen.
>
> Well, of course running such a "stretch" cluster would be
> rather different from a "normal" one.
>
> The essential thing is that there's no fencing, unless configured
> as a dead-man switch for the ticket. Given that booth has a
> "sanity" program hook, maybe that could be utilized to verify if
> this side of the cluster is healthy enough.
>
> Thanks,
>
> Dejan
>
> > Lars
> >
> > --
> > : Lars Ellenberg
> > : http://www.LINBIT.com | Your Way to High Availability
> > : DRBD, Linux-HA and Pacemaker support and consulting
> >
> > DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
> >
> > _______________________________________________
> > Pacemaker mailing list: ***@oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
>
> _______________________________________________
> Pacemaker mailing list: ***@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: ***@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Clu

Dejan Muhamedagic

2015-02-11 10:30:57 UTC

Permalink

On Wed, Feb 11, 2015 at 09:10:45AM +0300, Andrei Borzenkov wrote:
> В Tue, 10 Feb 2015 15:58:57 +0100
> Dejan Muhamedagic <***@fastmail.fm> пишет:
>
> > On Mon, Feb 09, 2015 at 04:41:19PM +0100, Lars Ellenberg wrote:
> > > On Fri, Feb 06, 2015 at 04:15:44PM +0100, Dejan Muhamedagic wrote:
> > > > Hi,
> > > >
> > > > On Thu, Feb 05, 2015 at 09:18:50AM +0100, Digimer wrote:
> > > > > That is the problem that makes geo-clustering very hard to nearly
> > > > > impossible. You can look at the Booth option for pacemaker, but that
> > > > > requires two (or more) full clusters, plus an arbitrator 3rd
> > > >
> > > > A full cluster can consist of one node only. Hence, it is
> > > > possible to have a kind of stretch two-node [multi-site] cluster
> > > > based on tickets and managed by booth.
> > >
> > > In theory.
> > >
> > > In practice, we rely on "proper behaviour" of "the other site",
> > > in case a ticket is revoked, or cannot be renewed.
> > >
> > > Relying on a single node for "proper behaviour" does not inspire
> > > as much confidence as relying on a multi-node HA-cluster at each site,
> > > which we can expect to ensure internal fencing.
> > >
> > > With reliable hardware watchdogs, it still should be ok to do
> > > "stretched two node HA clusters" in a reliable way.
> > >
> > > Be generous with timeouts.
> >
> > As always.
> >
> > > And document which failure modes you expect to handle,
> > > and how to deal with the worst-case scenarios if you end up with some
> > > failure case that you are not equipped to handle properly.
> > >
> > > There are deployments which favor
> > > "rather online with _potential_ split brain" over
> > > "rather offline just in case".
> >
> > There's an arbitrator which should help in case of split brain.
> >
>
> You can never really differentiate between site down and site cut off
> due to (network) infrastructure outage. Arbitrator can mitigate split
> brain only to the extent you trust your network. You still have to take
> decision what you value more - data availability or data consistency.

Right, that's why I mentioned ticket loss policy. If booth drops
the ticket, pacemaker would fence the node (if loss-policy=fence).
Booth guarantees that no two sites will hold the ticket at the
same time. Of course, you have to trust booth to function
properly, but I guess that's a different story.

Thanks,

Dejan

> Long distance clusters are really for disaster recovery. It is
> convenient to have a single button that starts up all resources in
> controlled manner, but someone really need to decide to push this
> button.
>
> > > Document this, print it out on paper,
> > >
> > > "I am aware that this may lead to lost transactions,
> > > data divergence, data corruption, or data loss.
> > > I am personally willing to take the blame,
> > > and live with the consequences."
> > >
> > > Have some "boss" sign that ^^^
> > > in the real world using a real pen.
> >
> > Well, of course running such a "stretch" cluster would be
> > rather different from a "normal" one.
> >
> > The essential thing is that there's no fencing, unless configured
> > as a dead-man switch for the ticket. Given that booth has a
> > "sanity" program hook, maybe that could be utilized to verify if
> > this side of the cluster is healthy enough.
> >
> > Thanks,
> >
> > Dejan
> >
> > > Lars
> > >
> > > --
> > > : Lars Ellenberg
> > > : http://www.LINBIT.com | Your Way to High Availability
> > > : DRBD, Linux-HA and Pacemaker support and consulting
> > >
> > > DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
> > >
> > > _______________________________________________
> > > Pacemaker mailing list: ***@oss.clusterlabs.org
> > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > >
> > > Project Home: http://www.clusterlabs.org
> > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > > Bugs: http://bugs.clusterlabs.org
> >
> > _______________________________________________
> > Pacemaker mailing list: ***@oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
>

_______________________________________________
Pacemaker mailing list: ***@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clust

Andrea

2015-02-06 09:08:52 UTC

Permalink

<***@...> writes:

> > If the pingnode became not visible on node2, I will see pingd attribute
> > on
> > node2 set to 0 and dummy resources stop on node2.
> > If I cut off nentire network on node2, I will see pingd attribute on
> > node2
> > set to 0 bud dummy resource never stop.
> > During network failure...stonith agent is active and try to fence node
> > 1
> > without success.
> > Why? Is the failed fence action that block location constraint?
> >
> >
> > Andrea
>
> When you disable stonith, pacemaker just assumes that "no contact" ==
> "peer dead", so recovery happens. This is a very false sense of security
> though, because most people test by actually crashing a node, so there
> is no risk of a split-brain. The problem is, in the real world, this can
> not be assured. A node can be running just fine, but the connection
> fails. If you disable stonith, you get a split-brain.
>
> So when you enable stonith, and you really must, then pacemaker will
> never make an assumption about the state of the peer. So when the peer
> stops responding, pacemaker blocks and calls a fence. It will then sit
> there and wait for the fence to succeed. If the fence *doesn't* succeed,
> it ends up staying blocked. This is the proper behaviour!
>
> Now, if you enable stonith *and* it is configured properly, then you
> will see that recovery proceeds as expected *after* the fence action
> completes successfully. So, setup stonith! :)
>

Hi

I don't want to disable stonith, I have stonith enabled, and I use it.
The problem is that during network failure on node2, fence action is
activated on this node, but fail. Fail because it is the unconnected node.
And also it doesn't reboot because watchdog can't check for key registration.
There is a method to stop resources on this node?
Maybe fence_scsi on remote iscsi target isn't the good solution?

Andrea

_______________________________________________
Pacemaker mailing list: ***@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org