[Bug 836107] New: DLM does not initiate recovery on node failure

bugzilla_noreply＠novell.com

22 Aug 2013 22 Aug '13

13:21

https://bugzilla.novell.com/show_bug.cgi?id=836107 https://bugzilla.novell.com/show_bug.cgi?id=836107#c0 Summary: DLM does not initiate recovery on node failure Classification: openSUSE Product: openSUSE Factory Version: 13.1 Milestone 4 Platform: x86-64 OS/Version: Other Status: NEW Severity: Normal Priority: P5 - None Component: High Availability AssignedTo: lzhong@suse.com ReportedBy: rgoldwyn@suse.com QAContact: qa-bugs@suse.de CC: lmb@suse.com, ygao@suse.com Found By: Development Blocker: --- DLM does not initiate node recovery on node failure/crash. Instead it start recovery once the node has rebooted and dlm restarted. I does initiate a fence request. 2013-08-21T18:21:57.154662-05:00 opensuse1 dlm_controld[1998]: 410 fence request 1084752300 pid 3674 nodedown time 1377127317 fence_all dlm_stonith but fails because of no actor 2013-08-21T18:21:58.158838-05:00 opensuse1 dlm_controld[1998]: 411 fence request 1084752300 no actor On node recovery, it says the recovered node needs fencing 2013-08-21T18:22:28.326046-05:00 opensuse1 dlm_controld[1998]: 442 daemon joined 1084752300 needs fencing And finally initiates recovery on rejoin. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

Show replies by date

bugzilla_noreply＠novell.com

22 Aug 22 Aug

13:23

New subject: [Bug 836107] DLM does not initiate recovery on node failure

https://bugzilla.novell.com/show_bug.cgi?id=836107 https://bugzilla.novell.com/show_bug.cgi?id=836107#c1 --- Comment #1 from Goldwyn Rodrigues 2013-08-22 13:23:30 UTC --- Created an attachment (id=553717) --> (http://bugzilla.novell.com/attachment.cgi?id=553717) message log file from opensuse1 Three nodes configured: opensuse1 opensuse2 opensuse3. opensuse2 was reset. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

13:24

New subject: [Bug 836107] DLM does not initiate recovery on node failure

https://bugzilla.novell.com/show_bug.cgi?id=836107 https://bugzilla.novell.com/show_bug.cgi?id=836107#c2 --- Comment #2 from Goldwyn Rodrigues 2013-08-22 13:24:07 UTC --- Created an attachment (id=553718) --> (http://bugzilla.novell.com/attachment.cgi?id=553718) message log file from opensuse3 -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

13:25

New subject: [Bug 836107] DLM does not initiate recovery on node failure

https://bugzilla.novell.com/show_bug.cgi?id=836107 https://bugzilla.novell.com/show_bug.cgi?id=836107#c3 --- Comment #3 from Goldwyn Rodrigues 2013-08-22 13:25:48 UTC --- The is the crm configuration used. node $id="1084752299" opensuse1 node $id="1084752300" opensuse2 node $id="1084752301" opensuse3 primitive clusterfs ocf:heartbeat:Filesystem \ params device="/dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_drive-scsi0-0-2-part1" directory="/srv/clusterfs" fstype="ocfs2" \ op start timeout="60" interval="0" \ op stop timeout="60" interval="0" \ op monitor interval="20" timeout="40" primitive dlm ocf:pacemaker:controld \ op start timeout="90" interval="0" \ op stop timeout="100" interval="0" \ op monitor interval="60" timeout="60" \ params args="-q 0" primitive stone stonith:external/libvirt \ params hostlist="opensuse1,opensuse2,opensuse3" hypervisor_uri="qemu+tcp://fiona/system" stonith-timeout="30" \ op start interval="0" timeout="60" \ op stop interval="0" timeout="60" \ op monitor interval="60" group base-group dlm clone base-clone base-group \ meta interleave="true" target-role="Started" clone c-clusterfs clusterfs \ meta interleave="true" target-role="Started" colocation clusterfs-with-base inf: c-clusterfs base-clone order base-then-clusterfs inf: base-clone c-clusterfs property $id="cib-bootstrap-options" \ dc-version="1.1.10-52.5-ac7aa1c" \ cluster-infrastructure="corosync" \ last-lrm-refresh="1377127056" \ stonith-enabled="true" -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

23 Aug 23 Aug

02:33

New subject: [Bug 836107] DLM does not initiate recovery on node failure

https://bugzilla.novell.com/show_bug.cgi?id=836107 https://bugzilla.novell.com/show_bug.cgi?id=836107#c4 --- Comment #4 from Lidong Zhong 2013-08-23 02:33:35 UTC --- Hi Goldwyn, I always failed to start ocfs2 in my test environment. Could I log into your nodes? I need to add some logs to debug this problem. Thanks a lot -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

13:10

New subject: [Bug 836107] DLM does not initiate recovery on node failure

https://bugzilla.novell.com/show_bug.cgi?id=836107 https://bugzilla.novell.com/show_bug.cgi?id=836107#c5 --- Comment #5 from Goldwyn Rodrigues 2013-08-23 13:10:11 UTC --- Lidong, I am trying to improve the ocfs2 handling. Hopefully, I should be done by Monday and you should be able to setup on your environment. If not, I will give you details of the test system. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

27 Aug 27 Aug

02:38

New subject: [Bug 836107] DLM does not initiate recovery on node failure

https://bugzilla.novell.com/show_bug.cgi?id=836107 https://bugzilla.novell.com/show_bug.cgi?id=836107#c6 --- Comment #6 from Goldwyn Rodrigues 2013-08-27 02:38:03 UTC --- Lidong: I have been able to setup ocfs2-tools so that the cluster filesystem can come up online. Please (re)install ocfs2-kmp and the latest ocfs2-tools from factory to check. Do cross check using modinfo that you are using the correct ocfs2.ko module. You *will need* o2cb RA since it contains the cluster_stack setup. So, don't remove it as yet, even though the logs ask you to. You may have to fix Filesystem RA as well. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

08:42

New subject: [Bug 836107] DLM does not initiate recovery on node failure

https://bugzilla.novell.com/show_bug.cgi?id=836107 https://bugzilla.novell.com/show_bug.cgi?id=836107#c7 --- Comment #7 from Lidong Zhong 2013-08-27 08:42:04 UTC --- (In reply to comment #6)

...

Lidong: I have been able to setup ocfs2-tools so that the cluster filesystem can come up online. Please (re)install ocfs2-kmp and the latest ocfs2-tools from factory to check. Do cross check using modinfo that you are using the correct ocfs2.ko module.

You *will need* o2cb RA since it contains the cluster_stack setup. So, don't remove it as yet, even though the logs ask you to.

You may have to fix Filesystem RA as well.

I build the cluster stack based on the latest version from network:ha-clustering repo including ocfs2-kmp inux-vphi:~ # modinfo ocfs2 filename: /lib/modules/3.10.1-3.g0cd5432-desktop/kernel/fs/ocfs2/ocfs2.ko alias: fs-ocfs2 license: GPL author: Oracle version: 1.5.0 description: OCFS2 1.5.0 srcversion: A83BD5EC31B5785FE1BD58B depends: ocfs2_stackglue,quota_tree,ocfs2_nodemanager intree: Y vermagic: 3.10.1-3.g0cd5432-desktop SMP preempt mod_unload modversions linux-vphi:~ # uname -r 3.10.1-3.g0cd5432-desktop The ocfs2 RA still failed to start. And the message shows that: 1012 2013-08-27T16:09:07.156448+08:00 opensuse131-1 Filesystem(ocfs2)[12121]: INFO: Running start for /dev/sdf2 on /mnt/shared 1013 2013-08-27T16:09:07.163998+08:00 opensuse131-1 Filesystem(ocfs2)[12121]: ERROR: /dev/sdf2: ocfs2 is not compatible with your environment. 1014 2013-08-27T16:09:07.166941+08:00 opensuse131-1 crmd[12025]: notice: process_lrm_event: LRM operation ocfs2_start_0 (call=28, rc=6, cib-update=14, confirmed=true) not configured 1015 2013-08-27T16:09:07.171978+08:00 opensuse131-1 attrd[12023]: notice: attrd_cs_dispatch: Update relayed from opensuse131-2 1016 2013-08-27T16:09:07.172610+08:00 opensuse131-1 attrd[12023]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-ocfs2 (INFINITY) 1017 2013-08-27T16:09:07.173206+08:00 opensuse131-1 attrd[12023]: notice: attrd_perform_update: Sent update 13: fail-count-ocfs2=INFINITY 1018 2013-08-27T16:09:07.173600+08:00 opensuse131-1 attrd[12023]: notice: attrd_cs_dispatch: Update relayed from opensuse131-2 1019 2013-08-27T16:09:07.173993+08:00 opensuse131-1 attrd[12023]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-ocfs2 (1377590947) 1020 2013-08-27T16:09:07.174155+08:00 opensuse131-1 attrd[12023]: notice: attrd_perform_update: Sent update 16: last-failure-ocfs2=1377590947 -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

22:01

New subject: [Bug 836107] DLM does not initiate recovery on node failure

https://bugzilla.novell.com/show_bug.cgi?id=836107 https://bugzilla.novell.com/show_bug.cgi?id=836107#c8 --- Comment #8 from Goldwyn Rodrigues 2013-08-27 22:01:17 UTC --- The problem is that it did not use the correct module, and the Filesystem RA was not setup. In any case, I have sent the access details in a private mail. Let me know if you can't access it. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

29 Aug 29 Aug

09:10

New subject: [Bug 836107] DLM does not initiate recovery on node failure

https://bugzilla.novell.com/show_bug.cgi?id=836107 https://bugzilla.novell.com/show_bug.cgi?id=836107#c9 --- Comment #9 from Lidong Zhong 2013-08-29 09:10:47 UTC --- Hi Goldwyn, thank you for sharing your test environment. But I still failed to reproduce this bug, please check where I did wrong. What I did was very simple: After these three nodes startup, cp -r /usr /srv/clusterfs. While this command running, I reboot the opensuse2.Then check the log generated on opensuse1 node. By the way, I disabled the corosync&pacemaker service on opensuse2 node for the systemd because I want to see what happend after the nodes down and what happend after the nodes up seperately. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

14:26

New subject: [Bug 836107] DLM does not initiate recovery on node failure

https://bugzilla.novell.com/show_bug.cgi?id=836107 https://bugzilla.novell.com/show_bug.cgi?id=836107#c10 --- Comment #10 from Goldwyn Rodrigues 2013-08-29 14:26:41 UTC --- Your method seems correct. This is how I would reproduce the issue: Disable corosync and pacemaker on opensuse2 opensuse2# systemctl disable corosync.service opensuse2# systemctl disable pacemaker.service Start corosync and pacemaker on all nodes and ensure using # crm status Initiate a copy on opensuse2 opensuse2 # cp -r /usr /srv/clusterfs Destroy opensuse2 using virsh $ sudo virsh virsh # destroy opensuse2 Wait for opensuse2 to come back up. (It will re-start automatically because of stonith being libvirt) virsh # list Id Name State ---------------------------------------------------- 11 opensuse3 running 12 opensuse1 running 14 opensuse2 running Check dmesg of both opensuse1 and opensuse3. There would be *no* recovery message from ocfs2. Start corosync and pacemaker on opensuse2 # systemctl start corosync.service # systemctl start pacemaker.service Check the dmesg of both opensuse1 and opensuse3. There would now be recovery message from ocfs2. Check /var/log/messages for dlm_controld Let me know if you can't still reproduce the issue. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

2 Sep 2 Sep

07:42

New subject: [Bug 836107] DLM does not initiate recovery on node failure

https://bugzilla.novell.com/show_bug.cgi?id=836107 https://bugzilla.novell.com/show_bug.cgi?id=836107#c Lars Marowsky-Bree changed: What |Removed |Added ---------------------------------------------------------------------------- Priority|P5 - None |P2 - High Severity|Normal |Major -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

5 Sep 5 Sep

14:12

New subject: [Bug 836107] DLM does not initiate recovery on node failure

https://bugzilla.novell.com/show_bug.cgi?id=836107 https://bugzilla.novell.com/show_bug.cgi?id=836107#c11 --- Comment #11 from Lars Marowsky-Bree 2013-09-05 14:12:38 UTC --- Lidong, can you reproduce with Goldwyn's explanation? If you cannot reproduce and it works for you, could you attach a hb_report of the working 3 node configuration? Goldwyn, and you one of the three node environment that doesn't work? Then we can start comparing configurations and what's happening. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

6 Sep 6 Sep

03:43

New subject: [Bug 836107] DLM does not initiate recovery on node failure

https://bugzilla.novell.com/show_bug.cgi?id=836107 https://bugzilla.novell.com/show_bug.cgi?id=836107#c12 --- Comment #12 from Lidong Zhong 2013-09-06 03:43:36 UTC --- Hi Lars, I reproduced this bug. But then I had to go to my hometown for some emergency. Sorry for the delay. There is some problem with the network in Beijing now. I couldn't connect to Goldwyn's machine. Once it is OK I will work on this problem. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

09:57

New subject: [Bug 836107] DLM does not initiate recovery on node failure

https://bugzilla.novell.com/show_bug.cgi?id=836107 https://bugzilla.novell.com/show_bug.cgi?id=836107#c13 --- Comment #13 from Lidong Zhong 2013-09-06 09:57:20 UTC --- (In reply to comment #0)

...

2013-08-21T18:21:57.154662-05:00 opensuse1 dlm_controld[1998]: 410 fence request 1084752300 pid 3674 nodedown time 1377127317 fence_all dlm_stonith

but fails because of no actor

Here is how dlm_controld works when a node becomes down. The other two nodes will record the alive member nodeid into their fence actors. And after a fence_request is done, it will send the fence results to other nodes. In function receive_fence_result, it will clear the nodeid from the fence actors if it fenced successfully. So all the nodeids in fence actors will be cleared.

...

2013-08-21T18:21:58.158838-05:00 opensuse1 dlm_controld[1998]: 411 fence request 1084752300 no actor

The actor returned from get_fence_actor here is just used for checking whether this is the local node itself. So it could send a fence request. So I guess this log is normal logic.

...

On node recovery, it says the recovered node needs fencing 2013-08-21T18:22:28.326046-05:00 opensuse1 dlm_controld[1998]: 442 daemon joined 1084752300 needs fencing

When the node is up, the need_fencing flag is still set because it was once lost.However this flag will cleared when the node is in CLEAN state which it will be. It doesn't really initiate a fence request.

...

And finally initiates recovery on rejoin.

Goldwyn, I could see the ocfs2 recovery log a few days ago, but there isn't any more today. Have you done some change to ocfs2? In all, it seems like that the dlm works normally from my point of view during the fence. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

12:02

New subject: [Bug 836107] DLM does not initiate recovery on node failure

https://bugzilla.novell.com/show_bug.cgi?id=836107 https://bugzilla.novell.com/show_bug.cgi?id=836107#c14 --- Comment #14 from Goldwyn Rodrigues 2013-09-06 12:02:42 UTC --- (In reply to comment #13)

...

(In reply to comment #0) Goldwyn, I could see the ocfs2 recovery log a few days ago, but there isn't any more today. Have you done some change to ocfs2?

I was testing mkfs.ocfs2 so I had disable c-clusterfs resource. That is the reason you did not see any ocfs2 messages because ocfs2 was not mounted. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

12:08

New subject: [Bug 836107] DLM does not initiate recovery on node failure

https://bugzilla.novell.com/show_bug.cgi?id=836107 https://bugzilla.novell.com/show_bug.cgi?id=836107#c15 --- Comment #15 from Goldwyn Rodrigues 2013-09-06 12:08:51 UTC --- (In reply to comment #13)

...

(In reply to comment #0)

...
2013-08-21T18:21:57.154662-05:00 opensuse1 dlm_controld[1998]: 410 fence request 1084752300 pid 3674 nodedown time 1377127317 fence_all dlm_stonith

but fails because of no actor

Here is how dlm_controld works when a node becomes down. The other two nodes will record the alive member nodeid into their fence actors. And after a fence_request is done, it will send the fence results to other nodes. In function receive_fence_result, it will clear the nodeid from the fence actors if it fenced successfully. So all the nodeids in fence actors will be cleared.

The problem here is that pacemaker is performing the fence for you. How does dlm know if the node is fenced in order to clear the flag? What if the crashed machine does not come up at all? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

7 Sep 7 Sep

15:35

New subject: [Bug 836107] DLM does not initiate recovery on node failure

https://bugzilla.novell.com/show_bug.cgi?id=836107 https://bugzilla.novell.com/show_bug.cgi?id=836107#c16 --- Comment #16 from Goldwyn Rodrigues 2013-09-07 15:35:47 UTC --- I have found the problem. DLM also has its own fencing mechanism if you are planning to use it without pacemaker/corosync. This can be overridden by using a proxy stonith device: dlm_stonith. You don't have to specify any options (--fence_all) because it is used by default if no stonith device is used. Just have to enable building it in our rpm. I knew it was a configuration issue ;) man dlm_stonith for details. I have submitted a request (id 197842) to enable building of dlm_stonith. Also disabled fscontrol in the same request. /me going back to spend time with family :) -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

15:48

New subject: [Bug 836107] DLM does not initiate recovery on node failure

https://bugzilla.novell.com/show_bug.cgi?id=836107 https://bugzilla.novell.com/show_bug.cgi?id=836107#c17 --- Comment #17 from Goldwyn Rodrigues 2013-09-07 15:48:30 UTC --- (In reply to comment #16)

...

man dlm_stonith for details.

dlm_stonith man page is not present in libdlm-4.0.2.tar.gz but is present in the upstream git of dlm. https://git.fedorahosted.org/cgit/dlm.git/tree/fence/dlm_stonith.8 -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

9 Sep 9 Sep

06:32

New subject: [Bug 836107] DLM does not initiate recovery on node failure

https://bugzilla.novell.com/show_bug.cgi?id=836107 https://bugzilla.novell.com/show_bug.cgi?id=836107#c18 --- Comment #18 from Lidong Zhong 2013-09-09 06:32:00 UTC --- Yes, dlm_stonith is the default fence agency if there isn't dlm.conf. But there seems some problem when the agent is running. For example, the code in run_agent(): execlp(agent,agent,NULL); //no args here exit(EXIT_FAILURE); //why not EXIT_SUCCESS -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

07:55

New subject: [Bug 836107] DLM does not initiate recovery on node failure

https://bugzilla.novell.com/show_bug.cgi?id=836107 https://bugzilla.novell.com/show_bug.cgi?id=836107#c19 --- Comment #19 from Lars Marowsky-Bree 2013-09-09 07:55:44 UTC --- (In reply to comment #18)

...

Yes, dlm_stonith is the default fence agency if there isn't dlm.conf. But there seems some problem when the agent is running. For example, the code in run_agent(): execlp(agent,agent,NULL); //no args here exit(EXIT_FAILURE); //why not EXIT_SUCCESS

execlp() only returns if executing the new process has fundamentally failed; otherwise, the new process replaces the current one and it is in itself responsible for setting the exit code. The above code is correct. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

bugzilla_noreply＠novell.com

8 Nov 8 Nov

01:55

New subject: [Bug 836107] DLM does not initiate recovery on node failure

https://bugzilla.novell.com/show_bug.cgi?id=836107 https://bugzilla.novell.com/show_bug.cgi?id=836107#c20 Lidong Zhong changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |CLOSED Resolution| |FIXED --- Comment #20 from Lidong Zhong 2013-11-08 01:55:20 UTC --- close it now -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.

3828

Age (days ago)

3906

Last active (days ago)

List overview

Download

21 comments

1 participants

participants (1)

bugzilla_noreply＠novell.com

[Bug 836107] New: DLM does not initiate recovery on node failure

tags

participants (1)