[Bug 836107] New: DLM does not initiate recovery on node failure
https://bugzilla.novell.com/show_bug.cgi?id=836107 https://bugzilla.novell.com/show_bug.cgi?id=836107#c0 Summary: DLM does not initiate recovery on node failure Classification: openSUSE Product: openSUSE Factory Version: 13.1 Milestone 4 Platform: x86-64 OS/Version: Other Status: NEW Severity: Normal Priority: P5 - None Component: High Availability AssignedTo: lzhong@suse.com ReportedBy: rgoldwyn@suse.com QAContact: qa-bugs@suse.de CC: lmb@suse.com, ygao@suse.com Found By: Development Blocker: --- DLM does not initiate node recovery on node failure/crash. Instead it start recovery once the node has rebooted and dlm restarted. I does initiate a fence request. 2013-08-21T18:21:57.154662-05:00 opensuse1 dlm_controld[1998]: 410 fence request 1084752300 pid 3674 nodedown time 1377127317 fence_all dlm_stonith but fails because of no actor 2013-08-21T18:21:58.158838-05:00 opensuse1 dlm_controld[1998]: 411 fence request 1084752300 no actor On node recovery, it says the recovered node needs fencing 2013-08-21T18:22:28.326046-05:00 opensuse1 dlm_controld[1998]: 442 daemon joined 1084752300 needs fencing And finally initiates recovery on rejoin. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=836107
https://bugzilla.novell.com/show_bug.cgi?id=836107#c1
--- Comment #1 from Goldwyn Rodrigues
https://bugzilla.novell.com/show_bug.cgi?id=836107
https://bugzilla.novell.com/show_bug.cgi?id=836107#c2
--- Comment #2 from Goldwyn Rodrigues
https://bugzilla.novell.com/show_bug.cgi?id=836107
https://bugzilla.novell.com/show_bug.cgi?id=836107#c3
--- Comment #3 from Goldwyn Rodrigues
https://bugzilla.novell.com/show_bug.cgi?id=836107
https://bugzilla.novell.com/show_bug.cgi?id=836107#c4
--- Comment #4 from Lidong Zhong
https://bugzilla.novell.com/show_bug.cgi?id=836107
https://bugzilla.novell.com/show_bug.cgi?id=836107#c5
--- Comment #5 from Goldwyn Rodrigues
https://bugzilla.novell.com/show_bug.cgi?id=836107
https://bugzilla.novell.com/show_bug.cgi?id=836107#c6
--- Comment #6 from Goldwyn Rodrigues
https://bugzilla.novell.com/show_bug.cgi?id=836107
https://bugzilla.novell.com/show_bug.cgi?id=836107#c7
--- Comment #7 from Lidong Zhong
Lidong: I have been able to setup ocfs2-tools so that the cluster filesystem can come up online. Please (re)install ocfs2-kmp and the latest ocfs2-tools from factory to check. Do cross check using modinfo that you are using the correct ocfs2.ko module.
You *will need* o2cb RA since it contains the cluster_stack setup. So, don't remove it as yet, even though the logs ask you to.
You may have to fix Filesystem RA as well.
I build the cluster stack based on the latest version from network:ha-clustering repo including ocfs2-kmp inux-vphi:~ # modinfo ocfs2 filename: /lib/modules/3.10.1-3.g0cd5432-desktop/kernel/fs/ocfs2/ocfs2.ko alias: fs-ocfs2 license: GPL author: Oracle version: 1.5.0 description: OCFS2 1.5.0 srcversion: A83BD5EC31B5785FE1BD58B depends: ocfs2_stackglue,quota_tree,ocfs2_nodemanager intree: Y vermagic: 3.10.1-3.g0cd5432-desktop SMP preempt mod_unload modversions linux-vphi:~ # uname -r 3.10.1-3.g0cd5432-desktop The ocfs2 RA still failed to start. And the message shows that: 1012 2013-08-27T16:09:07.156448+08:00 opensuse131-1 Filesystem(ocfs2)[12121]: INFO: Running start for /dev/sdf2 on /mnt/shared 1013 2013-08-27T16:09:07.163998+08:00 opensuse131-1 Filesystem(ocfs2)[12121]: ERROR: /dev/sdf2: ocfs2 is not compatible with your environment. 1014 2013-08-27T16:09:07.166941+08:00 opensuse131-1 crmd[12025]: notice: process_lrm_event: LRM operation ocfs2_start_0 (call=28, rc=6, cib-update=14, confirmed=true) not configured 1015 2013-08-27T16:09:07.171978+08:00 opensuse131-1 attrd[12023]: notice: attrd_cs_dispatch: Update relayed from opensuse131-2 1016 2013-08-27T16:09:07.172610+08:00 opensuse131-1 attrd[12023]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-ocfs2 (INFINITY) 1017 2013-08-27T16:09:07.173206+08:00 opensuse131-1 attrd[12023]: notice: attrd_perform_update: Sent update 13: fail-count-ocfs2=INFINITY 1018 2013-08-27T16:09:07.173600+08:00 opensuse131-1 attrd[12023]: notice: attrd_cs_dispatch: Update relayed from opensuse131-2 1019 2013-08-27T16:09:07.173993+08:00 opensuse131-1 attrd[12023]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-ocfs2 (1377590947) 1020 2013-08-27T16:09:07.174155+08:00 opensuse131-1 attrd[12023]: notice: attrd_perform_update: Sent update 16: last-failure-ocfs2=1377590947 -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=836107
https://bugzilla.novell.com/show_bug.cgi?id=836107#c8
--- Comment #8 from Goldwyn Rodrigues
https://bugzilla.novell.com/show_bug.cgi?id=836107
https://bugzilla.novell.com/show_bug.cgi?id=836107#c9
--- Comment #9 from Lidong Zhong
https://bugzilla.novell.com/show_bug.cgi?id=836107
https://bugzilla.novell.com/show_bug.cgi?id=836107#c10
--- Comment #10 from Goldwyn Rodrigues
https://bugzilla.novell.com/show_bug.cgi?id=836107
https://bugzilla.novell.com/show_bug.cgi?id=836107#c
Lars Marowsky-Bree
https://bugzilla.novell.com/show_bug.cgi?id=836107
https://bugzilla.novell.com/show_bug.cgi?id=836107#c11
--- Comment #11 from Lars Marowsky-Bree
https://bugzilla.novell.com/show_bug.cgi?id=836107
https://bugzilla.novell.com/show_bug.cgi?id=836107#c12
--- Comment #12 from Lidong Zhong
https://bugzilla.novell.com/show_bug.cgi?id=836107
https://bugzilla.novell.com/show_bug.cgi?id=836107#c13
--- Comment #13 from Lidong Zhong
2013-08-21T18:21:57.154662-05:00 opensuse1 dlm_controld[1998]: 410 fence request 1084752300 pid 3674 nodedown time 1377127317 fence_all dlm_stonith
but fails because of no actor
Here is how dlm_controld works when a node becomes down. The other two nodes will record the alive member nodeid into their fence actors. And after a fence_request is done, it will send the fence results to other nodes. In function receive_fence_result, it will clear the nodeid from the fence actors if it fenced successfully. So all the nodeids in fence actors will be cleared.
2013-08-21T18:21:58.158838-05:00 opensuse1 dlm_controld[1998]: 411 fence request 1084752300 no actor
The actor returned from get_fence_actor here is just used for checking whether this is the local node itself. So it could send a fence request. So I guess this log is normal logic.
On node recovery, it says the recovered node needs fencing 2013-08-21T18:22:28.326046-05:00 opensuse1 dlm_controld[1998]: 442 daemon joined 1084752300 needs fencing
When the node is up, the need_fencing flag is still set because it was once lost.However this flag will cleared when the node is in CLEAN state which it will be. It doesn't really initiate a fence request.
And finally initiates recovery on rejoin.
Goldwyn, I could see the ocfs2 recovery log a few days ago, but there isn't any more today. Have you done some change to ocfs2? In all, it seems like that the dlm works normally from my point of view during the fence. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=836107
https://bugzilla.novell.com/show_bug.cgi?id=836107#c14
--- Comment #14 from Goldwyn Rodrigues
(In reply to comment #0) Goldwyn, I could see the ocfs2 recovery log a few days ago, but there isn't any more today. Have you done some change to ocfs2?
I was testing mkfs.ocfs2 so I had disable c-clusterfs resource. That is the reason you did not see any ocfs2 messages because ocfs2 was not mounted. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=836107
https://bugzilla.novell.com/show_bug.cgi?id=836107#c15
--- Comment #15 from Goldwyn Rodrigues
(In reply to comment #0)
2013-08-21T18:21:57.154662-05:00 opensuse1 dlm_controld[1998]: 410 fence request 1084752300 pid 3674 nodedown time 1377127317 fence_all dlm_stonith
but fails because of no actor
Here is how dlm_controld works when a node becomes down. The other two nodes will record the alive member nodeid into their fence actors. And after a fence_request is done, it will send the fence results to other nodes. In function receive_fence_result, it will clear the nodeid from the fence actors if it fenced successfully. So all the nodeids in fence actors will be cleared.
The problem here is that pacemaker is performing the fence for you. How does dlm know if the node is fenced in order to clear the flag? What if the crashed machine does not come up at all? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=836107
https://bugzilla.novell.com/show_bug.cgi?id=836107#c16
--- Comment #16 from Goldwyn Rodrigues
https://bugzilla.novell.com/show_bug.cgi?id=836107
https://bugzilla.novell.com/show_bug.cgi?id=836107#c17
--- Comment #17 from Goldwyn Rodrigues
man dlm_stonith for details.
dlm_stonith man page is not present in libdlm-4.0.2.tar.gz but is present in the upstream git of dlm. https://git.fedorahosted.org/cgit/dlm.git/tree/fence/dlm_stonith.8 -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=836107
https://bugzilla.novell.com/show_bug.cgi?id=836107#c18
--- Comment #18 from Lidong Zhong
https://bugzilla.novell.com/show_bug.cgi?id=836107
https://bugzilla.novell.com/show_bug.cgi?id=836107#c19
--- Comment #19 from Lars Marowsky-Bree
Yes, dlm_stonith is the default fence agency if there isn't dlm.conf. But there seems some problem when the agent is running. For example, the code in run_agent(): execlp(agent,agent,NULL); //no args here exit(EXIT_FAILURE); //why not EXIT_SUCCESS
execlp() only returns if executing the new process has fundamentally failed; otherwise, the new process replaces the current one and it is in itself responsible for setting the exit code. The above code is correct. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=836107
https://bugzilla.novell.com/show_bug.cgi?id=836107#c20
Lidong Zhong
participants (1)
-
bugzilla_noreply@novell.com