[Bug 363249] New: openSUSE10.3 crashes with DRBD and OCFS under heavy I/ O load
https://bugzilla.novell.com/show_bug.cgi?id=363249 Summary: openSUSE10.3 crashes with DRBD and OCFS under heavy I/O load Product: openSUSE 10.3 Version: Final Platform: i586 OS/Version: openSUSE 10.3 Status: NEW Severity: Critical Priority: P5 - None Component: Kernel AssignedTo: kernel-maintainers@forge.provo.novell.com ReportedBy: b.burkhardt@gmx.de QAContact: qa@suse.de Found By: --- Hi, when using DRBD and OCFS2 openSUSE10.3 crashes under heavy I/O load. The System is right away rebooting with no log entries in messages or warn. After searching in some mailinglists i found the following boot-parameter which could solve the problem: "elevator=deadline". I tested it, but without success. When testing with DRBD and ext2 the problem did not occur. So it seems to be OCFS on top of the DRBD device that is causing the problem. System data: - SuSE10.3, Kernel 2.6.22.17-0.1-bigsmp - drbd-8.0.6-8 - ocfs-tools-1.2.6-18 - CPU: 2x Intel(R) Xeon(TM) CPU 2.80GHz -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=363249 User roland.kletzing@materna.de added comment https://bugzilla.novell.com/show_bug.cgi?id=363249#c1 roland kletzing <roland.kletzing@materna.de> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |roland.kletzing@materna.de --- Comment #1 from roland kletzing <roland.kletzing@materna.de> 2008-03-02 08:18:27 MST --- maybe you can try attaching a serial console to catch any message from dmesg? does system respond to sysrq triggers after crash ? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=363249 User b.burkhardt@gmx.de added comment https://bugzilla.novell.com/show_bug.cgi?id=363249#c2 --- Comment #2 from Birger Burkhardt <b.burkhardt@gmx.de> 2008-03-14 03:41:53 MST --- Hi Roland, sorry for my delayed answer - the cluster system where the problem first occured is by now operating live using another configuration (no ocfs2). Therfore I had to install a new test system (on some "old" lab hardware). The problem still exists. There is no output on neither serial console nor vga console connected to the server. The system just unexpectedly reboots with NO message before rebooting (console, logfile, serial console). The reboot occurs even if I stop ocfs2 and o2cb service on the opposed cluster-node. I produced the I/O with bonnie++ command (bonnie++ -u0 -d <path-to-ocfs2-mount> -s 9000). -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=363249 User jeffm@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=363249#c3 Jeff Mahoney <jeffm@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |jeffm@novell.com Status|NEW |NEEDINFO Info Provider| |b.burkhardt@gmx.de --- Comment #3 from Jeff Mahoney <jeffm@novell.com> 2008-03-20 09:35:59 MST --- That's strange. I expect it's actually OCFS2 fencing itself, but I'll have to reproduce it to guess for sure. On 10.3 there should be a /sys/o2cb/fence_method file. It defaults to "restart," which means it calls emergency_restart() for an immediate reboot. If OCFS2 is fencing itself, it should just panic and hang the system if you execute the following command before initiating your test again: # echo panic > /sys/o2cb/fence_method -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=363249 User b.burkhardt@gmx.de added comment https://bugzilla.novell.com/show_bug.cgi?id=363249#c4 --- Comment #4 from Birger Burkhardt <b.burkhardt@gmx.de> 2008-03-26 04:45:41 MST --- Hi Jeff, sorry, but when I try to change the value of "fence_method" as user "root", I get the following error: echo panic > /sys/o2cb/fence_method -bash: echo: write error: Invalid argument echo "panic" > /sys/o2cb/fence_method -bash: echo: write error: Invalid argument -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=363249 User jeffm@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=363249#c5 --- Comment #5 from Jeff Mahoney <jeffm@novell.com> 2008-03-26 09:27:32 MST --- Sorry, that code is a bit.. dumb. It needs to be "echo -n", not just "echo" -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=363249 User b.burkhardt@gmx.de added comment https://bugzilla.novell.com/show_bug.cgi?id=363249#c6 Birger Burkhardt <b.burkhardt@gmx.de> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEEDINFO |NEW Info Provider|b.burkhardt@gmx.de | --- Comment #6 from Birger Burkhardt <b.burkhardt@gmx.de> 2008-03-28 03:42:10 MST --- Hi Jeff, after executing the command echo -n "panic" > /sys/o2cb/fence_method and starting bonnie++, the system "hangs" after a few minutes of executing bonnie++. The Output on the system console on both machines is: Kernel panic - not syncing: *** ocfs2 is verry sorry to be fencing this system by panicing *** Since i now have setup this dedicated test-system, I could also test some parameters or patches for you. Thanks & best regards, Birger -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=363249 User jeffm@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=363249#c7 Jeff Mahoney <jeffm@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |NEEDINFO Info Provider| |b.burkhardt@gmx.de --- Comment #7 from Jeff Mahoney <jeffm@novell.com> 2008-03-28 07:36:35 MST --- Ok, that means that heartbeat has timed out and one of the nodes thinks the other node has gone away. Try just changing O2CB_HEARTBEAT_THRESHOLD to 31 in /etc/sysconfig/o2cb. This will be the default in new releases. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=363249 User b.burkhardt@gmx.de added comment https://bugzilla.novell.com/show_bug.cgi?id=363249#c8 Birger Burkhardt <b.burkhardt@gmx.de> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEEDINFO |NEW Info Provider|b.burkhardt@gmx.de | --- Comment #8 from Birger Burkhardt <b.burkhardt@gmx.de> 2008-03-28 14:03:01 MST --- Hi Jeff, I set this value on both systems, but the problem still exists, both machines still reboot without any output. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=363249 User jeffm@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=363249#c9 Jeff Mahoney <jeffm@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |NEEDINFO Info Provider| |b.burkhardt@gmx.de --- Comment #9 from Jeff Mahoney <jeffm@novell.com> 2008-03-28 14:08:25 MST --- Has it rebooted more than once since adding that value? The cluster service would have needed to be restarted for it to take effect. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=363249 User b.burkhardt@gmx.de added comment https://bugzilla.novell.com/show_bug.cgi?id=363249#c10 Birger Burkhardt <b.burkhardt@gmx.de> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEEDINFO |NEW Info Provider|b.burkhardt@gmx.de | --- Comment #10 from Birger Burkhardt <b.burkhardt@gmx.de> 2008-03-28 14:18:36 MST --- Yes, they were both rebooted after editing the configfile and I checked it after reboot via /etc/init.d/o2cb status -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=363249 Jeff Mahoney <jeffm@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- AssignedTo|kernel-maintainers@forge.provo.novell.com |jeffm@novell.com Status|NEW |ASSIGNED -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=363249 User jeffm@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=363249#c11 Jeff Mahoney <jeffm@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Summary|openSUSE10.3 crashes with DRBD and OCFS under |openSUSE10.3 crashes with DRBD and OCFS2 under |heavy I/O load |heavy I/O load --- Comment #11 from Jeff Mahoney <jeffm@novell.com> 2008-03-28 14:33:17 MST --- Well this means that either network connectivity or disk connectivity between the machines has been interrupted. With the threshold at 31, it means a full minute has passed or 10 seconds have passed without network connectivity. You can try using the new default for network connectivity: O2CB_IDLE_TIMEOUT_MS=30000 Something is causing latency and starvation in your environment. This isn't a bug in OCFS2. Your configuration isn't meeting the requirements of OCFS2. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=363249 User b.burkhardt@gmx.de added comment https://bugzilla.novell.com/show_bug.cgi?id=363249#c12 --- Comment #12 from Birger Burkhardt <b.burkhardt@gmx.de> 2008-04-21 04:46:02 MST --- Hi Jeff, i applied the settings you recommended and the problem no longer occurs. The network interconnection is realised with a crossover cable in this testing setup, the same error occurs when connected via a GBit-switch. There are no errors on the interface. 2 harddisks are connected to a hardware-RAID controller and setup as RAID-0 (Mirror) array for maximum performance. Can you please tell me the minimum requirements for running OCFS2? Best regards, Biger -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=363249 User jeffm@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=363249#c13 --- Comment #13 from Jeff Mahoney <jeffm@novell.com> 2008-04-22 08:17:15 MST --- Well any latency introduced by drbd is really outside of the scope of OCFS2. I'm surprised that removing the switch solves the problem. Perhaps your switch is flaking out? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=363249 User b.burkhardt@gmx.de added comment https://bugzilla.novell.com/show_bug.cgi?id=363249#c14 --- Comment #14 from Birger Burkhardt <b.burkhardt@gmx.de> 2008-04-22 12:58:22 MST --- Hi, sorry, but i wrote that the error occured even when I used a crossover cable, so the switch is not causing the problem. Yesterday the error occured once again (with the new settings), so the problem is still not fixed. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=363249 User jeffm@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=363249#c15 Jeff Mahoney <jeffm@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |mfasheh@novell.com --- Comment #15 from Jeff Mahoney <jeffm@novell.com> 2008-04-22 13:05:45 MST --- Ah ok, that's not quite what you said in your comment. Thanks for the clarification. It could be that drbd is saturating your network connection and the keepalive packets aren't making it between the machines. You might want to try using a separate link for the cluster traffic. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=363249 User b.burkhardt@gmx.de added comment https://bugzilla.novell.com/show_bug.cgi?id=363249#c16 --- Comment #16 from Birger Burkhardt <b.burkhardt@gmx.de> 2008-04-23 02:37:13 MST --- I changed the network topology, so now drbd uses a dedicated ethernet interface and ocfs uses a dedicated ethernet interface to communicate, but the problem still exists. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=363249 User jeffm@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=363249#c17 Jeff Mahoney <jeffm@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |lmb@novell.com --- Comment #17 from Jeff Mahoney <jeffm@novell.com> 2008-05-05 11:20:42 MST --- Is your drbd device configured in sync or async mode? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=363249 User b.burkhardt@gmx.de added comment https://bugzilla.novell.com/show_bug.cgi?id=363249#c18 --- Comment #18 from Birger Burkhardt <b.burkhardt@gmx.de> 2008-05-06 05:08:09 MST --- The DRBD device is configured with "protocol C", which means synchronous. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=363249 User jeffm@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=363249#c19 --- Comment #19 from Jeff Mahoney <jeffm@novell.com> 2008-06-18 10:44:29 MDT --- Does changing O2CB_HEARTBEAT_THRESHOLD help at all? It defaults to 7, which means 14 seconds. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=363249 User b.burkhardt@gmx.de added comment https://bugzilla.novell.com/show_bug.cgi?id=363249#c20 --- Comment #20 from Birger Burkhardt <b.burkhardt@gmx.de> 2008-08-01 03:47:47 MDT --- Please have look at my comment #8 above posted at 2008-03-28 14:03:01. No, it doesn't help at all. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=363249 User jeffm@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=363249#c21 Jeff Mahoney <jeffm@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |RESOLVED Resolution| |WONTFIX --- Comment #21 from Jeff Mahoney <jeffm@novell.com> 2008-12-05 13:47:37 MST --- OpenSUSE 11.1 has added robust userspace cluster support that eliminates the disk heartbeat. I'm going to close this one as WONTFIX. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@novell.com