[Bug 584554] New: random 61s soft lockups using xenified kernel 2.6.31.12 or even 2.6.33-25
http://bugzilla.novell.com/show_bug.cgi?id=584554 http://bugzilla.novell.com/show_bug.cgi?id=584554#c0 Summary: random 61s soft lockups using xenified kernel 2.6.31.12 or even 2.6.33-25 Classification: openSUSE Product: openSUSE 11.2 Version: Final Platform: x86-64 OS/Version: openSUSE 11.2 Status: NEW Severity: Critical Priority: P5 - None Component: Kernel AssignedTo: kernel-maintainers@forge.provo.novell.com ReportedBy: samuel.kvasnica@ims.co.at QAContact: qa@suse.de Found By: --- Blocker: --- Created an attachment (id=345885) --> (http://bugzilla.novell.com/attachment.cgi?id=345885) syslog snippet User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.7) Gecko/20100110 Mandriva Linux/1.9.1.7-0.1mdv2010.0 (2010.0) Firefox/3.5.7 We experience random "61s" CPU soft lockups on kernel 2.6.31.12-xen. We even verified the same behavior using the last head kernel 2.6.33-25-xen. We use drbd + xen configuration, 2 drbd partitions + 1 xen pvm guest. The system is running almost idle. We can observe (using snmp logging) rapid increase of load during the lockup. The system does not crash but the corresponding CPU will be loaded/blocked during this time. We are not able to determine what really triggers the lockup. High load does not seem to correlate. Typically, system would run up to 12hours cleanly before lockup happens. If lockup happens, more lockups will typically follow within few minutes. Afterwards, system runs cleanly again. Typically, there is the xen_safe_halt() on stack trace. The same system gives absolutely no lockups on older hardware (2-core Pentium-D, 8G RAM), 1 month uptime. Our hardware: -Supermicro X8SIL-F -Chipset Ibex Peak, XEON X3450 -8GB ECC RAM -Last BIOS updates applied -conservative BIOS settings -setting hyperthreading/turbo/c-states in BIOS make no difference Log of 2 last events is attached below (for 2.6.33 kernel but 2.6.31 looks very same). Reproducible: Sometimes Steps to Reproduce: 1. let system run 2. wait 3. wait, see above description... -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=584554
http://bugzilla.novell.com/show_bug.cgi?id=584554#c1
--- Comment #1 from Samuel Kvasnica
http://bugzilla.novell.com/show_bug.cgi?id=584554
http://bugzilla.novell.com/show_bug.cgi?id=584554#c
Jeff Mahoney
http://bugzilla.novell.com/show_bug.cgi?id=584554
http://bugzilla.novell.com/show_bug.cgi?id=584554#c2
Jan Beulich
http://bugzilla.novell.com/show_bug.cgi?id=584554
http://bugzilla.novell.com/show_bug.cgi?id=584554#c3
Samuel Kvasnica
http://bugzilla.novell.com/show_bug.cgi?id=584554
http://bugzilla.novell.com/show_bug.cgi?id=584554#c4
--- Comment #4 from Samuel Kvasnica
http://bugzilla.novell.com/show_bug.cgi?id=584554
http://bugzilla.novell.com/show_bug.cgi?id=584554#c5
--- Comment #5 from Samuel Kvasnica
http://bugzilla.novell.com/show_bug.cgi?id=584554
http://bugzilla.novell.com/show_bug.cgi?id=584554#c6
--- Comment #6 from Samuel Kvasnica
http://bugzilla.novell.com/show_bug.cgi?id=584554
http://bugzilla.novell.com/show_bug.cgi?id=584554#c7
--- Comment #7 from Samuel Kvasnica
http://bugzilla.novell.com/show_bug.cgi?id=584554
http://bugzilla.novell.com/show_bug.cgi?id=584554#c8
--- Comment #8 from Samuel Kvasnica
http://bugzilla.novell.com/show_bug.cgi?id=584554
http://bugzilla.novell.com/show_bug.cgi?id=584554#c9
Jan Beulich
http://bugzilla.novell.com/show_bug.cgi?id=584554
http://bugzilla.novell.com/show_bug.cgi?id=584554#c10
--- Comment #10 from Jan Beulich
http://bugzilla.novell.com/show_bug.cgi?id=584554
http://bugzilla.novell.com/show_bug.cgi?id=584554#c11
--- Comment #11 from Samuel Kvasnica
http://bugzilla.novell.com/show_bug.cgi?id=584554
http://bugzilla.novell.com/show_bug.cgi?id=584554#c12
--- Comment #12 from Jan Beulich
sorry for not taking care about the separate factory kernel bug - I'm terribly busy over last few weeks. However, your last comment sounds positive ! What 2.6.31 kernel version should contain that fix ? Or where can I find the relevant patch to try it out ?
There is no separate patch, the fix is part of one of the upstream merge patches. The relevant change is that under Xen sched_clock_stable must never be set to a non-zero value.
One more interesting info: I did some testing on another new supermicro system, this time it was a X8DTi-F with XEON E5530 CPU (intel 5520 chipset). In contrast to X8SIL-F with X3450 above, I cannot reproduce the 61s-bug on this system using the same system images. So it seem to be chipset- or mainboard-related.
If you look at the condition under which sched_clock_stable was set to 1 previously, you'll realize that only Nehalems (and newer) would be affected. -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=584554
http://bugzilla.novell.com/show_bug.cgi?id=584554#c13
--- Comment #13 from Samuel Kvasnica
http://bugzilla.novell.com/show_bug.cgi?id=584554
http://bugzilla.novell.com/show_bug.cgi?id=584554#c14
--- Comment #14 from Jan Beulich
http://bugzilla.novell.com/show_bug.cgi?id=584554
http://bugzilla.novell.com/show_bug.cgi?id=584554#c15
--- Comment #15 from Samuel Kvasnica
http://bugzilla.novell.com/show_bug.cgi?id=584554
http://bugzilla.novell.com/show_bug.cgi?id=584554#c16
Samuel Kvasnica
https://bugzilla.novell.com/show_bug.cgi?id=584554
https://bugzilla.novell.com/show_bug.cgi?id=584554#c17
Swamp Workflow Management
https://bugzilla.novell.com/show_bug.cgi?id=584554
https://bugzilla.novell.com/show_bug.cgi?id=584554#c18
Samuel Kvasnica
https://bugzilla.novell.com/show_bug.cgi?id=584554
https://bugzilla.novell.com/show_bug.cgi?id=584554#c19
Jan Beulich
https://bugzilla.novell.com/show_bug.cgi?id=584554
https://bugzilla.novell.com/show_bug.cgi?id=584554#c20
--- Comment #20 from Samuel Kvasnica
https://bugzilla.novell.com/show_bug.cgi?id=584554
https://bugzilla.novell.com/show_bug.cgi?id=584554#c21
--- Comment #21 from Jan Beulich
the "normal" 11.2 config is unfortunately not usable due to several issues, a full upgrade to 11.3 is too risky on running/tuned system only due to a kernel bug.
In order to give support, we generally need a reasonably consistent setup. Any exception to this may (and likely will) result in lowered priority and slower response times.
But - the previous 2.6.31.13 was running w/o lockup problems for several months. The 2.6.31.14 got lockup on 2 systems about 2 weeks after the upgrade.
Which still doesn't in any way mean that the old problem is back.
Do you have an idea what was changed with respect to clock/timers between .13 and .14 ?
Not offhand, no. And particularly not because you're apparently talking about the differences between kernels that never got released.
Is there any repository where I can get the last .13 kernel (I already removed it...) ?
I don't think they're being kept. You may want to try older released update kernels (http://download.opensuse.org/pub/opensuse/update/11.2/rpm/x86_64/; there's a .12 and an earlier .14 there). -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=584554
https://bugzilla.novell.com/show_bug.cgi?id=584554#c22
--- Comment #22 from Samuel Kvasnica
https://bugzilla.novell.com/show_bug.cgi?id=584554
https://bugzilla.novell.com/show_bug.cgi?id=584554#c23
--- Comment #23 from Jan Beulich
participants (1)
-
bugzilla_noreply@novell.com