[Bug 392585] New: Dom0 instability and log messages under heavy load
https://bugzilla.novell.com/show_bug.cgi?id=392585 Summary: Dom0 instability and log messages under heavy load Product: openSUSE 11.0 Version: Factory Platform: Other OS/Version: Other Status: NEW Severity: Normal Priority: P5 - None Component: Xen AssignedTo: cgriffin@novell.com ReportedBy: frank.arnold@amd.com QAContact: qa@suse.de Found By: --- We installed openSUSE Beta (currently 3) Xen on two systems and ran some tests on it. Dom0 becomes unstable if we start enough guests which are running different stress tests. This can be seen by VNC connections of the guests not getting updated properly. Eventually, it recovers after some time. But we saw a complete lockup after 2 days of runtime with some overload, too. Another indication is that we are getting lots of the following logs inside /var/log/messages, even with moderate load: klogd: clocksource/{0,1,2,3}: Time went backwards: ret=12f357ae516ad delta=-17749258 shadow=12f356f3bdd4e offset=cb83438 We are not seeing this with upstream Xen. Hardware: 1: Platform: Sahara (1P) Processor: Phenom 9600 Quad (Fam 16, Model 2, Stepping 2) Memory: 6GB BIOS: PSAD00-B OS: openSUSE 11.0 Beta3 32-bit 2: Motherboard: Asus M2N-MX SE Plus Processor: Athlon X2 5000+ (Fam 15, Model 107, Stepping 1) Memory: 4GB OS: openSUSE 11.0 Beta3 64-bit -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=392585 Jan Beulich <jbeulich@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |jbeulich@novell.com AssignedTo|cgriffin@novell.com |jbeulich@novell.com Status|NEW |ASSIGNED -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=392585 User jbeulich@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=392585#c1 Jan Beulich <jbeulich@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |NEEDINFO Info Provider| |frank.arnold@amd.com --- Comment #1 from Jan Beulich <jbeulich@novell.com> 2008-05-21 09:00:18 MST --- I don't think the 'time went backward messages', given the not too high delta, indicate a significant problem. Nevertheless, once the box locks up we'd need you to obtain some state information (sending 'd' over serial to the hypervisor as a first step). Also, more complete messages (to understand how bad the time issue is) should be attached. Likewise we'd want to see the hypervisor messages over the whole lifetime of the system (please use 'loglvl=all guest_loglvl=all' on the Xen command line). Finally, narrowing down the conditions for the problem to occur would be rather helpful. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=392585 User frank.arnold@amd.com added comment https://bugzilla.novell.com/show_bug.cgi?id=392585#c2 Frank Arnold <frank.arnold@amd.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEEDINFO |ASSIGNED Info Provider|frank.arnold@amd.com | --- Comment #2 from Frank Arnold <frank.arnold@amd.com> 2008-05-26 07:33:27 MDT --- There's plain nothing on the output that would indicate a problem, it just hangs. We don't have the resources to investigate this further, so here is a summary of the last run (froze about 8 hours after start). Log files will be attached. Conditions: Box: Sahara/6GB/Phenom 9600 (you should have some of those available) OS: openSUSE 11 Beta3 i386 Guests: 1 redhat_rhel5u1_32b_smp memory=640; shadow_memory=10; vcpus=2; pae=1; acpi=1; apic=1; kernbench 2 suse_sles10_32b_up memory=640; shadow_memory=10; vcpus=1; pae=1; acpi=1; apic=1; LTP 3 ms_winxp-sp2_32b_up memory=640; shadow_memory=10; vcpus=1; pae=1; acpi=1; apic=1; WinSST (AMD internal stress test) 4 redhat_rhel4u6_32bpae_smp memory=1024; shadow_memory=10; vcpus=2; pae=1; acpi=1; apic=1; CTCS 5 redhat_rhel5u1_32bpae_smp memory=1024; shadow_memory=10; vcpus=1; pae=1; acpi=1; apic=1; lmbench 6 suse_suse10_32bpae_smp memory=1024; shadow_memory=10; vcpus=1; pae=1; acpi=1; apic=1; CTCS -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=392585 User frank.arnold@amd.com added comment https://bugzilla.novell.com/show_bug.cgi?id=392585#c3 --- Comment #3 from Frank Arnold <frank.arnold@amd.com> 2008-05-26 07:35:17 MDT --- Created an attachment (id=218117) --> (https://bugzilla.novell.com/attachment.cgi?id=218117) serial console output -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=392585 User frank.arnold@amd.com added comment https://bugzilla.novell.com/show_bug.cgi?id=392585#c4 --- Comment #4 from Frank Arnold <frank.arnold@amd.com> 2008-05-26 07:47:49 MDT --- Created an attachment (id=218119) --> (https://bugzilla.novell.com/attachment.cgi?id=218119) compressed /var/log/messages of the box -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=392585 User jbeulich@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=392585#c5 Jan Beulich <jbeulich@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC|jbeulich@novell.com | QAContact|qa@suse.de |jdouglas@novell.com --- Comment #5 from Jan Beulich <jbeulich@novell.com> 2008-05-28 01:46:04 MDT ---
There's plain nothing on the output that would indicate a problem, it just hangs. We don't have the resources to investigate this further, so here is a summary of the last run (froze about 8 hours after start). Log files will be attached.
Unless we would be able to reproduce this in our lab, there's pretty little we can do if you're unavailable to assist with the analysis. According to the logs you didn't even get Xen's response to pressing 'd' once the box hung (of course I can't tell whether you tried and it didn't work). -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=392585 Jan Beulich <jbeulich@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |NEEDINFO Info Provider| |jdouglas@novell.com -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=392585 User jbeulich@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=392585#c6 Jan Beulich <jbeulich@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Info Provider|jdouglas@novell.com |frank.arnold@amd.com --- Comment #6 from Jan Beulich <jbeulich@novell.com> 2008-05-28 02:06:00 MDT --- The kernel log provided could hint at a TSC instability, although I didn't think Phenoms would do hardware initiated clock modulation affecting the TSC tick rate behind the back of the OS. The box doesn't appear to have a HPET, so you should try clocksource=pit to see whether this alternatively has something to do with the PM timer latency issues currently being fixed upstream. Also, I take it for granted that the problem happens regardless of the sync_console and watchdog command line options. As the watchdog doesn't appear to trigger, it's pretty certain you'd be able to get output from Xen once hung - please get the output from pressing 'd' as indicated above (and be sure to switch input to Xen before attempting this). -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=392585 User frank.arnold@amd.com added comment https://bugzilla.novell.com/show_bug.cgi?id=392585#c7 --- Comment #7 from Frank Arnold <frank.arnold@amd.com> 2008-05-28 10:33:03 MDT --- (In reply to comment #5 from Jan Beulich)
Unless we would be able to reproduce this in our lab, there's pretty little we can do if you're unavailable to assist with the analysis. According to the logs you didn't even get Xen's response to pressing 'd' once the box hung (of course I can't tell whether you tried and it didn't work).
Sorry, forgot that one: pressing 'd' didn't do anything. But I'm sure I didn't switch the input to Xen... Probably I can do some more work on that next week. But it would be nice if you could at least try to reproduce it on a Sahara on your side. Another side note: I installed a Windows XP Pro x64 guest while running another Windows Server 2003 instance which was unzipping a large archive on box #2 I mentioned above. While it was doing this the box froze, too. Unfortunately it had no serial connection. But this indicates once more that it seems to be a more common problem: - it happens on 32-bit and 64-bit - it happens with AMD processor family 15 and 16 I ran the same configuration as mentioned in comment #2 on a xen-unstable build with 2.6.18.8 Dom0 on the same box (#1). It did run for 16 hours without any issues. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=392585 User frank.arnold@amd.com added comment https://bugzilla.novell.com/show_bug.cgi?id=392585#c9 Frank Arnold <frank.arnold@amd.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEEDINFO |ASSIGNED Info Provider|frank.arnold@amd.com | --- Comment #9 from Frank Arnold <frank.arnold@amd.com> 2008-06-05 07:20:17 MDT --- OK, here we go. (In reply to comment #6 from Jan Beulich)
The kernel log provided could hint at a TSC instability, although I didn't think Phenoms would do hardware initiated clock modulation affecting the TSC tick rate behind the back of the OS.
That's true, Phenoms don't have this issue.
The box doesn't appear to have a HPET, so you should try clocksource=pit to see whether this alternatively has something to do with the PM timer latency issues currently being fixed upstream.
Tried clocksource=pit and got the same result.
Also, I take it for granted that the problem happens regardless of the sync_console and watchdog command line options.
When I first saw this issue no debug options were used. Just a clean install.
As the watchdog doesn't appear to trigger, it's pretty certain you'd be able to get output from Xen once hung - please get the output from pressing 'd' as indicated above (and be sure to switch input to Xen before attempting this).
Done. Serial logs from runs with clocksources PIT and ACPI PM are attached. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=392585 User frank.arnold@amd.com added comment https://bugzilla.novell.com/show_bug.cgi?id=392585#c10 --- Comment #10 from Frank Arnold <frank.arnold@amd.com> 2008-06-05 07:26:17 MDT --- Created an attachment (id=220447) --> (https://bugzilla.novell.com/attachment.cgi?id=220447) serial console logs -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=392585 User jbeulich@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=392585#c11 Jan Beulich <jbeulich@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |NEEDINFO Info Provider| |frank.arnold@amd.com --- Comment #11 from Jan Beulich <jbeulich@novell.com> 2008-06-05 08:27:35 MDT --- In order to make sense of the back traces - what exact kernel version did you run in Dom0? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=392585 User frank.arnold@amd.com added comment https://bugzilla.novell.com/show_bug.cgi?id=392585#c12 Frank Arnold <frank.arnold@amd.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEEDINFO |ASSIGNED Info Provider|frank.arnold@amd.com | --- Comment #12 from Frank Arnold <frank.arnold@amd.com> 2008-06-05 08:36:36 MDT --- # rpm -q --qf '%{name}-%{version}-%{release}.%{arch}\n' kernel-xen kernel-xen-2.6.25.4-8.i586 -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=392585 User jbeulich@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=392585#c13 --- Comment #13 from Jan Beulich <jbeulich@novell.com> 2008-06-05 09:50:39 MDT --- All vCPU-s visible in the backtraces which apparently belong to dom0 are waiting for the xtime_lock. This may be a time handling issue in Xen, but it also may be a scheduler issue (in that the vCPU holding the lock doesn't get scheduled within a reasonable amount of time). -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=392585 User frank.arnold@amd.com added comment https://bugzilla.novell.com/show_bug.cgi?id=392585#c15 --- Comment #15 from Frank Arnold <frank.arnold@amd.com> 2008-06-17 06:54:29 MDT --- Just figured that some of our developers are working on an issue that might be related to this one. 2.6.25 ticket spinlocks seem to cause trouble if this kernel used in a HVM guest. At least they see a heavy performance decrease. There will be a presentation about it at the Xen Summit next week. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=392585 User thomas.friebel@amd.com added comment https://bugzilla.novell.com/show_bug.cgi?id=392585#c16 Thomas Friebel <thomas.friebel@amd.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |thomas.friebel@amd.com --- Comment #16 from Thomas Friebel <thomas.friebel@amd.com> 2008-06-18 04:30:46 MDT --- Lock-holder preemption is a real problem for systems w/ ticket spinlocks (2.6.25+). Under CPU overcommitment the ticket locks will prevent the system from making progress. Exchanging the spinlock implementation with the 2.6.24 version will probably solve this. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=392585 User mark.langsdorf@amd.com added comment https://bugzilla.novell.com/show_bug.cgi?id=392585#c17 Mark Langsdorf <mark.langsdorf@amd.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |mark.langsdorf@amd.com --- Comment #17 from Mark Langsdorf <mark.langsdorf@amd.com> 2008-06-19 11:37:32 MDT ---
From my testing, it looks like pinning the vcpus significantly reduces or completely stops the rate of "time going backwards" messages. I'm not sure what that implies, though.
-- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=392585 Thomas Friebel <thomas.friebel@amd.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Severity|Normal |Major OS/Version|Other |openSUSE 11.0 Priority|P5 - None |P2 - High Platform|Other |x86 -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=392585 User aj@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=392585#c23 Andreas Jaeger <aj@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEEDINFO |ASSIGNED Info Provider|lbendixs@novell.com | --- Comment #23 from Andreas Jaeger <aj@novell.com> 2008-10-24 05:13:21 MDT --- What shall we do with this bugreport? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=392585 User jbeulich@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=392585#c24 Jan Beulich <jbeulich@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |NEEDINFO Info Provider| |frank.arnold@amd.com --- Comment #24 from Jan Beulich <jbeulich@novell.com> 2008-10-27 02:46:57 MDT --- Could this be tried with 11.1/SLE11 beta 4 (as soon as available), where the ticket lock implementation and a false positive issue in the reporting of 'time went backwards' have been fixed? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=392585 User frank.arnold@amd.com added comment https://bugzilla.novell.com/show_bug.cgi?id=392585#c25 --- Comment #25 from Frank Arnold <frank.arnold@amd.com> 2008-10-27 04:52:06 MDT --- (In reply to comment #24 from Jan Beulich)
Could this be tried with 11.1/SLE11 beta 4 (as soon as available)
We'll do this. We should have results sometime during next week. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=392585 User frank.arnold@amd.com added comment https://bugzilla.novell.com/show_bug.cgi?id=392585#c26 Frank Arnold <frank.arnold@amd.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEEDINFO |ASSIGNED Info Provider|frank.arnold@amd.com | --- Comment #26 from Frank Arnold <frank.arnold@amd.com> 2008-11-07 10:50:45 MST --- (In reply to comment #24 from Jan Beulich)
Could this be tried with 11.1/SLE11 beta 4 (as soon as available)
Done, and it's looking nice. Used a similar setup to the one mentioned in comment #2. openSUSE 11.1 Beta3 x86_64 ran for 104 hours without issues. No 'time went backwards' messages. SLE11 Beta4 x86_64 ran for 88 hours without issues. Also no 'time went backwards' messages. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=392585 User jbeulich@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=392585#c27 --- Comment #27 from Jan Beulich <jbeulich@novell.com> 2008-11-10 01:48:57 MST --- The log messages may have been false positives (which we fixed in the SLE11 code), whether the instability was related to that I can't really tell. Backporting that fix shouldn't be difficult, but I'm not sure we really need to do anything beyond that given we know it's working better in current code. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@novell.com