[Bug 392585] New: Dom0 instability and log messages under heavy load
https://bugzilla.novell.com/show_bug.cgi?id=392585 Summary: Dom0 instability and log messages under heavy load Product: openSUSE 11.0 Version: Factory Platform: Other OS/Version: Other Status: NEW Severity: Normal Priority: P5 - None Component: Xen AssignedTo: cgriffin@novell.com ReportedBy: frank.arnold@amd.com QAContact: qa@suse.de Found By: --- We installed openSUSE Beta (currently 3) Xen on two systems and ran some tests on it. Dom0 becomes unstable if we start enough guests which are running different stress tests. This can be seen by VNC connections of the guests not getting updated properly. Eventually, it recovers after some time. But we saw a complete lockup after 2 days of runtime with some overload, too. Another indication is that we are getting lots of the following logs inside /var/log/messages, even with moderate load: klogd: clocksource/{0,1,2,3}: Time went backwards: ret=12f357ae516ad delta=-17749258 shadow=12f356f3bdd4e offset=cb83438 We are not seeing this with upstream Xen. Hardware: 1: Platform: Sahara (1P) Processor: Phenom 9600 Quad (Fam 16, Model 2, Stepping 2) Memory: 6GB BIOS: PSAD00-B OS: openSUSE 11.0 Beta3 32-bit 2: Motherboard: Asus M2N-MX SE Plus Processor: Athlon X2 5000+ (Fam 15, Model 107, Stepping 1) Memory: 4GB OS: openSUSE 11.0 Beta3 64-bit -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=392585
Jan Beulich
https://bugzilla.novell.com/show_bug.cgi?id=392585
User jbeulich@novell.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=392585#c1
Jan Beulich
https://bugzilla.novell.com/show_bug.cgi?id=392585
User frank.arnold@amd.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=392585#c2
Frank Arnold
https://bugzilla.novell.com/show_bug.cgi?id=392585
User frank.arnold@amd.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=392585#c3
--- Comment #3 from Frank Arnold
https://bugzilla.novell.com/show_bug.cgi?id=392585
User frank.arnold@amd.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=392585#c4
--- Comment #4 from Frank Arnold
https://bugzilla.novell.com/show_bug.cgi?id=392585
User jbeulich@novell.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=392585#c5
Jan Beulich
There's plain nothing on the output that would indicate a problem, it just hangs. We don't have the resources to investigate this further, so here is a summary of the last run (froze about 8 hours after start). Log files will be attached.
Unless we would be able to reproduce this in our lab, there's pretty little we can do if you're unavailable to assist with the analysis. According to the logs you didn't even get Xen's response to pressing 'd' once the box hung (of course I can't tell whether you tried and it didn't work). -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=392585
Jan Beulich
https://bugzilla.novell.com/show_bug.cgi?id=392585
User jbeulich@novell.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=392585#c6
Jan Beulich
https://bugzilla.novell.com/show_bug.cgi?id=392585
User frank.arnold@amd.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=392585#c7
--- Comment #7 from Frank Arnold
Unless we would be able to reproduce this in our lab, there's pretty little we can do if you're unavailable to assist with the analysis. According to the logs you didn't even get Xen's response to pressing 'd' once the box hung (of course I can't tell whether you tried and it didn't work).
Sorry, forgot that one: pressing 'd' didn't do anything. But I'm sure I didn't switch the input to Xen... Probably I can do some more work on that next week. But it would be nice if you could at least try to reproduce it on a Sahara on your side. Another side note: I installed a Windows XP Pro x64 guest while running another Windows Server 2003 instance which was unzipping a large archive on box #2 I mentioned above. While it was doing this the box froze, too. Unfortunately it had no serial connection. But this indicates once more that it seems to be a more common problem: - it happens on 32-bit and 64-bit - it happens with AMD processor family 15 and 16 I ran the same configuration as mentioned in comment #2 on a xen-unstable build with 2.6.18.8 Dom0 on the same box (#1). It did run for 16 hours without any issues. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=392585
User frank.arnold@amd.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=392585#c9
Frank Arnold
The kernel log provided could hint at a TSC instability, although I didn't think Phenoms would do hardware initiated clock modulation affecting the TSC tick rate behind the back of the OS.
That's true, Phenoms don't have this issue.
The box doesn't appear to have a HPET, so you should try clocksource=pit to see whether this alternatively has something to do with the PM timer latency issues currently being fixed upstream.
Tried clocksource=pit and got the same result.
Also, I take it for granted that the problem happens regardless of the sync_console and watchdog command line options.
When I first saw this issue no debug options were used. Just a clean install.
As the watchdog doesn't appear to trigger, it's pretty certain you'd be able to get output from Xen once hung - please get the output from pressing 'd' as indicated above (and be sure to switch input to Xen before attempting this).
Done. Serial logs from runs with clocksources PIT and ACPI PM are attached. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=392585
User frank.arnold@amd.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=392585#c10
--- Comment #10 from Frank Arnold
https://bugzilla.novell.com/show_bug.cgi?id=392585
User jbeulich@novell.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=392585#c11
Jan Beulich
https://bugzilla.novell.com/show_bug.cgi?id=392585
User frank.arnold@amd.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=392585#c12
Frank Arnold
https://bugzilla.novell.com/show_bug.cgi?id=392585
User jbeulich@novell.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=392585#c13
--- Comment #13 from Jan Beulich
https://bugzilla.novell.com/show_bug.cgi?id=392585
User frank.arnold@amd.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=392585#c15
--- Comment #15 from Frank Arnold
https://bugzilla.novell.com/show_bug.cgi?id=392585
User thomas.friebel@amd.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=392585#c16
Thomas Friebel
https://bugzilla.novell.com/show_bug.cgi?id=392585
User mark.langsdorf@amd.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=392585#c17
Mark Langsdorf
From my testing, it looks like pinning the vcpus significantly reduces or completely stops the rate of "time going backwards" messages. I'm not sure what that implies, though.
-- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=392585
Thomas Friebel
https://bugzilla.novell.com/show_bug.cgi?id=392585
User aj@novell.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=392585#c23
Andreas Jaeger
https://bugzilla.novell.com/show_bug.cgi?id=392585
User jbeulich@novell.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=392585#c24
Jan Beulich
https://bugzilla.novell.com/show_bug.cgi?id=392585
User frank.arnold@amd.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=392585#c25
--- Comment #25 from Frank Arnold
Could this be tried with 11.1/SLE11 beta 4 (as soon as available)
We'll do this. We should have results sometime during next week. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=392585
User frank.arnold@amd.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=392585#c26
Frank Arnold
Could this be tried with 11.1/SLE11 beta 4 (as soon as available)
Done, and it's looking nice. Used a similar setup to the one mentioned in comment #2. openSUSE 11.1 Beta3 x86_64 ran for 104 hours without issues. No 'time went backwards' messages. SLE11 Beta4 x86_64 ran for 88 hours without issues. Also no 'time went backwards' messages. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=392585
User jbeulich@novell.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=392585#c27
--- Comment #27 from Jan Beulich
participants (1)
-
bugzilla_noreply@novell.com