[Bug 768714] New: BUG: soft lockup - CPU#1 stuck for 43s!
https://bugzilla.novell.com/show_bug.cgi?id=768714 https://bugzilla.novell.com/show_bug.cgi?id=768714#c0 Summary: BUG: soft lockup - CPU#1 stuck for 43s! Classification: openSUSE Product: openSUSE 12.2 Version: Beta 2 Platform: x86-64 OS/Version: Other Status: NEW Severity: Normal Priority: P5 - None Component: Kernel AssignedTo: kernel-maintainers@forge.provo.novell.com ReportedBy: per@opensuse.org QAContact: qa-bugs@suse.de Found By: --- Blocker: --- User-Agent: Mozilla/5.0 (X11; Linux i686 on x86_64; rv:11.0) Gecko/20100101 Firefox/11.0 Hardware: HP ML570 G3, 32Gb RAM, xen-host Software: Factory, but not quite beta2. Kernel 3.3.0-2-xen I noticed this message on the dom0 console whilst I was logged in as root: Message from syslogd@madrid at Jun 25 04:49:29 ... kernel:[320139.193450] BUG: soft lockup - CPU#1 stuck for 43s! [swapper/1:0] As far as I can tell it didn't cause any problems, but I thought I would report it anyway. I'll attach the complete listing from dmesg. Reproducible: Always -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=768714
https://bugzilla.novell.com/show_bug.cgi?id=768714#c1
--- Comment #1 from Per Jessen
https://bugzilla.novell.com/show_bug.cgi?id=768714
https://bugzilla.novell.com/show_bug.cgi?id=768714#c2
Jeff Mahoney
https://bugzilla.novell.com/show_bug.cgi?id=768714
https://bugzilla.novell.com/show_bug.cgi?id=768714#c3
Jan Beulich
https://bugzilla.novell.com/show_bug.cgi?id=768714
https://bugzilla.novell.com/show_bug.cgi?id=768714#c4
--- Comment #4 from Per Jessen
https://bugzilla.novell.com/show_bug.cgi?id=768714
https://bugzilla.novell.com/show_bug.cgi?id=768714#c5
Per Jessen
https://bugzilla.novell.com/show_bug.cgi?id=768714
https://bugzilla.novell.com/show_bug.cgi?id=768714#c6
Jan Beulich
https://bugzilla.novell.com/show_bug.cgi?id=768714
https://bugzilla.novell.com/show_bug.cgi?id=768714#c7
--- Comment #7 from Per Jessen
https://bugzilla.novell.com/show_bug.cgi?id=768714
https://bugzilla.novell.com/show_bug.cgi?id=768714#c8
--- Comment #8 from Jan Beulich
https://bugzilla.novell.com/show_bug.cgi?id=768714
https://bugzilla.novell.com/show_bug.cgi?id=768714#c9
--- Comment #9 from Per Jessen
https://bugzilla.novell.com/show_bug.cgi?id=768714
https://bugzilla.novell.com/show_bug.cgi?id=768714#c10
Johannes Weberhofer
https://bugzilla.novell.com/show_bug.cgi?id=768714
https://bugzilla.novell.com/show_bug.cgi?id=768714#c11
Jan Beulich
https://bugzilla.novell.com/show_bug.cgi?id=768714
https://bugzilla.novell.com/show_bug.cgi?id=768714#c12
--- Comment #12 from Per Jessen
https://bugzilla.novell.com/show_bug.cgi?id=768714
https://bugzilla.novell.com/show_bug.cgi?id=768714#c
Per Jessen
https://bugzilla.novell.com/show_bug.cgi?id=768714
https://bugzilla.novell.com/show_bug.cgi?id=768714#c13
--- Comment #13 from Per Jessen
https://bugzilla.novell.com/show_bug.cgi?id=768714
https://bugzilla.novell.com/show_bug.cgi?id=768714#c14
--- Comment #14 from Johannes Weberhofer
https://bugzilla.novell.com/show_bug.cgi?id=768714
https://bugzilla.novell.com/show_bug.cgi?id=768714#c15
Jeff Mahoney
https://bugzilla.novell.com/show_bug.cgi?id=768714
https://bugzilla.novell.com/show_bug.cgi?id=768714#c16
--- Comment #16 from Per Jessen
Are all of the use cases virtualized? The traces are all different and don't make any sense -- especially the ethtool one. That one involves a 10usec delay going for 22s.
The first case I reported was on a xen host system, the second case (comment#12) is a normal server. The second case is consistently reproducable, btw. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=768714
https://bugzilla.novell.com/show_bug.cgi?id=768714#c17
--- Comment #17 from Jan Beulich
Jan, especially with the heavy i/o load case, do you think it's possible that guests aren't getting scheduled? I'm not suggesting it's a Xen bug since we see the same problem with Parallels - but perhaps it's a load issue?
Heavy load can certainly cause latency issues, but in the tens-of-milliseconds range, not many seconds (which would imply that several time slices of - by default - 30ms don't suffice to get enough non-interrupt work done in the guest to have the soft lockup timer run at least once). That is of course unless - the system is insanely overcommitted, in which case this wouldn't be a bug at all, or - all of the time is spent in handling interrupts in the guest. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=768714
https://bugzilla.novell.com/show_bug.cgi?id=768714#c18
Jason Fillman
https://bugzilla.novell.com/show_bug.cgi?id=768714
https://bugzilla.novell.com/show_bug.cgi?id=768714#c19
Cornelius Claussen
https://bugzilla.novell.com/show_bug.cgi?id=768714
https://bugzilla.novell.com/show_bug.cgi?id=768714#c20
Archie Cobbs
https://bugzilla.novell.com/show_bug.cgi?id=768714
https://bugzilla.novell.com/show_bug.cgi?id=768714#c21
--- Comment #21 from Archie Cobbs
https://bugzilla.novell.com/show_bug.cgi?id=768714
https://bugzilla.novell.com/show_bug.cgi?id=768714#c22
Jan Bouwhuis
https://bugzilla.novell.com/show_bug.cgi?id=768714
https://bugzilla.novell.com/show_bug.cgi?id=768714#c23
Björn Voigt
https://bugzilla.novell.com/show_bug.cgi?id=768714
https://bugzilla.novell.com/show_bug.cgi?id=768714#c24
--- Comment #24 from Jan Bouwhuis
Googling around has led to a few theories:
* BIOS update needed: http://communities.vmware.com/message/953833 (old thread) * VMWare bug (?) not scheduling kernel enough: https://bugzilla.redhat.com/show_bug.cgi?id=790032 * Linux bug with boot parameter workaround: http://www.csharpest.net/?p=168
Regarding that last one, the claim is that setting the boot parameters "pci=noacpi acpi=off noapic" works around this problem.
I would be interested if other people can try that and see what happens (I will also be trying it).
Adjusting the boot parameters "pci=noacpi acpi=off noapic" did stabilize my VM. But on heavy load of the hostsystem soft lockups still occur. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=768714
https://bugzilla.novell.com/show_bug.cgi?id=768714#c25
Per Jessen
https://bugzilla.novell.com/show_bug.cgi?id=768714
https://bugzilla.novell.com/show_bug.cgi?id=768714#c26
--- Comment #26 from Per Jessen
This problem now also appears when trying to install openSUSE 12.3 32bit on an HP DL360. I see multiple lines such as:
[ 44.340002] BUG: soft lockup - CPU#1 stuck for 23s! [ethtool:732] [ 84.212002] BUG: soft lockup - CPU#0 stuck for 22s! [ethtool:734] [ 136.212002] BUG: soft lockup - CPU#0 stuck for 22s! [ethtool:922] [ 176.340002] BUG: soft lockup - CPU#1 stuck for 23s! [ethtool:924] [ 220.340002] BUG: soft lockup - CPU#1 stuck for 23s! [ethtool:952] [ 260.212002] BUG: soft lockup - CPU#0 stuck for 23s! [ethtool:954] [ 300.340002] BUG: soft lockup - CPU#1 stuck for 22s! [ethtool:982] [ 340.212003] BUG: soft lockup - CPU#0 stuck for 22s! [ethtool:984]
This eventually prevents network configuration altogether, thereby making network install impossible. I found out the a high load on the host that runs the VM increases the change
https://bugzilla.novell.com/show_bug.cgi?id=768714
https://bugzilla.novell.com/show_bug.cgi?id=768714#c27
--- Comment #27 from Jan Bouwhuis
https://bugzilla.novell.com/show_bug.cgi?id=768714
https://bugzilla.novell.com/show_bug.cgi?id=768714#c28
--- Comment #28 from Per Jessen
I found out the a high load on the host that runs the VM increases the change that soft lockups occur. Further it does matter a lot if only one processor is assigned to the VM. Since I have configured my opensuse 12.2 VM to run on 1 CPU I do not have any problem anymore.
I tried starting the install with "maxcpus=0", no effect. I've also tried using "pci=noacpi acpi=off noapic", also no effect. Does anyone have any other work-around suggestions? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=768714
https://bugzilla.novell.com/show_bug.cgi?id=768714#c29
--- Comment #29 from Jan Bouwhuis
(In reply to comment #27)
I found out the a high load on the host that runs the VM increases the change that soft lockups occur. Further it does matter a lot if only one processor is assigned to the VM. Since I have configured my opensuse 12.2 VM to run on 1 CPU I do not have any problem anymore.
I tried starting the install with "maxcpus=0", no effect. I've also tried using "pci=noacpi acpi=off noapic", also no effect. Does anyone have any other work-around suggestions? I have assigned only one CPU in the VM config. I have the standard kernel startup parameters. Kernel version: 3.4.33-2.24-desktop #1 SMP PREEMPT Tue Feb 26 03:34:33 UTC 2013 (5f00a32) x86_64 x86_64 x86_64 GNU/Linux.
-- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=768714
https://bugzilla.novell.com/show_bug.cgi?id=768714#c30
--- Comment #30 from Per Jessen
https://bugzilla.novell.com/show_bug.cgi?id=768714
https://bugzilla.novell.com/show_bug.cgi?id=768714#c31
--- Comment #31 from Per Jessen
https://bugzilla.novell.com/show_bug.cgi?id=768714
https://bugzilla.novell.com/show_bug.cgi?id=768714#c32
Petrov Andrey
https://bugzilla.novell.com/show_bug.cgi?id=768714
https://bugzilla.novell.com/show_bug.cgi?id=768714#c
Jeff Mahoney
https://bugzilla.novell.com/show_bug.cgi?id=768714
https://bugzilla.novell.com/show_bug.cgi?id=768714#c33
Evgeny Ivanov
This problem now also appears when trying to install openSUSE 12.3 32bit on an HP DL360. I see multiple lines such as:
[ 44.340002] BUG: soft lockup - CPU#1 stuck for 23s! [ethtool:732] [ 84.212002] BUG: soft lockup - CPU#0 stuck for 22s! [ethtool:734] [ 136.212002] BUG: soft lockup - CPU#0 stuck for 22s! [ethtool:922] [ 176.340002] BUG: soft lockup - CPU#1 stuck for 23s! [ethtool:924] [ 220.340002] BUG: soft lockup - CPU#1 stuck for 23s! [ethtool:952] [ 260.212002] BUG: soft lockup - CPU#0 stuck for 23s! [ethtool:954] [ 300.340002] BUG: soft lockup - CPU#1 stuck for 22s! [ethtool:982] [ 340.212003] BUG: soft lockup - CPU#0 stuck for 22s! [ethtool:984]
This eventually prevents network configuration altogether, thereby making network install impossible.
The same problem on HP DL360-G5 After poyalvleniya error messages hang all the VMs and then the server itself. The frequency display every 1-2 weeks. Memory: 10 Gb Kernel: 3.0.74-0.6.6-xen SUSE Linux Enterprise Server 11 (x86_64) VERSION = 11 PATCHLEVEL = 2 -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=768714
https://bugzilla.novell.com/show_bug.cgi?id=768714#c
Jeff Mahoney
https://bugzilla.novell.com/show_bug.cgi?id=768714
https://bugzilla.novell.com/show_bug.cgi?id=768714#c34
Darin Perusich
https://bugzilla.novell.com/show_bug.cgi?id=768714
https://bugzilla.novell.com/show_bug.cgi?id=768714#c35
--- Comment #35 from Darin Perusich
https://bugzilla.novell.com/show_bug.cgi?id=768714
https://bugzilla.novell.com/show_bug.cgi?id=768714#c36
--- Comment #36 from Darin Perusich
https://bugzilla.novell.com/show_bug.cgi?id=768714
https://bugzilla.novell.com/show_bug.cgi?id=768714#c37
--- Comment #37 from Darin Perusich
https://bugzilla.novell.com/show_bug.cgi?id=768714
https://bugzilla.novell.com/show_bug.cgi?id=768714#c38
--- Comment #38 from Darin Perusich
https://bugzilla.novell.com/show_bug.cgi?id=768714
https://bugzilla.novell.com/show_bug.cgi?id=768714#c39
--- Comment #39 from Per Jessen
In digging into this we've uncovered that the kernel soft lockup only occur when an interface has no "carrier signal", i.e. isn't plugged into anything. If you wire the empty broadcom nic port or ports into a switch, hub, another rj45 interface which will bring up the link lights, and run "ethtool -e eth#" the kernel will not soft lockup, however the system will still pause.
Interesting. ISTR Compaq/HP supplying RJ45 plugs with terminating resistors (or some such) in the old days - maybe plug one of those in? I think I'll try it. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=768714
https://bugzilla.novell.com/show_bug.cgi?id=768714#c40
--- Comment #40 from Darin Perusich
https://bugzilla.novell.com/show_bug.cgi?id=768714
https://bugzilla.novell.com/show_bug.cgi?id=768714#c41
--- Comment #41 from Per Jessen
https://bugzilla.novell.com/show_bug.cgi?id=768714
https://bugzilla.novell.com/show_bug.cgi?id=768714#c42
Darrell Fitzpatrick
https://bugzilla.novell.com/show_bug.cgi?id=768714
https://bugzilla.novell.com/show_bug.cgi?id=768714#c43
--- Comment #43 from Darrell Fitzpatrick
https://bugzilla.novell.com/show_bug.cgi?id=768714
https://bugzilla.novell.com/show_bug.cgi?id=768714#c44
Per Jessen
https://bugzilla.novell.com/show_bug.cgi?id=768714
https://bugzilla.novell.com/show_bug.cgi?id=768714#c45
Takashi Iwai
https://bugzilla.novell.com/show_bug.cgi?id=768714
https://bugzilla.novell.com/show_bug.cgi?id=768714#c46
Per Jessen
The problem is likely depending on the driver, although the end result looks same. For further tracking bugs, we need to confirm that the bugs are still present with the latest upstream kernels.
So, please test whether the latest 3.13.x or 3.14-rc kernel still shows the same issue. Test without any KMP but at best vanilla kernel without any changes.
Have tried with current Factory (build 147) which uses kernel 3.14.0-rc7-1-default - problem persists.
If the problem is confirmed in the upstream code, we can forward the bugs to the upstream bugzilla. Again, the problem is likely h/w-dependent, so each case has to be confirmed.
It has appeared on every one of my boxes with Broadcom 5703 and 5704 interfaces. Every time the tg3 driver. The hardware is HP Proliant DL380 g2/g3/g4 and DL580 g2/g3/g4. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=768714
https://bugzilla.novell.com/show_bug.cgi?id=768714#c47
Richard Ems
https://bugzilla.novell.com/show_bug.cgi?id=768714
https://bugzilla.novell.com/show_bug.cgi?id=768714#c48
Benjamin Poirier
https://bugzilla.novell.com/show_bug.cgi?id=768714
https://bugzilla.novell.com/show_bug.cgi?id=768714#c49
Per Jessen
https://bugzilla.novell.com/show_bug.cgi?id=768714
https://bugzilla.novell.com/show_bug.cgi?id=768714#c50
Ivan Petrov
https://bugzilla.novell.com/show_bug.cgi?id=768714
https://bugzilla.novell.com/show_bug.cgi?id=768714#c51
Vitaliy Tomin
https://bugzilla.novell.com/show_bug.cgi?id=768714
https://bugzilla.novell.com/show_bug.cgi?id=768714#c52
--- Comment #52 from Benjamin Poirier
https://bugzilla.novell.com/show_bug.cgi?id=768714
https://bugzilla.novell.com/show_bug.cgi?id=768714#c53
--- Comment #53 from Per Jessen
While working with Broadcom we have found a fix for this problem. Broadcom is currently testing it on different tg3 adapter versions and will submit it upstream. Once the patch is finalized we will integrate it in openSUSE.
This is excellent news, thanks for the update! -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=768714
https://bugzilla.novell.com/show_bug.cgi?id=768714#c54
Jeromy Smith
https://bugzilla.novell.com/show_bug.cgi?id=768714
https://bugzilla.novell.com/show_bug.cgi?id=768714#c56
Al Cho
https://bugzilla.novell.com/show_bug.cgi?id=768714
https://bugzilla.novell.com/show_bug.cgi?id=768714#c
Swamp Workflow Management
https://bugzilla.novell.com/show_bug.cgi?id=768714
https://bugzilla.novell.com/show_bug.cgi?id=768714#c57
Benjamin Poirier
https://bugzilla.novell.com/show_bug.cgi?id=768714
https://bugzilla.novell.com/show_bug.cgi?id=768714#c58
--- Comment #58 from Swamp Workflow Management
http://bugzilla.novell.com/show_bug.cgi?id=768714
Swamp Workflow Management
http://bugzilla.novell.com/show_bug.cgi?id=768714
--- Comment #59 from Swamp Workflow Management
http://bugzilla.novell.com/show_bug.cgi?id=768714
Swamp Workflow Management
http://bugzilla.novell.com/show_bug.cgi?id=768714
Swamp Workflow Management
participants (1)
-
bugzilla_noreply@novell.com