[Bug 742088] New: Kernel panic, hard lockup on cpu when thermal events occur using kernel 3.2.0-2
https://bugzilla.novell.com/show_bug.cgi?id=742088 https://bugzilla.novell.com/show_bug.cgi?id=742088#c0 Summary: Kernel panic, hard lockup on cpu when thermal events occur using kernel 3.2.0-2 Classification: openSUSE Product: openSUSE 12.1 Version: Final Platform: x86-64 OS/Version: SuSE Other Status: NEW Severity: Critical Priority: P5 - None Component: Kernel AssignedTo: kernel-maintainers@forge.provo.novell.com ReportedBy: gudlaugu@raunvis.hi.is QAContact: qa@suse.de Found By: --- Blocker: --- User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:9.0.1) Gecko/20100101 Firefox/9.0.1 When a thermal event occurs (when the cpus are under load for some time) I get a kernel panic - not syncing with the message Watchdog detected hard lockup on cpu x where x has been 2 4 and 16. I can reproduce this every time the system is under load and it always results in a trace to intel_thermal. My system has 2 intel xeon 5670 processors. Reproducible: Always Steps to Reproduce: 1.Run a computer intensive process until the cpu fans increase their speed and the cpus reduce their speed -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=742088
https://bugzilla.novell.com/show_bug.cgi?id=742088#c1
Guðlaugur Jóhannesson
https://bugzilla.novell.com/show_bug.cgi?id=742088
https://bugzilla.novell.com/show_bug.cgi?id=742088#c2
Jeff Mahoney
https://bugzilla.novell.com/show_bug.cgi?id=742088
https://bugzilla.novell.com/show_bug.cgi?id=742088#c3
Guðlaugur Jóhannesson
https://bugzilla.novell.com/show_bug.cgi?id=742088
https://bugzilla.novell.com/show_bug.cgi?id=742088#c4
Jeff Mahoney
https://bugzilla.novell.com/show_bug.cgi?id=742088
https://bugzilla.novell.com/show_bug.cgi?id=742088#c5
Guðlaugur Jóhannesson
https://bugzilla.novell.com/show_bug.cgi?id=742088
https://bugzilla.novell.com/show_bug.cgi?id=742088#c
Jeff Mahoney
https://bugzilla.novell.com/show_bug.cgi?id=742088
https://bugzilla.novell.com/show_bug.cgi?id=742088#c
Rafael Wysocki
https://bugzilla.novell.com/show_bug.cgi?id=742088
https://bugzilla.novell.com/show_bug.cgi?id=742088#c
Arvydas Dapkunas
https://bugzilla.novell.com/show_bug.cgi?id=742088
https://bugzilla.novell.com/show_bug.cgi?id=742088#c6
Rafael Wysocki
https://bugzilla.novell.com/show_bug.cgi?id=742088
https://bugzilla.novell.com/show_bug.cgi?id=742088#c7
Guðlaugur Jóhannesson
https://bugzilla.novell.com/show_bug.cgi?id=742088
https://bugzilla.novell.com/show_bug.cgi?id=742088#c8
--- Comment #8 from Rafael Wysocki
https://bugzilla.novell.com/show_bug.cgi?id=742088
https://bugzilla.novell.com/show_bug.cgi?id=742088#c9
--- Comment #9 from Guðlaugur Jóhannesson
https://bugzilla.novell.com/show_bug.cgi?id=742088
https://bugzilla.novell.com/show_bug.cgi?id=742088#c10
Rafael Wysocki
https://bugzilla.novell.com/show_bug.cgi?id=742088
https://bugzilla.novell.com/show_bug.cgi?id=742088#c11
--- Comment #11 from Jeff Mahoney
https://bugzilla.novell.com/show_bug.cgi?id=742088
https://bugzilla.novell.com/show_bug.cgi?id=742088#c12
Federico Vecchiarelli
https://bugzilla.novell.com/show_bug.cgi?id=742088
https://bugzilla.novell.com/show_bug.cgi?id=742088#c13
--- Comment #13 from Federico Vecchiarelli
https://bugzilla.novell.com/show_bug.cgi?id=742088
https://bugzilla.novell.com/show_bug.cgi?id=742088#c14
Thomas Renninger
https://bugzilla.novell.com/show_bug.cgi?id=742088
https://bugzilla.novell.com/show_bug.cgi?id=742088#c15
--- Comment #15 from Federico Vecchiarelli
https://bugzilla.novell.com/show_bug.cgi?id=742088
https://bugzilla.novell.com/show_bug.cgi?id=742088#c16
--- Comment #16 from Federico Vecchiarelli
https://bugzilla.novell.com/show_bug.cgi?id=742088
https://bugzilla.novell.com/show_bug.cgi?id=742088#c17
--- Comment #17 from Guðlaugur Jóhannesson
https://bugzilla.novell.com/show_bug.cgi?id=742088
https://bugzilla.novell.com/show_bug.cgi?id=742088#c18
--- Comment #18 from Guðlaugur Jóhannesson
https://bugzilla.novell.com/show_bug.cgi?id=742088
https://bugzilla.novell.com/show_bug.cgi?id=742088#c19
Rafael Wysocki
Created an attachment (id=497135) --> (http://bugzilla.novell.com/attachment.cgi?id=497135) [details] Screenshots of panic with kernel 3.4.2-29
This is a machine check exception looking like a hardware bug to me. Did any kernel worked for you before on this machine? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=742088
https://bugzilla.novell.com/show_bug.cgi?id=742088#c20
jordan hargrave
https://bugzilla.novell.com/show_bug.cgi?id=742088
https://bugzilla.novell.com/show_bug.cgi?id=742088#c21
Guðlaugur Jóhannesson
https://bugzilla.novell.com/show_bug.cgi?id=742088
https://bugzilla.novell.com/show_bug.cgi?id=742088#c22
--- Comment #22 from Federico Vecchiarelli
https://bugzilla.novell.com/show_bug.cgi?id=742088
https://bugzilla.novell.com/show_bug.cgi?id=742088#c23
--- Comment #23 from Arvydas Dapkunas
https://bugzilla.novell.com/show_bug.cgi?id=742088
https://bugzilla.novell.com/show_bug.cgi?id=742088#c24
Rafael Wysocki
https://bugzilla.novell.com/show_bug.cgi?id=742088
https://bugzilla.novell.com/show_bug.cgi?id=742088#c25
--- Comment #25 from Federico Vecchiarelli
https://bugzilla.novell.com/show_bug.cgi?id=742088
https://bugzilla.novell.com/show_bug.cgi?id=742088#c26
--- Comment #26 from Federico Vecchiarelli
https://bugzilla.novell.com/show_bug.cgi?id=742088
https://bugzilla.novell.com/show_bug.cgi?id=742088#c27
Federico Vecchiarelli
https://bugzilla.novell.com/show_bug.cgi?id=742088
https://bugzilla.novell.com/show_bug.cgi?id=742088#c28
Rafael Wysocki
https://bugzilla.novell.com/show_bug.cgi?id=742088
https://bugzilla.novell.com/show_bug.cgi?id=742088#c29
Federico Vecchiarelli
https://bugzilla.novell.com/show_bug.cgi?id=742088
https://bugzilla.novell.com/show_bug.cgi?id=742088#c30
--- Comment #30 from Federico Vecchiarelli
https://bugzilla.novell.com/show_bug.cgi?id=742088
https://bugzilla.novell.com/show_bug.cgi?id=742088#c31
--- Comment #31 from Rafael Wysocki
https://bugzilla.novell.com/show_bug.cgi?id=742088
https://bugzilla.novell.com/show_bug.cgi?id=742088#c
Rafael Wysocki
https://bugzilla.novell.com/show_bug.cgi?id=742088
https://bugzilla.novell.com/show_bug.cgi?id=742088#c32
Federico Vecchiarelli
https://bugzilla.novell.com/show_bug.cgi?id=742088
https://bugzilla.novell.com/show_bug.cgi?id=742088#c33
Rafael Wysocki
https://bugzilla.novell.com/show_bug.cgi?id=742088
https://bugzilla.novell.com/show_bug.cgi?id=742088#c34
Federico Vecchiarelli
https://bugzilla.novell.com/show_bug.cgi?id=742088
https://bugzilla.novell.com/show_bug.cgi?id=742088#c35
Rafael Wysocki
That being said, I can open a new bug if you think these are two separate kernel bugs.
Please do. So far I don't see any reason to think that this is the same bug. What you're seeing probably is a Bluetooth subsystem problem. Is there anyone except for Federico, who saw the original problem with kernels 3.4+ and later? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=742088
https://bugzilla.novell.com/show_bug.cgi?id=742088#c36
Sean McNally
(In reply to comment #34)
That being said, I can open a new bug if you think these are two separate kernel bugs.
Please do. So far I don't see any reason to think that this is the same bug. What you're seeing probably is a Bluetooth subsystem problem.
Is there anyone except for Federico, who saw the original problem with kernels 3.4+ and later?
I have just started following this thread. Experienced similar symptoms on 11.4 with kernels > 3.4.0-15.1. Following a clean install of 12.1, all was well with kernel(s) 3.1.10-1.16-default. Ensuing updates, via repo /Kernel/HEAD, to kernels 3.5.<anything>, up to and including 3.5.0-4.1, manifest these exact symptoms, albeit only on ATI graphics. (Strangely, an install on Intel Arrandale graphics performs admirably!). I do not have an Nvidia-based platform to test). No "Bluetooth" hardware is involved, and these symptoms are manifest of the -vanilla versions of the 3.5+ kernels. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=742088
https://bugzilla.novell.com/show_bug.cgi?id=742088#c37
--- Comment #37 from Federico Vecchiarelli
https://bugzilla.novell.com/show_bug.cgi?id=742088
https://bugzilla.novell.com/show_bug.cgi?id=742088#c38
--- Comment #38 from Sean McNally
https://bugzilla.novell.com/show_bug.cgi?id=742088
https://bugzilla.novell.com/show_bug.cgi?id=742088#c39
--- Comment #39 from Federico Vecchiarelli
https://bugzilla.novell.com/show_bug.cgi?id=742088
https://bugzilla.novell.com/show_bug.cgi?id=742088#c40
Federico Vecchiarelli
https://bugzilla.novell.com/show_bug.cgi?id=742088
https://bugzilla.novell.com/show_bug.cgi?id=742088#c41
--- Comment #41 from Sean McNally
https://bugzilla.novell.com/show_bug.cgi?id=742088
https://bugzilla.novell.com/show_bug.cgi?id=742088#c42
--- Comment #42 from Sean McNally
Sean, do you have any screenshots?
Apologies for the delay: somewhere in the Kernel repos, "mkinitrd" was changed, necessitating install of an "mkinitrd" from Factory. Kernel: 3.5.0-8.1-vanilla Failed as shown upon imposing load (YAST, Firefox and Flash (aarrgghh!). -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=742088
https://bugzilla.novell.com/show_bug.cgi?id=742088#c43
Rafael Wysocki
https://bugzilla.novell.com/show_bug.cgi?id=742088
https://bugzilla.novell.com/show_bug.cgi?id=742088#c44
Sean McNally
What graphics driver is used on your machines?
I have only one (1) remaining ATI PC: 32-bit seanm@linux-khk0:~> lsmod | grep radeon radeon 986760 2 ttm 69320 1 radeon drm_kms_helper 36953 1 radeon drm 208159 4 radeon,ttm,drm_kms_helper i2c_algo_bit 13199 1 radeon i2c_core 34010 5 i2c_i801,radeon,drm_kms_helper,drm,i2c_algo_bit hwmon 12936 2 radeon,thermal_sys seanm@linux-khk0:~> seanm@linux-khk0:~> glxinfo | grep direct direct rendering: Yes seanm@linux-khk0:~> glxinfo | grep open seanm@linux-khk0:~> glxinfo | grep OpenGL OpenGL vendor string: X.Org R300 Project OpenGL renderer string: Gallium 0.4 on ATI RV350 OpenGL version string: 2.1 Mesa 7.11 OpenGL extensions: seanm@linux-khk0:~> Further information: failure (kernel panic, hard stop) occurs with and without Desktop Effects. Failures occur under KDE and Gnome. I have also installed (and subsequently removed) the "drm-radeon-kmp-desktop" (from user jobermayr), with only a slightly longer MTF. Failure occurs on -vanilla kernels and -desktop kernels. Failure occurs on latest kernels, that require the updated (Factory repo) mkinitrd 2.271.1+. Every kernel flavor (-default, -desktop and -vanilla) run happily for days and days at 3.1.10-16.1 levels. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=742088
https://bugzilla.novell.com/show_bug.cgi?id=742088#c46
Thomas Renninger
https://bugzilla.novell.com/show_bug.cgi?id=742088
https://bugzilla.novell.com/show_bug.cgi?id=742088#c47
--- Comment #47 from Guðlaugur Jóhannesson
https://bugzilla.novell.com/show_bug.cgi?id=742088
https://bugzilla.novell.com/show_bug.cgi?id=742088#c48
Thomas Renninger
https://bugzilla.novell.com/show_bug.cgi?id=742088
https://bugzilla.novell.com/show_bug.cgi?id=742088#c49
Sean McNally
Sorry for being silent for so long, but I was on vacation. I just tried kernel 3.5.1-38-vanilla and I can no longer produce the hard lockup with a computer intensive task. So my issue has been solved, thank you.
Confirm resolution with kernel 3.5.1-2.1-vanilla from Kernel/Stable repositories. Previously-failing ATI-equipped PC has been running with and without loads for 7+ hours without issue. 3.5.2-vanilla has hit the repo, and will test this along with a re-confirmation of 3.5.0-8.1-vanilla failure. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=742088
https://bugzilla.novell.com/show_bug.cgi?id=742088#c50
Thomas Renninger
https://bugzilla.novell.com/show_bug.cgi?id=742088
https://bugzilla.novell.com/show_bug.cgi?id=742088#c51
--- Comment #51 from Thomas Renninger
has been running with and without loads for 7+ hours without issue Not that the issue is papered over because 3.5.1 has a "power save" fix or similar.
Are you sure you took -vanilla flavors for successful tests? It could be related to bug #756085. acpi-cpufreq driver was not loaded because of a module dependency. The fix only is in SUSE kernels (-desktop, -default, ...) where needed (latest 12.2 and master branch), but I have to re-submit mainline. While on such modern CPUs acpi-cpufreq shouldn't be that much of a difference, this (loading or not loading acpi-cpufreq driver) could be the reason for different thermal heat up behavior under load. If the thermal event condition happens there should be an MCE or TRM interrupt incremented in /proc/interrupts. You might want to double check whether this is true. Can someone say for sure whether the machine hangs immediately on the first thermal event or after some time (when quite a lot events may happen)? If this can still be reliably reproduced with latest kernel(s) in some way (put something in front of the fan slot until critical temp is reached if it's a laptop?), it would be great if blacklisting edac drivers as mentioned in comment #46 can be tested. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=742088
https://bugzilla.novell.com/show_bug.cgi?id=742088#c52
--- Comment #52 from Sean McNally
has been running with and without loads for 7+ hours without issue Not that the issue is papered over because 3.5.1 has a "power save" fix or similar.
Are you sure you took -vanilla flavors for successful tests?
Ran for 12+ hours, pushing the proc AND temps to highs without failure. Test of 3.5.0 (-4, -8 and -10), -vanilla and -desktop, resulted in failure within <5 minutes, at first loading (Firefox w/Flash).
It could be related to bug #756085. acpi-cpufreq driver was not loaded because of a module dependency. The fix only is in SUSE kernels (-desktop, -default, ...) where needed (latest 12.2 and master branch), but I have to re-submit mainline.
Found that interesting, as the failing 3.5.0 kernels (see above for versions and -flavors), would (almost) drain the battery while sitting on the CPU lockup and/or panic. (I noticed this after floundering my photographic talents getting screenshots). Further testing, with 3.5.2-1.1-vanilla AND 3.5.2-1.1-desktop (finally!) resulted in quite satisfactory results (NO lockups or panics). The proc load was assisted by a heretofore unseen CPU-load from "tracker" ! That being resolved (and suppressed), all is well. Something(s) was/were missing in the 3.5.0 kernels vis-a-vis ATI/ATI-hybrid graphics. I say that as these kernels performed admirably with -Intel graphics (both -Arrandale Integrated and, surpisingly, my old -855). I do not have an Nvidia platform, so no observation there. One last note: prior to the 3.5.1/3.5.2 kernels, I tested the -krm-radeon-desktop (jobermayr's kernel libs). The -drm's seemed to improve the performance of the ATI graphics, but only lengthened the MTTF by about 10 minutes. End result was either the CPU lockup or panic. Once I am satisfied with the stability of the 3.5.2 kernel, I may revisit that ATI -drm. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@novell.com