[Bug 773563] New: Kernel gets a bogus temperature reading (when under heavy load) and shuts the computer down.
100 degrees)
https://bugzilla.novell.com/show_bug.cgi?id=773563 https://bugzilla.novell.com/show_bug.cgi?id=773563#c0 Summary: Kernel gets a bogus temperature reading (when under heavy load) and shuts the computer down. Classification: openSUSE Product: openSUSE 12.2 Version: Factory Platform: x86-64 OS/Version: openSUSE 12.2 Status: NEW Severity: Major Priority: P5 - None Component: Kernel AssignedTo: kernel-maintainers@forge.provo.novell.com ReportedBy: tittiatcoke@gmail.com QAContact: qa-bugs@suse.de Found By: --- Blocker: --- User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.3 (KHTML, like Gecko) Chrome/22.0.1218.0 Safari/537.3 SUSE/22.0.1218.0 After upgrading to the 3.4 kernel, I noticed that under heavy load the kernel gets some strange temperature reading and based on this reading it shuts the notebook down. I see the following behavior : 1) Load the notebook with a gcc acitivity until all processors are 100% load 2) Shortly after you get notifications that a thermal event is reached (CPU's 3) Kernel reacts and shutdown the system. Strangely enough the same task with an 3.3 kernel can easily survive and also under Windows I do not have any problems. Checking bugs.kernel.org I found something with regards to this effect and they suggested to load the thermal module with nocrt=1 so that the activity on the first trippoint is not initiated. This indeed helps a lot and it seems that even with a high load the notebook doesn't get that hot. Also no shutdown is initiated. The notebook is a Lenovo T410 with an Intel I5 CPU. Running with the parameter or switching back to the 3.3 kernel resolves the issue. Reproducible: Always Steps to Reproduce: 1. 2. 3. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=773563
https://bugzilla.novell.com/show_bug.cgi?id=773563#c1
--- Comment #1 from Raymond Wooninck
https://bugzilla.novell.com/show_bug.cgi?id=773563
https://bugzilla.novell.com/show_bug.cgi?id=773563#c
Jeff Mahoney
https://bugzilla.novell.com/show_bug.cgi?id=773563
https://bugzilla.novell.com/show_bug.cgi?id=773563#c
Rafael Wysocki
https://bugzilla.novell.com/show_bug.cgi?id=773563
https://bugzilla.novell.com/show_bug.cgi?id=773563#c2
Rafael Wysocki
https://bugzilla.novell.com/show_bug.cgi?id=773563
https://bugzilla.novell.com/show_bug.cgi?id=773563#c3
--- Comment #3 from Raymond Wooninck
https://bugzilla.novell.com/show_bug.cgi?id=773563
https://bugzilla.novell.com/show_bug.cgi?id=773563#c
Raymond Wooninck
https://bugzilla.novell.com/show_bug.cgi?id=773563
https://bugzilla.novell.com/show_bug.cgi?id=773563#c4
Thomas Renninger
This is a known bug which was resolved for openSUSE (see bnc#756085) Oh dear, I need to double check whether this is not resolved for 12.2 yet.
-- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=773563
https://bugzilla.novell.com/show_bug.cgi?id=773563#c5
--- Comment #5 from Raymond Wooninck
https://bugzilla.novell.com/show_bug.cgi?id=773563
https://bugzilla.novell.com/show_bug.cgi?id=773563#c6
--- Comment #6 from Raymond Wooninck
https://bugzilla.novell.com/show_bug.cgi?id=773563
https://bugzilla.novell.com/show_bug.cgi?id=773563#c7
--- Comment #7 from Raymond Wooninck
https://bugzilla.novell.com/show_bug.cgi?id=773563
https://bugzilla.novell.com/show_bug.cgi?id=773563#c8
Raymond Wooninck
https://bugzilla.novell.com/show_bug.cgi?id=773563
https://bugzilla.novell.com/show_bug.cgi?id=773563#c9
Thomas Renninger
From what I can see, I'd say we may have a "kernel does not react quickly enough on thermal events" issue.
Sigh, it's somewhat difficult to retrieve ACPI thermal settings via sysfs. Can you provide output of this command, please: cd /sys/devices/virtual/thermal grep ".*" `find *` >/tmp/thermal_settings Please also attach output of acpidump. Best is you look out for a BIOS update before you continue with anything -> would be fatal if this got addressed already. Also browse a bit your BIOS settings after upgrading. I could imagine you find some thermal or CPU power related settings? Ah yes: The very first thing you should do is to use a vacuumer and clean the fan while the machine is switched off. Or try to find out whether there is dust in the fans. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=773563
https://bugzilla.novell.com/show_bug.cgi?id=773563#c
Thomas Renninger
https://bugzilla.novell.com/show_bug.cgi?id=773563
https://bugzilla.novell.com/show_bug.cgi?id=773563#c10
Raymond Wooninck
From my point of view. it looks like that the acpi_cpufreq driver keeps the CPU in their highest speed and therefore the overheating. When the acpi_cpufreq driver is not loaded, then the initial 2.4Ghz speed is reduced to around 1100Mhz and this keeps the cores cooler. However the speed in the later situation is never restored to the maximum again. (Which I also indicated in bnc#767692).
I have done the tests with the 3.5.0 kernel and I want to test now how the 3.4 vanilla kernel is reacting. I will attach the output of the two requested commands. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=773563
https://bugzilla.novell.com/show_bug.cgi?id=773563#c11
--- Comment #11 from Raymond Wooninck
https://bugzilla.novell.com/show_bug.cgi?id=773563
https://bugzilla.novell.com/show_bug.cgi?id=773563#c12
--- Comment #12 from Raymond Wooninck
https://bugzilla.novell.com/show_bug.cgi?id=773563
https://bugzilla.novell.com/show_bug.cgi?id=773563#c
Thomas Renninger
https://bugzilla.novell.com/show_bug.cgi?id=773563
https://bugzilla.novell.com/show_bug.cgi?id=773563#c13
--- Comment #13 from Raymond Wooninck
From acpidump -> you seem to have a passive cooling and a critical trip point. Please double check which value the passive one has (I guess it is too near at
https://bugzilla.novell.com/show_bug.cgi?id=773563
https://bugzilla.novell.com/show_bug.cgi?id=773563#c14
Thomas Renninger
https://bugzilla.novell.com/show_bug.cgi?id=773563
https://bugzilla.novell.com/show_bug.cgi?id=773563#c15
Raymond Wooninck
https://bugzilla.novell.com/show_bug.cgi?id=773563
https://bugzilla.novell.com/show_bug.cgi?id=773563#c16
Thomas Renninger
https://bugzilla.novell.com/show_bug.cgi?id=773563
https://bugzilla.novell.com/show_bug.cgi?id=773563#c17
--- Comment #17 from Raymond Wooninck
https://bugzilla.novell.com/show_bug.cgi?id=773563
https://bugzilla.novell.com/show_bug.cgi?id=773563#c18
--- Comment #18 from Raymond Wooninck
https://bugzilla.novell.com/show_bug.cgi?id=773563
https://bugzilla.novell.com/show_bug.cgi?id=773563#c19
Raymond Wooninck
https://bugzilla.novell.com/show_bug.cgi?id=773563
https://bugzilla.novell.com/show_bug.cgi?id=773563#c20
--- Comment #20 from Raymond Wooninck
https://bugzilla.novell.com/show_bug.cgi?id=773563
https://bugzilla.novell.com/show_bug.cgi?id=773563#c21
--- Comment #21 from Raymond Wooninck
https://bugzilla.novell.com/show_bug.cgi?id=773563
https://bugzilla.novell.com/show_bug.cgi?id=773563#c22
Thomas Renninger
I also put the hoover to work again and tried to clean the laptop from dust again. I have to admit that this made the laptop running for a few seconds more, but in the end I got the same issue again. You probably have to open the chassis. It's possible that the dust is stuck behind the small fins of the fan protection. I did it once for a ThinkPad where a cable contact to the fan was loose. You should google and read up some articles before you start. It was quite some fiddling, if you are lucky they changed the design of how the devices are set up and it's a bit easier with your model (pull up the keyboard?).
-- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=773563
https://bugzilla.novell.com/show_bug.cgi?id=773563#c23
--- Comment #23 from Rui Zhang
https://bugzilla.novell.com/show_bug.cgi?id=773563
https://bugzilla.novell.com/show_bug.cgi?id=773563#c24
--- Comment #24 from Rui Zhang
say "watch -n1 grep . /sys/class/thermal/*/* >> log" and burn the processor.
sorry, this command does not work. you just need to catch the thermal status every 1 second before shutting down, and attach the output here. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=773563
https://bugzilla.novell.com/show_bug.cgi?id=773563#c25
--- Comment #25 from Raymond Wooninck
-> I am not sure nocrt=1 is such a good workaround and would be careful with it.
It seems indeed that it prevents the kernel from performing the action at the critical trip point. Hopefully there is still a hardware mechanism left that would shutdown the notebook completely. Fortunately I didn't constantly overheated my notebook, but the new workaround seems to be better and more safe.
It was quite some fiddling, if you are lucky they changed the design of how the devices are set up and it's a bit easier with your model (pull up the keyboard?).
I will also check with out local helpdesk guy. Hopefully he has some more experience. Also I want to see if he still has a spare T410 left which I could use for a quick test to see if things works properly there. If so, then it would be definitely a hardware issue with my T410. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=773563
https://bugzilla.novell.com/show_bug.cgi?id=773563#c26
--- Comment #26 from Raymond Wooninck
you can use thermal.psv=80 to override the passive trip point to 80C, but as the firmware may not generate thermal events for this new trip point, you need to use thermal.tzp= to set a polling frequency, (Note that this pulls all the time, no matter the temperature is high or not).
Good to know, but it seems that something similar is done when I change the thermal policy in the BIOS to react the same way as on battery. Guess this would be a more safer way.
re: The output of cpupower monitor without the acpi_cpufreq loaded why the processor is running at 1200 when it is busy? who changes the cpu frequency?
What I noticed is that the CPU is running at high speed just before it reaches a thermal trip point. As soon as that one is reached then the CPU seems to be throttled. As soon as this happens, the CPU frequency stays on that low speed until a reboot.
BTW, can you please attach the output of "grep . /sys/class/thermal/*/*" without acpi_cpufreq loaded? say "watch -n1 grep . /sys/class/thermal/*/* >> log" and burn the processor. I'd like to see how thermal works before shutting down.
There are a lot of files in that path. Any particular one that you are interested in or you just want to see the temperature reported by the thermal_zone ? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=773563
https://bugzilla.novell.com/show_bug.cgi?id=773563#c27
--- Comment #27 from Rui Zhang
re: The output of cpupower monitor without the acpi_cpufreq loaded why the processor is running at 1200 when it is busy? who changes the cpu frequency?
What I noticed is that the CPU is running at high speed just before it reaches a thermal trip point. As soon as that one is reached then the CPU seems to be throttled. As soon as this happens, the CPU frequency stays on that low speed until a reboot.
if the thermal driver is working, processor should be set back when the temperature drops. Did you notice this?
BTW, can you please attach the output of "grep . /sys/class/thermal/*/*" without acpi_cpufreq loaded? say "watch -n1 grep . /sys/class/thermal/*/* >> log" and burn the processor. I'd like to see how thermal works before shutting down.
There are a lot of files in that path. Any particular one that you are interested in or you just want to see the temperature reported by the thermal_zone ?
I'd like to see the temperature and what cooling state the processor is in at that time. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=773563
https://bugzilla.novell.com/show_bug.cgi?id=773563#c28
--- Comment #28 from Raymond Wooninck
https://bugzilla.novell.com/show_bug.cgi?id=773563
https://bugzilla.novell.com/show_bug.cgi?id=773563#c29
--- Comment #29 from Raymond Wooninck
https://bugzilla.novell.com/show_bug.cgi?id=773563
https://bugzilla.novell.com/show_bug.cgi?id=773563#c30
--- Comment #30 from Raymond Wooninck
https://bugzilla.novell.com/show_bug.cgi?id=773563
https://bugzilla.novell.com/show_bug.cgi?id=773563#c31
Raymond Wooninck
We opened the laptop, took a hoover ... Since then I see the standard behavior ... So what we saw (machine shutdown, frequency throttled,...) is a perfectly running kernel picking up on BIOS notifications and reacting accordingly to
https://bugzilla.novell.com/show_bug.cgi?id=773563
https://bugzilla.novell.com/show_bug.cgi?id=773563#c32
Thomas Renninger
Strangely enough the same task with an 3.3 kernel can easily survive and also under Windows I do not have any problems.
Anyway, I close this one invalid. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=773563
https://bugzilla.novell.com/show_bug.cgi?id=773563#c33
--- Comment #33 from Raymond Wooninck
https://bugzilla.novell.com/show_bug.cgi?id=773563
https://bugzilla.novell.com/show_bug.cgi?id=773563#c34
--- Comment #34 from Rui Zhang
Created an attachment (id=500843) --> (http://bugzilla.novell.com/attachment.cgi?id=500843) [details] Thermal and Status output
this shows that thermal works correctly, it put the processors to a deeper cooling state when the temperature is 92C, and set it back to cooling state 0 when the temperature drops. But IMO, w/o acpi-cpufreq driver loaded, the processors enters T-state only to do the throttling, I'm still not clear who decreases the cpu frequency. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=773563
https://bugzilla.novell.com/show_bug.cgi?id=773563#c35
Raymond Wooninck
participants (1)
-
bugzilla_noreply@novell.com