[opensuse-factory] Thermal issues with kernel 3.4 (64bits)
100 degrees)
Hi, I just wanted to check if I am the only one that has thermal issues with the 3.4 kernel ? I see the following behavior : 1) Load the notebook with a gcc acitivity until all processors are 100% load 2) Shortly after you get notifications that a thermal event is reached (CPU's 3) Kernel reacts and shutdown the system. Strangely enough the same task with an 3.3 kernel can easily survive and also under Windows I do not have any problems. Checking bugs.kernel.org I found something with regards to this effect and they suggested to load the thermal module with nocrt=1 so that the activity on the first trippoint is not initiated. This indeed helps a lot and it seems that even with a high load the notebook doesn't get that hot. Also no shutdown is initiated. I also tried kernel 3.5 from Kernel:HEAD and this one has the same effect with the temperature. However using the nocrt=1 option, I get a different undesired effect. Once the temperature gets to a certain point, the kernel lowers the speed of the CPU's to the lowest point. Once the CPU's are cool'ed down, this maximum speed should be increased again. With kernel 3.4 this works as indicated. With kernel 3.5 however this maximum speed is not changed anymore until a reboot. I also filed the following two bugs for this: https://bugzilla.novell.com/show_bug.cgi?id=773563 https://bugzilla.novell.com/show_bug.cgi?id=767692 Regards Raymond -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org
Hello, On Sat, 28 Jul 2012, Raymond Wooninck wrote:
I just wanted to check if I am the only one that has thermal issues with the 3.4 kernel ? I see the following behavior :
1) Load the notebook with a gcc acitivity until all processors are 100% load 2) Shortly after you get notifications that a thermal event is reached (CPU's > 100 degrees) 3) Kernel reacts and shutdown the system. [..] effect. Once the temperature gets to a certain point, the kernel lowers the speed of the CPU's to the lowest point.
Thermal throttling is done autonomously by the CPU. Anyway: up till a few days ago, I used the "boxed" Cooler/Fan Combo on my Athlon II X2 250 (2x3.0GHz). As the cooler wasn't sprinkly clean anymore but had gathered some dust, when running both cores 100% the CPU got to ~59°C (k10temp)/69°C (it8720 MoBo sensor) rather fast (cleaning the cooler a bit recently helped a bit, but not much), with the fan spinning at top speed of about 3200 min^-1. I've now changed to an "overspecced" Cooler/Fan (Scythe Shuriken Rev. B w/120mm Fan) and guess what: just today I ran both cores for hours at 100% and temps topped out at about 40°C/51°C (with a room temp of ~25°C!) and the fan spinning at a leisurely ~1300-1350 min^-1 (out of a ~500-2350 min^-1 range, controlled by "fancontrol"). Almost 20°C cooler, along with less noise? That was worth the expense of 24 EUR. :) Now, I just hear the 7 disks spinning (I also changed the case fan and the 8th disk for a SSD (Samsung SSD 830 128G, the 64G is half as fast, the 256G just a bit faster but much than twice as expensive. Makes for a snappy boot and me convert / to ext4 *g*)). But, as you have a notebook and are stuck with your cooling system, check up on dusted up coolers / vents etc. Carefully clean the fans. Google for how to (suck vs. blow). Anyway: having the CPU throttling itself on load is not "normal". Whether the load comes from cinebench, prime95, gcc, mencoder, folding@home, seti@home or whatever is irrelevant.
Once the CPU's are cool'ed down, this maximum speed should be increased again.
ACK. -dnh --
Welches ist die gescheiteste Methode, die führende Null wegzubekommen? -- N.K Revolution bzw. Umsturz. Oder du wartest bis zur naechsten Wahl. -- J. P. Meier Nur die Rrrrrevolution hilft! Wahl ist maximal ein Vorzeichenwechsel. -- W.Flamme Was meinst Du, woher Begriffe wie "rote Null" und "schwarze Null" kommen? -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org
On Saturday, July 28, 2012 22:43:35 David Haller wrote:
Thermal throttling is done autonomously by the CPU.
Are you sure that the hardware is taking care of this or should the kernel instruct the CPU to do so ?
But, as you have a notebook and are stuck with your cooling system, check up on dusted up coolers / vents etc. Carefully clean the fans. Google for how to (suck vs. blow).
That is the first thing that I did, but I can't imagine that the amount of dust is automatically changing with the kernel ? Otherwise we should introduce a dustlevel to each kernel. But seriously, I have cleaned the cooling system already but I can see a very big difference in behaviour between the 3.3 kernel series and the 3.4/3.5 kernels. Even running for 30 seconds on full load causes a full shutdown of the notebook with a 3.4 kernel. With a 3.3 kernel I hear the fans spinning up and hot air is nicely blown out of the notebook without causing the shutdown effect. If I use the nocrt=1 for the thermal module on a 3.4 system, I am able to reproduce the normal behaviour of the 3.3 kernel. Without that option, the kernel throws a thermal exception and shuts down the notebook.
Anyway: having the CPU throttling itself on load is not "normal". Whether the load comes from cinebench, prime95, gcc, mencoder, folding@home, seti@home or whatever is irrelevant.
Well, this is what both the 3.4 and 3.5 kernels are doing. The maximum speed is lowered in order to reduce the temperature (I assume).
Once the CPU's are cool'ed down, this maximum speed should be increased again.
This is what kernel 3.5 kernel is not doing. The maximum speed is no longer increased. For me it is strange to see that this all started when the kernel developers made the changes to the cpufreq module. First we got the situation that it was not automatically loaded and now we see thermal issues. Raymond -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org
On 07/29/2012 04:06 AM, Raymond Wooninck wrote:
This is what kernel 3.5 kernel is not doing. The maximum speed is no longer increased.
For me it is strange to see that this all started when the kernel developers made the changes to the cpufreq module. First we got the situation that it was not automatically loaded and now we see thermal issues.
Why are you complaining about the issue here? First of all, I do not recall that you stated what CPU/chip set you have. For me, my AMD dual-core CPU in an HP dv2815nr works fine on all kernels up through 3.5. When I run a "make -j4" on the kernel, the fan speeds up, but the system never shuts down. If it did and the cooling system had been cleaned, I would report the symptoms and history to linux-kernel@vger.kernel.org and Rafael J. Wysocki <rjw@sisk.pl>, who is the maintainer for the CPU frequency drivers. Larry -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org
On Sunday, July 29, 2012 09:26:48 Larry Finger wrote:
Why are you complaining about the issue here? First of all, I do not recall
it did and the cooling system had been cleaned, I would report the symptoms Can we please stop with constantly referring to the cooling system of the laptop itself. I have been stating already a couple of times that this unless
I guess that you never read my first email, but that you are reacting on my last post. In the first email I wrote about this issue, I indicated that I wanted to validate if I was the only one with this issue or that others also see the same. Also I mentioned that I already filled two bug reports for the issue, so that if other people see the same issue, then they could add to these bug reports. Today I tested further and it has clearly to do with the acpi_cpufreq driver. Using the kernel-vanilla from Kernel:HEAD the issue does not occur, however the acpi_cpufreq driver is also not loaded. If I load the acpi_cpufreq driver, then the cpu is shut down the notebook very shortly after the high load. Maybe you should validate which kernel you are using. I have a Lenovo T410 with an Intel I5 CPU (dual-core with hyperthreading). the kernel itself has influence on the amount of dust there, that this has nothing to do with the issue itself. As indicated before kernel 3.3.x runs fine without any issues and also kernel-vanilla 3.5.x (without the acpi_cpufreq module) runs fine. The temperature of my laptop on 100% cpu load doesn't even reach 70 degrees, but as soon as I load the acpi_cpufreq module the temperature reaches 101 degrees within a matter of seconds. Still convinced that it is an issue of a dirty fan ?? Raymond -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org
On 07/29/2012 09:44 AM, Raymond Wooninck wrote:
On Sunday, July 29, 2012 09:26:48 Larry Finger wrote:
Why are you complaining about the issue here? First of all, I do not recall
I guess that you never read my first email, but that you are reacting on my last post. In the first email I wrote about this issue, I indicated that I wanted to validate if I was the only one with this issue or that others also see the same.
Also I mentioned that I already filled two bug reports for the issue, so that if other people see the same issue, then they could add to these bug reports.
Today I tested further and it has clearly to do with the acpi_cpufreq driver. Using the kernel-vanilla from Kernel:HEAD the issue does not occur, however the acpi_cpufreq driver is also not loaded. If I load the acpi_cpufreq driver, then the cpu is shut down the notebook very shortly after the high load.
Maybe you should validate which kernel you are using. I have a Lenovo T410 with an Intel I5 CPU (dual-core with hyperthreading).
it did and the cooling system had been cleaned, I would report the symptoms Can we please stop with constantly referring to the cooling system of the laptop itself.
Reread what I said. I did not say your cpu fan was dirty. What I said was that I would not complain about CPU heating UNTIL I HAD CLEANED IT. My CPU is an AMD dual-core running at 2.0 GHz. I don't use any Intel CPUs and I have no idea what quirks they have. My kernel is the latest pull from the wireless-testing git tree, and reports 3.5.0-wl. Larry -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org
Am Sonntag, 29. Juli 2012, 16:00:59 schrieb Larry Finger:
it did and the cooling system had been cleaned, I would report the symptoms
Can we please stop with constantly referring to the cooling system of the laptop itself.
Reread what I said. I did not say your cpu fan was dirty. What I said was that I would not complain about CPU heating UNTIL I HAD CLEANED IT.
I guess you did not read his email properly. What Raymond tells you is that in this context cleaning is irrelevant because it was proven to not be the source of the issue. Hence it is useless (misses the point) to bring it up – even if just for the sake of bringing it up. Sven -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org
On Sun, 29 Jul 2012 11:06:40 +0200, Raymond Wooninck <tittiatcoke@gmail.com> wrote:
Are you sure that the hardware is taking care of this or should the kernel instruct the CPU to do so ?
Nope, thermal throttling is built into the CPU and it activates it on its own with no way for software to intervene. Of cause the kernel can lower the load on the CPU and thus keep the CPU cooler. Philipp -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org
On Sunday 2012-07-29 11:06, Raymond Wooninck wrote:
On Saturday, July 28, 2012 22:43:35 David Haller wrote:
Thermal throttling is done autonomously by the CPU.
Are you sure that the hardware is taking care of this or should the kernel instruct the CPU to do so ?
When was the last time you were given a device that melted away, "per design", rather than due to "cheap" Asian hardware, without software instruction? I hear some old old Apple iMac PPC devices once did that, but I seriously doubt you have one of those. -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org
On Tuesday, July 31, 2012 09:51:06 Jan Engelhardt wrote:
Are you sure that the hardware is taking care of this or should the kernel instruct the CPU to do so ?
When was the last time you were given a device that melted away, "per design", rather than due to "cheap" Asian hardware, without software instruction?
Well it is one of the first Intel I5 dual-cores with hyperthreading. But as indicated if I leave the whole throttling over to the CPU's and don't do any dynamic frequency scaling (through the acpi_cpufreq kernel module), then everything works as expected. If I load the acpi_cpufreq module, then the CPU's are overheating within seconds. The difference is quite big. The same action on a kernel without acpi_cpufreq gives me 100% load on the 4 cores and the temperature remains below 70 degrees. Loading the acpi_cpufreq, I get the same 100% load on all 4 cores, but the temperature rises to 101 degrees within about 5 seconds. I know that Lenovo is an Asian company, but it should still be based on old big blue design. So either acpi_cpufreq found a little hole in the CPU design or it is able to block the automatic thermal throttling done by t he CPU. Raymond -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org
On Tue, Jul 31, 2012 at 2:54 AM, Raymond Wooninck <tittiatcoke@gmail.com> wrote:
On Tuesday, July 31, 2012 09:51:06 Jan Engelhardt wrote:
Are you sure that the hardware is taking care of this or should the kernel instruct the CPU to do so ?
When was the last time you were given a device that melted away, "per design", rather than due to "cheap" Asian hardware, without software instruction?
Well it is one of the first Intel I5 dual-cores with hyperthreading. But as indicated if I leave the whole throttling over to the CPU's and don't do any dynamic frequency scaling (through the acpi_cpufreq kernel module), then everything works as expected. If I load the acpi_cpufreq module, then the CPU's are overheating within seconds. The difference is quite big. The same action on a kernel without acpi_cpufreq gives me 100% load on the 4 cores and the temperature remains below 70 degrees. Loading the acpi_cpufreq, I get the same 100% load on all 4 cores, but the temperature rises to 101 degrees within about 5 seconds.
I know that Lenovo is an Asian company, but it should still be based on old big blue design. So either acpi_cpufreq found a little hole in the CPU design or it is able to block the automatic thermal throttling done by t he CPU.
Make sure your BIOS is up-to-date. The early i3/i5/i7 Lenovo (thinkpad, at least) BIOSes had significant issues with thermal management. Thermal management is done by the EC (Embedded Controller) part of the BIOS. Trip points, actions-to-take, etc... are all there. Make sure your BIOS is up-to-date. -- Jon -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org
Am Dienstag, 31. Juli 2012, 11:17:55 schrieb Jon Nelson:
Make sure your BIOS is up-to-date. The early i3/i5/i7 Lenovo (thinkpad, at least) BIOSes had significant issues with thermal management. Thermal management is done by the EC (Embedded Controller) part of the BIOS. Trip points, actions-to-take, etc... are all there. Make sure your BIOS is up-to-date.
If it was the BIOS, why does it not overheat with a different kernel version? Sven -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org
On Tue, Jul 31, 2012 at 1:40 PM, Sven Burmeister <sven.burmeister@gmx.net> wrote:
Make sure your BIOS is up-to-date. The early i3/i5/i7 Lenovo (thinkpad, at least) BIOSes had significant issues with thermal management. Thermal management is done by the EC (Embedded Controller) part of the BIOS. Trip points, actions-to-take, etc... are all there. Make sure your BIOS is up-to-date.
If it was the BIOS, why does it not overheat with a different kernel version?
BIOS bugs can be triggered by different kernel behavior -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org
Am Dienstag, 31. Juli 2012, 13:59:29 schrieb Claudio Freire:
If it was the BIOS, why does it not overheat with a different kernel version? BIOS bugs can be triggered by different kernel behavior
And how would one find the change in the kernel that causes the issue, triggers the bug respectively? By updating the BIOS and waiting until the next user runs into that issue? Sven -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org
On Tue, Jul 31, 2012 at 2:14 PM, Sven Burmeister <sven.burmeister@gmx.net> wrote:
If it was the BIOS, why does it not overheat with a different kernel version? BIOS bugs can be triggered by different kernel behavior
And how would one find the change in the kernel that causes the issue, triggers the bug respectively? By updating the BIOS and waiting until the next user runs into that issue?
No, if you wanted to find the changeset that introduced the bug, you'd have to do bisection on the kernel tree. Not easy, but maybe worth it. I think there was a review magazine that does that quite often, and they have a neat toolset to do it automatically. I'm not sure whether the toolset is opensource, though. -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org
On Tuesday, July 31, 2012 13:59:29 Claudio Freire wrote:
Make sure your BIOS is up-to-date. The early i3/i5/i7 Lenovo (thinkpad, at least) BIOSes had significant issues with thermal management. Thermal management is done by the EC (Embedded Controller) part of the BIOS. Trip points, actions-to-take, etc... are all there. Make sure your BIOS is up-to-date.
If it was the BIOS, why does it not overheat with a different kernel version? BIOS bugs can be triggered by different kernel behavior
Well, I have updated the BIOS to the latest available version for my laptop and this didn't resolve the issue. Would have been strange if it did, but it would have been possible. As indicated before, if I use the same kernel but without the acpi_cpufreq kernel module the laptop behaves correctly and does NOT overheat. If I load the acpi_cpufreq module (and this one loads the mperf module) then the laptop overheats within seconds with full load. Raymond -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org
Hello, On Tue, 31 Jul 2012, Raymond Wooninck wrote:
On Tuesday, July 31, 2012 13:59:29 Claudio Freire wrote:
Make sure your BIOS is up-to-date. The early i3/i5/i7 Lenovo (thinkpad, at least) BIOSes had significant issues with thermal management. Thermal management is done by the EC (Embedded Controller) part of the BIOS. Trip points, actions-to-take, etc... are all there. Make sure your BIOS is up-to-date.
If it was the BIOS, why does it not overheat with a different kernel version? BIOS bugs can be triggered by different kernel behavior
Well, I have updated the BIOS to the latest available version for my laptop and this didn't resolve the issue. Would have been strange if it did, but it would have been possible.
As indicated before, if I use the same kernel but without the acpi_cpufreq kernel module the laptop behaves correctly and does NOT overheat. If I load the acpi_cpufreq module (and this one loads the mperf module) then the laptop overheats within seconds with full load.
Is "acpi_cpufreq" the right one for your CPU? Anyway: autonomous thermal throttling by the CPU itself, using internal sensors has been in Intel CPUs since IIRC at least the P4, and the Athlon II or so[1]. I know though, that my Athlon 500 (first ever sold) did _not_ have that. Your Core i5 definitely does have it. Whatever: I'm beginning to suspect that using acpi_cpufreq _misinterprets_ data coming from the sensors (CPU internal or external on the MoBo), thinks the CPU is overheating (-> c.f. the very small delay between loading the module and "overheating") and then the Kernel tells the CPU to shutdown (it can do that via "HLT" cmd and whatnot) or does a shutdown itself without the CPU _actually_ overheating. Too bad you have a Laptop, in a Desktop you might be able to monitor at least the heatsink temps via an external thermometer and compare the "sudden" leap to >100 degC to how the heatsink actually heats up. Anyway: it being a bug in data-interpretation by acpi_cpufreq would explain the symptoms, esp. also the kernel-version-dependency. What does 'sensors' output? And have you run a sensors-detect? Have you updated to the latest version of lm_sensors? Just a few thoughts & HTH, -dnh [1] too lazy to look it up ;) -- To resist the influence of others, knowledge of one's self is most important. -- Teal'C, Stargate SG-1, 9x14 - Stronghold -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org
participants (8)
-
Claudio Freire
-
David Haller
-
Jan Engelhardt
-
Jon Nelson
-
Larry Finger
-
Philipp Thomas
-
Raymond Wooninck
-
Sven Burmeister