[opensuse-kernel] Fan speed control not working with kernel 3.19.4.1
Hello, I am a tumbleweed user, and I have recently (since 20150421 or 20150422) been having issues with my laptop's CPU overheating. Most of the times, the fan does not change speed when the temperature increases, and I cannot control its speed manually with pwm1_enable set to 1. I reported it on the factory mailing list, and together we could rule out the possibility of a hardware failure (most likely dust on the fan). In particular, through some (seemingly) random changes, I could get the fan to work under manual control. This setting does not appear to be reliably persistent upon reboot, but it proves that the fan can still spin fast enough to cool down the CPU, when the OS has full control on its speed. The setting I have to mess with to reactivate the fan when it stops working is the "thermal.off=1" parameter of the kernel. I would like to be more accurate, but I can't : in the past few days, I have "reactivated" the fan control twice, and as far as I can tell the procedure was not the same (and the same procedure did not yield the same consequences every time). For reference, the factory thread is here: http://lists.opensuse.org/opensuse-factory/2015-04/msg00299.html In particular, I had gathered some information about my hardware on: https://gist.github.com/ThibautVerron/27da7b4da3940f9894c0 If some of this information is irrelevant, or some other is missing, don't hesitate to ask of course. Thanks! Thibaut -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
Hi Thibaut, Le Sunday 03 May 2015 à 00:33 +0200, Thibaut Verron a écrit :
I am a tumbleweed user, and I have recently (since 20150421 or 20150422) been having issues with my laptop's CPU overheating. Most of the times, the fan does not change speed when the temperature increases, and I cannot control its speed manually with pwm1_enable set to 1.
I reported it on the factory mailing list, and together we could rule out the possibility of a hardware failure (most likely dust on the fan). In particular, through some (seemingly) random changes, I could get the fan to work under manual control. This setting does not appear to be reliably persistent upon reboot, but it proves that the fan can still spin fast enough to cool down the CPU, when the OS has full control on its speed.
The setting I have to mess with to reactivate the fan when it stops working is the "thermal.off=1" parameter of the kernel. I would like to be more accurate, but I can't : in the past few days, I have "reactivated" the fan control twice, and as far as I can tell the procedure was not the same (and the same procedure did not yield the same consequences every time).
For reference, the factory thread is here: http://lists.opensuse.org/opensuse-factory/2015-04/msg00299.html
I read this thread in part, and here are a few random comments: * It is common that low fan speeds are reported incorrectly, this isn't specific to your laptop. The reason is that fan speed control is achieved by either PWM signal (most frequent) or by lowering the voltage. In both cases the rotation feedback signal gets harder to sense, and weaker signal translates to incorrect speed values being reported. While inconvenient, it should in general not result in any issue with thermal management. * If a recent kernel somehow messed up with your system, it is possible that the problem survives warm reboots, including switching to other operating systems or back to older kernels. I urge you to always _cold_ boot the machine before every test your perform. On laptops, cold booting may require unplugging the AC adapter AND removing the battery AND waiting for a couple minutes. * Your problems sound BIOS-related to me. In most BIOS there is an option to load setup or failsafe defaults. If you didn't try that yet, that would be worth trying. * If the previous advice doesn't help, it might be worth re-flashing the BIOS even if no new version is available. If the BIOS code was somehow corrupted, that would restore it. * When did you try manual fan speed control with pwm1_enable for the last time? If you normally don't need to do that, it is entirely possible that this has been broken for a longer time and you did not notice and this issue is unrelated with your current troubles. OTOH is the fan speed controller is hosed somehow, it wouldn't be all that surprising if that affects both manual and automatic modes. FWIW there was no recent change to the eeepc-laptop driver. As a closing note, I don't know how old your model is, but it should be noted that low-priced consumer hardware showing issues after 4-5 years is nothing out of the ordinary. This may not be what's happening here... but it may as well be "just" that.
In particular, I had gathered some information about my hardware on: https://gist.github.com/ThibautVerron/27da7b4da3940f9894c0
If some of this information is irrelevant, or some other is missing, don't hesitate to ask of course.
-- Jean Delvare SUSE L3 Support -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
2015-05-05 22:08 GMT+02:00 Jean Delvare <jdelvare@suse.de>:
Hi Thibaut,
Le Sunday 03 May 2015 à 00:33 +0200, Thibaut Verron a écrit :
I am a tumbleweed user, and I have recently (since 20150421 or 20150422) been having issues with my laptop's CPU overheating. Most of the times, the fan does not change speed when the temperature increases, and I cannot control its speed manually with pwm1_enable set to 1.
I reported it on the factory mailing list, and together we could rule out the possibility of a hardware failure (most likely dust on the fan). In particular, through some (seemingly) random changes, I could get the fan to work under manual control. This setting does not appear to be reliably persistent upon reboot, but it proves that the fan can still spin fast enough to cool down the CPU, when the OS has full control on its speed.
The setting I have to mess with to reactivate the fan when it stops working is the "thermal.off=1" parameter of the kernel. I would like to be more accurate, but I can't : in the past few days, I have "reactivated" the fan control twice, and as far as I can tell the procedure was not the same (and the same procedure did not yield the same consequences every time).
For reference, the factory thread is here: http://lists.opensuse.org/opensuse-factory/2015-04/msg00299.html
I read this thread in part, and here are a few random comments:
* It is common that low fan speeds are reported incorrectly, this isn't specific to your laptop. The reason is that fan speed control is achieved by either PWM signal (most frequent) or by lowering the voltage. In both cases the rotation feedback signal gets harder to sense, and weaker signal translates to incorrect speed values being reported. While inconvenient, it should in general not result in any issue with thermal management.
Ok. Anyway, this part of the problem is not new, but now there is a logical explanation for that too.
* If a recent kernel somehow messed up with your system, it is possible that the problem survives warm reboots, including switching to other operating systems or back to older kernels. I urge you to always _cold_ boot the machine before every test your perform. On laptops, cold booting may require unplugging the AC adapter AND removing the battery AND waiting for a couple minutes.
I did try that occasionally when "messing around with settings", but you have a good point, I'll make that a habit. Rebooting vs turning off and back on the computer (without unplugging it or waiting more than 10s) did yield different results sometimes (but still nothing reproducible).
* Your problems sound BIOS-related to me. In most BIOS there is an option to load setup or failsafe defaults. If you didn't try that yet, that would be worth trying.
I have tried restoring the bios to factory settings (the only thing my bios seems to offer on this matter) and it didn't help. Thank you, indeed I forgot to mention that on the other thread.
* If the previous advice doesn't help, it might be worth re-flashing the BIOS even if no new version is available. If the BIOS code was somehow corrupted, that would restore it.
I could try that, but I'm wary of taking this step: flashing a bios with a computer that may shutdown because of critical temperature sounds a bit dangerous. The first time I got issues of this kind, I did update the bios (with the laptop directly on an indoor air conditioner, I could do that because it was July), and it did not help. But since nothing seems to be reproducible here, I might as well try again now...
* When did you try manual fan speed control with pwm1_enable for the last time? If you normally don't need to do that, it is entirely possible that this has been broken for a longer time and you did not notice and this issue is unrelated with your current troubles. OTOH is the fan speed controller is hosed somehow, it wouldn't be all that surprising if that affects both manual and automatic modes.
You have a good point of course, I usually only try manual control when I have problems with automatic control, so last time must have been in 2013, and I was not using openSUSE back then. It is quite possible that manual control on my laptop has never been working reliably with the openSUSE install. But is it possible that manual control be broken without affecting automatic control? I guess that's asking whether the kernel has a lower-level access to the fan controls than what's exposed in the /sys filesystem?
FWIW there was no recent change to the eeepc-laptop driver.
As a closing note, I don't know how old your model is, but it should be noted that low-priced consumer hardware showing issues after 4-5 years is nothing out of the ordinary. This may not be what's happening here... but it may as well be "just" that.
That's right. But given that there was similar issues at age 1 (just past warranty expiration... a critical point in a computer's lifetime, I know), I would find it strange if this was a new issue. Of course, it is still possible that the problems I had 3 and 2 years ago were not merely software issues, but symptoms of a hardware failure that was (temporarily) mitigated with software tweaks. Thank you very much for your comments and suggestions. If nobody has any other short-term suggestion, I'll try redownloading and flashing the bios indeed. Thibaut -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
On Tue, 5 May 2015 23:02:33 +0200, Thibaut Verron wrote:
2015-05-05 22:08 GMT+02:00 Jean Delvare <jdelvare@suse.de>:
* If the previous advice doesn't help, it might be worth re-flashing the BIOS even if no new version is available. If the BIOS code was somehow corrupted, that would restore it.
I could try that, but I'm wary of taking this step: flashing a bios with a computer that may shutdown because of critical temperature sounds a bit dangerous. The first time I got issues of this kind, I did update the bios (with the laptop directly on an indoor air conditioner, I could do that because it was July), and it did not help. But since nothing seems to be reproducible here, I might as well try again now...
It happens that I am having similar trouble with an old laptop of mine at the moment, so I know how you feel. The laptop is shutting down when my daughter plays Minecraft on it. The BIOS version was old and known broken so I decided to attempt to update it. I don't know yet if it helped. That being said my case was probably less critical than yours, as the problem would only happen under heavy load. Also according to the logs the shutdown was triggered by a thermal zone critical limit being reached, which means that it was a graceful emergency shutdown initiated by the ACPI thermal driver, not a hardware-triggered CPU power-off. There's no such thing running while no OS is running so I think I was safe flashing the BIOS. Do you have anything logged when the systems shuts itself down? If you give it a try, there are a few tricks: flash the BIOS after the machine has been off for a long time, do it in the morning when the ambient temperature is the lowest, in the coldest room, with the windows open (assuming it is colder outside.)
(...) But is it possible that manual control be broken without affecting automatic control? I guess that's asking whether the kernel has a lower-level access to the fan controls than what's exposed in the /sys filesystem?
Depends on what is broken exactly. If the fan control output circuitry is dead, this will affect both the manual and the automatic modes. The BIOS or some user-space daemon will try to set the fan speed to the desired value but the fan will not react to the command. That can't be fixed. If the BIOS code is broken, it will definitely affect the automatic mode, but it could also affect the manual mode, because in your case everything goes through an intermediate EC interface, including switching from automatic to manual mode and setting the speed manually. If the BIOS code is dead then even that can fail. Broken manual control with working automatic control would only happen if the driver code is wrong and sends the wrong commands to the EC in manual mode. Unfortunately for laptop-specific kernel modules we generally don't get much help from vendors, so it's always best-effort and we can't guarantee it will work perfectly all the time.
(...) As a closing note, I don't know how old your model is, but it should be noted that low-priced consumer hardware showing issues after 4-5 years is nothing out of the ordinary. This may not be what's happening here... but it may as well be "just" that.
That's right. But given that there was similar issues at age 1 (just past warranty expiration... a critical point in a computer's lifetime, I know), I would find it strange if this was a new issue.
I agree, it smells like the same issue you had back then resurfaced. Which isn't entirely surprising, BTW, as apparently you don't know how you solved the problem back then. Problems which solve themselves magically have a tendency to reappear in the same way.
Of course, it is still possible that the problems I had 3 and 2 years ago were not merely software issues, but symptoms of a hardware failure that was (temporarily) mitigated with software tweaks.
Thank you very much for your comments and suggestions. If nobody has any other short-term suggestion, I'll try redownloading and flashing the bios indeed.
-- Jean Delvare SUSE L3 Support -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
2015-05-06 10:49 GMT+02:00 Jean Delvare <jdelvare@suse.de>:
On Tue, 5 May 2015 23:02:33 +0200, Thibaut Verron wrote:
2015-05-05 22:08 GMT+02:00 Jean Delvare <jdelvare@suse.de>:
* If the previous advice doesn't help, it might be worth re-flashing the BIOS even if no new version is available. If the BIOS code was somehow corrupted, that would restore it.
I could try that, but I'm wary of taking this step: flashing a bios with a computer that may shutdown because of critical temperature sounds a bit dangerous. The first time I got issues of this kind, I did update the bios (with the laptop directly on an indoor air conditioner, I could do that because it was July), and it did not help. But since nothing seems to be reproducible here, I might as well try again now...
It happens that I am having similar trouble with an old laptop of mine at the moment, so I know how you feel. The laptop is shutting down when my daughter plays Minecraft on it. The BIOS version was old and known broken so I decided to attempt to update it. I don't know yet if it helped.
That being said my case was probably less critical than yours, as the problem would only happen under heavy load.
While the CPU is not under too much stress, I can get a good 30-45 minutes of worktime before the computer becomes worryingly hot. I don't think that flashing a BIOS requires a lot of computations, and I don't recall it taking too long either, so it'd be a low risk. The consequences are more daunting than the chances of it happening.
Also according to the logs the shutdown was triggered by a thermal zone critical limit being reached, which means that it was a graceful emergency shutdown initiated by the ACPI thermal driver, not a hardware-triggered CPU power-off. There's no such thing running while no OS is running so I think I was safe flashing the BIOS. Do you have anything logged when the systems shuts itself down?
I just tested it again, there seems to be a message (too fast to read), then it showed a tty prompt for a few seconds before going off. It is the same behavior as when I turn the computer off normally. So, yes, "graceful emergency shutdown". But that doesn't imply that there is no hardware emergency shutdown available... For example, I've also run a memtest (computer still hot), after two minutes it went off without warning. Or is there an ACPI driver in memtest too, but memtest doesn't need to take any care before shutting down?
If you give it a try, there are a few tricks: flash the BIOS after the machine has been off for a long time, do it in the morning when the ambient temperature is the lowest, in the coldest room, with the windows open (assuming it is colder outside.)
From the tests I had done back then, even a difference of 5-10° in temperature room didn't make much of a difference for the CPU. The air conditioner works nicely because it basically forces the bottom of the laptop to be at 20°... The fan was not spinning at all and the CPU (or at least its thermal sensor) was sitting at 35-40°. But with a cold computer I should be fine.
(...) But is it possible that manual control be broken without affecting automatic control? I guess that's asking whether the kernel has a lower-level access to the fan controls than what's exposed in the /sys filesystem?
Depends on what is broken exactly.
If the fan control output circuitry is dead, this will affect both the manual and the automatic modes. The BIOS or some user-space daemon will try to set the fan speed to the desired value but the fan will not react to the command. That can't be fixed.
This is not the case, since I can still get the fan to spin (or stop), sometimes. At least it's not entirely broken, I mean. It could be a poor contact or any other form of non-reliable hardware failure, though.
(...) As a closing note, I don't know how old your model is, but it should be noted that low-priced consumer hardware showing issues after 4-5 years is nothing out of the ordinary. This may not be what's happening here... but it may as well be "just" that.
That's right. But given that there was similar issues at age 1 (just past warranty expiration... a critical point in a computer's lifetime, I know), I would find it strange if this was a new issue.
I agree, it smells like the same issue you had back then resurfaced. Which isn't entirely surprising, BTW, as apparently you don't know how you solved the problem back then. Problems which solve themselves magically have a tendency to reappear in the same way.
And the next time the same magical handwaving and flags tweaking doesn't work as well. I'll look up for an exorcist if all else fails. Thanks again! Thibaut -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
Le Wednesday 06 May 2015 à 11:40 +0200, Thibaut Verron a écrit :
While the CPU is not under too much stress, I can get a good 30-45 minutes of worktime before the computer becomes worryingly hot. I don't think that flashing a BIOS requires a lot of computations, and I don't recall it taking too long either, so it'd be a low risk. The consequences are more daunting than the chances of it happening.
Indeed it seems safe to attempt then.
Also according to the logs the shutdown was triggered by a thermal zone critical limit being reached, which means that it was a graceful emergency shutdown initiated by the ACPI thermal driver, not a hardware-triggered CPU power-off. There's no such thing running while no OS is running so I think I was safe flashing the BIOS. Do you have anything logged when the systems shuts itself down?
I just tested it again, there seems to be a message (too fast to read), then it showed a tty prompt for a few seconds before going off. It is the same behavior as when I turn the computer off normally. So, yes, "graceful emergency shutdown".
FWIW I could even see the message after rebooting in the output of journalctl, so it was graceful enough that it could log the message to the disk.
But that doesn't imply that there is no hardware emergency shutdown available... For example, I've also run a memtest (computer still hot), after two minutes it went off without warning. Or is there an ACPI driver in memtest too, but memtest doesn't need to take any care before shutting down?
I very much doubt that memtest86 has any care for ACPI. So there must be indeed a hardware protection triggering, either at the processor level, or at the BIOS level with a SMI and some SMM code. After all, the ACPI-driven shut down is there to prevent the violent hardware one from triggering... Either way, whatever happens in memtest86 will happen when flashing the BIOS, as these are comparable conditions and workloads. The only difference is the duration, of course. Good luck, -- Jean Delvare SUSE L3 Support -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
Le Wed, 06 May 2015 11:57:04 +0200, Jean Delvare <jdelvare@suse.de> a écrit :
Le Wednesday 06 May 2015 à 11:40 +0200, Thibaut Verron a écrit :
While the CPU is not under too much stress, I can get a good 30-45 minutes of worktime before the computer becomes worryingly hot. I don't think that flashing a BIOS requires a lot of computations, and I don't recall it taking too long either, so it'd be a low risk. The consequences are more daunting than the chances of it happening.
Indeed it seems safe to attempt then.
Sorry, it has been some time since my last message here. I have tested flashing the BIOS, it did not seem to change anything to the issue. I also tried flashing the BIOS and then booting immediately on a live USB (ubuntu, since I could not get openSUSE to boot), the fan didn't spin fast enough to prevent the computer to grow abnormally hot. I have also noticed in the past weeks a much decreased battery life, from more than 5 hours with a full charge down to less than 2 hours. I am mentioning it only because it could be another symptom of the same ACPI problem, but there are a lot of other possible explanations: - usually I don't reboot the computer for weeks, rebooting it every 30 minutes (or more often when testing kernel settings) puts more stress on the battery; - exposure to high temperatures could have aged the battery prematurely (the battery is only 15 months old, the original one died already). For what it's worth, I've been monitoring this battery since day 1, and neither the max charge nor current voltage seem to have dropped in may. But I'm not sure how much I should trust the reports from this battery. Anyway, the problem has been here for more than a month now, and I am not able to reliably work-around it (actually, neither reliably nor at all, at the moment). I have ordered a replacement computer for now. I will probably not attempt to open the laptop post-mortem, because I wouldn't know what to look for to investigate a hardware failure... If someone has more ideas of things to try about this issue, I'm still interested. After all, I had plans for the afterlife of this computer, but they relied on uptimes longer than 45 minutes... Thank you very much for your help! Thibaut -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
participants (2)
-
Jean Delvare
-
Thibaut Verron