2015-05-06 10:49 GMT+02:00 Jean Delvare
On Tue, 5 May 2015 23:02:33 +0200, Thibaut Verron wrote:
2015-05-05 22:08 GMT+02:00 Jean Delvare
: * If the previous advice doesn't help, it might be worth re-flashing the BIOS even if no new version is available. If the BIOS code was somehow corrupted, that would restore it.
I could try that, but I'm wary of taking this step: flashing a bios with a computer that may shutdown because of critical temperature sounds a bit dangerous. The first time I got issues of this kind, I did update the bios (with the laptop directly on an indoor air conditioner, I could do that because it was July), and it did not help. But since nothing seems to be reproducible here, I might as well try again now...
It happens that I am having similar trouble with an old laptop of mine at the moment, so I know how you feel. The laptop is shutting down when my daughter plays Minecraft on it. The BIOS version was old and known broken so I decided to attempt to update it. I don't know yet if it helped.
That being said my case was probably less critical than yours, as the problem would only happen under heavy load.
While the CPU is not under too much stress, I can get a good 30-45 minutes of worktime before the computer becomes worryingly hot. I don't think that flashing a BIOS requires a lot of computations, and I don't recall it taking too long either, so it'd be a low risk. The consequences are more daunting than the chances of it happening.
Also according to the logs the shutdown was triggered by a thermal zone critical limit being reached, which means that it was a graceful emergency shutdown initiated by the ACPI thermal driver, not a hardware-triggered CPU power-off. There's no such thing running while no OS is running so I think I was safe flashing the BIOS. Do you have anything logged when the systems shuts itself down?
I just tested it again, there seems to be a message (too fast to read), then it showed a tty prompt for a few seconds before going off. It is the same behavior as when I turn the computer off normally. So, yes, "graceful emergency shutdown". But that doesn't imply that there is no hardware emergency shutdown available... For example, I've also run a memtest (computer still hot), after two minutes it went off without warning. Or is there an ACPI driver in memtest too, but memtest doesn't need to take any care before shutting down?
If you give it a try, there are a few tricks: flash the BIOS after the machine has been off for a long time, do it in the morning when the ambient temperature is the lowest, in the coldest room, with the windows open (assuming it is colder outside.)
From the tests I had done back then, even a difference of 5-10° in temperature room didn't make much of a difference for the CPU. The air conditioner works nicely because it basically forces the bottom of the laptop to be at 20°... The fan was not spinning at all and the CPU (or at least its thermal sensor) was sitting at 35-40°. But with a cold computer I should be fine.
(...) But is it possible that manual control be broken without affecting automatic control? I guess that's asking whether the kernel has a lower-level access to the fan controls than what's exposed in the /sys filesystem?
Depends on what is broken exactly.
If the fan control output circuitry is dead, this will affect both the manual and the automatic modes. The BIOS or some user-space daemon will try to set the fan speed to the desired value but the fan will not react to the command. That can't be fixed.
This is not the case, since I can still get the fan to spin (or stop), sometimes. At least it's not entirely broken, I mean. It could be a poor contact or any other form of non-reliable hardware failure, though.
(...) As a closing note, I don't know how old your model is, but it should be noted that low-priced consumer hardware showing issues after 4-5 years is nothing out of the ordinary. This may not be what's happening here... but it may as well be "just" that.
That's right. But given that there was similar issues at age 1 (just past warranty expiration... a critical point in a computer's lifetime, I know), I would find it strange if this was a new issue.
I agree, it smells like the same issue you had back then resurfaced. Which isn't entirely surprising, BTW, as apparently you don't know how you solved the problem back then. Problems which solve themselves magically have a tendency to reappear in the same way.
And the next time the same magical handwaving and flags tweaking doesn't work as well. I'll look up for an exorcist if all else fails. Thanks again! Thibaut -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org