[Bug 409064] New: HP tx2500z laptop shuts down immediately after boot b/ c of thermal trip point
https://bugzilla.novell.com/show_bug.cgi?id=409064 Summary: HP tx2500z laptop shuts down immediately after boot b/c of thermal trip point Product: openSUSE 11.0 Version: Final Platform: Other OS/Version: Other Status: NEW Severity: Normal Priority: P5 - None Component: Kernel AssignedTo: bnc-team-screening@forge.provo.novell.com ReportedBy: joachim.deguara@amd.com QAContact: qa@suse.de CC: trenn@novell.com Found By: --- This HP laptop, the tx2500z, shuts down immediately while booting and all I could see from the console was that it was hitting thermal trip points ... at 65 degrees?! One time I was able to look in /proc/acpi/thermal_zone and the critical trip point was a negative number. Now I have disabled the thermal driver (hope I am not killing my machine) by adding 'off=1' as an option to that driver in modprobe.d. The other way to get around it is booting with acpi=off, but that is obviously not ideal. What is wrong here, is the ACPI from BIOS bad? I am attaching it. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=409064 User trenn@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=409064#c1 Thomas Renninger <trenn@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |ASSIGNED --- Comment #1 from Thomas Renninger <trenn@novell.com> 2008-07-14 15:40:15 MDT --- Best would be to first look for a BIOS update. I know HPs with a broken critical trip point of 256C. I currently try to convince people for a: FIRMWARE_BUG (severity, descritpion, component) interface in the kernel. IMO Intel's linuxfirmwarekit where you have to write scripts to check for BIOS bugs is not worth much. - Have of the bugs are already aborbed/ignored/workarounded in kernel. - No kernel hacker has the time to write these scripts - It's often a one liner to mark bugs as firmware bugs in the kernel and this also documents the kernel a bit better I don't have much time, but IMO we really need a better BIOS checking... -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=409064 User trenn@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=409064#c2 --- Comment #2 from Thomas Renninger <trenn@novell.com> 2008-07-14 15:40:57 MDT --- Have of the bugs -> half of the bugs -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=409064 User joachim.deguara@amd.com added comment https://bugzilla.novell.com/show_bug.cgi?id=409064#c3 --- Comment #3 from Joachim Deguara <joachim.deguara@amd.com> 2008-07-14 15:44:30 MDT --- Created an attachment (id=227697) --> (https://bugzilla.novell.com/attachment.cgi?id=227697) acpi.dump Here is the dumped ACPI. I will check for a BIOS update but can this also be fixed by fixing up the DSDT? Also tell me if I am risking my system by running with ACPI but not letting the thermal module run. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
Also tell me if I am risking my system by running with ACPI but not letting the thermal module run. On an HP, I'd say yes. There should always be HW switching off if it gets too hot, but HP laptops do their thermal management through ACPI only. So it could be that you constantly run at 90C, which might not destroy your HW, but shorten
https://bugzilla.novell.com/show_bug.cgi?id=409064 User trenn@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=409064#c4 --- Comment #4 from Thomas Renninger <trenn@novell.com> 2008-07-15 02:40:41 MDT --- the lifetime, especially of your disk data...
I will check for a BIOS update I will wait a bit. If there really is a new BIOS, please tell me if you still see this and then upload acpidump of the new BIOS. It does not make sense to work on old code.
Hm, this is probably another problem than the I/O APIC problem I forwared you some info. But it could be related in a way that on both systems the ACPI code is going wrong paths (on the HP nx6325 showing wrong trip points, all 16C, here returning wrong temperature). If you can see the negative temperature value in dmesg (then we could calc back for what we are searching in the DSDT) or paste some output of related lines, this could also help. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=409064 User trenn@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=409064#c5 --- Comment #5 from Thomas Renninger <trenn@novell.com> 2008-07-15 05:56:21 MDT --- Hmm, maybe HP also simply forgot the polling frequency. Then this could help: for x in /proc/acpi/thermal_zone/*/polling_frequency;do echo 5 >$x done Then it would be a duplicate of this one: bug #387702 But as you see a negative temperature, I expect the kernel is processing ACPI code paths that should never have been processed (at least not in this way...). -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=409064 User joachim.deguara@amd.com added comment https://bugzilla.novell.com/show_bug.cgi?id=409064#c6 --- Comment #6 from Joachim Deguara <joachim.deguara@amd.com> 2008-07-15 10:30:03 MDT --- The BIOS currently installed is the latest. I disassembled the ACPI but I don't know how to read the therm trip stuff. Can we read it all out of the acpi dump I attached or do I need to do other diagnostics? Unfortunately I can't see the therm trip points (which I assume are bad) without the laptop shutting down immediately. I don't think this is related to polling frequency as it polls regularly and that is why it know/thinks it is overheating. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=409064 User joachim.deguara@amd.com added comment https://bugzilla.novell.com/show_bug.cgi?id=409064#c7 --- Comment #7 from Joachim Deguara <joachim.deguara@amd.com> 2008-07-15 11:17:23 MDT --- The same problem (shutting down immediately on boot) happens with SLED10 SP2 also. However booting with acpi=off results in the machine not finishing boot and adding the option off=1 to the module thermal does not help either. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
The same problem (shutting down immediately on boot) happens with SLED10 SP2 also Thanks a lot for trying that! I thought this came in because the new ACPI
https://bugzilla.novell.com/show_bug.cgi?id=409064 User trenn@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=409064#c8 --- Comment #8 from Thomas Renninger <trenn@novell.com> 2008-07-15 11:42:15 MDT --- This is the thermal zone stuff, I already cut out most parts. My comments start with # Jean: There is one interesting line for you marked with "Jean" :) Scope (_TZ) { Name (TPC, 0x69) Name (TPTM, 0x4B) Name (TPAS, 0x5F) ThermalZone (THRM) { # Temperature method Method (_TMP, 0, NotSerialized) { If (ECON) { # Read out from Embedded controller Store (\_SB.PCI0.LPC0.EC0.RTMP, Local0) Return (Add (0x0AAC, Multiply (Local0, 0x0A))) } Else { Return (Add (0x0AAC, Multiply (TPTM, 0x0A))) } } # Critical temperature Method (_CRT, 0, Serialized) { # If Windows is earlier than Windows Vista (cmp. with OSTP func) # return 0x69 (TPC + 273 Kelvin) * 10, or here: (TPC * 10) + 2732 If (LLess (TPOS, 0x40)) { Return (Add (0x0AAC, Multiply (TPC, 0x0A))) } # IF OSI RETURNS TRUE FOR WINDOWS VISTA (or VISTA SP1) OR LINUX (cmp.OSTP func) # then zero is returned. I EXPECT THE ERROR IS HERE (see kernel code snipet at the end) Else { Return (Zero) } } # Jean: They seem to provide a name for the thermal zone. Can/should we pass # this to hwmon? Name (REGN, "Processor Thermal Zone") .. } } In drivers/acpi/thermal.c: /* Critical Shutdown (required) */ if (flag & ACPI_TRIPS_CRITICAL) { status = acpi_evaluate_integer(tz->device->handle, "_CRT", NULL, &tz->trips.critical.temperature); /* * Treat freezing temperatures as invalid as well; some * BIOSes return really low values and cause reboots at startup. * Below zero (Celcius) values clearly aren't right for sure.. * ... so lets discard those as invalid. */ if (ACPI_FAILURE(status) || tz->trips.critical.temperature <= 2732) { tz->trips.critical.flags.valid = 0; ACPI_EXCEPTION((AE_INFO, status, "No or invalid critical threshold")); ---- Because zero is returned and the kernel is calculating the Kelvin*10 value back or better comparing it with (Kelvin*10) temperature values you get a negativ Celsius value. Ohh, it looks like the above code is in linux-next only? This should be the commit fixing it: commit a39a2d7c72b358c6253a2ec28e17b023b7f6f41c Author: Arjan van de Ven <arjan@linux.intel.com> Date: Mon May 19 15:55:15 2008 -0700 ACPI: Reject below-freezing temperatures as invalid critical temperatures My laptop thinks that it's a good idea to give -73C as the critical CPU temperature.... which isn't the best thing since it causes a shutdown right at bootup. Temperatures below freezing are clearly invalid critical thresholds so just reject these as such. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Acked-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Len Brown <len.brown@intel.com> This is something that must pop up in stable kernels... Increasing severity, I will take care about this tomorrow. thermal patches, but this shows how bad it is that we return OSI=Windows. They break SLES10 SP2 certification by adding Vista hotfixes. Linux must not return true for OSI=Windows... (A fight I lost with Intel, they want to let machines break, so that we stay close to Windows ACPI implementation and by that fix other machines..., but this give me new material...). -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=409064 User joachim.deguara@amd.com added comment https://bugzilla.novell.com/show_bug.cgi?id=409064#c9 --- Comment #9 from Joachim Deguara <joachim.deguara@amd.com> 2008-07-15 12:04:22 MDT --- Great job Thomas looks like you found it. I booted with setting the thermal module's paramter nocrt=1 and sure enough found that critical threshold is set to -273 whereas hot is set to 105 C. Currently the temperature is 62 C and there is a fan running so I am not worried, but look forward to your patch. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=409064 User trenn@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=409064#c10 --- Comment #10 from Thomas Renninger <trenn@novell.com> 2008-07-15 15:24:38 MDT ---
and sure enough found that critical threshold is set to -273 whereas hot is set to 105 C This is strange. The hot trip point is returned in the same way..., ohhh you are right: for the _HOT method it is LEqual, means only for Vista, but not for Linux zero is returned.
Method (_HOT, 0, Serialized) { !!! If (LEqual (TPOS, 0x40)) { Return (Add (0x0AAC, Multiply (TPC, 0x0A))) } Else { Return (Zero) } } Method (_CRT, 0, Serialized) { !!! If (LLess (TPOS, 0x40)) { Return (Add (0x0AAC, Multiply (TPC, 0x0A))) } Else { Return (Zero) } } -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=409064 User trenn@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=409064#c11 Thomas Renninger <trenn@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |NEEDINFO Info Provider| |joachim.deguara@amd.com --- Comment #11 from Thomas Renninger <trenn@novell.com> 2008-07-16 09:47:18 MDT --- I created a kiso (search for it on software.opensuse.org if interested, you use it: build-suse-kiso -b .../openSUSE-11.0-LATEST-DVD/i386/DVD1 -k kernel-default-2.6.25.11-0.2.i586.rpm -o /tmp/11.0_i386_fixed_kernel_kiso.iso ) It may also be nice for the blocker which did not get fixed yet for GM. You find a x86_64 and i386 iso here (hope it works, I expect you are the first who tries out on a kiso I created myself..., don't waste time, tell me if it does not work): ftp.suse.com/pub/people/trenn/thermal_and_ide_irq_fixed_11_0 The kernel should have a funny name, -ide-pnp-fix (or similar because of another fix). The idea is to boot the fixed kernel, then exchange the installation media with your original one. You find a i386 and x86_64 kiso to boot and test here: ftp.suse.com/pub/people/trenn/pnp_ide_irq_fix/
From kiso README:
When using together with regular installation media, 1. Boot with kISO and select operation to perform from the boot menu. Boot loader loads kernel and initrd images into memory. 2. Wait until boot loader finishes loading. 3. All the needed contents from the kISO are contained in kernel and initrd images, so kISO can be removed from the drive once boot loader finishes loading. Remove kISO media from the drive and put in release installation media. 4-1. If the release installation media was put in before hardware probe is complete, installation system starts and proceeds the same way. 4-2. If the media was put in too late, linuxrc will complain that it can't find product repository and activate manual setup. Nothing to worry about. You just need to press enter a few more times. Proceed to #5. 5. Make sure the release installation media is in the drive and select language and keyboard map. It will give you Main Menu. Select "Start Installation or System" -> "Start Installation or Update" -> "CD-ROM". All the selections are the default, so just pressing enter several times is enough. linuxrc will report that no update was found which can be safely ignored. Installation continues as usual. Because manual mode was activated, installation system will ask a few more questions. Other installation sources can be selected from step 5 as in any other installation media. If kISO is used, the installation system installs kernel RPM from kISO such that the updated kernel is used from the first boot after installation. The kernel RPM is installed using 'rpm --nodeps' after all other packages are installed. Beware that it might violate some dependencies if other packages which depend on the kernel version are installed. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=409064 User trenn@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=409064#c12 Thomas Renninger <trenn@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEEDINFO |RESOLVED Info Provider|joachim.deguara@amd.com | Resolution| |FIXED --- Comment #12 from Thomas Renninger <trenn@novell.com> 2008-07-29 02:22:42 MDT --- I added a patch to 11.0 a while ago. Now this should also work for SLE10-SP2: Tue Jul 29 10:20:11 CEST 2008 - trenn@suse.de - patches.arch/acpi_thermal_check_critical_temp.patch: ACPI: Reject below-freezing temperatures as invalid critical temperatures (bnc#409064). Thanks for reporting. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=409064 User meissner@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=409064#c13 Marcus Meissner <meissner@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |meissner@novell.com --- Comment #13 from Marcus Meissner <meissner@novell.com> 2008-10-01 06:53:39 MDT --- This bug was mentioned/fixed in the just released SLES 10 SP2 kernel update, version 2.6.16.60-0.29 (all but x86_64) and 2.6.16.60-0.30 (x86_64). -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@novell.com