[Bug 807312] New: kernel 3.8 crashes (reboots notebook after a few minutes)
https://bugzilla.novell.com/show_bug.cgi?id=807312
https://bugzilla.novell.com/show_bug.cgi?id=807312#c0
Summary: kernel 3.8 crashes (reboots notebook after a few
minutes)
Classification: openSUSE
Product: openSUSE Factory
Version: 12.3 Beta 1
Platform: x86-64
OS/Version: openSUSE 12.2
Status: NEW
Severity: Major
Priority: P5 - None
Component: Kernel
AssignedTo: kernel-maintainers@forge.provo.novell.com
ReportedBy: rainer.klier@gmx.at
QAContact: qa-bugs@suse.de
Found By: ---
Blocker: ---
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:19.0) Gecko/20100101
Firefox/19.0
i use a HP Compaq 8710w notebook with a Mobile PM965/GM965 chipset.
after installing kernel 3.8 (or 3.8.1) and rebooting everything seems normal.
but dmesg show thousands of the following messages:
[ 282.308114] mei 0000:00:03.0: unexpected reset: dev_state = RESETING
if found out that this has to be this device (from lspci -vvv):
00:03.0 Communication controller: Intel Corporation Mobile PM965/GM965 MEI
Controller (rev 0c)
Subsystem: Hewlett-Packard Company Device 30c3
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
https://bugzilla.novell.com/show_bug.cgi?id=807312
https://bugzilla.novell.com/show_bug.cgi?id=807312#c
Rainer Klier
https://bugzilla.novell.com/show_bug.cgi?id=807312
https://bugzilla.novell.com/show_bug.cgi?id=807312#c
Rainer Klier
https://bugzilla.novell.com/show_bug.cgi?id=807312
https://bugzilla.novell.com/show_bug.cgi?id=807312#c1
Rainer Klier
https://bugzilla.novell.com/show_bug.cgi?id=807312
https://bugzilla.novell.com/show_bug.cgi?id=807312#c2
--- Comment #2 from Rainer Klier
https://bugzilla.novell.com/show_bug.cgi?id=807312
https://bugzilla.novell.com/show_bug.cgi?id=807312#c3
--- Comment #3 from Rainer Klier
https://bugzilla.novell.com/show_bug.cgi?id=807312
https://bugzilla.novell.com/show_bug.cgi?id=807312#c4
--- Comment #4 from Jean Delvare
https://bugzilla.novell.com/show_bug.cgi?id=807312
https://bugzilla.novell.com/show_bug.cgi?id=807312#c5
--- Comment #5 from Rainer Klier
pwmconfig and fancontrol are meant for systems with hardware monitoring chips driven natively by Linux. On laptops everything is under the control of ACPI, so it is completely expected that they did not work.
ok, understod. under kernel 3.7 the fan ran most of the time, so it seems obvious that the notebook never overheated under kernel 3.7. but starting with kernel 3.8 the fan does not run always. so, do i have any possibility to force the fan to run? i found /sys/devices/virtual/thermal and /sys/devices/virtual/thermal/thermal_zone0(12345) but all values there are read-only. is there any way to control/force the fan to run? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=807312
https://bugzilla.novell.com/show_bug.cgi?id=807312#c6
Jean Delvare
https://bugzilla.novell.com/show_bug.cgi?id=807312
https://bugzilla.novell.com/show_bug.cgi?id=807312#c7
Takashi Iwai
https://bugzilla.novell.com/show_bug.cgi?id=807312
https://bugzilla.novell.com/show_bug.cgi?id=807312#c8
--- Comment #8 from Rainer Klier
Could you check the kernel message before reboot via netconsole or such? At least there should be some dying message if it's a "sane" reboot.
i don't think it's a "sane" reboot. i agree with jean's assumption, that it is a kind of "rescue-reboot/-shutoff" to save the hardware from overheating. it happens immediately with no warning. i already checked /var/log/messages for anything, but found nothing. right now, i use the KDE4 desktop-widget for watching the temperatures and/or xsensors. and i see one of the 6 thermal zones which goes higher than the others. it is called "temp1" in xsensors and it is at about 73°C - 81°C. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=807312
https://bugzilla.novell.com/show_bug.cgi?id=807312#c9
--- Comment #9 from Takashi Iwai
https://bugzilla.novell.com/show_bug.cgi?id=807312
https://bugzilla.novell.com/show_bug.cgi?id=807312#c10
--- Comment #10 from Rainer Klier
A rescue reboot by the kernel doing by itself is a sort of "sane" reboot. The
ok.
reboot is done in the expected way. But this should leave some record in the log, usually. The fact that it's missing implies that something weird happens instead.
i fear so. the reboot happens suddenly. suddenly the screen goes black and a second later i see the bios-/boot-screen like the notebook was switched on. and there is nothing in /var/log/messages....
In such a case, try to login in a single user mode with "nomodeset" boot option
nomodeset is default "on" on my system. i don't use any KMS driver.
so that no KMS is kicked in, and without GUI. Do you still see the problem?
no. i think, this is because when the system runs in text-only-mode, the notebook doesn't become that hot, so it doesn't reboot. it even "survived" a whole night staying idle in KDE4. i just logged in to KDE4 and then did nothing. this seems to produce so less heat, that it doesn't reboot. but when i work with the notebook, it suddenly reboots after a while. but it seems to be better, after i ran sensors-detect and created /etc/sysconfig/lm_sensors and /usr/lib/systemd/system/lm_sensors.service. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=807312
https://bugzilla.novell.com/show_bug.cgi?id=807312#c11
--- Comment #11 from Rainer Klier
I don't know of any way, sorry.
i found out, that i can control the fan, when i do: echo "1" > /sys/devices/virtual/thermal/cooling_device0[123456789101112]/cur_state but not all enabled cooling_devices start the fan to blow... and the strange thing is, that when i do this to /sys/devices/virtual/thermal/cooling_device2,3,4,5,8,9,10,11 the temperature of /sys/devices/virtual/thermal/thermal_zone3 raises immediatelly. shouldn't the enabling of a cooling_device lower the temperature? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=807312
https://bugzilla.novell.com/show_bug.cgi?id=807312#c12
--- Comment #12 from Rainer Klier
https://bugzilla.novell.com/show_bug.cgi?id=807312
https://bugzilla.novell.com/show_bug.cgi?id=807312#c13
--- Comment #13 from Rainer Klier
https://bugzilla.novell.com/show_bug.cgi?id=807312
https://bugzilla.novell.com/show_bug.cgi?id=807312#c14
--- Comment #14 from Rainer Klier
https://bugzilla.novell.com/show_bug.cgi?id=807312
https://bugzilla.novell.com/show_bug.cgi?id=807312#c15
--- Comment #15 from Jean Delvare
In such a case, try to login in a single user mode with "nomodeset" boot so that no KMS is kicked in, and without GUI. Do you still see the problem?
no. i think, this is because when the system runs in text-only-mode, the notebook doesn't become that hot, so it doesn't reboot.
You could run in text mode and run for example "md5sum /dev/zero" to create artificial load. Then maybe you'd be able to see something before it reboots.
but it seems to be better, after i ran sensors-detect and created /etc/sysconfig/lm_sensors and /usr/lib/systemd/system/lm_sensors.service.
For the record, I can't think of any reason why this would make a difference. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=807312
https://bugzilla.novell.com/show_bug.cgi?id=807312#c16
--- Comment #16 from Jean Delvare
https://bugzilla.novell.com/show_bug.cgi?id=807312
https://bugzilla.novell.com/show_bug.cgi?id=807312#c17
--- Comment #17 from Rainer Klier
(In reply to comment #10)
i think, this is because when the system runs in text-only-mode, the notebook doesn't become that hot, so it doesn't reboot.
You could run in text mode and run for example "md5sum /dev/zero" to create artificial load. Then maybe you'd be able to see something before it reboots.
ok, but if the load produces the reboot, as expected, it doesn't help. it only proves the suspicion. but i will try nevertheless.
but it seems to be better, after i ran sensors-detect and created /etc/sysconfig/lm_sensors and /usr/lib/systemd/system/lm_sensors.service.
For the record, I can't think of any reason why this would make a difference.
ok. what is the kernel module coretemp for? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=807312
https://bugzilla.novell.com/show_bug.cgi?id=807312#c18
--- Comment #18 from Rainer Klier
What I don't get is how you could possibly have 11 fans in a notebook. This looks plain wrong.
yes, maybe. but as you can see in https://bugzilla.redhat.com/show_bug.cgi?id=895276 others also have a similar situation in similar HP notebooks.
Could you please compare with what the openSUSE 12.2 kernel said?
i will try that also. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=807312
https://bugzilla.novell.com/show_bug.cgi?id=807312#c19
--- Comment #19 from Jean Delvare
what is the kernel module coretemp for?
Temperature reporting to user-space only. No thermal management and no fan control. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=807312
https://bugzilla.novell.com/show_bug.cgi?id=807312#c20
--- Comment #20 from Rainer Klier
https://bugzilla.novell.com/show_bug.cgi?id=807312
https://bugzilla.novell.com/show_bug.cgi?id=807312#c21
--- Comment #21 from Jean Delvare
https://bugzilla.novell.com/show_bug.cgi?id=807312
https://bugzilla.novell.com/show_bug.cgi?id=807312#c22
--- Comment #22 from Rainer Klier
Question #1: do you hear the fan spin when you force-enable cooling_device2?
when i do this, the first thing, i notice is the raise of the temp of thermal_zone3 from 30°C up to 100°C. (and i didn't let it go higher...) "/sys/devices/virtual/thermalthermal_zone3" corresponds to "temp4" from the output of sensors. the contents of /sys/devices/virtual/thermal/thermal_zone3/device/path is: \_TZ_.TZ5_ /sys/devices/virtual/thermal/thermal_zone3/device/modalias: acpi:LNXTHERM: /sys/devices/virtual/thermal/thermal_zone3/device/hid: LNXTHERM /sys/devices/virtual/thermal/thermal_zone3/trip_point_0_temp: 110000 /sys/devices/virtual/thermal/thermal_zone3/trip_point_0_type: critical immediately after the temp it raising, i hear the fan spin up. when i put "0" to /sys/devices/virtual/thermal/cooling_device2/cur_state the temp is droping and shortly after this, the fan stops. so, it seenms, that changing the value of /sys/devices/virtual/thermal/cooling_device2/cur_state doesn't trigger the fan directly, but raises somehow the temp of /sys/devices/virtual/thermal/thermal_zone3, which triggers the fan to spin up. at least it looks like this.
Question #2: does it prevent the spurious reboot?
i really don't know. the spurious reboot didn't happen since i installed /etc/sysconfig/lm_sensors and /usr/lib/systemd/system/lm_sensors.service. i have the KDE4 hardware monitoring widget running, which shows the temps of all 6 thermal_zones (called temp1 - temp6 in the widget). and i watch the current values the whole time to detect the temp, where it reboots, but it didn't happen since then. but i noticed, that forcing the fan to spin with the above methode (echo "1" > .....cooling_device2/cur_state) lowers the temp of thermal_zone5, starts the fan, BUT raises the temp of thermal_zone3. very strange. and as you can read in https://bugzilla.redhat.com/show_bug.cgi?id=895276 similar things happen with other HP notebooks from other people.... -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=807312
https://bugzilla.novell.com/show_bug.cgi?id=807312#c23
--- Comment #23 from Rainer Klier
https://bugzilla.novell.com/show_bug.cgi?id=807312
https://bugzilla.novell.com/show_bug.cgi?id=807312#c24
--- Comment #24 from Jean Delvare
https://bugzilla.novell.com/show_bug.cgi?id=807312
https://bugzilla.novell.com/show_bug.cgi?id=807312#c25
--- Comment #25 from Rainer Klier
These may not all be the same issue, as some users are reporting issues starting with kernel 3.7, which does work for you. Also most of the reports are
yes, i remember that with 3.7 i can confirm that the fan ran the whole time (like many of these other bugs report), which in my case helped, because the notebook didn't overheat. but i think the source of all these bugs is the same. it results in fan running the whole time, or, in my case (apparently with a fix for that whole-time-running-issue) running not at all, or too late to cool the system.
about issues after resume, while your system fails even without a suspend/resume cycle.
yes. but https://bugzilla.kernel.org/show_bug.cgi?id=56281 is also about overheating. in this bug the reporter writes: "If ambient temperatures are above 30 degrees and the computer is slightly used the critical shutdown temperature of 95° for temp1 is frequently reached. This results in an immediate shutdown of the system." this sounds exactly like my problem. the strange thing is, that since i changed the bios setting "fan always on at AC" (which it is strangely NOT) and installung/using /etc/sysconfig/lm_sensors and /usr/lib/systemd/system/lm_sensors.service it didn't happen again. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=807312
https://bugzilla.novell.com/show_bug.cgi?id=807312#c26
--- Comment #26 from Jean Delvare
https://bugzilla.novell.com/show_bug.cgi?id=807312
https://bugzilla.novell.com/show_bug.cgi?id=807312#c27
Stephan Kulow
https://bugzilla.novell.com/show_bug.cgi?id=807312
https://bugzilla.novell.com/show_bug.cgi?id=807312#c28
--- Comment #28 from Thomas Renninger
https://bugzilla.novell.com/show_bug.cgi?id=807312
https://bugzilla.novell.com/show_bug.cgi?id=807312#c29
--- Comment #29 from Rainer Klier
Can you double check, please.
in the meantime i am at kernel 3.11.4. and as i said in c25 it never happend again. /sys/devices/virtual/thermal/thermal_zone3 has 85° most of the time, and the fan is running most of the time. this is exactly how it was with kernel 3.7. the reboot never happened again. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=807312
https://bugzilla.novell.com/show_bug.cgi?id=807312#c30
Borislav Petkov
https://bugzilla.novell.com/show_bug.cgi?id=807312
https://bugzilla.novell.com/show_bug.cgi?id=807312#c31
--- Comment #31 from Rainer Klier
https://bugzilla.novell.com/show_bug.cgi?id=807312
https://bugzilla.novell.com/show_bug.cgi?id=807312#c32
Jean Delvare
participants (1)
-
bugzilla_noreply@novell.com