[opensuse] Hardware error messages...
I am getting these repeating error messages in all my kconsole windows on my laptop. I am running Leap 42.2 x64 with the KDE desktop. Can someone translate these into plain English and please don't tell me my brand new ASUS laptop is already broken! Please? I hope not... KDE System Settings reports - Version 5.8.2 Using: KDE Frameworks 5.26.0 Qt 5.6.1 (built against 5.6.1) The xcb windowing system Message from syslogd@marcslaptop at Mar 13 19:18:10 ... kernel:[18656.354194] mce: [Hardware Error]: PROCESSOR 0:506e3 TIME 1489457890 SOCKET 0 APIC 1 microcode 9e Message from syslogd@marcslaptop at Mar 13 19:18:10 ... kernel:[18656.354195] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 128: 000000008806080a Message from syslogd@marcslaptop at Mar 13 19:18:10 ... kernel:[18656.354196] mce: [Hardware Error]: TSC 2e09ebedc3f6 mce: Thanks in advance.. Marc.. -- "The Truth is out there" - Spooky -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 03/13/2017 10:33 PM, Marc Chamberlin wrote:
I am getting these repeating error messages in all my kconsole windows on my laptop. I am running Leap 42.2 x64 with the KDE desktop. Can someone translate these into plain English and please don't tell me my brand new ASUS laptop is already broken! Please? I hope not...
KDE System Settings reports -
Version 5.8.2
Using: KDE Frameworks 5.26.0 Qt 5.6.1 (built against 5.6.1) The xcb windowing system
Message from syslogd@marcslaptop at Mar 13 19:18:10 ... kernel:[18656.354194] mce: [Hardware Error]: PROCESSOR 0:506e3 TIME 1489457890 SOCKET 0 APIC 1 microcode 9e Message from syslogd@marcslaptop at Mar 13 19:18:10 ... kernel:[18656.354195] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 128: 000000008806080a Message from syslogd@marcslaptop at Mar 13 19:18:10 ... kernel:[18656.354196] mce: [Hardware Error]: TSC 2e09ebedc3f6 mce:
Thanks in advance.. Marc..
A few suggestions: 1. If you bought the laptop with some version of Windows on it and it is still on it, does it work? 2. If you just installed the Linux system off disk or dongle that you downloaded, and it never worked, try getting another download and install it. Try to get the download on a different computer, so if the download machine is having problems, you can eliminate that. 3. Sure Suse should work, but try some other distro--maybe a small one. Also PCLinuxOS runs KDE, altho not small, uses the .rpm system, and has excellent Forum help. 4. Probably your new laptop is 64 bits, but are you sure? If you can load any Linux, then type uname -a from a console window and make sure. You can also look at this url and get some ideas: http://www.computerhope.com/issues/ch001121.htm 5. There is still a 32 bit version of Linux, that you can get by downloading Linux Mint 17.1 "Rebecca" - Cinnamon (32-bit) Linux Mint and trying that system. (Mint is a nice system but it uses .deb instead of .rpm installs.) 6. Hope one of these suggestions helps. Please post back to the thread and tell us what you have found, and what helped. Good luck! --doug -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 03/13/2017 10:49 PM, Doug wrote:
000000008806080a Message from syslogd@marcslaptop at Mar 13 19:18:10 ... kernel:[18656.354196] mce: [Hardware Error]: TSC 2e09ebedc3f6 mce:
mce = Machine Check Exception The chances of this being a software problem are minuscule. MCE is a hardware exception (as shown in the error above) I don't even know if a bios misconfiguration could generate a MCE. It is generally a RAM or failing capacitor or other Hardware issue. -- David C. Rankin, J.D.,P.E. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 03/13/2017 07:33 PM, Marc Chamberlin wrote:
I am getting these repeating error messages in all my kconsole windows on my laptop. I am running Leap 42.2 x64 with the KDE desktop. Can someone translate these into plain English and please don't tell me my brand new ASUS laptop is already broken! Please? I hope not...
KDE System Settings reports -
Version 5.8.2
Using: KDE Frameworks 5.26.0 Qt 5.6.1 (built against 5.6.1) The xcb windowing system
This is odd, I will reformat this and resend.. Dunno why this happened, looked ok when I first sent it... Marc...
Message from syslogd@marcslaptop at Mar 13 19:18:10 ... kernel:[18656.354194] mce: [Hardware Error]: PROCESSOR 0:506e3 TIME 1489457890 SOCKET 0 APIC 1 microcode 9e
Message from syslogd@marcslaptop at Mar 13 19:18:10 ... kernel:[18656.354195] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 128: 000000008806080a
Message from syslogd@marcslaptop at Mar 13 19:18:10 ... kernel:[18656.354196] mce: [Hardware Error]: TSC 2e09ebedc3f6 mce:
Thanks in advance.. Marc..
-- "The Truth is out there" - Spooky -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
Marc Chamberlin wrote:
Message from syslogd@marcslaptop at Mar 13 19:18:10 ... kernel:[18656.354194] mce: [Hardware Error]: PROCESSOR 0:506e3 TIME 1489457890 SOCKET 0 APIC 1 microcode 9e
Message from syslogd@marcslaptop at Mar 13 19:18:10 ... kernel:[18656.354195] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 128: 000000008806080a
Message from syslogd@marcslaptop at Mar 13 19:18:10 ... kernel:[18656.354196] mce: [Hardware Error]: TSC 2e09ebedc3f6 mce:
MCE = Machine Check Exception - yes, basically hardware error. https://en.wikipedia.org/wiki/Machine-check_exception AFAIR, the codes are listed in the Intel developer manuals. -- Per Jessen, Zürich (3.9°C) http://www.cloudsuisse.com/ - your owncloud, hosted in Switzerland. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 2017-03-14 07:32, Per Jessen wrote:
Marc Chamberlin wrote:
Message from syslogd@marcslaptop at Mar 13 19:18:10 ... kernel:[18656.354194] mce: [Hardware Error]: PROCESSOR 0:506e3 TIME 1489457890 SOCKET 0 APIC 1 microcode 9e
Message from syslogd@marcslaptop at Mar 13 19:18:10 ... kernel:[18656.354195] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 128: 000000008806080a
Message from syslogd@marcslaptop at Mar 13 19:18:10 ... kernel:[18656.354196] mce: [Hardware Error]: TSC 2e09ebedc3f6 mce:
MCE = Machine Check Exception - yes, basically hardware error.
But some times I have seen MCEs appear with a new openSUSE release, then disappear on the next. -- Cheers / Saludos, Carlos E. R. (from 42.2 x86_64 "Malachite" (Minas Tirith))
On 03/14/2017 05:10 AM, Carlos E. R. wrote:
On 2017-03-14 07:32, Per Jessen wrote:
Marc Chamberlin wrote:
Message from syslogd@marcslaptop at Mar 13 19:18:10 ... kernel:[18656.354194] mce: [Hardware Error]: PROCESSOR 0:506e3 TIME 1489457890 SOCKET 0 APIC 1 microcode 9e Message from syslogd@marcslaptop at Mar 13 19:18:10 ... kernel:[18656.354195] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 128: 000000008806080a Message from syslogd@marcslaptop at Mar 13 19:18:10 ... kernel:[18656.354196] mce: [Hardware Error]: TSC 2e09ebedc3f6 mce: MCE = Machine Check Exception - yes, basically hardware error. But some times I have seen MCEs appear with a new openSUSE release, then disappear on the next.
Thanks Carlos, Per, Doug for taking the time to reply and for your thoughts! Much appreciated.. It appears this is a bug and has been reported. I followed the suggestion in the bug report to turn off the emergency kernel messages for now, to keep this from annoying me... https://bugzilla.opensuse.org/show_bug.cgi?id=1028027 -- "The Truth is out there" - Spooky -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On Monday, 13 March 2017 19:33:51 EET Marc Chamberlin wrote:
Message from syslogd@marcslaptop at Mar 13 19:18:10 ... kernel:[18656.354194] mce: [Hardware Error]: PROCESSOR 0:506e3 TIME 1489457890 SOCKET 0 APIC 1 microcode 9e Message from syslogd@marcslaptop at Mar 13 19:18:10 ... kernel:[18656.354195] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 128: 000000008806080a Message from syslogd@marcslaptop at Mar 13 19:18:10 ... kernel:[18656.354196] mce: [Hardware Error]: TSC 2e09ebedc3f6 mce:
What is the full MCE message in the system log? Is thermald running?
Thanks in advance.. Marc.. -- Regards, Peter
-- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 03/14/2017 12:31 PM, auxsvr wrote:
On Monday, 13 March 2017 19:33:51 EET Marc Chamberlin wrote:
Message from syslogd@marcslaptop at Mar 13 19:18:10 ... kernel:[18656.354194] mce: [Hardware Error]: PROCESSOR 0:506e3 TIME 1489457890 SOCKET 0 APIC 1 microcode 9e Message from syslogd@marcslaptop at Mar 13 19:18:10 ... kernel:[18656.354195] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 128: 000000008806080a Message from syslogd@marcslaptop at Mar 13 19:18:10 ... kernel:[18656.354196] mce: [Hardware Error]: TSC 2e09ebedc3f6 mce: What is the full MCE message in the system log? Is thermald running?
Peter - I am not qualified to answer your question specifically, it is beyond my pay grade at the moment! Until you asked, I have never heard of, or knew about thermald and no there is no process by that name running. I took a look with YaST to see if there is such a package, and sure enough there is! Looks interesting especially since to my untrained eyes it appears from the log messages that this might be a thermal related event of some kind. Should I download/install this daemon? I will give it a shot... I took a look at messages log file and this is a snapshot of what gets repeated a lot, and taking a guess this may contain the answer to your questions. Sorry it is rather a lot of info to grok and I hope it does not violate forum protocol to post so much noise. :-D HTHs Marc... 2017-03-14T11:11:33.161786-07:00 marcslaptop kernel: [ 3555.558599] CPU4: Core temperature above threshold, cpu clock throttled (total events = 328) 2017-03-14T11:11:33.161796-07:00 marcslaptop kernel: [ 3555.558600] CPU0: Core temperature above threshold, cpu clock throttled (total events = 328) 2017-03-14T11:11:33.161797-07:00 marcslaptop kernel: [ 3555.558601] CPU6: Package temperature above threshold, cpu clock throttled (total events = 328) 2017-03-14T11:11:33.161797-07:00 marcslaptop kernel: [ 3555.558602] CPU3: Package temperature above threshold, cpu clock throttled (total events = 328) 2017-03-14T11:11:33.161798-07:00 marcslaptop kernel: [ 3555.558603] CPU5: Package temperature above threshold, cpu clock throttled (total events = 328) 2017-03-14T11:11:33.161798-07:00 marcslaptop kernel: [ 3555.558604] CPU1: Package temperature above threshold, cpu clock throttled (total events = 328) 2017-03-14T11:11:33.161799-07:00 marcslaptop kernel: [ 3555.558605] CPU2: Package temperature above threshold, cpu clock throttled (total events = 328) 2017-03-14T11:11:33.161799-07:00 marcslaptop kernel: [ 3555.558606] CPU7: Package temperature above threshold, cpu clock throttled (total events = 328) 2017-03-14T11:11:33.161799-07:00 marcslaptop kernel: [ 3555.558606] CPU4: Package temperature above threshold, cpu clock throttled (total events = 328) 2017-03-14T11:11:33.161800-07:00 marcslaptop kernel: [ 3555.558607] CPU0: Package temperature above threshold, cpu clock throttled (total events = 328) 2017-03-14T11:11:33.161800-07:00 marcslaptop kernel: [ 3555.558611] mce: [Hardware Error]: Machine check events logged 2017-03-14T11:11:33.161801-07:00 marcslaptop kernel: [ 3555.558618] mce: [Hardware Error]: CPU 4: Machine Check: 0 Bank 128: 000000008805000b 2017-03-14T11:11:33.161801-07:00 marcslaptop kernel: [ 3555.558619] mce: [Hardware Error]: TSC 8cb22edccba mce: 2017-03-14T11:11:33.161802-07:00 marcslaptop kernel: [ 3555.558620] mce: [Hardware Error]: PROCESSOR 0:506e3 TIME 1489515093 SOCKET 0 APIC 1 microcode 9e 2017-03-14T11:11:33.161802-07:00 marcslaptop kernel: [ 3555.558621] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 128: 000000008805000b 2017-03-14T11:11:33.161802-07:00 marcslaptop kernel: [ 3555.558623] mce: [Hardware Error]: TSC 8cb22edccb6 mce: 2017-03-14T11:11:33.161803-07:00 marcslaptop kernel: [ 3555.558624] mce: [Hardware Error]: PROCESSOR 0:506e3 TIME 1489515093 SOCKET 0 APIC 0 microcode 9e 2017-03-14T11:11:33.169789-07:00 marcslaptop kernel: [ 3555.563587] CPU6: Package temperature/speed normal 2017-03-14T11:11:33.169799-07:00 marcslaptop kernel: [ 3555.563588] CPU4: Core temperature/speed normal 2017-03-14T11:11:33.169799-07:00 marcslaptop kernel: [ 3555.563589] CPU1: Package temperature/speed normal 2017-03-14T11:11:33.169800-07:00 marcslaptop kernel: [ 3555.563589] CPU5: Package temperature/speed normal 2017-03-14T11:11:33.169800-07:00 marcslaptop kernel: [ 3555.563590] CPU3: Package temperature/speed normal 2017-03-14T11:11:33.169801-07:00 marcslaptop kernel: [ 3555.563591] CPU0: Core temperature/speed normal 2017-03-14T11:11:33.169802-07:00 marcslaptop kernel: [ 3555.563591] CPU7: Package temperature/speed normal 2017-03-14T11:11:33.169802-07:00 marcslaptop kernel: [ 3555.563592] CPU2: Package temperature/speed normal 2017-03-14T11:11:33.169803-07:00 marcslaptop kernel: [ 3555.563592] CPU4: Package temperature/speed normal 2017-03-14T11:11:33.169803-07:00 marcslaptop kernel: [ 3555.563592] CPU0: Package temperature/speed normal 2017-03-14T11:11:33.169804-07:00 marcslaptop kernel: [ 3555.563593] mce: [Hardware Error]: Machine check events logged 2017-03-14T11:11:33.169804-07:00 marcslaptop kernel: [ 3555.563599] mce: [Hardware Error]: CPU 4: Machine Check: 0 Bank 128: 000000008806000a 2017-03-14T11:11:33.169804-07:00 marcslaptop kernel: [ 3555.563601] mce: [Hardware Error]: TSC 8cb23bc20e3 mce: 2017-03-14T11:11:33.169805-07:00 marcslaptop kernel: [ 3555.563602] mce: [Hardware Error]: PROCESSOR 0:506e3 TIME 1489515093 SOCKET 0 APIC 1 microcode 9e 2017-03-14T11:11:33.169805-07:00 marcslaptop kernel: [ 3555.563603] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 128: 000000008806000a 2017-03-14T11:11:33.169806-07:00 marcslaptop kernel: [ 3555.563603] mce: [Hardware Error]: TSC 8cb23bc2f5e mce: 2017-03-14T11:11:33.169806-07:00 marcslaptop kernel: [ 3555.563604] mce: [Hardware Error]: PROCESSOR 0:506e3 TIME 1489515093 SOCKET 0 APIC 0 microcode 9e 2017-03-14T11:11:33.425799-07:00 marcslaptop mcelog[816]: mcelog: Family 6 Model 5e CPU: only decoding architectural errors 2017-03-14T11:11:33.425935-07:00 marcslaptop mcelog[816]: mcelog: Too many trigger children running already 2017-03-14T11:11:33.426032-07:00 marcslaptop mcelog[816]: Email set up for sending 2017-03-14T11:11:33.426126-07:00 marcslaptop mcelog[816]: Hardware event. This is not a software error. 2017-03-14T11:11:33.426219-07:00 marcslaptop mcelog[816]: MCE 0 2017-03-14T11:11:33.426310-07:00 marcslaptop mcelog[816]: CPU 4 THERMAL EVENT TSC 8cb22edccba 2017-03-14T11:11:33.426402-07:00 marcslaptop mcelog[816]: TIME 1489515093 Tue Mar 14 11:11:33 2017 2017-03-14T11:11:33.426494-07:00 marcslaptop mcelog[816]: Processor 4 heated above trip temperature. Throttling enabled. 2017-03-14T11:11:33.426588-07:00 marcslaptop mcelog[816]: Please check your system cooling. Performance will be impacted 2017-03-14T11:11:33.426679-07:00 marcslaptop mcelog[816]: Running trigger `unknown-error-trigger' 2017-03-14T11:11:33.426770-07:00 marcslaptop mcelog[816]: STATUS 8805000b MCGSTATUS 0 2017-03-14T11:11:33.426863-07:00 marcslaptop mcelog[816]: MCGCAP c0a APICID 1 SOCKETID 0 2017-03-14T11:11:33.426960-07:00 marcslaptop mcelog[816]: CPUID Vendor Intel Family 6 Model 94 2017-03-14T11:11:33.427055-07:00 marcslaptop mcelog[816]: mcelog: Family 6 Model 5e CPU: only decoding architectural errors 2017-03-14T11:11:33.427145-07:00 marcslaptop mcelog[816]: mcelog: Too many trigger children running already 2017-03-14T11:11:33.427234-07:00 marcslaptop mcelog[816]: Email set up for sending 2017-03-14T11:11:33.427323-07:00 marcslaptop mcelog[816]: Hardware event. This is not a software error. 2017-03-14T11:11:33.427412-07:00 marcslaptop mcelog[816]: MCE 1 2017-03-14T11:11:33.427504-07:00 marcslaptop mcelog[816]: CPU 0 THERMAL EVENT TSC 8cb22edccb6 2017-03-14T11:11:33.427596-07:00 marcslaptop mcelog[816]: TIME 1489515093 Tue Mar 14 11:11:33 2017 2017-03-14T11:11:33.427685-07:00 marcslaptop mcelog[816]: Processor 0 heated above trip temperature. Throttling enabled. 2017-03-14T11:11:33.427777-07:00 marcslaptop mcelog[816]: Please check your system cooling. Performance will be impacted 2017-03-14T11:11:33.427865-07:00 marcslaptop mcelog[816]: Running trigger `unknown-error-trigger' 2017-03-14T11:11:33.427956-07:00 marcslaptop mcelog[816]: STATUS 8805000b MCGSTATUS 0 2017-03-14T11:11:33.428046-07:00 marcslaptop mcelog[816]: MCGCAP c0a APICID 0 SOCKETID 0 2017-03-14T11:11:33.428137-07:00 marcslaptop mcelog[816]: CPUID Vendor Intel Family 6 Model 94 2017-03-14T11:11:33.428227-07:00 marcslaptop mcelog[816]: mcelog: Family 6 Model 5e CPU: only decoding architectural errors 2017-03-14T11:11:33.428316-07:00 marcslaptop mcelog[816]: mcelog: Too many trigger children running already 2017-03-14T11:11:33.428406-07:00 marcslaptop mcelog[816]: Email set up for sending 2017-03-14T11:11:33.428495-07:00 marcslaptop mcelog[816]: Hardware event. This is not a software error. 2017-03-14T11:11:33.428583-07:00 marcslaptop mcelog[816]: MCE 0 2017-03-14T11:11:33.428671-07:00 marcslaptop mcelog[816]: CPU 4 THERMAL EVENT TSC 8cb23bc20e3 2017-03-14T11:11:33.428768-07:00 marcslaptop mcelog[816]: TIME 1489515093 Tue Mar 14 11:11:33 2017 2017-03-14T11:11:33.428857-07:00 marcslaptop mcelog[816]: Processor 4 below trip temperature. Throttling disabled 2017-03-14T11:11:33.428951-07:00 marcslaptop mcelog[816]: Running trigger `unknown-error-trigger' 2017-03-14T11:11:33.429041-07:00 marcslaptop mcelog[816]: STATUS 8806000a MCGSTATUS 0 2017-03-14T11:11:33.429132-07:00 marcslaptop mcelog[816]: MCGCAP c0a APICID 1 SOCKETID 0 2017-03-14T11:11:33.429221-07:00 marcslaptop mcelog[816]: CPUID Vendor Intel Family 6 Model 94 2017-03-14T11:11:33.429309-07:00 marcslaptop mcelog[816]: mcelog: Family 6 Model 5e CPU: only decoding architectural errors 2017-03-14T11:11:33.429398-07:00 marcslaptop mcelog[816]: mcelog: Too many trigger children running already 2017-03-14T11:11:33.429486-07:00 marcslaptop mcelog[816]: Email set up for sending 2017-03-14T11:11:33.429576-07:00 marcslaptop mcelog[816]: Hardware event. This is not a software error. 2017-03-14T11:11:33.429665-07:00 marcslaptop mcelog[816]: MCE 1 2017-03-14T11:11:33.429755-07:00 marcslaptop mcelog[816]: CPU 0 THERMAL EVENT TSC 8cb23bc2f5e 2017-03-14T11:11:33.429853-07:00 marcslaptop mcelog[816]: TIME 1489515093 Tue Mar 14 11:11:33 2017 2017-03-14T11:11:33.429944-07:00 marcslaptop mcelog[816]: Processor 0 below trip temperature. Throttling disabled 2017-03-14T11:11:33.430038-07:00 marcslaptop mcelog[816]: Running trigger `unknown-error-trigger' 2017-03-14T11:11:33.430127-07:00 marcslaptop mcelog[816]: STATUS 8806000a MCGSTATUS 0 2017-03-14T11:11:33.430216-07:00 marcslaptop mcelog[816]: MCGCAP c0a APICID 0 SOCKETID 0 2017-03-14T11:11:33.430304-07:00 marcslaptop mcelog[816]: CPUID Vendor Intel Family 6 Model 94 2017-03-14T11:11:33.430399-07:00 marcslaptop mcelog[816]: mcelog: Cannot collect child 15729: No child processes 2017-03-14T11:12:21.401781-07:00 marcslaptop rsyslogd: -- MARK -- 2017-03-14T11:12:22.320288-07:00 marcslaptop smartd[766]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 187 to 181 -- "The Truth is out there" - Spooky -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 03/14/2017 01:44 PM, Marc Chamberlin wrote:
On 03/14/2017 12:31 PM, auxsvr wrote:
On Monday, 13 March 2017 19:33:51 EET Marc Chamberlin wrote:
Message from syslogd@marcslaptop at Mar 13 19:18:10 ... kernel:[18656.354194] mce: [Hardware Error]: PROCESSOR 0:506e3 TIME 1489457890 SOCKET 0 APIC 1 microcode 9e Message from syslogd@marcslaptop at Mar 13 19:18:10 ... kernel:[18656.354195] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 128: 000000008806080a Message from syslogd@marcslaptop at Mar 13 19:18:10 ... kernel:[18656.354196] mce: [Hardware Error]: TSC 2e09ebedc3f6 mce: What is the full MCE message in the system log? Is thermald running?
Peter - I tried thermald, but looking at the output in the messages log file, kinda looks like it got off to a sick start! Dunno what this all means but don't look good... Marc.. 2017-03-14T13:45:02.021834-07:00 marcslaptop systemd[1]: Starting Thermal Daemon Service... 2017-03-14T13:45:02.025438-07:00 marcslaptop thermald[27185]: 22 CPUID levels; family:model:stepping 0x6:5e:3 (6:94:3) 2017-03-14T13:45:02.025613-07:00 marcslaptop thermald[27185]: Polling mode is enabled: 4 2017-03-14T13:45:02.025790-07:00 marcslaptop thermald[27185]: failed to GET COUNT on /dev/acpi_thermal_rel 2017-03-14T13:45:02.026270-07:00 marcslaptop thermald[27185]: Using generated /var/run/thermald/thermal-conf.xml.auto 2017-03-14T13:45:02.027339-07:00 marcslaptop thermald[27185]: sysfs read failed constraint_0_max_power_uw 2017-03-14T13:45:02.028427-07:00 marcslaptop thermald[27185]: csys_fs::read exception 1/trip_point_0_hyst 2017-03-14T13:45:02.028548-07:00 marcslaptop thermald[27185]: csys_fs::read exception 1/trip_point_1_hyst 2017-03-14T13:45:02.029742-07:00 marcslaptop thermald[27185]: csys_fs::read exception 2/trip_point_0_hyst 2017-03-14T13:45:02.029870-07:00 marcslaptop thermald[27185]: csys_fs::read exception 2/trip_point_1_hyst 2017-03-14T13:45:02.030982-07:00 marcslaptop thermald[27185]: csys_fs::read exception 3/trip_point_0_hyst 2017-03-14T13:45:02.031085-07:00 marcslaptop thermald[27185]: csys_fs::read exception 3/trip_point_1_hyst 2017-03-14T13:45:02.032174-07:00 marcslaptop thermald[27185]: csys_fs::read exception 4/trip_point_0_hyst 2017-03-14T13:45:02.032276-07:00 marcslaptop thermald[27185]: csys_fs::read exception 4/trip_point_1_hyst 2017-03-14T13:45:02.035978-07:00 marcslaptop thermald[27185]: sysfs write failed trip_point_0_temp 2017-03-14T13:45:02.036130-07:00 marcslaptop thermald[27185]: csys_fs::read exception 6/trip_point_0_hyst 2017-03-14T13:45:02.036221-07:00 marcslaptop thermald[27185]: csys_fs::read exception 6/trip_point_1_hyst 2017-03-14T13:45:02.036310-07:00 marcslaptop thermald[27185]: csys_fs::read exception 6/trip_point_2_hyst 2017-03-14T13:45:02.036398-07:00 marcslaptop thermald[27185]: csys_fs::read exception 6/trip_point_3_hyst 2017-03-14T13:45:02.036487-07:00 marcslaptop thermald[27185]: csys_fs::read exception 6/trip_point_4_hyst 2017-03-14T13:45:02.037037-07:00 marcslaptop thermald[27185]: XML zone: invalid sensor type TMEM 2017-03-14T13:45:02.037151-07:00 marcslaptop thermald[27185]: Zone update failed: unable to bind 2017-03-14T13:45:02.044083-07:00 marcslaptop systemd[1]: Started Thermal Daemon Service. -- "The Truth is out there" - Spooky -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On Tuesday, 14 March 2017 22:53:40 EET Marc Chamberlin wrote:
2017-03-14T13:45:02.021834-07:00 marcslaptop systemd[1]: Starting Thermal Daemon Service... 2017-03-14T13:45:02.025438-07:00 marcslaptop thermald[27185]: 22 CPUID levels; family:model:stepping 0x6:5e:3 (6:94:3) 2017-03-14T13:45:02.025613-07:00 marcslaptop thermald[27185]: Polling mode is enabled: 4
Some errors are anticipated, as they also appear on my laptop. Perhaps thermald is not updated to support fully new processors?
2017-03-14T11:11:33.161786-07:00 marcslaptop kernel: [ 3555.558599] CPU4: Core temperature above threshold, cpu clock throttled (total events = 328)
These messages indicate that your CPU is overheating. Since it takes a few ms to cool down, most likely turbo boost triggers overheating in your case, just as in mine, and thermald should handle this, although it is not always successful. -- Regards, Peter -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 2017-03-14 21:44, Marc Chamberlin wrote:
I took a look at messages log file and this is a snapshot of what gets repeated a lot, and taking a guess this may contain the answer to your questions. Sorry it is rather a lot of info to grok and I hope it does not violate forum protocol to post so much noise. :-D HTHs Marc...
It seems that CPU overheats, and the kernel throttles the CPU down. But within a second it says that the temperature is back to normal, and throttling is disabled. Makes no sense. Somewhere it also says that it is preparing an email. -- Cheers / Saludos, Carlos E. R. (from 42.2 x86_64 "Malachite" (Minas Tirith))
On 03/14/2017 02:43 PM, Carlos E. R. wrote:
On 2017-03-14 21:44, Marc Chamberlin wrote:
I took a look at messages log file and this is a snapshot of what gets repeated a lot, and taking a guess this may contain the answer to your questions. Sorry it is rather a lot of info to grok and I hope it does not violate forum protocol to post so much noise. :-D HTHs Marc... It seems that CPU overheats, and the kernel throttles the CPU down. But within a second it says that the temperature is back to normal, and throttling is disabled. Makes no sense.
Somewhere it also says that it is preparing an email.
Nice to know, Carlos, that I am not the only one puzzled by this! LOL... I can't say anything about the email that is being prepared, I am not getting any automatically generated email from this error handler... And I have no idea on where or how I would go about configuring things so that I would... Thanks, Marc -- "The Truth is out there" - Spooky -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 2017-03-15 03:04, Marc Chamberlin wrote:
On 03/14/2017 02:43 PM, Carlos E. R. wrote:
Somewhere it also says that it is preparing an email.
Nice to know, Carlos, that I am not the only one puzzled by this! LOL... I can't say anything about the email that is being prepared, I am not getting any automatically generated email from this error handler... And I have no idea on where or how I would go about configuring things so that I would...
It would be local email. If you open a terminal you would typically see a message about mail waiting, which you could read with the command "mail". But normally that email would be sent to "root". If it is missing, the mail log would say what happened to it. -- Cheers / Saludos, Carlos E. R. (from 42.2 x86_64 "Malachite" (Minas Tirith))
On 03/14/2017 04:43 PM, Carlos E. R. wrote:
It seems that CPU overheats, and the kernel throttles the CPU down. But within a second it says that the temperature is back to normal, and throttling is disabled. Makes no sense.
Somewhere it also says that it is preparing an email.
It trips the temp limit, sends an e-mail and then subsequent to that checks again and is below the limit, e.g. MCE 0 CPU 4 THERMAL EVENT TSC 8cb22edccba Processor 4 heated above trip temperature. Throttling enabled. MCE 1 CPU 0 THERMAL EVENT TSC 8cb22edccb6 Processor 0 heated above trip temperature. Throttling enabled. <stuff> MCE 0 CPU 4 THERMAL EVENT TSC 8cb23bc20e3 Processor 4 below trip temperature. Throttling disabled MCE 1 CPU 0 THERMAL EVENT TSC 8cb23bc20e3 Processor 0 below trip temperature. Throttling disabled If this was the first I've seen of the error, I would be looking at cleaning dust bunnies out of the fan screens and making sure there wasn't a rat's nest around the CPU.... -- David C. Rankin, J.D.,P.E.
On 2017-03-15 03:32, David C. Rankin wrote:
On 03/14/2017 04:43 PM, Carlos E. R. wrote:
It seems that CPU overheats, and the kernel throttles the CPU down. But within a second it says that the temperature is back to normal, and throttling is disabled. Makes no sense.
Somewhere it also says that it is preparing an email.
It trips the temp limit, sends an e-mail and then subsequent to that checks again and is below the limit, e.g.
Yes, but: CPU 6 too hot at 2017-03-14T11:11:33.161797 CPU 6 normal at 2017-03-14T11:11:33.169789 8 milliseconds later! That's impossible!
If this was the first I've seen of the error, I would be looking at cleaning dust bunnies out of the fan screens and making sure there wasn't a rat's nest around the CPU....
Yes, me too, but the time difference points at bugs. One way to have both events so near is noise around the trip point. No hysteresis cycle designed in the detection. It would cause a storm of events, and in fact, there is such: 2017-03-14T11:11:33.429398-07:00 marcslaptop mcelog[816]: mcelog: Too many trigger children running already -- Cheers / Saludos, Carlos E. R. (from 42.2 x86_64 "Malachite" (Minas Tirith))
On 03/14/2017 09:46 PM, Carlos E. R. wrote:
One way to have both events so near is noise around the trip point. No hysteresis cycle designed in the detection. It would cause a storm of events, and in fact, there is such:
2017-03-14T11:11:33.429398-07:00 marcslaptop mcelog[816]: mcelog: Too many trigger children running already
Hmm..., I don't claim to understand the silicon level, but if the trigger is fluctuating that wildly, it smells like a voltage problem at the chip. Power supply issue? -- David C. Rankin, J.D.,P.E.
On 2017-03-15 04:05, David C. Rankin wrote:
On 03/14/2017 09:46 PM, Carlos E. R. wrote:
One way to have both events so near is noise around the trip point. No hysteresis cycle designed in the detection. It would cause a storm of events, and in fact, there is such:
2017-03-14T11:11:33.429398-07:00 marcslaptop mcelog[816]: mcelog: Too many trigger children running already
Hmm...,
I don't claim to understand the silicon level, but if the trigger is fluctuating that wildly, it smells like a voltage problem at the chip. Power supply issue?
The log doesn't mention the temperature it reads, so we can't know if it fluctuates wildly. Perhaps those details are in the missing email. -- Cheers / Saludos, Carlos E. R. (from 42.2 x86_64 "Malachite" (Minas Tirith))
On 03/14/2017 08:27 PM, Carlos E. R. wrote:
On 2017-03-15 04:05, David C. Rankin wrote:
On 03/14/2017 09:46 PM, Carlos E. R. wrote:
One way to have both events so near is noise around the trip point. No hysteresis cycle designed in the detection. It would cause a storm of events, and in fact, there is such:
2017-03-14T11:11:33.429398-07:00 marcslaptop mcelog[816]: mcelog: Too many trigger children running already Hmm...,
I don't claim to understand the silicon level, but if the trigger is fluctuating that wildly, it smells like a voltage problem at the chip. Power supply issue? The log doesn't mention the temperature it reads, so we can't know if it fluctuates wildly. Perhaps those details are in the missing email.
Interesting stuff Carlos, David... I am following along but not sure yet what to do... But will add a couple tidbits of info related to your thoughts... 1. I rebooted my laptop into Windoz10 and let it run Boinc and a few other automatic tasks overnight. No problems with Windoz10 and laptop was running fine this morning... So if it is a hardware problem Windoz10 is keeping quiet about it.. Dunno if there is someplace specifically I should look... 2. Executed mail and mail -u root, both responded by saying there was no mail.... Marc... -- "The Truth is out there" - Spooky -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 03/15/2017 12:37 AM, Marc Chamberlin wrote:
On 03/14/2017 08:27 PM, Carlos E. R. wrote:
On 2017-03-15 04:05, David C. Rankin wrote:
On 03/14/2017 09:46 PM, Carlos E. R. wrote:
One way to have both events so near is noise around the trip point. No hysteresis cycle designed in the detection. It would cause a storm of events, and in fact, there is such:
2017-03-14T11:11:33.429398-07:00 marcslaptop mcelog[816]: mcelog: Too many trigger children running already Hmm...,
I don't claim to understand the silicon level, but if the trigger is fluctuating that wildly, it smells like a voltage problem at the chip. Power supply issue? The log doesn't mention the temperature it reads, so we can't know if it fluctuates wildly. Perhaps those details are in the missing email.
Interesting stuff Carlos, David... I am following along but not sure yet what to do... But will add a couple tidbits of info related to your thoughts...
1. I rebooted my laptop into Windoz10 and let it run Boinc and a few other automatic tasks overnight. No problems with Windoz10 and laptop was running fine this morning... So if it is a hardware problem Windoz10 is keeping quiet about it.. Dunno if there is someplace specifically I should look...
2. Executed mail and mail -u root, both responded by saying there was no mail....
Marc...
Marc, Open top, or htop or the like and look for runaway processes continually pegging your CPU at 100+% to eliminate it being an actual heating problem. (even though you should be able to run at 100% continually without issue, as cooling degrades due to build-up of dust that itself can push temps over the limit. Again checking/cleaning the fan screens and cooling passages is a cheap quick test/fix if dealing with laptop overtemp issues (if you can peg the fan with a opened paperclip or thumbtack inserted through the a grill opening, you can use a shopvac to suck out the cooling passages and usually remove a bulk of the duff -- but don't use the shopvac directly over the cooling passage unless you peg the fan, you can wildly overspeed the fan causing it to come apart) The other thought is to check any BIOS temperature limit settings (usually under the System Status menu (or Power or Temperature or something similar) Some BIOS allow you to set limits in 5 degree increments over a 50-80 degree range appropriate for your processor. The reason I also suggested a power issue is if you are getting fluctuating bus voltages that can cause havoc with the processor reporting as well. That about exhausts my "what to check for list" without knowing more. When did this first start, and were there any changes you made to suse before it began? What do you have running (other than kde/konsole) when the errors occur. Does it seem like the messages are in response to something you were doing at the time? Was you mail reader running, and how many mail accounts do you have it pulling from?, etc..., etc..., etc..... Errors on Win not on Linux or vice-versa are not really that unexpected. Booth do better/worse in some areas, and it's possible it is happening on windoze, but it's just not sophisticated enough to warn/e-mail you about the temperature excursions. As Carlos said, post whatever additional information you have that might help us help you narrow it down :) -- David C. Rankin, J.D.,P.E. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 15/03/17 08:49, David C. Rankin wrote:
Open top, or htop or the like and look for runaway processes continually pegging your CPU at 100+% to eliminate it being an actual heating problem.
- a month ago, had overheat probs : it was stuck process of xfce "notes" .......... regards -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 2017-03-15 06:37, Marc Chamberlin wrote:
Interesting stuff Carlos, David... I am following along but not sure yet what to do... But will add a couple tidbits of info related to your thoughts...
1. I rebooted my laptop into Windoz10 and let it run Boinc and a few other automatic tasks overnight. No problems with Windoz10 and laptop was running fine this morning... So if it is a hardware problem Windoz10 is keeping quiet about it.. Dunno if there is someplace specifically I should look...
2. Executed mail and mail -u root, both responded by saying there was no mail....
I did "journalctl | grep mail | less -S" in my machine to investigate how to find out where those emails went to, and to my surprise, I see many of these: Nov 27 16:53:33 Isengard mcelog[870]: Email set up for sending Nov 27 16:53:33 Isengard mcelog[870]: Email set up for sending Nov 27 16:53:33 Isengard mcelog[870]: Email set up for sending Nov 27 16:53:33 Isengard mcelog[870]: Email set up for sending So I look carefully: Nov 27 16:53:33 Isengard mcelog[870]: Hardware event. This is not a software error. Nov 27 16:53:33 Isengard mcelog[870]: MCE 0 Nov 27 16:53:33 Isengard mcelog[870]: CPU 0 THERMAL EVENT TSC 528603de34fc Nov 27 16:53:33 Isengard mcelog[870]: TIME 1480262013 Sun Nov 27 16:53:33 2016 Nov 27 16:53:33 Isengard mcelog[870]: STATUS 80003cf MCGSTATUS 0 Nov 27 16:53:33 Isengard mcelog[870]: MCGCAP 806 APICID 0 SOCKETID 0 Nov 27 16:53:33 Isengard mcelog[870]: CPUID Vendor Intel Family 6 Model 76 Nov 27 16:53:33 Isengard mcelog[870]: Email set up for sending Nov 27 16:53:33 Isengard postfix/smtpd[27627]: connect from localhost[::1] Nov 27 16:53:33 Isengard postfix/smtpd[27627]: warning: Illegal address syntax from localhost[::1] in RCPT command: <> Nov 27 16:53:33 Isengard postfix/smtpd[27627]: disconnect from localhost[::1] (this is just a heat up event, real in my case, when I was testing the machine) The email fails because mcelog tries the wrong syntax. The email is lost for ever. Bug, IMHO. -- Cheers / Saludos, Carlos E. R. (from 42.2 x86_64 "Malachite" (Minas Tirith))
participants (7)
-
auxsvr
-
Carlos E. R.
-
David C. Rankin
-
Doug
-
ellanios82
-
Marc Chamberlin
-
Per Jessen