[opensuse] mcelog: CPU 6 on socket 1 received Bus and Interconnect Errors in Other-transaction
Hi *, on my main server after a cold boot I see the following messages in my journal: ... kernel: mce: [Hardware Error]: CPU 6: Machine Check: 0 Bank 20: c8012a4000200e0f kernel: mce: [Hardware Error]: TSC 0 mce: MISC 800000 mce: kernel: mce: [Hardware Error]: PROCESSOR 0:306f2 TIME 1508767394 SOCKET 1 APIC 10 microcode 36 mcelog[2635]: CPU 6 on socket 1 received Bus and Interconnect Errors in Other-transaction mcelog[2636]: Location: CPU 6 on socket 1 ... systemd[1]: Starting Machine Check Exception Logging Daemon... systemd[1]: Started Machine Check Exception Logging Daemon. mcelog[2628]: Hardware event. This is not a software error. mcelog[2628]: MCE 0 mcelog[2628]: CPU 6 BANK 20 mcelog[2628]: MISC 800000 mcelog[2628]: TIME 1508767394 Mon Oct 23 16:03:14 2017 mcelog[2628]: MCG status: mcelog[2628]: MCi status: mcelog[2628]: Error overflow mcelog[2628]: Corrected error mcelog[2628]: MCi_MISC register valid mcelog[2628]: MCA: BUS error: 1 6 Level-3 Generic Generic Other-transaction Request-did-not-timeout mcelog[2628]: Running trigger `bus-error-trigger' mcelog[2628]: QPI: mcelog[2628]: Intel QPI physical layer detected a QPI in-band reset but aborted initialization mcelog[2628]: STATUS c8012a4000200e0f MCGSTATUS 0 mcelog[2628]: MCGCAP 7000c16 APICID 10 SOCKETID 1 mcelog[2628]: CPUID Vendor Intel Family 6 Model 63 mcelog[2628]: <27>Oct 23 16:04:47 mcelog: CPU 6 on socket 1 received Bus and Interconnect Errors in Other-transaction mcelog[2628]: <27>Oct 23 16:04:47 mcelog: Location: CPU 6 on socket 1 ... This is with kernel 4.4.90-28 on openSuSE Leap 42.3, but after checking older journal entries I saw, that it also happened with 4.4.87-25. Machine specs: - Supermicro X10DRi/X10DRi, BIOS 2.0 12/28/2015 - 2 x 6 core CPU Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz - 128 GB RAM The error does not happen when restarting the os, only after a cold boot of the machine. I couldn't find appropriate information on the net. Is cpu 1 damaged? Can I do anything to correct the problem - or just ignore it? -- Michael Hirmke -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
push
Hi *,
on my main server after a cold boot I see the following messages in my journal:
... kernel: mce: [Hardware Error]: CPU 6: Machine Check: 0 Bank 20: c8012a4000200e0f kernel: mce: [Hardware Error]: TSC 0 mce: MISC 800000 mce: kernel: mce: [Hardware Error]: PROCESSOR 0:306f2 TIME 1508767394 SOCKET 1 APIC 10 microcode 36 mcelog[2635]: CPU 6 on socket 1 received Bus and Interconnect Errors in Other-transaction mcelog[2636]: Location: CPU 6 on socket 1 ... systemd[1]: Starting Machine Check Exception Logging Daemon... systemd[1]: Started Machine Check Exception Logging Daemon. mcelog[2628]: Hardware event. This is not a software error. mcelog[2628]: MCE 0 mcelog[2628]: CPU 6 BANK 20 mcelog[2628]: MISC 800000 mcelog[2628]: TIME 1508767394 Mon Oct 23 16:03:14 2017 mcelog[2628]: MCG status: mcelog[2628]: MCi status: mcelog[2628]: Error overflow mcelog[2628]: Corrected error mcelog[2628]: MCi_MISC register valid mcelog[2628]: MCA: BUS error: 1 6 Level-3 Generic Generic Other-transaction Request-did-not-timeout mcelog[2628]: Running trigger `bus-error-trigger' mcelog[2628]: QPI: mcelog[2628]: Intel QPI physical layer detected a QPI in-band reset but aborted initialization mcelog[2628]: STATUS c8012a4000200e0f MCGSTATUS 0 mcelog[2628]: MCGCAP 7000c16 APICID 10 SOCKETID 1 mcelog[2628]: CPUID Vendor Intel Family 6 Model 63 mcelog[2628]: <27>Oct 23 16:04:47 mcelog: CPU 6 on socket 1 received Bus and Interconnect Errors in Other-transaction mcelog[2628]: <27>Oct 23 16:04:47 mcelog: Location: CPU 6 on socket 1 ...
This is with kernel 4.4.90-28 on openSuSE Leap 42.3, but after checking older journal entries I saw, that it also happened with 4.4.87-25. Machine specs:
- Supermicro X10DRi/X10DRi, BIOS 2.0 12/28/2015 - 2 x 6 core CPU Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz - 128 GB RAM
The error does not happen when restarting the os, only after a cold boot of the machine.
I couldn't find appropriate information on the net. Is cpu 1 damaged? Can I do anything to correct the problem - or just ignore it?
-- Michael Hirmke
-- Michael Hirmke -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 01/11/17 07:22 AM, Michael Hirmke wrote:
push
a) dust b) oxidation c) bad caps (In my case I'd have to add 'cat hair' and clean the fans, remember to use a paper-clip.) 2/3 solution is 'pull'. I suggest investing in one of those cans of pressure air with a nozzle. While I sometimes have to pull memory and connectors, wipe the gold fingers with an antistatic cloth, blow air into the connectors and replace all same, I rarely have to replace capacitors. I have more than enough low-end mobos from the "Closet of Anxieties", but obviously that isn't the case with you. PLEASE NOTE: I'm not saying that this is an ultimate solution, and I'd be VERY reluctant to 'pull and polish' the CPUs, but this is a first line of wolf fencing of the problem. NEXT UP: quality of the PSU from a cold start. Heck, it's getting cold and I turn the computer on before the forced air heating has warmed the house ... Time was I had a [project in a portacabin. We'd turn the heating on in the portacabin and go get breakfast; come back an hour later. Any earlier and the electronics wouldn't turn on. -- A: Yes. > Q: Are you sure? >> A: Because it reverses the logical flow of conversation. >>> Q: Why is top posting frowned upon? -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
Hi Anton, thx for your answer.
On 01/11/17 07:22 AM, Michael Hirmke wrote:
push
a) dust
I already cleaned everything as careful as possible.
b) oxidation
Puh, I'd have to disassemble fan and cpu to check that. I'd prefer to do this as a last ressort. But of course you're right - this may be one reason.
c) bad caps
Oops, changing caps is beyond my skills 8-< [...]
While I sometimes have to pull memory and connectors, wipe the gold fingers with an antistatic cloth, blow air into the connectors and replace all same, I rarely have to replace capacitors. I have more than enough low-end mobos from the "Closet of Anxieties", but obviously that isn't the case with you.
Indeed - this Supermicro mobo is a high-end mobo.
PLEASE NOTE:
I'm not saying that this is an ultimate solution, and I'd be VERY reluctant to 'pull and polish' the CPUs, but this is a first line of wolf fencing of the problem.
You are right, but I#m not very eager to do that 8-<
NEXT UP: quality of the PSU from a cold start.
Heck, it's getting cold and I turn the computer on before the forced air heating has warmed the house ...
I don#t think, this is a problem here, because the machines run the whole day, so everything is warm, when one of them is switched off for a short while and then switched back on. [...] Bye. Michael. -- Michael Hirmke -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 01/11/17 08:03 AM, Michael Hirmke wrote:
Hi Anton,
thx for your answer.
On 01/11/17 07:22 AM, Michael Hirmke wrote:
push
a) dust
I already cleaned everything as careful as possible.
+1
b) oxidation
Puh, I'd have to disassemble fan and cpu to check that.
Yes/No/Maybe Sometimes it's just a matter of un-plugging what three is to be unplugged, including the power leads to the mobo, and what contacts are accessible. Wipe. Blow air.
I'd prefer to do this as a last ressort.
I dread the thought of pulling the CPUs! But you can dust-off their fans and the power leads to those fans. The only reason I can imagine is a) the CPU really really dies b) you decide to upgrade to 8-core or 16-core
But of course you're right - this may be one reason.
c) bad caps
Oops, changing caps is beyond my skills 8-<
On a multi-layer board like a mobo, this is beyond mine too, though I've fixed up a flat-screen that died of this.
[...]
While I sometimes have to pull memory and connectors, wipe the gold fingers with an antistatic cloth, blow air into the connectors and replace all same, I rarely have to replace capacitors. I have more than enough low-end mobos from the "Closet of Anxieties", but obviously that isn't the case with you.
Indeed - this Supermicro mobo is a high-end mobo.
Right! Oh, what is it, how much did it set you back? Obviously this is not something I'd expect to find in the Closet of Anxieties! But never-the-less, clean what contacts you can clean.
NEXT UP: quality of the PSU from a cold start.
Heck, it's getting cold and I turn the computer on before the forced air heating has warmed the house ...
I don#t think, this is a problem here, because the machines run the whole day, so everything is warm, when one of them is switched off for a short while and then switched back on.
... your electricity bill, Bro, not mine! Still, 'pull and polish'. -- A: Yes. > Q: Are you sure? >> A: Because it reverses the logical flow of conversation. >>> Q: Why is top posting frowned upon? -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
Hi Anton, [...]
Indeed - this Supermicro mobo is a high-end mobo.
Right! Oh, what is it, how much did it set you back?
How do you mean that?
Obviously this is not something I'd expect to find in the Closet of Anxieties!
But never-the-less, clean what contacts you can clean.
Yep. [...]
I don#t think, this is a problem here, because the machines run the whole day, so everything is warm, when one of them is switched off for a short while and then switched back on.
... your electricity bill, Bro, not mine!
Solar power is one of my closest friends :))
Still, 'pull and polish'.
Yep. Bye. Michael. -- Michael Hirmke -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
participants (2)
-
Anton Aylward
-
mh@mike.franken.de