David C. Rankin wrote:
Dave Plater wrote:
David C. Rankin wrote:
Anders Johansson wrote:
On Friday 04 April 2008 17:15:45 David C. Rankin wrote:
You seem to be misunderstanding what "mce" is. A machine check exception is the hardware itself telling you that something has gone badly wrong. There is no interpretation involved in the software. The software just logs the message
If the mce says it is a hardware problem, you can count on its being a hardware problem
Anders
Jan, Anders, List:
The more I read, and the more I test, the more I am concerned that there may be a simmering issue with the x86_64 code. I installed a plain-jan pci-e ATI card running with the open source driver. Just as with the nvidia 8600GT card (using the opensource "nv" driver), the system still gives occasional MCEs. Just as with the 8600, the MCEs do not have any affect on the system. If I wasn't logging them with mcelog, I would never know they were occurring.
Reading the tech-docs, it is readily apparent that MCE doesn't necessarily mean hardware. Software is more than capable of causing them:
AMD64 Architecture Programmer�s Manual Volume 2: System Programming
2.6.6 New Exception Conditions
"The AMD64 architecture defines a number of new conditions that can cause an exception to occur when the processor is running in long mode. Many of the conditions occur when software attempts to use an address that is not in canonical form. See �Vectors� on page 208 for information on the new exception conditions that can occur in long mode."
See:http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/2459...
See Also:
AMD64 -
http://www.amd.com/us-en/Processors/TechnicalResources/0,,30_182_739_7044,00...
Opteron Specific -
http://www.amd.com/us-en/Processors/TechnicalResources/0,,30_182_739_9003,00...
My question is, "What type of additional logging or data capture should I be doing in hopes of catching or narrowing down what the real cause of the MCE is?" I'm running and capturing the MCEs with mcelog running every minute under cron to insure I buffers never get filled. But beyond that, I'm not doing any other special logging. The only hardware I haven't changed is the motherboard and that tests fine. What else could I run/log/set that would give me the best change of finding the real culprit.
Any help is much appreciated.
Hi, you're logging handled exceptions you need to make the machine crash under stress and the last exception will be the one. The most common cause of a software crash is a divide by zero but they may have fixed that even, my experience stops at P3. I've followed this thread with great interest. Regards Dave
Thank you Dave,
I'm pulling my hair out on this one. One thing I haven't done is to post the actual MCEs I'm seeing. The mcelog and the syslog containing the MCEs are here:
http://www.3111skyline.com/download/lockup_x86-64/mcelog
http://www.3111skyline.com/download/lockup_x86-64/messages_20080407
The logs must both be read due to MCEs being written to "mcelog" before I configured mcelog to write to /var/log/messages. Also, the data before approximately 3/31/08 just shows the fact that an MCE occurred but doesn't give the supporting details. (it's just included for completeness) This is because I didn't have mcelog installed until then. Additionally some of the 4/7 entries do not have details because I dorked the mcelog cron entry during an edit. It is fixed now. The mcelog is annotated with:
# #### nvidia 8600GT removed, driver blacklisted, ATI Radeon 1500 installed w/radeon driver #
To show when the nvidia card was changed and the nvidia kernel module removed. (Frequency of hardlocks in reduced but MCEs still reported and do occasionally hardlock)
grepping the files on ADDR | sort shows that the errors never occur at the same memory address. I really don't know what the ADDR means. I've been looking for some way to correlate "ADDR 2ba96974e8a0" for example to what that means (video, main memory, bios, etc..) No luck so far.
If somebody smarter than I can tell what memory range we are dealing with, and hopefully what it means, it would be greatly appreciated.
Thanks!
P.S. - The complete collection of the AMD Technical Documentation for x86_64 and Opteron are also included in:
http://www.3111skyline.com/download/lockup_x86-64
if anyone is curious...
Hi David, does your bios support disabling video interrupts? If it does, disable it and try again. It's a pity mcelog is not a bit more specific about the actual instruction executed at time of exception. Regards Dave -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org