Mailinglist Archive: opensuse (2348 mails)

< Previous Next >
Re: [Bulk] Re: [Bulk] Re: [opensuse] Novell Bugzilla - At it Again - Bugs Apparently Dismissed Without Sufficient Investigation
  • From: Dave Plater <davejplater@xxxxxxxxx>
  • Date: Mon, 07 Apr 2008 09:37:17 +0200
  • Message-id: <47F9CF2D.9030100@xxxxxxxxx>
David C. Rankin wrote:
Dave Plater wrote:
David C. Rankin wrote:
Anders Johansson wrote:
On Friday 04 April 2008 17:15:45 David C. Rankin wrote:

You seem to be misunderstanding what "mce" is. A machine check
exception is the hardware itself telling you that something has gone
badly wrong. There is no interpretation involved in the software. The
software just logs the message

If the mce says it is a hardware problem, you can count on its being
a hardware problem

Anders

Jan, Anders, List:

The more I read, and the more I test, the more I am concerned that
there may be a simmering issue with the x86_64 code. I installed a
plain-jan pci-e ATI card running with the open source driver. Just as
with the nvidia 8600GT card (using the opensource "nv" driver), the
system still gives occasional MCEs. Just as with the 8600, the MCEs do
not have any affect on the system. If I wasn't logging them with mcelog,
I would never know they were occurring.

Reading the tech-docs, it is readily apparent that MCE doesn't
necessarily mean hardware. Software is more than capable of causing them:

AMD64 Architecture
Programmer�s Manual
Volume 2:
System Programming

2.6.6 New Exception Conditions

"The AMD64 architecture defines a number of new conditions that can cause an exception to occur when the processor is running in long mode. Many of the conditions occur when software attempts to use an address that is not in canonical form. See �Vectors� on page 208 for information on the new exception conditions that can occur in long mode."

See:http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf

See Also:

AMD64 -

http://www.amd.com/us-en/Processors/TechnicalResources/0,,30_182_739_7044,00.html

Opteron Specific -

http://www.amd.com/us-en/Processors/TechnicalResources/0,,30_182_739_9003,00.html

My question is, "What type of additional logging or data capture should I be doing in hopes of catching or narrowing down what the real cause of the MCE is?" I'm running and capturing the MCEs with mcelog running every minute under cron to insure I buffers never get filled. But beyond that, I'm not doing any other special logging. The only hardware I haven't changed is the motherboard and that tests fine. What else could I run/log/set that would give me the best change of finding the real culprit.

Any help is much appreciated.

Hi, you're logging handled exceptions you need to make the machine crash under stress and the last exception will be the one. The most common cause of a software crash is a divide by zero but they may have fixed that even, my experience stops at P3.
I've followed this thread with great interest.
Regards
Dave



Thank you Dave,

I'm pulling my hair out on this one. One thing I haven't done is to post the actual MCEs I'm seeing. The mcelog and the syslog containing the MCEs are here:

http://www.3111skyline.com/download/lockup_x86-64/mcelog

http://www.3111skyline.com/download/lockup_x86-64/messages_20080407

The logs must both be read due to MCEs being written to "mcelog" before I configured mcelog to write to /var/log/messages. Also, the data before approximately 3/31/08 just shows the fact that an MCE occurred but doesn't give the supporting details. (it's just included for completeness) This is because I didn't have mcelog installed until then. Additionally some of the 4/7 entries do not have details because I dorked the mcelog cron entry during an edit. It is fixed now. The mcelog is annotated with:

#
#### nvidia 8600GT removed, driver blacklisted, ATI Radeon 1500 installed w/radeon driver
#

To show when the nvidia card was changed and the nvidia kernel module removed. (Frequency of hardlocks in reduced but MCEs still reported and do occasionally hardlock)

grepping the files on ADDR | sort shows that the errors never occur at the same memory address. I really don't know what the ADDR means. I've been looking for some way to correlate "ADDR 2ba96974e8a0" for example to what that means (video, main memory, bios, etc..) No luck so far.

If somebody smarter than I can tell what memory range we are dealing with, and hopefully what it means, it would be greatly appreciated.

Thanks!

P.S. - The complete collection of the AMD Technical Documentation for x86_64 and Opteron are also included in:

http://www.3111skyline.com/download/lockup_x86-64

if anyone is curious...

Hi David, does your bios support disabling video interrupts?
If it does, disable it and try again. It's a pity mcelog is not a bit more specific about the actual instruction executed at time of exception.
Regards
Dave

--
To unsubscribe, e-mail: opensuse+unsubscribe@xxxxxxxxxxxx
For additional commands, e-mail: opensuse+help@xxxxxxxxxxxx

< Previous Next >
Follow Ups