Mailinglist Archive: opensuse (2348 mails)

< Previous Next >
Re: [opensuse] Novell Bugzilla - At it Again - Bugs Apparently Dismissed Without Sufficient Investigation
  • From: "David C. Rankin" <drankinatty@xxxxxxxxxxxxxxxxxx>
  • Date: Sun, 06 Apr 2008 22:14:52 -0500
  • Message-id: <47F991AC.4030209@xxxxxxxxxxxxxxxxxx>
Anders Johansson wrote:
On Friday 04 April 2008 17:15:45 David C. Rankin wrote:

You seem to be misunderstanding what "mce" is. A machine check
exception is the hardware itself telling you that something has gone
badly wrong. There is no interpretation involved in the software. The
software just logs the message

If the mce says it is a hardware problem, you can count on its being
a hardware problem

Anders

Jan, Anders, List:

The more I read, and the more I test, the more I am concerned that
there may be a simmering issue with the x86_64 code. I installed a
plain-jan pci-e ATI card running with the open source driver. Just as
with the nvidia 8600GT card (using the opensource "nv" driver), the
system still gives occasional MCEs. Just as with the 8600, the MCEs do
not have any affect on the system. If I wasn't logging them with mcelog,
I would never know they were occurring.

Reading the tech-docs, it is readily apparent that MCE doesn't
necessarily mean hardware. Software is more than capable of causing them:

AMD64 Architecture
Programmer’s Manual
Volume 2:
System Programming

2.6.6 New Exception Conditions

"The AMD64 architecture defines a number of new conditions that can cause an exception to occur when the processor is running in long mode. Many of the conditions occur when software attempts to use an address that is not in canonical form. See “Vectors” on page 208 for information on the new exception conditions that can occur in long mode."

See:http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf

See Also:

AMD64 -

http://www.amd.com/us-en/Processors/TechnicalResources/0,,30_182_739_7044,00.html

Opteron Specific -

http://www.amd.com/us-en/Processors/TechnicalResources/0,,30_182_739_9003,00.html

My question is, "What type of additional logging or data capture should I be doing in hopes of catching or narrowing down what the real cause of the MCE is?" I'm running and capturing the MCEs with mcelog running every minute under cron to insure I buffers never get filled. But beyond that, I'm not doing any other special logging. The only hardware I haven't changed is the motherboard and that tests fine. What else could I run/log/set that would give me the best change of finding the real culprit.

Any help is much appreciated.

--
David C. Rankin, J.D., P.E.
Rankin Law Firm, PLLC
510 Ochiltree Street
Nacogdoches, Texas 75961
Telephone: (936) 715-9333
Facsimile: (936) 715-9339
www.rankinlawfirm.com
--
To unsubscribe, e-mail: opensuse+unsubscribe@xxxxxxxxxxxx
For additional commands, e-mail: opensuse+help@xxxxxxxxxxxx

< Previous Next >
Follow Ups