On Fri, Apr 4, 2008 at 9:20 AM, Anders Johansson <ajh@rydsbo.net> wrote:
You seem to be misunderstanding what "mce" is. A machine check exception is the hardware itself telling you that something has gone badly wrong. There is no interpretation involved in the software. The software just logs the message
If the mce says it is a hardware problem, you can count on its being a hardware problem
Anders
No you can't count on that Anders. Do some research on MCE errors and you will find these errors are often reported when there is absolutely nothing wrong with the machine. In fact DELL had a huge thread on their internal blog about the reporting of mce errors from linux users upon the arrival of core 2 duo machines. They were more than a little miffed getting calls because some developer of the mce package with a swollen head put in language insisting it was hardware when others clearly demonstrated you could get to that part of the code with no hardware error at all. Its quite possible for software bugs to hoze things so badly that the mce modules think there was an error. Further, part of the mce software's job is to filter out the bogus MCE errors. (or so says someone who shall remain nameless but who's email address is ak@suse.de ). Now if the software's job is to filter out bogus mc events that is a defacto assertion that lots of these events are bogus. I've seen these in the past as well. Mine had to do with runaway keys, and the clue was the bit about TSC. Dual cores can get their timers to disagree to the point that it forces a failure. You would often see this with speed-step or power-now enabled, but simply locking the machine at high-power setting would avoid the problem. For me the nohpet command line kernel parameter was required under suse 10.1. That solved all my instances. But that was on a core-2-duo. -- ----------JSA--------- -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org