Mailinglist Archive: opensuse (2348 mails)

< Previous Next >
Re: [opensuse] Unstable system - culprit identified
  • From: "David C. Rankin" <drankinatty@xxxxxxxxxxxxxxxxxx>
  • Date: Tue, 01 Apr 2008 09:43:04 -0500
  • Message-id: <47F249F8.5060506@xxxxxxxxxxxxxxxxxx>
Jan Engelhardt wrote:

On Tuesday 2008-04-01 06:33, David C. Rankin wrote:

I have moved /etc/modprobe.d/nvidia and blacklisted nvidia and it completed the refresh without issue. The nvidia module may be the culprit.
While I have nvidia unloaded, what is a good torture test I can try? Kick off a kernel recompile? (If so, how would I do it so it doesn't mess with my working kernel?) Or do you have another suggestion for torture testing?

zypper refresh, running mprime, SETI; kernel compiles also fit the
bill. Just do whatever you did before.


Jan,

You did it again! You solved the unsolvable. zypper refresh ran and mprime ran overnight along with vbox while XP did more updates and installs with "zero" errors or warnings and "zero" additional mcelog errors:

02:26 nirvana~/tmp> sudo tail /var/log/mcelog
^^^^^
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 1 instruction cache TSC 6ed6056dfe
ADDR 2ad2c6d787a0
bit62 = error overflow (multiple errors)
memory/cache error 'instruction fetch mem transaction, instruction transaction, level 1'
STATUS d400000000000151 MCGSTATUS 0
#
#### nvidia driver blacklisted again
#

(no more errors)

08:44 nirvana~/tmp> sudo tail /var/log/mcelog
^^^^^
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 1 instruction cache TSC 6ed6056dfe
ADDR 2ad2c6d787a0
bit62 = error overflow (multiple errors)
memory/cache error 'instruction fetch mem transaction, instruction transaction, level 1'
STATUS d400000000000151 MCGSTATUS 0
#
#### nvidia driver blacklisted again
#

(still no more errors)

Wohoo, no new errors, no lockups, no reboots! The box was just happily crunching along on mprime.

Now, where do I go from here? I still don't fully understand where or with whom the error lies. Sure removing the nvidia driver in combination with passing the acpi_use_timer_override kernel parameter seems so have fixed it, but who do we address the problems to so it can be fixed? It looks like an nvidia issue, but with the driver being proprietary does it fall on deaf ears? Does it look like an address space problem where the nvidia module is competing with some other process for address space based on the chipset/architecture that would make it a kernel issue? Should I address it through bugzilla.novell and have them get the nvidia guys on board?

What is you thought Jan? list?


--
David C. Rankin, J.D., P.E.
Rankin Law Firm, PLLC
510 Ochiltree Street
Nacogdoches, Texas 75961
Telephone: (936) 715-9333
Facsimile: (936) 715-9339
www.rankinlawfirm.com
--
To unsubscribe, e-mail: opensuse+unsubscribe@xxxxxxxxxxxx
For additional commands, e-mail: opensuse+help@xxxxxxxxxxxx

< Previous Next >