Mailinglist Archive: opensuse-bugs (14006 mails)

< Previous Next >
[Bug 376165] New: AMD Multicore - Lockup/ Reboot with MCE errors After nvidia kernel module load/install
  • From: bugzilla_noreply@xxxxxxxxxx
  • Date: Tue, 1 Apr 2008 23:33:45 -0600 (MDT)
  • Message-id: <bug-376165-21960@xxxxxxxxxxxxxxxxxxxxxxxxx/>
https://bugzilla.novell.com/show_bug.cgi?id=376165


Summary: AMD Multicore - Lockup/Reboot with MCE errors After
nvidia kernel module load/install
Product: openSUSE 10.3
Version: Final
Platform: x86-64
OS/Version: Other
Status: NEW
Severity: Major
Priority: P5 - None
Component: Kernel
AssignedTo: kernel-maintainers@xxxxxxxxxxxxxxxxxxxxxx
ReportedBy: drankinatty@xxxxxxxxxxxxxxxxxx
QAContact: qa@xxxxxxx
Found By: ---


System: Tyan Tomcat 8KE S2865ANRF
datasheet: ftp://ftp.tyan.com/datasheets/d_s2865_100.pdf

Processor: Opteron 180 (Ver. 2.0)

Memory: 2G OCZ Platinum PC3200 (Timings 2-3-2)(certified OK)

Video Card: MSI Nvidia 8600GT Twin Turbo

Power Supply: Corsair 550W

The Problem:

A system will lockup or reboot randomly but frequently with the nvidia
kernel module loaded. If a lockup is experienced, the keyboard the cap lock and
scroll lock lights flash at approximately 1 second intervals. This occurs with
any simple load applied to the system (zypper refresh, running kwrite, grep,
etc.) Without a load, the system will idle for days.

The nvidia module was installed via 1-click install that provided
packages: nvidia-gfxG01-kmp-default-169.12_2.6.22.17_0.1-0.1 and
x11-video-nvidiaG01-169.12-0.1. The lockups produce Machine check events
stating that it is a Hardware problem. However, the problem is caused by the
"inclusion" of the nvidia kernel module. It may *not* be the module itself, but
it may be caused some address map or similar issue brought into play by loading
the module. Judging from the list this problem seems to affect 10.3 x86_64
installs with multicore AMD processors with some motherboard
chipset/architectures.

The work around (in this case):

Unload and blacklist the nvidia kernel module and pass
"acpi_use_timer_override" to the kernel at boot. The basic "nv" driver is used
after unloading No lockups and no further mce activity even while running
mprime, XP in virtual box installing updates and deleting
/var/cache/zypp/zypp.db and forcing a zypper refresh (simultaneously). Of
course operating without the nvidia kernel module cripples the graphic system
performance.

Discussion:

After performing a fresh 10.3 install on the system without any install
problems with md5sum verified media, the machine began experiencing lockups
frequently. At the time, there was no correlation between the nvidia graphics
driver install and the lockups. (I may have installed the driver late along
with updates, then the lockups started sometime the next day)

With the nvidia driver installed, the machine will "idle" for days at a
time until any load is applied, then the lockups occur.

Ram, thermal and motherboard hardware are all elimated as
possibilities.

Ram: memtested plus physically shipped to OCZ and verified OK

Thermal: The box is an Antec p182 case w/3 120mm fans, core 1 temps
idle at 30 degrees C and are 37-38 under load. The core 2 temps idle at 23 and
average 30 when underload. The bois PC Health and lm-sensors temps match almost
exactly. All well short of the 74 degree safe operating limit and well short of
the 85 degree shutdown temp.

Motherboard Hardware: The mother board has a bios code window (LCD) on
the motherboard and I have caught all the bios codes (including all self tests)
and all the codes say everything is OK. I have gone through the bios and turned
anything non-necessary off (Ser 1, Ser 2, Parallel, AC97 sound, etc.) Doesn't
make any difference.

The problem here lies with the Tyan S2865 (and other manufacuters)
board/chipset architecture running multicore AMD processors and the apparent
"hardware" failure caused by loading the nvidia kernel module. Again, perhaps
not with the module itself, but as a result of its loading whether that be with
address mapping, address space, etc. which causes the lockup. That is the core
issue. The driver itself worked great until the system would freeze.

Included Files:

I am including the hwinfo, dmesg (with and without the
"acpi_use_timer_override" applied), the mcelog (note the comment I placed at
the end of the file when I unloaded the nvidia driver to denote when it was
removed) and the syslog (complete from initial install on 3/8/08)

Close:

As with all these problems, I will provide you with any additional
information or testing that you want me to do. Just ask. This may be a good one
to take care of before 11.0 comes out as the use of multicore AMD processors
and boards is on the rise. Thanks


--
Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

< Previous Next >