[Bug 376165] New: AMD Multicore - Lockup/ Reboot with MCE errors After nvidia kernel module load/install
https://bugzilla.novell.com/show_bug.cgi?id=376165 Summary: AMD Multicore - Lockup/Reboot with MCE errors After nvidia kernel module load/install Product: openSUSE 10.3 Version: Final Platform: x86-64 OS/Version: Other Status: NEW Severity: Major Priority: P5 - None Component: Kernel AssignedTo: kernel-maintainers@forge.provo.novell.com ReportedBy: drankinatty@suddenlinkmail.com QAContact: qa@suse.de Found By: --- System: Tyan Tomcat 8KE S2865ANRF datasheet: ftp://ftp.tyan.com/datasheets/d_s2865_100.pdf Processor: Opteron 180 (Ver. 2.0) Memory: 2G OCZ Platinum PC3200 (Timings 2-3-2)(certified OK) Video Card: MSI Nvidia 8600GT Twin Turbo Power Supply: Corsair 550W The Problem: A system will lockup or reboot randomly but frequently with the nvidia kernel module loaded. If a lockup is experienced, the keyboard the cap lock and scroll lock lights flash at approximately 1 second intervals. This occurs with any simple load applied to the system (zypper refresh, running kwrite, grep, etc.) Without a load, the system will idle for days. The nvidia module was installed via 1-click install that provided packages: nvidia-gfxG01-kmp-default-169.12_2.6.22.17_0.1-0.1 and x11-video-nvidiaG01-169.12-0.1. The lockups produce Machine check events stating that it is a Hardware problem. However, the problem is caused by the "inclusion" of the nvidia kernel module. It may *not* be the module itself, but it may be caused some address map or similar issue brought into play by loading the module. Judging from the list this problem seems to affect 10.3 x86_64 installs with multicore AMD processors with some motherboard chipset/architectures. The work around (in this case): Unload and blacklist the nvidia kernel module and pass "acpi_use_timer_override" to the kernel at boot. The basic "nv" driver is used after unloading No lockups and no further mce activity even while running mprime, XP in virtual box installing updates and deleting /var/cache/zypp/zypp.db and forcing a zypper refresh (simultaneously). Of course operating without the nvidia kernel module cripples the graphic system performance. Discussion: After performing a fresh 10.3 install on the system without any install problems with md5sum verified media, the machine began experiencing lockups frequently. At the time, there was no correlation between the nvidia graphics driver install and the lockups. (I may have installed the driver late along with updates, then the lockups started sometime the next day) With the nvidia driver installed, the machine will "idle" for days at a time until any load is applied, then the lockups occur. Ram, thermal and motherboard hardware are all elimated as possibilities. Ram: memtested plus physically shipped to OCZ and verified OK Thermal: The box is an Antec p182 case w/3 120mm fans, core 1 temps idle at 30 degrees C and are 37-38 under load. The core 2 temps idle at 23 and average 30 when underload. The bois PC Health and lm-sensors temps match almost exactly. All well short of the 74 degree safe operating limit and well short of the 85 degree shutdown temp. Motherboard Hardware: The mother board has a bios code window (LCD) on the motherboard and I have caught all the bios codes (including all self tests) and all the codes say everything is OK. I have gone through the bios and turned anything non-necessary off (Ser 1, Ser 2, Parallel, AC97 sound, etc.) Doesn't make any difference. The problem here lies with the Tyan S2865 (and other manufacuters) board/chipset architecture running multicore AMD processors and the apparent "hardware" failure caused by loading the nvidia kernel module. Again, perhaps not with the module itself, but as a result of its loading whether that be with address mapping, address space, etc. which causes the lockup. That is the core issue. The driver itself worked great until the system would freeze. Included Files: I am including the hwinfo, dmesg (with and without the "acpi_use_timer_override" applied), the mcelog (note the comment I placed at the end of the file when I unloaded the nvidia driver to denote when it was removed) and the syslog (complete from initial install on 3/8/08) Close: As with all these problems, I will provide you with any additional information or testing that you want me to do. Just ask. This may be a good one to take care of before 11.0 comes out as the use of multicore AMD processors and boards is on the rise. Thanks -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=376165
User drankinatty@suddenlinkmail.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=376165#c1
--- Comment #1 from David Rankin
https://bugzilla.novell.com/show_bug.cgi?id=376165
User drankinatty@suddenlinkmail.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=376165#c2
--- Comment #2 from David Rankin
https://bugzilla.novell.com/show_bug.cgi?id=376165
User drankinatty@suddenlinkmail.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=376165#c3
--- Comment #3 from David Rankin
https://bugzilla.novell.com/show_bug.cgi?id=376165
User drankinatty@suddenlinkmail.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=376165#c4
--- Comment #4 from David Rankin
https://bugzilla.novell.com/show_bug.cgi?id=376165
User drankinatty@suddenlinkmail.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=376165#c5
--- Comment #5 from David Rankin
https://bugzilla.novell.com/show_bug.cgi?id=376165
User drankinatty@suddenlinkmail.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=376165#c6
--- Comment #6 from David Rankin
https://bugzilla.novell.com/show_bug.cgi?id=376165
User drankinatty@suddenlinkmail.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=376165#c7
--- Comment #7 from David Rankin
https://bugzilla.novell.com/show_bug.cgi?id=376165
User drankinatty@suddenlinkmail.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=376165#c8
--- Comment #8 from David Rankin
https://bugzilla.novell.com/show_bug.cgi?id=376165
User drankinatty@suddenlinkmail.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=376165#c9
--- Comment #9 from David Rankin
Guys,
Here are additional mce(s) captured in syslog but absent in /var/log/mcelog. The nvidia driver was *not* loaded when these occurred and apparently were only written to syslog due to the crash/reboot occurring before the next cron run of /usr/sbin/mcelog. (currently set to run at 1 min. intervals).
The following of the mce(s) caught in /var/log/messages:
Apr 1 01:35:01 nirvana /usr/sbin/cron[3761]: (root) CMD (/usr/sbin/mcelog --k8 --syslog) Apr 1 01:35:50 nirvana kernel: [ 198.079706] Machine check events logged Apr 1 01:36:01 nirvana /usr/sbin/cron[3766]: (root) CMD (/usr/sbin/mcelog --k8 --syslog) Apr 1 01:36:01 nirvana mcelog: HARDWARE ERROR. This is *NOT* a software problem! Apr 1 01:36:01 nirvana mcelog: Please contact your hardware vendor Apr 1 01:36:01 nirvana mcelog: CPU 0 1 instruction cache Apr 1 01:36:01 nirvana mcelog: TSC 6f61bca83a Apr 1 01:36:01 nirvana mcelog: ADDR 2b66e64040f0 Apr 1 01:36:01 nirvana mcelog: Apr 1 01:36:01 nirvana mcelog: memory/cache error 'instruction fetch mem transaction, instruction transaction, level 1' Apr 1 01:36:01 nirvana mcelog: STATUS 9400000000000151 MCGSTATUS 0 Apr 1 01:36:01 nirvana mcelog: HARDWARE ERROR. This is *NOT* a software problem! Apr 1 01:36:01 nirvana mcelog: Please contact your hardware vendor Apr 1 01:36:01 nirvana mcelog: CPU 1 1 instruction cache Apr 1 01:36:01 nirvana mcelog: TSC 6f61bd690e Apr 1 01:36:01 nirvana mcelog: ADDR ffff804454f0 Apr 1 01:36:01 nirvana mcelog: Apr 1 01:36:01 nirvana mcelog: bit62 = error overflow (multiple errors) Apr 1 01:36:01 nirvana mcelog: memory/cache error 'instruction fetch mem transaction, instruction transaction, level 1' Apr 1 01:36:01 nirvana mcelog: STATUS d400000000000151 MCGSTATUS 0 Apr 1 01:37:01 nirvana /usr/sbin/cron[3805]: (root) CMD (/usr/sbin/mcelog --k8 --syslog) Apr 1 01:38:01 nirvana /usr/sbin/cron[3812]: (root) CMD (/usr/sbin/mcelog --k8 --syslog) Apr 1 01:39:01 nirvana /usr/sbin/cron[3820]: (root) CMD (/usr/sbin/mcelog --k8 --syslog) Apr 1 01:39:01 nirvana mcelog: HARDWARE ERROR. This is *NOT* a software problem! Apr 1 01:39:01 nirvana mcelog: Please contact your hardware vendor Apr 1 01:39:01 nirvana mcelog: CPU 0 1 instruction cache Apr 1 01:39:01 nirvana mcelog: TSC 9c52fbb5cd Apr 1 01:39:01 nirvana mcelog: ADDR 77a8d270 Apr 1 01:39:01 nirvana mcelog: Apr 1 01:39:01 nirvana mcelog: bit62 = error overflow (multiple errors) Apr 1 01:39:01 nirvana mcelog: memory/cache error 'instruction fetch mem transaction, instruction transaction, level 1' Apr 1 01:39:01 nirvana mcelog: STATUS d400000000000151 MCGSTATUS 0
The complete collection of the mce(s) from /var/log/messages are contained in the attachment "mce_syslog" provided along with this post.
Let me know what else I can provide and I will respond as soon as I can. Also, if an account on the box would be helpful, we can arrange that as well.
Thanks!
NOTE: For testing purposes the "acpi_use_timer_override" kernel parameter was removed after the 4/3 reboot at 0400. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=376165
User gregkh@novell.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=376165#c10
Greg Kroah-Hartman
https://bugzilla.novell.com/show_bug.cgi?id=376165
User drankinatty@suddenlinkmail.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=376165#c11
David Rankin
https://bugzilla.novell.com/show_bug.cgi?id=376165
User gregkh@novell.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=376165#c12
Greg Kroah-Hartman
https://bugzilla.novell.com/show_bug.cgi?id=376165
User drankinatty@suddenlinkmail.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=376165#c13
--- Comment #13 from David Rankin
https://bugzilla.novell.com/show_bug.cgi?id=376165
User drankinatty@suddenlinkmail.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=376165#c14
--- Comment #14 from David Rankin
https://bugzilla.novell.com/show_bug.cgi?id=376165
User drankinatty@suddenlinkmail.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=376165#c15
--- Comment #15 from David Rankin
participants (1)
-
bugzilla_noreply@novell.com