Stability tweaks for nforce3 and 2.6 kernel
Hi, I have an AMD Athlon(tm) 64 Processor 3200+, on a Gigabyte GA-K8NNXP motherboard, which uses the NVIDIA nForce3 150, plus an MSI FX5700-TD AGP graphics card, which uses the NVIDIA GeForce FX 5700, running SuSE Linux Professional 9.0 for AMD64, with the following supplementary changes: 1) KDE 3.2.1 This required that I remove the package kdebase3-suse 2) Kernel 2.6.4, specifically kernel-default 2.6.4-5 3) Nvidia driver 1.0-5332 for 2.6 kernel from http://www.minion.de I have been having stability problems with my hardware and various combinations of software since I first assembled my machine. See http://lists.suse.com/archive/suse-amd64/2004-Feb/0182.html One recurring theme was that if I ran programs which did many memory allocations while I was using X11, I would get intermittent and unpredictable segfaults. Well, cross fingers and all that, I haven't been getting that problem any more. What follows is a list of some of the things I did to get a stable system. No doubt, not all of them are necessary. 0) I used software versions as listed above. 1) As recommended, I updated my BIOS to the latest (F12) http://www.giga-byte.com/Motherboard/Support/BIOS/BIOS_GA-K8NNXP.htm 2) Crucially, I added the kernel parameter "iommu=noagp" I found a number of places where the nForce3, AGP and the Linux kernel are discussed. In particular, there is an old thread which discusses an interaction between previous generation AMD processors and NVIDIA AGP GART: http://lwn.net/2002/0124/a/athlon-agp-problem.php3 On a hunch, I tried the kernel parameter "iommu=noagp" as documented at http://www.x86-64.org/lists/discuss/msg03747.html I made this change yesterday and it seems to work, so far. With this parameter, I have so far not seen any more segfaults. X seems to be stable as well. My boot options now say: kernel (hd0,1)/vmlinuz root=/dev/hda5 apm=off noapic iommu=noagp pci=noacpi vga=0x31a desktop hdc=ide-scsi hdclun=0 showopts I still probably need to change a few of these (eg. ide-scsi is deprecated for 2.6) but I would like to leave it as is are for a week to test stability. 3) I also had a problem in X, where SHIFT-BACKSPACE would kill X. Similar to what is mentioned at http://linux.derkeiler.com/Newsgroups/alt.os.linux.suse/2003-11/0420.html I looked at /etc/X11/XF86Config and removed a line which had contained Option "XkbVariant" "pc105"
Hi all, Sorry to report, my system is still not stable. Details below. Best regards On Saturday 10 April 2004 11:51, Paul C. Leopardi wrote:
I have been having stability problems with my hardware and various combinations of software since I first assembled my machine. What follows is a list of some of the things I did to get a stable system. No doubt, not all of them are necessary.
After submitting the last message, I started a large make, and two things happened: 1) The screen and keyboard froze, including the cursor. 2) After reboot, I saw the following error message in the logs: Kernel BUG at rmap:569 invalid operand: 0000 [1] CPU 0 Pid: 43, comm: kswapd0 Tainted: P RIP: 0010:[try_to_unmap_one+163/608] <ffffffff80162793>{try_to_unmap_one+163} RIP: 0010:[<ffffffff80162793>] <ffffffff80162793>{try_to_unmap_one+163} Call Trace: <ffffffff80162eb6>{try_to_unmap+1382} <ffffffff80159b01>{shrink_zone+2353} <ffffffff8015a2aa>{balance_pgdat+346} <ffffffff8015a5a5>{kswapd+309} <ffffffff801334a0>{autoremove_wake_function+0} <ffffffff801334a0>{autoremove_wake_function+0} <ffffffff8010febf>{child_rip+8} <ffffffff8015a470>{kswapd+0} <ffffffff8010feb7>{child_rip+0} PS. I've also checked the logs to see the effect of "iommu=noagp" on the NVIDIA kernel module: Before: 0: nvidia: loading NVIDIA Linux x86_64 NVIDIA Kernel Module 1.0-5332 Fri Jan 9 12:42:32 PST 2004 agpgart: Found an AGP 3.0 compliant device at 0000:00:00.0. agpgart: Putting AGP V3 device at 0000:00:00.0 into 8x mode agpgart: Putting AGP V3 device at 0000:02:00.0 into 8x mode After: 0: nvidia: loading NVIDIA Linux x86_64 NVIDIA Kernel Module 1.0-5332 Fri Jan 9 12:42:32 PST 2004 0: NVRM: AGPGART: unable to retrieve symbol table 0: NVRM: not using NVAGP, kernel was compiled with GART_IOMMU support!!
Hi, A little more info below. Best regards On Saturday 10 April 2004 13:49, Paul C. Leopardi wrote:
Hi all, Sorry to report, my system is still not stable. Details below. After submitting the last message, I started a large make, and two things happened: 1) The screen and keyboard froze, including the cursor. 2) After reboot, I saw the following error message in the logs:
I tried a large make again. It succeeded. So I tried running a long computation. After a while, screen and keyboard froze again. This time, no error message in the logs. Freeze seems to correlate with the job taking up a large proportion of memory, or the virtual memory having to go into swap. It does not happen all the time. I then tried re-adding "acpi=off" to kernel parameters, and tried running another long job. This time I got a segfault. So, I'm going to take "acpi=off" out again to see if I can get a segfault, or if I only get freezes. I'll also try running from run level 3. I'll also try Option "NvAgp" "1" in /etc/X11/XF86Config as it states at http://minion.de/ and finally, I will try contacting zander@mail.minion.de.
I had the same problems (nForce3 chipset) I solved them by building my own 2.6 kernel. -- Bob On Sat, 2004-04-10 at 04:03, Paul C. Leopardi wrote:
Hi, A little more info below. Best regards On Saturday 10 April 2004 13:49, Paul C. Leopardi wrote:
Hi all, Sorry to report, my system is still not stable. Details below. After submitting the last message, I started a large make, and two things happened: 1) The screen and keyboard froze, including the cursor. 2) After reboot, I saw the following error message in the logs:
I tried a large make again. It succeeded. So I tried running a long computation. After a while, screen and keyboard froze again. This time, no error message in the logs. Freeze seems to correlate with the job taking up a large proportion of memory, or the virtual memory having to go into swap. It does not happen all the time.
I then tried re-adding "acpi=off" to kernel parameters, and tried running another long job. This time I got a segfault.
So, I'm going to take "acpi=off" out again to see if I can get a segfault, or if I only get freezes. I'll also try running from run level 3. I'll also try Option "NvAgp" "1" in /etc/X11/XF86Config as it states at http://minion.de/ and finally, I will try contacting zander@mail.minion.de.
Results so far, below. On Saturday 10 April 2004 18:03, Paul C. Leopardi wrote:
So, I'm going to take "acpi=off" out again to see if I can get a segfault, or if I only get freezes. I'll also try running from run level 3. Without "acpi=off", booted to runlevel 5 then "init 3", so X is not running. -> Still get segfault at point of swapping -> Nvidia module is loaded
With "acpi=off", without "desktop",
booted to runlevel 3, checked Nvidia module not loaded.
Tested by running 2 copies of GluCat 0.1.5 (http://glucat.sf.net) gfft_test in
parallel.
-> Still get segfault at point of swapping
Tested by running 2 copies of make for GluCat 0.1.5 in paralel.
-> No segfault, but did not go into swap
Tested by running 1 copy of GluCat 0.1.5 gfft_test.
-> Still get segfault at point of swapping
Tested by using gdb to run GluCat 0.1.5 gfft_test.
-> Still get segfault at point of swapping
-> in gdb: Segmentation fault
-> std::__default_allocate_template
I'll also try Option "NvAgp" "1" in /etc/X11/XF86Config as it states at http://minion.de/ and finally, I will try contacting zander@mail.minion.de.
Not yet. Since "iommu=noagp" does not prevent crashes even in runlevel 3 with no nvidia driver loaded and with X not running, need to try "agp=off" first. See separate email for results. Preliminary analysis: In case you are wondering, I have tried GluCat 0.1.5 on the following architectures without seeing these segfault problems: 1) Red Hat Linux 8.0 using g++ 3.2 and Boost 1.30.2 on an Intel(R) Pentium(R) 4. 2) SuSE SEL 8 using g++ 3.2.2 and Boost 1.30.2 on a dual AMD Opteron(tm). Current testing is on: 3) SuSE Linux 9.0 using g++ 3.3.2 and Boost 1.31.0 on an AMD Athlon(tm) 64. Also, noticed that gdb itself crashed with a segfault. So, I'll bet the problem is not in my code. Finally, the segfaults I am seeing here may be different to the ones I saw before "iommu=noagp". The earlier segfaults may have happened before swapping and seemed to be associated with X activity. The current segfaults seem to happen when swapping. I'm starting to suspect DMA.
Hi all, I've done testing for the moment. No luck. Summary below. On Sunday 11 April 2004 10:05, Paul C. Leopardi wrote:
I'll also try Option "NvAgp" "1" in /etc/X11/XF86Config as it states at http://minion.de/ and finally, I will try contacting zander@mail.minion.de.
Not yet. Since "iommu=noagp" does not prevent crashes even in runlevel 3 with no nvidia driver loaded and with X not running, need to try "agp=off" first. See separate email for results.
Here is a brief summary showing what happened when I tried "agp=off" with
various configurations. I had quite a number of other messages, most of which
I don't understand, and have kept more extensive logs.
Bottom line is that I have found no configuration stable enough to prevent gdb
from segfaulting or gpfing.
I noticed that general protection fault did not happen the first time I ran
the program after booting, but often happened the second time, if the first
test was large enough to go into swap, and swap was non-zero when I started
the second test. Otherwise the fault happened the 3rd, 4th or 5th time.
I'm not sure what my next step is, but I could try more testing with more
complete logging of configuration changes. If I was to report a bug to NVIDIA
or to Gigabyte Techology, I'm still not sure what I would be reporting.
Perhaps the next best thing would be to try to reproduce the problem on the
UNSW School of Mathematics cluster, which runs SuSE Enterprise Linux 8 on
dual Opteron nodes. The kernel and toolchain differ, as well as the hardware,
but the architecture is still AMD64, and I can run the same test program,
with more main memory and no possible interference from X11.
Best regards
1) 2.6.4-5 runlevel 5 with "nvidia" driver
apm=off acpi=off noapic agp=off pci=noacpi vga=0x31a desktop
hdc=ide-scsi hdclun=0 3 console=tty0
*** cat /proc/mtrr
reg00: base=0x00000000 ( 0MB), size=1024MB: write-back, count=1
reg01: base=0xe0000000 (3584MB), size= 128MB: write-combining, count=1
reg02: base=0xd0000000 (3328MB), size= 16MB: write-combining, count=1
init: Switching to runlevel: 5
nvidia: module license 'NVIDIA' taints kernel.
nvidia: loading NVIDIA Linux x86_64 NVIDIA Kernel Module
1.0-5332 Fri Jan 9 12:42:32 PST 2004
NVRM: AGPGART: unable to retrieve symbol table
NVRM: not using NVAGP, kernel was compiled with GART_IOMMU support!!
*** glucat-0.1.5 gfft_test 11 gave general protection and segfaults
2) 2.6.4-5 Failsafe runlevel 3
showopts ide=nodma apm=off acpi=off noapic agp=off vga=normal
iommu=noforce maxcpus=0 hdc=ide-scsi hdclun=0 3
*** cat /proc/mtrr
reg00: base=0x00000000 ( 0MB), size=1024MB: write-back, count=1
reg01: base=0xe0000000 (3584MB), size= 128MB: write-combining, count=1
*** nvidia kernel module was not loaded.
*** glucat-0.1.5 gfft_test 11 gave general protection and segfaults
*** gdb exited prematurely.
3) 2.4.21-209 Failsafe runlevel 2
showopts ide=nodma apm=off acpi=off noapic agp=off vga=normal
iommu=noforce maxcpus=0 hdc=ide-scsi hdclun=0 3
*** cat /proc/mtrr
reg00: base=0x00000000 ( 0MB), size=1024MB: write-back, count=1
reg01: base=0xe0000000 (3584MB), size= 128MB: write-combining, count=1
*** nvidia kernel module was not loaded.
*** glucat-0.1.5 gfft_test 10 gave segfaults when run in gdb
Program received signal SIGSEGV, Segmentation fault.
0x0000002a956f9797 in
std::__default_alloc_template
On Sat, 10 Apr 2004 11:51:25 +1000
"Paul C. Leopardi"
2) Crucially, I added the kernel parameter "iommu=noagp"
This just means you have effectively AGP. A more straightforward setting would have been "agp=off" or disabling use of AGP in your X server configuration. Most likely you have a buggy BIOS that sets up incorrect MTRRs. Another great way to increase stability in a system is to not use the binary only Nvidia driver. -Andi
Andi, Thanks for your response. Answers below. Best regards On Saturday 10 April 2004 21:03, Andi Kleen wrote:
2) Crucially, I added the kernel parameter "iommu=noagp" This just means you have effectively AGP. A more straightforward setting would have been "agp=off" or disabling use of AGP in your X server configuration. I will try "agp=off" and will test again, both in runlevel 3 without the X server, and in runlevel 5.
Most likely you have a buggy BIOS that sets up incorrect MTRRs. How do I check this? Should I just look at /proc/mtrr or there a better way? leopardi@linfinit:/proc> cat mtrr reg00: base=0x00000000 ( 0MB), size=1024MB: write-back, count=1 reg01: base=0xe0000000 (3584MB), size= 128MB: write-combining, count=1 reg02: base=0xd0000000 (3328MB), size= 16MB: write-combining, count=1
Another great way to increase stability in a system is to not use the binary only Nvidia driver. Unfortunately, XFree86 4.3 does not support the NVIDIA GeForce FX 5700. If I try using it I get jittery screens, blanking screens, etc. Should I try building XFree86 4.4 or the SUSE version 4.3.99? suse.de/people/matz/xfree86-fdo/ has binary RPMs for i586 but I can't find SRPMs or x86_64 RPMs.
I have ordered SUSE Linux Professional 9.1. Maybe some of these problems will be fixed there? At least I'll be able to try ditching the binary only Nvidia driver.
On Sat, 10 Apr 2004 22:02:14 +1000
"Paul C. Leopardi"
Another great way to increase stability in a system is to not use the binary only Nvidia driver. Unfortunately, XFree86 4.3 does not support the NVIDIA GeForce FX 5700. If I try using it I get jittery screens, blanking screens, etc. Should I try building XFree86 4.4 or the SUSE version 4.3.99? suse.de/people/matz/xfree86-fdo/ has binary RPMs for i586 but I can't find SRPMs or x86_64 RPMs.
Rebuilding the SRPM will probably work.
I have ordered SUSE Linux Professional 9.1. Maybe some of these problems will be fixed there? At least I'll be able to try ditching the binary only Nvidia driver.
9.1 should support the card with the standard X server. -Andi
participants (3)
-
Andi Kleen
-
Bob Fischer
-
Paul C. Leopardi