Segmentation faults on Opteron System
I have a strange problem with my smp-system. Hardware config: Tyan s2895 mobo, 2x Opteron 270 Dual-Core CPU, 16GB ECC Reg. RAM We use this Workstation for CFD-Analyses. When I start 4 Jobs on 4 CPUs, at least 2 jobs die (segmentation fault). One job works fine. A parallel run using mpich as MPI aborts (Error message says that it has recieved a SIGABORT signal form a CPU / SIGSEGV). The strage thing: If I remove 8GB of RAM, the system works stable. Neither an update from Suse 9 to Suse 10 nor a manual compiled kernel with the current stable version helped. A memtest86+ showed no errors (running 5 days), so I think that a bad RAM module is not hte reason. Furthermore, a cpu-stress test with cpuburn-in worked well (24 hours running without any errors). Might be a problem with the memory managment? Any hints how I can resolve my problem? Christian
I have a strange problem with my smp-system. Hardware config: Tyan s2895 mobo, 2x Opteron 270 Dual-Core CPU, 16GB ECC Reg. RAM We use this Workstation for CFD-Analyses. When I start 4 Jobs on 4 CPUs, at least 2 jobs die (segmentation fault). One job works fine. A
Christian Hopfensitz wrote: parallel run using mpich as MPI aborts (Error message says that it has recieved a SIGABORT signal form a CPU / SIGSEGV). The strage thing: If I remove 8GB of RAM, the system works stable. Neither an update from Suse 9 to Suse 10 nor a manual compiled kernel with the current stable version helped. A memtest86+ showed no errors (running 5 days), so I think that a bad RAM module is not hte reason. Furthermore, a cpu-stress test with cpuburn-in worked well (24 hours running without any errors). Might be a problem with the memory managment?
Any hints how I can resolve my problem?
You haven't had much response, so here's my 2 pennyworth ... (1) It could be a hardware problem. Tyan can be very helpful. You could contact them directly, or you could contact your hardware vendor and ask them to contact Tyan if they can't help you themselves. (2) It could be a software problem. You're running a specific application? Try asking on a list related to that application, or on an MPI-related list. HTH, Dave
participants (2)
-
Christian Hopfensitz
-
Dave Howorth