Hi, we have problems with the stability of our Opteron Cluster running with SuSE 9.1. From time to time the kswapd processes on the master node run with 99% on one of the 2 cpus available - even though almost no swap space is used. Jobs submitted by OpenPBS stop at random times especially when data is transferred (via nfs). Sometimes dmesg gives centaur[31528]: segfault at 0000000000000000 rip 000000000807ca86 rsp 00000000ffffcfd8 error 4 centaur[761]: segfault at 0000000000000000 rip 000000000807ca86 rsp 00000000ffffcfd8 error 4 hybconvert[5561]: segfault at 0000002a96b5b000 rip 0000002a95a75336 rsp 0000007fbfffb240 error 4 hybconvert[5562]: segfault at 0000002a96e50000 rip 0000002a95a75336 rsp 0000007fbfffb240 error 4 hybconvert[5666]: segfault at 0000002a96b14000 rip 0000002a95a75336 rsp 0000007fbfffb240 error 4 hybconvert[5833]: segfault at 0000002a98eb6000 rip 0000002a95a75336 rsp 0000007fbfffb240 error 4 igesconv[2804]: segfault at 0000000000000120 rip 0000000008093d40 rsp 00000000ffffca58 error 4 hybconvert[16799]: segfault at 0000002a973cc000 rip 0000002a95a75336 rsp 0000007fbfffb1c0 error 4 hybconvert[23026]: segfault at 0000002aa88c5000 rip 0000002a95a75336 rsp 0000007fbfffb1c0 error 4 We noticed that the numbers following rip and rsp are equal for some processes. What does all this indicate? Is bad memmory causing such problems or might this be a kernel issue? Any help is greatly appreciated! Regards, Torsten
participants (1)
-
Torsten Wolf