Bad RAM or software issue

18 Jan 2005

      Hi,

we have problems with the stability of our Opteron Cluster running with
SuSE 9.1. From time to time the kswapd processes on the master node run
with 99% on one of the 2 cpus available -  even though almost no swap
space is used. Jobs submitted by OpenPBS stop at random times especially
when data is transferred (via nfs). Sometimes dmesg gives

centaur[31528]: segfault at 0000000000000000 rip 000000000807ca86 rsp 00000000ffffcfd8 error 4
centaur[761]: segfault at 0000000000000000 rip 000000000807ca86 rsp 00000000ffffcfd8 error 4
hybconvert[5561]: segfault at 0000002a96b5b000 rip 0000002a95a75336 rsp 0000007fbfffb240 error 4
hybconvert[5562]: segfault at 0000002a96e50000 rip 0000002a95a75336 rsp 0000007fbfffb240 error 4
hybconvert[5666]: segfault at 0000002a96b14000 rip 0000002a95a75336 rsp 0000007fbfffb240 error 4
hybconvert[5833]: segfault at 0000002a98eb6000 rip 0000002a95a75336 rsp 0000007fbfffb240 error 4
igesconv[2804]: segfault at 0000000000000120 rip 0000000008093d40 rsp 00000000ffffca58 error 4
hybconvert[16799]: segfault at 0000002a973cc000 rip 0000002a95a75336 rsp 0000007fbfffb1c0 error 4
hybconvert[23026]: segfault at 0000002aa88c5000 rip 0000002a95a75336 rsp 0000007fbfffb1c0 error 4

We noticed that the numbers following rip and rsp are equal for some
processes. What does all this indicate? Is bad memmory causing such
problems or might this be a kernel issue? Any help is greatly
appreciated!

Regards,
Torsten

Torsten Wolf

tags

participants (1)