Hi, For the past few months we've been experiencing random but infrequent system crashes on our SUSE7.2 Dell box, which we haven't been able to track down. Previously on these lists, it was suggested that the Intel Network card driver was at fault and this was changed but the crash occurred again today! Essentially, the system runs along fine but then (after a few days or up to a month) it rapidly slows and locks up completely within 10 seconds (remotely or at the term). The only recourse is a nasty hard reboot. Naturally, the first port of call is to watch 'top' and try and see if something hogs the memory/cpu and after failing to find anything, we wrote a script that ran top to files 5 times every second. The somewhat peculiar thing we discovered was that in the top log at the crash time, the "top system time" changed! This happens precisely at the time of the crash, although we get approx. another 50 top outputs, which represent the following 10 seconds before it dies completely. Strangely enough the file modified time remains as the expected time, even though top is feeding out the wrong system time all of a sudden. Very top line of top: 1:39pm up 21 days, 4:07, 9 users, load average: 2.04, 2.14, 2.17 And 0.2 seconds later: 12:07pm up 21 days, 2:35, 9 users, load average: 1.81, 1.81, 1.82 Has anyone got any ideas what could possibly cause strange behaviour like this please? I have system info written below and beneath my name is a more detailed top output file. A couple of other points: - There are loads of Java processes listed but these represent threads and not processes to my knowledge. - The Oracle Process ora_s000_VCP seems to rise in memory (up to 150Mb of 1gig) as time passes and while I'm not sure what this process does I'm wary. - The two kpm processes near the end seem suspicious? - SymLinktoTop is not a strange process, but rather the process used to write the top log. System: Suse 7.2 running Oracle 8.1.7, Tomcat 4.0.1 on a Dell twin Intel CPU server. Any help or ideas would be greatly appreciated! More top details below. Kind regards, David Molloy ------- TOP Details --------- Two entries before time change: 1:39pm up 21 days, 4:07, 9 users, load average: 2.04, 2.14, 2.17 253 processes: 249 sleeping, 1 running, 3 zombie, 0 stopped CPU0 states: 0.0% user, 0.0% system, 0.0% nice, 100.0% idle CPU1 states: 8.0% user, 91.0% system, 0.0% nice, 0.0% idle Mem: 1028084K av, 944844K used, 83240K free, 0K shrd, 132604K buff Swap: 1052248K av, 164K used, 1052084K free 456612K cached PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND 4171 root 17 0 1060 1060 716 R 99.9 0.1 8:20 SymLinkToTop 912 root 9 0 4744 4744 3300 S 4.2 0.4 4853m gtop 25038 httpd 9 0 4252 4252 3516 S 2.1 0.4 0:11 httpd 25055 httpd 9 0 4264 4264 3516 S 2.1 0.4 0:11 httpd 1 root 9 0 224 224 188 S 0.0 0.0 0:24 init 2 root 9 0 0 0 0 SW 0.0 0.0 0:00 keventd 3 root 19 19 0 0 0 SWN 0.0 0.0 0:04 ksoftirqd_CPU0 4 root 19 19 0 0 0 SWN 0.0 0.0 0:08 ksoftirqd_CPU1 5 root 9 0 0 0 0 SW 0.0 0.0 89:11 kswapd 6 root 9 0 0 0 0 SW 0.0 0.0 0:00 kreclaimd 7 root 9 0 0 0 0 SW 0.0 0.0 2:03 bdflush 8 root 9 0 0 0 0 SW 0.0 0.0 3:26 kupdated 11 root 9 0 0 0 0 SW 0.0 0.0 0:00 scsi_eh_0 12 root 9 0 0 0 0 SW 0.0 0.0 0:00 scsi_eh_1 14 root 9 0 0 0 0 SW 0.0 0.0 0:00 AIFd 129 root 9 0 612 612 532 S 0.0 0.0 0:00 apcupsd 130 root 9 0 588 588 504 S 0.0 0.0 1:00 apcupsd 418 bin 9 0 512 512 428 S 0.0 0.0 0:00 portmap 427 root 9 0 964 964 836 S 0.0 0.0 0:02 sshd 440 root 9 0 636 636 532 S 0.0 0.0 0:03 syslogd 444 root 9 0 1016 1016 452 S 0.0 0.0 0:00 klogd 459 root 9 0 548 548 468 S 0.0 0.0 0:00 lpd 472 root 9 0 704 704 608 S 0.0 0.0 0:00 rpc.statd 474 root 9 0 0 0 0 SW 0.0 0.0 0:46 rpciod 475 root 9 0 0 0 0 SW 0.0 0.0 0:00 lockd <SNIP> Last entry before time change: 1:39pm up 21 days, 4:07, 9 users, load average: 2.04, 2.14, 2.17 253 processes: 249 sleeping, 1 running, 3 zombie, 0 stopped CPU0 states: 0.0% user, 4.0% system, 0.0% nice, 95.0% idle CPU1 states: 4.0% user, 95.0% system, 0.0% nice, 0.0% idle Mem: 1028084K av, 944860K used, 83224K free, 0K shrd, 132604K buff Swap: 1052248K av, 164K used, 1052084K free 456632K cached PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND 4171 root 16 0 1060 1060 716 R 99.9 0.1 8:20 SymLinkToTop 1 root 9 0 224 224 188 S 0.0 0.0 0:24 init 2 root 9 0 0 0 0 SW 0.0 0.0 0:00 keventd 3 root 19 19 0 0 0 SWN 0.0 0.0 0:04 ksoftirqd_CPU0 4 root 19 19 0 0 0 SWN 0.0 0.0 0:08 ksoftirqd_CPU1 5 root 9 0 0 0 0 SW 0.0 0.0 89:11 kswapd 6 root 9 0 0 0 0 SW 0.0 0.0 0:00 kreclaimd 7 root 9 0 0 0 0 SW 0.0 0.0 2:03 bdflush 8 root 9 0 0 0 0 SW 0.0 0.0 3:26 kupdated 11 root 9 0 0 0 0 SW 0.0 0.0 0:00 scsi_eh_0 12 root 9 0 0 0 0 SW 0.0 0.0 0:00 scsi_eh_1 14 root 9 0 0 0 0 SW 0.0 0.0 0:00 AIFd 129 root 9 0 612 612 532 S 0.0 0.0 0:00 apcupsd 130 root 9 0 588 588 504 S 0.0 0.0 1:00 apcupsd 418 bin 9 0 512 512 428 S 0.0 0.0 0:00 portmap 427 root 9 0 964 964 836 S 0.0 0.0 0:02 sshd 440 root 9 0 636 636 532 S 0.0 0.0 0:03 syslogd 444 root 9 0 1016 1016 452 S 0.0 0.0 0:00 klogd 459 root 9 0 548 548 468 S 0.0 0.0 0:00 lpd 472 root 9 0 704 704 608 S 0.0 0.0 0:00 rpc.statd 474 root 9 0 0 0 0 SW 0.0 0.0 0:46 rpciod 475 root 9 0 0 0 0 SW 0.0 0.0 0:00 lockd <SNIP> 0.2 seconds later at the timechange: 12:07pm up 21 days, 2:35, 9 users, load average: 1.81, 1.81, 1.82 226 processes: 222 sleeping, 1 running, 3 zombie, 0 stopped CPU0 states: 2.0% user, 100.0% system, 0.0% nice, -2.0% idle CPU1 states: 0.0% user, 0.0% system, 0.0% nice, 100.0% idle Mem: 1028084K av, 1015296K used, 12788K free, 0K shrd, 91128K buff Swap: 1052248K av, 164K used, 1052084K free 575292K cached PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND 996 root 19 0 1044 1044 716 R 99.9 0.1 6:57 SymLinkToTop 1 root 9 0 224 224 188 S 0.0 0.0 0:24 init 2 root 9 0 0 0 0 SW 0.0 0.0 0:00 keventd 3 root 19 19 0 0 0 SWN 0.0 0.0 0:04 ksoftirqd_CPU0 4 root 19 19 0 0 0 SWN 0.0 0.0 0:08 ksoftirqd_CPU1 5 root 9 0 0 0 0 SW 0.0 0.0 89:10 kswapd 6 root 9 0 0 0 0 SW 0.0 0.0 0:00 kreclaimd 7 root 9 0 0 0 0 SW 0.0 0.0 2:03 bdflush 8 root 9 0 0 0 0 SW 0.0 0.0 3:25 kupdated 11 root 9 0 0 0 0 SW 0.0 0.0 0:00 scsi_eh_0 12 root 9 0 0 0 0 SW 0.0 0.0 0:00 scsi_eh_1 14 root 9 0 0 0 0 SW 0.0 0.0 0:00 AIFd 129 root 9 0 612 612 532 S 0.0 0.0 0:00 apcupsd 130 root 9 0 588 588 504 S 0.0 0.0 0:59 apcupsd 418 bin 9 0 512 512 428 S 0.0 0.0 0:00 portmap 427 root 9 0 964 964 836 S 0.0 0.0 0:02 sshd 440 root 9 0 636 636 532 S 0.0 0.0 0:03 syslogd 444 root 9 0 1016 1016 452 S 0.0 0.0 0:00 klogd 459 root 9 0 548 548 468 S 0.0 0.0 0:00 lpd 472 root 9 0 704 704 608 S 0.0 0.0 0:00 rpc.statd 474 root 9 0 0 0 0 SW 0.0 0.0 0:46 rpciod 475 root 9 0 0 0 0 SW 0.0 0.0 0:00 lockd <SNIP> Next entry: 12:07pm up 21 days, 2:35, 9 users, load average: 1.81, 1.81, 1.82 226 processes: 222 sleeping, 1 running, 3 zombie, 0 stopped CPU0 states: 8.0% user, 88.0% system, 0.0% nice, 2.0% idle CPU1 states: 0.0% user, 5.0% system, 0.0% nice, 94.0% idle Mem: 1028084K av, 1015312K used, 12772K free, 0K shrd, 91128K buff Swap: 1052248K av, 164K used, 1052084K free 575308K cached PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND 996 root 18 0 1044 1044 716 R 99.9 0.1 6:57 SymLinkToTop 1 root 9 0 224 224 188 S 0.0 0.0 0:24 init 2 root 9 0 0 0 0 SW 0.0 0.0 0:00 keventd 3 root 19 19 0 0 0 SWN 0.0 0.0 0:04 ksoftirqd_CPU0 4 root 19 19 0 0 0 SWN 0.0 0.0 0:08 ksoftirqd_CPU1 5 root 9 0 0 0 0 SW 0.0 0.0 89:10 kswapd 6 root 9 0 0 0 0 SW 0.0 0.0 0:00 kreclaimd 7 root 9 0 0 0 0 SW 0.0 0.0 2:03 bdflush 8 root 9 0 0 0 0 SW 0.0 0.0 3:25 kupdated 11 root 9 0 0 0 0 SW 0.0 0.0 0:00 scsi_eh_0 12 root 9 0 0 0 0 SW 0.0 0.0 0:00 scsi_eh_1 14 root 9 0 0 0 0 SW 0.0 0.0 0:00 AIFd 129 root 9 0 612 612 532 S 0.0 0.0 0:00 apcupsd 130 root 9 0 588 588 504 S 0.0 0.0 0:59 apcupsd 418 bin 9 0 512 512 428 S 0.0 0.0 0:00 portmap 427 root 9 0 964 964 836 S 0.0 0.0 0:02 sshd 440 root 9 0 636 636 532 S 0.0 0.0 0:03 syslogd 444 root 9 0 1016 1016 452 S 0.0 0.0 0:00 klogd 459 root 9 0 548 548 468 S 0.0 0.0 0:00 lpd 472 root 9 0 704 704 608 S 0.0 0.0 0:00 rpc.statd 474 root 9 0 0 0 0 SW 0.0 0.0 0:46 rpciod 475 root 9 0 0 0 0 SW 0.0 0.0 0:00 lockd <SNIP> Very last entry before machine died completely: (about 10 seconds later) 12:07pm up 21 days, 2:36, 9 users, load average: 1.61, 1.76, 1.80 243 processes: 240 sleeping, 1 running, 2 zombie, 0 stopped CPU0 states: 2.0% user, 52.0% system, 0.0% nice, 45.0% idle CPU1 states: 0.0% user, 52.0% system, 0.0% nice, 47.0% idle Mem: 1028084K av, 1018600K used, 9484K free, 0K shrd, 91136K buff Swap: 1052248K av, 164K used, 1052084K free 576280K cached PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND 996 root 19 0 1044 1044 716 R 99.9 0.1 7:22 SymLinkToTop 915 root 12 0 7468 7468 5708 S 95.3 0.7 1968m kpm 916 root 11 0 7440 7440 5704 S 95.3 0.7 1985m kpm 912 root 9 0 4744 4744 3300 S 42.9 0.4 4833m gtop 766 root 9 -1 11228 10M 1652 S < 4.7 1.0 85:38 X 25012 root 9 0 69896 68M 7652 S 2.3 6.7 0:22 java 25040 httpd 9 0 4252 4252 3524 S 2.3 0.4 0:08 httpd 1 root 9 0 224 224 188 S 0.0 0.0 0:24 init 2 root 9 0 0 0 0 SW 0.0 0.0 0:00 keventd 3 root 19 19 0 0 0 SWN 0.0 0.0 0:04 ksoftirqd_CPU0 4 root 19 19 0 0 0 SWN 0.0 0.0 0:08 ksoftirqd_CPU1 5 root 9 0 0 0 0 SW 0.0 0.0 89:10 kswapd 6 root 9 0 0 0 0 SW 0.0 0.0 0:00 kreclaimd 7 root 9 0 0 0 0 SW 0.0 0.0 2:03 bdflush 8 root 9 0 0 0 0 SW 0.0 0.0 3:25 kupdated 11 root 9 0 0 0 0 SW 0.0 0.0 0:00 scsi_eh_0 12 root 9 0 0 0 0 SW 0.0 0.0 0:00 scsi_eh_1 14 root 9 0 0 0 0 SW 0.0 0.0 0:00 AIFd 129 root 9 0 612 612 532 S 0.0 0.0 0:00 apcupsd 130 root 9 0 588 588 504 S 0.0 0.0 0:59 apcupsd 418 bin 9 0 512 512 428 S 0.0 0.0 0:00 portmap <SNIP>
participants (1)
-
David Molloy