Hi,
For the past few months we've been experiencing random but
infrequent system crashes on our SUSE7.2 Dell box, which we
haven't been able to track down. Previously on these lists, it
was suggested that the Intel Network card driver was at fault and
this was changed but the crash occurred again today!
Essentially, the system runs along fine but then (after a few
days or up to a month) it rapidly slows and locks up completely
within 10 seconds (remotely or at the term). The only recourse
is a nasty hard reboot. Naturally, the first port of call is
to watch 'top' and try and see if something hogs the memory/cpu
and after failing to find anything, we wrote a script that ran
top to files 5 times every second.
The somewhat peculiar thing we discovered was that in the top log
at the crash time, the "top system time" changed! This happens
precisely at the time of the crash, although we get approx. another
50 top outputs, which represent the following 10 seconds before
it dies completely. Strangely enough the file modified time remains
as the expected time, even though top is feeding out the wrong
system time all of a sudden.
Very top line of top:
1:39pm up 21 days, 4:07, 9 users, load average: 2.04, 2.14, 2.17
And 0.2 seconds later:
12:07pm up 21 days, 2:35, 9 users, load average: 1.81, 1.81, 1.82
Has anyone got any ideas what could possibly cause strange behaviour
like this please? I have system info written below and beneath my
name is a more detailed top output file.
A couple of other points:
- There are loads of Java processes listed but these represent threads
and not processes to my knowledge.
- The Oracle Process ora_s000_VCP seems to rise in memory (up to
150Mb of 1gig) as time passes and while I'm not sure what this
process does I'm wary.
- The two kpm processes near the end seem suspicious?
- SymLinktoTop is not a strange process, but rather the process used
to write the top log.
System: Suse 7.2 running Oracle 8.1.7, Tomcat 4.0.1 on a Dell twin
Intel CPU server.
Any help or ideas would be greatly appreciated! More top details
below.
Kind regards,
David Molloy
------- TOP Details ---------
Two entries before time change:
1:39pm up 21 days, 4:07, 9 users, load average: 2.04, 2.14, 2.17
253 processes: 249 sleeping, 1 running, 3 zombie, 0 stopped
CPU0 states: 0.0% user, 0.0% system, 0.0% nice, 100.0% idle
CPU1 states: 8.0% user, 91.0% system, 0.0% nice, 0.0% idle
Mem: 1028084K av, 944844K used, 83240K free, 0K shrd, 132604K
buff
Swap: 1052248K av, 164K used, 1052084K free 456612K
cached
PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND
4171 root 17 0 1060 1060 716 R 99.9 0.1 8:20 SymLinkToTop
912 root 9 0 4744 4744 3300 S 4.2 0.4 4853m gtop
25038 httpd 9 0 4252 4252 3516 S 2.1 0.4 0:11 httpd
25055 httpd 9 0 4264 4264 3516 S 2.1 0.4 0:11 httpd
1 root 9 0 224 224 188 S 0.0 0.0 0:24 init
2 root 9 0 0 0 0 SW 0.0 0.0 0:00 keventd
3 root 19 19 0 0 0 SWN 0.0 0.0 0:04 ksoftirqd_CPU0
4 root 19 19 0 0 0 SWN 0.0 0.0 0:08 ksoftirqd_CPU1
5 root 9 0 0 0 0 SW 0.0 0.0 89:11 kswapd
6 root 9 0 0 0 0 SW 0.0 0.0 0:00 kreclaimd
7 root 9 0 0 0 0 SW 0.0 0.0 2:03 bdflush
8 root 9 0 0 0 0 SW 0.0 0.0 3:26 kupdated
11 root 9 0 0 0 0 SW 0.0 0.0 0:00 scsi_eh_0
12 root 9 0 0 0 0 SW 0.0 0.0 0:00 scsi_eh_1
14 root 9 0 0 0 0 SW 0.0 0.0 0:00 AIFd
129 root 9 0 612 612 532 S 0.0 0.0 0:00 apcupsd
130 root 9 0 588 588 504 S 0.0 0.0 1:00 apcupsd
418 bin 9 0 512 512 428 S 0.0 0.0 0:00 portmap
427 root 9 0 964 964 836 S 0.0 0.0 0:02 sshd
440 root 9 0 636 636 532 S 0.0 0.0 0:03 syslogd
444 root 9 0 1016 1016 452 S 0.0 0.0 0:00 klogd
459 root 9 0 548 548 468 S 0.0 0.0 0:00 lpd
472 root 9 0 704 704 608 S 0.0 0.0 0:00 rpc.statd
474 root 9 0 0 0 0 SW 0.0 0.0 0:46 rpciod
475 root 9 0 0 0 0 SW 0.0 0.0 0:00 lockd
<SNIP>
Last entry before time change:
1:39pm up 21 days, 4:07, 9 users, load average: 2.04, 2.14, 2.17
253 processes: 249 sleeping, 1 running, 3 zombie, 0 stopped
CPU0 states: 0.0% user, 4.0% system, 0.0% nice, 95.0% idle
CPU1 states: 4.0% user, 95.0% system, 0.0% nice, 0.0% idle
Mem: 1028084K av, 944860K used, 83224K free, 0K shrd, 132604K
buff
Swap: 1052248K av, 164K used, 1052084K free 456632K
cached
PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND
4171 root 16 0 1060 1060 716 R 99.9 0.1 8:20 SymLinkToTop
1 root 9 0 224 224 188 S 0.0 0.0 0:24 init
2 root 9 0 0 0 0 SW 0.0 0.0 0:00 keventd
3 root 19 19 0 0 0 SWN 0.0 0.0 0:04 ksoftirqd_CPU0
4 root 19 19 0 0 0 SWN 0.0 0.0 0:08 ksoftirqd_CPU1
5 root 9 0 0 0 0 SW 0.0 0.0 89:11 kswapd
6 root 9 0 0 0 0 SW 0.0 0.0 0:00 kreclaimd
7 root 9 0 0 0 0 SW 0.0 0.0 2:03 bdflush
8 root 9 0 0 0 0 SW 0.0 0.0 3:26 kupdated
11 root 9 0 0 0 0 SW 0.0 0.0 0:00 scsi_eh_0
12 root 9 0 0 0 0 SW 0.0 0.0 0:00 scsi_eh_1
14 root 9 0 0 0 0 SW 0.0 0.0 0:00 AIFd
129 root 9 0 612 612 532 S 0.0 0.0 0:00 apcupsd
130 root 9 0 588 588 504 S 0.0 0.0 1:00 apcupsd
418 bin 9 0 512 512 428 S 0.0 0.0 0:00 portmap
427 root 9 0 964 964 836 S 0.0 0.0 0:02 sshd
440 root 9 0 636 636 532 S 0.0 0.0 0:03 syslogd
444 root 9 0 1016 1016 452 S 0.0 0.0 0:00 klogd
459 root 9 0 548 548 468 S 0.0 0.0 0:00 lpd
472 root 9 0 704 704 608 S 0.0 0.0 0:00 rpc.statd
474 root 9 0 0 0 0 SW 0.0 0.0 0:46 rpciod
475 root 9 0 0 0 0 SW 0.0 0.0 0:00 lockd
<SNIP>
0.2 seconds later at the timechange:
12:07pm up 21 days, 2:35, 9 users, load average: 1.81, 1.81, 1.82
226 processes: 222 sleeping, 1 running, 3 zombie, 0 stopped
CPU0 states: 2.0% user, 100.0% system, 0.0% nice, -2.0% idle
CPU1 states: 0.0% user, 0.0% system, 0.0% nice, 100.0% idle
Mem: 1028084K av, 1015296K used, 12788K free, 0K shrd, 91128K
buff
Swap: 1052248K av, 164K used, 1052084K free 575292K
cached
PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND
996 root 19 0 1044 1044 716 R 99.9 0.1 6:57 SymLinkToTop
1 root 9 0 224 224 188 S 0.0 0.0 0:24 init
2 root 9 0 0 0 0 SW 0.0 0.0 0:00 keventd
3 root 19 19 0 0 0 SWN 0.0 0.0 0:04 ksoftirqd_CPU0
4 root 19 19 0 0 0 SWN 0.0 0.0 0:08 ksoftirqd_CPU1
5 root 9 0 0 0 0 SW 0.0 0.0 89:10 kswapd
6 root 9 0 0 0 0 SW 0.0 0.0 0:00 kreclaimd
7 root 9 0 0 0 0 SW 0.0 0.0 2:03 bdflush
8 root 9 0 0 0 0 SW 0.0 0.0 3:25 kupdated
11 root 9 0 0 0 0 SW 0.0 0.0 0:00 scsi_eh_0
12 root 9 0 0 0 0 SW 0.0 0.0 0:00 scsi_eh_1
14 root 9 0 0 0 0 SW 0.0 0.0 0:00 AIFd
129 root 9 0 612 612 532 S 0.0 0.0 0:00 apcupsd
130 root 9 0 588 588 504 S 0.0 0.0 0:59 apcupsd
418 bin 9 0 512 512 428 S 0.0 0.0 0:00 portmap
427 root 9 0 964 964 836 S 0.0 0.0 0:02 sshd
440 root 9 0 636 636 532 S 0.0 0.0 0:03 syslogd
444 root 9 0 1016 1016 452 S 0.0 0.0 0:00 klogd
459 root 9 0 548 548 468 S 0.0 0.0 0:00 lpd
472 root 9 0 704 704 608 S 0.0 0.0 0:00 rpc.statd
474 root 9 0 0 0 0 SW 0.0 0.0 0:46 rpciod
475 root 9 0 0 0 0 SW 0.0 0.0 0:00 lockd
<SNIP>
Next entry:
12:07pm up 21 days, 2:35, 9 users, load average: 1.81, 1.81, 1.82
226 processes: 222 sleeping, 1 running, 3 zombie, 0 stopped
CPU0 states: 8.0% user, 88.0% system, 0.0% nice, 2.0% idle
CPU1 states: 0.0% user, 5.0% system, 0.0% nice, 94.0% idle
Mem: 1028084K av, 1015312K used, 12772K free, 0K shrd, 91128K
buff
Swap: 1052248K av, 164K used, 1052084K free 575308K
cached
PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND
996 root 18 0 1044 1044 716 R 99.9 0.1 6:57 SymLinkToTop
1 root 9 0 224 224 188 S 0.0 0.0 0:24 init
2 root 9 0 0 0 0 SW 0.0 0.0 0:00 keventd
3 root 19 19 0 0 0 SWN 0.0 0.0 0:04 ksoftirqd_CPU0
4 root 19 19 0 0 0 SWN 0.0 0.0 0:08 ksoftirqd_CPU1
5 root 9 0 0 0 0 SW 0.0 0.0 89:10 kswapd
6 root 9 0 0 0 0 SW 0.0 0.0 0:00 kreclaimd
7 root 9 0 0 0 0 SW 0.0 0.0 2:03 bdflush
8 root 9 0 0 0 0 SW 0.0 0.0 3:25 kupdated
11 root 9 0 0 0 0 SW 0.0 0.0 0:00 scsi_eh_0
12 root 9 0 0 0 0 SW 0.0 0.0 0:00 scsi_eh_1
14 root 9 0 0 0 0 SW 0.0 0.0 0:00 AIFd
129 root 9 0 612 612 532 S 0.0 0.0 0:00 apcupsd
130 root 9 0 588 588 504 S 0.0 0.0 0:59 apcupsd
418 bin 9 0 512 512 428 S 0.0 0.0 0:00 portmap
427 root 9 0 964 964 836 S 0.0 0.0 0:02 sshd
440 root 9 0 636 636 532 S 0.0 0.0 0:03 syslogd
444 root 9 0 1016 1016 452 S 0.0 0.0 0:00 klogd
459 root 9 0 548 548 468 S 0.0 0.0 0:00 lpd
472 root 9 0 704 704 608 S 0.0 0.0 0:00 rpc.statd
474 root 9 0 0 0 0 SW 0.0 0.0 0:46 rpciod
475 root 9 0 0 0 0 SW 0.0 0.0 0:00 lockd
<SNIP>
Very last entry before machine died completely: (about 10 seconds later)
12:07pm up 21 days, 2:36, 9 users, load average: 1.61, 1.76, 1.80
243 processes: 240 sleeping, 1 running, 2 zombie, 0 stopped
CPU0 states: 2.0% user, 52.0% system, 0.0% nice, 45.0% idle
CPU1 states: 0.0% user, 52.0% system, 0.0% nice, 47.0% idle
Mem: 1028084K av, 1018600K used, 9484K free, 0K shrd, 91136K
buff
Swap: 1052248K av, 164K used, 1052084K free 576280K
cached
PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND
996 root 19 0 1044 1044 716 R 99.9 0.1 7:22 SymLinkToTop
915 root 12 0 7468 7468 5708 S 95.3 0.7 1968m kpm
916 root 11 0 7440 7440 5704 S 95.3 0.7 1985m kpm
912 root 9 0 4744 4744 3300 S 42.9 0.4 4833m gtop
766 root 9 -1 11228 10M 1652 S < 4.7 1.0 85:38 X
25012 root 9 0 69896 68M 7652 S 2.3 6.7 0:22 java
25040 httpd 9 0 4252 4252 3524 S 2.3 0.4 0:08 httpd
1 root 9 0 224 224 188 S 0.0 0.0 0:24 init
2 root 9 0 0 0 0 SW 0.0 0.0 0:00 keventd
3 root 19 19 0 0 0 SWN 0.0 0.0 0:04 ksoftirqd_CPU0
4 root 19 19 0 0 0 SWN 0.0 0.0 0:08 ksoftirqd_CPU1
5 root 9 0 0 0 0 SW 0.0 0.0 89:10 kswapd
6 root 9 0 0 0 0 SW 0.0 0.0 0:00 kreclaimd
7 root 9 0 0 0 0 SW 0.0 0.0 2:03 bdflush
8 root 9 0 0 0 0 SW 0.0 0.0 3:25 kupdated
11 root 9 0 0 0 0 SW 0.0 0.0 0:00 scsi_eh_0
12 root 9 0 0 0 0 SW 0.0 0.0 0:00 scsi_eh_1
14 root 9 0 0 0 0 SW 0.0 0.0 0:00 AIFd
129 root 9 0 612 612 532 S 0.0 0.0 0:00 apcupsd
130 root 9 0 588 588 504 S 0.0 0.0 0:59 apcupsd
418 bin 9 0 512 512 428 S 0.0 0.0 0:00 portmap
<SNIP>