Mailinglist Archive: opensuse-amd64 (299 mails)

< Previous Next >
Instable kswapd / kernel oopses
  • From: Kees Hoekzema <kees@xxxxxxxxxxxx>
  • Date: Mon, 12 Jan 2004 16:57:43 +0000 (UTC)
  • Message-id: <200401121757.37934.kees@xxxxxxxxxxxx>
Hello list,

We have a strange problem with one of our opteron servers. It used to be very
unstable when we put it in production the first time, but after some tweaks
we managed it to run for more than 20 days without crashing (the previous
record was 2 hours..).

Until today. We had to reboot the server and after it came up it was very
unstable again. It crashed every half hour, it was pingable, a running vmstat
would continue to run but no new commando's or shells could be started
effectivily freezing the server.
This was the same behavior we experienced when we started to use the server,
we were unable than to conclude what was exactly wrong, it was just running
stable after the xx'th reboot without any reason why it should be stable.

We are currently using only 4G memory (6G total avail.) this makes the server
stable but very slow, it almost looks like the memory management of the
kernel still thinks he has 6G ram and uses 2G of swap to compensate the
missing 2G of ram (we are starting with mem=4g, we don't have physical access
to the server now). Also when running in 4G-mode kswapd is doing strange
things ~ every 5 minutes, a snapshot of top shows it useing a huge amount of
CPU and MEM, it blocks the other processes completely making the server
unavailable for several minutes.

top - 17:08:37 up 3:27, 2 users, load average: 9.65, 9.75, 8.59
Tasks: 350 total, 10 running, 340 sleeping, 0 stopped, 0 zombie
Cpu(s): 4.0% user, 95.6% system, 0.0% nice, 0.4% idle
Mem: 3641112k total, 3367880k used, 273232k free, 16308k buffers
Swap: 4056360k total, 1372000k used, 2684360k free, 1940732k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7 root 25 0 0 0 0 R 96.5 0.0 42:24.97 kswapd
29695 acm 21 0 632 248 232 R 20.4 0.0 0:03.57 top
29619 mysql 17 0 1333m 2228 2084 R 6.8 0.1 0:01.26 mysqld
29651 mysql 18 0 1333m 936 856 R 6.1 0.0 0:00.48 mysqld
29623 mysql 18 0 1333m 1556 1452 R 5.1 0.0 0:00.92 mysqld
29650 mysql 17 0 1333m 692 664 R 5.0 0.0 0:00.29 mysqld

This is the reason that with only 4G ram the load-averages of the server are 8
to 10 times higher than normal, we cant figure out why.

But that isn't the main problem, the main problem are the kernel oopses that
causes our server to freeze, we get this in our logs:
Jan 12 11:00:00 apollo /USR/SBIN/CRON[1973]: (root) CMD
(/usr/sbin/ntpdatentp.xs4all.nl > /dev/null 2>&1)
Jan 12 11:00:02 apollo xinetd[746]: sysinfo: fork failed: Cannot allocate
memory (errno = 12)
Jan 12 11:00:22 apollo last message repeated 4 times
Jan 12 11:00:22 apollo xinetd[746]: service sysinfo: too many consecutive fork
failures
Jan 12 11:00:23 apollo xinetd[746]: sysinfo: fork failed: Cannot allocate
memory (errno = 12)
Jan 12 11:00:43 apollo last message repeated 4 times
Jan 12 11:00:43 apollo xinetd[746]: service sysinfo: too many consecutive fork
failures
Jan 12 11:00:22 apollo last message repeated 4 times
Jan 12 11:00:22 apollo xinetd[746]: service sysinfo: too many consecutive fork
failures
Jan 12 11:00:23 apollo xinetd[746]: sysinfo: fork failed: Cannot allocate
memory (errno = 12)
Jan 12 11:00:43 apollo last message repeated 4 times
Jan 12 11:00:43 apollo xinetd[746]: service sysinfo: too many consecutive fork
failures
Jan 12 11:01:15 apollo kernel: Unable to handle kernel paging request at
virtual address 0000007f804537e0
Jan 12 11:01:15 apollo kernel: printing rip:
Jan 12 11:01:15 apollo kernel: ffffffff801494f7
Jan 12 11:01:15 apollo kernel: PML4 3febf067 PGD 401a3067 PMD 0
Jan 12 11:01:15 apollo kernel: Oops: 0000
Jan 12 11:01:15 apollo kernel: CPU 0
Jan 12 11:01:15 apollo kernel: Pid: 7, comm: kswapd Not tainted
Jan 12 11:01:15 apollo kernel: RIP:
0010:[kmem_cache_reap+343/880]{kmem_cache_reap+343}
Jan 12 11:01:15 apollo kernel: RIP:
0010:[<ffffffff801494f7>]{kmem_cache_reap+343}
Jan 12 11:01:15 apollo kernel: RSP: 0018:0000010102c4fdf8 EFLAGS: 00010016
Jan 12 11:01:15 apollo kernel: RAX: 000ffffff0000000 RBX: 0000000000000001
RCX: 0000000000000019
Jan 12 11:01:15 apollo kernel: RDX: 0000007fffff8000 RSI: 0000000000000000
RDI: 00000100e78f3b10
Jan 12 11:01:15 apollo kernel: RBP: 00000100e78f4440 R08: 000000000000003a
R09: 00000100e78f3b30
Jan 12 11:01:15 apollo kernel: R10: 000001017fd06408 R11: 000001017fd06400
R12: 0000000000000058
Jan 12 11:01:15 apollo kernel: R13: 0000000000000001 R14: ffffffff7fffffff
R15: 0000000080000000
Jan 12 11:01:15 apollo kernel: FS: 000000000050e800(0000)
GS:ffffffff804bba80(0000) knlGS:0000000000000000
Jan 12 11:01:15 apollo kernel: CS: 0010 DS: 0018 ES: 0018 CR0:
000000008005003b
Jan 12 11:01:15 apollo kernel: CR2: 0000007f804537e0 CR3: 0000000000101000
CR4: 00000000000006e0
Jan 12 11:01:15 apollo kernel: Process kswapd (pid: 7, stackpage=10102c4f000)
Jan 12 11:01:15 apollo kernel: Stack: 0000010102c4fdf8 0000000000000018
ffffffff8014ae20 00000100e78f3b20
Jan 12 11:01:15 apollo kernel: 0000000100000000 0000010100013008
0000000000000020 00000000000001d0
Jan 12 11:01:15 apollo kernel: 00000101000003c0 0000010102c4fe84
0000000000000000 0000000000000000
Jan 12 11:01:15 apollo kernel: Call Trace:
[shrink_cache+1104/1184]{shrink_cache+1104}
Jan 12 11:01:15 apollo kernel: Call Trace:
[<ffffffff8014ae20>]{shrink_cache+1104}
Jan 12 11:01:15 apollo kernel: [shrink_caches+41/128]{shrink_caches+41}
[try_to_free_pages_zone+98/272]{try_to_free_pages_zone+98}
Jan 12 11:01:15 apollo kernel: [<ffffffff8014b0f9>]{shrink_caches+41}
[<ffffffff8014b1b2>]{try_to_free_pages_zone+98}
Jan 12 11:01:15 apollo kernel:
[kswapd_balance_pgdat+113/224]{kswapd_balance_pgdat+113}
[kswapd_balance+28/64]{kswapd_balance+28}
Jan 12 11:01:15 apollo kernel:
[<ffffffff8014b3d1>]{kswapd_balance_pgdat+113}
[<ffffffff8014b45c>]{kswapd_balance+28}
Jan 12 11:01:15 apollo kernel: [kswapd+168/195]{kswapd+168}
[child_rip+8/16]{child_rip+8}
Jan 12 11:01:15 apollo kernel: [<ffffffff8014b5b8>]{kswapd+168}
[<ffffffff80110ae4>]{child_rip+8}
Jan 12 11:01:15 apollo kernel: [kswapd+0/195]{kswapd+0}
[child_rip+0/16]{child_rip+0}
Jan 12 11:01:15 apollo kernel: [<ffffffff8014b510>]{kswapd+0}
[<ffffffff80110adc>]{child_rip+0}
Jan 12 11:01:15 apollo kernel:
Jan 12 11:01:15 apollo kernel:
Jan 12 11:01:15 apollo kernel: Code: 48 0f b6 92 e0 b7 45 80 48 8b 14 d5 00 b6
45 80 48 8b 8a c8
Jan 12 12:11:09 apollo syslogd 1.4.1: restart.

And output from vmstat during and after this oops:
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
1 2 0 187088 37800 3940288 0 0 1372 5772 500 745 3 3 94 0
1 2 0 186372 37916 3940844 0 0 560 2108 293 297 0 23 77 0
0 2 0 186364 37924 3940844 0 0 0 12 140 31 0 35 65 0
0 2 0 186364 37924 3940844 0 0 0 0 127 4 0 0 100 0
0 2 0 186364 37924 3940844 0 0 0 0 148 12 0 0 100 0
(after which it stays at 100% idle and a building up of blocked processes, not
even one running process).

A kinda nasty piece of work from.. kspwad!
What is wrong with kswapd? it crashes our server if we use 6G and slows down
our server to a crawl when using 4G (kernel option mem=4g).
Is this the BIOS bug mentioned earlier or an undocumented AMD 'feature'. We
are going to take two dimms out the server so it runs physically with 4G
(maybe we just slap it a bit too for the trouble it is giving us ;)).

But are there more things we can do? it won't boot a vanilla kernel (it
crashes during boot on.. kswapd!..) , it boots the latest 2.4.21-149 suse
kernel without problems though.

-kees


< Previous Next >