Instable kswapd / kernel oopses
Hello list, We have a strange problem with one of our opteron servers. It used to be very unstable when we put it in production the first time, but after some tweaks we managed it to run for more than 20 days without crashing (the previous record was 2 hours..). Until today. We had to reboot the server and after it came up it was very unstable again. It crashed every half hour, it was pingable, a running vmstat would continue to run but no new commando's or shells could be started effectivily freezing the server. This was the same behavior we experienced when we started to use the server, we were unable than to conclude what was exactly wrong, it was just running stable after the xx'th reboot without any reason why it should be stable. We are currently using only 4G memory (6G total avail.) this makes the server stable but very slow, it almost looks like the memory management of the kernel still thinks he has 6G ram and uses 2G of swap to compensate the missing 2G of ram (we are starting with mem=4g, we don't have physical access to the server now). Also when running in 4G-mode kswapd is doing strange things ~ every 5 minutes, a snapshot of top shows it useing a huge amount of CPU and MEM, it blocks the other processes completely making the server unavailable for several minutes. top - 17:08:37 up 3:27, 2 users, load average: 9.65, 9.75, 8.59 Tasks: 350 total, 10 running, 340 sleeping, 0 stopped, 0 zombie Cpu(s): 4.0% user, 95.6% system, 0.0% nice, 0.4% idle Mem: 3641112k total, 3367880k used, 273232k free, 16308k buffers Swap: 4056360k total, 1372000k used, 2684360k free, 1940732k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 7 root 25 0 0 0 0 R 96.5 0.0 42:24.97 kswapd 29695 acm 21 0 632 248 232 R 20.4 0.0 0:03.57 top 29619 mysql 17 0 1333m 2228 2084 R 6.8 0.1 0:01.26 mysqld 29651 mysql 18 0 1333m 936 856 R 6.1 0.0 0:00.48 mysqld 29623 mysql 18 0 1333m 1556 1452 R 5.1 0.0 0:00.92 mysqld 29650 mysql 17 0 1333m 692 664 R 5.0 0.0 0:00.29 mysqld This is the reason that with only 4G ram the load-averages of the server are 8 to 10 times higher than normal, we cant figure out why. But that isn't the main problem, the main problem are the kernel oopses that causes our server to freeze, we get this in our logs: Jan 12 11:00:00 apollo /USR/SBIN/CRON[1973]: (root) CMD (/usr/sbin/ntpdatentp.xs4all.nl > /dev/null 2>&1) Jan 12 11:00:02 apollo xinetd[746]: sysinfo: fork failed: Cannot allocate memory (errno = 12) Jan 12 11:00:22 apollo last message repeated 4 times Jan 12 11:00:22 apollo xinetd[746]: service sysinfo: too many consecutive fork failures Jan 12 11:00:23 apollo xinetd[746]: sysinfo: fork failed: Cannot allocate memory (errno = 12) Jan 12 11:00:43 apollo last message repeated 4 times Jan 12 11:00:43 apollo xinetd[746]: service sysinfo: too many consecutive fork failures Jan 12 11:00:22 apollo last message repeated 4 times Jan 12 11:00:22 apollo xinetd[746]: service sysinfo: too many consecutive fork failures Jan 12 11:00:23 apollo xinetd[746]: sysinfo: fork failed: Cannot allocate memory (errno = 12) Jan 12 11:00:43 apollo last message repeated 4 times Jan 12 11:00:43 apollo xinetd[746]: service sysinfo: too many consecutive fork failures Jan 12 11:01:15 apollo kernel: Unable to handle kernel paging request at virtual address 0000007f804537e0 Jan 12 11:01:15 apollo kernel: printing rip: Jan 12 11:01:15 apollo kernel: ffffffff801494f7 Jan 12 11:01:15 apollo kernel: PML4 3febf067 PGD 401a3067 PMD 0 Jan 12 11:01:15 apollo kernel: Oops: 0000 Jan 12 11:01:15 apollo kernel: CPU 0 Jan 12 11:01:15 apollo kernel: Pid: 7, comm: kswapd Not tainted Jan 12 11:01:15 apollo kernel: RIP: 0010:[kmem_cache_reap+343/880]{kmem_cache_reap+343} Jan 12 11:01:15 apollo kernel: RIP: 0010:[<ffffffff801494f7>]{kmem_cache_reap+343} Jan 12 11:01:15 apollo kernel: RSP: 0018:0000010102c4fdf8 EFLAGS: 00010016 Jan 12 11:01:15 apollo kernel: RAX: 000ffffff0000000 RBX: 0000000000000001 RCX: 0000000000000019 Jan 12 11:01:15 apollo kernel: RDX: 0000007fffff8000 RSI: 0000000000000000 RDI: 00000100e78f3b10 Jan 12 11:01:15 apollo kernel: RBP: 00000100e78f4440 R08: 000000000000003a R09: 00000100e78f3b30 Jan 12 11:01:15 apollo kernel: R10: 000001017fd06408 R11: 000001017fd06400 R12: 0000000000000058 Jan 12 11:01:15 apollo kernel: R13: 0000000000000001 R14: ffffffff7fffffff R15: 0000000080000000 Jan 12 11:01:15 apollo kernel: FS: 000000000050e800(0000) GS:ffffffff804bba80(0000) knlGS:0000000000000000 Jan 12 11:01:15 apollo kernel: CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b Jan 12 11:01:15 apollo kernel: CR2: 0000007f804537e0 CR3: 0000000000101000 CR4: 00000000000006e0 Jan 12 11:01:15 apollo kernel: Process kswapd (pid: 7, stackpage=10102c4f000) Jan 12 11:01:15 apollo kernel: Stack: 0000010102c4fdf8 0000000000000018 ffffffff8014ae20 00000100e78f3b20 Jan 12 11:01:15 apollo kernel: 0000000100000000 0000010100013008 0000000000000020 00000000000001d0 Jan 12 11:01:15 apollo kernel: 00000101000003c0 0000010102c4fe84 0000000000000000 0000000000000000 Jan 12 11:01:15 apollo kernel: Call Trace: [shrink_cache+1104/1184]{shrink_cache+1104} Jan 12 11:01:15 apollo kernel: Call Trace: [<ffffffff8014ae20>]{shrink_cache+1104} Jan 12 11:01:15 apollo kernel: [shrink_caches+41/128]{shrink_caches+41} [try_to_free_pages_zone+98/272]{try_to_free_pages_zone+98} Jan 12 11:01:15 apollo kernel: [<ffffffff8014b0f9>]{shrink_caches+41} [<ffffffff8014b1b2>]{try_to_free_pages_zone+98} Jan 12 11:01:15 apollo kernel: [kswapd_balance_pgdat+113/224]{kswapd_balance_pgdat+113} [kswapd_balance+28/64]{kswapd_balance+28} Jan 12 11:01:15 apollo kernel: [<ffffffff8014b3d1>]{kswapd_balance_pgdat+113} [<ffffffff8014b45c>]{kswapd_balance+28} Jan 12 11:01:15 apollo kernel: [kswapd+168/195]{kswapd+168} [child_rip+8/16]{child_rip+8} Jan 12 11:01:15 apollo kernel: [<ffffffff8014b5b8>]{kswapd+168} [<ffffffff80110ae4>]{child_rip+8} Jan 12 11:01:15 apollo kernel: [kswapd+0/195]{kswapd+0} [child_rip+0/16]{child_rip+0} Jan 12 11:01:15 apollo kernel: [<ffffffff8014b510>]{kswapd+0} [<ffffffff80110adc>]{child_rip+0} Jan 12 11:01:15 apollo kernel: Jan 12 11:01:15 apollo kernel: Jan 12 11:01:15 apollo kernel: Code: 48 0f b6 92 e0 b7 45 80 48 8b 14 d5 00 b6 45 80 48 8b 8a c8 Jan 12 12:11:09 apollo syslogd 1.4.1: restart. And output from vmstat during and after this oops: procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 1 2 0 187088 37800 3940288 0 0 1372 5772 500 745 3 3 94 0 1 2 0 186372 37916 3940844 0 0 560 2108 293 297 0 23 77 0 0 2 0 186364 37924 3940844 0 0 0 12 140 31 0 35 65 0 0 2 0 186364 37924 3940844 0 0 0 0 127 4 0 0 100 0 0 2 0 186364 37924 3940844 0 0 0 0 148 12 0 0 100 0 (after which it stays at 100% idle and a building up of blocked processes, not even one running process). A kinda nasty piece of work from.. kspwad! What is wrong with kswapd? it crashes our server if we use 6G and slows down our server to a crawl when using 4G (kernel option mem=4g). Is this the BIOS bug mentioned earlier or an undocumented AMD 'feature'. We are going to take two dimms out the server so it runs physically with 4G (maybe we just slap it a bit too for the trouble it is giving us ;)). But are there more things we can do? it won't boot a vanilla kernel (it crashes during boot on.. kswapd!..) , it boots the latest 2.4.21-149 suse kernel without problems though. -kees
A bit more than four hours ago we removed two of the 6 memory modules, leaving a physical addressable space of only 4G and now it is running without any swap usage at all and more stable than it did today before that removal. Best regards, Arjen Kees Hoekzema wrote:
Hello list,
We have a strange problem with one of our opteron servers. It used to be very unstable when we put it in production the first time, but after some tweaks we managed it to run for more than 20 days without crashing (the previous record was 2 hours..).
Until today. We had to reboot the server and after it came up it was very unstable again. It crashed every half hour, it was pingable, a running vmstat would continue to run but no new commando's or shells could be started effectivily freezing the server. This was the same behavior we experienced when we started to use the server, we were unable than to conclude what was exactly wrong, it was just running stable after the xx'th reboot without any reason why it should be stable.
We are currently using only 4G memory (6G total avail.) this makes the server stable but very slow, it almost looks like the memory management of the kernel still thinks he has 6G ram and uses 2G of swap to compensate the missing 2G of ram (we are starting with mem=4g, we don't have physical access to the server now). Also when running in 4G-mode kswapd is doing strange things ~ every 5 minutes, a snapshot of top shows it useing a huge amount of CPU and MEM, it blocks the other processes completely making the server unavailable for several minutes.
[snip]
A kinda nasty piece of work from.. kspwad! What is wrong with kswapd? it crashes our server if we use 6G and slows down our server to a crawl when using 4G (kernel option mem=4g). Is this the BIOS bug mentioned earlier or an undocumented AMD 'feature'. We are going to take two dimms out the server so it runs physically with 4G (maybe we just slap it a bit too for the trouble it is giving us ;)).
But are there more things we can do? it won't boot a vanilla kernel (it crashes during boot on.. kswapd!..) , it boots the latest 2.4.21-149 suse kernel without problems though.
-kees
participants (2)
-
Arjen van der Meijden
-
Kees Hoekzema