[Bug 685777] New: kernel wedging under load ...
https://bugzilla.novell.com/show_bug.cgi?id=685777 https://bugzilla.novell.com/show_bug.cgi?id=685777#c0 Summary: kernel wedging under load ... Classification: openSUSE Product: openSUSE 11.4 Version: Final Platform: Other OS/Version: Other Status: NEW Severity: Critical Priority: P5 - None Component: Kernel AssignedTo: kernel-maintainers@forge.provo.novell.com ReportedBy: mmeeks@novell.com QAContact: qa@suse.de Found By: --- Blocker: --- I can no longer compile LibreOffice on 11.4 - under load (same load as openSUSE 11.3 survived interactively) it wedges the machine tight. The disk appears to stop responding, and while that resets, I get no interactivity (of course). I don't use swap (since I have an SSD and have 3GB of RAM). my /var/log/messages is plagued with: Apr 7 10:07:52 lenovo-w500 kernel: [ 104.537048] hub 1-0:1.0: connect-debounce failed, port 3 disabled prolly un-related; but I get some flurry of oom-killer activity in the logs: pr 7 10:01:45 lenovo-w500 kernel: [51723.303571] Pid: 4462, comm: cc1plus Not tainted 2.6.37.1-1.2-desktop #1 Apr 7 10:01:45 lenovo-w500 kernel: [51723.303572] Call Trace: Apr 7 10:01:45 lenovo-w500 kernel: [51723.303594] [<c02062a3>] try_stack_unwind+0x173/0x190 Apr 7 10:01:45 lenovo-w500 kernel: [51723.303599] [<c0204ebf>] dump_trace+0x3f/0xe0 Apr 7 10:01:45 lenovo-w500 kernel: [51723.303603] [<c020630b>] show_trace_log_lvl+0x4b/0x60 Apr 7 10:01:45 lenovo-w500 kernel: [51723.303606] [<c0206338>] show_trace+0x18/0x20 Apr 7 10:01:45 lenovo-w500 kernel: [51723.303611] [<c068d44a>] dump_stack+0x6d/0x72 Apr 7 10:01:45 lenovo-w500 kernel: [51723.303616] [<c02da1e4>] dump_header+0x84/0x1e0 Apr 7 10:01:45 lenovo-w500 kernel: [51723.303620] [<c02da7b0>] oom_kill_process+0x90/0x190 Apr 7 10:01:45 lenovo-w500 kernel: [51723.303624] [<c02dab77>] out_of_memory+0xd7/0x200 Apr 7 10:01:45 lenovo-w500 kernel: [51723.303628] [<c02defc8>] __alloc_pages_nodemask+0x678/0x690 Apr 7 10:01:45 lenovo-w500 kernel: [51723.303642] [<c030e417>] alloc_pages_current+0x77/0xd0 Apr 7 10:01:45 lenovo-w500 kernel: [51723.303646] [<c02e14e1>] __do_page_cache_readahead+0xf1/0x220 Apr 7 10:01:45 lenovo-w500 kernel: [51723.303650] [<c02e18ee>] ra_submit+0x1e/0x30 Apr 7 10:01:45 lenovo-w500 kernel: [51723.303653] [<c02d9747>] filemap_fault+0x347/0x410 Apr 7 10:01:45 lenovo-w500 kernel: [51723.303660] [<c02f5212>] __do_fault+0x52/0x510 Apr 7 10:01:45 lenovo-w500 kernel: [51723.303664] [<c02f93f9>] handle_mm_fault+0x169/0x410 Apr 7 10:01:45 lenovo-w500 kernel: [51723.303668] [<c0692ff0>] do_page_fault+0x170/0x4b0 Apr 7 10:01:45 lenovo-w500 kernel: [51723.303672] [<c06909c6>] error_code+0x5a/0x60 Apr 7 10:01:45 lenovo-w500 kernel: [51723.303690] [<081aad20>] 0x81aad20 other interesting things I've not seen recently: Apr 7 10:02:05 lenovo-w500 rtkit-daemon[1735]: The canary thread is apparently starving. Taking action. Apr 7 10:02:06 lenovo-w500 rtkit-daemon[1735]: Demoting known real-time threads. Apr 7 10:02:07 lenovo-w500 rtkit-daemon[1735]: Demoted 0 threads. my kernel is: Linux lenovo-w500 2.6.37.1-1.2-desktop #1 SMP PREEMPT 2011-02-21 10:34:10 +0100 i686 i686 i386 GNU/Linux rpm -q --changelog kernel-desktop | head * Mon Feb 21 2011 tiwai@suse.de - ALSA: caiaq - Fix possible string-buffer overflow (bnc#672499, CVE-2011-0712). - commit f6a72cc hwinfo attached - a modern Lenovo W500 laptop. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=685777 https://bugzilla.novell.com/show_bug.cgi?id=685777#c1 --- Comment #1 from Michael Meeks <mmeeks@novell.com> 2011-04-07 09:16:13 UTC --- Created an attachment (id=423646) --> (http://bugzilla.novell.com/attachment.cgi?id=423646) hwinfo -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=685777 https://bugzilla.novell.com/show_bug.cgi?id=685777#c2 --- Comment #2 from Michael Meeks <mmeeks@novell.com> 2011-04-07 09:49:34 UTC --- The great news is - that upgrading to the tumbleweed kernel: Linux lenovo-w500 2.6.38-18-desktop #1 SMP PREEMPT 2011-03-20 22:25:37 +0100 i686 i686 i386 GNU/Linux makes things wedge even more quickly when building; I didn't even get my mail client started before it died ;-) Ho hum; looks like sticking with openSUSE 11.3 is the only option [ amazing ] -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=685777 https://bugzilla.novell.com/show_bug.cgi?id=685777#c3 Michael Meeks <mmeeks@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Severity|Critical |Major --- Comment #3 from Michael Meeks <mmeeks@novell.com> 2011-04-07 10:07:26 UTC --- Hah - so, there is a PEBKAC element here; I had not enabled icecream - which throttles compiler processes to a manageable number - down from 60+ to one (on this machine). Having said that, the OOM killer / load balancing seems to have done a pretty terrible job in this instance; easy to reproduce just checkout libreoffice, zypper si -d libreoffice-bootstrap and autogen with: '--with-num-cpus=16' '--with-max-jobs=4' :-) -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=685777 https://bugzilla.novell.com/show_bug.cgi?id=685777#c Jeff Mahoney <jeffm@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |jeffm@novell.com AssignedTo|kernel-maintainers@forge.pr |mgorman@novell.com |ovo.novell.com | -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=685777 https://bugzilla.novell.com/show_bug.cgi?id=685777#c4 --- Comment #4 from Mel Gorman <mgorman@novell.com> 2011-04-15 18:56:11 UTC --- Can we see the full OOM message, particularly the memory-related information that is usually displayed after the stack trace? What would also be helpful is the contents of /proc/slabinfo as close to the time of failure as possible. Ideally include the kernels config but if you don't have it handy, I'll assume it's the default openSUSE 11.4 kernel. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=685777 https://bugzilla.novell.com/show_bug.cgi?id=685777#c5 --- Comment #5 from Michael Meeks <mmeeks@novell.com> 2011-05-20 11:48:34 UTC --- Hi Mel, really sorry about this - but while it was trivially reproducible, I can't find the time to dig into this; zypper si -d libreoffice-bootstrap git clone git://anongit.freedesktop.org/libreoffice/bootstrap /autogen.sh --with-num-cpus=16 --with-max-jobs=4 make should (after some hours the first time, and some minutes the second time) cause the wedge of wedges :-) Might even be a useful test case in general - we do a lot of I/O and expose a lot of parallelism wrt. compiling I suppose - and each C++ compile chews a great chunk of memory :-) Anyhow - if you can't get to that I guess we should close this. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=685777 https://bugzilla.novell.com/show_bug.cgi?id=685777#c6 --- Comment #6 from Mel Gorman <mgorman@novell.com> 2011-07-26 16:45:18 UTC --- In partial response to this bug, I made a number of reclaim-related fixes in mainline that are now merged. Most of them dealt with kswapd consuming too much CPU or dumping too much memory but thought they might related to this bug so prepared a backport. I configured a laptop with 3G memory and set it up for building libreoffice including the disabling of swap. The machine did not lock up but the OOM trigger did fire and I see Jul 26 17:27:08 micromek kernel: [ 1838.738311] Node 0 DMA32 free:6704kB min:6916kB low:8644kB high:10372kB active_anon:2851928kB inactive_anon:6040kB active_file:2848kB inactive_file:2900kB unevictable:2584kB isolated(anon):0kB isolated(file):1588kB present:3009804kB mlocked:0kB dirty:0kB writeback:3856kB mapped:1960kB shmem:4520kB slab_reclaimable:10596kB slab_unreclaimable:22220kB kernel_stack:2392kB pagetables:16920kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:480 all_unreclaimable? no page cache from files is reduced to negligible levels while anonymous memory is through the roof. This system is genuinely out of memory due to a lack of swap space as the available memory is close to the minimum watermarks. When I examined the OOM message, 92% of memory was consumed by cc1plus. I don't think this is a kernel issue as such. Can you check the g++ versions you are using in openSUSE 11.3 and 11.4? A difference in version could explain an increase in memory usage for newer optimisations that were just enough to push your system over the edge. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=685777 https://bugzilla.novell.com/show_bug.cgi?id=685777#c7 --- Comment #7 from Michael Meeks <mmeeks@novell.com> 2011-07-27 17:13:03 UTC --- Mel ! thanks so much for looking into that. Meanwhile I had had a similar problem with just overloading the machine by other means I think. I wonder - so, it is entirely possible that my problems are related to the Intel SSD I have, is it possible that that exacerbates the paging issues ? Is it possible you have a new kernel I can try to see if the issues is fixed ? I feel rather motivated to test it after all your hard work :-) Sadly I'm on vacation shortly for two weeks though so ... Thanks anyhow ! -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=685777 https://bugzilla.novell.com/show_bug.cgi?id=685777#c8 --- Comment #8 from Mel Gorman <mgorman@novell.com> 2011-07-28 10:05:44 UTC --- (In reply to comment #7)
Mel ! thanks so much for looking into that. Meanwhile I had had a similar problem with just overloading the machine by other means I think.
I wonder - so, it is entirely possible that my problems are related to the Intel SSD I have, is it possible that that exacerbates the paging issues ?
What is the reproduction scenario? The speed of the SSD could be masking the fact that the machine is almost out-of-memory but not enough to trigger the OOM killer. If you have no swap configured and anonymous memory is occupying a high percentage of memory (e.g. 85%) then a significant percentage of time will be spent paging to and from the SSD. On a slower disk, the machine would become extremely unresponsive. This thrashing would be visible as a high page in/out rate in "vmstat -n 1". The percentage of memory that is anonymous can be determined from the nr_active_anon and nr_inactive_anon fields in /proc/vmstat . Can you tell me if this is the case? If so, it's not a bug in the kernel unless it is a memory leak that is causing the lack of memory. Just attaching the contents of /proc/vmstat when the machine is running very slow would be helpful in determining what is going on here.
Is it possible you have a new kernel I can try to see if the issues is fixed?
I didn't build a new kernel with the backport yet. However, unless there is sufficient free memory and kswapd is still using a lot of CPU, the patches will not help. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=685777 https://bugzilla.novell.com/show_bug.cgi?id=685777#c9 Mel Gorman <mgorman@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |CLOSED Resolution| |FIXED --- Comment #9 from Mel Gorman <mgorman@suse.com> 2011-10-10 07:56:13 UTC --- It still looks like the original problem was the machine genuinely going OOM due to the lack of swap when it was required. The reclaim-related patches that were developed as a result have been pushed to mainline but a backport is not justified as they wouldn't affect this particular problem. The second reported problem looks like a trashing issue that is also due to a lack of memory rather than a kernel issue. There is a mainline bug where too much memory can be reclaimed if transparent hugepage support is available but I don't see evidence that it is happening here so I'm closing this bug for the moment. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=685777 https://bugzilla.novell.com/show_bug.cgi?id=685777#c10 --- Comment #10 from Michael Meeks <mmeeks@suse.com> 2011-11-01 16:26:33 UTC --- I was using: Linux lenovo-w500 2.6.37.1-1.2-desktop #1 SMP PREEMPT 2011-02-21 10:34:10 +0100 i686 i686 i386 GNU/Linux and had a couple of similar hangs while loading my machine over the last week: annoying. Then I wrote this test which kills it dead. Interestingly, having upgraded to 3.1.0 ( to try un-successfully to fix the churning connect debounce syslog thrash ;-) - this now crashes successfully on start: which is ideal; whereas in 2.6.37 it totalled the machine. So - progress somewhere I suppose :-) -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=685777 https://bugzilla.novell.com/show_bug.cgi?id=685777#c11 --- Comment #11 from Michael Meeks <mmeeks@suse.com> 2011-11-01 16:33:27 UTC --- Created an attachment (id=459760) --> (http://bugzilla.novell.com/attachment.cgi?id=459760) alloc lots of memory and touch it. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=685777 https://bugzilla.novell.com/show_bug.cgi?id=685777#c12 --- Comment #12 from Mel Gorman <mgorman@suse.com> 2011-11-01 17:37:39 UTC --- Has swap been configured? Your test case is filling the system with 4G of anonymous pages on a machine with 3G of RAM. If there is no swap configured, I would expect it to go OOM due to overcommit being enabled by default. With swap configured, it will grind heavily while it trashes swap for a while but should recover eventually. However, if you have wireless on this laptop, it is possible that the trashing is due to trying to allocate order-1 pages for frames depending on the model of card. Similarly, fork-heavy workloads require order-1 pages. On systems with a lot of anonymous pages (such as your testcase) and without swap, 2.6.37 behaves badly and can take an extremely long time to fork new processes because basically it has little choice other than to wait for processes to exit. Enabling swap mitigates this somewhat but there were also a number of changes made between 2.6.37 and 3.1 related to allocating high-order pages quickly that would make a noticeable difference. They are not likely to get backported though any time soon. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=685777 https://bugzilla.novell.com/show_bug.cgi?id=685777#c13 --- Comment #13 from Michael Meeks <mmeeks@suse.com> 2011-11-01 18:23:25 UTC --- Ah - interesting; so the system was odd - with some things remaining interactive - eg. mingetty - but not able to log-in: which points to the process forking problem. Anyhow - if it is fixed for kernel<next> - I'm happy. I have no swap, I'm on an SSD. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@novell.com