[Bug 1129214] New: kernel 4.12.14-lp150.12.48-default BUG: workqueue lockup
http://bugzilla.opensuse.org/show_bug.cgi?id=1129214 Bug ID: 1129214 Summary: kernel 4.12.14-lp150.12.48-default BUG: workqueue lockup Classification: openSUSE Product: openSUSE Distribution Version: Leap 15.0 Hardware: x86-64 OS: Other Status: NEW Severity: Normal Priority: P5 - None Component: Kernel Assignee: kernel-maintainers@forge.provo.novell.com Reporter: chris@computersalat.de QA Contact: qa-bugs@suse.de Found By: --- Blocker: --- Created attachment 800058 --> http://bugzilla.opensuse.org/attachment.cgi?id=800058&action=edit all 'kernel' messages from 'messages' log This BUG makes the System 'unresponsive' and almost 'unusable'. Have look for more info in attachment If you need more info please tell me how and what to provide. Thank you -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1129214
http://bugzilla.opensuse.org/show_bug.cgi?id=1129214#c1
Takashi Iwai
http://bugzilla.opensuse.org/show_bug.cgi?id=1129214
http://bugzilla.opensuse.org/show_bug.cgi?id=1129214#c3
--- Comment #3 from Christian Wittmer
These all seem to be temporary allocation stalls 3 hitting one window Mar 8 and one mar 13. One hitting xfs, 2 skb and one vfs allocation paths. So nothing really systematic. Checking the free memory proves that the memory was mostly used.
I am strongly suspecting that some part of the memory reclaim (shrinkers probably) took excessive amount of time to make a forward progress. We would need vmsacan tracepoints data to know better though.
How can I provide such info about 'vmsacan tracepoints data' ? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1129214
http://bugzilla.opensuse.org/show_bug.cgi?id=1129214#c6
Christian Wittmer
There are two ways. You can either mount tracefs and enable $TRACEFS_MNT/events/vmscan/{mm_vmscan_direct_reclaim_begin, mm_vmscan_direct_reclaim_end,mm_shrink_slab_start,mm_shrink_slab_end} and How do I enable ?
read the output from $TRACE_MNT/trace_pipe or use trace-cmd to do the same. sorry I am not that professional. I would need more detailed info about what to do ...
Btw. is this reproducible?
probably when the 'rsync' jobs are running again ... some days ago the server died cause of a stack trace. I needed to 'reset' which caused a raid failure. After repairing the raid (rebuild is still running ... 900min) I will provide logs ... -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1129214
http://bugzilla.opensuse.org/show_bug.cgi?id=1129214#c7
Christian Wittmer
Looks like rsync is exerting a lot of memory pressure, and potentially writeback at the same time cannot keep up (it seems that the md layer has pending data to be flushed). Perhaps also a stacktrace of the blocked tasks (echo w > /proc/sysrq-trigger) would give a bit more insight next time this occurs.
So when 'rsync' is exertimg memory again I should do the 'echo w ...' command ? Where will I see the 'bit more insight' then ? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1129214
http://bugzilla.opensuse.org/show_bug.cgi?id=1129214#c8
Anthony Iliopoulos
(In reply to Anthony Iliopoulos from comment #5)
Looks like rsync is exerting a lot of memory pressure, and potentially writeback at the same time cannot keep up (it seems that the md layer has pending data to be flushed). Perhaps also a stacktrace of the blocked tasks (echo w > /proc/sysrq-trigger) would give a bit more insight next time this occurs.
So when 'rsync' is exertimg memory again I should do the 'echo w ...' command ? Where will I see the 'bit more insight' then ?
Hi Christian, When you see things getting stuck/not progressing, you can do echo w > /proc/sysrq-trigger as root, and this will dump a list of all blocked tasks and their stack traces on the kernel ring buffer log, which you can then view by running dmesg as root (save it to a file and attach it here). This will hopefully shed some more light, as we will be able to see where each blocked task is waiting and could help us better identify what is going on. [You can see an example of how this looks like on a per-process basis by doing cat /proc/<pid>/stack (this will dump the current stack of the task running with the specified pid, irrespective of its state, e.g. if it's blocked or not)]. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1129214
http://bugzilla.opensuse.org/show_bug.cgi?id=1129214#c9
Anthony Iliopoulos
min)(In reply to Michal Hocko from comment #4)
There are two ways. You can either mount tracefs and enable $TRACEFS_MNT/events/vmscan/{mm_vmscan_direct_reclaim_begin, mm_vmscan_direct_reclaim_end,mm_shrink_slab_start,mm_shrink_slab_end} and How do I enable ?
read the output from $TRACE_MNT/trace_pipe or use trace-cmd to do the same. sorry I am not that professional. I would need more detailed info about what to do ...
You can install the trace-cmd package, and run something like: sudo trace-cmd start -e mm_vmscan_direct_reclaim_begin -e mm_vmscan_direct_reclaim_end -e mm_shrink_slab_start -e mm_shrink_slab_end This will enable tracing and recording of those enabled events in a dedicated kernel buffer. You can dump the buffer anytime via trace-cmd show. You can stop the tracing of the events via trace-cmd stop. The recorded events buffer will still be available for reading/dumping into a file via trace-cmd show, until you do trace-cmd reset that will clear everything.
Btw. is this reproducible? probably when the 'rsync' jobs are running again ...
some days ago the server died cause of a stack trace. I needed to 'reset' which caused a raid failure. After repairing the raid (rebuild is still running ... 900min) I will provide logs ...
I'd suggest setting up kdump (unless it's already set by default), see [1] for more info and setting up via yast etc. Next time you need to forcibly reset/reboot the system for whatever reason, you can trigger a crash via echo c > /proc/sysrq-trigger. This will force a kernel reboot as well as leverage kdump to keep a crash-dump of the kernel memory that we can use to analyze the problem in-depth. [1] https://doc.opensuse.org/documentation/leap/tuning/html/book.sle.tuning/cha.... -- You are receiving this mail because: You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@novell.com