[Bug 1129214] New: kernel 4.12.14-lp150.12.48-default BUG: workqueue lockup

older
[Bug 1133466] New: SDDM w/ nvidia...

bugzilla_noreply＠novell.com

14 Mar 2019 14 Mar '19

11:07

http://bugzilla.opensuse.org/show_bug.cgi?id=1129214 Bug ID: 1129214 Summary: kernel 4.12.14-lp150.12.48-default BUG: workqueue lockup Classification: openSUSE Product: openSUSE Distribution Version: Leap 15.0 Hardware: x86-64 OS: Other Status: NEW Severity: Normal Priority: P5 - None Component: Kernel Assignee: kernel-maintainers@forge.provo.novell.com Reporter: chris@computersalat.de QA Contact: qa-bugs@suse.de Found By: --- Blocker: --- Created attachment 800058 --> http://bugzilla.opensuse.org/attachment.cgi?id=800058&action=edit all 'kernel' messages from 'messages' log This BUG makes the System 'unresponsive' and almost 'unusable'. Have look for more info in attachment If you need more info please tell me how and what to provide. Thank you -- You are receiving this mail because: You are on the CC list for the bug.

Attachments:

attachment.htm (text/html — 2.6 KB)

Show replies by date

bugzilla_noreply＠novell.com

14 Mar 14 Mar

11:26

New subject: [Bug 1129214] kernel 4.12.14-lp150.12.48-default BUG: workqueue lockup

http://bugzilla.opensuse.org/show_bug.cgi?id=1129214 http://bugzilla.opensuse.org/show_bug.cgi?id=1129214#c1 Takashi Iwai changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |ailiopoulos@suse.com, | |jeffm@suse.com, | |mhocko@suse.com, | |tiwai@suse.com, | |vbabka@suse.com --- Comment #1 from Takashi Iwai --- It looks like the page allocation stall from xfs_inode_alloc(). Its order is 0, so a really critical failure. Adding both FS and MM people to Cc, who should have better clue. Meanwhile, could you check whether the problem still happens with the latest Leap 15.0 KOTD, available in OBS Kernel:openSUSE-15.0 repo, just to be sure? http://download.opensuse.org/repositories/Kernel:/openSUSE-15.0/standard/ -- You are receiving this mail because: You are on the CC list for the bug.

bugzilla_noreply＠novell.com

12:58

New subject: [Bug 1129214] kernel 4.12.14-lp150.12.48-default BUG: workqueue lockup

http://bugzilla.opensuse.org/show_bug.cgi?id=1129214 http://bugzilla.opensuse.org/show_bug.cgi?id=1129214#c3 --- Comment #3 from Christian Wittmer --- (In reply to Michal Hocko from comment #2)

...

These all seem to be temporary allocation stalls 3 hitting one window Mar 8 and one mar 13. One hitting xfs, 2 skb and one vfs allocation paths. So nothing really systematic. Checking the free memory proves that the memory was mostly used.

I am strongly suspecting that some part of the memory reclaim (shrinkers probably) took excessive amount of time to make a forward progress. We would need vmsacan tracepoints data to know better though.

How can I provide such info about 'vmsacan tracepoints data' ? -- You are receiving this mail because: You are on the CC list for the bug.

bugzilla_noreply＠novell.com

18 May 18 May

20:02

New subject: [Bug 1129214] kernel 4.12.14-lp150.12.48-default BUG: workqueue lockup

http://bugzilla.opensuse.org/show_bug.cgi?id=1129214 http://bugzilla.opensuse.org/show_bug.cgi?id=1129214#c6 Christian Wittmer changed: What |Removed |Added ---------------------------------------------------------------------------- Flags| |needinfo?(mhocko@suse.com) --- Comment #6 from Christian Wittmer --- min)(In reply to Michal Hocko from comment #4)

...

There are two ways. You can either mount tracefs and enable $TRACEFS_MNT/events/vmscan/{mm_vmscan_direct_reclaim_begin, mm_vmscan_direct_reclaim_end,mm_shrink_slab_start,mm_shrink_slab_end} and How do I enable ?

...

read the output from $TRACE_MNT/trace_pipe or use trace-cmd to do the same. sorry I am not that professional. I would need more detailed info about what to do ...

...

Btw. is this reproducible?

probably when the 'rsync' jobs are running again ... some days ago the server died cause of a stack trace. I needed to 'reset' which caused a raid failure. After repairing the raid (rebuild is still running ... 900min) I will provide logs ... -- You are receiving this mail because: You are on the CC list for the bug.

bugzilla_noreply＠novell.com

20:06

New subject: [Bug 1129214] kernel 4.12.14-lp150.12.48-default BUG: workqueue lockup

http://bugzilla.opensuse.org/show_bug.cgi?id=1129214 http://bugzilla.opensuse.org/show_bug.cgi?id=1129214#c7 Christian Wittmer changed: What |Removed |Added ---------------------------------------------------------------------------- Flags| |needinfo?(ailiopoulos@suse. | |com) --- Comment #7 from Christian Wittmer --- (In reply to Anthony Iliopoulos from comment #5)

...

Looks like rsync is exerting a lot of memory pressure, and potentially writeback at the same time cannot keep up (it seems that the md layer has pending data to be flushed). Perhaps also a stacktrace of the blocked tasks (echo w > /proc/sysrq-trigger) would give a bit more insight next time this occurs.

So when 'rsync' is exertimg memory again I should do the 'echo w ...' command ? Where will I see the 'bit more insight' then ? -- You are receiving this mail because: You are on the CC list for the bug.

bugzilla_noreply＠novell.com

20:23

New subject: [Bug 1129214] kernel 4.12.14-lp150.12.48-default BUG: workqueue lockup

http://bugzilla.opensuse.org/show_bug.cgi?id=1129214 http://bugzilla.opensuse.org/show_bug.cgi?id=1129214#c8 Anthony Iliopoulos changed: What |Removed |Added ---------------------------------------------------------------------------- Flags|needinfo?(ailiopoulos@suse. | |com) | --- Comment #8 from Anthony Iliopoulos --- (In reply to Christian Wittmer from comment #7)

...

(In reply to Anthony Iliopoulos from comment #5)

...
Looks like rsync is exerting a lot of memory pressure, and potentially writeback at the same time cannot keep up (it seems that the md layer has pending data to be flushed). Perhaps also a stacktrace of the blocked tasks (echo w > /proc/sysrq-trigger) would give a bit more insight next time this occurs.

So when 'rsync' is exertimg memory again I should do the 'echo w ...' command ? Where will I see the 'bit more insight' then ?

Hi Christian, When you see things getting stuck/not progressing, you can do echo w > /proc/sysrq-trigger as root, and this will dump a list of all blocked tasks and their stack traces on the kernel ring buffer log, which you can then view by running dmesg as root (save it to a file and attach it here). This will hopefully shed some more light, as we will be able to see where each blocked task is waiting and could help us better identify what is going on. [You can see an example of how this looks like on a per-process basis by doing cat /proc/<pid>/stack (this will dump the current stack of the task running with the specified pid, irrespective of its state, e.g. if it's blocked or not)]. -- You are receiving this mail because: You are on the CC list for the bug.

bugzilla_noreply＠novell.com

20:47

New subject: [Bug 1129214] kernel 4.12.14-lp150.12.48-default BUG: workqueue lockup

http://bugzilla.opensuse.org/show_bug.cgi?id=1129214 http://bugzilla.opensuse.org/show_bug.cgi?id=1129214#c9 Anthony Iliopoulos changed: What |Removed |Added ---------------------------------------------------------------------------- Flags|needinfo?(mhocko@suse.com) | --- Comment #9 from Anthony Iliopoulos --- (In reply to Christian Wittmer from comment #6)

...

min)(In reply to Michal Hocko from comment #4)

...
There are two ways. You can either mount tracefs and enable $TRACEFS_MNT/events/vmscan/{mm_vmscan_direct_reclaim_begin, mm_vmscan_direct_reclaim_end,mm_shrink_slab_start,mm_shrink_slab_end} and How do I enable ?

...
read the output from $TRACE_MNT/trace_pipe or use trace-cmd to do the same. sorry I am not that professional. I would need more detailed info about what to do ...

You can install the trace-cmd package, and run something like: sudo trace-cmd start -e mm_vmscan_direct_reclaim_begin -e mm_vmscan_direct_reclaim_end -e mm_shrink_slab_start -e mm_shrink_slab_end This will enable tracing and recording of those enabled events in a dedicated kernel buffer. You can dump the buffer anytime via trace-cmd show. You can stop the tracing of the events via trace-cmd stop. The recorded events buffer will still be available for reading/dumping into a file via trace-cmd show, until you do trace-cmd reset that will clear everything.

...

...
Btw. is this reproducible? probably when the 'rsync' jobs are running again ...

some days ago the server died cause of a stack trace. I needed to 'reset' which caused a raid failure. After repairing the raid (rebuild is still running ... 900min) I will provide logs ...

I'd suggest setting up kdump (unless it's already set by default), see [1] for more info and setting up via yast etc. Next time you need to forcibly reset/reboot the system for whatever reason, you can trigger a crash via echo c > /proc/sysrq-trigger. This will force a kernel reboot as well as leverage kdump to keep a crash-dump of the kernel memory that we can use to analyze the problem in-depth. [1] https://doc.opensuse.org/documentation/leap/tuning/html/book.sle.tuning/cha.... -- You are receiving this mail because: You are on the CC list for the bug.

1814

Age (days ago)

1879

Last active (days ago)

List overview

Download

6 comments

1 participants

participants (1)

bugzilla_noreply＠novell.com