[kernel-bugs] [Bug 1174116] New: Kernel repeatedly crashing
https://bugzilla.suse.com/show_bug.cgi?id=1174116 Bug ID: 1174116 Summary: Kernel repeatedly crashing Classification: openSUSE Product: openSUSE Distribution Version: Leap 15.2 Hardware: Other OS: Other Status: NEW Severity: Major Priority: P5 - None Component: Kernel Assignee: kernel-bugs@opensuse.org Reporter: msvec@suse.com QA Contact: qa-bugs@suse.de Found By: --- Blocker: --- Created attachment 839676 --> https://bugzilla.suse.com/attachment.cgi?id=839676&action=edit kernel log This is the latest 15.2 with kernel-default-5.3.18-lp152.20.7.1.x86_64 The only thing I need to do was: - kinit - access a file in the NFS directory See the attached crash log -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1174116 https://bugzilla.suse.com/show_bug.cgi?id=1174116#c1 Takashi Iwai <tiwai@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |msvec@suse.com, | |tiwai@suse.com Flags| |needinfo?(msvec@suse.com) --- Comment #1 from Takashi Iwai <tiwai@suse.com> --- Could you test with the kernel in OBS Kernel:openSUSE-15.2 repo? Your problem looks different from others we've already addressed, but just to make sure. -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1174116 https://bugzilla.suse.com/show_bug.cgi?id=1174116#c3 Michal Svec <msvec@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Flags|needinfo?(msvec@suse.com) | --- Comment #3 from Michal Svec <msvec@suse.com> --- Same with kernel-default-5.3.18-lp152.85.1.gc6b9ae1.x86_64 -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1174116 https://bugzilla.suse.com/show_bug.cgi?id=1174116#c4 Takashi Iwai <tiwai@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |mhocko@suse.com, | |vbabka@suse.com --- Comment #4 from Takashi Iwai <tiwai@suse.com> --- Thanks. Could you attach the crash log from the new kernel as well? The Oops looks like some mm-related bug (RIP at kmem_cache_alloc). Michal H, Vlastimil, could you take a look? -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1174116 https://bugzilla.suse.com/show_bug.cgi?id=1174116#c5 --- Comment #5 from Vlastimil Babka <vbabka@suse.com> --- Try booting with kernel parameter slub_debug=FU -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1174116 https://bugzilla.suse.com/show_bug.cgi?id=1174116#c6 --- Comment #6 from Michal Svec <msvec@suse.com> --- Created attachment 839704 --> https://bugzilla.suse.com/attachment.cgi?id=839704&action=edit kernel log (Kernel:openSUSE-15.2) (In reply to Takashi Iwai from comment #4)
Thanks. Could you attach the crash log from the new kernel as well?
Attached. -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1174116 https://bugzilla.suse.com/show_bug.cgi?id=1174116#c7 --- Comment #7 from Vlastimil Babka <vbabka@suse.com> ---
(In reply to Takashi Iwai from comment #4)
Thanks. Could you attach the crash log from the new kernel as well?
Attached.
Was this with slub_debug=FU ? -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1174116 https://bugzilla.suse.com/show_bug.cgi?id=1174116#c8 --- Comment #8 from Michal Svec <msvec@suse.com> --- (In reply to Vlastimil Babka from comment #7)
(In reply to Takashi Iwai from comment #4)
Thanks. Could you attach the crash log from the new kernel as well?
Attached.
Was this with slub_debug=FU ?
That was without, just with a different kernel. -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1174116 https://bugzilla.suse.com/show_bug.cgi?id=1174116#c9 --- Comment #9 from Michal Svec <msvec@suse.com> --- (In reply to Vlastimil Babka from comment #5)
Try booting with kernel parameter slub_debug=FU
And this seems to have helped. The host doesn't crash any more (with the 15.2 stock kernel). -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1174116 https://bugzilla.suse.com/show_bug.cgi?id=1174116#c10 --- Comment #10 from Vlastimil Babka <vbabka@suse.com> --- (In reply to Michal Svec from comment #9)
And this seems to have helped. The host doesn't crash any more (with the 15.2 stock kernel).
Eh, too bad, it was supposed to catch the culprit, not prevent the bug from manifesting by changed layout/timing... Let's try being more specific about the debugging and limit it only to suspected caches. What happens if you boot with just this parameter? slub_nomerge And then, cat /proc/slabinfo | grep biovec, is the output like this? gehinom:~ # cat /proc/slabinfo | grep biovec biovec-128 272 272 2048 16 8 : tunables 0 0 0 : slabdata 17 17 0 biovec-64 544 544 1024 32 8 : tunables 0 0 0 : slabdata 17 17 0 Then try booting N times with a special parameter for each biovec-X in the list above, e.g. in my case, 2 times: slub_nomerge slub_debug=FU,biovec-64 slub_nomerge slub_debug=FU,biovec-128 Thanks. -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1174116 https://bugzilla.suse.com/show_bug.cgi?id=1174116#c11 --- Comment #11 from Michal Svec <msvec@suse.com> --- beholder:~$ sudo cat /proc/slabinfo | grep biovec biovec-max 230 272 4096 8 8 : tunables 0 0 0 : slabdata 34 34 0 biovec-128 208 240 2048 16 8 : tunables 0 0 0 : slabdata 15 15 0 biovec-64 480 576 1024 32 8 : tunables 0 0 0 : slabdata 18 18 0 biovec-16 256 256 256 32 2 : tunables 0 0 0 : slabdata 8 8 0 With neither of biovec-128, biovec-64 and biovec-16 the system crashes. -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1174116 https://bugzilla.suse.com/show_bug.cgi?id=1174116#c12 --- Comment #12 from Michal Svec <msvec@suse.com> --- FTR, with "slub_nomerge" standalone it crashes as well -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1174116 https://bugzilla.suse.com/show_bug.cgi?id=1174116#c13 --- Comment #13 from Michal Svec <msvec@suse.com> --- This one worked for some time but then crashed too: slub_nomerge slub_debug=FU,biovec-128 -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1174116 https://bugzilla.suse.com/show_bug.cgi?id=1174116#c14 --- Comment #14 from Vlastimil Babka <vbabka@suse.com> --- (In reply to Michal Svec from comment #13)
This one worked for some time but then crashed too: slub_nomerge slub_debug=FU,biovec-128
And the crash looked the same as before? Not different? So if the crashing or not is not that deterministic, let's try "slub_nomerge slub_debug=FU,biovec-*" and let it run for some time. Thanks. -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1174116 https://bugzilla.suse.com/show_bug.cgi?id=1174116#c15 --- Comment #15 from Michal Svec <msvec@suse.com> --- Created attachment 839764 --> https://bugzilla.suse.com/attachment.cgi?id=839764&action=edit kernel log (slub_nomerge slub_debug=FU,biovec-128)
This one worked for some time but then crashed too: slub_nomerge slub_debug=FU,biovec-128
Here's the crash log -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1174116 https://bugzilla.suse.com/show_bug.cgi?id=1174116#c16 --- Comment #16 from Michal Svec <msvec@suse.com> --- Created attachment 839773 --> https://bugzilla.suse.com/attachment.cgi?id=839773&action=edit kernel log (slub_nomerge slub_debug=F,biovec-128) This one also crashes: BOOT_IMAGE=/boot/vmlinuz-5.3.18-lp152.85.gc6b9ae1-default root=UUID=54125a0f-3254-4609-9079-1ea01e4433d8 resume=/dev/disk/by-uuid/80e7c610-537c-4261-98ae-2ec8bcdc1d22 splash=silent quiet showopts slub_nomerge slub_debug=F,biovec-128 -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1174116 https://bugzilla.suse.com/show_bug.cgi?id=1174116#c17 --- Comment #17 from Michal Svec <msvec@suse.com> --- Created attachment 839778 --> https://bugzilla.suse.com/attachment.cgi?id=839778&action=edit kernel log (slub_nomerge slub_debug=FU,biovec-*)
So if the crashing or not is not that deterministic, let's try "slub_nomerge slub_debug=FU,biovec-*"
Crashed. BOOT_IMAGE=/boot/vmlinuz-5.3.18-lp152.85.gc6b9ae1-default root=UUID=54125a0f-3254-4609-9079-1ea01e4433d8 resume=/dev/disk/by-uuid/80e7c610-537c-4261-98ae-2ec8bcdc1d22 splash=silent quiet showopts slub_nomerge slub_debug=FU,biovec-* -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1174116 https://bugzilla.suse.com/show_bug.cgi?id=1174116#c18 --- Comment #18 from Vlastimil Babka <vbabka@suse.com> --- So the problem moved to sunrpc? Maybe we just need a crashdump then, can you setup kdump? -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1174116 https://bugzilla.suse.com/show_bug.cgi?id=1174116#c19 --- Comment #19 from Vlastimil Babka <vbabka@suse.com> --- BTW, can you post a full dmesg from boot? There's a I taint which means kernel is not happy about something in the firmware, so there should be some details. -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1174116 Vlastimil Babka <vbabka@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |jack@suse.com, | |nfbrown@suse.com -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1174116 https://bugzilla.suse.com/show_bug.cgi?id=1174116#c23 --- Comment #23 from Michal Svec <msvec@suse.com> --- FTR the system has been now running for 3 days just fine with these settings: BOOT_IMAGE=/boot/vmlinuz-5.3.18-lp152.85.gc6b9ae1-default root=UUID=54125a0f-3254-4609-9079-1ea01e4433d8 resume=/dev/disk/by-uuid/80e7c610-537c-4261-98ae-2ec8bcdc1d22 splash=silent quiet showopts slub_nomerge slub_debug=FU crashkernel=202M,high crashkernel=72M,low panic_on_oops=1 -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1174116 https://bugzilla.suse.com/show_bug.cgi?id=1174116#c24 Vlastimil Babka <vbabka@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Assignee|kernel-bugs@opensuse.org |vbabka@suse.com --- Comment #24 from Vlastimil Babka <vbabka@suse.com> --- I've created a debug kernel that disables freelist hardening, hopefully that will tell us more about the freelist corruptions. It's building at https://build.opensuse.org/project/show/home:vbabka:bsc1174116 Once built please download, install and boot without any extra options this time except panic_on_oops and kdump. -- You are receiving this mail because: You are the assignee for the bug.
participants (1)
-
bugzilla_noreply@suse.com