[Bug 1200259] New: Kernel Panic after Update to 5.3.18-150300.59.68.1
http://bugzilla.opensuse.org/show_bug.cgi?id=1200259 Bug ID: 1200259 Summary: Kernel Panic after Update to 5.3.18-150300.59.68.1 Classification: openSUSE Product: openSUSE Distribution Version: Leap 15.3 Hardware: x86-64 OS: openSUSE Leap 15.3 Status: NEW Severity: Major Priority: P5 - None Component: Kernel Assignee: kernel-bugs@opensuse.org Reporter: fgruener@web.de QA Contact: qa-bugs@suse.de Found By: --- Blocker: --- Created attachment 859446 --> http://bugzilla.opensuse.org/attachment.cgi?id=859446&action=edit dmesg from crash>log After upgrading the default kernel ("kernel-default") provided via regular update from the official OpenSuse Leap 15.3 (Upstream SLES) repository from 5.3.18-150300.59.63.1-default to the newest Leap Standard Kernel 5.3.18-150300.59.68.1-default, I encountered already several Kernel Panics. This could happened within an one day period 2 to 3 times, but also last time it took 2 to 3 days, until it reoccurs. An possibility to reproduce the issue step by step, I have not yet found, only be using my PC. crash>log unveiled a null pointer execption. Please see dmesg attachted. Enabling KDUMP unveiled the following backtrace: crash> bt PID: 23800 TASK: ffff8d1b32ae8000 CPU: 0 COMMAND: "fstrim" #0 [ffffb2134512f790] machine_kexec at ffffffffa7e6fe01 #1 [ffffb2134512f7e8] __crash_kexec at ffffffffa7f595fd #2 [ffffb2134512f8b0] crash_kexec at ffffffffa7f5a4bd #3 [ffffb2134512f8c8] oops_end at ffffffffa7e36d3f #4 [ffffb2134512f8e8] no_context at ffffffffa7e82bbf #5 [ffffb2134512f950] do_page_fault at ffffffffa7e83e40 #6 [ffffb2134512f980] page_fault at ffffffffa880130e [exception RIP: bfq_bio_bfqg+37] RIP: ffffffffa8277b55 RSP: ffffb2134512fa30 RFLAGS: 00010002 RAX: 000000000000001f RBX: 0000000000000000 RCX: 0000000000000000 RDX: ffff8d1b8f614e00 RSI: ffff8d1b7fd47200 RDI: ffff8d1b7fd47200 RBP: ffff8d1887ff0800 R8: ffff8d199aeb54b8 R9: ffff8d199aeb5488 R10: 0000000000000000 R11: ffff8d1ab7742e00 R12: ffff8d17c7744640 R13: ffff8d1887ff0800 R14: ffff8d1b7fd47200 R15: ffff8d1b8d91a894 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #7 [ffffb2134512fa40] bfq_bic_update_cgroup at ffffffffa8277e78 #8 [ffffb2134512fa78] bfq_bio_merge at ffffffffa826ee9f #9 [ffffb2134512fad0] blk_mq_submit_bio at ffffffffa8248769 #10 [ffffb2134512fb58] submit_bio_noacct at ffffffffa823c343 #11 [ffffb2134512fbe8] submit_bio at ffffffffa823c3db #12 [ffffb2134512fc38] submit_bio_wait at ffffffffa8234cc4 #13 [ffffb2134512fc78] blkdev_issue_discard at ffffffffa8243d20 #14 [ffffb2134512fd08] ext4_trim_fs at ffffffffc079a7ea [ext4] #15 [ffffb2134512fe10] ext4_ioctl at ffffffffc0790ef6 [ext4] #16 [ffffb2134512fef8] ksys_ioctl at ffffffffa80fadc2 #17 [ffffb2134512ff30] __x64_sys_ioctl at ffffffffa80fadf6 #18 [ffffb2134512ff38] do_syscall_64 at ffffffffa7e0538b #19 [ffffb2134512ff50] entry_SYSCALL_64_after_hwframe at ffffffffa880008c RIP: 00007f73ecd54c47 RSP: 00007ffc4e041c08 RFLAGS: 00000246 RAX: ffffffffffffffda RBX: 000055764b87c310 RCX: 00007f73ecd54c47 RDX: 00007ffc4e041c20 RSI: 00000000c0185879 RDI: 0000000000000003 RBP: 000055764b8788c0 R8: 000055764b87c310 R9: 0000000000000002 R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffc4e041d00 R13: 000055764b8788c0 R14: 0000000000000003 R15: 000055764b87a890 ORIG_RAX: 0000000000000010 CS: 0033 SS: 002b According to the logfiles of the changed items of the last patch, there was a patch added to address: - bfq: Update cgroup information before merging bio (bsc#1197926). Searching in the internet "bfq_bic_update_cgroup+0x28/0x1b0 core dump opensuse" I also found: https://lkml.kernel.org/linux-block/20220330124255.24581-2-jack@suse.cz/T/ There at least in the same stack changes have been done in the same area. @@ -2457,10 +2457,17 @@ static bool bfq_bio_merge(struct request_queue *q, struct bio *bio, + bfq_bic_update_cgroup(bic, bio); Maybe bio is not fully initialized? Maybe my refernece is wrong, but it is remarkable, that at least the kernel post contains a change in the same stack. As this is my first post, I am not quite sure what other information might be helpfull here. Please ask, if you will need further information. Thanks already for taking this issue into account. And for your openSuse. Best regards cyp -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1200259 Gr�ner <fgruener@web.de> changed: What |Removed |Added ---------------------------------------------------------------------------- Summary|Kernel Panic after Update |Kernel Panic in |to 5.3.18-150300.59.68.1 |bfq_bic_update_cgroup while | |fstrim after Update to | |Kernel | |5.3.18-150300.59.68.1 -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1200259 http://bugzilla.opensuse.org/show_bug.cgi?id=1200259#c1 --- Comment #1 from Gr�ner <fgruener@web.de> --- Digging further, I could identify a link between the execution of fstrim and the crashes. Since the installation of the new kernel, fstrim never ended and always the kernel panic occurs. Maybe this helps when trying to repreduce the error. Package Version used util-linux-2.36.2-150300.4.20.1.x86_64 updated on March where it was still running with an older kernel without issues. First with the introduction of the new kernel the kernel panics occur.
fstrim --version fstrim from util-linux 2.36.2
-- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1200259 Gr�ner <fgruener@web.de> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |fgruener@web.de -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1200259 http://bugzilla.opensuse.org/show_bug.cgi?id=1200259#c2 Ralf K�lmel <ralf.koelmel@kit.edu> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |ralf.koelmel@kit.edu --- Comment #2 from Ralf K�lmel <ralf.koelmel@kit.edu> --- I have several systems with the 5.3.18-150300.59.68.1 kernel running and "fstrim -av" is executed weekly or daily as cron job. Until now i haven't seen this kernel crash. I've tried to execute fstrim manually to replicate the problem, but with no success. I see a difference that i use btrfs instead of ext4, but the code paths on the lower levels should be the same. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1200259 http://bugzilla.opensuse.org/show_bug.cgi?id=1200259#c3 --- Comment #3 from Gr�ner <fgruener@web.de> --- I was able to strip it further down. I started the system im rescue mode. Then: #systemctl isolate rescue.target #fstrim -av => went through #fstrim -av => Kernel panic Again restart in normal mode #systemctl isolate rescue.target #fstrim -av => Kernel panic Identical stack trace. Now the core dump is a bit more handy. Only 800 MB instead of the previous 8GB. I can also attach the full dmesg if needed, to see details of the system. Later I can try to isolate the call I suspect, to see if this helps. But this might only be the removal of the sympthom not the root cause, why the structure is not valid to be hand over to bfq_bic_update_cgroup(). -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1200259 http://bugzilla.opensuse.org/show_bug.cgi?id=1200259#c4 --- Comment #4 from Ralf K�lmel <ralf.koelmel@kit.edu> --- regarding the kernel stacktrace it seems related to the bfq IO scheduler. Can you check if you can replicate the problem with another IO scheduler (e.g. none or mq-deadline with the command: echo "none" >> /sys/block/<device>/queue/scheduler) although ? With btrfs and bfq i can't reproduce the problem until now. There seems to be a special runtime condition in your case. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1200259 http://bugzilla.opensuse.org/show_bug.cgi?id=1200259#c5 --- Comment #5 from Gr�ner <fgruener@web.de> --- # Check which Devices are trimmed
fstrim -anv /raid/local: 0 B (dry run) trimmed on /dev/md0 /windows/D: 0 B (dry run) trimmed on /dev/sdb1
# What is behind md0
cat /proc/mdstat Personalities : [raid10] md0 : active raid10 sdb3[4] sdc3[3] 536867680 blocks super 1.0 2 near-copies [2/2] [UU] bitmap: 0/4 pages [0KB], 65536KB chunk
unused devices: <none> # Show Schedulers
for i in md0 sdb sdc; do echo -n "$i:" ; cat /sys/block/$i/queue/scheduler; done md0:none sdb:mq-deadline kyber [bfq] none sdc:mq-deadline kyber [bfq] none
# What Filesystem is behind
mount /dev/sdb1 on /windows/D type fuseblk (rw,nosuid,nodev,noexec,relatime,user_id=0,group_id=0,default_permissions,allow_other,blksize=4096) /dev/md0 on /raid/local type ext4 (rw,relatime,stripe=8)
# Test sdb1 with bfq scheduler 100 times for i in {0..100}; do fstrim -v /windows/D; done => No crash # update Scheduler for raid devices
echo "none" >> /sys/block/sdb/queue/scheduler echo "none" >> /sys/block/sdc/queue/scheduler cat /sys/block/sdb/queue/scheduler [none] mq-deadline kyber bfq cat /sys/block/sdc/queue/scheduler [none] mq-deadline kyber bfq
# run fstrim fstrim -v /raid/local => Fine
echo "bfq" >> /sys/block/sdb/queue/scheduler echo "bfq" >> /sys/block/sdc/queue/scheduler cat /sys/block/sdb/queue/scheduler mq-deadline kyber [bfq] none cat /sys/block/sdc/queue/scheduler mq-deadline kyber [bfq] none cat /sys/block/md0/queue/scheduler none fstrim -v /raid/local => Kernel Panic
Based on this picture, the differences are Raid and different Filesystem. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1200259 http://bugzilla.opensuse.org/show_bug.cgi?id=1200259#c9 --- Comment #9 from Gr�ner <fgruener@web.de> --- Many thanks for the great support and the fast solution! -- You are receiving this mail because: You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@suse.com