[Bug 1212848] New: System lock up - kernel null pointer dereference - page fault
https://bugzilla.suse.com/show_bug.cgi?id=1212848 Bug ID: 1212848 Summary: System lock up - kernel null pointer dereference - page fault Classification: openSUSE Product: openSUSE Tumbleweed Version: Current Hardware: Other OS: Other Status: NEW Severity: Normal Priority: P5 - None Component: Kernel Assignee: kernel-bugs@opensuse.org Reporter: db@mail25.net QA Contact: qa-bugs@suse.de Target Milestone: --- Found By: --- Blocker: --- Created attachment 867882 --> https://bugzilla.suse.com/attachment.cgi?id=867882&action=edit journal-logs Booted up my laptop from full shutdown, logged in, Slack (Flatpak) started automatically, clicked on a few Slack channels and the system froze, couldn't switch between TTYs either. Forced shutdown with the power button. OS: openSUSE Tumbleweed 20230624 Kernel: 6.3.9-1-default DE: GNOME 44.2 Wayland Laptop: Lenovo ThinkPad T14s Gen 1 (AMD 4650U with integrated graphics) Attached some logs from the journal. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1212848 https://bugzilla.suse.com/show_bug.cgi?id=1212848#c1 --- Comment #1 from David B <db@mail25.net> --- Yesterday I experienced it again but now when shutting down the system, GNOME had already shutdown and I just saw messages in the terminal. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1212848 https://bugzilla.suse.com/show_bug.cgi?id=1212848#c2 --- Comment #2 from David B <db@mail25.net> --- Created attachment 867901 --> https://bugzilla.suse.com/attachment.cgi?id=867901&action=edit shutdown-logs.txt -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1212848 https://bugzilla.suse.com/show_bug.cgi?id=1212848#c4 --- Comment #4 from David B <db@mail25.net> --- (In reply to Takashi Iwai from comment #3)
As TW is moving to 6.4.x kernel, could you test with the latest kernel in OBS Kernel:stable repo?
Ok, will see how it goes.
uname -r 6.4.1-1.gb8cc951-default -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1212848 https://bugzilla.suse.com/show_bug.cgi?id=1212848#c5 David B <db@mail25.net> changed: What |Removed |Added ---------------------------------------------------------------------------- Flags|needinfo?(db@mail25.net) | --- Comment #5 from David B <db@mail25.net> --- I just experienced another lock up after using the laptop for several hours. Had to force shutdown and the system seems could print the full trace because it's not in the journal, this is all I found:
liep. 04 16:01:11 kernel: BUG: kernel NULL pointer dereference, address: 000000000000051f liep. 04 16:01:11 kernel: #PF: supervisor write access in kernel mode liep. 04 16:01:11 kernel: #PF: error_code(0x0002) - not-present page liep. 04 16:01:11 kernel: PGD 0 P4D 0 liep. 04 16:01:11 kernel: Oops: 0002 [#1] PREEMPT SMP NOPTI liep. 04 16:01:11 kernel: CPU: 5 PID: 3793 Comm: gnome-shell Not tainted 6.4.1-1.gb8cc951-default #1 openSUSE Tumbleweed (unreleased) 65162174696c49cc99e8e33b7df9ef2d74c390c3 liep. 04 16:01:11 kernel: Hardware name: LENOVO 20UH002DMH/20UH002DMH, BIOS R1CET74W(1.43 ) 03/01/2023 -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1212848 https://bugzilla.suse.com/show_bug.cgi?id=1212848#c6 --- Comment #6 from Takashi Iwai <tiwai@suse.com> --- So there seems some kernel crash, but we can't judge whether it's a similar pattern like the previous kernel, as it's cut off at the middle. A next possibility would be to set up kdump and catch the crash log. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1212848 https://bugzilla.suse.com/show_bug.cgi?id=1212848#c8 --- Comment #8 from David B <db@mail25.net> --- (In reply to Takashi Iwai from comment #7)
Also, could you try the kernel in OBS home:tiwai:kernel:stable-kasan repo? It's a kernel with KASAN enabled, and hopefully it can catch something useful. Note that the kernel will run a bit slower due to the overhead of KASAN.
I've installed your kernel, there hasn't been any freezes yet but I see one bug in dmesg
[ 9.253637] ================================================================== [ 9.253646] BUG: KASAN: slab-out-of-bounds in amdgpu_sw_ring_ib_mark_offset+0x2c1/0x2e0 [amdgpu] [ 9.254277] Write of size 8 at addr ffff888126db1a98 by task kworker/3:0/36
[ 9.254285] CPU: 3 PID: 36 Comm: kworker/3:0 Not tainted 6.4.1-3.gb8cc951-default #1 openSUSE Tumbleweed (unreleased) 9c9f1db9bae7100475d5e6bab98f1d56aac2818e [ 9.254294] Hardware name: LENOVO 20UH002DMH/20UH002DMH, BIOS R1CET74W(1.43 ) 03/01/2023 [ 9.254299] Workqueue: events amdgpu_device_delayed_init_work_handler [amdgpu] [ 9.254827] Call Trace: [ 9.254830] <TASK> [ 9.254833] dump_stack_lvl+0x47/0x60 [ 9.254841] print_report+0xcf/0x640 [ 9.254847] ? __pfx__raw_spin_lock_irqsave+0x10/0x10 [ 9.254853] ? amdgpu_sw_ring_ib_mark_offset+0x2c1/0x2e0 [amdgpu 9cffb4e0926e4a6d8349208c6da17c87b48a9ea2] [ 9.255391] kasan_report+0xb1/0xe0 [ 9.255396] ? amdgpu_sw_ring_ib_mark_offset+0x2c1/0x2e0 [amdgpu 9cffb4e0926e4a6d8349208c6da17c87b48a9ea2] [ 9.255933] amdgpu_sw_ring_ib_mark_offset+0x2c1/0x2e0 [amdgpu 9cffb4e0926e4a6d8349208c6da17c87b48a9ea2] [ 9.256469] gfx_v9_0_ring_emit_ib_gfx+0x4cc/0xd50 [amdgpu 9cffb4e0926e4a6d8349208c6da17c87b48a9ea2] [ 9.256953] ? amdgpu_sw_ring_ib_begin+0x1b4/0x3e0 [amdgpu 9cffb4e0926e4a6d8349208c6da17c87b48a9ea2] [ 9.256953] amdgpu_ib_schedule+0x7cb/0x1520 [amdgpu 9cffb4e0926e4a6d8349208c6da17c87b48a9ea2] [ 9.256953] gfx_v9_0_ring_test_ib+0x344/0x510 [amdgpu 9cffb4e0926e4a6d8349208c6da17c87b48a9ea2] [ 9.256953] ? __pfx_gfx_v9_0_ring_test_ib+0x10/0x10 [amdgpu 9cffb4e0926e4a6d8349208c6da17c87b48a9ea2] [ 9.256953] ? __schedule+0xc87/0x4ec0 [ 9.256953] amdgpu_ib_ring_tests+0x2bc/0x490 [amdgpu 9cffb4e0926e4a6d8349208c6da17c87b48a9ea2] [ 9.256953] amdgpu_device_delayed_init_work_handler+0x15/0x30 [amdgpu 9cffb4e0926e4a6d8349208c6da17c87b48a9ea2] [ 9.256953] process_one_work+0x76f/0x1350 [ 9.256953] worker_thread+0xef/0x13b0 [ 9.256953] ? __pfx_worker_thread+0x10/0x10 [ 9.256953] kthread+0x2a3/0x370 [ 9.256953] ? __pfx_kthread+0x10/0x10 [ 9.256953] ret_from_fork+0x2c/0x50 [ 9.256953] </TASK>
[ 9.256953] Allocated by task 433: [ 9.256953] kasan_save_stack+0x20/0x40 [ 9.256953] kasan_set_track+0x25/0x30 [ 9.256953] __kasan_kmalloc+0xaa/0xb0 [ 9.256953] __kmalloc+0x5e/0x160 [ 9.256953] amdgpu_ring_mux_init+0x6e/0x1d0 [amdgpu] [ 9.256953] gfx_v9_0_sw_init+0xf43/0x2860 [amdgpu] [ 9.256953] amdgpu_device_init+0x3bbc/0x7db0 [amdgpu] [ 9.256953] amdgpu_driver_load_kms+0x1d/0x4b0 [amdgpu] [ 9.256953] amdgpu_pci_probe+0x273/0x9a0 [amdgpu] [ 9.256953] local_pci_probe+0xdd/0x190 [ 9.256953] pci_device_probe+0x23a/0x770 [ 9.256953] really_probe+0x3e2/0xb80 [ 9.256953] __driver_probe_device+0x18c/0x450 [ 9.256953] driver_probe_device+0x4a/0x120 [ 9.256953] __driver_attach+0x1e1/0x4a0 [ 9.256953] bus_for_each_dev+0xf4/0x170 [ 9.256953] bus_add_driver+0x29e/0x570 [ 9.256953] driver_register+0x134/0x460 [ 9.256953] do_one_initcall+0x8e/0x310 [ 9.256953] do_init_module+0x238/0x730 [ 9.256953] load_module+0x5b41/0x6dd0 [ 9.256953] __do_sys_init_module+0x1df/0x210 [ 9.256953] do_syscall_64+0x60/0x90 [ 9.256953] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[ 9.256953] The buggy address belongs to the object at ffff888126db1a00 which belongs to the cache kmalloc-128 of size 128 [ 9.256953] The buggy address is located 24 bytes to the right of allocated 128-byte region [ffff888126db1a00, ffff888126db1a80)
[ 9.256953] The buggy address belongs to the physical page: [ 9.256953] page:00000000ee84b3fa refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x126db0 [ 9.256953] head:00000000ee84b3fa order:1 entire_mapcount:0 nr_pages_mapped:0 pincount:0 [ 9.256953] flags: 0x17ffffc0010200(slab|head|node=0|zone=2|lastcpupid=0x1fffff) [ 9.256953] page_type: 0xffffffff() [ 9.256953] raw: 0017ffffc0010200 ffff8881000428c0 dead000000000122 0000000000000000 [ 9.256953] raw: 0000000000000000 0000000080200020 00000001ffffffff 0000000000000000 [ 9.256953] page dumped because: kasan: bad access detected
[ 9.256953] Memory state around the buggy address: [ 9.256953] ffff888126db1980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 9.256953] ffff888126db1a00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 9.256953] >ffff888126db1a80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 9.256953] ^ [ 9.256953] ffff888126db1b00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 9.256953] ffff888126db1b80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 9.256953] ================================================================== [ 9.263148] Disabling lock debugging due to kernel taint [ 9.270245] fbcon: amdgpudrmfb (fb0) is primary device [ 9.271048] [drm] DSC precompute is not needed. [ 9.318160] Console: switching to colour frame buffer device 240x67 [ 9.340456] amdgpu 0000:06:00.0: [drm] fb0: amdgpudrmfb frame buffer device [ 9.691782] BTRFS: device label root devid 1 transid 4302 /dev/dm-0 scanned by (udev-worker) (568) -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1212848 https://bugzilla.suse.com/show_bug.cgi?id=1212848#c13 --- Comment #13 from Dronskowski <malted.dev@gmail.com> --- Thanks, I'll try it when I get home. Meanwhile reverting to 6.3.8 seems to have solved it for now as well. A little more information: The bug was reported and confirmed on the Freedesktop Gitlab: https://gitlab.freedesktop.org/drm/amd/-/issues/2658 Reports also start piling up on Arch, Fedora etc. bsc#1212833 seems to be a duplicate. Same symptoms, same time of appearing (6.3.9) and always with the amdgpu module loaded. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1212848 https://bugzilla.suse.com/show_bug.cgi?id=1212848#c14 --- Comment #14 from David B <db@mail25.net> --- (In reply to Takashi Iwai from comment #12)
Thanks for the info. Yes, this looks pretty much relevant, and interestingly, the whole PR isn't included in linux-next tree. Maybe an overlook? We need to ask Alex and Dave.
I'm building a test kernel with the backport of those patches in OBS home:tiwai:bsc1212848 repo. It'll take some time (an hour or so) to build. Once after the build finishes, could you try it out?
I've been running the patched kernel for a few hours and so far so good. Would it make sense to have a kernel with both the patch and KASAN so we could see immediately if the memory corruption happens after booting? -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1212848 https://bugzilla.suse.com/show_bug.cgi?id=1212848#c17 --- Comment #17 from David B <db@mail25.net> --- (In reply to Takashi Iwai from comment #15)
OK, looks good, so far.
The patches are now merged to stable git branch for TW, and the repo in OBS Kernel:stable will be updated tomorrow (it's updated daily), followed by the automatic rebuild in my OBS home:tiwai:kernel:stable-kasan repo. So check out the tomorrow's build.
The update to TW will happen eventually later, usually together with other updates like the upstream 6.4.x update.
I just wanted to confirm that I've installed 6.4.1-8.g3561b10-default from your KASAN repo yesterday evening, KASAN didn't show any memory bugs and I didn't experience any crashes or other bugs so the patches seem to have solved it. Thank you for looking into this Takashi. -- You are receiving this mail because: You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@suse.com