[Bug 1220814] New: kernel 6.7.6-1-default Oops: unable to handle page faults and workqueue lockup (udev related?)
https://bugzilla.suse.com/show_bug.cgi?id=1220814 Bug ID: 1220814 Summary: kernel 6.7.6-1-default Oops: unable to handle page faults and workqueue lockup (udev related?) Classification: openSUSE Product: openSUSE Tumbleweed Version: Current Hardware: x86-64 OS: Other Status: NEW Severity: Critical Priority: P5 - None Component: Kernel Assignee: kernel-bugs@opensuse.org Reporter: michael.burge77@gmail.com QA Contact: qa-bugs@suse.de Target Milestone: --- Found By: --- Blocker: --- Created attachment 873182 --> https://bugzilla.suse.com/attachment.cgi?id=873182&action=edit full journalctl of the third crash I have an old portable with a wonky dc power jack which tends to generate a few quick AC0<->BAT0 events in succession. Starting with kernel 6.7.6-1-default the system locks up. Log is full of ``` Mar 01 17:34:02 ost (udev-worker)[23694]: AC0: Process '/usr/sbin/tlp auto' terminated by signal KILL. Mar 01 17:34:02 ost (udev-worker)[23694]: AC0: Failed to wait for spawned command '/usr/sbin/tlp auto': Input/output error Mar 01 17:34:02 ost (udev-worker)[23694]: AC0: Failed to execute '/usr/sbin/tlp auto', ignoring: Input/output error Mar 01 17:34:17 ost (udev-worker)[24044]: AC0: Process '/usr/sbin/tlp auto' terminated by signal KILL. Mar 01 17:34:17 ost (udev-worker)[24044]: AC0: Failed to wait for spawned command '/usr/sbin/tlp auto': Input/output error Mar 01 17:34:17 ost (udev-worker)[24044]: AC0: Failed to execute '/usr/sbin/tlp auto', ignoring: Input/output error Mar 01 17:34:17 ost plasmashell[2682]: plasma-pk-updates: acPluggedChanged onBattery: false -> true Mar 01 17:34:17 ost kernel: BUG: unable to handle page fault for address: 0000000000007b8a Mar 01 17:34:17 ost kernel: #PF: supervisor read access in kernel mode Mar 01 17:34:17 ost kernel: #PF: error_code(0x0000) - not-present page ``` until it locks up with ``` Mar 01 18:11:29 ost kernel: note: kworker/11:5[23457] exited with irqs disabled Mar 01 18:11:50 ost systemd-logind[1443]: Power key pressed short. Mar 01 18:11:54 ost systemd-logind[1443]: Power key pressed short. Mar 01 18:11:55 ost systemd-logind[1443]: Power key pressed short. Mar 01 18:11:58 ost plasmashell[5912]: [5912:0301/181158.744595:ERROR:command_buffer_proxy_impl.cc(128)] ContextResult::kTransientFailure: Failed to send GpuControl.CreateCommandBuffer. Mar 01 18:12:05 ost systemd-logind[1443]: Power key pressed short. Mar 01 18:12:06 ost kernel: BUG: workqueue lockup - pool cpus=11 node=0 flags=0x0 nice=0 stuck for 37s! ``` at which point only holding down the power button works. -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1220814
https://bugzilla.suse.com/show_bug.cgi?id=1220814#c1
Anthony Iliopoulos
https://bugzilla.suse.com/show_bug.cgi?id=1220814
https://bugzilla.suse.com/show_bug.cgi?id=1220814#c2
--- Comment #2 from Michael Burge
If you could setup kdump [1] and reproduce the issue (with the latest stable kernel) there may be more hints there for analysis.
Also please set /proc/sys/kernel/panic_on_oops = 1.
[1] https://doc.opensuse.org/documentation/leap/tuning/html/book-tuning/cha- tuning-kexec.html#cha-tuning-kdump-basic
Thank you, I've done as recommended, but am currently rocking 6.7.7-1-default without issue. Will update if I get anything. Regards -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1220814
https://bugzilla.suse.com/show_bug.cgi?id=1220814#c3
--- Comment #3 from Michael Burge
https://bugzilla.suse.com/show_bug.cgi?id=1220814
https://bugzilla.suse.com/show_bug.cgi?id=1220814#c4
--- Comment #4 from Michael Burge
https://bugzilla.suse.com/show_bug.cgi?id=1220814
https://bugzilla.suse.com/show_bug.cgi?id=1220814#c5
--- Comment #5 from Anthony Iliopoulos
https://bugzilla.suse.com/show_bug.cgi?id=1220814
https://bugzilla.suse.com/show_bug.cgi?id=1220814#c6
--- Comment #6 from Michael Burge
please attach the kdump.
Would that be the vmcore file? It is 332MiB.. -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1220814
https://bugzilla.suse.com/show_bug.cgi?id=1220814#c7
--- Comment #7 from Anthony Iliopoulos
(In reply to Anthony Iliopoulos from comment #5)
please attach the kdump.
Would that be the vmcore file? It is 332MiB..
yes, at tarball with everything under /var/crash/2024-03-19-12-05 please. If this exceeds the bugzilla attachment limit, then any other file sharing service would be fine (not sure if opensuse.org offers something like that). -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1220814
https://bugzilla.suse.com/show_bug.cgi?id=1220814#c8
--- Comment #8 from Michael Burge
(In reply to Michael Burge from comment #6)
(In reply to Anthony Iliopoulos from comment #5)
please attach the kdump.
Would that be the vmcore file? It is 332MiB..
yes, at tarball with everything under /var/crash/2024-03-19-12-05 please. If this exceeds the bugzilla attachment limit, then any other file sharing service would be fine (not sure if opensuse.org offers something like that).
https://drive.google.com/file/d/1d5pA29GWpGVuks2xNAxN56NfUojMRZLR/view?usp=s... -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1220814
https://bugzilla.suse.com/show_bug.cgi?id=1220814#c9
--- Comment #9 from Michael Burge
https://bugzilla.suse.com/show_bug.cgi?id=1220814
https://bugzilla.suse.com/show_bug.cgi?id=1220814#c10
--- Comment #10 from Anthony Iliopoulos
https://bugzilla.suse.com/show_bug.cgi?id=1220814
https://bugzilla.suse.com/show_bug.cgi?id=1220814#c11
--- Comment #11 from Michael Burge
I think upstream commit 4207b556e62f ("kernfs: RCU protect kernfs_nodes and avoid kernfs_idr_lock in kernfs_find_and_get_node_by_id()") fixes this issue, and this was backported to v6.8.6 which is now available in TW.
Could you please give it a try and see if this is still reproducible?
I am quite confident this indeed fixes it. Timeframe is only ~24hrs, but I've observed the signs of the underlying state changes that would've previously crashed it, ie corresponding journalctl entries, software and physical battery indicators, fan powercurve changes, go by uneventfully(for a lack of a better word). Thank you for the insights and support ! -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1220814
https://bugzilla.suse.com/show_bug.cgi?id=1220814#c12
Anthony Iliopoulos
https://bugzilla.suse.com/show_bug.cgi?id=1220814
https://bugzilla.suse.com/show_bug.cgi?id=1220814#c13
--- Comment #13 from Michael Burge
https://bugzilla.suse.com/show_bug.cgi?id=1220814
https://bugzilla.suse.com/show_bug.cgi?id=1220814#c14
--- Comment #14 from Michael Burge
participants (1)
-
bugzilla_noreply@suse.com