[Bug 1207948] New: Lenovo T14s Gen3 AMD resume from hibernation ends in kernel panic
http://bugzilla.opensuse.org/show_bug.cgi?id=1207948 Bug ID: 1207948 Summary: Lenovo T14s Gen3 AMD resume from hibernation ends in kernel panic Classification: openSUSE Product: openSUSE Tumbleweed Version: Current Hardware: x86-64 OS: openSUSE Tumbleweed Status: NEW Severity: Normal Priority: P5 - None Component: Kernel Assignee: kernel-bugs@opensuse.org Reporter: mjambor@suse.com QA Contact: qa-bugs@suse.de Found By: --- Blocker: --- Created attachment 864736 --> http://bugzilla.opensuse.org/attachment.cgi?id=864736&action=edit hwinfo output With the current openSUSE Tumbleweed (kernel-default-6.1.8-1.1.x86_64), attempts to resume from hibernation (from "suspend to disk") ends in kernel panic. After I type in my disk encryption password (I have kernel and initrd on an unencrypted partition), initrd takes over but a short while afterwards system hangs, does nothing for about 30 seconds and then CapsLock starts blinking). Usually the screen stays black but two or three times I have seen the contents of the screen restored and only then the laptop froze. All my attempts to get at some backtraces or error messages were not successful (I tried no_console_suspend=1 but that only meant I saw the cursor blinking before the machine froze but no messages appeared and my attempts to somehow use EFI pstore also led to nothing). Resume from suspend to RAM works fine. I am attaching output from hwinfo. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1207948 http://bugzilla.opensuse.org/show_bug.cgi?id=1207948#c1 Takashi Iwai <tiwai@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |tiwai@suse.com --- Comment #1 from Takashi Iwai <tiwai@suse.com> --- Is it a regression by the recent update? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1207948 http://bugzilla.opensuse.org/show_bug.cgi?id=1207948#c2 --- Comment #2 from Martin Jambor <mjambor@suse.com> --- (In reply to Takashi Iwai from comment #1)
Is it a regression by the recent update?
I don't know but probably no. The first kernel I installed on the machine was 6.1.7-1.1 and (IIRC) it also did not work, with the same symptoms. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1207948 http://bugzilla.opensuse.org/show_bug.cgi?id=1207948#c3 --- Comment #3 from Takashi Iwai <tiwai@suse.com> --- Then could you try a few other older kernels? You can find the (unofficial) builds of each last TW/stable kernels in OBS home:tiwai:kernel:6.0, home:tiwai:kernel:5.19, ... -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1207948 http://bugzilla.opensuse.org/show_bug.cgi?id=1207948#c8 --- Comment #8 from Martin Jambor <mjambor@suse.com> --- (In reply to Takashi Iwai from comment #3)
Then could you try a few other older kernels? You can find the (unofficial) builds of each last TW/stable kernels in OBS home:tiwai:kernel:6.0, home:tiwai:kernel:5.19, ...
Kernel 6.0.12 also panicked in the same way during resume from hibernation. Kernel 5.9.14 did not and resumed fine but graphics did not even come up with it, so some problematic bit probably does not even exist in it, rather than something regressing. I tried booting into multi-user.target with 6.1.8 but even then the same issue was there. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1207948 http://bugzilla.opensuse.org/show_bug.cgi?id=1207948#c9 --- Comment #9 from Martin Jambor <mjambor@suse.com> --- (In reply to Jiri Slaby from comment #5)
Just to sum up. no_console_suspend shows only a blinking cursor, right?
So sometimes resuming manages to restore the screen (console or even graphical) just before locking up. Sometimes it does not. In that experiment with no_console_suspend=1 I saw just blank screen with blinking cursor - which was a difference, without it cursor never blinks, even when resume manages to restore console before locking up.
Does pstore store anything to efi afterall?
I tried following https://blogs.oracle.com/linux/post/pstore-linux-kernel-persistent-storage-f... I got as far as "cat /sys/module/pstore/parameters/backend" resulting in "efi" but no file ever appeared in /sys/fs/pstore
Do you have kdump set up?
I don't think so. So no, unless it is somehow on default.
No crash generated in /var/crash/? (Note kdump is currently broken after each kernel update, see bug 1207114. You need to delete /boot/initrd-kdump and restart kdump service.)
Thanks, but I'll need to look up how the whole thing works before attempting anyway. (In reply to Jiri Slaby from comment #6)
Yeah and I also asked you about: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/ Documentation/power/basic-pm-debugging.rst
What last step (from [freezer, devices, ...]) does not expose the crash?
I'm only starting to read the document now. (In reply to Jiri Slaby from comment #7)
And yet (sorry), brand new BIOSes might be buggy, so no BIOS update available (e.g. via fwupd)?
BIOS version is 1.25 which is the latest one according to Lenovo web-site. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1207948 http://bugzilla.opensuse.org/show_bug.cgi?id=1207948#c10 --- Comment #10 from Takashi Iwai <tiwai@suse.com> --- If you still can't get any logs -- could you try to hibernate/resume without the amdgpu native graphics, e.g. with nomodeset boot option? The graphics would be broken after the resume, but the question is whether this also causes the panic or not. If this works better, the culprit is likely the amdgpu driver. We may exclude other drivers similarly by blacklisting or whatever, and check whether it changes anything, too. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1207948 http://bugzilla.opensuse.org/show_bug.cgi?id=1207948#c11 --- Comment #11 from Martin Jambor <mjambor@suse.com> --- (In reply to Takashi Iwai from comment #10)
If you still can't get any logs -- could you try to hibernate/resume without the amdgpu native graphics, e.g. with nomodeset boot option? The graphics would be broken after the resume, but the question is whether this also causes the panic or not. If this works better, the culprit is likely the amdgpu driver.
Even after adding nomdeset kernel parameter in grub the kernel panic was still there (and yeah, when resuming the screen went completely blank).
We may exclude other drivers similarly by blacklisting or whatever, and check whether it changes anything, too.
So would the output from lsmod be helpful, or something like it? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1207948 http://bugzilla.opensuse.org/show_bug.cgi?id=1207948#c12 --- Comment #12 from Martin Jambor <mjambor@suse.com> --- (In reply to Jiri Slaby from comment #6)
Yeah and I also asked you about: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/ Documentation/power/basic-pm-debugging.rst
What last step (from [freezer, devices, ...]) does not expose the crash?
None of them exposes it - even "core" returns fine without any problems - except for "none" which of course results in the panic. Funny thing, when triggering the suspend with echo platform > /sys/power/disk echo disk > /sys/power/state (as opposed to "systemctl hibernate") the virtual console echo works at the end of the unsuccessful resuming from disk. I.e. I can see the letters I type - but there is no reaction from bash or anything, the next bash prompt is not displayed either - until the panic when everything stops... is it perhaps some watchdog that is triggering it? I'll try to figure out how to have persistent logs nowadays and see if anything is there... -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1207948 http://bugzilla.opensuse.org/show_bug.cgi?id=1207948#c13 --- Comment #13 from Martin Jambor <mjambor@suse.com> --- Unfortunately, I did not find anything really interesting. Apparently my experiments with /sys/power/pm_test resulted in some kernel warings with backtraces in the thunderbolt module but those were not present in other "boots" that ended up in a panic. It seems like nothing from any real resume is in the journal. Blacklisting the thunderbolt module does not help. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1207948 Martin Jambor <mjambor@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |mbrugger@suse.com -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1207948 http://bugzilla.opensuse.org/show_bug.cgi?id=1207948#c15 --- Comment #15 from Martin Jambor <mjambor@suse.com> --- Created attachment 864876 --> http://bugzilla.opensuse.org/attachment.cgi?id=864876&action=edit dmesg.txt from kdump dmesg captured by kdump during resume. The key bit seems to be: [ 205.834253] ath11k_pci 0000:01:00.0: PM: **** DPM device timeout **** [ 205.834273] Call Trace: [ 205.834279] <TASK> [ 205.834289] __schedule+0x360/0x1350 [ 205.834315] ? __mod_timer+0x26e/0x390 [ 205.834331] schedule+0x5a/0xd0 [ 205.834342] schedule_timeout+0x87/0x150 [ 205.834355] ? __bpf_trace_tick_stop+0x10/0x10 [ 205.834372] __mhi_pm_resume+0x1f4/0x3d0 [mhi b0be4565fae9a4eff439b816de4091ab6cb78e61] [ 205.834393] ? destroy_sched_domains_rcu+0x30/0x30 [ 205.834405] ? pci_pm_poweroff_noirq+0x100/0x100 [ 205.834421] ath11k_mhi_resume+0x17/0x50 [ath11k_pci 04cba7fcb154366ae79dc3f99b9ba743c987dca7] [ 205.834447] ath11k_core_resume+0x55/0x120 [ath11k f34d9e4b2e3e853770e5fae7af817515de79e808] [ 205.834483] ath11k_pci_pm_resume+0x2e/0x60 [ath11k_pci 04cba7fcb154366ae79dc3f99b9ba743c987dca7] [ 205.834494] ? pci_pm_poweroff_noirq+0x100/0x100 [ 205.834504] dpm_run_callback+0x4a/0x150 [ 205.834517] device_resume+0x104/0x270 [ 205.834527] ? dpm_show_time.cold+0x62/0x62 [ 205.834539] async_resume+0x19/0x30 [ 205.834546] async_run_entry_fn+0x2e/0x110 [ 205.834557] process_one_work+0x20f/0x3d0 [ 205.834567] worker_thread+0x4a/0x3b0 [ 205.834574] ? process_one_work+0x3d0/0x3d0 [ 205.834580] kthread+0xda/0x100 [ 205.834587] ? kthread_complete_and_exit+0x20/0x20 [ 205.834593] ret_from_fork+0x22/0x30 [ 205.834606] </TASK> [ 205.834611] Kernel panic - not syncing: ath11k_pci 0000:01:00.0: unrecoverable failure I also have the vmcore file, if it can be useful. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1207948 http://bugzilla.opensuse.org/show_bug.cgi?id=1207948#c16 --- Comment #16 from Martin Jambor <mjambor@suse.com> --- After unloading modules ath11k_pci and ath11k, resuming from hibernation (suspend to disk) works as expected. What next? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1207948 http://bugzilla.opensuse.org/show_bug.cgi?id=1207948#c17 --- Comment #17 from Takashi Iwai <tiwai@suse.com> --- Hm, it looks like an endless wait in __mhi_pm_resume(). There should be the timeout_ms entry for the mhi bus in /sys/kernel/debug/. What value is it shown there? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1207948 http://bugzilla.opensuse.org/show_bug.cgi?id=1207948#c18 --- Comment #18 from Takashi Iwai <tiwai@suse.com> --- Also, just to be sure: there are a few changes in linux-next for mhi bus code (for 6.3). Can anyone test quickly linux-next kernel to see whether the problem persists (e.g. with kernel-vanilla in OBS Kernel:linux-next repo)? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1207948 http://bugzilla.opensuse.org/show_bug.cgi?id=1207948#c20 --- Comment #20 from Martin Jambor <mjambor@suse.com> --- (In reply to Takashi Iwai from comment #17)
Hm, it looks like an endless wait in __mhi_pm_resume(). There should be the timeout_ms entry for the mhi bus in /sys/kernel/debug/. What value is it shown there?
Do you mean this? # cat /sys/kernel/debug/mhi/mhi0/timeout_ms 90000 ms -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1207948 http://bugzilla.opensuse.org/show_bug.cgi?id=1207948#c21 --- Comment #21 from Takashi Iwai <tiwai@suse.com> --- (In reply to Martin Jambor from comment #20)
(In reply to Takashi Iwai from comment #17)
Hm, it looks like an endless wait in __mhi_pm_resume(). There should be the timeout_ms entry for the mhi bus in /sys/kernel/debug/. What value is it shown there?
Do you mean this?
# cat /sys/kernel/debug/mhi/mhi0/timeout_ms 90000 ms
Ah, that explains why the panic is triggered. The timeout in the driver core for the resume hang watchdog is 60 seconds, while this MHI bus timeout is set to 90 seconds. So the watchdog triggers the panic before the MHI bus driver goes out of the event loop. Could you try to write a smaller value such as 20000 there, and retry the hibernate/resume? # echo -n 20000 > /sys/kernel/debug/mhi/mhi0/timeout_ms This should avoid the panic, at least, even though the WiFi might be broken after the resume. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1207948 http://bugzilla.opensuse.org/show_bug.cgi?id=1207948#c22 --- Comment #22 from Martin Jambor <mjambor@suse.com> --- (In reply to Takashi Iwai from comment #21)
Could you try to write a smaller value such as 20000 there, and retry the hibernate/resume? # echo -n 20000 > /sys/kernel/debug/mhi/mhi0/timeout_ms
This should avoid the panic, at least, even though the WiFi might be broken after the resume.
That is exactly what happened. The notebook was waiting unresponsive for about 20 seconds but then did wake up and everything works except for wifi which appears gone (and atempting to remedy it by rmmod ath11pk_pci does not work as the command never terminates). -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1207948 http://bugzilla.opensuse.org/show_bug.cgi?id=1207948#c23 --- Comment #23 from Martin Jambor <mjambor@suse.com> --- (In reply to Martin Jambor from comment #22)
That is exactly what happened. The notebook was waiting unresponsive for about 20 seconds but then did wake up and everything works except for wifi which appears gone (and atempting to remedy it by rmmod ath11pk_pci does not work as the command never terminates).
Oh, after a very long time it did terminate, and re-loading it afterwards eventually even fixed the wifi connectivity. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1207948 http://bugzilla.opensuse.org/show_bug.cgi?id=1207948#c24 --- Comment #24 from Martin Jambor <mjambor@suse.com> --- (In reply to Takashi Iwai from comment #18)
Also, just to be sure: there are a few changes in linux-next for mhi bus code (for 6.3). Can anyone test quickly linux-next kernel to see whether the problem persists (e.g. with kernel-vanilla in OBS Kernel:linux-next repo)?
I'm sorry but I probably cannot test this - the kernel from kernel-vanilla-6.2~rc7.next.20230213-1.1.g059273c.x86_64 does not seem to be able to unlock my encrypted partition? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1207948 http://bugzilla.opensuse.org/show_bug.cgi?id=1207948#c27 Richard Weinberger <richard@nod.at> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |richard@nod.at --- Comment #27 from Richard Weinberger <richard@nod.at> --- FWIW, I think this report matches what I have already reported to linux-wireless. https://lore.kernel.org/linux-wireless/1263051271.53086.1674425560245.JavaMa... Thanks, //richard -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1207948 Michal Suchanek <msuchanek@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- See Also| |https://bugzilla.kernel.org | |/show_bug.cgi?id=214649 -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1207948 http://bugzilla.opensuse.org/show_bug.cgi?id=1207948#c28 --- Comment #28 from Vlastimil Babka <vbabka@suse.com> --- (In reply to Martin Jambor from comment #24)
I'm sorry but I probably cannot test this - the kernel from kernel-vanilla-6.2~rc7.next.20230213-1.1.g059273c.x86_64 does not seem to be able to unlock my encrypted partition?
Same here, although normal kernel-vanilla rc8 did unlock fine. So it's either something coming from the upstream -next, or some difference in config in the repo that provides kernel-vanilla-next? (In reply to Takashi Iwai from comment #25)
Meanwhile, could you confirm that the same behavior appears with the 6.2-rc8 kernel from OBS Kernel:HEAD?
I can confirm that, yeah. Will try your backports then. -- You are receiving this mail because: You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@suse.com