[Bug 1190608] New: Boot fails/hangs/freezes with kernel 5.14.2-1-default; System works with 5.14.1-1-default
http://bugzilla.opensuse.org/show_bug.cgi?id=1190608 Bug ID: 1190608 Summary: Boot fails/hangs/freezes with kernel 5.14.2-1-default; System works with 5.14.1-1-default Classification: openSUSE Product: openSUSE Tumbleweed Version: Current Hardware: x86-64 OS: openSUSE Tumbleweed Status: NEW Severity: Critical Priority: P5 - None Component: Kernel Assignee: kernel-bugs@opensuse.org Reporter: awoo@posteo.de QA Contact: qa-bugs@suse.de Found By: --- Blocker: --- User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36 Build Identifier: My initial description, with some photos of my screen as the boot freezes can be found at https://www.reddit.com/r/openSUSE/comments/ppyfcd/optimus_laptop_hangs_on_bo... I received some initial guidance by https://www.reddit.com/u/MasterPatricko/ (Thank you!) The system is a Tongfang GK5CP0Z / XMG Neo 15 Early 2019 with NVIDIA RTX 2060. I am using it with an external HDMI monitor. HDMI is only available in nvidia mode, so I use prime-select to choose nvidia over intel. Nvidia 470.63.01-43.1 is in use. Kernel 5.14.1-1-default boots and runs perfectly. Kernel 5.14.2-1-default fails to boot. Photo of screen as it freezes on boot: https://imgur.com/eFegmCV Kernel parameters (as copied from grub.cfg): root=/dev/mapper/system-root ${extra_cmdline} resume=/dev/system/swap acpi_osi=! acpi_osi=Linux acpi_os_name=Linux acpi_rev_override=1 nouveau.modeset=0 nouveau.runpm=0 pcie_aspm=force drm.vblankoffdelay=1 scsi_mod.use_blk_mq=1 mem_sleep_default=deep mitigations=auto Reproducible: Always Steps to Reproduce: 1. Boot into zypper post snapshot after installing updates yesterday 2. Kernel 5.14.2-1-default starts booting 3. Prints twice ���xhci_hcd: can't change power state from D3cold to D0 (config space inaccessible)��� 4. Prints some other USB-related messages (see photo) 5. Freeze Actual Results: Freezes. No reaction of caps lock LED. No reaction of virtual terminal hotkey (Ctrl+Alt+F1 - F12) Can only hard-reboot by holding physical power button. Expected Results: Kernel 5.14.2-1-default should boot just the same as 5.14.1-1-default does. I tried booting the broken kernel (5.14.2-1-default) from its zypper post snapshot by appending an additional boot param `systemd.unit=multi-user.target`, but that made no difference at all. Same freeze, same messages. Journalctl does not seem to contain any of the frozen boots. Output of lsmod: https://paste.opensuse.org/36050901 Output of lspci: https://paste.opensuse.org/12105741 -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190608 http://bugzilla.opensuse.org/show_bug.cgi?id=1190608#c1 --- Comment #1 from Bunte Katze <awoo@posteo.de> --- The photo at https://imgur.com/eFegmCV shows a section of kernel boot messages when boot freezes with kernel 5.14.2-1-default. Here the same section of kernel boot messages of a successful boot with the previous kernel, 5.14.1-1-default: https://paste.opensuse.org/79702865 You can use "usb: port power management may be unreliable" to find the place shortly before kernel 5.14.2-1-default freezes. -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190608 http://bugzilla.opensuse.org/show_bug.cgi?id=1190608#c2 --- Comment #2 from Bunte Katze <awoo@posteo.de> --- As proposed by https://www.reddit.com/user/ditzypoet/ I tried booting kernel 5.14.2-1-default only with minimal boot params: linuxefi /boot/vmlinuz-5.14.2-1-default root=/dev/mapper/system-root ${extra_cmdline} resume=/dev/system/swap mitigations=auto However, the result was unfortunately *exactly the same* as with my default boot params (which grew to a long list after applying a whole number of workarounds to fix adjustable laptop screen backlight, keyboard backlights etc.): linuxefi /boot/vmlinux-<kernel version> root=/dev/mapper/system-root ${extra_cmdline} resume=/dev/system/swap acpi_osi=! acpi_osi=Linux acpi_os_name=Linux acpi_rev_override=1 nouveau.modeset=0 nouveau.runpm=0 pcie_aspm=force drm.vblankoffdelay=1 scsi_mod.use_blk_mq=1 mem_sleep_default=deep mitigations=auto -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190608 http://bugzilla.opensuse.org/show_bug.cgi?id=1190608#c3 --- Comment #3 from Bunte Katze <awoo@posteo.de> --- I used minimal kernel params and appended `module_blacklist` for all nvidia modules: https://imgur.com/a/1tX1HoL The result was the same as before. The issue does not seem to be related to the nvidia driver. -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190608 http://bugzilla.opensuse.org/show_bug.cgi?id=1190608#c4 --- Comment #4 from Takashi Iwai <tiwai@suse.com> --- Could you check the kernel in OBS Kernel:stable repo? It contains the newer 5.14.x kernel for the upcoming TW. Note that it's an unofficial build, and it won't boot with Secure Boot. -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190608 http://bugzilla.opensuse.org/show_bug.cgi?id=1190608#c5 --- Comment #5 from Bunte Katze <awoo@posteo.de> --- Mid-air comment collision. My comment: 1) I have now tried nvidia.prime=intel, but unfortunately that did not make a difference: https://imgur.com/a/SvJfthS 2) I also tried it in combination with modprobe.blacklist=nvidia, but that did not help either. (Which is no surprise, since disabling all nvidia drivers made no difference: https://imgur.com/a/1tX1HoL) 3) Given that the suspicious "can't change power state from D3cold to D0 (config space inaccessible)" message comes from xhci_hcd, and the issue occurs just the same when using module_blacklist on *all* nvidia modules (https://imgur.com/a/1tX1HoL), this issue does not seem to be related to nvidia at all. 4) On reddit, u/ditzypoet made me aware (https://www.reddit.com/r/openSUSE/comments/ppyfcd/optimus_laptop_hangs_on_bo...) of this repository: https://github.com/ReimuNotMoe/Linux-on-GK5CP6V-S Therefore, I tried to boot with the following kernel params: linuxefi /boot/vmlinuz-5.14.2-1-default root=/dev/mapper/system-root ${extra_cmdline} noresume mitigations=auto pci=nommconf acpi_osi=Linux intel_iommu=off However, that did not make any difference: https://imgur.com/a/iK9rlXQ -- My response: @Takashi Iwai Thank you for your hint! I will try this kernel. Probably I need to only disable secure boot in the bios setup? I'll let you know when I figured this out and tried the kernel! -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190608 http://bugzilla.opensuse.org/show_bug.cgi?id=1190608#c6 --- Comment #6 from Bunte Katze <awoo@posteo.de> --- @Takashi Iwai I am now running 5.14.5-1.gfdb6afd-default from https://download.opensuse.org/repositories/Kernel:/stable/standard/ It boots fine, and I get to a graphical desktop if I add `nvidia.prime=intel`. The nvidia driver is of course missing. Can we conclude anything from that now? - Is it a regression in 5.14.2 that was fixed in 5.14.3 or later? - Or is this caused by the nvidia module being installed, even if I disabled it with module_blacklist=nvidia,nvidia_drm,nvidia_modeset,nvidia_uvm,i2c_nvidia_gpu ? I will see if I manage to compile the nvidia module for that kernel and either boot successfully or find it being the cause of the issue. -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190608 http://bugzilla.opensuse.org/show_bug.cgi?id=1190608#c7 Michael Pujos <pujos.michael@gmail.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |pujos.michael@gmail.com --- Comment #7 from Michael Pujos <pujos.michael@gmail.com> --- I'm pretty persuaded it is suse-prime udev rules for NVIDIA power management that are causing this issue: https://github.com/openSUSE/SUSEPrime/blob/master/90-nvidia-udev-pm-G05.rule... (installed in /usr/lib/udev/rules.d/90-nvidia-udev-pm-G05.rules_ and : https://github.com/openSUSE/SUSEPrime/blob/master/09-nvidia-modprobe-pm-G05.... (installed in /usr/lib/modprobe.d/09-nvidia-modprobe-pm-G05.conf) With new suse-prime 0.8.2 that was just published in TW, there is this new nvidia.prime= parameter that allow to override whatever druver (intel, nvidia) was set previously by user. nvidia.prime=intel will cause nvidia modules to not be loaded, allowing the machine to boot. The next testing step would be to disable the power management. Try commenting all lines in /usr/lib/udev/rules.d/90-nvidia-udev-pm-G05.rules (and /usr/lib/modprobe.d/09-nvidia-modprobe-pm-G05.conf although this one might not be necessary) then attempt to boot normally (or use nvidia.prime=nvidia) to check if it boots. -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190608 http://bugzilla.opensuse.org/show_bug.cgi?id=1190608#c8 --- Comment #8 from Bunte Katze <awoo@posteo.de> --- @Takashi Iwai So we have a huge step here: I was able to recompile the nvidia kernel module for 5.14.5-1.1.gfdb6afd: `zypper in -f kernel-default-devel=5.14.5-1.1.gfdb6afd nvidia-gfxG05-kmp-default` With `prime-select nvidia` I could switch to the newly installed module. When I rebooted (5.14.5-1.gfdb6afd-default, now with nvidia support), everything ran perfectly! No more boot hang, and the messages following "usb: port power management may be unreliable" are back to normal: Sep 17 21:03:44 felicity kernel: usb: port power management may be unreliable Sep 17 21:03:44 felicity kernel: xhci_hcd 0000:01:00.2: xHCI Host Controller Sep 17 21:03:44 felicity kernel: xhci_hcd 0000:01:00.2: new USB bus registered, assigned bus number 3 Sep 17 21:03:44 felicity kernel: xhci_hcd 0000:01:00.2: hcc params 0x0180ff05 hci version 0x110 quirks 0x0000000000000010 So 5.14.2-1-default seems to have a regression to 5.14.1-1-default, while 5.14.5-1.1.gfdb6afd (or earlier, like 5.14.3 or .4 possibly) fixes that. -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190608 http://bugzilla.opensuse.org/show_bug.cgi?id=1190608#c9 --- Comment #9 from Bunte Katze <awoo@posteo.de> --- @Michael Pujos I did already try nvidia.prime=intel with 5.14.2-1-default, but that changed nothing. The nvidia driver appears to not be the culprit. As Takashi Iwai advised, I tried a later kernel from https://download.opensuse.org/repositories/Kernel:/stable/standard/ (5.14.5-1.gfdb6afd-default), which worked, and continued working as I installed the nvidia kernel module and switched back from intel to nvidia. (I am currently writing this on that new setup.) Do you think it is still necessary for bug diagnostics to disable the udev power management rules? (https://bugzilla.opensuse.org/show_bug.cgi?id=1190608#c7) I will have a hard time doing any file modifications, since I can only get to the broken kernel by booting a readonly btrfs snapshot - which fails to boot. (It should probably be possible to choose a previous kernel from the boot menu? I can try if you think it will help figure out the bug, even though 5.14.5 turned out working fine.) Unfortunately, snapper even cleaned out my previous working setup (kernel 5.14.1) before I could increase the NUMBER_LIMIT in /etc/snapper/configs/root. ��\_(���)_/�� -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190608 http://bugzilla.opensuse.org/show_bug.cgi?id=1190608#c10 --- Comment #10 from Bunte Katze <awoo@posteo.de> --- @Michael Pujos PS: I just found out I still run 0.7.17-2.2 anyway. I hope the update to 0.8 will not bring new bad surprises. ;) -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190608 http://bugzilla.opensuse.org/show_bug.cgi?id=1190608#c11 --- Comment #11 from Michael Pujos <pujos.michael@gmail.com> --- The very latest TW snapshot has suse-prime 0.8.2. Although it has rather lager changes, it should not bring bad surprises since the udev rules for dynamic power management I mentioned were already in 0.7.17. -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190608 http://bugzilla.opensuse.org/show_bug.cgi?id=1190608#c12 Lee <caramilk@gmx.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |caramilk@gmx.com --- Comment #12 from Lee <caramilk@gmx.com> --- For me it's suse-prime pulling in bbswitch and bbswitch-kmp-default, which freeze my optimus laptop on boot. I rolled back by snapshot, upgraded without installing bbswitch and bbswitch-kmp-default the system boots fine afterwards. -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190608 http://bugzilla.opensuse.org/show_bug.cgi?id=1190608#c13 --- Comment #13 from Bunte Katze <awoo@posteo.de> --- @Michael Pujos Indeed I am now running bbswitch 0.8-11.28, and it all runs fine on kernel 5.14.5. I believe bbswitch is not the culprit here. -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190608 http://bugzilla.opensuse.org/show_bug.cgi?id=1190608#c14 --- Comment #14 from Bunte Katze <awoo@posteo.de> --- @Lee What you describe sounds like an entirely unrelated issue. If you think otherwise, and especially if your issue is about kernel 5.14.2, then please add more information about your system, versions, details about the boot freeze etc., to help with diagnosis. -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190608 http://bugzilla.opensuse.org/show_bug.cgi?id=1190608#c15 --- Comment #15 from Michael Pujos <pujos.michael@gmail.com> --- bbswitch is used (if installed) only in "intel" or "intel2" mode for prime-select, to disable the NVIDIA card entirely. Otherwise, it does nothing (the module is not even loaded). -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190608 http://bugzilla.opensuse.org/show_bug.cgi?id=1190608#c16 --- Comment #16 from Lee <caramilk@gmx.com> --- I have a 2020 version Predator Triton 300, i7 10750H and nvidia 2070 max-q, freeze when booting with 5.14.2 kernel just at the line "xhci_hcd: cant change power state from D3Hot to D0". The strange thing is if I have bbswitch and bbswitch-kmp-default installed, any older kernel will freeze when booting at exactly the same line, even 5.13.13. That's when I suspect it's something not related to kernel versions and nvidia. Rolled back the update then updated without bbswitch and bbswitch-kmp-default 5.14.2 boots normally. Didn't tried if 5.14.5 kernel can co-exists with bbswitch and bbswitch-kmp-default. -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190608 http://bugzilla.opensuse.org/show_bug.cgi?id=1190608#c17 --- Comment #17 from Michael Pujos <pujos.michael@gmail.com> --- @Leo The bbswitch module is only loaded by the prime-select systemd service (which in turn call /usr/sbin/prime-select) if prime-select operates in Intel mode ('prime-select intel' or 'prime-select intel2'). That service (eventually loading the bbswitch module in the cases mentioned above) is called at the very end of the boot process, just before the Display Manager is spawned. Old versions of the suse-prime package did not pull bbswitch (you had to use the suse-prime-bbswitch package for that) but the new version does as it combines both packages. The use of bbswitch to disable a PCI device (the NVIDIA card) can be dangerous and not work everywhere. Though it works on my laptop with a Pascal NVIDIA GPU, on all kernels. The suse-prime package update made all users use bbswitch now (again, only in Intel mode) which is probably not a good idea. I will create a bug report for the upstream project. -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190608 http://bugzilla.opensuse.org/show_bug.cgi?id=1190608#c18 Stefan Dirsch <sndirsch@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |sndirsch@suse.com --- Comment #18 from Stefan Dirsch <sndirsch@suse.com> --- (In reply to Michael Pujos from comment #17)
@Leo
The bbswitch module is only loaded by the prime-select systemd service (which in turn call /usr/sbin/prime-select) if prime-select operates in Intel mode ('prime-select intel' or 'prime-select intel2'). That service (eventually loading the bbswitch module in the cases mentioned above) is called at the very end of the boot process, just before the Display Manager is spawned.
Old versions of the suse-prime package did not pull bbswitch (you had to use the suse-prime-bbswitch package for that) but the new version does as it combines both packages.
The use of bbswitch to disable a PCI device (the NVIDIA card) can be dangerous and not work everywhere. Though it works on my laptop with a Pascal NVIDIA GPU, on all kernels. The suse-prime package update made all users use bbswitch now (again, only in Intel mode) which is probably not a good idea.
I will create a bug report for the upstream project.
Thanks for taking care, Michael! https://github.com/openSUSE/SUSEPrime/issues/70 -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190608 http://bugzilla.opensuse.org/show_bug.cgi?id=1190608#c19 Stefan Dirsch <sndirsch@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #19 from Stefan Dirsch <sndirsch@suse.com> --- The initial issue appears to be fixed with later kernels. and Lee's issue with bbswitch meanwhile addressed by https://github.com/openSUSE/SUSEPrime/issues/70 https://build.opensuse.org/request/show/920110 I think we can close this ticket now. -- You are receiving this mail because: You are the assignee for the bug.
participants (1)
-
bugzilla_noreply@suse.com