[Bug 1188745] New: kexec reboot fails since 5.13.4
http://bugzilla.opensuse.org/show_bug.cgi?id=1188745 Bug ID: 1188745 Summary: kexec reboot fails since 5.13.4 Classification: openSUSE Product: openSUSE Tumbleweed Version: Current Hardware: Other OS: Other Status: NEW Severity: Normal Priority: P5 - None Component: Kernel Assignee: kernel-bugs@opensuse.org Reporter: mrueckert@suse.com QA Contact: qa-bugs@suse.de CC: sndirsch@suse.com Found By: --- Blocker: --- Jul 26 01:47:44 fortress kernel: iommu ivhd0: AMD-Vi: Event logged [INVALID_DEVICE_REQUEST device=0a:00.0 pasid=0x00000 address=0xfffffffdf8000000 flags=0x0a00] Jul 26 01:47:48 fortress kernel: NVRM: GPU 0000:0a:00.0: RmInitAdapter failed! (0x23:0x65:1204) Jul 26 01:47:48 fortress kernel: NVRM: GPU 0000:0a:00.0: rm_init_adapter failed, device minor number 0 Jul 26 01:47:48 fortress kernel: [drm:nv_drm_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000a00] Failed to allocate NvKmsKapiDevice Jul 26 01:47:48 fortress kernel: [drm:nv_drm_probe_devices [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000a00] Failed to register device The same worked fine with the same version of the nvidia driver (470.57.02) on 5.13.2. I can provide full boot log if needed. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1188745 http://bugzilla.opensuse.org/show_bug.cgi?id=1188745#c1 --- Comment #1 from Stefan Dirsch <sndirsch@suse.com> --- No idea. Maybe rebuild/reinstall of nvidia kernel module helps. Do we officially support kexec reboot? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1188745 http://bugzilla.opensuse.org/show_bug.cgi?id=1188745#c2 --- Comment #2 from Marcus R�ckert <mrueckert@suse.com> --- 1. rebuilding didnt help 2. https://documentation.suse.com/sles/15-SP3/html/SLES-all/cha-tuning-kexec.ht... -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1188745 http://bugzilla.opensuse.org/show_bug.cgi?id=1188745#c3 --- Comment #3 from Stefan Dirsch <sndirsch@suse.com> --- Ok. But a regular boot works, right? If it does, then does a kexec reboot of the same 5.3.14 kernel works? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1188745 Stefan Dirsch <sndirsch@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |IN_PROGRESS CC| |mrueckert@suse.com Flags| |needinfo?(mrueckert@suse.co | |m) -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1188745 http://bugzilla.opensuse.org/show_bug.cgi?id=1188745#c4 Marcus R�ckert <mrueckert@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Flags|needinfo?(mrueckert@suse.co | |m) | --- Comment #4 from Marcus R�ckert <mrueckert@suse.com> --- 1. yes booting the 5.13.4 kernel works fine. 2. doing a normal reboot also works fine. (deleting reboot.target and systemctl daemon-reload) only kexec reboot does not work anymore. maybe something changed about device (de)initialization between 5.13.2 and 5.13.4? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1188745 http://bugzilla.opensuse.org/show_bug.cgi?id=1188745#c5 --- Comment #5 from Stefan Dirsch <sndirsch@suse.com> --- Yeah, who knows ... -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1188745 http://bugzilla.opensuse.org/show_bug.cgi?id=1188745#c6 Marcus R�ckert <mrueckert@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC|mrueckert@suse.com |jroedel@suse.com --- Comment #6 from Marcus R�ckert <mrueckert@suse.com> --- Jeff mentioned some iommu changes between 5.13.2 and 5.13.4 could cause this. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1188745 http://bugzilla.opensuse.org/show_bug.cgi?id=1188745#c7 Stefan Dirsch <sndirsch@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |jeffm@suse.com --- Comment #7 from Stefan Dirsch <sndirsch@suse.com> --- Jeff Mahoney? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1188745 Stefan Dirsch <sndirsch@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC|jeffm@suse.com | -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1188745 http://bugzilla.opensuse.org/show_bug.cgi?id=1188745#c10 --- Comment #10 from Joerg Roedel <jroedel@suse.com> --- (In reply to Marcus R�ckert from comment #0)
Jul 26 01:47:44 fortress kernel: iommu ivhd0: AMD-Vi: Event logged [INVALID_DEVICE_REQUEST device=0a:00.0 pasid=0x00000 address=0xfffffffdf8000000 flags=0x0a00]
This is an interrupt translation request while IRQ remapping is disabled for the device. Do the command line parameters differ between the first and the kexec kernel? There are no AMD IOMMU driver changes between 5.13.2 and 5.13.4, only ARM-SMMU and Intel VT-d fixes. So nothing changed on the AMD driver side.
I can provide full boot log if needed.
What hardware does this happen on? A full boot log might also help. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1188745 http://bugzilla.opensuse.org/show_bug.cgi?id=1188745#c12 --- Comment #12 from Marcus R�ckert <mrueckert@suse.com> --- forgot to mention cmdline params are the same. the problems happens no matter if the old running kernel is 5.13.2 or 5.13.4. if the target kernel is 5.13.4 it breaks. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1188745 http://bugzilla.opensuse.org/show_bug.cgi?id=1188745#c13 --- Comment #13 from Joerg Roedel <jroedel@suse.com> --- I looked through the failure log file, but this didn't give any insights. What I found so far is that the device sends an interrupt request before any IRQs have been configured by the device driver. The IOMMU is configured to still block all interrupt requests from the device. This is the only situation where this can happen (in fact, when IV=1b and IntCtl=00b in the DTE for the device) As there are no AMD IOMMU changes in the range, my guess is that some other change between 5.14.2 and 5.14.4 influenced the behaviour of the Nvidia module, so that it now sends an IRQ when it really shouldn't. There are two options on how to proceed: * Bisect upstream stable kernels between 5.13.2 and 5.13.4 to find the commit which introduced the issue * Work around the problem by booting with 'intremap=off' on the kernel command line. Please let me know what you prefer. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1188745 http://bugzilla.opensuse.org/show_bug.cgi?id=1188745#c14 --- Comment #14 from Marcus R�ckert <mrueckert@suse.com> --- bisect would mean rebuilding the kernel over and over and rebooting the machine for each kernel? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1188745 http://bugzilla.opensuse.org/show_bug.cgi?id=1188745#c15 --- Comment #15 from Joerg Roedel <jroedel@suse.com> --- (In reply to Marcus R�ckert from comment #14)
bisect would mean rebuilding the kernel over and over and rebooting the machine for each kernel?
Yes, there are 610 commits between 5.13.2 and 5.13.4, so ~9-10 rounds of compiling/testing. I don't see another way to find the offending commit, sorry. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1188745 http://bugzilla.opensuse.org/show_bug.cgi?id=1188745#c16 --- Comment #16 from Marcus R�ckert <mrueckert@suse.com> --- some more details kernel 5.13.8-1 (current TW kernel) 1. kexec from 470.63.01 into system with 470.42.01 - failed kexec 2. power cycle and boot directly into system with 470.42.01 - working kexec with 470.42.01, 3. kexec from 470.42.01 into system with 470.57.02 - breaks already when 470.57.02 tries to boot From the log it looks like thie step was working still with 5.13.2 ``` root@fortress ~ # for boot in 28 27 26 25 24 23 22 21 ; do echo "boot ID: -${boot}" ; journalctl -b -${boot} --no-tail | rg -i "(NVRM: loading NVIDIA UNIX x86_64 Kernel Module|kernel: Linux version|INVALID_DEVICE_REQUEST)" ; done boot ID: -28 Jul 18 02:27:17 fortress kernel: Linux version 5.13.1-1-default (geeko@buildhost) (gcc (SUSE Linux) 11.1.1 20210625 [revision 62bbb113ae68a7e724255e17143520735bcb9ec9], GNU ld (GNU Binutils; openSUSE Tumbleweed) 2.36.1.20210326-4) #1 SMP Mon Jul 12 06:35:58 UTC 2021 (72aabc2) Jul 18 02:27:26 fortress kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module 470.42.01 Tue Jun 15 21:26:37 UTC 2021 boot ID: -27 Jul 19 11:55:44 fortress kernel: Linux version 5.13.2-1-default (geeko@buildhost) (gcc (SUSE Linux) 11.1.1 20210625 [revision 62bbb113ae68a7e724255e17143520735bcb9ec9], GNU ld (GNU Binutils; openSUSE Tumbleweed) 2.36.1.20210326-4) #1 SMP Thu Jul 15 03:36:02 UTC 2021 (89416ca) Jul 19 11:56:25 fortress kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module 470.42.01 Tue Jun 15 21:26:37 UTC 2021 boot ID: -26 Jul 19 20:17:24 fortress kernel: Linux version 5.13.2-1-default (geeko@buildhost) (gcc (SUSE Linux) 11.1.1 20210625 [revision 62bbb113ae68a7e724255e17143520735bcb9ec9], GNU ld (GNU Binutils; openSUSE Tumbleweed) 2.36.1.20210326-4) #1 SMP Thu Jul 15 03:36:02 UTC 2021 (89416ca) Jul 19 20:17:34 fortress kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module 470.57.02 Tue Jul 13 16:14:05 UTC 2021 boot ID: -25 Jul 23 18:16:01 fortress kernel: Linux version 5.13.2-1-default (geeko@buildhost) (gcc (SUSE Linux) 11.1.1 20210625 [revision 62bbb113ae68a7e724255e17143520735bcb9ec9], GNU ld (GNU Binutils; openSUSE Tumbleweed) 2.36.1.20210326-4) #1 SMP Thu Jul 15 03:36:02 UTC 2021 (89416ca) Jul 23 18:16:17 fortress kernel: iommu ivhd0: AMD-Vi: Event logged [INVALID_DEVICE_REQUEST device=0a:00.0 pasid=0x00000 address=0xfffffffdf8000000 flags=0x0a00] Jul 23 18:16:18 fortress kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module 470.57.02 Tue Jul 13 16:14:05 UTC 2021 boot ID: -24 Jul 23 18:17:46 fortress kernel: Linux version 5.13.2-1-default (geeko@buildhost) (gcc (SUSE Linux) 11.1.1 20210625 [revision 62bbb113ae68a7e724255e17143520735bcb9ec9], GNU ld (GNU Binutils; openSUSE Tumbleweed) 2.36.1.20210326-4) #1 SMP Thu Jul 15 03:36:02 UTC 2021 (89416ca) Jul 23 18:18:04 fortress kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module 470.57.02 Tue Jul 13 16:14:05 UTC 2021 boot ID: -23 Jul 23 18:32:45 fortress kernel: Linux version 5.13.2-1-default (geeko@buildhost) (gcc (SUSE Linux) 11.1.1 20210625 [revision 62bbb113ae68a7e724255e17143520735bcb9ec9], GNU ld (GNU Binutils; openSUSE Tumbleweed) 2.36.1.20210326-4) #1 SMP Thu Jul 15 03:36:02 UTC 2021 (89416ca) Jul 23 18:32:57 fortress kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module 470.57.02 Tue Jul 13 16:14:05 UTC 2021 boot ID: -22 Jul 23 18:37:54 fortress kernel: Linux version 5.13.2-1-default (geeko@buildhost) (gcc (SUSE Linux) 11.1.1 20210625 [revision 62bbb113ae68a7e724255e17143520735bcb9ec9], GNU ld (GNU Binutils; openSUSE Tumbleweed) 2.36.1.20210326-4) #1 SMP Thu Jul 15 03:36:02 UTC 2021 (89416ca) Jul 23 18:38:12 fortress kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module 470.57.02 Tue Jul 13 16:14:05 UTC 2021 boot ID: -21 Jul 25 03:05:38 fortress kernel: Linux version 5.13.2-1-default (geeko@buildhost) (gcc (SUSE Linux) 11.1.1 20210625 [revision 62bbb113ae68a7e724255e17143520735bcb9ec9], GNU ld (GNU Binutils; openSUSE Tumbleweed) 2.36.1.20210326-4) #1 SMP Thu Jul 15 03:36:02 UTC 2021 (89416ca) Jul 25 03:05:52 fortress kernel: iommu ivhd0: AMD-Vi: Event logged [INVALID_DEVICE_REQUEST device=0a:00.0 pasid=0x00000 address=0xfffffffdf8000000 flags=0x0a00] Jul 25 03:05:52 fortress kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module 470.57.02 Tue Jul 13 16:14:05 UTC 2021 ``` the boot ID -25 was the reason why i thought it was a kernel regression in the first place. But further testing seems to show we have at least a partial problem at the nvidia driver. Do we have anyone from nvidia whom we could CC on the bug to get some input from them? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1188745 http://bugzilla.opensuse.org/show_bug.cgi?id=1188745#c17 --- Comment #17 from Marcus R�ckert <mrueckert@suse.com> --- 5.14.0~rc6 doesnt fix it either. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1188745 http://bugzilla.opensuse.org/show_bug.cgi?id=1188745#c18 --- Comment #18 from Joerg Roedel <jroedel@suse.com> --- So this looks like an NVidia driver problem which is triggered by a kernel change in the v5.14 kernel. Stefan, can you have a look please? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1188745 http://bugzilla.opensuse.org/show_bug.cgi?id=1188745#c19 --- Comment #19 from Stefan Dirsch <sndirsch@suse.com> --- Hmm. The issue already occurred with 5.13.4! Or did you make some backports from 5.14 to 5.13.4, which may explain this behaviour? I could ask nVidia if they can reproduce the issue on Kernel >= 5.13.4 or 5.14 (if needed). I'm sure if they do tests with kexec at all ... maybe they never supported it. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1188745 http://bugzilla.opensuse.org/show_bug.cgi?id=1188745#c23 --- Comment #23 from Marcus R�ckert <mrueckert@suse.com> --- retested with 510.39.01 on 5.15.12 - failed -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1188745 http://bugzilla.opensuse.org/show_bug.cgi?id=1188745#c27 Marcus R�ckert <mrueckert@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Flags|needinfo?(mrueckert@suse.co | |m) | --- Comment #27 from Marcus R�ckert <mrueckert@suse.com> --- ``` Feb 25 11:48:22 fortress kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 235 Feb 25 11:48:22 fortress kernel: Feb 25 11:48:22 fortress kernel: nvidia 0000:0a:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none Feb 25 11:48:22 fortress kernel: iommu ivhd0: AMD-Vi: Event logged [INVALID_DEVICE_REQUEST device=0a:00.0 pasid=0x00000 address=0xfffffffdf8000000 flags=0x0a00] Feb 25 11:48:22 fortress kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module 510.54 Tue Feb 8 04:42:21 UTC 2022 Feb 25 11:48:23 fortress kernel: nvidia_uvm: module uses symbols from proprietary module nvidia, inheriting taint. Feb 25 11:48:23 fortress kernel: nvidia-uvm: Loaded the UVM driver, major device number 511. Feb 25 11:48:22 fortress kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 235 Feb 25 11:48:22 fortress kernel: nvidia 0000:0a:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none Feb 25 11:48:22 fortress kernel: iommu ivhd0: AMD-Vi: Event logged [INVALID_DEVICE_REQUEST device=0a:00.0 pasid=0x00000 address=0xfffffffdf8000000 flags=0x0a00] Feb 25 11:48:22 fortress kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module 510.54 Tue Feb 8 04:42:21 UTC 2022 Feb 25 11:48:23 fortress kernel: nvidia_uvm: module uses symbols from proprietary module nvidia, inheriting taint. Feb 25 11:48:23 fortress kernel: nvidia-uvm: Loaded the UVM driver, major device number 511. Feb 25 11:48:23 fortress kernel: nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 510.54 Tue Feb 8 04:34:06 UTC 2022 Feb 25 11:48:23 fortress kernel: [drm] [nvidia-drm] [GPU ID 0x00000a00] Loading driver Feb 25 11:48:27 fortress kernel: NVRM: GPU 0000:0a:00.0: RmInitAdapter failed! (0x23:0x65:1401) Feb 25 11:48:27 fortress kernel: NVRM: GPU 0000:0a:00.0: rm_init_adapter failed, device minor number 0 Feb 25 11:48:27 fortress kernel: [drm:nv_drm_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000a00] Failed to allocate NvKmsKapiDevice Feb 25 11:48:27 fortress kernel: [drm:nv_drm_probe_devices [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000a00] Failed to register device ``` so no -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1188745 http://bugzilla.opensuse.org/show_bug.cgi?id=1188745#c28 --- Comment #28 from Stefan Dirsch <sndirsch@suse.com> --- Thanks for re-testing, Marcus! -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1188745 http://bugzilla.opensuse.org/show_bug.cgi?id=1188745#c31 Peter S�tterlin <P.Suetterlin@royac.iac.es> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |P.Suetterlin@royac.iac.es --- Comment #31 from Peter S�tterlin <P.Suetterlin@royac.iac.es> --- Not sure if it is related in any way, but I do get the same error after upgrading to kernel 6.0 / TW 20221012: kernel: nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 470.141.03 Thu Jun 30 18:34:41 UTC 2022 kernel: [drm] [nvidia-drm] [GPU ID 0x00000200] Loading driver kernel: NVRM: GPU 0000:02:00.0: Failed to copy vbios to system memory. kernel: NVRM: GPU 0000:02:00.0: RmInitAdapter failed! (0x30:0xffff:874) kernel: NVRM: GPU 0000:02:00.0: rm_init_adapter failed, device minor number 0 kernel: [drm:nv_drm_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000200] Failed to allocate NvKmsKapiDevice kernel: [drm:nv_drm_probe_devices [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000200] Failed to register device The card is 02:00.0 3D controller: NVIDIA Corporation GM108M [GeForce 940MX] (rev a2) (That's a Mobile / Optimus chip) -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1188745 Min-Seok Oh <recoverpoint@naver.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |recoverpoint@naver.com -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1188745 http://bugzilla.opensuse.org/show_bug.cgi?id=1188745#c35 --- Comment #35 from Peter S�tterlin <P.Suetterlin@royac.iac.es> --- (In reply to Stefan Dirsch from comment #34)
Hmm. Of course you could try with
options nvidia-drm modeset=0
in /etc/modprobe.d/50-nvidia-default.conf. But with that you can't use Wayland any longer. But since you're using suse-prime you're apparently still using X anyway.
Hmm, tried that, but it is still always loading nvidia-modeset!? Needless to say it still won't work: [drm:nv_drm_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000200] Failed to allocate NvKmsKapiDevice [drm:nv_drm_probe_devices [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000200] Failed to register device (With latest TW and kernel 6.0.3). Should I open a separate bug? I had only posted here because it's the same error message... -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1188745 http://bugzilla.opensuse.org/show_bug.cgi?id=1188745#c37 --- Comment #37 from Peter S�tterlin <P.Suetterlin@royac.iac.es> --- --> https://bugzilla.opensuse.org/show_bug.cgi?id=1204756 -- You are receiving this mail because: You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@suse.com