[Bug 1219444] New: amdgpu critical error
https://bugzilla.suse.com/show_bug.cgi?id=1219444 Bug ID: 1219444 Summary: amdgpu critical error Classification: openSUSE Product: openSUSE Distribution Version: Leap 15.5 Hardware: x86-64 OS: openSUSE Leap 15.5 Status: NEW Severity: Normal Priority: P5 - None Component: Kernel Assignee: kernel-bugs@opensuse.org Reporter: teuniz@protonmail.com QA Contact: qa-bugs@suse.de Target Milestone: --- Found By: --- Blocker: --- Created attachment 872372 --> https://bugzilla.suse.com/attachment.cgi?id=872372&action=edit Output of dmesg The kernel crashes approx every 5 minutes. I reverted back to kernel 5.14.21-150500.55.19-default because with that one it crashes approx once a day. Operating System: openSUSE Leap 15.5 KDE Plasma Version: 5.27.9 KDE Frameworks Version: 5.103.0 Qt Version: 5.15.8 Kernel Version: 5.14.21-150500.55.44-default (64-bit) Graphics Platform: X11 Processors: 32 × 13th Gen Intel Core i9-13900K Memory: 31.0 GiB of RAM Graphics Processor: AMD Radeon Pro W6600 Manufacturer: HP Product Name: HP Z2 Tower G9 Workstation Desktop PC dmesg | grep amdgpu [ 1.540640] [drm] amdgpu kernel modesetting enabled. [ 1.540703] amdgpu: CRAT table not found [ 1.540705] amdgpu: Virtual CRAT table created for CPU [ 1.540712] amdgpu: Topology: Add CPU node [ 1.542670] amdgpu 0000:03:00.0: amdgpu: Fetched VBIOS from VFCT [ 1.542671] amdgpu: ATOM BIOS: 113-D5330400-100 [ 1.542770] amdgpu 0000:03:00.0: vgaarb: deactivate vga console [ 1.542771] amdgpu 0000:03:00.0: amdgpu: Trusted Memory Zone (TMZ) feature disabled as experimental (default) [ 1.542799] amdgpu 0000:03:00.0: amdgpu: VRAM: 8176M 0x0000008000000000 - 0x00000081FEFFFFFF (8176M used) [ 1.542800] amdgpu 0000:03:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF [ 1.542801] amdgpu 0000:03:00.0: amdgpu: AGP: 267894784M 0x0000008400000000 - 0x0000FFFFFFFFFFFF [ 1.542845] [drm] amdgpu: 8176M of VRAM memory ready [ 1.542845] [drm] amdgpu: 15892M of GTT memory ready. [ 1.548699] amdgpu 0000:03:00.0: amdgpu: PSP runtime database doesn't exist [ 1.548704] amdgpu 0000:03:00.0: amdgpu: PSP runtime database doesn't exist [ 2.854516] amdgpu 0000:03:00.0: amdgpu: STB initialized to 2048 entries [ 2.895100] amdgpu 0000:03:00.0: amdgpu: Will use PSP to load VCN firmware [ 3.094413] amdgpu 0000:03:00.0: amdgpu: RAS: optional ras ta ucode is not available [ 3.115717] amdgpu 0000:03:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available [ 3.115740] amdgpu 0000:03:00.0: amdgpu: smu driver if version = 0x0000000f, smu fw if version = 0x00000013, smu fw program = 0, version = 0x003b2b00 (59.43.0) [ 3.115745] amdgpu 0000:03:00.0: amdgpu: SMU driver if version not matched [ 3.115777] amdgpu 0000:03:00.0: amdgpu: use vbios provided pptable [ 3.165133] amdgpu 0000:03:00.0: amdgpu: SMU is initialized successfully! [ 3.268063] kfd kfd: amdgpu: Allocated 3969056 bytes on gart [ 3.268478] amdgpu: sdma_bitmap: ffff [ 3.302091] amdgpu: HMM registered 8176MB device memory [ 3.302135] amdgpu: SRAT table not found [ 3.302136] amdgpu: Virtual CRAT table created for GPU [ 3.302599] amdgpu: Topology: Add dGPU node [0x73e3:0x1002] [ 3.302601] kfd kfd: amdgpu: added device 1002:73e3 [ 3.302617] amdgpu 0000:03:00.0: amdgpu: SE 2, SH per SE 2, CU per SH 8, active_cu_number 28 [ 3.302658] amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0 [ 3.302659] amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0 [ 3.302659] amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0 [ 3.302660] amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0 [ 3.302660] amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0 [ 3.302661] amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0 [ 3.302661] amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0 [ 3.302662] amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0 [ 3.302662] amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0 [ 3.302663] amdgpu 0000:03:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0 [ 3.302663] amdgpu 0000:03:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0 [ 3.302664] amdgpu 0000:03:00.0: amdgpu: ring sdma1 uses VM inv eng 13 on hub 0 [ 3.302665] amdgpu 0000:03:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 0 on hub 1 [ 3.302665] amdgpu 0000:03:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv eng 1 on hub 1 [ 3.302666] amdgpu 0000:03:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv eng 4 on hub 1 [ 3.302666] amdgpu 0000:03:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 1 [ 3.303573] [drm] Initialized amdgpu 3.49.0 20150101 for 0000:03:00.0 on minor 0 [ 3.308709] fbcon: amdgpudrmfb (fb0) is primary device [ 3.505728] amdgpu 0000:03:00.0: amdgpu: [mmhub] page fault (src_id:0 ring:157 vmid:0 pasid:0, for process pid 0 thread pid 0) [ 3.505731] amdgpu 0000:03:00.0: amdgpu: in page starting at address 0x0000000006004000 from client 0x12 (VMC) [ 3.505733] amdgpu 0000:03:00.0: amdgpu: MMVM_L2_PROTECTION_FAULT_STATUS:0x0000073A [ 3.505733] amdgpu 0000:03:00.0: amdgpu: Faulty UTCL2 client ID: DCEDMC (0x3) [ 3.505734] amdgpu 0000:03:00.0: amdgpu: MORE_FAULTS: 0x0 [ 3.505735] amdgpu 0000:03:00.0: amdgpu: WALKER_ERROR: 0x5 [ 3.505735] amdgpu 0000:03:00.0: amdgpu: PERMISSION_FAULTS: 0x3 [ 3.505735] amdgpu 0000:03:00.0: amdgpu: MAPPING_ERROR: 0x1 [ 3.505736] amdgpu 0000:03:00.0: amdgpu: RW: 0x0 [ 3.524299] amdgpu 0000:03:00.0: [drm] fb0: amdgpudrmfb frame buffer device [ 4.537456] snd_hda_intel 0000:03:00.1: bound 0000:03:00.0 (ops amdgpu_dm_audio_component_bind_ops [amdgpu]) [ 5.416287] amdgpu 0000:03:00.0: amdgpu: [mmhub] page fault (src_id:0 ring:157 vmid:0 pasid:0, for process pid 0 thread pid 0) [ 5.416312] amdgpu 0000:03:00.0: amdgpu: in page starting at address 0x0000000006004000 from client 0x12 (VMC) [ 5.416319] amdgpu 0000:03:00.0: amdgpu: MMVM_L2_PROTECTION_FAULT_STATUS:0x00000000 [ 5.416324] amdgpu 0000:03:00.0: amdgpu: Faulty UTCL2 client ID: unknown (0x0) [ 5.416329] amdgpu 0000:03:00.0: amdgpu: MORE_FAULTS: 0x0 [ 5.416333] amdgpu 0000:03:00.0: amdgpu: WALKER_ERROR: 0x0 [ 5.416336] amdgpu 0000:03:00.0: amdgpu: PERMISSION_FAULTS: 0x0 [ 5.416340] amdgpu 0000:03:00.0: amdgpu: MAPPING_ERROR: 0x0 [ 5.416343] amdgpu 0000:03:00.0: amdgpu: RW: 0x0 [ 73.156519] amdgpu 0000:03:00.0: amdgpu: [mmhub] page fault (src_id:0 ring:157 vmid:0 pasid:0, for process pid 0 thread pid 0) [ 73.156538] amdgpu 0000:03:00.0: amdgpu: in page starting at address 0x0000000006004000 from client 0x12 (VMC) [ 73.156546] amdgpu 0000:03:00.0: amdgpu: MMVM_L2_PROTECTION_FAULT_STATUS:0x0000073A [ 73.156551] amdgpu 0000:03:00.0: amdgpu: Faulty UTCL2 client ID: DCEDMC (0x3) [ 73.156562] amdgpu 0000:03:00.0: amdgpu: MORE_FAULTS: 0x0 [ 73.156566] amdgpu 0000:03:00.0: amdgpu: WALKER_ERROR: 0x5 [ 73.156570] amdgpu 0000:03:00.0: amdgpu: PERMISSION_FAULTS: 0x3 [ 73.156578] amdgpu 0000:03:00.0: amdgpu: MAPPING_ERROR: 0x1 [ 73.156582] amdgpu 0000:03:00.0: amdgpu: RW: 0x0 uname -a 5.14.21-150500.55.44-default #1 SMP PREEMPT_DYNAMIC Mon Jan 15 10:03:40 UTC 2024 (cc7d8b6) x86_64 x86_64 x86_64 GNU/Linux lspci VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 23 WKS-XL [Radeon PRO W6600] -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1219444 Teuniz XXX <teuniz@protonmail.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #872372|Output of dmesg |Output of /var/log/messages description| | -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1219444 https://bugzilla.suse.com/show_bug.cgi?id=1219444#c1 Takashi Iwai <tiwai@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Flags| |needinfo?(teuniz@protonmail | |.com) CC| |tiwai@suse.com --- Comment #1 from Takashi Iwai <tiwai@suse.com> --- Please check with the recent upstream kernel, e.g. in OBS Kernel:stable:Backport repo: http://download.opensuse.org/repositories/Kernel:/stable:/Backport/standard/ If the problem persists, we'd need to report it to the upstream devs. -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1219444 https://bugzilla.suse.com/show_bug.cgi?id=1219444#c2 --- Comment #2 from Teuniz XXX <teuniz@protonmail.com> --- Thanks, I just installed kernel 6.7.3-lp155.2.g0fa3c9e-default #1 SMP PREEMPT_DYNAMIC Thu Feb 1 05:38:11 UTC 2024 (0fa3c9e) x86_64 x86_64 x86_64 GNU/Linux from that repo you mentioned and it booted without any error messages. I'll continue to use this kernel and I'll let you know next week how it goes. Have a nice weekend. -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1219444 https://bugzilla.suse.com/show_bug.cgi?id=1219444#c3 --- Comment #3 from Teuniz XXX <teuniz@protonmail.com> --- After one week of testing, it seems that kernel 6.7.3-lp155.2.g0fa3c9e-default solves the problem. I haven't noticed any error messages or instabilities. Thank you for pointing me to that repo! Only downside of that kernel is that I can't run virtualbox. I need to run sudo /usr/sbin/vboxconfig which in turn tries to compile a kernel interface but exits with an error because the newer kernel is compiled with GCC 13 (instead of 7.5). Output of /var/log/virtualbox.log: === Building 'vboxdrv' module === make[1]: Entering directory '/usr/src/kernel-modules/virtualbox/src/vboxdrv' make V= CONFIG_MODULE_SIG= CONFIG_MODULE_SIG_ALL= -C /lib/modules/6.7.3-lp155.2.g0fa3c9e-default/build M=/usr/src/kernel-modules/virtualbox/src/vboxdrv SRCROOT=/usr/src/kernel-modules/virtualbox/src/vboxdrv -j32 modules make[2]: Entering directory '/usr/src/linux-6.7.3-lp155.2.g0fa3c9e-obj/x86_64/default' warning: the compiler differs from the one used to build the kernel The kernel was built by: gcc (SUSE Linux) 13.2.1 20230912 [revision b96e66fd4ef3e36983969fb8cdd1956f551a074b] You are using: gcc (SUSE Linux) 7.5.0 CC [M] /usr/src/kernel-modules/virtualbox/src/vboxdrv/linux/SUPDrv-linux.o CC [M] /usr/src/kernel-modules/virtualbox/src/vboxdrv/SUPDrv.o CC [M] /usr/src/kernel-modules/virtualbox/src/vboxdrv/SUPDrvGip.o CC [M] /usr/src/kernel-modules/virtualbox/src/vboxdrv/SUPDrvSem.o CC [M] /usr/src/kernel-modules/virtualbox/src/vboxdrv/SUPDrvTracer.o gcc: error: unrecognized command line option ‘-mharden-sls=all’; did you mean ‘-mhard-float’? CC [M] /usr/src/kernel-modules/virtualbox/src/vboxdrv/SUPLibAll.o make[4]: *** [/usr/src/linux-6.7.3-lp155.2.g0fa3c9e/scripts/Makefile.build:244: /usr/src/kernel-modules/virtualbox/src/vboxdrv/linux/SUPDrv-linux.o] Error 1 make[4]: *** Waiting for unfinished jobs.... gcc: error: unrecognized command line option ‘-mharden-sls=all’; did you mean ‘-mhard-float’? gcc: error: unrecognized command line option ‘-mharden-sls=all’; did you mean ‘-mhard-float’? make[4]: *** [/usr/src/linux-6.7.3-lp155.2.g0fa3c9e/scripts/Makefile.build:244: /usr/src/kernel-modules/virtualbox/src/vboxdrv/SUPDrvGip.o] Error 1 make[4]: *** [/usr/src/linux-6.7.3-lp155.2.g0fa3c9e/scripts/Makefile.build:244: /usr/src/kernel-modules/virtualbox/src/vboxdrv/SUPDrv.o] Error 1 CC [M] /usr/src/kernel-modules/virtualbox/src/vboxdrv/common/string/strformatrt.o gcc: error: unrecognized command line option ‘-mharden-sls=all’; did you mean ‘-mhard-float’? make[4]: *** [/usr/src/linux-6.7.3-lp155.2.g0fa3c9e/scripts/Makefile.build:244: /usr/src/kernel-modules/virtualbox/src/vboxdrv/SUPDrvSem.o] Error 1 gcc: error: unrecognized command line option ‘-mharden-sls=all’; did you mean ‘-mhard-float’? make[4]: *** [/usr/src/linux-6.7.3-lp155.2.g0fa3c9e/scripts/Makefile.build:244: /usr/src/kernel-modules/virtualbox/src/vboxdrv/SUPDrvTracer.o] Error 1 gcc: error: unrecognized command line option ‘-mharden-sls=all’; did you mean ‘-mhard-float’? make[4]: *** [/usr/src/linux-6.7.3-lp155.2.g0fa3c9e/scripts/Makefile.build:244: /usr/src/kernel-modules/virtualbox/src/vboxdrv/SUPLibAll.o] Error 1 gcc: error: unrecognized command line option ‘-mharden-sls=all’; did you mean ‘-mhard-float’? make[4]: *** [/usr/src/linux-6.7.3-lp155.2.g0fa3c9e/scripts/Makefile.build:244: /usr/src/kernel-modules/virtualbox/src/vboxdrv/common/string/strformatrt.o] Error 1 -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1219444 Teuniz XXX <teuniz@protonmail.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Flags|needinfo?(teuniz@protonmail | |.com) | -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1219444 https://bugzilla.suse.com/show_bug.cgi?id=1219444#c4 --- Comment #4 from Takashi Iwai <tiwai@suse.com> --- Yes, the lack of KMP is a known issue with the TW kernel build for Leap, unfortunately. Honestly speaking, fixing this kind of bug for amdgpu on SLE15-SP5 kernel is really tough. It seems hitting on only certain models / hardware configs. You may try Leap 15.6 kernel instead of TW backport kernel, too; which should be new enough and receive most of fixes from the latest code, too. vbox driver should be available for Leap 15.6, too. But maybe some later point after the kABI freeze. -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1219444 https://bugzilla.suse.com/show_bug.cgi?id=1219444#c5 --- Comment #5 from Teuniz XXX <teuniz@protonmail.com> --- I believe the root cause is not a kernel bug but a bug in Mesa: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30510 https://bugzilla.redhat.com/show_bug.cgi?id=2299241 https://forums.opensuse.org/t/amdgpu-instability-on-tumbleweed-20240911/1785... Is it possible to backport somehow the fix for Mesa into Leap 15.5 / 15.6? The problem is that, with the newer kernels, the crash still happens (but less often than with the stcok kernel). Also, I can't run Virtualbox which I really need. And Tumbleweed is not an option either. Thanks! -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1219444 https://bugzilla.suse.com/show_bug.cgi?id=1219444#c6 --- Comment #6 from Teuniz XXX <teuniz@protonmail.com> --- Do I need to create a new bug report for this (because it's related to Mesa)? -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1219444 https://bugzilla.suse.com/show_bug.cgi?id=1219444#c7 Takashi Iwai <tiwai@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Component|Kernel |X.Org QA Contact|qa-bugs@suse.de |gfx-bugs@suse.de Assignee|kernel-bugs@opensuse.org |gfx-bugs@suse.de --- Comment #7 from Takashi Iwai <tiwai@suse.com> --- Let's just reassign the component. -- You are receiving this mail because: You are the assignee for the bug.
participants (1)
-
bugzilla_noreply@suse.com