[Bug 1190188] New: AMD GPU reset causes sporadic system freezes with kernel 5.3.18-59.19
https://bugzilla.suse.com/show_bug.cgi?id=1190188 Bug ID: 1190188 Summary: AMD GPU reset causes sporadic system freezes with kernel 5.3.18-59.19 Classification: openSUSE Product: openSUSE Distribution Version: Leap 15.3 Hardware: x86-64 OS: openSUSE Leap 15.3 Status: NEW Severity: Major Priority: P5 - None Component: Kernel Assignee: kernel-bugs@opensuse.org Reporter: runger@suse.com QA Contact: qa-bugs@suse.de Found By: --- Blocker: --- Hardware: Company provided laptop Lenovo ThinkPad T14s, AMD Ryzen Pro 7. After updating kernel from 5.3.18-59.16 to .19 the following behavior is observerd: System boots, Gnomes starts, after a while (I have not seen a pattern) the whole system comes to a freeze. Hard reset is necessary to come out of that condition. 'dmesg' did not yield useful logs for me. Below are a collection of logs from /var/log/messages at the time of the freezes. ------ 2021-08-27T08:00:12.349977+02:00 localhost fwupd[2890]: ERROR:esys:src/tss2-esys/esys_context.c:69:Esys_Initialize() Initialize default tcti. ErrorCode (0x000a000a) 2021-08-27T08:00:12.056977+02:00 localhost fwupd[2890]: 06:00:12:0056 FuEngine failed to add device usb:04:00:01:03:03: failed to read SPI chip ID: failed to read chip ID: endpoint stalled or request not supported 2021-08-27T08:00:10.696953+02:00 localhost gnome-shell[2242]: JS ERROR: TypeError: corner.set_style_pseudo_class is not a function#012updateElementPositions/connectCorner/corner._buttonStyleChangedSignalId<@/home/ralf/.local/share/gnome-shell/extensions/dash-to-panel@jderose9.github.com/panel.js:574:21#012_loadBackground/signalId<@resource:///org/gnome/shell/ui/layout.js:621:13#012_emit@resource:///org/gnome/gjs/modules/signals.js:135:27#012SystemBackground/id<@resource:///org/gnome/shell/ui/background.js:526:17 021-08-27T07:44:33.816089+02:00 localhost kernel: [ 89.019320] xhci_hcd 0000:06:00.3: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 13 2021-08-27T07:44:46.862479+02:00 localhost kernel: [ 102.069523] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=518, emitted seq=521 2021-08-27T07:44:46.862482+02:00 localhost kernel: [ 102.069646] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process gnome-shell pid 2013 thread gnome-shel:cs0 pid 2042 2021-08-27T07:44:46.862485+02:00 localhost kernel: [ 102.069652] amdgpu 0000:06:00.0: amdgpu: GPU reset begin! 2021-08-27T07:44:49.416068+02:00 localhost kernel: [ 104.622887] amdgpu 0000:06:00.0: amdgpu: failed send message: DisallowGfxOff (8) param: 0x00000000 response 0xffffffc2 2021-08-27T07:44:55.420076+02:00 localhost kernel: [ 110.623526] [drm:smu_v12_0_gfx_off_control [amdgpu]] *ERROR* disable gfxoff timeout and failed! 2021-08-27T07:44:55.420094+02:00 localhost kernel: [ 110.623532] amdgpu 0000:06:00.0: amdgpu: Failed to disable gfxoff! 2021-08-27T07:44:56.776102+02:00 localhost kernel: [ 111.979537] WARNING: CPU: 5 PID: 191 at ../drivers/gpu/drm/amd/amdgpu/../display/dc/clk_mgr/dcn21/rn_clk_mgr_vbios_smu.c:104 rn_vbios_smu_send_msg_with_param+0xab/0x1e0 [amdgpu] 2021-08-27T07:44:59.372083+02:00 localhost kernel: [ 114.579105] amdgpu 0000:06:00.0: amdgpu: Msg issuing pre-check failed and SMU may be not in the right state! 2021-08-27T07:44:59.372099+02:00 localhost kernel: [ 114.579109] amdgpu 0000:06:00.0: amdgpu: Failed to power gate SDMA! 2021-08-27T07:44:59.632106+02:00 localhost kernel: [ 114.839020] amdgpu 0000:06:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110) 2021-08-27T07:45:02.188068+02:00 localhost kernel: [ 117.394329] amdgpu 0000:06:00.0: amdgpu: Msg issuing pre-check failed and SMU may be not in the right state! 021-08-27T08:10:38.516112+02:00 localhost kernel: [ 636.719266] ACPI Error: Aborting method \_SB.UBTC.ECRD due to previous error (AE_NOT_EXIST) (20200925/psparse-531) 2021-08-27T08:10:38.516115+02:00 localhost kernel: [ 636.719274] ACPI Error: Aborting method \_SB.UBTC.NTFY due to previous error (AE_NOT_EXIST) (20200925/psparse-531) 2021-08-27T08:10:38.516118+02:00 localhost kernel: [ 636.719279] ACPI Error: Aborting method \_SB.PCI0.LPC0.EC0._Q4F due to previous error (AE_NOT_EXIST) (20200925/psparse-531) 021-08-27T07:56:51.724723+02:00 localhost kernel: [ 629.931089] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out! 2021-08-27T07:56:56.856044+02:00 localhost kernel: [ 629.931264] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out! 2021-08-27T07:56:56.856064+02:00 localhost kernel: [ 635.061135] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=70578, emitted seq=70580 2021-08-27T07:56:56.856070+02:00 localhost kernel: [ 635.061258] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xwayland pid 2374 thread Xwayland:cs0 pid 2375 2021-08-27T07:56:56.856073+02:00 localhost kernel: [ 635.061264] amdgpu 0000:06:00.0: amdgpu: GPU reset begin! 2021-08-27T07:56:57.160276+02:00 localhost kernel: [ 635.363002] [drm] free PSP TMR buffer 2021-08-27T07:56:57.192050+02:00 localhost kernel: [ 635.396068] amdgpu 0000:06:00.0: amdgpu: GPU reset succeeded, trying to resume 2021-08-27T07:56:57.192070+02:00 localhost kernel: [ 635.396258] [drm] PCIE GART of 1024M enabled (table at 0x000000F400900000). 2021-08-27T07:56:57.192075+02:00 localhost kernel: [ 635.396502] [drm] PSP is resuming... 2021-08-27T07:56:57.209813+02:00 localhost kernel: [ 635.416407] [drm] reserve 0x400000 from 0xf41f800000 for PSP TMR 2021-08-27T07:56:57.608048+02:00 localhost kernel: [ 635.811004] amdgpu 0000:06:00.0: amdgpu: SMU is resuming... 2021-08-27T07:56:57.608069+02:00 localhost kernel: [ 635.811834] amdgpu 0000:06:00.0: amdgpu: SMU is resumed successfully! 2021-08-27T07:56:57.852039+02:00 localhost kernel: [ 636.055472] [drm] kiq ring mec 2 pipe 1 q 0 2021-08-27T07:56:57.864034+02:00 localhost kernel: [ 636.069062] [drm] DMUB hardware initialized: version=0x01000000 2021-08-27T07:56:58.324045+02:00 localhost kernel: [ 636.194260] [drm] Failed to add display topology, DTM TA is not initialized. 2021-08-27T07:56:58.352064+02:00 localhost kernel: [ 636.527082] [drm] Failed to add display topology, DTM TA is not initialized. 2021-08-27T07:56:58.352089+02:00 localhost kernel: [ 636.556442] [drm] VCN decode and encode initialized successfully(under DPG Mode). 2021-08-27T07:56:58.352094+02:00 localhost kernel: [ 636.556590] [drm] JPEG decode initialized successfully. 2021-08-27T07:56:58.352097+02:00 localhost kernel: [ 636.556597] amdgpu 0000:06:00.0: amdgpu: ring gfx uses VM inv eng 0 on hub 0 2021-08-27T07:56:58.352100+02:00 localhost kernel: [ 636.556599] amdgpu 0000:06:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0 2021-08-27T07:56:58.352104+02:00 localhost kernel: [ 636.556601] amdgpu 0000:06:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0 2021-08-27T07:56:58.352107+02:00 localhost kernel: [ 636.556603] amdgpu 0000:06:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0 2021-08-27T07:56:58.352109+02:00 localhost kernel: [ 636.556604] amdgpu 0000:06:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0 2021-08-27T07:56:58.352112+02:00 localhost kernel: [ 636.556606] amdgpu 0000:06:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0 2021-08-27T07:56:58.352115+02:00 localhost kernel: [ 636.556607] amdgpu 0000:06:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0 2021-08-27T07:56:58.352117+02:00 localhost kernel: [ 636.556609] amdgpu 0000:06:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0 2021-08-27T07:56:58.352120+02:00 localhost kernel: [ 636.556611] amdgpu 0000:06:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0 2021-08-27T07:56:58.352123+02:00 localhost kernel: [ 636.556613] amdgpu 0000:06:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0 2021-08-27T07:56:58.352125+02:00 localhost kernel: [ 636.556615] amdgpu 0000:06:00.0: amdgpu: ring sdma0 uses VM inv eng 0 on hub 1 2021-08-27T07:56:58.352128+02:00 localhost kernel: [ 636.556616] amdgpu 0000:06:00.0: amdgpu: ring vcn_dec uses VM inv eng 1 on hub 1 2021-08-27T07:56:58.352131+02:00 localhost kernel: [ 636.556618] amdgpu 0000:06:00.0: amdgpu: ring vcn_enc0 uses VM inv eng 4 on hub 1 2021-08-27T07:56:58.352134+02:00 localhost kernel: [ 636.556619] amdgpu 0000:06:00.0: amdgpu: ring vcn_enc1 uses VM inv eng 5 on hub 1 2021-08-27T07:56:58.352137+02:00 localhost kernel: [ 636.556621] amdgpu 0000:06:00.0: amdgpu: ring jpeg_dec uses VM inv eng 6 on hub 1 2021-08-27T07:56:58.356074+02:00 localhost kernel: [ 636.562581] [drm] recover vram bo from shadow start 2021-08-27T07:56:58.356099+02:00 localhost kernel: [ 636.562584] [drm] recover vram bo from shadow done 2021-08-27T07:56:58.356102+02:00 localhost kernel: [ 636.562587] [drm] Skip scheduling IBs! 2021-08-27T07:56:58.356105+02:00 localhost kernel: [ 636.562648] amdgpu 0000:06:00.0: amdgpu: GPU reset(2) succeeded! 2021-08-27T07:56:58.356108+02:00 localhost kernel: [ 636.562683] [drm] Skip scheduling IBs! 2021-08-27T07:56:58.356129+02:00 localhost kernel: [ 636.562809] gmc_v9_0_process_interrupt: 10 callbacks suppressed 2021-08-27T07:56:58.356132+02:00 localhost kernel: [ 636.562818] amdgpu 0000:06:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:4 pasid:32771, for process Xwayland pid 2374 thread Xwayland:cs0 pid 2375) 2021-08-27T07:56:58.356136+02:00 localhost kernel: [ 636.562822] amdgpu 0000:06:00.0: amdgpu: in page starting at address 0x000080010082d000 from client 27 2021-08-27T07:56:58.356139+02:00 localhost kernel: [ 636.562824] amdgpu 0000:06:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00440C11 2021-08-27T07:56:58.356142+02:00 localhost kernel: [ 636.562825] amdgpu 0000:06:00.0: amdgpu: Faulty UTCL2 client ID: 0x6 2021-08-27T07:56:58.356143+02:00 localhost kernel: [ 636.562827] amdgpu 0000:06:00.0: amdgpu: MORE_FAULTS: 0x1 2021-08-27T07:56:58.356146+02:00 localhost kernel: [ 636.562829] amdgpu 0000:06:00.0: amdgpu: WALKER_ERROR: 0x0 2021-08-27T07:56:58.356149+02:00 localhost kernel: [ 636.562831] amdgpu 0000:06:00.0: amdgpu: PERMISSION_FAULTS: 0x1 2021-08-27T07:56:58.356152+02:00 localhost kernel: [ 636.562835] amdgpu 0000:06:00.0: amdgpu: MAPPING_ERROR: 0x0 2021-08-27T07:56:58.356155+02:00 localhost kernel: [ 636.562839] amdgpu 0000:06:00.0: amdgpu: RW: 0x1 2021-08-27T07:56:58.357727+02:00 localhost kernel: [ 636.563630] amdgpu 0000:06:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:4 pasid:32771, for process Xwayland pid 2374 thread Xwayland:cs0 pid 2375) 2021-08-27T07:56:58.357740+02:00 localhost kernel: [ 636.563633] amdgpu 0000:06:00.0: amdgpu: in page starting at address 0x000080010082d000 from client 27 2021-08-27T07:57:18.136423+02:00 localhost kernel: [ 656.340199] BTRFS info (device nvme0n1p2): qgroup scan completed (inconsistency flag cleared) 2021-08-27T07:57:20.150733+02:00 localhost kernel: [ 658.357082] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=70588, emitted seq=70590 2021-08-27T07:57:20.150759+02:00 localhost kernel: [ 658.357269] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process gnome-shell pid 2304 thread gnome-shel:cs0 pid 2336 2021-08-27T07:57:20.150763+02:00 localhost kernel: [ 658.357276] amdgpu 0000:06:00.0: amdgpu: GPU reset begin! ------- -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1190188 https://bugzilla.suse.com/show_bug.cgi?id=1190188#c1 --- Comment #1 from Ralf Unger <runger@suse.com> --- Update: I just encountered the same issue on the 5.3.18-59.16 kernel. It seems less frequent there, though. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1190188 https://bugzilla.suse.com/show_bug.cgi?id=1190188#c2 Takashi Iwai <tiwai@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |tiwai@suse.com --- Comment #2 from Takashi Iwai <tiwai@suse.com> --- OK, so this doesn't look like a regression, per se. You can try the latest firmware from kernel-firmware-amdgpu package in OBS Kernel:stable:Backport repo. I'm not sure whether it's relevant, but might be worth to try. -- You are receiving this mail because: You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@suse.com