[Bug 1190188] New: AMD GPU reset causes sporadic system freezes with kernel 5.3.18-59.19

4 Sep 2021

      https://bugzilla.suse.com/show_bug.cgi?id=1190188

            Bug ID: 1190188
           Summary: AMD GPU reset causes sporadic system freezes with
                    kernel 5.3.18-59.19
    Classification: openSUSE
           Product: openSUSE Distribution
           Version: Leap 15.3
          Hardware: x86-64
                OS: openSUSE Leap 15.3
            Status: NEW
          Severity: Major
          Priority: P5 - None
         Component: Kernel
          Assignee: kernel-bugs@opensuse.org
          Reporter: runger@suse.com
        QA Contact: qa-bugs@suse.de
          Found By: ---
           Blocker: ---

Hardware: Company provided laptop Lenovo ThinkPad T14s, AMD Ryzen Pro 7.

After updating kernel from 5.3.18-59.16 to .19 the following behavior is
observerd:

System boots, Gnomes starts, after a while (I have not seen a pattern) the
whole system comes to a freeze. Hard reset is necessary to come out of that
condition.

'dmesg' did not yield useful logs for me. Below are a collection of logs from
/var/log/messages at the time of the freezes.

------
2021-08-27T08:00:12.349977+02:00 localhost fwupd[2890]:
ERROR:esys:src/tss2-esys/esys_context.c:69:Esys_Initialize() Initialize default
tcti. ErrorCode (0x000a000a)

2021-08-27T08:00:12.056977+02:00 localhost fwupd[2890]: 06:00:12:0056 FuEngine 
           failed to add device usb:04:00:01:03:03: failed to read SPI chip ID:
failed to read chip ID: endpoint stalled or request not supported

2021-08-27T08:00:10.696953+02:00 localhost gnome-shell[2242]: JS ERROR:
TypeError: corner.set_style_pseudo_class is not a
function#012updateElementPositions/connectCorner/corner._buttonStyleChangedSignalId<@/home/ralf/.local/share/gnome-shell/extensions/dash-to-panel@jderose9.github.com/panel.js:574:21#012_loadBackground/signalId<@resource:///org/gnome/shell/ui/layout.js:621:13#012_emit@resource:///org/gnome/gjs/modules/signals.js:135:27#012SystemBackground/id<@resource:///org/gnome/shell/ui/background.js:526:17

021-08-27T07:44:33.816089+02:00 localhost kernel: [   89.019320] xhci_hcd
0000:06:00.3: ERROR Transfer event TRB DMA ptr not part of current TD ep_index
2 comp_code 13

2021-08-27T07:44:46.862479+02:00 localhost kernel: [  102.069523]
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=518,
emitted seq=521
2021-08-27T07:44:46.862482+02:00 localhost kernel: [  102.069646]
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process
gnome-shell pid 2013 thread gnome-shel:cs0 pid 2042
2021-08-27T07:44:46.862485+02:00 localhost kernel: [  102.069652] amdgpu
0000:06:00.0: amdgpu: GPU reset begin!
2021-08-27T07:44:49.416068+02:00 localhost kernel: [  104.622887] amdgpu
0000:06:00.0: amdgpu: failed send message: DisallowGfxOff (8)     param:
0x00000000 response 0xffffffc2
2021-08-27T07:44:55.420076+02:00 localhost kernel: [  110.623526]
[drm:smu_v12_0_gfx_off_control [amdgpu]] *ERROR* disable gfxoff timeout and
failed!
2021-08-27T07:44:55.420094+02:00 localhost kernel: [  110.623532] amdgpu
0000:06:00.0: amdgpu: Failed to disable gfxoff!
2021-08-27T07:44:56.776102+02:00 localhost kernel: [  111.979537] WARNING: CPU:
5 PID: 191 at
../drivers/gpu/drm/amd/amdgpu/../display/dc/clk_mgr/dcn21/rn_clk_mgr_vbios_smu.c:104
rn_vbios_smu_send_msg_with_param+0xab/0x1e0 [amdgpu]

2021-08-27T07:44:59.372083+02:00 localhost kernel: [  114.579105] amdgpu
0000:06:00.0: amdgpu: Msg issuing pre-check failed and SMU may be not in the
right state!
2021-08-27T07:44:59.372099+02:00 localhost kernel: [  114.579109] amdgpu
0000:06:00.0: amdgpu: Failed to power gate SDMA!
2021-08-27T07:44:59.632106+02:00 localhost kernel: [  114.839020] amdgpu
0000:06:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0
test failed (-110)
2021-08-27T07:45:02.188068+02:00 localhost kernel: [  117.394329] amdgpu
0000:06:00.0: amdgpu: Msg issuing pre-check failed and SMU may be not in the
right state!

021-08-27T08:10:38.516112+02:00 localhost kernel: [  636.719266] ACPI Error:
Aborting method \_SB.UBTC.ECRD due to previous error (AE_NOT_EXIST)
(20200925/psparse-531)
2021-08-27T08:10:38.516115+02:00 localhost kernel: [  636.719274] ACPI Error:
Aborting method \_SB.UBTC.NTFY due to previous error (AE_NOT_EXIST)
(20200925/psparse-531)
2021-08-27T08:10:38.516118+02:00 localhost kernel: [  636.719279] ACPI Error:
Aborting method \_SB.PCI0.LPC0.EC0._Q4F due to previous error (AE_NOT_EXIST)
(20200925/psparse-531)

021-08-27T07:56:51.724723+02:00 localhost kernel: [  629.931089]
[drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed
out!
2021-08-27T07:56:56.856044+02:00 localhost kernel: [  629.931264]
[drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed
out!
2021-08-27T07:56:56.856064+02:00 localhost kernel: [  635.061135]
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled
seq=70578, emitted seq=70580
2021-08-27T07:56:56.856070+02:00 localhost kernel: [  635.061258]
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process
Xwayland pid 2374 thread Xwayland:cs0 pid 2375
2021-08-27T07:56:56.856073+02:00 localhost kernel: [  635.061264] amdgpu
0000:06:00.0: amdgpu: GPU reset begin!
2021-08-27T07:56:57.160276+02:00 localhost kernel: [  635.363002] [drm] free
PSP TMR buffer
2021-08-27T07:56:57.192050+02:00 localhost kernel: [  635.396068] amdgpu
0000:06:00.0: amdgpu: GPU reset succeeded, trying to resume
2021-08-27T07:56:57.192070+02:00 localhost kernel: [  635.396258] [drm] PCIE
GART of 1024M enabled (table at 0x000000F400900000).
2021-08-27T07:56:57.192075+02:00 localhost kernel: [  635.396502] [drm] PSP is
resuming...
2021-08-27T07:56:57.209813+02:00 localhost kernel: [  635.416407] [drm] reserve
0x400000 from 0xf41f800000 for PSP TMR
2021-08-27T07:56:57.608048+02:00 localhost kernel: [  635.811004] amdgpu
0000:06:00.0: amdgpu: SMU is resuming...
2021-08-27T07:56:57.608069+02:00 localhost kernel: [  635.811834] amdgpu
0000:06:00.0: amdgpu: SMU is resumed successfully!
2021-08-27T07:56:57.852039+02:00 localhost kernel: [  636.055472] [drm] kiq
ring mec 2 pipe 1 q 0
2021-08-27T07:56:57.864034+02:00 localhost kernel: [  636.069062] [drm] DMUB
hardware initialized: version=0x01000000
2021-08-27T07:56:58.324045+02:00 localhost kernel: [  636.194260] [drm] Failed
to add display topology, DTM TA is not initialized.
2021-08-27T07:56:58.352064+02:00 localhost kernel: [  636.527082] [drm] Failed
to add display topology, DTM TA is not initialized.
2021-08-27T07:56:58.352089+02:00 localhost kernel: [  636.556442] [drm] VCN
decode and encode initialized successfully(under DPG Mode).
2021-08-27T07:56:58.352094+02:00 localhost kernel: [  636.556590] [drm] JPEG
decode initialized successfully.
2021-08-27T07:56:58.352097+02:00 localhost kernel: [  636.556597] amdgpu
0000:06:00.0: amdgpu: ring gfx uses VM inv eng 0 on hub 0
2021-08-27T07:56:58.352100+02:00 localhost kernel: [  636.556599] amdgpu
0000:06:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
2021-08-27T07:56:58.352104+02:00 localhost kernel: [  636.556601] amdgpu
0000:06:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
2021-08-27T07:56:58.352107+02:00 localhost kernel: [  636.556603] amdgpu
0000:06:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
2021-08-27T07:56:58.352109+02:00 localhost kernel: [  636.556604] amdgpu
0000:06:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
2021-08-27T07:56:58.352112+02:00 localhost kernel: [  636.556606] amdgpu
0000:06:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
2021-08-27T07:56:58.352115+02:00 localhost kernel: [  636.556607] amdgpu
0000:06:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
2021-08-27T07:56:58.352117+02:00 localhost kernel: [  636.556609] amdgpu
0000:06:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
2021-08-27T07:56:58.352120+02:00 localhost kernel: [  636.556611] amdgpu
0000:06:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
2021-08-27T07:56:58.352123+02:00 localhost kernel: [  636.556613] amdgpu
0000:06:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
2021-08-27T07:56:58.352125+02:00 localhost kernel: [  636.556615] amdgpu
0000:06:00.0: amdgpu: ring sdma0 uses VM inv eng 0 on hub 1
2021-08-27T07:56:58.352128+02:00 localhost kernel: [  636.556616] amdgpu
0000:06:00.0: amdgpu: ring vcn_dec uses VM inv eng 1 on hub 1
2021-08-27T07:56:58.352131+02:00 localhost kernel: [  636.556618] amdgpu
0000:06:00.0: amdgpu: ring vcn_enc0 uses VM inv eng 4 on hub 1
2021-08-27T07:56:58.352134+02:00 localhost kernel: [  636.556619] amdgpu
0000:06:00.0: amdgpu: ring vcn_enc1 uses VM inv eng 5 on hub 1
2021-08-27T07:56:58.352137+02:00 localhost kernel: [  636.556621] amdgpu
0000:06:00.0: amdgpu: ring jpeg_dec uses VM inv eng 6 on hub 1
2021-08-27T07:56:58.356074+02:00 localhost kernel: [  636.562581] [drm] recover
vram bo from shadow start
2021-08-27T07:56:58.356099+02:00 localhost kernel: [  636.562584] [drm] recover
vram bo from shadow done
2021-08-27T07:56:58.356102+02:00 localhost kernel: [  636.562587] [drm] Skip
scheduling IBs!
2021-08-27T07:56:58.356105+02:00 localhost kernel: [  636.562648] amdgpu
0000:06:00.0: amdgpu: GPU reset(2) succeeded!
2021-08-27T07:56:58.356108+02:00 localhost kernel: [  636.562683] [drm] Skip
scheduling IBs!
2021-08-27T07:56:58.356129+02:00 localhost kernel: [  636.562809]
gmc_v9_0_process_interrupt: 10 callbacks suppressed
2021-08-27T07:56:58.356132+02:00 localhost kernel: [  636.562818] amdgpu
0000:06:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:4
pasid:32771, for process Xwayland pid 2374 thread Xwayland:cs0 pid 2375)
2021-08-27T07:56:58.356136+02:00 localhost kernel: [  636.562822] amdgpu
0000:06:00.0: amdgpu:   in page starting at address 0x000080010082d000 from
client 27
2021-08-27T07:56:58.356139+02:00 localhost kernel: [  636.562824] amdgpu
0000:06:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00440C11
2021-08-27T07:56:58.356142+02:00 localhost kernel: [  636.562825] amdgpu
0000:06:00.0: amdgpu:      Faulty UTCL2 client ID: 0x6
2021-08-27T07:56:58.356143+02:00 localhost kernel: [  636.562827] amdgpu
0000:06:00.0: amdgpu:      MORE_FAULTS: 0x1
2021-08-27T07:56:58.356146+02:00 localhost kernel: [  636.562829] amdgpu
0000:06:00.0: amdgpu:      WALKER_ERROR: 0x0
2021-08-27T07:56:58.356149+02:00 localhost kernel: [  636.562831] amdgpu
0000:06:00.0: amdgpu:      PERMISSION_FAULTS: 0x1
2021-08-27T07:56:58.356152+02:00 localhost kernel: [  636.562835] amdgpu
0000:06:00.0: amdgpu:      MAPPING_ERROR: 0x0
2021-08-27T07:56:58.356155+02:00 localhost kernel: [  636.562839] amdgpu
0000:06:00.0: amdgpu:      RW: 0x1
2021-08-27T07:56:58.357727+02:00 localhost kernel: [  636.563630] amdgpu
0000:06:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:4
pasid:32771, for process Xwayland pid 2374 thread Xwayland:cs0 pid 2375)
2021-08-27T07:56:58.357740+02:00 localhost kernel: [  636.563633] amdgpu
0000:06:00.0: amdgpu:   in page starting at address 0x000080010082d000 from
client 27

2021-08-27T07:57:18.136423+02:00 localhost kernel: [  656.340199] BTRFS info
(device nvme0n1p2): qgroup scan completed (inconsistency flag cleared)
2021-08-27T07:57:20.150733+02:00 localhost kernel: [  658.357082]
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled
seq=70588, emitted seq=70590
2021-08-27T07:57:20.150759+02:00 localhost kernel: [  658.357269]
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process
gnome-shell pid 2304 thread gnome-shel:cs0 pid 2336
2021-08-27T07:57:20.150763+02:00 localhost kernel: [  658.357276] amdgpu
0000:06:00.0: amdgpu: GPU reset begin!
-------

-- 
You are receiving this mail because:
You are on the CC list for the bug.

[Bug 1190188] New: AMD GPU reset causes sporadic system freezes with kernel 5.3.18-59.19

bugzilla_noreply＠suse.com