Bug ID 1200380
Summary [amdgpu] OpenCL workload can reliably crash GPU unrecoverably and also reliably cause reboot
Classification openSUSE
Product openSUSE Tumbleweed
Version Current
Hardware x86-64
OS openSUSE Tumbleweed
Status NEW
Severity Normal
Priority P5 - None
Component Kernel
Assignee kernel-bugs@opensuse.org
Reporter andreas_nordal_4@hotmail.com
QA Contact qa-bugs@suse.de
Found By ---
Blocker ---

AMD Ryzen 5 3400G with Radeon Vega Graphics
Linux 5.18.1-1-default
amdocl 11.12-1.27 from
https://download.opensuse.org/repositories/home:/patrikjakobsson:/aomp/openSUSE_Tumbleweed/

* I have one (proprietary) OpenCL workload that reliably causes my computer to
freeze and reboot.
* In investigating this, I'm trying out various test programs that do a subset
of it. A common theme is that after 2-3 runs, with kernel traces appearing in
dmesg, the GPU disappears and never comes back (at least the OpenCL part): The
desktop is responsive and fine afterwards, thankfully, but clinfo reports
"Number of devices 0", and I have to manually reboot to fix it.

The first run, with the trace in dmesg:

[   96.228040] [drm] Fence fallback timer expired on ring gfx
[  119.268047] [drm] Fence fallback timer expired on ring gfx
[  123.972050] [drm] Fence fallback timer expired on ring gfx
[  124.676047] [drm] Fence fallback timer expired on ring gfx
[  138.014780] ------------[ cut here ]------------
[  138.014784] refcount_t: decrement hit 0; leaking memory.
[  138.014789] WARNING: CPU: 5 PID: 469 at lib/refcount.c:31
refcount_warn_saturate+0xe5/0xf0
[  138.014795] Modules linked in: zram xt_conntrack xt_MASQUERADE
nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_filter
iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c
bpfilter br_netfilter bridge stp llc overlay cmac nls_utf8 cifs cifs_arc4
cifs_md4 dns_resolver fscache netfs af_packet iscsi_ibft iscsi_boot_sysfs
dmi_sysfs joydev r8169 atlantic realtek mdio_devres libphy macsec hid_generic
intel_rapl_msr intel_rapl_common usbhid snd_hda_codec_realtek
snd_hda_codec_generic edac_mce_amd ledtrig_audio snd_hda_codec_hdmi
snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hda_core
snd_hwdep eeepc_wmi asus_wmi snd_pcm kvm battery snd_timer sparse_keymap snd
platform_profile irqbypass rfkill wmi_bmof pcspkr efi_pstore k10temp i2c_piix4
soundcore tiny_power_button gpio_amdpt gpio_generic button acpi_cpufreq
nls_iso8859_1 nls_cp437 vfat fat fuse configfs ip_tables x_tables ext4 mbcache
jbd2 amdgpu crct10dif_pclmul
[  138.014837]  crc32_pclmul crc32c_intel ghash_clmulni_intel drm_ttm_helper
ttm iommu_v2 gpu_sched i2c_algo_bit drm_dp_helper drm_kms_helper aesni_intel
syscopyarea sysfillrect sysimgblt fb_sys_fops xhci_pci crypto_simd
xhci_pci_renesas cryptd drm xhci_hcd nvme sp5100_tco nvme_core cec ccp usbcore
rc_core wmi video sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua
msr efivarfs
[  138.014855] CPU: 5 PID: 469 Comm: kworker/5:2 Not tainted 5.18.1-1-default
#1 openSUSE Tumbleweed 76e6f98ac75e9529b1c9bcf308333727a2e8ebe3
[  138.014858] Hardware name: Komplett Komplett PC/PRIME B450-PLUS, BIOS 2008
12/06/2019
[  138.014859] Workqueue: kfd_process_wq kfd_process_wq_release [amdgpu]
[  138.015078] RIP: 0010:refcount_warn_saturate+0xe5/0xf0
[  138.015081] Code: 48 c7 c7 98 ab 8b 94 c6 05 53 b6 9f 01 01 e8 f3 7e 52 00
0f 0b c3 cc 48 c7 c7 68 ab 8b 94 c6 05 3d b6 9f 01 01 e8 dc 7e 52 00 <0f> 0b c3
cc 0f 1f 80 00 00 00 00 8b 07 3d 00 00 00 c0 74 12 83 f8
[  138.015082] RSP: 0018:ffff9d34807dfde8 EFLAGS: 00010282
[  138.015084] RAX: 0000000000000000 RBX: ffff8f9929bec000 RCX:
0000000000000027
[  138.015085] RDX: ffff8f9fc0b62528 RSI: 0000000000000001 RDI:
ffff8f9fc0b62520
[  138.015086] RBP: ffff8f9929bec000 R08: 0000000000000000 R09:
ffff9d34807dfc18
[  138.015087] R10: 0000000000000003 R11: ffff8f9fc07fffe8 R12:
0000000000000001
[  138.015087] R13: ffff8f98c2fb4800 R14: ffff8f98c2fb4b90 R15:
ffff8f98c807ec00
[  138.015089] FS:  0000000000000000(0000) GS:ffff8f9fc0b40000(0000)
knlGS:0000000000000000
[  138.015090] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  138.015091] CR2: 00007f0e47700000 CR3: 0000000694010000 CR4:
00000000003506e0
[  138.015092] Call Trace:
[  138.015094]  <TASK>
[  138.015096]  put_pasid_state_wait+0x9f/0xb0 [iommu_v2
6923f1fc45e94345ae677d5df84b0cda035fac36]
[  138.015100]  ? mmu_notifier_unregister+0xa9/0xe0
[  138.015103]  amd_iommu_unbind_pasid+0xa5/0xc0 [iommu_v2
6923f1fc45e94345ae677d5df84b0cda035fac36]
[  138.015106]  kfd_iommu_unbind_process+0x4b/0x60 [amdgpu
425139e59bdaa506cde69fb30b88f0810eed06e1]
[  138.015303]  kfd_process_wq_release+0x1e5/0x350 [amdgpu
425139e59bdaa506cde69fb30b88f0810eed06e1]
[  138.015497]  process_one_work+0x208/0x3c0
[  138.015500]  worker_thread+0x4a/0x3b0
[  138.015502]  ? process_one_work+0x3c0/0x3c0
[  138.015503]  kthread+0xda/0x100
[  138.015505]  ? kthread_complete_and_exit+0x20/0x20
[  138.015507]  ret_from_fork+0x22/0x30
[  138.015510]  </TASK>
[  138.015511] ---[ end trace 0000000000000000 ]---

The second run, when the GPU disappeared:

[  157.924040] [drm] Fence fallback timer expired on ring gfx
[  175.524041] [drm] Fence fallback timer expired on ring gfx
[  197.372016] amdgpu: qcm fence wait loop timeout expired
[  197.372021] amdgpu: The cp might be in an unrecoverable state due to an
unsuccessful queues preemption
[  197.372062] amdgpu 0000:07:00.0: amdgpu: GPU reset begin!
[  197.372071] amdgpu: Failed to suspend process 0x800c

As for the freeze and reboot, I don't have a trace of that (yet). I checked
journalctl, but it seems that part didn't survive.


You are receiving this mail because: