[Bug 1200380] New: [amdgpu] OpenCL workload can reliably crash GPU unrecoverably and also reliably cause reboot
![](https://seccdn.libravatar.org/avatar/a895f78a81a109471893519443e4d933.jpg?s=120&d=mm&r=g)
http://bugzilla.opensuse.org/show_bug.cgi?id=1200380 Bug ID: 1200380 Summary: [amdgpu] OpenCL workload can reliably crash GPU unrecoverably and also reliably cause reboot Classification: openSUSE Product: openSUSE Tumbleweed Version: Current Hardware: x86-64 OS: openSUSE Tumbleweed Status: NEW Severity: Normal Priority: P5 - None Component: Kernel Assignee: kernel-bugs@opensuse.org Reporter: andreas_nordal_4@hotmail.com QA Contact: qa-bugs@suse.de Found By: --- Blocker: --- AMD Ryzen 5 3400G with Radeon Vega Graphics Linux 5.18.1-1-default amdocl 11.12-1.27 from https://download.opensuse.org/repositories/home:/patrikjakobsson:/aomp/openS... * I have one (proprietary) OpenCL workload that reliably causes my computer to freeze and reboot. * In investigating this, I'm trying out various test programs that do a subset of it. A common theme is that after 2-3 runs, with kernel traces appearing in dmesg, the GPU disappears and never comes back (at least the OpenCL part): The desktop is responsive and fine afterwards, thankfully, but clinfo reports "Number of devices 0", and I have to manually reboot to fix it. The first run, with the trace in dmesg: [ 96.228040] [drm] Fence fallback timer expired on ring gfx [ 119.268047] [drm] Fence fallback timer expired on ring gfx [ 123.972050] [drm] Fence fallback timer expired on ring gfx [ 124.676047] [drm] Fence fallback timer expired on ring gfx [ 138.014780] ------------[ cut here ]------------ [ 138.014784] refcount_t: decrement hit 0; leaking memory. [ 138.014789] WARNING: CPU: 5 PID: 469 at lib/refcount.c:31 refcount_warn_saturate+0xe5/0xf0 [ 138.014795] Modules linked in: zram xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c bpfilter br_netfilter bridge stp llc overlay cmac nls_utf8 cifs cifs_arc4 cifs_md4 dns_resolver fscache netfs af_packet iscsi_ibft iscsi_boot_sysfs dmi_sysfs joydev r8169 atlantic realtek mdio_devres libphy macsec hid_generic intel_rapl_msr intel_rapl_common usbhid snd_hda_codec_realtek snd_hda_codec_generic edac_mce_amd ledtrig_audio snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hda_core snd_hwdep eeepc_wmi asus_wmi snd_pcm kvm battery snd_timer sparse_keymap snd platform_profile irqbypass rfkill wmi_bmof pcspkr efi_pstore k10temp i2c_piix4 soundcore tiny_power_button gpio_amdpt gpio_generic button acpi_cpufreq nls_iso8859_1 nls_cp437 vfat fat fuse configfs ip_tables x_tables ext4 mbcache jbd2 amdgpu crct10dif_pclmul [ 138.014837] crc32_pclmul crc32c_intel ghash_clmulni_intel drm_ttm_helper ttm iommu_v2 gpu_sched i2c_algo_bit drm_dp_helper drm_kms_helper aesni_intel syscopyarea sysfillrect sysimgblt fb_sys_fops xhci_pci crypto_simd xhci_pci_renesas cryptd drm xhci_hcd nvme sp5100_tco nvme_core cec ccp usbcore rc_core wmi video sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua msr efivarfs [ 138.014855] CPU: 5 PID: 469 Comm: kworker/5:2 Not tainted 5.18.1-1-default #1 openSUSE Tumbleweed 76e6f98ac75e9529b1c9bcf308333727a2e8ebe3 [ 138.014858] Hardware name: Komplett Komplett PC/PRIME B450-PLUS, BIOS 2008 12/06/2019 [ 138.014859] Workqueue: kfd_process_wq kfd_process_wq_release [amdgpu] [ 138.015078] RIP: 0010:refcount_warn_saturate+0xe5/0xf0 [ 138.015081] Code: 48 c7 c7 98 ab 8b 94 c6 05 53 b6 9f 01 01 e8 f3 7e 52 00 0f 0b c3 cc 48 c7 c7 68 ab 8b 94 c6 05 3d b6 9f 01 01 e8 dc 7e 52 00 <0f> 0b c3 cc 0f 1f 80 00 00 00 00 8b 07 3d 00 00 00 c0 74 12 83 f8 [ 138.015082] RSP: 0018:ffff9d34807dfde8 EFLAGS: 00010282 [ 138.015084] RAX: 0000000000000000 RBX: ffff8f9929bec000 RCX: 0000000000000027 [ 138.015085] RDX: ffff8f9fc0b62528 RSI: 0000000000000001 RDI: ffff8f9fc0b62520 [ 138.015086] RBP: ffff8f9929bec000 R08: 0000000000000000 R09: ffff9d34807dfc18 [ 138.015087] R10: 0000000000000003 R11: ffff8f9fc07fffe8 R12: 0000000000000001 [ 138.015087] R13: ffff8f98c2fb4800 R14: ffff8f98c2fb4b90 R15: ffff8f98c807ec00 [ 138.015089] FS: 0000000000000000(0000) GS:ffff8f9fc0b40000(0000) knlGS:0000000000000000 [ 138.015090] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 138.015091] CR2: 00007f0e47700000 CR3: 0000000694010000 CR4: 00000000003506e0 [ 138.015092] Call Trace: [ 138.015094] <TASK> [ 138.015096] put_pasid_state_wait+0x9f/0xb0 [iommu_v2 6923f1fc45e94345ae677d5df84b0cda035fac36] [ 138.015100] ? mmu_notifier_unregister+0xa9/0xe0 [ 138.015103] amd_iommu_unbind_pasid+0xa5/0xc0 [iommu_v2 6923f1fc45e94345ae677d5df84b0cda035fac36] [ 138.015106] kfd_iommu_unbind_process+0x4b/0x60 [amdgpu 425139e59bdaa506cde69fb30b88f0810eed06e1] [ 138.015303] kfd_process_wq_release+0x1e5/0x350 [amdgpu 425139e59bdaa506cde69fb30b88f0810eed06e1] [ 138.015497] process_one_work+0x208/0x3c0 [ 138.015500] worker_thread+0x4a/0x3b0 [ 138.015502] ? process_one_work+0x3c0/0x3c0 [ 138.015503] kthread+0xda/0x100 [ 138.015505] ? kthread_complete_and_exit+0x20/0x20 [ 138.015507] ret_from_fork+0x22/0x30 [ 138.015510] </TASK> [ 138.015511] ---[ end trace 0000000000000000 ]--- The second run, when the GPU disappeared: [ 157.924040] [drm] Fence fallback timer expired on ring gfx [ 175.524041] [drm] Fence fallback timer expired on ring gfx [ 197.372016] amdgpu: qcm fence wait loop timeout expired [ 197.372021] amdgpu: The cp might be in an unrecoverable state due to an unsuccessful queues preemption [ 197.372062] amdgpu 0000:07:00.0: amdgpu: GPU reset begin! [ 197.372071] amdgpu: Failed to suspend process 0x800c As for the freeze and reboot, I don't have a trace of that (yet). I checked journalctl, but it seems that part didn't survive. -- You are receiving this mail because: You are the assignee for the bug.
![](https://seccdn.libravatar.org/avatar/a895f78a81a109471893519443e4d933.jpg?s=120&d=mm&r=g)
http://bugzilla.opensuse.org/show_bug.cgi?id=1200380
http://bugzilla.opensuse.org/show_bug.cgi?id=1200380#c1
Patrik Jakobsson
![](https://seccdn.libravatar.org/avatar/a895f78a81a109471893519443e4d933.jpg?s=120&d=mm&r=g)
http://bugzilla.opensuse.org/show_bug.cgi?id=1200380
http://bugzilla.opensuse.org/show_bug.cgi?id=1200380#c2
Patrik Jakobsson
participants (1)
-
bugzilla_noreply@suse.com