[Bug 1195311] New: Apparent memory leak in radeon (and amdgpu) or ttm
http://bugzilla.opensuse.org/show_bug.cgi?id=1195311 Bug ID: 1195311 Summary: Apparent memory leak in radeon (and amdgpu) or ttm Classification: openSUSE Product: openSUSE Tumbleweed Version: Current Hardware: x86-64 OS: openSUSE Tumbleweed Status: NEW Severity: Normal Priority: P5 - None Component: Kernel Assignee: kernel-bugs@opensuse.org Reporter: aaronpuchert@alice-dsl.net QA Contact: qa-bugs@suse.de Found By: --- Blocker: --- Created attachment 855726 --> http://bugzilla.opensuse.org/attachment.cgi?id=855726&action=edit Last output of bcc-tools' memleak after process exit Kernel: 5.16.2-1-default CPU: AMD A10-5750M APU (Piledriver, Family 15h) GPU: Builtin Radeon HD 8650G ("Richland", ARUBA, Northern Islands), used here. Discrete Radeon HD 8650M ("Sun Pro", HAINAN, Southern Islands), inactive. With some rendering applications (such as openSUSE:Factory/xonotic) I'm observing a massive increase of used memory that doesn't recover when closing the application. I couldn't observe this with other 3D graphics games, but it seems necessarily a kernel issue. In fact diffing /proc/meminfo before and after gives me no idea where the memory might be: --- meminfo.before +++ meminfo.after @@ -1,44 +1,44 @@ MemTotal: 7307764 kB -MemFree: 5424968 kB +MemFree: 1809372 kB -MemAvailable: 6022788 kB +MemAvailable: 2647392 kB -Buffers: 145620 kB +Buffers: 147420 kB -Cached: 654312 kB +Cached: 891696 kB SwapCached: 0 kB -Active: 371420 kB +Active: 415308 kB -Inactive: 1031956 kB +Inactive: 1261828 kB Active(anon): 1072 kB -Inactive(anon): 625400 kB +Inactive(anon): 659872 kB -Active(file): 370348 kB +Active(file): 414236 kB -Inactive(file): 406556 kB +Inactive(file): 601956 kB Unevictable: 132 kB Mlocked: 132 kB SwapTotal: 2097148 kB SwapFree: 2097148 kB -Dirty: 44 kB +Dirty: 1536 kB Writeback: 0 kB -AnonPages: 592320 kB +AnonPages: 592880 kB -Mapped: 310056 kB +Mapped: 310476 kB -Shmem: 23028 kB +Shmem: 22924 kB -KReclaimable: 81092 kB +KReclaimable: 82920 kB -Slab: 150960 kB +Slab: 154736 kB -SReclaimable: 81092 kB +SReclaimable: 82920 kB -SUnreclaim: 69868 kB +SUnreclaim: 71816 kB -KernelStack: 5120 kB +KernelStack: 5184 kB -PageTables: 13176 kB +PageTables: 13220 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 5751028 kB -Committed_AS: 2195056 kB +Committed_AS: 2190264 kB VmallocTotal: 34359738367 kB -VmallocUsed: 42340 kB +VmallocUsed: 42356 kB VmallocChunk: 0 kB Percpu: 2800 kB HardwareCorrupted: 0 kB -AnonHugePages: 249856 kB +AnonHugePages: 202752 kB ShmemHugePages: 0 kB ShmemPmdMapped: 0 kB -FileHugePages: 0 kB +FileHugePages: 2048 kB FilePmdMapped: 0 kB CmaTotal: 0 kB CmaFree: 0 kB @@ -48,6 +48,6 @@ HugePages_Surp: 0 Hugepagesize: 2048 kB Hugetlb: 0 kB -DirectMap4k: 481020 kB +DirectMap4k: 3833596 kB -DirectMap2M: 6023168 kB +DirectMap2M: 3719168 kB -DirectMap1G: 1048576 kB +DirectMap1G: 0 kB Available memory is down 3.5G, but neither in kernel nor in user space is there an increase that might justify it. I also diffed the output of "grep . /proc/[0-9]*/statm" before and after, with the difference in resident set sizes rather unremarkable. (In the order of a couple hundred pages in total.) So after a reboot I ran /usr/share/bcc/tools/memleak from bcc-tools, and after closing the game (and making sure the process is indeed no longer there) the top 10 "leaks" end with this (the script just traces allocations that haven't been freed, these are not necessarily leaks): 3681759232 bytes in 4055 allocations from stack __alloc_pages+0x178 [kernel] __alloc_pages+0x178 [kernel] ttm_pool_alloc+0x24a [ttm] ttm_tt_populate+0x9f [ttm] ttm_bo_handle_move_mem+0x152 [ttm] ttm_bo_validate+0xc1 [ttm] ttm_bo_init_reserved+0x1d1 [ttm] ttm_bo_init+0x5a [ttm] radeon_bo_create+0x150 [radeon] radeon_gem_object_create+0xb0 [radeon] radeon_gem_create_ioctl+0x68 [radeon] drm_ioctl_kernel+0xb0 [drm] drm_ioctl+0x220 [drm] radeon_drm_ioctl+0x49 [radeon] __x64_sys_ioctl+0x82 [kernel] do_syscall_64+0x5c [kernel] entry_SYSCALL_64_after_hwframe+0x44 [kernel] Line info shouldn't be necessary, at least I could easily follow the call stack just from the function names. I'll attach the full top 10 for reference, but the next stack only accounts for ~200M, so it's probably not that important. (Also it seems to me like filling the page cache, which is probably intentionally not freed.) The title claims that this also affects amdgpu because we tried another machine that has a (single) desktop GPU, an R9 270X (PITCAIRN, also Southern Islands) being run with amdgpu via radeon.si_support=0 amdgpu.si_support=1. We observe the same callstack with memleak except with radeon being replaced by amdgpu. (Presumably they just copied that over.) So either the problem is common to both drivers or somewhere else in the stack. -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1195311 http://bugzilla.opensuse.org/show_bug.cgi?id=1195311#c1 --- Comment #1 from Aaron Puchert <aaronpuchert@alice-dsl.net> --- Now debugfs has an interesting file: /sys/kernel/debug/dri/0/radeon_gem_info. Obviously there is some fluctuation, but aggregating gives us a clear picture: $ sed 's/bo\[0x[0-9a-f]*\] *\([0-9]*\)kB *[0-9]*MB *[A-Z]* *pid *[0-9]*/\1/g' radeon_gem_info.before | paste -sd+ - | bc 87628 $ sed 's/bo\[0x[0-9a-f]*\] *\([0-9]*\)kB *[0-9]*MB *[A-Z]* *pid *[0-9]*/\1/g' radeon_gem_info.during | paste -sd+ - | bc 1392096 $ sed 's/bo\[0x[0-9a-f]*\] *\([0-9]*\)kB *[0-9]*MB *[A-Z]* *pid *[0-9]*/\1/g' radeon_gem_info.after | paste -sd+ - | bc 86888 This was a different run, but the leak was there again. Not however in radeon_gem_info, where we're pretty much back to the original state. I can attach the files, but they don't seem so interesting. Sadly playing around with this locked up my machine a couple of times (had to reboot via SysRq keys), with the journal holding this nugget: BUG: kernel NULL pointer dereference, address: 0000000000000010 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP NOPTI CPU: 0 PID: 17577 Comm: cat Not tainted 5.16.2-1-default #1 openSUSE Tumbleweed b40a195b7ff0f3399a616c3290f963c4ad189e84 Hardware name: LENOVO 20255/Lenovo G505s, BIOS 83CN35WW(V2.05) 12/06/2013 RIP: 0010:radeon_debugfs_gem_info_show+0x4d/0xd0 [radeon] Code: 00 4c 89 f7 e8 c4 5c 18 ed 48 8b 5d 00 48 39 eb 74 7a 45 31 ff 49 c7 c5 66 a6 7f c0 48 8b 83 e0 01 00 00 49 c7 c1 66 a6 7f c0 <8b> 40 10 83 f8 02 77 21 8b 04 85 e0 22 7b c0 49 c7 c1 61 a6 7f c0 RSP: 0018:ffffaf75c1ba3cb0 EFLAGS: 00010216 RAX: 0000000000000000 RBX: ffff96e8b22f5400 RCX: 0000000000000001 RDX: 0000000000010000 RSI: ffffffffc07eccfd RDI: ffff96e78e9dd1b0 RBP: ffff96e891575cd8 R08: ffff96e78e9dd1af R09: ffffffffc07fa666 R10: ffffffffffffffff R11: ffff96e78e9dd1af R12: ffff96e8b91cae10 R13: ffffffffc07fa666 R14: ffff96e891575cb8 R15: 00000000000003d0 FS: 00007fcf1948d740(0000) GS:ffff96e9a7800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000010 CR3: 0000000203ab8000 CR4: 00000000000406f0 Call Trace: <TASK> seq_read_iter+0x11c/0x4b0 ? aa_file_perm+0x11c/0x490 seq_read+0xfd/0x140 full_proxy_read+0x53/0x80 vfs_read+0x95/0x190 ksys_read+0x5f/0xe0 do_syscall_64+0x5c/0x80 ? handle_mm_fault+0xb2/0x280 ? do_user_addr_fault+0x1d7/0x690 ? do_syscall_64+0x69/0x80 ? exc_page_fault+0x68/0x150 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7fcf195ab852 Code: 18 02 00 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 90 90 64 8b 04 25 18 00 00 00 85 c0 75 10 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24 RSP: 002b:00007ffccd004178 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007fcf195ab852 RDX: 0000000000020000 RSI: 00007fcf19167000 RDI: 0000000000000003 RBP: 00007fcf19167000 R08: 00007fcf19166010 R09: 0000000000000000 R10: 00007fcf1949a4b8 R11: 0000000000000246 R12: 0000000000020000 R13: 0000000000000003 R14: 00007ffccd004e6a R15: 0000000000020000 </TASK> Modules linked in: udp_diag tcp_diag inet_diag af_packet snd_seq snd_seq_device dmi_sysfs msr uvcvideo rtsx_usb_ms videobuf2_vmalloc memstick videobuf2_memops videobuf2_v4l2 videobuf2_common videodev mc dm_crypt essiv authenc trusted asn1_encoder tee ath9k ath9k_common ath9k_hw ath edac_mce_amd kvm_amd ccp snd_hda_codec_conexant snd_hda_codec_generic kvm ledtrig_audio mac80211 snd_hda_codec_hdmi snd_hda_intel libarc4 pktcdvd cfg80211 snd_intel_dspcfg irqbypass snd_intel_sdw_acpi ideapad_laptop snd_hda_codec sparse_keymap platform_profile snd_hda_core wmi rfkill snd_hwdep snd_pcm alx snd_timer efi_pstore pcspkr tiny_power_button joydev ac fan snd i2c_piix4 thermal k10temp mdio soundcore button acpi_cpufreq nls_iso8859_1 nls_cp437 vfat fat fuse configfs ip_tables x_tables ext4 mbcache jbd2 amdgpu iommu_v2 gpu_sched rtsx_usb_sdmmc mmc_core hid_generic rtsx_usb usbhid crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel radeon aesni_intel ohci_pci i2c_algo_bit drm_ttm_helper ttm crypto_simd xhci_pci xhci_pci_renesas drm_kms_helper cryptd wdat_wdt ehci_pci ohci_hcd syscopyarea ehci_hcd serio_raw sysfillrect xhci_hcd sysimgblt fb_sys_fops sp5100_tco cec rc_core sr_mod cdrom drm usbcore battery video sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua efivarfs CR2: 0000000000000010 ---[ end trace c396c07901b3cc6a ]--- ------------[ cut here ]------------ Voluntary context switch within RCU read-side critical section! WARNING: CPU: 0 PID: 17577 at kernel/rcu/tree_plugin.h:316 rcu_note_context_switch+0x56e/0x5d0 Modules linked in: udp_diag tcp_diag inet_diag af_packet snd_seq snd_seq_device dmi_sysfs msr uvcvideo rtsx_usb_ms videobuf2_vmalloc memstick videobuf2_memops videobuf2_v4l2 videobuf2_common videodev mc dm_crypt essiv authenc trusted asn1_encoder tee ath9k ath9k_common ath9k_hw ath edac_mce_amd kvm_amd ccp snd_hda_codec_conexant snd_hda_codec_generic kvm ledtrig_audio mac80211 snd_hda_codec_hdmi snd_hda_intel libarc4 pktcdvd cfg80211 snd_intel_dspcfg irqbypass snd_intel_sdw_acpi ideapad_laptop snd_hda_codec sparse_keymap platform_profile snd_hda_core wmi rfkill snd_hwdep snd_pcm alx snd_timer efi_pstore pcspkr tiny_power_button joydev ac fan snd i2c_piix4 thermal k10temp mdio soundcore button acpi_cpufreq nls_iso8859_1 nls_cp437 vfat fat fuse configfs ip_tables x_tables ext4 mbcache jbd2 amdgpu iommu_v2 gpu_sched rtsx_usb_sdmmc mmc_core hid_generic rtsx_usb usbhid crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel radeon aesni_intel ohci_pci i2c_algo_bit drm_ttm_helper ttm crypto_simd xhci_pci xhci_pci_renesas drm_kms_helper cryptd wdat_wdt ehci_pci ohci_hcd syscopyarea ehci_hcd serio_raw sysfillrect xhci_hcd sysimgblt fb_sys_fops sp5100_tco cec rc_core sr_mod cdrom drm usbcore battery video sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua efivarfs CPU: 0 PID: 17577 Comm: cat Tainted: G D 5.16.2-1-default #1 openSUSE Tumbleweed b40a195b7ff0f3399a616c3290f963c4ad189e84 Hardware name: LENOVO 20255/Lenovo G505s, BIOS 83CN35WW(V2.05) 12/06/2013 RIP: 0010:rcu_note_context_switch+0x56e/0x5d0 Code: 00 48 89 be 40 08 00 00 48 89 86 48 08 00 00 48 89 10 e9 40 fd ff ff 48 c7 c7 40 dd 24 ae c6 05 1a 2c de 01 01 e8 20 f8 8e 00 <0f> 0b e9 db fa ff ff c6 43 15 00 48 8b 73 20 ba 01 00 00 00 48 8b RSP: 0018:ffffaf75c1ba36c8 EFLAGS: 00010082 RAX: 0000000000000000 RBX: ffff96e9a7834640 RCX: 0000000000000027 RDX: ffff96e9a7822948 RSI: 0000000000000001 RDI: ffff96e9a7822940 RBP: ffffaf75c1ba3778 R08: 0000000000000000 R09: ffffaf75c1ba3500 R10: ffffaf75c1ba34f8 R11: ffffffffaeb58308 R12: 0000000000000000 R13: ffff96e8e975d100 R14: 0000000000000007 R15: ffff96e8e975d100 FS: 00007fcf1948d740(0000) GS:ffff96e9a7800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000010 CR3: 0000000203ab8000 CR4: 00000000000406f0 Call Trace: <TASK> __schedule+0xaf/0x10c0 ? enqueue_task_fair+0x87/0x630 ? enqueue_task+0x4b/0x140 ? _flat_send_IPI_mask+0x21/0x30 schedule+0x4b/0xc0 schedule_timeout+0x115/0x150 wait_for_completion+0x89/0xe0 virt_efi_query_variable_info+0x141/0x150 efi_query_variable_store+0x5b/0x1a0 efivar_entry_set_safe+0xbd/0x210 efi_pstore_write+0x124/0x1a0 [efi_pstore e8887364a0c84df2100f8a2427f7437b9c33b134] ? pstore_dump+0x182/0x340 pstore_dump+0x182/0x340 kmsg_dump+0x46/0x60 oops_end+0x63/0xd0 page_fault_oops+0x158/0x2a0 ? search_bpf_extables+0x5f/0x80 exc_page_fault+0x68/0x150 asm_exc_page_fault+0x1e/0x30 RIP: 0010:radeon_debugfs_gem_info_show+0x4d/0xd0 [radeon] Code: 00 4c 89 f7 e8 c4 5c 18 ed 48 8b 5d 00 48 39 eb 74 7a 45 31 ff 49 c7 c5 66 a6 7f c0 48 8b 83 e0 01 00 00 49 c7 c1 66 a6 7f c0 <8b> 40 10 83 f8 02 77 21 8b 04 85 e0 22 7b c0 49 c7 c1 61 a6 7f c0 RSP: 0018:ffffaf75c1ba3cb0 EFLAGS: 00010216 RAX: 0000000000000000 RBX: ffff96e8b22f5400 RCX: 0000000000000001 RDX: 0000000000010000 RSI: ffffffffc07eccfd RDI: ffff96e78e9dd1b0 RBP: ffff96e891575cd8 R08: ffff96e78e9dd1af R09: ffffffffc07fa666 R10: ffffffffffffffff R11: ffff96e78e9dd1af R12: ffff96e8b91cae10 R13: ffffffffc07fa666 R14: ffff96e891575cb8 R15: 00000000000003d0 seq_read_iter+0x11c/0x4b0 ? aa_file_perm+0x11c/0x490 seq_read+0xfd/0x140 full_proxy_read+0x53/0x80 vfs_read+0x95/0x190 ksys_read+0x5f/0xe0 do_syscall_64+0x5c/0x80 ? handle_mm_fault+0xb2/0x280 ? do_user_addr_fault+0x1d7/0x690 ? do_syscall_64+0x69/0x80 ? exc_page_fault+0x68/0x150 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7fcf195ab852 Code: 18 02 00 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 90 90 64 8b 04 25 18 00 00 00 85 c0 75 10 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24 RSP: 002b:00007ffccd004178 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007fcf195ab852 RDX: 0000000000020000 RSI: 00007fcf19167000 RDI: 0000000000000003 RBP: 00007fcf19167000 R08: 00007fcf19166010 R09: 0000000000000000 R10: 00007fcf1949a4b8 R11: 0000000000000246 R12: 0000000000020000 R13: 0000000000000003 R14: 00007ffccd004e6a R15: 0000000000020000 </TASK> ---[ end trace c396c07901b3cc6b ]--- Perhaps it's a different issue, but if some data structures are corrupted, it might be a symptom of the same cause. The "context switch within RCU read-side critical section" is probably just caused by the original dump and not a "real" issue. -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1195311 http://bugzilla.opensuse.org/show_bug.cgi?id=1195311#c2 --- Comment #2 from Aaron Puchert <aaronpuchert@alice-dsl.net> --- Let's have a brief look like this: static int radeon_debugfs_gem_info_show(struct seq_file *m, void *unused) { struct radeon_device *rdev = (struct radeon_device *)m->private; struct radeon_bo *rbo; unsigned i = 0; mutex_lock(&rdev->gem.mutex); list_for_each_entry(rbo, &rdev->gem.objects, list) { unsigned domain; domain = radeon_mem_type_to_domain(rbo->tbo.resource->mem_type); // ... } mutex_unlock(&rdev->gem.mutex); return 0; } $ zstd -d /usr/lib/modules/5.16.2-1-default/kernel/drivers/gpu/drm/radeon/radeon.ko.zst -o radeon.ko $ objdump --no-show-raw-insn --disassemble=radeon_debugfs_gem_info_show radeon.ko But I'll annotate the code a bit and guess call targets. 000000000002c140 <radeon_debugfs_gem_info_show>: 2c140: call 2c145 <radeon_debugfs_gem_info_show+0x5> ; ??? 2c145: push %r15 2c147: push %r14 2c149: push %r13 2c14b: push %r12 2c14d: mov %rdi,%r12 2c150: push %rbp 2c151: push %rbx 2c152: mov 0x70(%rdi),%rbp ; rdev = m->private (?, 5.16 has 0x60) 2c156: lea 0x1cb8(%rbp),%r14 ; &rdev->gem.mutex 2c15d: add $0x1cd8,%rbp ; &rdev->gem.objects 2c164: mov %r14,%rdi 2c167: call 2c16c <radeon_debugfs_gem_info_show+0x2c> ; mutex_lock 2c16c: mov 0x0(%rbp),%rbx ; rdev->gem.objects 2c170: cmp %rbp,%rbx 2c173: je 2c1ef <radeon_debugfs_gem_info_show+0xaf> 2c175: xor %r15d,%r15d 2c178: mov $0x0,%r13 2c17f: mov 0x1e0(%rbx),%rax ; rbo->tbo.resource (?, 5.16 has 0x1d8) 2c186: mov $0x0,%r9 2c18d: ====> mov 0x10(%rax),%eax ; rbo->tbo.resource->mem_type [...] Sometimes the offsets didn't quite match my 5.16 tree, hence the question marks. So it appears that %rax holds rbo->tbo.resource = 0. Is this allowed, or another sign of corruption? -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1195311 http://bugzilla.opensuse.org/show_bug.cgi?id=1195311#c3 --- Comment #3 from Patrik Jakobsson <patrik.jakobsson@suse.com> --- Which kernel version are you running? I tried to reproduce this with a Pitcairn (AMDGPU driver) on a 5.14 kernel but cannot see a leak. I'll try a newer kernel so see if things are different. -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1195311 http://bugzilla.opensuse.org/show_bug.cgi?id=1195311#c4 --- Comment #4 from Patrik Jakobsson <patrik.jakobsson@suse.com> --- Still no leak on a 5.16 based kernel. I've tried running xonotic for a few minutes. Is there a better way to trigger the leak/bug? -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1195311 http://bugzilla.opensuse.org/show_bug.cgi?id=1195311#c5 --- Comment #5 from Aaron Puchert <aaronpuchert@alice-dsl.net> --- (In reply to Patrik Jakobsson from comment #3)
Which kernel version are you running? The report was with 5.16.2-1-default, now 5.16.8-1-default. Can still observe it.
(In reply to Patrik Jakobsson from comment #4)
Still no leak on a 5.16 based kernel. I've tried running xonotic for a few minutes. Is there a better way to trigger the leak/bug? There probably is, but I haven't seen this anywhere else, including other 3D games. Which is a bit strange because according to my understanding many applications create gem objects and bos (buffer objects?).
For what it's worth, we can't reproduce it anymore on the bigger machine with the Pitcairn card (on amdgpu), but the notebook with the (integrated) Aruba card (on radeon) is still showing the leak. (Same kernel version.) I can try to collect traces with e.g. bcc, but I don't really know what to look for. If you have an idea what to collect I should be able to carry that out. -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1195311 http://bugzilla.opensuse.org/show_bug.cgi?id=1195311#c6 --- Comment #6 from Aaron Puchert <aaronpuchert@alice-dsl.net> --- Interestingly I can't reproduce it with DRI_PRIME=1 on the same notebook. On the kernel side it's the same driver (radeon), but in userspace it uses radeonsi_dri.so instead of r600_dri.so. -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1195311 http://bugzilla.opensuse.org/show_bug.cgi?id=1195311#c7 --- Comment #7 from Aaron Puchert <aaronpuchert@alice-dsl.net> --- Interesting, just got this briefly after login in Xorg: ------------[ cut here ]------------ WARNING: CPU: 1 PID: 1924 at drivers/gpu/drm/ttm/ttm_bo.c:411 ttm_bo_release+0x358/0x380 [ttm] Modules linked in: af_packet snd_seq snd_seq_device dmi_sysfs msr uvcvideo rtsx_usb_ms memstick videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_common videodev mc dm_crypt essiv authenc trusted asn1_encoder tee ath9k ath9k_common ath9k_hw ath mac80211 libarc4 snd_hda_codec_hdmi snd_hda_codec_conexant snd_hda_codec_generic ledtrig_audio snd_hda_intel cfg80211 ideapad_laptop sparse_keymap platform_profile rfkill wmi snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hda_core i2c_piix4 pcspkr joydev ac alx mdio edac_mce_amd kvm_amd ccp pktcdvd kvm irqbypass efi_pstore thermal fan snd_hwdep snd_pcm k10temp tiny_power_button button snd_timer snd soundcore acpi_cpufreq nls_iso8859_1 nls_cp437 vfat fat fuse configfs ip_tables x_tables ext4 mbcache jbd2 amdgpu iommu_v2 gpu_sched hid_generic usbhid rtsx_usb_sdmmc mmc_core rtsx_usb crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel crypto_simd cryptd wdat_wdt ohci_pci radeon serio_raw xhci_pci xhci_pci_renesas sr_mod cdrom xhci_hcd sp5100_tco ehci_pci ohci_hcd ehci_hcd usbcore drm_ttm_helper ttm battery video sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua efivarfs CPU: 1 PID: 1924 Comm: Xorg.bin Not tainted 5.16.8-1-default #1 openSUSE Tumbleweed 257f8f36371552cd38032922fd021edb6811ecdc Hardware name: LENOVO 20255/Lenovo G505s, BIOS 83CN35WW(V2.05) 12/06/2013 RIP: 0010:ttm_bo_release+0x358/0x380 [ttm] Code: 00 e8 fc 26 0b ec 48 8b 43 e8 eb a8 be 03 00 00 00 e8 8c 92 e1 eb e9 9b fd ff ff e8 d2 04 0b ec e9 91 fd ff ff 48 89 e8 eb 8a <0f> 0b e9 db fc ff ff e8 bc 04 0b ec e9 ca fe ff ff be 03 00 00 00 RSP: 0018:ffffbe5a8016bd60 EFLAGS: 00010202 RAX: 0000000000000000 RBX: ffff9b9b4b9079d8 RCX: 0000000000000000 RDX: 0000000000000001 RSI: ffff9b9b618188d0 RDI: ffff9b9b4b9079d8 RBP: ffff9b9b507b46f0 R08: ffff9b9b4b9079d8 R09: 0000000000000064 R10: 0000000000000010 R11: ffff9b9b80fe0490 R12: ffff9b9b4b907878 R13: ffff9b9c09215060 R14: ffff9b9b6d3eef00 R15: 0000000000000000 FS: 00007f7dc09bed80(0000) GS:ffff9b9c67880000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00005612f00c8610 CR3: 0000000131830000 CR4: 00000000000406e0 Call Trace: <TASK> ? __inode_wait_for_writeback+0x7e/0xe0 ? fsnotify_grab_connector+0x49/0x80 radeon_bo_unref+0x1a/0x30 [radeon d8b9f51f0af1bd1c0421cb4011e3d0de89217c67] radeon_gem_object_free+0x30/0x50 [radeon d8b9f51f0af1bd1c0421cb4011e3d0de89217c67] drm_gem_dmabuf_release+0x36/0x50 dma_buf_release+0x3a/0x90 __dentry_kill+0xf8/0x170 __fput+0xe3/0x250 task_work_run+0x5c/0x90 exit_to_user_mode_prepare+0x224/0x230 syscall_exit_to_user_mode+0x18/0x40 do_syscall_64+0x69/0x80 ? exit_to_user_mode_prepare+0x19b/0x230 ? syscall_exit_to_user_mode+0x18/0x40 ? do_syscall_64+0x69/0x80 ? syscall_exit_to_user_mode+0x18/0x40 ? do_syscall_64+0x69/0x80 ? do_syscall_64+0x69/0x80 ? syscall_exit_to_user_mode+0x18/0x40 ? do_syscall_64+0x69/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f7dc025988b Code: ff ff ff 85 c0 79 9b 49 c7 c4 ff ff ff ff 5b 5d 4c 89 e0 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 5d 75 0f 00 f7 d8 64 89 01 48 RSP: 002b:00007ffe13cfffc8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: 0000000000000000 RBX: 00007ffe13d00018 RCX: 00007f7dc025988b RDX: 00007ffe13d00018 RSI: 0000000040086409 RDI: 0000000000000010 RBP: 0000000040086409 R08: 0000000000b8a58b R09: 0000000000000000 R10: 00007f7dbf930220 R11: 0000000000000246 R12: 00005612ef5e0d78 R13: 0000000000000010 R14: 0000000000000438 R15: 00005612ef555fa0 </TASK> ---[ end trace 50208e5f30f02cc1 ]--- This should be WARN_ON_ONCE(bo->pin_count); Going through my journal, I find it 5 more times, the earliest with 5.16.2-1-default, though in that session I was playing around with /sys/kernel/debug/dri/0/radeon_gem_info, so maybe that's not so interesting. The other 4 are my latest 4 boots, all briefly after login, just after kscreen_backend_launcher fiddling around with xrandr. It happens that I installed an external monitor four boots ago, so I guess that's related. Still it's interesting that we get a warning in radeon_gem_object_free, and because it warns only once, maybe there is more going on after that. -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1195311 http://bugzilla.opensuse.org/show_bug.cgi?id=1195311#c8 --- Comment #8 from Aaron Puchert <aaronpuchert@alice-dsl.net> --- Ok, the earliest occurrence of that (with a different line number though) is June 12 of last year with 5.12.9-1-default. ------------[ cut here ]------------ WARNING: CPU: 2 PID: 1748 at drivers/gpu/drm/ttm/ttm_bo.c:518 ttm_bo_release+0x2bf/0x310 [ttm] Modules linked in: udp_diag tcp_diag inet_diag af_packet snd_seq snd_seq_device dmi_sysfs msr uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_common videodev mc rtsx_usb_ms memstick dm_crypt essiv authenc trusted ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 edac_mce_amd pktcdvd kvm_amd snd_hda_codec_hdmi snd_hda_codec_conexant snd_hda_codec_generic ledtrig_audio snd_hda_intel snd_intel_dspcfg ccp ideapad_laptop platform_profile alx snd_intel_sdw_acpi sparse_keymap kvm snd_hda_codec rfkill wmi libarc4 snd_hda_core irqbypass snd_hwdep snd_pcm joydev efi_pstore snd_timer snd wdat_wdt mdio pcspkr soundcore k10temp sp5100_tco thermal tiny_power_button i2c_piix4 fan ac acpi_cpufreq button nls_iso8859_1 nls_cp437 vfat fat fuse configfs amdgpu iommu_v2 gpu_sched rtsx_usb_sdmmc mmc_core rtsx_usb radeon i2c_algo_bit drm_ttm_helper ttm crct10dif_pclmul crc32_pclmul drm_kms_helper crc32c_intel ghash_clmulni_intel xhci_pci xhci_pci_renesas xhci_hcd aesni_intel syscopyarea sysfillrect sysimgblt fb_sys_fops cec ohci_pci rc_core ohci_hcd crypto_simd drm cryptd ehci_pci ehci_hcd usbcore serio_raw sr_mod cdrom battery video sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua efivarfs CPU: 2 PID: 1748 Comm: kwin_x11:rcs0 Not tainted 5.12.9-1-default #1 openSUSE Tumbleweed Hardware name: LENOVO 20255/Lenovo G505s, BIOS 83CN35WW(V2.05) 12/06/2013 RIP: 0010:ttm_bo_release+0x2bf/0x310 [ttm] Code: e9 a1 fd ff ff e8 e1 82 ae e4 e9 d2 fd ff ff 49 8b 7e 90 b9 4c 1d 00 00 31 d2 be 01 00 00 00 e8 07 a4 ae e4 49 8b 46 e0 eb 9e <0f> 0b 41 c7 86 94 00 00 00 00 00 00 00 49 8d 76 08 31 d2 4c 89 ef RSP: 0018:ffffbdda020cfe30 EFLAGS: 00010202 RAX: 0000000000000001 RBX: 0000000000000001 RCX: 000000000000007d RDX: 0000000000000001 RSI: ffff9b1fc0055800 RDI: ffffffffc0684aa8 RBP: ffff9b1fc75e06e8 R08: 0000000000000000 R09: ffff9b1fdc7a5b38 R10: ffff9b1fc340c908 R11: ffff9b2091c59110 R12: ffffffffc0684aa8 R13: ffff9b1f020abc78 R14: ffff9b1f020abde0 R15: ffff9b203143c880 FS: 00007f0ad4aff640(0000) GS:ffff9b20e7900000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f9934008000 CR3: 0000000132334000 CR4: 00000000000406e0 Call Trace: radeon_bo_unref+0x1a/0x30 [radeon] radeon_gem_object_free+0x30/0x50 [radeon] drm_gem_dmabuf_release+0x36/0x50 [drm] dma_buf_release+0x3a/0x80 __dentry_kill+0xfa/0x170 __fput+0xe3/0x240 task_work_run+0x65/0xa0 exit_to_user_mode_prepare+0x168/0x170 syscall_exit_to_user_mode+0x18/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f0adc2ef00b Code: ff ff ff 85 c0 79 9b 49 c7 c4 ff ff ff ff 5b 5d 4c 89 e0 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 35 be 0c 00 f7 d8 64 89 01 48 RSP: 002b:00007f0ad4afeab8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: 0000000000000000 RBX: 00007f0ad4afeb08 RCX: 00007f0adc2ef00b RDX: 00007f0ad4afeb08 RSI: 0000000040086409 RDI: 000000000000000a RBP: 0000000040086409 R08: 0000000000000000 R09: 00000000ffffffff R10: 0000000000000000 R11: 0000000000000246 R12: 00005573c297ea28 R13: 000000000000000a R14: 00007f0aa33d5190 R15: 00005573c297eb80 ---[ end trace 025a972c0b4b1fe6 ]--- It looks a bit different, but we're warning on the same thing [1]: 518: if (WARN_ON_ONCE(bo->pin_count)) { 519: bo->pin_count = 0; 520: ttm_bo_move_to_lru_tail(bo, &bo->mem, NULL); 521: } It's in kwin_x11, but that might be a coincidence. Again it appears immediately after kscreen_backend_launcher fiddles around with xrandr, not on login but it appears after disconnecting an external monitor. (The earlier connection that extends the screen shows no warning.) Same on July 25 with 5.13.2-1-default, here it's drivers/gpu/drm/ttm/ttm_bo.c:437 [2] in kwin_x11, again after disconnecting an external monitor. And on September 8 with 5.13.13-1-default, same line number, same situation. But it might be unrelated, the leak in Xonotic is observable entirely without external monitor. [1] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/driver... [2] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/driver... -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1195311 http://bugzilla.opensuse.org/show_bug.cgi?id=1195311#c9 --- Comment #9 from Patrik Jakobsson <patrik.jakobsson@suse.com> --- The warning seems to come from dma_buf_release() and similar issues are already reported upstream in [1] and [2]. It looks like the leak happens because refcounting of DMABUF allocated BOs are incorrect and you need a multi-gpu system to observe the leak. I'll try replicating this in a multi-gpu system. AMD is more likely to be able to fix the problem. Can you please also submit your issue to https://gitlab.freedesktop.org/drm/amd/-/issues [1] https://gitlab.freedesktop.org/drm/amd/-/issues/1902 [2] https://gitlab.freedesktop.org/drm/amd/-/issues/1894 -- You are receiving this mail because: You are the assignee for the bug.
participants (1)
-
bugzilla_noreply@suse.com