[Bug 1204380] New: Swap usage going up until OOM killer kicks in
https://bugzilla.suse.com/show_bug.cgi?id=1204380 Bug ID: 1204380 Summary: Swap usage going up until OOM killer kicks in Classification: openSUSE Product: openSUSE Distribution Version: Leap 15.4 Hardware: x86-64 OS: openSUSE Leap 15.4 Status: NEW Severity: Normal Priority: P5 - None Component: Kernel Assignee: kernel-bugs@opensuse.org Reporter: jgross@suse.com QA Contact: qa-bugs@suse.de Found By: --- Blocker: --- On my laptop running openSUSE Leap 15.4 swap usage keeps growing, until after several days processes are killed by the OOM killer. Swap usage isn't going down significantly even after stopping most use processes. After a reboot the same pattern is shown: swap usage is going up again. After 3 days running, I'm seeing 2.5GB swap used: # free total used free shared buff/cache available Mem: 32476508 28181440 749956 22399648 26361408 4295068 Swap: 33554428 2566144 30988284 # cat /proc/meminfo MemTotal: 32476508 kB MemFree: 764712 kB MemAvailable: 4309652 kB Buffers: 748 kB Cached: 25370992 kB SwapCached: 10000 kB Active: 3887988 kB Inactive: 25143556 kB Active(anon): 1726396 kB Inactive(anon): 24281868 kB Active(file): 2161592 kB Inactive(file): 861688 kB Unevictable: 778528 kB Mlocked: 13880 kB SwapTotal: 33554428 kB SwapFree: 30988284 kB Dirty: 6120 kB Writeback: 0 kB AnonPages: 4375308 kB Mapped: 536888 kB Shmem: 22399772 kB KReclaimable: 989620 kB Slab: 1231732 kB SReclaimable: 989620 kB SUnreclaim: 242112 kB KernelStack: 22832 kB PageTables: 99672 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 49792680 kB Committed_AS: 36645776 kB VmallocTotal: 34359738367 kB VmallocUsed: 67064 kB VmallocChunk: 0 kB Percpu: 7296 kB HardwareCorrupted: 0 kB AnonHugePages: 1413120 kB ShmemHugePages: 0 kB ShmemPmdMapped: 0 kB FileHugePages: 0 kB FilePmdMapped: 0 kB CmaTotal: 0 kB CmaFree: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB Hugetlb: 0 kB DirectMap4k: 1328552 kB DirectMap2M: 30789632 kB DirectMap1G: 1048576 kB # uname -a Linux jglap 5.14.21-150400.24.21-default #1 SMP PREEMPT_DYNAMIC Wed Sep 7 06:51:18 UTC 2022 (974d0aa) x86_64 x86_64 x86_64 GNU/Linux # df -k | grep tmpfs devtmpfs 4096 0 4096 0% /dev tmpfs 16238252 11580 16226672 1% /dev/shm tmpfs 6495304 35264 6460040 1% /run tmpfs 4096 0 4096 0% /sys/fs/cgroup tmpfs 3247648 140 3247508 1% /run/user/1000 # ipcs -m ------ Shared Memory Segments -------- key shmid owner perms bytes nattch status 0x00000000 28 gross 600 393216 2 dest 0x00000000 31 gross 600 524288 2 dest 0x00000000 32 gross 600 393216 2 dest 0x00000000 35 gross 600 524288 2 dest 0x00000000 38 gross 600 524288 2 dest 0x00000000 41 gross 600 524288 2 dest Summing up different values of all processes from /proc/<pid>/smaps_rollup show much lower swap usage: Swap: 116900 KB Pss_Shmem: 51861 KB SwapPss: 96114 KB -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1204380 https://bugzilla.suse.com/show_bug.cgi?id=1204380#c1 Vlastimil Babka <vbabka@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |mhocko@suse.com, | |vbabka@suse.com Assignee|kernel-bugs@opensuse.org |vbabka@suse.com --- Comment #1 from Vlastimil Babka <vbabka@suse.com> --- Was there a previous Leap 15.4 kernel version where you could say this wasn't happening? It looks like shmem memory leak without visible users. I'd recommend booting with page_owner=on, if you can tolerate some associated memory/cpu overhead. Once Shmem becomes prominent enough (can be before swap starts to be used), try capturing the largest memory users with the help of tools/vm/page_owner_sort.c (details in Documentation/mm/page_owner.rst ) and hopefully that will tell us something. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1204380 https://bugzilla.suse.com/show_bug.cgi?id=1204380#c2 --- Comment #2 from J�rgen Gro� <jgross@suse.com> --- (In reply to Vlastimil Babka from comment #1)
Was there a previous Leap 15.4 kernel version where you could say this wasn't happening?
Sorry, no. I updated the system from 15.3 just 2 weeks ago.
It looks like shmem memory leak without visible users.
I'd recommend booting with page_owner=on, if you can tolerate some associated memory/cpu overhead. Once Shmem becomes prominent enough (can be before swap starts to be used), try capturing the largest memory users with the help of tools/vm/page_owner_sort.c (details in Documentation/mm/page_owner.rst ) and hopefully that will tell us something.
Okay, will try that. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1204380 https://bugzilla.suse.com/show_bug.cgi?id=1204380#c3 --- Comment #3 from J�rgen Gro� <jgross@suse.com> --- I have collected the allocation data from /sys/kernel/debug/page_owner now. Before doing that, I have stopped the processes with large memory sizes (firefox, thunderbird, ...), in order to get this memory freed. There were still more than 10GB of shared memory allocated in the system when gathering the data, while summing up the known users of shared memory (processes, tmpfs, IPC) added up to less than 100MB. Interpreting the allocation data is not really easy, as the resulting output is a 5.6GB sized file. And using page_owner_sort doesn't help me a lot, probably because I don't know the proper parameters to extract the needed data. Am I right that entries showing a timestamp for "free", i.e. not containing "free_ts 0 ns", relate to memory having been freed again? I have my doubts, as summing up the memory sizes of those entries will result in only about 1GB of memory being used, which is clearly contradicting the shared memory size shown. Please educate me how to use the gathered data, or where to put the file for your analysis. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1204380 https://bugzilla.suse.com/show_bug.cgi?id=1204380#c4 --- Comment #4 from Vlastimil Babka <vbabka@suse.com> --- (In reply to J�rgen Gro� from comment #3)
Interpreting the allocation data is not really easy, as the resulting output is a 5.6GB sized file. And using page_owner_sort doesn't help me a lot, probably because I don't know the proper parameters to extract the needed data.
The default of page_owner_sort should count the same kinds of allocations and tell you how many were there as the "X times", and also sort by X, so the most prominents should be first. So attaching say first 10k lines of the sorted output should hopefully be enough to find the culprit.
Am I right that entries showing a timestamp for "free", i.e. not containing "free_ts 0 ns", relate to memory having been freed again? I have my doubts, as summing up the memory sizes of those entries will result in only about 1GB of memory being used, which is clearly contradicting the shared memory size shown.
It's the timestamp of last freeing, which means the page could have been allocated again, but the old timestamp and info stays. You'd basically need to compare if the allocated ts is newer. The "-f" parameter does that and should count only pages that are currently allocated, not freed. Note that specifying just '-f' seems to destroy the otherwise implied default '-t' so you have to pass it too, see below.
Please educate me how to use the gathered data, or where to put the file for your analysis.
I'd run: ./page_owner_sort -tf page_owner_full.txt page_owner_sorted.txt and attach first 10k lines of page_owner_sorted.txt Thanks. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1204380 https://bugzilla.suse.com/show_bug.cgi?id=1204380#c5 --- Comment #5 from J�rgen Gro� <jgross@suse.com> --- Created attachment 862336 --> https://bugzilla.suse.com/attachment.cgi?id=862336&action=edit first 10000 lines of page_owner_sorted.txt Collected page owner data related to comment #3, sorted via "page_owner_sort -tf" and clipped via "head -10000". I don't see this to be valuable data, but maybe I'm overlooking something. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1204380 https://bugzilla.suse.com/show_bug.cgi?id=1204380#c6 --- Comment #6 from Vlastimil Babka <vbabka@suse.com> --- Yeah that's really weird. The sorter might be buggy though, I notice there's a very recent mainline commit 57eb60c04d2c7b0de91eac2bc5d0331f8fe72fd7 fixing the -f option. Does it look more useful if you build the sorter from latest Linus master, or omit the -f on the older version? -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1204380 https://bugzilla.suse.com/show_bug.cgi?id=1204380#c7 --- Comment #7 from J�rgen Gro� <jgross@suse.com> --- (In reply to Vlastimil Babka from comment #6)
Yeah that's really weird. The sorter might be buggy though, I notice there's a very recent mainline commit 57eb60c04d2c7b0de91eac2bc5d0331f8fe72fd7 fixing the -f option. Does it look more useful if you build the sorter from latest Linus master, or omit the -f on the older version?
I DID build it from Linus master (6.1-rc1). -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1204380 https://bugzilla.suse.com/show_bug.cgi?id=1204380#c8 --- Comment #8 from Vlastimil Babka <vbabka@suse.com> --- Does the full 5.7GB file get any smaller with compression? You could upload it to wotan or somewhere? -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1204380 https://bugzilla.suse.com/show_bug.cgi?id=1204380#c9 --- Comment #9 from J�rgen Gro� <jgross@suse.com> --- (In reply to Vlastimil Babka from comment #8)
Does the full 5.7GB file get any smaller with compression? You could upload it to wotan or somewhere?
Slightly (83 MB now) :-) You can find on /home/jgross/page_owner_sorted.txt.gz -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1204380 https://bugzilla.suse.com/show_bug.cgi?id=1204380#c10 --- Comment #10 from J�rgen Gro� <jgross@suse.com> --- (In reply to J�rgen Gro� from comment #9)
(In reply to Vlastimil Babka from comment #8)
Does the full 5.7GB file get any smaller with compression? You could upload it to wotan or somewhere?
Slightly (83 MB now) :-)
You can find on /home/jgross/page_owner_sorted.txt.gz
And of course home/jgross/page_owner_full.txt.gz -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1204380 https://bugzilla.suse.com/show_bug.cgi?id=1204380#c11 Vlastimil Babka <vbabka@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |tzimmermann@suse.com --- Comment #11 from Vlastimil Babka <vbabka@suse.com> --- Looks like the sorting tool is still broken. Grepping the full output for "shmem", I can see tons of entries like this: Page allocated via order 0, mask 0x1120d2(__GFP_HIGHMEM|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_HARDWALL|__GFP_RECLAIMABLE), pid 2603, ts 155051094696842 ns, free_ts 154475925820958 ns PFN 12866 type Reclaimable Block 25 type Reclaimable Flags 0xfffffc0180014(uptodate|lru|swapbacked|unevictable|node=0|zone=1|lastcpupid=0x1fffff) prep_new_page+0x93/0xb0 get_page_from_freelist+0x19e9/0x1ce0 __alloc_pages+0x180/0x320 alloc_pages_vma+0x8b/0x260 shmem_alloc_page+0x3f/0x90 shmem_alloc_and_acct_page+0x72/0x1c0 shmem_getpage_gfp+0x2eb/0x870 shmem_read_mapping_page_gfp+0x49/0xf0 shmem_get_pages+0x1c6/0x600 [i915] __i915_gem_object_get_pages+0x34/0x40 [i915] i915_gem_set_domain_ioctl+0x2b6/0x360 [i915] drm_ioctl_kernel+0xb4/0x100 [drm] drm_ioctl+0x35a/0x400 [drm] __x64_sys_ioctl+0x8f/0xd0 do_syscall_64+0x58/0x80 entry_SYSCALL_64_after_hwframe+0x61/0xcb Note the 'ts' and 'free_ts' values are actually irrelevant - if a page shows up in page_owner dump, it is currently allocated, the filtering in the sorting tool is bogus. So we can simply do
grep __i915_gem_object_get_pages page_owner_full.txt | wc -l 3080094
even if all is order-0, that amounts to 11.7GB. Comments 3 says "There were still more than 10GB of shared memory" so I guess that's all attributable to i915? CCing tzimmermann. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1204380 https://bugzilla.suse.com/show_bug.cgi?id=1204380#c12 --- Comment #12 from J�rgen Gro� <jgross@suse.com> --- (In reply to Vlastimil Babka from comment #11)
Looks like the sorting tool is still broken. Grepping the full output for "shmem", I can see tons of entries like this:
Page allocated via order 0, mask 0x1120d2(__GFP_HIGHMEM|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_HAR DWALL|__GFP_RECLAIMABLE), pid 2603, ts 155051094696842 ns, free_ts 154475925820958 ns PFN 12866 type Reclaimable Block 25 type Reclaimable Flags 0xfffffc0180014(uptodate|lru|swapbacked|unevictable|node=0|zone=1|lastcpupid= 0x1fffff) prep_new_page+0x93/0xb0 get_page_from_freelist+0x19e9/0x1ce0 __alloc_pages+0x180/0x320 alloc_pages_vma+0x8b/0x260 shmem_alloc_page+0x3f/0x90 shmem_alloc_and_acct_page+0x72/0x1c0 shmem_getpage_gfp+0x2eb/0x870 shmem_read_mapping_page_gfp+0x49/0xf0 shmem_get_pages+0x1c6/0x600 [i915] __i915_gem_object_get_pages+0x34/0x40 [i915] i915_gem_set_domain_ioctl+0x2b6/0x360 [i915] drm_ioctl_kernel+0xb4/0x100 [drm] drm_ioctl+0x35a/0x400 [drm] __x64_sys_ioctl+0x8f/0xd0 do_syscall_64+0x58/0x80 entry_SYSCALL_64_after_hwframe+0x61/0xcb
Note the 'ts' and 'free_ts' values are actually irrelevant - if a page shows up in page_owner dump, it is currently allocated, the filtering in the sorting tool is bogus. So we can simply do
grep __i915_gem_object_get_pages page_owner_full.txt | wc -l 3080094
even if all is order-0, that amounts to 11.7GB. Comments 3 says "There were still more than 10GB of shared memory" so I guess that's all attributable to i915?
Seems to be a good guess. PID 2603 is /usr/bin/plasmashell BTW. Right now I have about 17GB of shared mem, and "grep __i915_gem_object_get_pages /sys/kernel/debug/page_owner | wc -l" gives me 4259344. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1204380 https://bugzilla.suse.com/show_bug.cgi?id=1204380#c13 Thomas Zimmermann <tzimmermann@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |jgross@suse.com Flags| |needinfo?(jgross@suse.com) --- Comment #13 from Thomas Zimmermann <tzimmermann@suse.com> --- Hi, there are plenty of internal information at /sys/kernel/debug/dri/0/. After running the leaky code for a while, can you please retrieve the following files: sudo cat /sys/kernel/debug/dri/0/clients sudo cat /sys/kernel/debug/dri/0/framebuffer sudo cat /sys/kernel/debug/dri/0/i915_gem_objects Please also see if their content changes over time. The last file has information about memory allocation that might be helpful. Does it go up? Best regards Thomas -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1204380 https://bugzilla.suse.com/show_bug.cgi?id=1204380#c14 J�rgen Gro� <jgross@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Flags|needinfo?(jgross@suse.com) | --- Comment #14 from J�rgen Gro� <jgross@suse.com> --- (In reply to Thomas Zimmermann from comment #13)
Hi,
there are plenty of internal information at /sys/kernel/debug/dri/0/. After running the leaky code for a while, can you please retrieve the following files:
sudo cat /sys/kernel/debug/dri/0/clients sudo cat /sys/kernel/debug/dri/0/framebuffer sudo cat /sys/kernel/debug/dri/0/i915_gem_objects
Please also see if their content changes over time. The last file has information about memory allocation that might be helpful. Does it go up?
# cat /sys/kernel/debug/dri/0/clients command pid dev master a uid magic X 2321 0 y y 0 0 X 2321 0 n y 0 2 X 2321 0 n y 0 3 X 2321 0 n y 0 4 X 2321 0 n y 0 5 X 2321 0 n y 0 6 X 2321 0 n y 0 7 X 2321 0 n y 0 1 firefox 23630 128 n n 1000 0 X 2321 0 n y 0 8 X 2321 0 n y 0 9 X 2321 0 n y 0 10 thunderbird-bin 23974 128 n n 1000 0 X 2321 0 n y 0 11 X 2321 0 n y 0 12 X 2321 0 n y 0 13 X 2321 0 n y 0 14 Keybase 24435 128 n n 1000 0 X 2321 0 n y 0 15 electron 24645 128 n n 1000 0 X 2321 0 n y 0 16 X 2321 0 n y 0 17 electron 24832 128 n n 1000 0 X 2321 0 n y 0 18 X 2321 0 n y 0 19 RDD Process 25416 128 n n 1000 0 X 2321 0 n y 0 20 X 2321 0 n y 0 21 X 2321 0 n y 0 22 # cat /sys/kernel/debug/dri/0/framebuffer framebuffer[140]: allocated by = X refcount=7 format=XR24 little-endian (0x34325258) modifier=0x100000000000001 size=5760x1200 layers: size[0]=5760x1200 pitch[0]=23040 offset[0]=0 obj[0]: name=0 refcount=7 start=00000000 size=29360128 imported=no framebuffer[97]: allocated by = [fbcon] refcount=1 format=XR24 little-endian (0x34325258) modifier=0x0 size=1920x1080 layers: size[0]=1920x1080 pitch[0]=7680 offset[0]=0 obj[0]: name=0 refcount=3 start=00000000 size=8294400 imported=no # cat /sys/kernel/debug/dri/0/i915_gem_objects 1166 shrinkable [0 free] objects, 1291399168 bytes system: total:0x00000007be357000, available:0x00000007be357000 bytes stolen-system: total:0x0000000004000000, available:0x0000000004000000 bytes Note that shared memory allocations were up to about 20GB when taking this data, according to the page owner data most of that was allocated by i915. Some minutes after above snapshot I'm seeing: # cat /sys/kernel/debug/dri/0/i915_gem_objects 1224 shrinkable [0 free] objects, 716410880 bytes system: total:0x00000007be357000, available:0x00000007be357000 bytes stolen-system: total:0x0000000004000000, available:0x0000000004000000 bytes Shared memory usage hasn't really dropped, though. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1204380 https://bugzilla.suse.com/show_bug.cgi?id=1204380#c15 Thomas Zimmermann <tzimmermann@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Flags| |needinfo?(jgross@suse.com) --- Comment #15 from Thomas Zimmermann <tzimmermann@suse.com> --- Hi (In reply to J�rgen Gro� from comment #14)
(In reply to Thomas Zimmermann from comment #13)
Hi,
there are plenty of internal information at /sys/kernel/debug/dri/0/. After running the leaky code for a while, can you please retrieve the following files:
sudo cat /sys/kernel/debug/dri/0/clients sudo cat /sys/kernel/debug/dri/0/framebuffer sudo cat /sys/kernel/debug/dri/0/i915_gem_objects
Please also see if their content changes over time. The last file has information about memory allocation that might be helpful. Does it go up?
# cat /sys/kernel/debug/dri/0/clients command pid dev master a uid magic X 2321 0 y y 0 0 X 2321 0 n y 0 2 X 2321 0 n y 0 3 X 2321 0 n y 0 4 X 2321 0 n y 0 5 X 2321 0 n y 0 6 X 2321 0 n y 0 7 X 2321 0 n y 0 1 firefox 23630 128 n n 1000 0 X 2321 0 n y 0 8 X 2321 0 n y 0 9 X 2321 0 n y 0 10 thunderbird-bin 23974 128 n n 1000 0 X 2321 0 n y 0 11 X 2321 0 n y 0 12 X 2321 0 n y 0 13 X 2321 0 n y 0 14 Keybase 24435 128 n n 1000 0 X 2321 0 n y 0 15 electron 24645 128 n n 1000 0 X 2321 0 n y 0 16 X 2321 0 n y 0 17 electron 24832 128 n n 1000 0 X 2321 0 n y 0 18 X 2321 0 n y 0 19 RDD Process 25416 128 n n 1000 0 X 2321 0 n y 0 20 X 2321 0 n y 0 21 X 2321 0 n y 0 22
Quite a bit of X here, but maybe not a problem.
# cat /sys/kernel/debug/dri/0/framebuffer framebuffer[140]: allocated by = X refcount=7 format=XR24 little-endian (0x34325258) modifier=0x100000000000001 size=5760x1200 layers: size[0]=5760x1200 pitch[0]=23040 offset[0]=0 obj[0]: name=0 refcount=7 start=00000000 size=29360128 imported=no framebuffer[97]: allocated by = [fbcon] refcount=1 format=XR24 little-endian (0x34325258) modifier=0x0 size=1920x1080 layers: size[0]=1920x1080 pitch[0]=7680 offset[0]=0 obj[0]: name=0 refcount=3 start=00000000 size=8294400 imported=no
Looks normal.
# cat /sys/kernel/debug/dri/0/i915_gem_objects 1166 shrinkable [0 free] objects, 1291399168 bytes
That's a lot of objects. I have ~250 to 350 on my system with i915 and Tumbleweed. It goes up and down, but remains within that range.
system: total:0x00000007be357000, available:0x00000007be357000 bytes stolen-system: total:0x0000000004000000, available:0x0000000004000000 bytes
Note that shared memory allocations were up to about 20GB when taking this data, according to the page owner data most of that was allocated by i915.
Some minutes after above snapshot I'm seeing:
# cat /sys/kernel/debug/dri/0/i915_gem_objects 1224 shrinkable [0 free] objects, 716410880 bytes
Number of objects is going up, but the storage memory is going down. Maybe you're leaking object instances (in contrast to full buffers). What's your graphics environment? Can you run with a different desktop/window manager and observe the changes to these files? I'd like to rule out a bug in the userspace code. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1204380 https://bugzilla.suse.com/show_bug.cgi?id=1204380#c16 J�rgen Gro� <jgross@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Flags|needinfo?(jgross@suse.com) | --- Comment #16 from J�rgen Gro� <jgross@suse.com> --- Switching the desktop would hit me rather hard, as this is my primary work system. Recreating the whole setup with 3 screens and several virtual desktops would cost me probably several hours, and I have a lot of work... Yesterday I rebooted the system due to swap usage having grown to more then 25 GB again. I'll monitor /sys/kernel/debug/dri/0/i915_gem_objects to see how the numbers are changing over time. Right now I'm seeing: 1833 shrinkable [0 free] objects, 3151917056 bytes -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1204380 https://bugzilla.suse.com/show_bug.cgi?id=1204380#c17 Thomas Zimmermann <tzimmermann@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Flags| |needinfo?(jgross@suse.com) --- Comment #17 from Thomas Zimmermann <tzimmermann@suse.com> --- (In reply to J�rgen Gro� from comment #16)
Switching the desktop would hit me rather hard, as this is my primary work system. Recreating the whole setup with 3 screens and several virtual desktops would cost me probably several hours, and I have a lot of work...
Yesterday I rebooted the system due to swap usage having grown to more then 25 GB again. I'll monitor /sys/kernel/debug/dri/0/i915_gem_objects to see how the numbers are changing over time. Right now I'm seeing:
1833 shrinkable [0 free] objects, 3151917056 bytes
What is your current desktop environment? I understand that such a test is a lot of work. But we need to rule out possible source of the problem. One is the desktop environment. I assume you cannot test with a different graphics driver/hardware either? -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1204380 https://bugzilla.suse.com/show_bug.cgi?id=1204380#c18 J�rgen Gro� <jgross@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Flags|needinfo?(jgross@suse.com) | --- Comment #18 from J�rgen Gro� <jgross@suse.com> --- (In reply to Thomas Zimmermann from comment #17)
(In reply to J�rgen Gro� from comment #16)
Switching the desktop would hit me rather hard, as this is my primary work system. Recreating the whole setup with 3 screens and several virtual desktops would cost me probably several hours, and I have a lot of work...
Yesterday I rebooted the system due to swap usage having grown to more then 25 GB again. I'll monitor /sys/kernel/debug/dri/0/i915_gem_objects to see how the numbers are changing over time. Right now I'm seeing:
1833 shrinkable [0 free] objects, 3151917056 bytes
What is your current desktop environment?
KDE
I understand that such a test is a lot of work. But we need to rule out possible source of the problem. One is the desktop environment.
The main problem here is that this is my main work machine. I don't have a spare one, so reconfiguring it brings my daily work to stop for quite some time, and letting it run in the reconfigured setup for some days will probably slow down my normal work in that time.
I assume you cannot test with a different graphics driver/hardware either?
No. I'd be happy to run a debug kernel, though. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1204380 https://bugzilla.suse.com/show_bug.cgi?id=1204380#c19 --- Comment #19 from Thomas Zimmermann <tzimmermann@suse.com> --- Give me a bit, I'll try to reproduce the problem locally. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1204380 https://bugzilla.suse.com/show_bug.cgi?id=1204380#c20 Thomas Zimmermann <tzimmermann@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Flags| |needinfo?(jgross@suse.com) --- Comment #20 from Thomas Zimmermann <tzimmermann@suse.com> --- (In reply to Thomas Zimmermann from comment #19)
Give me a bit, I'll try to reproduce the problem locally.
I had the system running for a while, but nothing happened. 330 objects with 100 MiB of memory, give or take. ~600 with Firefox and Thunderbird. Is there a program that you typically run that could make the problem show up? -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1204380 https://bugzilla.suse.com/show_bug.cgi?id=1204380#c21 J�rgen Gro� <jgross@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Flags|needinfo?(jgross@suse.com) | --- Comment #21 from J�rgen Gro� <jgross@suse.com> --- (In reply to Thomas Zimmermann from comment #20)
(In reply to Thomas Zimmermann from comment #19)
Give me a bit, I'll try to reproduce the problem locally.
I had the system running for a while, but nothing happened. 330 objects with 100 MiB of memory, give or take. ~600 with Firefox and Thunderbird.
My system is going down to 330 objects at night, too. I don't think this is the problem. Look for the output of "grep Shmem: /proc/meminfo". This value is going up by 5-6 GB per day on my system.
Is there a program that you typically run that could make the problem show up?
I'm running electron (a matrix.org client) and keybase, which are not _that_ common. Stopping those won't make shared memory usage go down significantly, though. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1204380 https://bugzilla.suse.com/show_bug.cgi?id=1204380#c22 --- Comment #22 from J�rgen Gro� <jgross@suse.com> --- Another data point: I've rebooted my system and did _not_ start keybase, electron and the Slack web-client. After more than 3 hours there seems to be no shared memory leak (all shared memory found to be in use in /proc/meminfo can be attributed either to processes, or to i915 objects in use). I'll wait until tomorrow morning and will then start electron. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1204380 https://bugzilla.suse.com/show_bug.cgi?id=1204380#c23 --- Comment #23 from J�rgen Gro� <jgross@suse.com> --- With electron running the shared memory consumption is going up at roughly 200 - 300 MB per hour. Stopping electron again has no effect, i.e. the shared memory consumption is still rising at the same rate. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1204380 https://bugzilla.suse.com/show_bug.cgi?id=1204380#c24 --- Comment #24 from Thomas Zimmermann <tzimmermann@suse.com> --- Hi, thanks for the more detailed analysis. It makes me wonder if the problem is really in the kernel. Maybe one of the programs keep leaking graphics buffers. The graphics buffer is like a regular memory buffer. As long as a program has it allocated, the kernel will not release the memory pages. (In reply to J�rgen Gro� from comment #23)
With electron running the shared memory consumption is going up at roughly 200 - 300 MB per hour.
Stopping electron again has no effect, i.e. the shared memory consumption is still rising at the same rate.
Could it be that the electron process keeps running in the background? -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1204380 https://bugzilla.suse.com/show_bug.cgi?id=1204380#c25 --- Comment #25 from J�rgen Gro� <jgross@suse.com> --- (In reply to Thomas Zimmermann from comment #24)
Hi,
thanks for the more detailed analysis. It makes me wonder if the problem is really in the kernel. Maybe one of the programs keep leaking graphics buffers.
The graphics buffer is like a regular memory buffer. As long as a program has it allocated, the kernel will not release the memory pages.
(In reply to J�rgen Gro� from comment #23)
With electron running the shared memory consumption is going up at roughly 200 - 300 MB per hour.
Stopping electron again has no effect, i.e. the shared memory consumption is still rising at the same rate.
Could it be that the electron process keeps running in the background?
No, unless it has a sub-process with a different name in it. I've checked with "ps -ef | grep electron". -- You are receiving this mail because: You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@suse.com