[Bug 1199970] New: reaim-io-disk ext4 regression in 5.18 in mballoc
https://bugzilla.suse.com/show_bug.cgi?id=1199970 Bug ID: 1199970 Summary: reaim-io-disk ext4 regression in 5.18 in mballoc Classification: openSUSE Product: openSUSE Tumbleweed Version: Current Hardware: Other OS: Other Status: NEW Severity: Normal Priority: P5 - None Component: Kernel Assignee: kernel-bugs@opensuse.org Reporter: jack@suse.com QA Contact: qa-bugs@suse.de Found By: --- Blocker: --- Marvin has bisected a small (5%) performance regression in ext4 due to commit 077d0c2c78df ("ext4: make mb_optimize_scan performance mount option work with extents"). The bisection result looks like: good-27b38686 bad-077d0c2c Hmean disk-1 537.83 529.10 Hmean disk-5 3433.28 3431.71 ... Hmean disk-25 16999.09 16290.18 * -3.67%* Hmean disk-29 19162.99 18385.46 * -4.73%* Hmean disk-33 21859.13 20793.95 * -4.54%* The regression was reported on laurel1. Likely the overhead of mballoc has somewhat increased due to rbtree searching of best group to use. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1199970 Jan Kara <jack@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |kernel-performance-bugs@sus | |e.de Assignee|kernel-bugs@opensuse.org |jack@suse.com -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1199970 https://bugzilla.suse.com/show_bug.cgi?id=1199970#c1 --- Comment #1 from Jan Kara <jack@suse.com> --- Marvin7 also tracked down this commit as a culprit for io-fsmark-small-file-stream-ext4 regression: good-27b38686 bad-077d0c2c 1st-qrtle 1-files/sec 46497.40 21683.10 ( -53.29%) 2nd-qrtle 1-files/sec 45571.70 14200.20 ( -69.02%) 3rd-qrtle 1-files/sec 45103.40 12218.30 ( -73.31%) Max-90 1-files/sec 45103.40 12218.30 ( -73.31%) -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1199970 https://bugzilla.suse.com/show_bug.cgi?id=1199970#c2 --- Comment #2 from Jan Kara <jack@suse.com> --- And marvin7 reaim-io-disk-ext4 regression: good-27b38686 bad-077d0c2c Hmean disk-1 3456.22 ( 4.15%) 3267.97 ( -1.53%) Hmean disk-25 219298.25 ( -1.17%) 198938.99 * -10.34%* Hmean disk-49 436201.78 ( -0.30%) 378865.98 * -13.40%* Hmean disk-73 600000.00 ( 0.82%) 504608.30 * -15.21%* Hmean disk-97 734848.49 * -4.29%* 621794.87 * -19.02%* Hmean disk-121 828767.12 ( -0.46%) 666055.05 * -20.00%* Hmean disk-145 877016.13 ( -2.82%) 744863.01 * -17.47%* Hmean disk-169 944134.08 ( -0.00%) 803486.53 * -14.90%* Hmean disk-193 900466.56 * -4.82%* 851470.59 * -10.00%* -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1199970 https://bugzilla.suse.com/show_bug.cgi?id=1199970#c3 --- Comment #3 from Jan Kara <jack@suse.com> --- The result looks reproducible on hardy4 as well: baseline mb_optimize_scan Hmean disk-1 2114.16 ( 0.00%) 2099.37 ( -0.70%) Hmean disk-41 87794.43 ( 0.00%) 83787.47 * -4.56%* Hmean disk-81 148170.73 ( 0.00%) 135527.05 * -8.53%* Hmean disk-121 177506.11 ( 0.00%) 166284.93 * -6.32%* Hmean disk-161 220951.51 ( 0.00%) 207563.39 * -6.06%* Hmean disk-201 208722.74 ( 0.00%) 203235.59 ( -2.63%) Hmean disk-241 222051.60 ( 0.00%) 217705.51 ( -1.96%) Hmean disk-281 252244.17 ( 0.00%) 241132.72 * -4.41%* Hmean disk-321 255844.84 ( 0.00%) 245412.84 * -4.08%* For reference this is 5.18 kernel, only with mount option mb_optimize_scan=0/1 altered. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1199970 https://bugzilla.suse.com/show_bug.cgi?id=1199970#c4 --- Comment #4 from Jan Kara <jack@suse.com> --- OK, some updates after rather long time. The regression seems to be mostly triggered by reaim rapidly creating, fsyncing, and deleting small files. I was able to reproduce the regression using a simpler stress-unlink benchmark: stress-unlink -s -c 10000 -f 22528 16 0 /mnt Results with this benchmark are: AVG STDDEV mb_optimize_scan=0: 28.285800 0.156846 mb_optimize_scan=1: 30.896600 0.323324 -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1199970 https://bugzilla.suse.com/show_bug.cgi?id=1199970#c5 --- Comment #5 from Jan Kara <jack@suse.com> --- Created attachment 860162 --> https://bugzilla.suse.com/attachment.cgi?id=860162&action=edit Stress--unlink benchmark -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1199970 https://bugzilla.suse.com/show_bug.cgi?id=1199970#c6 --- Comment #6 from Jan Kara <jack@suse.com> --- For explanation, this invocation of the benchmark spawns 16 processes, each process does create 22k file, fsync, unlink in a loop 10000 times. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1199970 https://bugzilla.suse.com/show_bug.cgi?id=1199970#c7 --- Comment #7 from Jan Kara <jack@suse.com> --- From the tracing data, it is clear that with mb_optimize_scan=1 jbd2 does considerably more IO: mb_optimize_scan=0: Stats for process [jbd2/sdb1-8] (36694) Queued writes 657489 (2629956 KB) mb_optimize_scan=1: Stats for process [jbd2/sdb1-8] (35727) Queued writes 745845 (2983380 KB) The number of commits is actually somewhat lower with mb_optimize_scan=1: commits mb_optimize_scan=0: 26367 commits mb_optimize_scan=1: 25582 So commits are considerably larger with mb_optimize_scan=1. The load dirties only inodes, block & inode bitmaps. So likely with mb_optimize_scan=1 we spread processes more which results in dirtying. In theory each process can dirty upto 6 blocks per commit (unlink + create can each dirty one block bitmap, one inode bitmap, and one inode table block). Given we have 16 processes, this can result in commits upto 96 blocks large. mb_optimize_scan=0 average is 23.9 blocks per commit, mb_optimize_scan=1 average is 28.1 blocks per commit. And indeed counting groups touched in each commit shows that with mb_optimize_scan=1 we are indeed touching more groups per commit. Now I have to think whether this wider spreading of processes is a desirable thing or not... -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1199970 https://bugzilla.suse.com/show_bug.cgi?id=1199970#c8 Jan Kara <jack@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |IN_PROGRESS --- Comment #8 from Jan Kara <jack@suse.com> --- OK, I was looking more into why allocations get more spread among groups with mb_optimize_scan=1. Let me summarize here my current findings so that I don't forget while I'm on vacation next week :). The things get somewhat obscured by group preallocations because small allocations (below 64k which is our case) get allocated from those. Group preallocations are per-CPU and they get initialized like: mb_optimize_scan=0: 49 81 113 97 17 33 113 49 81 33 97 113 81 1 17 33 33 81 1 113 97 17 113 113 33 33 97 81 49 81 17 49 mb_optimize_scan=1: 127 126 126 125 126 127 125 126 127 124 123 124 122 122 121 120 119 118 117 116 115 116 114 113 111 110 109 108 107 106 105 104 104 So we can see groups from which "group preallocations" get allocated drift with mb_optimize_scan=1 while they keep jumping among same groups with mb_optimize_scan=0. This is likely because with mb_optimize_scan=0 we always start searching for free space in the goal group which is determined by the inode and inode's group is determined by the parent directory. So in that case we always start the search in the same group. With mb_optimize_scan=1 we always call ext4_mb_choose_next_group_cr0() to determine the first group to search. The drifting seems to be caused by each free space update (e.g. mb_mark_used()) calls mb_set_largest_free_order() which deletes group from bb_largest_free_order_node list, recomputes the order, and inserts it at the tail of the list again. Anyway this seems a bit suboptimal because inode - data block locality is desirable. Also given the algorithm of group selection all allocations of the same size will be contending on the same group. After I return from vacation, I'll take this upstream for discussion, whether the current behavior does not need tweaking. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1199970 https://bugzilla.suse.com/show_bug.cgi?id=1199970#c9 --- Comment #9 from Jan Kara <jack@suse.com> --- I have posted patches that fix the regression for me here: https://lore.kernel.org/all/20220823134508.27854-1-jack@suse.cz However it apparently does not completely fix the regression for RPi users so there's still more investigation going on. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1199970 https://bugzilla.suse.com/show_bug.cgi?id=1199970#c10 Jan Kara <jack@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|IN_PROGRESS |RESOLVED Resolution|--- |FIXED --- Comment #10 from Jan Kara <jack@suse.com> --- Forgot to update this one. Patches to fix these problems were already merged upstream in September: 4fca50d440cc ("ext4: make mballoc try target group first even with mb_optimize_scan") 1940265ede66 ("ext4: avoid unnecessary spreading of allocations among groups") 613c5a85898d ("ext4: make directory inode spreading reflect flexbg size") a9f2a2931d0e ("ext4: use locality group preallocation for small closed files") 83e80a6e3543 ("ext4: use buckets for cr 1 block scan instead of rbtree") Closing the bug. -- You are receiving this mail because: You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@suse.com