[Bug 1030310] New: Regression in WAL and COMMIT times for pgioperf-bench on ext3
http://bugzilla.suse.com/show_bug.cgi?id=1030310 Bug ID: 1030310 Summary: Regression in WAL and COMMIT times for pgioperf-bench on ext3 Classification: openSUSE Product: openSUSE Tumbleweed Version: Current Hardware: Other OS: Other Status: NEW Severity: Normal Priority: P5 - None Component: Kernel Assignee: kernel-maintainers@forge.provo.novell.com Reporter: jack@suse.com QA Contact: qa-bugs@suse.de Found By: --- Blocker: --- Commit 06bd3c36a733ac27962fea7d6f47168841376824 "ext4: fix data exposure after a crash" regressed WAL times of pgioperf-bench on ext3 filesystem ~100x for balada.arch.suse.de. COMMIT times regressed ~10x. The benchmark is doing random 8k writes into a large datafile (2*memsize) calling fsync after 800MB written and cycling 8k writes into wal file, fsyncing it every 5 writes. In parallel there are lots of random reads going on. The regression is caused by the fact that before above mentioned commit datafile inode was always attached to transaction in ext4_write_end() and thus random writes were continuously trickled to the filesystem using WRITE_SYNC writes. After the patch this does not happen and writes are written back by flush worker using async writes. Likely these get starved by parallel readers and as a result WAL writer gets stuck in write(2) for extended periods of time. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1030310 Jan Kara <jack@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Priority|P5 - None |P3 - Medium CC| |kernel-performance-bugs@sus | |e.de Assignee|kernel-maintainers@forge.pr |jack@suse.com |ovo.novell.com | -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1030310 http://bugzilla.suse.com/show_bug.cgi?id=1030310#c1 --- Comment #1 from Jan Kara <jack@suse.com> --- The regression is specific for 'nodelalloc' configuration of ext4 - ext3 is a case where this is used. ext4 with delalloc already experiences the "regressed" behavior for a long time and the above mentioned commit brings nodelalloc mode performance in line with the (bad) performance of delalloc mode. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1030310 Jan Kara <jack@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |IN_PROGRESS -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1030310 http://bugzilla.suse.com/show_bug.cgi?id=1030310#c2 --- Comment #2 from Jan Kara <jack@suse.com> --- OK, so I've been testing with SLE12 SP3 kernels on ives.arch.suse.de. One thing I've noticed is that writeback-throttling patches have been backported from upstream to SLE12-SP3 and it screws this workload badly. Read numbers look all nice and dandy like: read[21335]: avg: 6.3 msec; max: 303.8 msec read[21335]: avg: 6.4 msec; max: 224.6 msec read[21335]: avg: 6.2 msec; max: 286.5 msec read[21335]: avg: 6.2 msec; max: 264.8 msec read[21335]: avg: 6.4 msec; max: 258.2 msec read[21335]: avg: 6.4 msec; max: 280.4 msec However writer struggles hard to make any progress at all - usually pgioperf.log contains only entries like: wal[27108]: avg: 0.0 msec; max: 6.1 msec wal[27108]: avg: 0.0 msec; max: 10.9 msec commit[27108]: avg: 2.6 msec; max: 2361.9 msec wal[27108]: avg: 0.0 msec; max: 16.5 msec which are from the time before readers actually managed to start. Then writers are blocked until the whole benchmark completes. Analysis of blktrace data has shown that wbt logic interacts badly with CFQ. As a result of wbt logic, CFQ always sees just one write request at a time so when such request is seen, async queue gets scheduled, eventually gets its time slot and submits that one write request. Once that completes, readers are scheduled again since the async queue has no more IO. As a result we complete about 1 write per couple of seconds which is far too low - single writeback pass through the data file has more writes than we can complete during the whole benchmark run. TODO item: Disable writeback throttling for non-multiqueue devices by default. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1030310 http://bugzilla.suse.com/show_bug.cgi?id=1030310#c3 --- Comment #3 from Jan Kara <jack@suse.com> --- Even after disabling writeback throttling we can see noticeable effect of the commit mentioned in comment 0 on the numbers. Without the commit both read and write numbers are rather stable at: read[992]: avg: 18.3 msec; max: 881.4 msec read[998]: avg: 18.3 msec; max: 881.2 msec read[1001]: avg: 18.3 msec; max: 881.4 msec read[993]: avg: 18.3 msec; max: 881.0 msec wal[991]: avg: 0.0 msec; max: 0.2 msec wal[991]: avg: 0.0 msec; max: 0.2 msec commit[991]: avg: 53.9 msec; max: 928.4 msec With the commit read numbers are better but with large spikes like: read[16644]: avg: 7.8 msec; max: 369.4 msec read[16653]: avg: 7.8 msec; max: 369.6 msec read[16647]: avg: 6.6 msec; max: 369.4 msec read[16651]: avg: 7.8 msec; max: 369.5 msec read[16645]: avg: 7.8 msec; max: 369.5 msec read[16655]: avg: 7.8 msec; max: 20493.0 msec Write numbers are worse and with large spikes: wal[16642]: avg: 101.0 msec; max: 100993.5 msec wal[16642]: avg: 76.8 msec; max: 70830.4 msec wal[16642]: avg: 19.5 msec; max: 16180.1 msec commit[16642]: avg: 81.9 msec; max: 15718.7 msec wal[16642]: avg: 18.7 msec; max: 13445.0 msec ... commit[16642]: avg: 106.7 msec; max: 21362.7 msec wal[16642]: avg: 60.2 msec; max: 21059.4 msec wal[16642]: avg: 30.7 msec; max: 23651.2 msec -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1030310 http://bugzilla.suse.com/show_bug.cgi?id=1030310#c4 --- Comment #4 from Jan Kara <jack@suse.com> --- Created attachment 719156 --> http://bugzilla.suse.com/attachment.cgi?id=719156&action=edit Log when WBT is disabled, CFQ used, my async starvation fix applied -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1030310 http://bugzilla.suse.com/show_bug.cgi?id=1030310#c5 --- Comment #5 from Jan Kara <jack@suse.com> --- Created attachment 719157 --> http://bugzilla.suse.com/attachment.cgi?id=719157&action=edit Log when WBT is disabled, CFQ used, my async starvation fix applied, ext4 change reverted -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1030310 http://bugzilla.suse.com/show_bug.cgi?id=1030310#c6 --- Comment #6 from Jan Kara <jack@suse.com> --- Created attachment 719160 --> http://bugzilla.suse.com/attachment.cgi?id=719160&action=edit IOwatcher graph from run with ext4 patch included -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1030310 http://bugzilla.suse.com/show_bug.cgi?id=1030310#c7 --- Comment #7 from Jan Kara <jack@suse.com> --- Created attachment 719162 --> http://bugzilla.suse.com/attachment.cgi?id=719162&action=edit IOwatcher graph from run with ext4 patch reverted -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1030310 http://bugzilla.suse.com/show_bug.cgi?id=1030310#c8 --- Comment #8 from Jan Kara <jack@suse.com> --- The IOwatcher graphs (from part of the benchmarking run) show very nicely the difference between the two kernels. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1030310 http://bugzilla.suse.com/show_bug.cgi?id=1030310#c9 --- Comment #9 from Jan Kara <jack@suse.com> --- OK, I've been doing more investigation about this issue. In the end culprit of stalls are transaction commit times. During the benchmark run average commit time is ~18s with standard deviation of ~41s! The top 5 commit times are: 274.466639s, 126.467347s, 86.992429s, 34.351563s, 31.517653s. And the reason why transation commits are taking so long (although they are pretty small) is that flusher worker holds transaction open in ext4_writepages() while writing back pages. This writeback gets throttled by CFQ and so it takes a long time for ext4_writepages() to complete and thus for transaction handle to be dropped while consequently allows transaction commit to complete. A relatively simple solution to this problem is that we can start a transaction only once we find a page that needs block allocation / extent conversion in ext4_writepages(). With this change transaction commit times drop to 0.1s on average with standard deviation of 0.15s and top 5 commit times: 0.563792s, 0.519980s, 0.509841s, 0.471700s, 0.469899s Also the benchmark numbers themselves look better after the change. For reads results look like: read[23390]: avg: 10.7 msec; max: 358.5 msec read[23387]: avg: 10.7 msec; max: 358.8 msec read[23394]: avg: 10.7 msec; max: 358.9 msec read[23392]: avg: 10.7 msec; max: 358.6 msec read[23395]: avg: 10.7 msec; max: 358.6 msec read[23382]: avg: 10.7 msec; max: 358.7 msec read[23381]: avg: 10.7 msec; max: 358.9 msec read[23385]: avg: 10.7 msec; max: 358.4 msec read[23393]: avg: 10.7 msec; max: 359.0 msec read[23389]: avg: 10.7 msec; max: 358.6 msec read[23388]: avg: 10.7 msec; max: 358.7 msec read[23386]: avg: 10.7 msec; max: 358.3 msec read[23396]: avg: 10.7 msec; max: 359.0 msec read[23383]: avg: 10.7 msec; max: 358.5 msec read[23391]: avg: 10.7 msec; max: 358.9 msec read[23384]: avg: 10.7 msec; max: 359.0 msec with maximum observed read latency ~500 msec. Average wal times are 0.0 msec with maximums at 10-20 msec range and one 300 msec sample. Also commit times look reasonable. Averages are in 30-50 msec range and maximums peak at 10 seconds - that's still quite big but order of magnitude better than with unpatched kernel. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1030310 http://bugzilla.suse.com/show_bug.cgi?id=1030310#c10 --- Comment #10 from Jan Kara <jack@suse.com> --- Created attachment 720108 --> http://bugzilla.suse.com/attachment.cgi?id=720108&action=edit Log when WBT is disabled, ext4_writepages() change applied -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1030310 http://bugzilla.suse.com/show_bug.cgi?id=1030310#c11 --- Comment #11 from Jan Kara <jack@suse.com> --- Created attachment 720110 --> http://bugzilla.suse.com/attachment.cgi?id=720110&action=edit IOwatcher graph from run with ext4_writepages() change -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1030310 http://bugzilla.suse.com/show_bug.cgi?id=1030310#c12 --- Comment #12 from Jan Kara <jack@suse.com> --- Created attachment 720111 --> http://bugzilla.suse.com/attachment.cgi?id=720111&action=edit [PATCH 1/3] ext4: Allow IO submission without io_end -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1030310 http://bugzilla.suse.com/show_bug.cgi?id=1030310#c13 --- Comment #13 from Jan Kara <jack@suse.com> --- Created attachment 720113 --> http://bugzilla.suse.com/attachment.cgi?id=720113&action=edit [PATCH 2/3] ext4: Don't allocate io_end for writeback from ext4_writepage() -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1030310 http://bugzilla.suse.com/show_bug.cgi?id=1030310#c14 --- Comment #14 from Jan Kara <jack@suse.com> --- Created attachment 720114 --> http://bugzilla.suse.com/attachment.cgi?id=720114&action=edit [PATCH 3/3] ext4: Submit mapped buffers without transaction -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1030310 http://bugzilla.suse.com/show_bug.cgi?id=1030310#c15 --- Comment #15 from Jan Kara <jack@suse.com> --- I've attached for reference patches that achieve the improvement. I'm now working on further testing of these patches and will push them upstream if functional testing passes. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1030310 http://bugzilla.suse.com/show_bug.cgi?id=1030310#c16 Mel Gorman <mgorman@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |mgorman@suse.com --- Comment #16 from Mel Gorman <mgorman@suse.com> --- (In reply to Jan Kara from comment #15)
I've attached for reference patches that achieve the improvement. I'm now working on further testing of these patches and will push them upstream if functional testing passes.
FWIW, I've queued these on the machine that I first spotted the problem on. v4.11-rc5 vanilla against these patches only. I looked at the patches but I'm not qualified to review them properly. As an aside, how did you measure the transaction commit times, tracing jbd2_end_commit+jbd2_start_commit and subtracting the difference? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1030310 http://bugzilla.suse.com/show_bug.cgi?id=1030310#c17 --- Comment #17 from Mel Gorman <mgorman@suse.com> --- ext4 looks good pgioperfbench ext4 4.11.0-rc5 4.11.0-rc5 vanilla transact-v1r1 Min commit 13.10 ( 0.00%) 11.00 ( 16.03%) Min read 1053.00 ( 0.00%) 1077.30 ( -2.31%) Min wal 0.00 ( 0.00%) 0.00 ( 0.00%) Max-95% commit 6154.80 ( 0.00%) 78.90 ( 98.72%) Max-95% read 1473.40 ( 0.00%) 1097.10 ( 25.54%) Max-95% wal 6359.70 ( 0.00%) 0.10 (100.00%) Max-99% commit 8933.20 ( 0.00%) 382.50 ( 95.72%) Max-99% read 1696.80 ( 0.00%) 1097.10 ( 35.34%) Max-99% wal 7013.60 ( 0.00%) 0.20 (100.00%) Max commit 10651.00 ( 0.00%) 3090.50 ( 70.98%) Max read 1696.80 ( 0.00%) 1097.10 ( 35.34%) Max wal 76206.20 ( 0.00%) 41.40 ( 99.95%) Mean commit 828.89 ( 0.00%) 57.06 ( 93.12%) Mean read 1111.46 ( 0.00%) 1088.66 ( 2.05%) Mean wal 1241.19 ( 0.00%) 0.08 ( 99.99%) This is a limited view of the report but it's fairly obvious it's good. Max wal latency of 76 seconds down to 4 ms, read latencies very similar, commit times way down. However, there appears to be some read starvation going on because the number of read samples is far lower (not in the report). A manual check shows 416 read samples with the vanilla kernel and 80 with the patches. The story is much more severe for ext3 4.11.0-rc5 4.11.0-rc5 vanilla transact-v1r1 Min commit 12.40 ( 0.00%) 9.80 ( 20.97%) Min read 1046.90 ( 0.00%) Min wal 0.00 ( 0.00%) 0.00 ( 0.00%) Max-95% commit 4156.80 ( 0.00%) 101.40 ( 97.56%) Max-95% read 1296.10 ( 0.00%) (100.00%) Max-95% wal 4623.20 ( 0.00%) 0.10 (100.00%) Max-99% commit 6352.20 ( 0.00%) 521.90 ( 91.78%) Max-99% read 1296.20 ( 0.00%) (100.00%) Max-99% wal 5346.10 ( 0.00%) 0.20 (100.00%) Max commit 36643.40 ( 0.00%) 2212.40 ( 93.96%) Max read 1296.20 ( 0.00%) (100.00%) Max wal 45138.40 ( 0.00%) 124.40 ( 99.72%) Those blank entries for read are somewhat of a reporting bug but occur due to no samples being recorded. A manual check verifies. 304 samples with the vanilla kernel and 0 with the patches applied. The per-sample graphs (not presented) shows that commit and wal times are consistently very low but the lack of reads is of concern. A partially manual rerun to see what readers were doing was not particularly revealing. It's stuck in read as you'd expect delboy:~ # cat /proc/3177/stack [<ffffffff810b5876>] io_schedule+0x16/0x40 [<ffffffff811a8466>] wait_on_page_bit_common+0x116/0x1c0 [<ffffffff811ab367>] generic_file_read_iter+0x157/0x8b0 [<ffffffffa01afd9a>] ext4_file_read_iter+0x4a/0xd0 [ext4] [<ffffffff8123b27e>] __vfs_read+0xbe/0x130 [<ffffffff8123c13e>] vfs_read+0x9e/0x170 [<ffffffff8123d666>] SyS_read+0x46/0xa0 [<ffffffff810039ae>] do_syscall_64+0x6e/0x180 [<ffffffff8176be2f>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff They're not completely stalled because tracing one of the readers show that reads are completing but apparently not enough of them to meet the threshold where pgioperf reports something. It could be another flaw in the benchmark and the reason for fewer reads being recorded is simply because writes are not being stalled but it's worth checking out. One major observation supporting that it's a basic timing issue is that the time the benchmark takes to complete is way reduced. 4.11.0-rc5 4.11.0-rc5 vanillatransact-v1r1 User 14.51 8.91 System 188.92 97.41 Elapsed 4432.20 2660.06 That's way faster and this may all be down to timing. Hence, there may be no problem with the patches here as such and what is needed is to adjust the benchmark to report stall times more frequently and increase the number of samples it takes before the benchmark is considered complete. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1030310 http://bugzilla.suse.com/show_bug.cgi?id=1030310#c18 --- Comment #18 from Mel Gorman <mgorman@suse.com> ---
Those blank entries for read are somewhat of a reporting bug but occur due to no samples being recorded. A manual check verifies. 304 samples with the vanilla kernel and 0 with the patches applied.
A rerun with the same patches and configuration captured some samples -- 64 samples and the latency was fine. It's very possible this is entirely a limitation of the benchmark and it should report sample times more frequently with more samples captured overall so it still runs for a decent length of time. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1030310 http://bugzilla.suse.com/show_bug.cgi?id=1030310#c19 --- Comment #19 from Jan Kara <jack@suse.com> --- (In reply to Mel Gorman from comment #16)
As an aside, how did you measure the transaction commit times, tracing jbd2_end_commit+jbd2_start_commit and subtracting the difference?
Yes. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1030310 http://bugzilla.suse.com/show_bug.cgi?id=1030310#c20 --- Comment #20 from Jan Kara <jack@suse.com> --- Regarding the read starvation - I didn't observe anything like that in my testing although I do agree that after my patches there were less read samples since writes weren't backlogged as much behind blocked journal. I'll try to run your version of the benchmark, see what happens and possibly come up with better parameters for the benchmark. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1030310 http://bugzilla.suse.com/show_bug.cgi?id=1030310#c21 --- Comment #21 from Jan Kara <jack@suse.com> --- Ah, now I realized what could be the problem you possibly observe! The output from pgioperf processes is cached in glibc stdio buffers. These can contain significant amount of output. What I have in my private version of pgioperf is that I fflush() the output after each line which deals with the issue for me... -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1030310 http://bugzilla.suse.com/show_bug.cgi?id=1030310#c22 Jan Kara <jack@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|IN_PROGRESS |RESOLVED Resolution|--- |FIXED --- Comment #22 from Jan Kara <jack@suse.com> --- I have pushed the fix that got merged into ext4 tree to users/jack/SLE12-SP3/for-next (as patches.fixes/ext4-avoid-unnecessary-transaction-stalls-during-writeback.patch). Closing the bug. -- You are receiving this mail because: You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@novell.com