OK, I've been doing more investigation about this issue. In the end culprit of stalls are transaction commit times. During the benchmark run average commit time is ~18s with standard deviation of ~41s! The top 5 commit times are: 274.466639s, 126.467347s, 86.992429s, 34.351563s, 31.517653s. And the reason why transation commits are taking so long (although they are pretty small) is that flusher worker holds transaction open in ext4_writepages() while writing back pages. This writeback gets throttled by CFQ and so it takes a long time for ext4_writepages() to complete and thus for transaction handle to be dropped while consequently allows transaction commit to complete. A relatively simple solution to this problem is that we can start a transaction only once we find a page that needs block allocation / extent conversion in ext4_writepages(). With this change transaction commit times drop to 0.1s on average with standard deviation of 0.15s and top 5 commit times: 0.563792s, 0.519980s, 0.509841s, 0.471700s, 0.469899s Also the benchmark numbers themselves look better after the change. For reads results look like: read[23390]: avg: 10.7 msec; max: 358.5 msec read[23387]: avg: 10.7 msec; max: 358.8 msec read[23394]: avg: 10.7 msec; max: 358.9 msec read[23392]: avg: 10.7 msec; max: 358.6 msec read[23395]: avg: 10.7 msec; max: 358.6 msec read[23382]: avg: 10.7 msec; max: 358.7 msec read[23381]: avg: 10.7 msec; max: 358.9 msec read[23385]: avg: 10.7 msec; max: 358.4 msec read[23393]: avg: 10.7 msec; max: 359.0 msec read[23389]: avg: 10.7 msec; max: 358.6 msec read[23388]: avg: 10.7 msec; max: 358.7 msec read[23386]: avg: 10.7 msec; max: 358.3 msec read[23396]: avg: 10.7 msec; max: 359.0 msec read[23383]: avg: 10.7 msec; max: 358.5 msec read[23391]: avg: 10.7 msec; max: 358.9 msec read[23384]: avg: 10.7 msec; max: 359.0 msec with maximum observed read latency ~500 msec. Average wal times are 0.0 msec with maximums at 10-20 msec range and one 300 msec sample. Also commit times look reasonable. Averages are in 30-50 msec range and maximums peak at 10 seconds - that's still quite big but order of magnitude better than with unpatched kernel.