[opensuse-kernel] after upgrade from 11.2 to 12.1: disk io hog/starvation issue on HW Raid ext4

rst＠suissimage.ch

18 Jan 2012 18 Jan '12

09:32

Hi y'all, I've been directed here by the opensuse forums about a problem we are having with our server since we upgraded from opensuse 11.2 (kernel 2.6.31 I believe) to 12.1. The problem is that one process can hog all disk io and starves others. For example progress database restore of multi GB DB starves all others for example mysqld. We see latencies on fsync for mysqld of 15s + with cfq block io scheduler. Still 5s+ with deadline block io scheduler and read_expire reduced to 20ms. Been unable to reduce latency for other processes any further. Our guess to the culprit is that the improvement that was made in 2.6.37 for smp ext4 block io throughput (300-400% according to Linux 2 6 37 - Linux Kernel Newbies ) has made it possible for one process to be that fast and created this starvation problem. Or maybe some kernel bug. Anybody have any pointers about how to reign in disk-io hogs in 3.1? best regards remo Some info about the Server: Dell T710 with 2 Xeon 6 core procs, 48GB Memory. 6x300GB Disks in RAID10 on a H700 Raid Controller. We didn't mess with many default Suse Kernel values. Except swapiness, default blocksize of Tape Driver, Max Semaphore and Shared Memory Segment Values ( /proc/sys/kernel/shmmax shmmni shmall). And of course the ioscheduler as deadline scheduler makes the system less unusable... I'll gladly provide any other info y'all might need to help us improve this starvation issue. -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org

Show replies by date

Jan Kara

18 Jan 18 Jan

14:40

...

I've been directed here by the opensuse forums about a problem we are having with our server since we upgraded from opensuse 11.2 (kernel 2.6.31 I believe) to 12.1.

The problem is that one process can hog all disk io and starves others. For example progress database restore of multi GB DB starves all others for example mysqld. We see latencies on fsync for mysqld of 15s + with cfq block io scheduler. Still 5s+ with deadline block io scheduler and read_expire reduced to 20ms. OK, I presume you used ext4 in both 11.2 and 12.1, didn't you? Also what were fsync latencies with 11.2? And what is the size of restored file (in

...

Been unable to reduce latency for other processes any further.

Our guess to the culprit is that the improvement that was made in 2.6.37 for smp ext4 block io throughput (300-400% according to Linux 2 6 37 - Linux Kernel Newbies ) has made it possible for one process to be that fast and created this starvation problem. I don't think that change was the reason (if you mean commit bd2d0210). The claimed throughput improvement can be observed only for big number of

...

Or maybe some kernel bug.

Anybody have any pointers about how to reign in disk-io hogs in 3.1?

Some info about the Server: Dell T710 with 2 Xeon 6 core procs, 48GB Memory. 6x300GB Disks in RAID10 on a H700 Raid Controller. If the server has UPS so you are certain power cannot just abruptly fail, you can mount the filesystem with nobarrier mount option. That will

...

We didn't mess with many default Suse Kernel values. Except swapiness, default blocksize of Tape Driver, Max Semaphore and Shared Memory Segment Values ( /proc/sys/kernel/shmmax shmmni shmall). And of course the ioscheduler as deadline scheduler makes the system less unusable...

I'll gladly provide any other info y'all might need to help us improve this starvation issue. If you cannot use nobarrier or it does not help. You can use 'blktrace' to record what's going on in the IO scheduler while fsync is hanging. I'm not sure how reproducible big fsync latencies are but from your report it seems

Hello, On Wed 18-01-12 10:32:58, rst@suissimage.ch wrote: particular in comparison with amount of memory)? threads (in buffer layer they contend more for locks) but that does not seem to be your problem. So I'd rather suspect changes in fsync() handling (we send disk cache flush more often and force transaction commit more often in 3.1 kernel - 2.6.31 kernel had bugs and didn't propely assure all data is on disk after fsync) or maybe some changes in writeback code. probably speed up your IO. they are rather common. So just start: blktrace -d <device> and run DB restore to trigger big latencies and after some long fsync occurs stop blktrace, pack resulting files and attach them to a bugzilla you create for this ;) Feel free to assign it to me (jack@suse.com) so that it does not get missed. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org

Cristian Rodríguez

14:44

On 18/01/12 11:40, Jan Kara wrote:

...

...
Some info about the Server: Dell T710 with 2 Xeon 6 core procs, 48GB Memory. 6x300GB Disks in RAID10 on a H700 Raid Controller. If the server has UPS so you are certain power cannot just abruptly fail, you can mount the filesystem with nobarrier mount option. That will probably speed up your IO.

Also try the deadline and the noop scheduler, and see what difference you get. -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org

rst＠suissimage.ch

19 Jan 19 Jan

18:01

New subject: AW: [opensuse-kernel] after upgrade from 11.2 to 12.1: disk io hog/starvation issue on HW Raid ext4

Hi Jan, Maurice and Cristian, thanks a lot for the quick replies. yes we were using ext4 on 11.2 as well. We had no noticeable latencies with 11.2, so I don't know exact values. The restored DB file is 20GB on a machine with 48GB of RAM. We are already using deadline scheduler. Noop seemed worse but didn't make any qualitative measurements. Cfq was definitely waaay worse. Interesting what you wrote about some bug with fsync in 2.6.31. If fsync didn't work as it should have then maybe that's why massive write io by one process didn't impact others as much. Had already planned to try the nobarrier mount option. Glad y'all are recommending it as well. Seeing we have UPS and BUU on Raid Card we should be fine ;-). Just remounted our filesystems with barrier=0. Didn't help it much if at all. So I did some initial blktrace, blkparse, btt runs and boy does that deliver loads of numbers. Here the first few lines from a btt over the combined trace and the tail from the blktrace: Total (sda): Reads Queued: 979, 8,332KiB Writes Queued: 60,185, 17,860MiB Read Dispatches: 943, 8,332KiB Write Dispatches: 58,089, 17,860MiB Reads Requeued: 0 Writes Requeued: 0 Reads Completed: 941, 8,324KiB Writes Completed: 58,089, 17,860MiB Read Merges: 35, 636KiB Write Merges: 2,096, 10,368KiB IO unplugs: 337 Timer unplugs: 0 Throughput (R/W): 142KiB/s / 305,389KiB/s Events (sda): 423,130 entries Skips: 0 forward (0 - 0.0%) 300MB/s write ain't that bad for a 6x300GB 10KRpm SAS Drives RAID10. I am not sure our system was any faster under 11.2/2.6.31. ==================== All Devices ==================== ALL MIN AVG MAX N --------------- ------------- ------------- ------------- ----------- Q2Q 0.000000163 0.000956222 1.197385582 61163 Q2G 0.000000175 0.000348236 0.061476246 1416792 S2G 0.000854097 0.028974235 0.061474444 16992 G2I 0.000000246 0.000002060 0.003133293 1416792 Q2M 0.000000139 0.000000233 0.000007519 51144 I2D 0.000000118 0.006148337 0.048895925 1416792 M2D 0.000001943 0.017204386 0.041711330 51144 D2C 0.000022381 0.078485527 1.721045923 61162 Q2C 0.000023937 0.085357206 1.721609484 61162 ==================== Device Overhead ==================== DEV | Q2G G2I Q2M I2D D2C ---------- | --------- --------- --------- --------- --------- ( 8, 0) | 9.4506% 0.0559% 0.0002% 166.8560% 91.9495% ---------- | --------- --------- --------- --------- --------- Overall | 9.4506% 0.0559% 0.0002% 166.8560% 91.9495% ==================== Device Merge Information ==================== DEV | #Q #D Ratio | BLKmin BLKavg BLKmax Total ---------- | -------- -------- ------- | -------- -------- -------- -------- ( 8, 0) | 1416768 1416768 1.0 | 8 605 640 857713920 What worries/puzzles me here is the Device Merge Ratio of 1.0... If that means what I fear it means then that might be the cause. Now about fixing that.. Maybe some buffers, queues having wrong values? googling for "btt device merge ratio very low" didn't give me much. in the meantime, this first update. Will wait with creating a bug report and uploading the blktrace as that one is 22MB of data. And maybe above merge ratio already is the cause or a good enough pointer towards the cause. thanks a lot for the help Remo

...

-----Ursprüngliche Nachricht----- Von: Jan Kara [mailto:jack@suse.cz] Gesendet: Mittwoch, 18. Januar 2012 15:40 An: Remo Strotkamp Cc: opensuse-kernel@opensuse.org Betreff: Re: [opensuse-kernel] after upgrade from 11.2 to 12.1: disk io hog/starvation issue on HW Raid ext4

Hello,

...
I've been directed here by the opensuse forums about a problem we are having with our server since we upgraded from opensuse 11.2 (kernel 2.6.31 I believe) to 12.1.

The problem is that one process can hog all disk io and starves others. For example progress database restore of multi GB DB starves all others for example mysqld. We see latencies on fsync for mysqld of 15s + with cfq block io scheduler. Still 5s+ with deadline block io scheduler and read_expire reduced to 20ms. OK, I presume you used ext4 in both 11.2 and 12.1, didn't you? Also what were fsync latencies with 11.2? And what is the size of restored file (in

On Wed 18-01-12 10:32:58, rst@suissimage.ch wrote: particular in comparison with amount of memory)?

...
Been unable to reduce latency for other processes any further.

Our guess to the culprit is that the improvement that was made in 2.6.37 for smp ext4 block io throughput (300-400% according to Linux 2 6 37 - Linux Kernel Newbies ) has made it possible for one process to be that fast and created this starvation problem. I don't think that change was the reason (if you mean commit bd2d0210). The claimed throughput improvement can be observed only for big number of threads (in buffer layer they contend more for locks) but that does not seem to be your problem. So I'd rather suspect changes in fsync() handling (we send disk cache flush more often and force transaction commit more often in 3.1 kernel - 2.6.31 kernel had bugs and didn't propely assure all data is on disk after fsync) or maybe some changes in writeback code.

...
Or maybe some kernel bug.

Anybody have any pointers about how to reign in disk-io hogs in 3.1?

Some info about the Server: Dell T710 with 2 Xeon 6 core procs, 48GB Memory. 6x300GB Disks in RAID10 on a H700 Raid Controller. If the server has UPS so you are certain power cannot just abruptly fail, you can mount the filesystem with nobarrier mount option. That will probably speed up your IO.

...
We didn't mess with many default Suse Kernel values. Except swapiness, default blocksize of Tape Driver, Max Semaphore and Shared Memory Segment Values ( /proc/sys/kernel/shmmax shmmni shmall). And of course the ioscheduler as deadline scheduler makes the system less unusable...

I'll gladly provide any other info y'all might need to help us improve this starvation issue. If you cannot use nobarrier or it does not help. You can use 'blktrace' to record what's going on in the IO scheduler while fsync is hanging. I'm not sure how reproducible big fsync latencies are but from your report it seems they are rather common. So just start: blktrace -d <device> and run DB restore to trigger big latencies and after some long fsync occurs stop blktrace, pack resulting files and attach them to a bugzilla you create for this ;) Feel free to assign it to me (jack@suse.com) so that it does not get missed.

Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR

-- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org

Jan Kara

22:18

Hello, On Thu 19-01-12 19:01:18, rst@suissimage.ch wrote:

...

yes we were using ext4 on 11.2 as well. We had no noticeable latencies with 11.2, so I don't know exact values. The restored DB file is 20GB on a machine with 48GB of RAM. OK.

...

We are already using deadline scheduler. Noop seemed worse but didn't make any qualitative measurements. Cfq was definitely waaay worse.

Interesting what you wrote about some bug with fsync in 2.6.31. If fsync didn't work as it should have then maybe that's why massive write io by one process didn't impact others as much.

Had already planned to try the nobarrier mount option. Glad y'all are recommending it as well. Seeing we have UPS and BUU on Raid Card we should be fine ;-).

Just remounted our filesystems with barrier=0. Didn't help it much if at all. Ok, so one thing less to care about.

...

So I did some initial blktrace, blkparse, btt runs and boy does that deliver loads of numbers.

Here the first few lines from a btt over the combined trace and the tail from the blktrace:

Total (sda): Reads Queued: 979, 8,332KiB Writes Queued: 60,185, 17,860MiB Read Dispatches: 943, 8,332KiB Write Dispatches: 58,089, 17,860MiB Reads Requeued: 0 Writes Requeued: 0 Reads Completed: 941, 8,324KiB Writes Completed: 58,089, 17,860MiB Read Merges: 35, 636KiB Write Merges: 2,096, 10,368KiB IO unplugs: 337 Timer unplugs: 0

Throughput (R/W): 142KiB/s / 305,389KiB/s Events (sda): 423,130 entries Skips: 0 forward (0 - 0.0%)

300MB/s write ain't that bad for a 6x300GB 10KRpm SAS Drives RAID10. I am not sure our system was any faster under 11.2/2.6.31. Yeah, 300MB/s looks reasonable. That's 100MB/s per drive. You could maybe do more with good SAS drives but it's definitely not going to be the difference between "not noticeable latency" and "15 second latency". So I don't think latency is caused by a drop in throughput as such.

...

==================== All Devices ====================

ALL MIN AVG MAX N --------------- ------------- ------------- ------------- -----------

Q2Q 0.000000163 0.000956222 1.197385582 61163 Q2G 0.000000175 0.000348236 0.061476246 1416792 S2G 0.000854097 0.028974235 0.061474444 16992 G2I 0.000000246 0.000002060 0.003133293 1416792 Q2M 0.000000139 0.000000233 0.000007519 51144 I2D 0.000000118 0.006148337 0.048895925 1416792 M2D 0.000001943 0.017204386 0.041711330 51144 D2C 0.000022381 0.078485527 1.721045923 61162 Q2C 0.000023937 0.085357206 1.721609484 61162

==================== Device Overhead ====================

DEV | Q2G G2I Q2M I2D D2C ---------- | --------- --------- --------- --------- --------- ( 8, 0) | 9.4506% 0.0559% 0.0002% 166.8560% 91.9495% ---------- | --------- --------- --------- --------- --------- Overall | 9.4506% 0.0559% 0.0002% 166.8560% 91.9495%

==================== Device Merge Information ====================

DEV | #Q #D Ratio | BLKmin BLKavg BLKmax Total ---------- | -------- -------- ------- | -------- -------- -------- -------- ( 8, 0) | 1416768 1416768 1.0 | 8 605 640 857713920

What worries/puzzles me here is the Device Merge Ratio of 1.0...

If that means what I fear it means then that might be the cause. Now about fixing that.. Maybe some buffers, queues having wrong values? I don't think that's a problem. Average write request has 302 KB which isn't bad (512 KB is maximum) and throughput isn't bad. If we had too small requests throughput would suffer.

Other numbers look pretty normal as well. So on average we are doing well. It's just that fsync takes longer than it used to.

...

in the meantime, this first update. Will wait with creating a bug report and uploading the blktrace as that one is 22MB of data. And maybe above merge ratio already is the cause or a good enough pointer towards the cause. One more question - can you run 'echo w >/proc/sysrq-trigger' at the moment fsync is hanging, then take output of 'dmesg' and add it to the bug as well. We should know what exactly is fsync waiting on from that. If you cannot easily detect when fsync is hanging, just sample /proc/<pid>/stack of the process whose fsync sometimes hangs and also of flush-8:0 and jbd2/sda process every second or so.

Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org

M. Edward (Ed) Borasky

18 Jan 18 Jan

18:26

blktrace / seekwatcher are good places to start. See https://github.com/znmeb/LinuxCon2009/tree/master/Linux_Server_Profiling_Usi... https://github.com/znmeb/LinuxCon2009/tree/master/Modeling_the_Linux_Block_I... On Wed, Jan 18, 2012 at 1:32 AM, <rst@suissimage.ch> wrote:

...

Hi y'all,

I've been directed here by the opensuse forums about a problem we are having with our server since we upgraded from opensuse 11.2 (kernel 2.6.31 I believe) to 12.1.

The problem is that one process can hog all disk io and starves others. For example progress database restore of multi GB DB starves all others for example mysqld. We see latencies on fsync for mysqld of 15s + with cfq block io scheduler. Still 5s+ with deadline block io scheduler and read_expire reduced to 20ms.

Been unable to reduce latency for other processes any further.

Our guess to the culprit is that the improvement that was made in 2.6.37 for smp ext4 block io throughput (300-400% according to Linux 2 6 37 - Linux Kernel Newbies ) has made it possible for one process to be that fast and created this starvation problem.

Or maybe some kernel bug.

Anybody have any pointers about how to reign in disk-io hogs in 3.1?

best regards

remo

Some info about the Server: Dell T710 with 2 Xeon 6 core procs, 48GB Memory. 6x300GB Disks in RAID10 on a H700 Raid Controller.

We didn't mess with many default Suse Kernel values. Except swapiness, default blocksize of Tape Driver, Max Semaphore and Shared Memory Segment Values ( /proc/sys/kernel/shmmax shmmni shmall). And of course the ioscheduler as deadline scheduler makes the system less unusable...

I'll gladly provide any other info y'all might need to help us improve this starvation issue.

-- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org

-- http://twitter.com/znmeb http://borasky-research.net "A mathematician is a device for turning coffee into theorems." -- Paul Erdős -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org

4716

Age (days ago)

4717

Last active (days ago)

List overview

Download

5 comments

4 participants

participants (4)

Cristian Rodríguez
Jan Kara
M. Edward (Ed) Borasky
rst＠suissimage.ch