
Hi Jan, Maurice and Cristian, thanks a lot for the quick replies. yes we were using ext4 on 11.2 as well. We had no noticeable latencies with 11.2, so I don't know exact values. The restored DB file is 20GB on a machine with 48GB of RAM. We are already using deadline scheduler. Noop seemed worse but didn't make any qualitative measurements. Cfq was definitely waaay worse. Interesting what you wrote about some bug with fsync in 2.6.31. If fsync didn't work as it should have then maybe that's why massive write io by one process didn't impact others as much. Had already planned to try the nobarrier mount option. Glad y'all are recommending it as well. Seeing we have UPS and BUU on Raid Card we should be fine ;-). Just remounted our filesystems with barrier=0. Didn't help it much if at all. So I did some initial blktrace, blkparse, btt runs and boy does that deliver loads of numbers. Here the first few lines from a btt over the combined trace and the tail from the blktrace: Total (sda): Reads Queued: 979, 8,332KiB Writes Queued: 60,185, 17,860MiB Read Dispatches: 943, 8,332KiB Write Dispatches: 58,089, 17,860MiB Reads Requeued: 0 Writes Requeued: 0 Reads Completed: 941, 8,324KiB Writes Completed: 58,089, 17,860MiB Read Merges: 35, 636KiB Write Merges: 2,096, 10,368KiB IO unplugs: 337 Timer unplugs: 0 Throughput (R/W): 142KiB/s / 305,389KiB/s Events (sda): 423,130 entries Skips: 0 forward (0 - 0.0%) 300MB/s write ain't that bad for a 6x300GB 10KRpm SAS Drives RAID10. I am not sure our system was any faster under 11.2/2.6.31. ==================== All Devices ==================== ALL MIN AVG MAX N --------------- ------------- ------------- ------------- ----------- Q2Q 0.000000163 0.000956222 1.197385582 61163 Q2G 0.000000175 0.000348236 0.061476246 1416792 S2G 0.000854097 0.028974235 0.061474444 16992 G2I 0.000000246 0.000002060 0.003133293 1416792 Q2M 0.000000139 0.000000233 0.000007519 51144 I2D 0.000000118 0.006148337 0.048895925 1416792 M2D 0.000001943 0.017204386 0.041711330 51144 D2C 0.000022381 0.078485527 1.721045923 61162 Q2C 0.000023937 0.085357206 1.721609484 61162 ==================== Device Overhead ==================== DEV | Q2G G2I Q2M I2D D2C ---------- | --------- --------- --------- --------- --------- ( 8, 0) | 9.4506% 0.0559% 0.0002% 166.8560% 91.9495% ---------- | --------- --------- --------- --------- --------- Overall | 9.4506% 0.0559% 0.0002% 166.8560% 91.9495% ==================== Device Merge Information ==================== DEV | #Q #D Ratio | BLKmin BLKavg BLKmax Total ---------- | -------- -------- ------- | -------- -------- -------- -------- ( 8, 0) | 1416768 1416768 1.0 | 8 605 640 857713920 What worries/puzzles me here is the Device Merge Ratio of 1.0... If that means what I fear it means then that might be the cause. Now about fixing that.. Maybe some buffers, queues having wrong values? googling for "btt device merge ratio very low" didn't give me much. in the meantime, this first update. Will wait with creating a bug report and uploading the blktrace as that one is 22MB of data. And maybe above merge ratio already is the cause or a good enough pointer towards the cause. thanks a lot for the help Remo
-----Ursprüngliche Nachricht----- Von: Jan Kara [mailto:jack@suse.cz] Gesendet: Mittwoch, 18. Januar 2012 15:40 An: Remo Strotkamp Cc: opensuse-kernel@opensuse.org Betreff: Re: [opensuse-kernel] after upgrade from 11.2 to 12.1: disk io hog/starvation issue on HW Raid ext4
Hello,
On Wed 18-01-12 10:32:58, rst@suissimage.ch wrote:
I've been directed here by the opensuse forums about a problem we are having with our server since we upgraded from opensuse 11.2 (kernel 2.6.31 I believe) to 12.1.
The problem is that one process can hog all disk io and starves others. For example progress database restore of multi GB DB starves all others for example mysqld. We see latencies on fsync for mysqld of 15s + with cfq block io scheduler. Still 5s+ with deadline block io scheduler and read_expire reduced to 20ms. OK, I presume you used ext4 in both 11.2 and 12.1, didn't you? Also what were fsync latencies with 11.2? And what is the size of restored file (in particular in comparison with amount of memory)?
Been unable to reduce latency for other processes any further.
Our guess to the culprit is that the improvement that was made in 2.6.37 for smp ext4 block io throughput (300-400% according to Linux 2 6 37 - Linux Kernel Newbies ) has made it possible for one process to be that fast and created this starvation problem. I don't think that change was the reason (if you mean commit bd2d0210). The claimed throughput improvement can be observed only for big number of threads (in buffer layer they contend more for locks) but that does not seem to be your problem. So I'd rather suspect changes in fsync() handling (we send disk cache flush more often and force transaction commit more often in 3.1 kernel - 2.6.31 kernel had bugs and didn't propely assure all data is on disk after fsync) or maybe some changes in writeback code.
Or maybe some kernel bug.
Anybody have any pointers about how to reign in disk-io hogs in 3.1?
Some info about the Server: Dell T710 with 2 Xeon 6 core procs, 48GB Memory. 6x300GB Disks in RAID10 on a H700 Raid Controller. If the server has UPS so you are certain power cannot just abruptly fail, you can mount the filesystem with nobarrier mount option. That will probably speed up your IO.
We didn't mess with many default Suse Kernel values. Except swapiness, default blocksize of Tape Driver, Max Semaphore and Shared Memory Segment Values ( /proc/sys/kernel/shmmax shmmni shmall). And of course the ioscheduler as deadline scheduler makes the system less unusable...
I'll gladly provide any other info y'all might need to help us improve this starvation issue. If you cannot use nobarrier or it does not help. You can use 'blktrace' to record what's going on in the IO scheduler while fsync is hanging. I'm not sure how reproducible big fsync latencies are but from your report it seems they are rather common. So just start: blktrace -d <device> and run DB restore to trigger big latencies and after some long fsync occurs stop blktrace, pack resulting files and attach them to a bugzilla you create for this ;) Feel free to assign it to me (jack@suse.com) so that it does not get missed.
Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR
-- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org