AW: [opensuse-kernel] after upgrade from 11.2 to 12.1: disk io hog/starvation issue on HW Raid ext4

19 Jan 2012

      Hi Jan, Maurice and Cristian,

thanks a lot for the quick replies.

yes we were using ext4 on 11.2 as well. We had no noticeable latencies
with 11.2, so I don't know exact values. The restored DB file is 20GB on
a machine with 48GB of RAM. 

We are already using deadline scheduler. Noop seemed worse but didn't make any
qualitative measurements. Cfq was definitely waaay worse.

Interesting what you wrote about some bug with fsync in 2.6.31. If fsync didn't
work as it should have then maybe that's why massive write io by one process
didn't impact others as much.

Had already planned to try the nobarrier mount option. Glad y'all are recommending
it as well. Seeing we have UPS and BUU on Raid Card we should be fine ;-).

Just remounted our filesystems with barrier=0. Didn't help it much if at all.
So I did some initial blktrace, blkparse, btt runs and boy does 
that deliver loads of numbers. 

Here the first few lines from a btt over the combined trace and the tail from the blktrace:

Total (sda):
 Reads Queued:         979,    8,332KiB  Writes Queued:      60,185,   17,860MiB
 Read Dispatches:      943,    8,332KiB  Write Dispatches:   58,089,   17,860MiB
 Reads Requeued:         0               Writes Requeued:         0
 Reads Completed:      941,    8,324KiB  Writes Completed:   58,089,   17,860MiB
 Read Merges:           35,      636KiB  Write Merges:        2,096,   10,368KiB
 IO unplugs:           337               Timer unplugs:           0

Throughput (R/W): 142KiB/s / 305,389KiB/s
Events (sda): 423,130 entries
Skips: 0 forward (0 -   0.0%)

300MB/s write ain't that bad for a 6x300GB 10KRpm SAS Drives RAID10.
I am not sure our system was any faster under 11.2/2.6.31. 

==================== All Devices ====================

            ALL           MIN           AVG           MAX           N
--------------- ------------- ------------- ------------- -----------

Q2Q               0.000000163   0.000956222   1.197385582       61163
Q2G               0.000000175   0.000348236   0.061476246     1416792
S2G               0.000854097   0.028974235   0.061474444       16992
G2I               0.000000246   0.000002060   0.003133293     1416792
Q2M               0.000000139   0.000000233   0.000007519       51144
I2D               0.000000118   0.006148337   0.048895925     1416792
M2D               0.000001943   0.017204386   0.041711330       51144
D2C               0.000022381   0.078485527   1.721045923       61162
Q2C               0.000023937   0.085357206   1.721609484       61162

==================== Device Overhead ====================

       DEV |       Q2G       G2I       Q2M       I2D       D2C
---------- | --------- --------- --------- --------- ---------
 (  8,  0) |   9.4506%   0.0559%   0.0002% 166.8560%  91.9495%
---------- | --------- --------- --------- --------- ---------
   Overall |   9.4506%   0.0559%   0.0002% 166.8560%  91.9495%

==================== Device Merge Information ====================

       DEV |       #Q       #D   Ratio |   BLKmin   BLKavg   BLKmax    Total
---------- | -------- -------- ------- | -------- -------- -------- --------
 (  8,  0) |  1416768  1416768     1.0 |        8      605      640 857713920

What worries/puzzles me here is the Device Merge Ratio of 1.0...

If that means what I fear it means then that might be the cause. Now about fixing that..
Maybe some buffers, queues having wrong values?

googling for "btt device merge ratio very low" didn't give me much.

in the meantime, this first update. Will wait with creating a bug report and uploading
the blktrace as that one is 22MB of data. And maybe above merge ratio already is the 
cause or a good enough pointer towards the cause.

thanks a lot for the help

Remo
...
-----Ursprüngliche Nachricht-----
Von: Jan Kara [mailto:jack@suse.cz]
Gesendet: Mittwoch, 18. Januar 2012 15:40
An: Remo Strotkamp
Cc: opensuse-kernel@opensuse.org
Betreff: Re: [opensuse-kernel] after upgrade from 11.2 to 12.1: disk io hog/starvation issue
on HW Raid ext4
Hello,
...
I've been directed here by the opensuse forums about a problem we are
having with our server since we upgraded from opensuse 11.2 (kernel
2.6.31 I believe) to 12.1.
The problem is that one process can hog all disk io and starves others.
For example progress database restore of multi GB DB  starves all others
for example mysqld.  We see latencies on fsync for mysqld of 15s + with
cfq block io scheduler. Still 5s+ with deadline block io scheduler and
read_expire reduced to 20ms.
  OK, I presume you used ext4 in both 11.2 and 12.1, didn't you? Also what
were fsync latencies with 11.2? And what is the size of restored file (in
On Wed 18-01-12 10:32:58, rst@suissimage.ch wrote:
particular in comparison with amount of memory)?
...
Been unable to reduce latency for other processes any further.
Our guess to the culprit is that the improvement that was made in 2.6.37
for smp ext4 block io throughput (300-400% according to Linux 2 6 37 -
Linux Kernel Newbies ) has made it possible for one process to be that
fast and created this starvation problem.
  I don't think that change was the reason (if you mean commit bd2d0210).
The claimed throughput improvement can be observed only for big number of
threads (in buffer layer they contend more for locks) but that does not seem
to be your problem. So I'd rather suspect changes in fsync() handling (we
send disk cache flush more often and force transaction commit more often in
3.1 kernel - 2.6.31 kernel had bugs and didn't propely assure all data is
on disk after fsync) or maybe some changes in writeback code.
...
Or maybe some kernel bug.
Anybody have any pointers about how to reign in disk-io hogs in 3.1?
Some info about the Server:
Dell T710 with 2 Xeon 6 core procs, 48GB Memory. 6x300GB Disks in RAID10
on a H700 Raid Controller.
  If the server has UPS so you are certain power cannot just abruptly fail,
you can mount the filesystem with nobarrier mount option. That will
probably speed up your IO.
...
We didn't mess with many default Suse Kernel values. Except swapiness,
default blocksize of Tape Driver, Max Semaphore and Shared Memory Segment
Values ( /proc/sys/kernel/shmmax shmmni shmall). And of course the
ioscheduler as deadline scheduler makes the system less unusable...
I'll gladly provide any other info y'all might need to help us improve
this starvation issue.
  If you cannot use nobarrier or it does not help. You can use 'blktrace' to
record what's going on in the IO scheduler while fsync is hanging. I'm not
sure how reproducible big fsync latencies are but from your report it seems
they are rather common. So just start:
  blktrace -d <device>
and run DB restore to trigger big latencies and after some long fsync
occurs stop blktrace, pack resulting files and attach them to a bugzilla
you create for this ;) Feel free to assign it to me (jack@suse.com) so that
it does not get missed.
Honza
--
Jan Kara 
SUSE Labs, CR
-- 
To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org
To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org

AW: [opensuse-kernel] after upgrade from 11.2 to 12.1: disk io hog/starvation issue on HW Raid ext4

rst＠suissimage.ch