[opensuse] Raid5/LVM2/XFS alignment

newer
[opensuse] Re: [suse-sles-e] Dumb...

Ciro Iriarte

28 Jan 2008 28 Jan '08

16:25

Hi, anybody has some notes about tuning md raid5, lvn and xfs?. I'm getting 20mb/s with dd and I think it can be improved. I'll add config parameters as soon as i get home. I'm using md raid5 on a motherboard with nvidia sata controller, 4x500gb samsung sata2 disks and lvm with OpenSUSE 10.3@x86_64. Regards, Ciro -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org

Show replies by date

Greg Freemyer

28 Jan 28 Jan

20:10

On Jan 28, 2008 11:25 AM, Ciro Iriarte <cyruspy@gmail.com> wrote:

...

Hi, anybody has some notes about tuning md raid5, lvn and xfs?. I'm getting 20mb/s with dd and I think it can be improved. I'll add config parameters as soon as i get home. I'm using md raid5 on a motherboard with nvidia sata controller, 4x500gb samsung sata2 disks and lvm with OpenSUSE 10.3@x86_64.

Regards, Ciro --

I have not done any raid 5 perf. testing: 20 mb/sec seems pretty bad, but not outrageous I suppose. I can get about 4-5GB/min from new sata drives. So about 75 MB/sec from a single raw drive (ie. dd if=/dev/zero of=/dev/sdb bs=4k) You don't say how your invoking dd. The default bs is only 512 bytes I think and that is totally inefficient with the linux kernel. I typically use 4k which maps to what the kernel uses. ie. dd if=/dev/zero of=big-file bs=4k count=1000 should give you a simple but meaningful test.. I think the default stride is 64k per drive, so if your writing 3x 64K at a time, you may get perfect alignment and miss the overhead of having to recalculate the checksum all the time. As another data point, I would bump that up to 30x 64K and see if you continue to get speed improvements. So tell us the write speed for bs=512 bs=4k bs=192k bs=1920k And the read speeds for the same. ie. dd if=big-file of=/dev/null bs=4k, etc. I would expect the write speed to go up with each increase in bs, but the read speed to be more or less constant. Then you need to figure out what sort of real world block sizes your going to be using. Once you have a bs, or collection of bs sizes that match your needs, then you can start tuning your stack. Greg -- Greg Freemyer Litigation Triage Solutions Specialist http://www.linkedin.com/in/gregfreemyer First 99 Days Litigation White Paper - http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf The Norcross Group The Intersection of Evidence & Technology http://www.norcrossgroup.com -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org

Ciro Iriarte

20:51

2008/1/28, Greg Freemyer <greg.freemyer@gmail.com>:

...

On Jan 28, 2008 11:25 AM, Ciro Iriarte <cyruspy@gmail.com> wrote:

...
Hi, anybody has some notes about tuning md raid5, lvn and xfs?. I'm getting 20mb/s with dd and I think it can be improved. I'll add config parameters as soon as i get home. I'm using md raid5 on a motherboard with nvidia sata controller, 4x500gb samsung sata2 disks and lvm with OpenSUSE 10.3@x86_64.

Regards, Ciro --

I have not done any raid 5 perf. testing: 20 mb/sec seems pretty bad, but not outrageous I suppose. I can get about 4-5GB/min from new sata drives. So about 75 MB/sec from a single raw drive (ie. dd if=/dev/zero of=/dev/sdb bs=4k)

You don't say how your invoking dd. The default bs is only 512 bytes I think and that is totally inefficient with the linux kernel.

I typically use 4k which maps to what the kernel uses. ie. dd if=/dev/zero of=big-file bs=4k count=1000 should give you a simple but meaningful test..

I think the default stride is 64k per drive, so if your writing 3x 64K at a time, you may get perfect alignment and miss the overhead of having to recalculate the checksum all the time.

As another data point, I would bump that up to 30x 64K and see if you continue to get speed improvements.

So tell us the write speed for bs=512 bs=4k bs=192k bs=1920k

And the read speeds for the same. ie. dd if=big-file of=/dev/null bs=4k, etc.

I would expect the write speed to go up with each increase in bs, but the read speed to be more or less constant. Then you need to figure out what sort of real world block sizes your going to be using. Once you have a bs, or collection of bs sizes that match your needs, then you can start tuning your stack.

Greg

Hi, posted the first mail from my cell phone, so couldn't add more info.... - I created the raid with chunk size= 256k. mainwks:~ # mdadm --misc --detail /dev/md2 /dev/md2: Version : 01.00.03 Creation Time : Sun Jan 27 20:08:48 2008 Raid Level : raid5 Array Size : 1465151232 (1397.28 GiB 1500.31 GB) Used Dev Size : 976767488 (465.76 GiB 500.10 GB) Raid Devices : 4 Total Devices : 4 Preferred Minor : 2 Persistence : Superblock is persistent Intent Bitmap : Internal Update Time : Mon Jan 28 17:42:51 2008 State : active Active Devices : 4 Working Devices : 4 Failed Devices : 0 Spare Devices : 0 Layout : left-symmetric Chunk Size : 256K Name : 2 UUID : 65cb16de:d89af60e:6cac47da:88828cfe Events : 12 Number Major Minor RaidDevice State 0 8 33 0 active sync /dev/sdc1 1 8 49 1 active sync /dev/sdd1 2 8 65 2 active sync /dev/sde1 4 8 81 3 active sync /dev/sdf1 - Speed reported by hdparm: mainwks:~ # hdparm -tT /dev/sdc /dev/sdc: Timing cached reads: 1754 MB in 2.00 seconds = 877.60 MB/sec Timing buffered disk reads: 226 MB in 3.02 seconds = 74.76 MB/sec mainwks:~ # hdparm -tT /dev/md2 /dev/md2: Timing cached reads: 1250 MB in 2.00 seconds = 624.82 MB/sec Timing buffered disk reads: 620 MB in 3.01 seconds = 206.09 MB/sec - LVM: mainwks:~ # vgdisplay data Incorrect metadata area header checksum --- Volume group --- VG Name data System ID Format lvm2 Metadata Areas 1 Metadata Sequence No 5 VG Access read/write VG Status resizable MAX LV 0 Cur LV 2 Open LV 2 Max PV 0 Cur PV 1 Act PV 1 VG Size 1.36 TB PE Size 4.00 MB Total PE 357702 Alloc PE / Size 51200 / 200.00 GB Free PE / Size 306502 / 1.17 TB VG UUID KpUAeN-mPjO-2K8t-hiLX-FF0C-93R2-IP3aFI mainwks:~ # pvdisplay /dev/sdc1 Incorrect metadata area header checksum --- Physical volume --- PV Name /dev/md2 VG Name data PV Size 1.36 TB / not usable 3.75 MB Allocatable yes PE Size (KByte) 4096 Total PE 357702 Free PE 306502 Allocated PE 51200 PV UUID Axl2c0-RP95-WwO0-inHP-aJEF-6SYJ-Fqhnga - XFS: mainwks:~ # xfs_info /dev/data/test meta-data=/dev/mapper/data-test isize=256 agcount=16, agsize=1638400 blks = sectsz=512 attr=0 data = bsize=4096 blocks=26214400, imaxpct=25 = sunit=16 swidth=48 blks, unwritten=1 naming =version 2 bsize=4096 log =internal bsize=4096 blocks=16384, version=1 = sectsz=512 sunit=0 blks, lazy-count=0 realtime =none extsz=4096 blocks=0, rtextents=0 - The reported dd mainwks:~ # dd if=/dev/zero bs=1024k count=100 of=/mnt/custom/t3 100+0 records in 100+0 records out 104857600 bytes (105 MB) copied, 5.11596 s, 20.5 MB/s - New dd (seems to give better result) mainwks:~ # dd if=/dev/zero bs=1024k count=1000 of=/mnt/custom/t0 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 13.6218 s, 77.0 MB/s Ciro -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org

Greg Freemyer

23:06

On Jan 28, 2008 3:51 PM, Ciro Iriarte <cyruspy@gmail.com> wrote:

...

2008/1/28, Greg Freemyer <greg.freemyer@gmail.com>:

...
On Jan 28, 2008 11:25 AM, Ciro Iriarte <cyruspy@gmail.com> wrote:

...
Hi, anybody has some notes about tuning md raid5, lvn and xfs?. I'm getting 20mb/s with dd and I think it can be improved. I'll add config parameters as soon as i get home. I'm using md raid5 on a motherboard with nvidia sata controller, 4x500gb samsung sata2 disks and lvm with OpenSUSE 10.3@x86_64.

Regards, Ciro --

I have not done any raid 5 perf. testing: 20 mb/sec seems pretty bad, but not outrageous I suppose. I can get about 4-5GB/min from new sata drives. So about 75 MB/sec from a single raw drive (ie. dd if=/dev/zero of=/dev/sdb bs=4k)

You don't say how your invoking dd. The default bs is only 512 bytes I think and that is totally inefficient with the linux kernel.

I typically use 4k which maps to what the kernel uses. ie. dd if=/dev/zero of=big-file bs=4k count=1000 should give you a simple but meaningful test..

I think the default stride is 64k per drive, so if your writing 3x 64K at a time, you may get perfect alignment and miss the overhead of having to recalculate the checksum all the time.

As another data point, I would bump that up to 30x 64K and see if you continue to get speed improvements.

So tell us the write speed for bs=512 bs=4k bs=192k bs=1920k

And the read speeds for the same. ie. dd if=big-file of=/dev/null bs=4k, etc.

I would expect the write speed to go up with each increase in bs, but the read speed to be more or less constant. Then you need to figure out what sort of real world block sizes your going to be using. Once you have a bs, or collection of bs sizes that match your needs, then you can start tuning your stack.

Greg

Hi, posted the first mail from my cell phone, so couldn't add more info....

- I created the raid with chunk size= 256k.

mainwks:~ # mdadm --misc --detail /dev/md2 /dev/md2: Version : 01.00.03 Creation Time : Sun Jan 27 20:08:48 2008 Raid Level : raid5 Array Size : 1465151232 (1397.28 GiB 1500.31 GB) Used Dev Size : 976767488 (465.76 GiB 500.10 GB) Raid Devices : 4 Total Devices : 4 Preferred Minor : 2 Persistence : Superblock is persistent

Intent Bitmap : Internal

Update Time : Mon Jan 28 17:42:51 2008 State : active Active Devices : 4 Working Devices : 4 Failed Devices : 0 Spare Devices : 0

Layout : left-symmetric Chunk Size : 256K

Name : 2 UUID : 65cb16de:d89af60e:6cac47da:88828cfe Events : 12

Number Major Minor RaidDevice State 0 8 33 0 active sync /dev/sdc1 1 8 49 1 active sync /dev/sdd1 2 8 65 2 active sync /dev/sde1 4 8 81 3 active sync /dev/sdf1

- Speed reported by hdparm:

mainwks:~ # hdparm -tT /dev/sdc

/dev/sdc: Timing cached reads: 1754 MB in 2.00 seconds = 877.60 MB/sec Timing buffered disk reads: 226 MB in 3.02 seconds = 74.76 MB/sec mainwks:~ # hdparm -tT /dev/md2

/dev/md2: Timing cached reads: 1250 MB in 2.00 seconds = 624.82 MB/sec Timing buffered disk reads: 620 MB in 3.01 seconds = 206.09 MB/sec

- LVM:

mainwks:~ # vgdisplay data Incorrect metadata area header checksum --- Volume group --- VG Name data System ID Format lvm2 Metadata Areas 1 Metadata Sequence No 5 VG Access read/write VG Status resizable MAX LV 0 Cur LV 2 Open LV 2 Max PV 0 Cur PV 1 Act PV 1 VG Size 1.36 TB PE Size 4.00 MB Total PE 357702 Alloc PE / Size 51200 / 200.00 GB Free PE / Size 306502 / 1.17 TB VG UUID KpUAeN-mPjO-2K8t-hiLX-FF0C-93R2-IP3aFI

mainwks:~ # pvdisplay /dev/sdc1 Incorrect metadata area header checksum --- Physical volume --- PV Name /dev/md2 VG Name data PV Size 1.36 TB / not usable 3.75 MB Allocatable yes PE Size (KByte) 4096 Total PE 357702 Free PE 306502 Allocated PE 51200 PV UUID Axl2c0-RP95-WwO0-inHP-aJEF-6SYJ-Fqhnga

- XFS:

mainwks:~ # xfs_info /dev/data/test meta-data=/dev/mapper/data-test isize=256 agcount=16, agsize=1638400 blks = sectsz=512 attr=0 data = bsize=4096 blocks=26214400, imaxpct=25 = sunit=16 swidth=48 blks, unwritten=1 naming =version 2 bsize=4096 log =internal bsize=4096 blocks=16384, version=1 = sectsz=512 sunit=0 blks, lazy-count=0 realtime =none extsz=4096 blocks=0, rtextents=0

- The reported dd mainwks:~ # dd if=/dev/zero bs=1024k count=100 of=/mnt/custom/t3 100+0 records in 100+0 records out 104857600 bytes (105 MB) copied, 5.11596 s, 20.5 MB/s

- New dd (seems to give better result) mainwks:~ # dd if=/dev/zero bs=1024k count=1000 of=/mnt/custom/t0 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 13.6218 s, 77.0 MB/s

Ciro

Not sure I followed why the old and new dd were so different. I do see the old one only had 5 seconds worth of data. Not much data to base a test run on. IF you really have 1MB avg. write sizes, you should read http://oss.sgi.com/archives/xfs/2007-06/msg00411.html for a tuning sample Basically that post recommends: chuck size = 256KB LVM align = 3x Chunk Size = 768KB (assumes a 4-disk raid5) And tune the XFS bsize/sunit/swidth to match. But that all _assumes_ a large data write size. If you have a more typical desktop load, then the average write is way below that and you need to really reduce all of the above (except bsize. I think 4K bsize is always best with Linux, but I'm not positive about that.). Also, dd is only able to simulate a sequential data stream. If you don't have that kind of load, once again you need to reduce the chunk size. I think the generically preferred chunk size is 64KB, With some database apps, that can drop down to 4KB. So really and truly, you need to characterize your workload before you start tuning. OTOH, if you just want bragging rights, test with and tune for a big average write, but be warned your typical performance will be going down at the same time that your large write performance is going up. Greg -- Greg Freemyer Litigation Triage Solutions Specialist http://www.linkedin.com/in/gregfreemyer First 99 Days Litigation White Paper - http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf The Norcross Group The Intersection of Evidence & Technology http://www.norcrossgroup.com -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org

Ciro Iriarte

23:41

2008/1/28, Greg Freemyer <greg.freemyer@gmail.com>:

...

On Jan 28, 2008 3:51 PM, Ciro Iriarte <cyruspy@gmail.com> wrote:

...
2008/1/28, Greg Freemyer <greg.freemyer@gmail.com>:

...
On Jan 28, 2008 11:25 AM, Ciro Iriarte <cyruspy@gmail.com> wrote:

...
Hi, anybody has some notes about tuning md raid5, lvn and xfs?. I'm getting 20mb/s with dd and I think it can be improved. I'll add config parameters as soon as i get home. I'm using md raid5 on a motherboard with nvidia sata controller, 4x500gb samsung sata2 disks and lvm with OpenSUSE 10.3@x86_64.

Regards, Ciro --

I have not done any raid 5 perf. testing: 20 mb/sec seems pretty bad, but not outrageous I suppose. I can get about 4-5GB/min from new sata drives. So about 75 MB/sec from a single raw drive (ie. dd if=/dev/zero of=/dev/sdb bs=4k)

You don't say how your invoking dd. The default bs is only 512 bytes I think and that is totally inefficient with the linux kernel.

I typically use 4k which maps to what the kernel uses. ie. dd if=/dev/zero of=big-file bs=4k count=1000 should give you a simple but meaningful test..

I think the default stride is 64k per drive, so if your writing 3x 64K at a time, you may get perfect alignment and miss the overhead of having to recalculate the checksum all the time.

As another data point, I would bump that up to 30x 64K and see if you continue to get speed improvements.

So tell us the write speed for bs=512 bs=4k bs=192k bs=1920k

And the read speeds for the same. ie. dd if=big-file of=/dev/null bs=4k, etc.

I would expect the write speed to go up with each increase in bs, but the read speed to be more or less constant. Then you need to figure out what sort of real world block sizes your going to be using. Once you have a bs, or collection of bs sizes that match your needs, then you can start tuning your stack.

Greg

Hi, posted the first mail from my cell phone, so couldn't add more info....

- I created the raid with chunk size= 256k.

mainwks:~ # mdadm --misc --detail /dev/md2 /dev/md2: Version : 01.00.03 Creation Time : Sun Jan 27 20:08:48 2008 Raid Level : raid5 Array Size : 1465151232 (1397.28 GiB 1500.31 GB) Used Dev Size : 976767488 (465.76 GiB 500.10 GB) Raid Devices : 4 Total Devices : 4 Preferred Minor : 2 Persistence : Superblock is persistent

Intent Bitmap : Internal

Update Time : Mon Jan 28 17:42:51 2008 State : active Active Devices : 4 Working Devices : 4 Failed Devices : 0 Spare Devices : 0

Layout : left-symmetric Chunk Size : 256K

Name : 2 UUID : 65cb16de:d89af60e:6cac47da:88828cfe Events : 12

Number Major Minor RaidDevice State 0 8 33 0 active sync /dev/sdc1 1 8 49 1 active sync /dev/sdd1 2 8 65 2 active sync /dev/sde1 4 8 81 3 active sync /dev/sdf1

- Speed reported by hdparm:

mainwks:~ # hdparm -tT /dev/sdc

/dev/sdc: Timing cached reads: 1754 MB in 2.00 seconds = 877.60 MB/sec Timing buffered disk reads: 226 MB in 3.02 seconds = 74.76 MB/sec mainwks:~ # hdparm -tT /dev/md2

/dev/md2: Timing cached reads: 1250 MB in 2.00 seconds = 624.82 MB/sec Timing buffered disk reads: 620 MB in 3.01 seconds = 206.09 MB/sec

- LVM:

mainwks:~ # vgdisplay data Incorrect metadata area header checksum --- Volume group --- VG Name data System ID Format lvm2 Metadata Areas 1 Metadata Sequence No 5 VG Access read/write VG Status resizable MAX LV 0 Cur LV 2 Open LV 2 Max PV 0 Cur PV 1 Act PV 1 VG Size 1.36 TB PE Size 4.00 MB Total PE 357702 Alloc PE / Size 51200 / 200.00 GB Free PE / Size 306502 / 1.17 TB VG UUID KpUAeN-mPjO-2K8t-hiLX-FF0C-93R2-IP3aFI

mainwks:~ # pvdisplay /dev/sdc1 Incorrect metadata area header checksum --- Physical volume --- PV Name /dev/md2 VG Name data PV Size 1.36 TB / not usable 3.75 MB Allocatable yes PE Size (KByte) 4096 Total PE 357702 Free PE 306502 Allocated PE 51200 PV UUID Axl2c0-RP95-WwO0-inHP-aJEF-6SYJ-Fqhnga

- XFS:

mainwks:~ # xfs_info /dev/data/test meta-data=/dev/mapper/data-test isize=256 agcount=16, agsize=1638400 blks = sectsz=512 attr=0 data = bsize=4096 blocks=26214400, imaxpct=25 = sunit=16 swidth=48 blks, unwritten=1 naming =version 2 bsize=4096 log =internal bsize=4096 blocks=16384, version=1 = sectsz=512 sunit=0 blks, lazy-count=0 realtime =none extsz=4096 blocks=0, rtextents=0

- The reported dd mainwks:~ # dd if=/dev/zero bs=1024k count=100 of=/mnt/custom/t3 100+0 records in 100+0 records out 104857600 bytes (105 MB) copied, 5.11596 s, 20.5 MB/s

- New dd (seems to give better result) mainwks:~ # dd if=/dev/zero bs=1024k count=1000 of=/mnt/custom/t0 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 13.6218 s, 77.0 MB/s

Ciro

Not sure I followed why the old and new dd were so different. I do see the old one only had 5 seconds worth of data. Not much data to base a test run on.

IF you really have 1MB avg. write sizes, you should read http://oss.sgi.com/archives/xfs/2007-06/msg00411.html for a tuning sample

Basically that post recommends:

chuck size = 256KB LVM align = 3x Chunk Size = 768KB (assumes a 4-disk raid5)

And tune the XFS bsize/sunit/swidth to match.

But that all _assumes_ a large data write size. If you have a more typical desktop load, then the average write is way below that and you need to really reduce all of the above (except bsize. I think 4K bsize is always best with Linux, but I'm not positive about that.).

Also, dd is only able to simulate a sequential data stream. If you don't have that kind of load, once again you need to reduce the chunk size. I think the generically preferred chunk size is 64KB, With some database apps, that can drop down to 4KB.

So really and truly, you need to characterize your workload before you start tuning.

OTOH, if you just want bragging rights, test with and tune for a big average write, but be warned your typical performance will be going down at the same time that your large write performance is going up.

Greg -- Greg Freemyer Litigation Triage Solutions Specialist http://www.linkedin.com/in/gregfreemyer First 99 Days Litigation White Paper - http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf

The Norcross Group The Intersection of Evidence & Technology http://www.norcrossgroup.com -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org

Hi, i found that thread too, the problem is i'm not sure how to tune the lvm alignment, maybe --stripes & --stripesize at LV creation time?, can't find an option for pvcreate or vgcreate. It will be basically a repository for media files, movies, backups, iso images, etc... For the rest (documents, ebooks and music) I'll create other LVs with Ext3. Regards, Ciro -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org

Greg Freemyer

29 Jan 29 Jan

00:16

On Jan 28, 2008 6:41 PM, Ciro Iriarte <cyruspy@gmail.com> wrote:

...

2008/1/28, Greg Freemyer <greg.freemyer@gmail.com>:

...
On Jan 28, 2008 3:51 PM, Ciro Iriarte <cyruspy@gmail.com> wrote:

...
2008/1/28, Greg Freemyer <greg.freemyer@gmail.com>:

...
On Jan 28, 2008 11:25 AM, Ciro Iriarte <cyruspy@gmail.com> wrote:

...
Hi, anybody has some notes about tuning md raid5, lvn and xfs?. I'm getting 20mb/s with dd and I think it can be improved. I'll add config parameters as soon as i get home. I'm using md raid5 on a motherboard with nvidia sata controller, 4x500gb samsung sata2 disks and lvm with OpenSUSE 10.3@x86_64.

Regards, Ciro --

I have not done any raid 5 perf. testing: 20 mb/sec seems pretty bad, but not outrageous I suppose. I can get about 4-5GB/min from new sata drives. So about 75 MB/sec from a single raw drive (ie. dd if=/dev/zero of=/dev/sdb bs=4k)

You don't say how your invoking dd. The default bs is only 512 bytes I think and that is totally inefficient with the linux kernel.

I typically use 4k which maps to what the kernel uses. ie. dd if=/dev/zero of=big-file bs=4k count=1000 should give you a simple but meaningful test..

I think the default stride is 64k per drive, so if your writing 3x 64K at a time, you may get perfect alignment and miss the overhead of having to recalculate the checksum all the time.

As another data point, I would bump that up to 30x 64K and see if you continue to get speed improvements.

So tell us the write speed for bs=512 bs=4k bs=192k bs=1920k

And the read speeds for the same. ie. dd if=big-file of=/dev/null bs=4k, etc.

I would expect the write speed to go up with each increase in bs, but the read speed to be more or less constant. Then you need to figure out what sort of real world block sizes your going to be using. Once you have a bs, or collection of bs sizes that match your needs, then you can start tuning your stack.

Greg

Hi, posted the first mail from my cell phone, so couldn't add more info....

- I created the raid with chunk size= 256k.

mainwks:~ # mdadm --misc --detail /dev/md2 /dev/md2: Version : 01.00.03 Creation Time : Sun Jan 27 20:08:48 2008 Raid Level : raid5 Array Size : 1465151232 (1397.28 GiB 1500.31 GB) Used Dev Size : 976767488 (465.76 GiB 500.10 GB) Raid Devices : 4 Total Devices : 4 Preferred Minor : 2 Persistence : Superblock is persistent

Intent Bitmap : Internal

Update Time : Mon Jan 28 17:42:51 2008 State : active Active Devices : 4 Working Devices : 4 Failed Devices : 0 Spare Devices : 0

Layout : left-symmetric Chunk Size : 256K

Name : 2 UUID : 65cb16de:d89af60e:6cac47da:88828cfe Events : 12

Number Major Minor RaidDevice State 0 8 33 0 active sync /dev/sdc1 1 8 49 1 active sync /dev/sdd1 2 8 65 2 active sync /dev/sde1 4 8 81 3 active sync /dev/sdf1

- Speed reported by hdparm:

mainwks:~ # hdparm -tT /dev/sdc

/dev/sdc: Timing cached reads: 1754 MB in 2.00 seconds = 877.60 MB/sec Timing buffered disk reads: 226 MB in 3.02 seconds = 74.76 MB/sec mainwks:~ # hdparm -tT /dev/md2

/dev/md2: Timing cached reads: 1250 MB in 2.00 seconds = 624.82 MB/sec Timing buffered disk reads: 620 MB in 3.01 seconds = 206.09 MB/sec

- LVM:

mainwks:~ # vgdisplay data Incorrect metadata area header checksum --- Volume group --- VG Name data System ID Format lvm2 Metadata Areas 1 Metadata Sequence No 5 VG Access read/write VG Status resizable MAX LV 0 Cur LV 2 Open LV 2 Max PV 0 Cur PV 1 Act PV 1 VG Size 1.36 TB PE Size 4.00 MB Total PE 357702 Alloc PE / Size 51200 / 200.00 GB Free PE / Size 306502 / 1.17 TB VG UUID KpUAeN-mPjO-2K8t-hiLX-FF0C-93R2-IP3aFI

mainwks:~ # pvdisplay /dev/sdc1 Incorrect metadata area header checksum --- Physical volume --- PV Name /dev/md2 VG Name data PV Size 1.36 TB / not usable 3.75 MB Allocatable yes PE Size (KByte) 4096 Total PE 357702 Free PE 306502 Allocated PE 51200 PV UUID Axl2c0-RP95-WwO0-inHP-aJEF-6SYJ-Fqhnga

- XFS:

mainwks:~ # xfs_info /dev/data/test meta-data=/dev/mapper/data-test isize=256 agcount=16, agsize=1638400 blks = sectsz=512 attr=0 data = bsize=4096 blocks=26214400, imaxpct=25 = sunit=16 swidth=48 blks, unwritten=1 naming =version 2 bsize=4096 log =internal bsize=4096 blocks=16384, version=1 = sectsz=512 sunit=0 blks, lazy-count=0 realtime =none extsz=4096 blocks=0, rtextents=0

- The reported dd mainwks:~ # dd if=/dev/zero bs=1024k count=100 of=/mnt/custom/t3 100+0 records in 100+0 records out 104857600 bytes (105 MB) copied, 5.11596 s, 20.5 MB/s

- New dd (seems to give better result) mainwks:~ # dd if=/dev/zero bs=1024k count=1000 of=/mnt/custom/t0 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 13.6218 s, 77.0 MB/s

Ciro

Not sure I followed why the old and new dd were so different. I do see the old one only had 5 seconds worth of data. Not much data to base a test run on.

IF you really have 1MB avg. write sizes, you should read http://oss.sgi.com/archives/xfs/2007-06/msg00411.html for a tuning sample

Basically that post recommends:

chuck size = 256KB LVM align = 3x Chunk Size = 768KB (assumes a 4-disk raid5)

And tune the XFS bsize/sunit/swidth to match.

But that all _assumes_ a large data write size. If you have a more typical desktop load, then the average write is way below that and you need to really reduce all of the above (except bsize. I think 4K bsize is always best with Linux, but I'm not positive about that.).

Also, dd is only able to simulate a sequential data stream. If you don't have that kind of load, once again you need to reduce the chunk size. I think the generically preferred chunk size is 64KB, With some database apps, that can drop down to 4KB.

So really and truly, you need to characterize your workload before you start tuning.

OTOH, if you just want bragging rights, test with and tune for a big average write, but be warned your typical performance will be going down at the same time that your large write performance is going up.

Greg -- Greg Freemyer Litigation Triage Solutions Specialist http://www.linkedin.com/in/gregfreemyer First 99 Days Litigation White Paper - http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf

The Norcross Group The Intersection of Evidence & Technology http://www.norcrossgroup.com -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org

Hi, i found that thread too, the problem is i'm not sure how to tune the lvm alignment, maybe --stripes & --stripesize at LV creation time?, can't find an option for pvcreate or vgcreate. It will be basically a repository for media files, movies, backups, iso images, etc... For the rest (documents, ebooks and music) I'll create other LVs with Ext3.

Regards, Ciro

Ok, I guess you know reads are not significantly impacted by the tuning were talking about. This is mostly about tuning for raid5 write performance. Anyway, are you planning to stripe together multiple md5 arrays via LVM? I believe that is what --stripes and --stripesize are for. (ie. If you have 8 drives, you could create 2 raid5 arrays, and use LVM to interleave them by using --stripes = 2.) I've never used that feature. You need to worry about the vg extents. I think vgcreate --physicalextentsize is what you need to tune. I would make each extent an even number of stripes in size. ie. 768KB * N. Maybe use N=10, so -s 7680K Assuming your not using lvm strips and since this appears to be a new setup, I would also use -C or --contiguous to ensure all the data is sequential. It maybe overkill, but it will further ensure you _avoid_ LV extents that don't end on a stripe boundary. (a stripe == 3 raid5 chunks for you). Then if you are going to use the snapshot feature, you need to set your chunksize efficiently. If you only are going to have large files, then I would use a large LVM snapshot chunksize. 256KB seems like a good choice, but I have not benchmarked snapshot chunksizes. Greg -- Greg Freemyer Litigation Triage Solutions Specialist http://www.linkedin.com/in/gregfreemyer First 99 Days Litigation White Paper - http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf The Norcross Group The Intersection of Evidence & Technology http://www.norcrossgroup.com -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org

Ciro Iriarte

02:11

2008/1/28, Greg Freemyer <greg.freemyer@gmail.com>:

...

On Jan 28, 2008 6:41 PM, Ciro Iriarte <cyruspy@gmail.com> wrote:

...
2008/1/28, Greg Freemyer <greg.freemyer@gmail.com>:

...
On Jan 28, 2008 3:51 PM, Ciro Iriarte <cyruspy@gmail.com> wrote:

...
2008/1/28, Greg Freemyer <greg.freemyer@gmail.com>:

...
On Jan 28, 2008 11:25 AM, Ciro Iriarte <cyruspy@gmail.com> wrote:

...
Hi, anybody has some notes about tuning md raid5, lvn and xfs?. I'm getting 20mb/s with dd and I think it can be improved. I'll add config parameters as soon as i get home. I'm using md raid5 on a motherboard with nvidia sata controller, 4x500gb samsung sata2 disks and lvm with OpenSUSE 10.3@x86_64.

Regards, Ciro --

I have not done any raid 5 perf. testing: 20 mb/sec seems pretty bad, but not outrageous I suppose. I can get about 4-5GB/min from new sata drives. So about 75 MB/sec from a single raw drive (ie. dd if=/dev/zero of=/dev/sdb bs=4k)

You don't say how your invoking dd. The default bs is only 512 bytes I think and that is totally inefficient with the linux kernel.

I typically use 4k which maps to what the kernel uses. ie. dd if=/dev/zero of=big-file bs=4k count=1000 should give you a simple but meaningful test..

I think the default stride is 64k per drive, so if your writing 3x 64K at a time, you may get perfect alignment and miss the overhead of having to recalculate the checksum all the time.

As another data point, I would bump that up to 30x 64K and see if you continue to get speed improvements.

So tell us the write speed for bs=512 bs=4k bs=192k bs=1920k

And the read speeds for the same. ie. dd if=big-file of=/dev/null bs=4k, etc.

I would expect the write speed to go up with each increase in bs, but the read speed to be more or less constant. Then you need to figure out what sort of real world block sizes your going to be using. Once you have a bs, or collection of bs sizes that match your needs, then you can start tuning your stack.

Greg

Hi, posted the first mail from my cell phone, so couldn't add more info....

- I created the raid with chunk size= 256k.

mainwks:~ # mdadm --misc --detail /dev/md2 /dev/md2: Version : 01.00.03 Creation Time : Sun Jan 27 20:08:48 2008 Raid Level : raid5 Array Size : 1465151232 (1397.28 GiB 1500.31 GB) Used Dev Size : 976767488 (465.76 GiB 500.10 GB) Raid Devices : 4 Total Devices : 4 Preferred Minor : 2 Persistence : Superblock is persistent

Intent Bitmap : Internal

Update Time : Mon Jan 28 17:42:51 2008 State : active Active Devices : 4 Working Devices : 4 Failed Devices : 0 Spare Devices : 0

Layout : left-symmetric Chunk Size : 256K

Name : 2 UUID : 65cb16de:d89af60e:6cac47da:88828cfe Events : 12

Number Major Minor RaidDevice State 0 8 33 0 active sync /dev/sdc1 1 8 49 1 active sync /dev/sdd1 2 8 65 2 active sync /dev/sde1 4 8 81 3 active sync /dev/sdf1

- Speed reported by hdparm:

mainwks:~ # hdparm -tT /dev/sdc

/dev/sdc: Timing cached reads: 1754 MB in 2.00 seconds = 877.60 MB/sec Timing buffered disk reads: 226 MB in 3.02 seconds = 74.76 MB/sec mainwks:~ # hdparm -tT /dev/md2

/dev/md2: Timing cached reads: 1250 MB in 2.00 seconds = 624.82 MB/sec Timing buffered disk reads: 620 MB in 3.01 seconds = 206.09 MB/sec

- LVM:

mainwks:~ # vgdisplay data Incorrect metadata area header checksum --- Volume group --- VG Name data System ID Format lvm2 Metadata Areas 1 Metadata Sequence No 5 VG Access read/write VG Status resizable MAX LV 0 Cur LV 2 Open LV 2 Max PV 0 Cur PV 1 Act PV 1 VG Size 1.36 TB PE Size 4.00 MB Total PE 357702 Alloc PE / Size 51200 / 200.00 GB Free PE / Size 306502 / 1.17 TB VG UUID KpUAeN-mPjO-2K8t-hiLX-FF0C-93R2-IP3aFI

mainwks:~ # pvdisplay /dev/sdc1 Incorrect metadata area header checksum --- Physical volume --- PV Name /dev/md2 VG Name data PV Size 1.36 TB / not usable 3.75 MB Allocatable yes PE Size (KByte) 4096 Total PE 357702 Free PE 306502 Allocated PE 51200 PV UUID Axl2c0-RP95-WwO0-inHP-aJEF-6SYJ-Fqhnga

- XFS:

mainwks:~ # xfs_info /dev/data/test meta-data=/dev/mapper/data-test isize=256 agcount=16, agsize=1638400 blks = sectsz=512 attr=0 data = bsize=4096 blocks=26214400, imaxpct=25 = sunit=16 swidth=48 blks, unwritten=1 naming =version 2 bsize=4096 log =internal bsize=4096 blocks=16384, version=1 = sectsz=512 sunit=0 blks, lazy-count=0 realtime =none extsz=4096 blocks=0, rtextents=0

- The reported dd mainwks:~ # dd if=/dev/zero bs=1024k count=100 of=/mnt/custom/t3 100+0 records in 100+0 records out 104857600 bytes (105 MB) copied, 5.11596 s, 20.5 MB/s

- New dd (seems to give better result) mainwks:~ # dd if=/dev/zero bs=1024k count=1000 of=/mnt/custom/t0 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 13.6218 s, 77.0 MB/s

Ciro

Not sure I followed why the old and new dd were so different. I do see the old one only had 5 seconds worth of data. Not much data to base a test run on.

IF you really have 1MB avg. write sizes, you should read http://oss.sgi.com/archives/xfs/2007-06/msg00411.html for a tuning sample

Basically that post recommends:

chuck size = 256KB LVM align = 3x Chunk Size = 768KB (assumes a 4-disk raid5)

And tune the XFS bsize/sunit/swidth to match.

But that all _assumes_ a large data write size. If you have a more typical desktop load, then the average write is way below that and you need to really reduce all of the above (except bsize. I think 4K bsize is always best with Linux, but I'm not positive about that.).

Also, dd is only able to simulate a sequential data stream. If you don't have that kind of load, once again you need to reduce the chunk size. I think the generically preferred chunk size is 64KB, With some database apps, that can drop down to 4KB.

So really and truly, you need to characterize your workload before you start tuning.

OTOH, if you just want bragging rights, test with and tune for a big average write, but be warned your typical performance will be going down at the same time that your large write performance is going up.

Greg -- Greg Freemyer Litigation Triage Solutions Specialist http://www.linkedin.com/in/gregfreemyer First 99 Days Litigation White Paper - http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf

The Norcross Group The Intersection of Evidence & Technology http://www.norcrossgroup.com -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org

Hi, i found that thread too, the problem is i'm not sure how to tune the lvm alignment, maybe --stripes & --stripesize at LV creation time?, can't find an option for pvcreate or vgcreate. It will be basically a repository for media files, movies, backups, iso images, etc... For the rest (documents, ebooks and music) I'll create other LVs with Ext3.

Regards, Ciro

Ok, I guess you know reads are not significantly impacted by the tuning were talking about. This is mostly about tuning for raid5 write performance.

Yep, i know...

...

Anyway, are you planning to stripe together multiple md5 arrays via LVM? I believe that is what --stripes and --stripesize are for. (ie. If you have 8 drives, you could create 2 raid5 arrays, and use LVM to interleave them by using --stripes = 2.) I've never used that feature.

No, i don't plan to use something like that

...

You need to worry about the vg extents. I think vgcreate --physicalextentsize is what you need to tune. I would make each extent an even number of stripes in size. ie. 768KB * N. Maybe use N=10, so -s 7680K

Well, i'm not sure about the PE parameter, it doesn't affect every write operation as far as i know, using a large number just helps the allocation process (LV creation/grow) and a little number helps with allocation granularity (slower creation/grow of LV)

...

Assuming your not using lvm strips and since this appears to be a new setup, I would also use -C or --contiguous to ensure all the data is sequential. It maybe overkill, but it will further ensure you _avoid_ LV extents that don't end on a stripe boundary. (a stripe == 3 raid5 chunks for you).

Taking note...

...

Then if you are going to use the snapshot feature, you need to set your chunksize efficiently. If you only are going to have large files, then I would use a large LVM snapshot chunksize. 256KB seems like a good choice, but I have not benchmarked snapshot chunksizes.

Read about that, but probably wont use snapshots with this VG

...

Greg --

Thanks, Ciro -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org

Ciro Iriarte

20:05

2008/1/28, Greg Freemyer <greg.freemyer@gmail.com>:

...

On Jan 28, 2008 6:41 PM, Ciro Iriarte <cyruspy@gmail.com> wrote:

...
Ok, I guess you know reads are not significantly impacted by the tuning were talking about. This is mostly about tuning for raid5 write performance.

Anyway, are you planning to stripe together multiple md5 arrays via LVM? I believe that is what --stripes and --stripesize are for. (ie. If you have 8 drives, you could create 2 raid5 arrays, and use LVM to interleave them by using --stripes = 2.) I've never used that feature.

You need to worry about the vg extents. I think vgcreate --physicalextentsize is what you need to tune. I would make each extent an even number of stripes in size. ie. 768KB * N. Maybe use N=10, so -s 7680K

Assuming your not using lvm strips and since this appears to be a new setup, I would also use -C or --contiguous to ensure all the data is sequential. It maybe overkill, but it will further ensure you _avoid_ LV extents that don't end on a stripe boundary. (a stripe == 3 raid5 chunks for you).

Then if you are going to use the snapshot feature, you need to set your chunksize efficiently. If you only are going to have large files, then I would use a large LVM snapshot chunksize. 256KB seems like a good choice, but I have not benchmarked snapshot chunksizes.

Greg --

Just for the record, dealing with a bug that made the raid hang, found a workaround that also gave me performance boost: "echo 4096 > /sys/block/md2/md/stripe_cache_size" Result: mainwks:~ # dd if=/dev/zero bs=1024k count=1000 of=/datos/test 1000+0 records in 1000+0 records out 1048576000 bytes (1,0 GB) copied, 6,78341 s, 155 MB/s mainwks:~ # rm /datos/test mainwks:~ # dd if=/dev/zero bs=1024k count=20000 of=/datos/test 20000+0 records in 20000+0 records out 20971520000 bytes (21 GB) copied, 199,135 s, 105 MB/s Ciro -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org

Greg Freemyer

20:44

On Jan 29, 2008 3:05 PM, Ciro Iriarte <cyruspy@gmail.com> wrote:

...

2008/1/28, Greg Freemyer <greg.freemyer@gmail.com>:

...
On Jan 28, 2008 6:41 PM, Ciro Iriarte <cyruspy@gmail.com> wrote:

...
Ok, I guess you know reads are not significantly impacted by the tuning were talking about. This is mostly about tuning for raid5 write performance.

Anyway, are you planning to stripe together multiple md5 arrays via LVM? I believe that is what --stripes and --stripesize are for. (ie. If you have 8 drives, you could create 2 raid5 arrays, and use LVM to interleave them by using --stripes = 2.) I've never used that feature.

You need to worry about the vg extents. I think vgcreate --physicalextentsize is what you need to tune. I would make each extent an even number of stripes in size. ie. 768KB * N. Maybe use N=10, so -s 7680K

Assuming your not using lvm strips and since this appears to be a new setup, I would also use -C or --contiguous to ensure all the data is sequential. It maybe overkill, but it will further ensure you _avoid_ LV extents that don't end on a stripe boundary. (a stripe == 3 raid5 chunks for you).

Then if you are going to use the snapshot feature, you need to set your chunksize efficiently. If you only are going to have large files, then I would use a large LVM snapshot chunksize. 256KB seems like a good choice, but I have not benchmarked snapshot chunksizes.

Greg --

Just for the record, dealing with a bug that made the raid hang, found a workaround that also gave me performance boost: "echo 4096 > /sys/block/md2/md/stripe_cache_size"

Result:

mainwks:~ # dd if=/dev/zero bs=1024k count=1000 of=/datos/test 1000+0 records in 1000+0 records out 1048576000 bytes (1,0 GB) copied, 6,78341 s, 155 MB/s

mainwks:~ # rm /datos/test

mainwks:~ # dd if=/dev/zero bs=1024k count=20000 of=/datos/test 20000+0 records in 20000+0 records out 20971520000 bytes (21 GB) copied, 199,135 s, 105 MB/s

Ciro

Ciro, 105 MB/s seems strange to me. I would have expected 75 MB/s or 225MB/ s ie. For normal non-full stripe i/o, it should be 75MB/s * 4 / 4. Where 75MB/sec is what I see for one drive typically, the first 4 is the number of drives that can be doing parallel i/o and the second 4 is the number of i/o's per write. ie. When you do a non-full stripe write, the kernel has to read the old checksum. read the old chunk data, recalc the checksum, write the new chunk data, write the checksum. Out of curiosity, on the dd line, do you get better performance if you set your blocksize to exactly one stripe? ie. 3x 256KB = 768KB stripe. I've read the Linux's raid5 implementation is optimized to handle full stripe write's. ie. Writing 3 chunks produces: Calc new checksum from all new data, Write d1, d2, d3, p so to get 3 256KB chunks to the drive, the kernel ends up invoking 4 256KB writes. Or 75 MB/s * 4 * 3 / 4 = 225 MB / sec If you have everything optimized, I think you should see the same performance with a 2-stripe write. ie. 6x 256KB. If your optimization is wrong, you will see a speed improvement because the alignment between your writes and stripes will be wrong. With the bigger write, you will be guaranteed at least one full stripe write. Thanks Greg -- Greg Freemyer Litigation Triage Solutions Specialist http://www.linkedin.com/in/gregfreemyer First 99 Days Litigation White Paper - http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf The Norcross Group The Intersection of Evidence & Technology http://www.norcrossgroup.com -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org

6166

Age (days ago)

6167

Last active (days ago)

List overview

Download

8 comments

2 participants

participants (2)

Ciro Iriarte
Greg Freemyer

[opensuse] Raid5/LVM2/XFS alignment

tags

participants (2)