On Jan 28, 2008 6:41 PM, Ciro Iriarte <cyruspy@gmail.com> wrote:
2008/1/28, Greg Freemyer <greg.freemyer@gmail.com>:
On Jan 28, 2008 3:51 PM, Ciro Iriarte <cyruspy@gmail.com> wrote:
2008/1/28, Greg Freemyer <greg.freemyer@gmail.com>:
On Jan 28, 2008 11:25 AM, Ciro Iriarte <cyruspy@gmail.com> wrote:
Hi, anybody has some notes about tuning md raid5, lvn and xfs?. I'm getting 20mb/s with dd and I think it can be improved. I'll add config parameters as soon as i get home. I'm using md raid5 on a motherboard with nvidia sata controller, 4x500gb samsung sata2 disks and lvm with OpenSUSE 10.3@x86_64.
Regards, Ciro --
I have not done any raid 5 perf. testing: 20 mb/sec seems pretty bad, but not outrageous I suppose. I can get about 4-5GB/min from new sata drives. So about 75 MB/sec from a single raw drive (ie. dd if=/dev/zero of=/dev/sdb bs=4k)
You don't say how your invoking dd. The default bs is only 512 bytes I think and that is totally inefficient with the linux kernel.
I typically use 4k which maps to what the kernel uses. ie. dd if=/dev/zero of=big-file bs=4k count=1000 should give you a simple but meaningful test..
I think the default stride is 64k per drive, so if your writing 3x 64K at a time, you may get perfect alignment and miss the overhead of having to recalculate the checksum all the time.
As another data point, I would bump that up to 30x 64K and see if you continue to get speed improvements.
So tell us the write speed for bs=512 bs=4k bs=192k bs=1920k
And the read speeds for the same. ie. dd if=big-file of=/dev/null bs=4k, etc.
I would expect the write speed to go up with each increase in bs, but the read speed to be more or less constant. Then you need to figure out what sort of real world block sizes your going to be using. Once you have a bs, or collection of bs sizes that match your needs, then you can start tuning your stack.
Greg
Hi, posted the first mail from my cell phone, so couldn't add more info....
- I created the raid with chunk size= 256k.
mainwks:~ # mdadm --misc --detail /dev/md2 /dev/md2: Version : 01.00.03 Creation Time : Sun Jan 27 20:08:48 2008 Raid Level : raid5 Array Size : 1465151232 (1397.28 GiB 1500.31 GB) Used Dev Size : 976767488 (465.76 GiB 500.10 GB) Raid Devices : 4 Total Devices : 4 Preferred Minor : 2 Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Mon Jan 28 17:42:51 2008 State : active Active Devices : 4 Working Devices : 4 Failed Devices : 0 Spare Devices : 0
Layout : left-symmetric Chunk Size : 256K
Name : 2 UUID : 65cb16de:d89af60e:6cac47da:88828cfe Events : 12
Number Major Minor RaidDevice State 0 8 33 0 active sync /dev/sdc1 1 8 49 1 active sync /dev/sdd1 2 8 65 2 active sync /dev/sde1 4 8 81 3 active sync /dev/sdf1
- Speed reported by hdparm:
mainwks:~ # hdparm -tT /dev/sdc
/dev/sdc: Timing cached reads: 1754 MB in 2.00 seconds = 877.60 MB/sec Timing buffered disk reads: 226 MB in 3.02 seconds = 74.76 MB/sec mainwks:~ # hdparm -tT /dev/md2
/dev/md2: Timing cached reads: 1250 MB in 2.00 seconds = 624.82 MB/sec Timing buffered disk reads: 620 MB in 3.01 seconds = 206.09 MB/sec
- LVM:
mainwks:~ # vgdisplay data Incorrect metadata area header checksum --- Volume group --- VG Name data System ID Format lvm2 Metadata Areas 1 Metadata Sequence No 5 VG Access read/write VG Status resizable MAX LV 0 Cur LV 2 Open LV 2 Max PV 0 Cur PV 1 Act PV 1 VG Size 1.36 TB PE Size 4.00 MB Total PE 357702 Alloc PE / Size 51200 / 200.00 GB Free PE / Size 306502 / 1.17 TB VG UUID KpUAeN-mPjO-2K8t-hiLX-FF0C-93R2-IP3aFI
mainwks:~ # pvdisplay /dev/sdc1 Incorrect metadata area header checksum --- Physical volume --- PV Name /dev/md2 VG Name data PV Size 1.36 TB / not usable 3.75 MB Allocatable yes PE Size (KByte) 4096 Total PE 357702 Free PE 306502 Allocated PE 51200 PV UUID Axl2c0-RP95-WwO0-inHP-aJEF-6SYJ-Fqhnga
- XFS:
mainwks:~ # xfs_info /dev/data/test meta-data=/dev/mapper/data-test isize=256 agcount=16, agsize=1638400 blks = sectsz=512 attr=0 data = bsize=4096 blocks=26214400, imaxpct=25 = sunit=16 swidth=48 blks, unwritten=1 naming =version 2 bsize=4096 log =internal bsize=4096 blocks=16384, version=1 = sectsz=512 sunit=0 blks, lazy-count=0 realtime =none extsz=4096 blocks=0, rtextents=0
- The reported dd mainwks:~ # dd if=/dev/zero bs=1024k count=100 of=/mnt/custom/t3 100+0 records in 100+0 records out 104857600 bytes (105 MB) copied, 5.11596 s, 20.5 MB/s
- New dd (seems to give better result) mainwks:~ # dd if=/dev/zero bs=1024k count=1000 of=/mnt/custom/t0 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 13.6218 s, 77.0 MB/s
Ciro
Not sure I followed why the old and new dd were so different. I do see the old one only had 5 seconds worth of data. Not much data to base a test run on.
IF you really have 1MB avg. write sizes, you should read http://oss.sgi.com/archives/xfs/2007-06/msg00411.html for a tuning sample
Basically that post recommends:
chuck size = 256KB LVM align = 3x Chunk Size = 768KB (assumes a 4-disk raid5)
And tune the XFS bsize/sunit/swidth to match.
But that all _assumes_ a large data write size. If you have a more typical desktop load, then the average write is way below that and you need to really reduce all of the above (except bsize. I think 4K bsize is always best with Linux, but I'm not positive about that.).
Also, dd is only able to simulate a sequential data stream. If you don't have that kind of load, once again you need to reduce the chunk size. I think the generically preferred chunk size is 64KB, With some database apps, that can drop down to 4KB.
So really and truly, you need to characterize your workload before you start tuning.
OTOH, if you just want bragging rights, test with and tune for a big average write, but be warned your typical performance will be going down at the same time that your large write performance is going up.
Greg -- Greg Freemyer Litigation Triage Solutions Specialist http://www.linkedin.com/in/gregfreemyer First 99 Days Litigation White Paper - http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf
The Norcross Group The Intersection of Evidence & Technology http://www.norcrossgroup.com -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
Hi, i found that thread too, the problem is i'm not sure how to tune the lvm alignment, maybe --stripes & --stripesize at LV creation time?, can't find an option for pvcreate or vgcreate. It will be basically a repository for media files, movies, backups, iso images, etc... For the rest (documents, ebooks and music) I'll create other LVs with Ext3.
Regards, Ciro
Ok, I guess you know reads are not significantly impacted by the tuning were talking about. This is mostly about tuning for raid5 write performance. Anyway, are you planning to stripe together multiple md5 arrays via LVM? I believe that is what --stripes and --stripesize are for. (ie. If you have 8 drives, you could create 2 raid5 arrays, and use LVM to interleave them by using --stripes = 2.) I've never used that feature. You need to worry about the vg extents. I think vgcreate --physicalextentsize is what you need to tune. I would make each extent an even number of stripes in size. ie. 768KB * N. Maybe use N=10, so -s 7680K Assuming your not using lvm strips and since this appears to be a new setup, I would also use -C or --contiguous to ensure all the data is sequential. It maybe overkill, but it will further ensure you _avoid_ LV extents that don't end on a stripe boundary. (a stripe == 3 raid5 chunks for you). Then if you are going to use the snapshot feature, you need to set your chunksize efficiently. If you only are going to have large files, then I would use a large LVM snapshot chunksize. 256KB seems like a good choice, but I have not benchmarked snapshot chunksizes. Greg -- Greg Freemyer Litigation Triage Solutions Specialist http://www.linkedin.com/in/gregfreemyer First 99 Days Litigation White Paper - http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf The Norcross Group The Intersection of Evidence & Technology http://www.norcrossgroup.com -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org