[[[ SIDE NOTE:... before I forget ;-), the unit I describe below is different from a "dedicated storage subsystem" in that it is my home server that also manages my internet connections (web proxy, routing, email file serving...etc) all running via OpenSuse. That it is critical for so many things is a main reason I have been reluctant to go with less tested and less flexible alternatives. In emergency situations, I've been able to boot from a suse rescue CD OR boot in single-user directly from the disk. and brought up the system, service-by-service, by hand directly from the HW-init boot step. This allowed me to make temporary patches or do what was necessary to get my system back to normal running, to allow me to then examine more permanent corrections at leisure. Sometimes, I had a hand-booted system staying up for weeks as I didn't want to address boot problems at that moment. The fact that I couldn't do something simple like hand-load a driver (via modprobe) in shell and continue the boot process was(is?) a major reliability issue in some other boot & service managers. I still see that as an issue -- at least on a piecemeal system as linux has been. Background of the advice I was given: This started on the xfs list w/me wondering about effect of spindle numbers on IOPS and performance. I tried to trim old HW stuff, and non-perf/raid info. The whole discussion would be in xfs-archives from 2013 if you want original text... Stan said:
Hay Linda, if you're to re-architect your storage, the first thing I'd do is ditch that RAID50 setup. RAID50 exists strictly to reduce some of the penalties of RAID5. But then you find new downsides specific to RAID50, including the alignment issues you mentioned.
(my RAID50 alignment was 768K) I was wondering about how I might increase my random I/O performance and gave some specs about my setup back then. At the time I had a 9280 LSI card compare to the 9286 I have now main diffs or features of 9286 over previous card: * - 2 cpu's vs. 1; * - 12Gb bus (vs. 6Gb), and * - 4k sector support. The cpu thing -- was mostly about supporting multi-checksum RAID configs like RAID50 (mentioned below). most of the rest of this was from Stan with exact text, publicly available in xfs archives. -------- Original Message -------- Subject: Re: RAID setups, usage, Q's' effect of spindle groups... Date: Mon, 21 Jan 2013 07:38:09 -0600 From: Stan Hoeppner To: Linda Walsh CC: xfs-oss The 2108 ROC ASIC in the 9280 doesn't have sufficient horsepower for good performance with dual parity arrays, but that pales in comparison to the performance drop due to the RMW induced seek latency.
not to mention the diskspace hit RAID10 would be a bit too decadent for my usage/budget.
When one perceives the capacity overhead of RAID1/10 as an intolerable cost, instead of a benefit, one is forever destined to suffer from the poor performance of parity RAID schemes. ...
On #3 currently using 12.31tB in 20 partitions ...details elided....
So you have 24x 2TB 7.2K SATA drives total in two 630Js, correct?
I was mostly interested in how increasing number of spindles in a Raid50 would help parallelism
[how would this help performance overall...] The answer is simple too: Parity RAID sucks. If you want anything more than a trivial increase in performance, you need to ditch parity RAID. Given the time and effort involved in rearranging all of your disks to get one or two more RAID5 arrays with fewer disks per array into a RAID50, it doesn't make sense to do so when you can simply create one large RAID10, and be done monkeying around and second guessing. You'll have the performance you're seeking. Actually far, far more.
Consider this -- my max read and write (both), on my large array is 1GB/s. There's no way I could get that with a RAID10 setup without a much larger number of disks.
On the contrary. The same disks in RAID10 will walk all over your RAID50 setup. Let's discuss practical use and performance instead of peak optimums shall we? Note that immediately below I'm simply educating you, not recommending a 12 drive RAID10. Recommendations come later. In this one array you have 12 drives, 3x 4 drive RAID5 arrays in RAID50, for 9 effective data spindles. An equivalent 12 drive RAID10 would yield 6 data spindles. For a pure streaming read workload with all drives evenly in play, the RAID50 might be ~50% faster. For a purely random read workload about the same, although in both cases 50x or more slower than the streaming read case due to random seeks. With a pure streaming allocation write workload with perfect stripe filling, no RMW, the RAID50 will be faster, but less than the 50% above due to parity calcs in the ASIC. Now it gets interesting. With a purely random write non aligned non allocation workload on the RAID50, RMW cycles will abound driving seek latency through the roof, while the ASIC is performing a parity calc on each stripe update. Throughput here will be in the low tens of MBs per second, tops. RAID10 simply writes each sector--done. Throughput will be in the high tens to 100s of MB/s. So in this scenario RAID10 will be anywhere from 5-10x or more faster depending on the distribution of the writes across the drives. Another factor here is that RMW reads from the disks go into the LSI cache for parity recalculation, eating cache bandwidth and capacity, decreasing the writeback efficiency. With RAID10 you get full cache bandwidth for sinking incoming writes and performing flush scheduling, both being extremely important for random write workloads. Food for thought: A random write workload of ~500MB with RAID10 will complete almost instantly after the controller cache consumes it. With RAID50 you have to go through the hundreds or thousands of RMW cycles on the disks, so the same operation will take many minutes. Lets look at more real world scenarios. Take your example of the nightly background processes kicking in. This constitutes a mixed random read and write workload. In this situation every RMW can create 3 seeks per drive write: read, write, parity write. Now you have a seek for a pending read operation, making 4 seeks. But the problem isn't just the seeks, it is the inter-seek latency due to the slow 7.2K RPM platters having to spin under the head for the next read or write. This scenario makes scheduling in the controller and the drives themselves very difficult adding more latency. With RAID10 in this scenario, you simply have write/read/write/read/etc. You're not performing 2 extra seeks for each write, so you're not incurring that latency between operations, nor the scheduling complexity, thus driving throughput much higher. In this scenario, the 6 disk RAID10 may be 10x to as much as 50x faster than the RAID50 depending on the access/seek patterns. I've obviously not covered this in much technical detail as storage behavior is quite complex. I've attempted to give you a high level overview of the behavioral differences between parity and non parity RAID, and the potential performance differences with various workloads, and the differences between "peak" performance and actual performance. While your RAID50 may have greater theoretical peak streaming performance, the RAID10 will typically, literally, run circles around it with most day-to-day mixed IO workloads. While the RAID50 may have a peak throughput of ~1GB/s, it may only attain that 1-10% of the time. The RAID10 may have a peak throughput of "only" ~700MB/s, but may likely achieve that more than 60% of the time. And as a result its performance degradation will be much more graceful with concurrent workloads due the the dramatically lower IO completion latencies.
Though I admit, concurrency would rise... but I generate most of my workload, so usually I don't have too many things going on at the same time... a few maybe...
But I'd guess it's at times like this when you bog down the RAID50 with mixed workloads and become annoyed. You typically don't see that with the non-parity arrays.
When an xfs_fsr kicks in and starts swallowing disk-cache, *ahem*, and the daily backup kicks in, AND the daily 'rsync' to create a static snapshot... things can slow down a bit.. but rare am I up at those hours...
And this is one scenario where the RAID10 would run circles around the RAID50.
You'll need more drives to maintain the same usable capacity,
(oh, a minor detail! ;^))...
Well how much space do you really need in a one person development operation plus home media/etc storage system? 10TB, 24TB, 48TB? Assuming you have both 630Js filled with 24x 2TB drives, that's 48TB raw. If you have 6x 4 drive RAID5s in multiple RAID50 spans, you have 18x 2TB = 36TB of capacity. Your largest array is 12 drives with 9 effective spindles of throughput. You've split up your arrays for different functions, limiting some workloads to fewer spindles of performance, and having spindles sit idle that could otherwise be actively adding performance to active workloads. You've created partitions directly on the array disk devices and have various LVM devices and filesystems on those for various purposes, again limiting some filesystems to less performance than your total spindles can give you. The change I recommend you consider is to do something similar to what we do with SAN storage consolidation. Create a single large spindle count non-parity array on the LSI. In this case that would be a 24 drive RAID10 with a strip (sunit) of 32KB, yielding a stripe width (swidth) of 384KB, which should work very well with all of your filesystems and workloads, giving a good probability of full stripe writes. You'd have ~24TB of usable space. All of your workloads would have 12 spindles of non-parity performance, peak streaming read/write of ~1.4GB/s, and random read/write mixed workload throughput of a few hundred MB/s, simply stomping what you have now. You'd be very hard pressed to bog down this 12 spindle non-parity array. Making a conservative guesstimate, I'd say the mixed random IO throughput would be on the order of 30x-50x that of your current RAID5/50 arrays combined. In summary, you'd gain a staggering performance increase you simply wouldn't have considered possible with your current hardware. You'd "sacrifice" 12TB of your 48TB of raw space to achieve it. That 30-50x increase in random IOPs is exactly why many folks gladly "waste money on extra drives". After you see the dramatic performance increase you'll wonder why you ever considered spending money on high RPM SAS drives to reduce RAID5 latency. Put these 24 7.2K SATA drives in this RAID10 up against 24 15K SAS drives in a 6x4 RAID50. Your big slow Hitachis will best the nimble SAS 15ks in random IOPS, probably by a wide margin. Simply due to RMW. Yes, RMW will hammer 15K drives that much. RMW hammers all spinning rust, everything but SSDs.
[wondering about increasing spindle account and effect on perf in RAID50]
Optimizing the spindle count of constituent RAID5s in a RAID50 to gain performance is akin to a downhill skier manically waxing his skis every day, hoping to shave 2 seconds off a 2 minute course.
Thanks for any insights...(I'm always open to learning how wrong I am! ;-))...
If nothing else I hopefully got the point across as to how destructive parity RAID read-modify-write operations are to performance. It's simply impossible to get good mixed IO performance from parity RAID unless one's workloads always fit in controller write cache, or if one has SSD storage. === comment about piecemeal increases in a 24-unit disk housing (same author): A trap many/most home users fall into is buying such a chassis and 4 drives in RAID5, then adding sets of 4 drives in RAID5 as "budget permits", and ending up with 6 separate arrays thus 6 different data silos each of low performance. This may be better/easier on the wallet for those who can't/don't plan and save, but in the end they end up with 1/6th of their spindle performance for any given workload, and a difficult migration path to get the actual performance the drives are capable of. Which is obviously why I recommend acquiring the end game solution in one shot and configuring for maximum performance from day one. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org