[opensuse] RAID Q's' & # spindle groups for performance (long/ref) (was ...dedicated storage subsystem ...)

27 Feb 2018

      [[[ SIDE NOTE:... before I forget ;-), the unit I describe
below is different from a "dedicated storage subsystem" 
in that it is my home server that also manages my internet
connections (web proxy, routing, email file serving...etc)
all running via OpenSuse.  

That it is critical for so many things is a main reason 
I have been reluctant to go with less tested and less flexible 
alternatives.  

In emergency situations, I've been able to boot from
a suse rescue CD OR boot in single-user directly from
the disk.  and brought up the system, service-by-service,
by hand directly from the HW-init boot step.  This allowed me
to make temporary patches or do what was necessary to get my
system back to normal running, to allow me to
then examine more permanent corrections at leisure. Sometimes, I
had a hand-booted system staying up for weeks as I didn't
want to address boot problems at that moment.

The fact that I couldn't do something simple like hand-load
a driver (via modprobe) in shell and continue the boot 
process was(is?) a major reliability issue in some other
boot & service managers. I still see that as an issue --
at least on a piecemeal system as linux has been.  

Background of the advice I was given:

This started on the xfs list w/me wondering about effect of spindle
numbers on IOPS and performance.
I tried to trim old HW stuff, and non-perf/raid
info.  The whole discussion would be in xfs-archives from 2013 if you
want original text...

Stan said:
...
Hay Linda, if you're to re-architect your storage,
the first thing I'd do is ditch that RAID50 setup.  RAID50 exists
strictly to reduce some of the penalties of RAID5.  But then you find
new downsides specific to RAID50, including the alignment issues you
mentioned.
(my RAID50 alignment was 768K) I was wondering about how I might
increase my random I/O performance and gave some specs about my
setup back then.  At the time I had a 9280 LSI card compare to the 
9286 I have now main diffs or features of 9286 over previous card:
* - 2 cpu's vs. 1;  * - 12Gb bus (vs. 6Gb), and * - 4k sector support.
The cpu thing -- was mostly about supporting multi-checksum
RAID configs like RAID50 (mentioned below).

most of the rest of this was from Stan with exact text,
publicly available in xfs archives.

-------- Original Message --------
Subject: Re: RAID setups, usage, Q's' effect of spindle groups...
Date: Mon, 21 Jan 2013 07:38:09 -0600
From: Stan Hoeppner 
To: Linda Walsh 
CC: xfs-oss 

The 2108 ROC ASIC in the 9280 doesn't have sufficient horsepower for
good performance with dual parity arrays, but that pales in comparison
to the performance drop due to the RMW induced seek latency.
...
not to mention the diskspace
hit RAID10 would be a bit too decadent for my usage/budget.
When one perceives the capacity overhead of RAID1/10 as an intolerable
cost, instead of a benefit, one is forever destined to suffer from the
poor performance of parity RAID schemes.

...
...
On #3 currently using 12.31tB in 20 partitions
...details elided....
So you have 24x 2TB 7.2K SATA drives total in two 630Js, correct?
...
I was mostly interested in how increasing number of spindles
in a Raid50 would help parallelism
[how would this help performance overall...]

The answer is simple too:  Parity RAID sucks.  If you want anything more
than a trivial increase in performance, you need to ditch parity RAID.
Given the time and effort involved in rearranging all of your disks to
get one or two more RAID5 arrays with fewer disks per array into a
RAID50, it doesn't make sense to do so when you can simply create one
large RAID10, and be done monkeying around and second guessing.  You'll
have the performance you're seeking.  Actually far, far more.
...
Consider this -- my max read and write (both), on my
large array is 1GB/s.  There's no way I could get that with a RAID10 setup
without a much larger number of disks.
On the contrary.  The same disks in RAID10 will walk all over your
RAID50 setup.  Let's discuss practical use and performance instead of
peak optimums shall we?  Note that immediately below I'm simply
educating you, not recommending a 12 drive RAID10.  Recommendations come
later.

In this one array you have 12 drives, 3x 4 drive RAID5 arrays in RAID50,
for 9 effective data spindles.  An equivalent 12 drive RAID10 would
yield 6 data spindles.

For a pure streaming read workload with all drives evenly in play, the
RAID50 might be ~50% faster.  For a purely random read workload about
the same, although in both cases 50x or more slower than the streaming
read case due to random seeks.

With a pure streaming allocation write workload with perfect stripe
filling, no RMW, the RAID50 will be faster, but less than the 50% above
due to parity calcs in the ASIC.

Now it gets interesting.  With a purely random write non aligned non
allocation workload on the RAID50, RMW cycles will abound driving seek
latency through the roof, while the ASIC is performing a parity calc on
each stripe update.  Throughput here will be in the low tens of MBs per
second, tops.  RAID10 simply writes each sector--done.  Throughput will
be in the high tens to 100s of MB/s.  So in this scenario RAID10 will be
anywhere from 5-10x or more faster depending on the distribution of the
writes across the drives.  Another factor here is that RMW reads from
the disks go into the LSI cache for parity recalculation, eating cache
bandwidth and capacity, decreasing the writeback efficiency.  With
RAID10 you get full cache bandwidth for sinking incoming writes and
performing flush scheduling, both being extremely important for random
write workloads.

Food for thought:  A random write workload of ~500MB with RAID10 will
complete almost instantly after the controller cache consumes it.  With
RAID50 you have to go through the hundreds or thousands of RMW cycles on
the disks, so the same operation will take many minutes.

Lets look at more real world scenarios.  Take your example of the
nightly background processes kicking in.  This constitutes a mixed
random read and write workload.  In this situation every RMW can create
3 seeks per drive write:  read, write, parity write.  Now you have a
seek for a pending read operation, making 4 seeks.  But the problem
isn't just the seeks, it is the inter-seek latency due to the slow 7.2K
RPM platters having to spin under the head for the next read or write.
This scenario makes scheduling in the controller and the drives
themselves very difficult adding more latency.

With RAID10 in this scenario, you simply have write/read/write/read/etc.
 You're not performing 2 extra seeks for each write, so you're not
incurring that latency between operations, nor the scheduling
complexity, thus driving throughput much higher.  In this scenario, the
6 disk RAID10 may be 10x to as much as 50x faster than the RAID50
depending on the access/seek patterns.

I've obviously not covered this in much technical detail as storage
behavior is quite complex.  I've attempted to give you a high level
overview of the behavioral differences between parity and non parity
RAID, and the potential performance differences with various workloads,
and the differences between "peak" performance and actual performance.

While your RAID50 may have greater theoretical peak streaming
performance, the RAID10 will typically, literally, run circles around it
with most day-to-day mixed IO workloads.  While the RAID50 may have a
peak throughput of ~1GB/s, it may only attain that 1-10% of the time.
The RAID10 may have a peak throughput of "only" ~700MB/s, but may likely
achieve that more than 60% of the time.  And as a result its performance
degradation will be much more graceful with concurrent workloads due the
the dramatically lower IO completion latencies.
...
Though I admit, concurrency would
rise... but I generate most of my workload, so usually I don't have
too many things going on at the same time... a few maybe...
But I'd guess it's at times like this when you bog down the RAID50 with
mixed workloads and become annoyed.  You typically don't see that with
the non-parity arrays.
...
When an xfs_fsr kicks in and starts swallowing disk-cache, *ahem*,
and the daily backup kicks in, AND the daily 'rsync' to create a static
snapshot... things can slow down a bit.. but rare am I up at those hours...
And this is one scenario where the RAID10 would run circles around the
RAID50.
...
...
You'll need more drives to  maintain the same usable capacity,

(oh, a minor detail! ;^))...
Well how much space do you really need in a one person development
operation plus home media/etc storage system?  10TB, 24TB, 48TB?

Assuming you have both 630Js filled with 24x 2TB drives, that's 48TB
raw.  If you have 6x 4 drive RAID5s in multiple RAID50 spans, you have
18x 2TB = 36TB of capacity.  Your largest array is 12 drives with 9
effective spindles of throughput.  You've split up your arrays for
different functions, limiting some workloads to fewer spindles of
performance, and having spindles sit idle that could otherwise be
actively adding performance to active workloads.  You've created
partitions directly on the array disk devices and have various LVM
devices and filesystems on those for various purposes, again limiting
some filesystems to less performance than your total spindles can give you.

The change I recommend you consider is to do something similar to what
we do with SAN storage consolidation.  Create a single large spindle
count non-parity array on the LSI.  In this case that would be a 24
drive RAID10 with a strip (sunit) of 32KB, yielding a stripe width
(swidth) of 384KB, which should work very well with all of your
filesystems and workloads, giving a good probability of full stripe
writes.  You'd have ~24TB of usable space.  All of your workloads would
have 12 spindles of non-parity performance, peak streaming read/write of
~1.4GB/s, and random read/write mixed workload throughput of a few
hundred MB/s, simply stomping what you have now.  You'd be very hard
pressed to bog down this 12 spindle non-parity array.  Making a
conservative guesstimate, I'd say the mixed random IO throughput would
be on the order of 30x-50x that of your current RAID5/50 arrays combined.

In summary, you'd gain a staggering performance increase you simply
wouldn't have considered possible with your current hardware.  You'd
"sacrifice" 12TB of your 48TB of raw space to achieve it.  That 30-50x
increase in random IOPs is exactly why many folks gladly "waste money on
extra drives".  After you see the dramatic performance increase you'll
wonder why you ever considered spending money on high RPM SAS drives to
reduce RAID5 latency.  Put these 24 7.2K SATA drives in this RAID10 up
against 24 15K SAS drives in a 6x4 RAID50.  Your big slow Hitachis will
best the nimble SAS 15ks in random IOPS, probably by a wide margin.
Simply due to RMW.  Yes, RMW will hammer 15K drives that much.  RMW
hammers all spinning rust, everything but SSDs.
...
[wondering about increasing spindle account and effect on perf
in RAID50]
Optimizing the spindle count of constituent RAID5s in a RAID50 to gain
performance is akin to a downhill skier manically waxing his skis every
day, hoping to shave 2 seconds off a 2 minute course.
...
Thanks for any insights...(I'm always open to learning how wrong I am!
;-))...
If nothing else I hopefully got the point across as to how destructive
parity RAID read-modify-write operations are to performance.  It's
simply impossible to get good mixed IO performance from parity RAID
unless one's workloads always fit in controller write cache, or if one
has SSD storage.

===  comment about piecemeal increases in a 24-unit
     disk housing (same author):

A trap many/most home users fall into is buying such a chassis and 4
drives in RAID5, then adding sets of 4 drives in RAID5 as "budget
permits", and ending up with 6 separate arrays thus 6 different data
silos each of low performance.  This may be better/easier on the wallet
for those who can't/don't plan and save, but in the end they end up with
1/6th of their spindle performance for any given workload, and a
difficult migration path to get the actual performance the drives are
capable of.  Which is obviously why I recommend acquiring the end game
solution in one shot and configuring for maximum performance from day one.

-- 
To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org
To contact the owner, e-mail: opensuse+owner@opensuse.org

[opensuse] RAID Q's' & # spindle groups for performance (long/ref) (was ...dedicated storage subsystem ...)

L A Walsh