Re: [opensuse] Re: Spec'ing a 50 TB dedicated storage subsystem and room to grow?

28 Feb 2018


      On 28/02/18 20:54, Greg Freemyer wrote:
...
On Tue, Feb 27, 2018 at 4:57 PM, Wols Lists <antlists@youngman.org.uk> wrote:
...
On 27/02/18 00:57, Greg Freemyer wrote:
...
Raid 61 would also be interesting.  It seems to me the rebuild time on
a raid 61 could be greatly faster than on just a 6 (or 60).  That
assumes the failed drive could just be copied over from the mirror
pair of that drive.
Actually, it would be faster even than that. Do you know the difference
between Raid-1+0, and linux md-raid-10? linux-10 has the disadvantage
(at least from the developer's point of view) that the drives are
mirrors of each other, and thus rebuilding one drive places a lot of
stress on said mirror.
The point of the work I've spec'd is that the blocks are scattered
according to a pseudo-random algorithm, such that there is no such
mirror!
Unusual!
...
So if you have say 20 drives, with your raid-61 configured as
8,2, that would mean you have two logical 8-drive raid-6 arrays,
mirrored. But the blocks are scattered at random across your 20 drives.
So if a drive fails, let's say it's 10TB, the rebuild can copy 0.5TB
from EVERY other drive, and rebuild the failed one.
Say what?
Putting thinking hat on!
Whoa, that is very cool if I have it right!
Somebody posted to the linux-raid list about a CRUSH algorithm, I think 
it was called. This enables you to spec local storage, different 
controllers, network storage etc, and ensure that blocks are scattered 
over all of them. The intent was that you could lose a controller, or a 
network link, or whatever, and still guarantee that a complete stripe of 
blocks could be found elsewhere. But I get the impression that it's 
computationally expensive - I wanted a simple algorithm that got you 
most of the benefits for a tiny fraction of the cost.
...
...
The standard
algorithm would hammer one other drive and quite possibly tip that over
the edge too.
The only snag with my algorithm is that, iirc, you can get a
pathological failure if you don't have at least twice the drives. So an
8,2 setup might need 33 drives for the algorithm to work.
I'm confused here.
If the number of drives is high enough, it's easy to prove that the 
pathological setup cannot occur. Unfortunately, every simulation I've 
run with less than that IS pathological :-( (By that, I mean that a 
single drive failure could destroy all copies of some blocks :-(
...
Let's say I decide to be intentional about building a 80TB usable LV
with your setup.  If I use 10TB drives, does that mean I'd have to buy
33 x 10TB drives.At $400/drive, that's $13.2K just for the drives
(chassis, controllers, etc not included).  That seems like a lot of
money for 80TB useable.
I'm trying to remember my maths. That's 8 drives for data plus 2 parity, 
twice. 20 drives. You would need either 21 or 41 drives. But 41 sounds 
wrong, it should certainly work with 31. It should be possible to do it 
with 21, maybe I just need to improve my algorithm.
...
...
Of course, if
that's the case, it would fall back to a simpler algorithm, probably the
one that leads to a mirror. Or at least for raid-6, it would know that
if all copies of a block were stored on the one drive, it could rebuild
that block from parity. But that's not a good idea :-(
I attach my test code. Have a play. Note that you need to make sure that 
the primes aren't pathological - they must not be a factor of any of the 
other numbers. Any queries I'll try to remember what I was doing and 
say. There should be an email from me on the raid list that explains it 
all, I'll hunt it up later, but it's now my bed time ... :-)

Cheers,
Wol

Re: [opensuse] Re: Spec'ing a 50 TB dedicated storage subsystem and room to grow?

Wol's lists