[opensuse] Spec'ing a 50 TB dedicated storage subsystem and room to grow?

Greg Freemyer

23 Feb 2018 23 Feb '18

17:01

All, I bought a used iSCSI based Drobo in 2016 with 50TB in it. It works great, but it isn't fast by today's standards. (100 MB/sec absolute max). And I never got it to work from openSUSE (Windows and Mac did work). I now need to buy something similar, but with fast I/O and Linux support: 10 Gbit Ethernet and fibre-channel may be my only options? Or are there point-to-point SAS external racks I should consider? Also, I will likely want to put a SSD based cache in the mix at some point. It could be integrated into the storage subsystem, or I could use something like bcache and have the SSD be in the server,. It just needs to be reliable and fast. Recommendations? (I'm thinking used and $10K at the high-end for the subsystem including PCIx cards and switches, if needed.) == details I need to buy a server with 50TB usable disk for a production environment. A high speed disk subsystem is critical. And the project may scale up over time. It doesn't need to be a fail-over cluster, just a single server. I'm looking at used equipment most likely. I was thinking one with 4 CPU sockets would let me start with 2 CPUs and then expand to 4 later on. Lots of RAM capacity would also be great. My first thought was to get a Dell R920 and throw a bunch of disks in it. https://goo.gl/images/x3Rfj2 24 disk slots! But then I looked at what size drives are available for those slots: 2.5 inch. I only see 2TB drives max. I'm hoping to use a Dell server because it's the preferred brand at my (new) job. Unless I'm missing something a standalone rack mount Dell won't work. So, now I'm looking at a used R820 most likely and some sort of disk subsystem to support it. I picked it because it is only a 5-year old design and it has 4 CPU sockets I can expand into as the server demand increases over time. Also 3TB max ram is way above what this will require, even long term. https://media.adn.de/media/DE/DocLib/Poweredge_Easy_Matrix.pdf A used, barebones R820 can be had for under $2K. Just add CPUs and ram. Greg -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Show replies by date

Per Jessen

23 Feb 23 Feb

17:23

Greg Freemyer wrote:

...

== details

I need to buy a server with 50TB usable disk for a production environment. A high speed disk subsystem is critical. And the project may scale up over time.

It doesn't need to be a fail-over cluster, just a single server. I'm looking at used equipment most likely.

I was thinking one with 4 CPU sockets would let me start with 2 CPUs and then expand to 4 later on. Lots of RAM capacity would also be great.

My first thought was to get a Dell R920 and throw a bunch of disks in it.

https://goo.gl/images/x3Rfj2

24 disk slots!

But then I looked at what size drives are available for those slots: 2.5 inch. I only see 2TB drives max.

Greg, for storage servers, we use some fairly plain Supermicro SC846 boxes with 24 x 3.5" slots. They were bought 2nd hand, I think we have 12 or 14, with some being used for spare parts. (power supplies), 8 currently in production. They were cheap, maybe 400-500chf a piece. They have server boards, with ipmi board etc., but not overly beefy. One quad-core CPU, 8Gb RAM. They're mainly used for serving disk to xen guests over iscsi. You might look for something like that and then upgrade with faster SATA, if needed. We have been experimenting with adding a large SSD cache (pcix) to each box, but I don't have any numbers yet. Otherwise, in the 2nd hand area, EMC and Fujitsu both do some very nice storage arrays. IBM and HP too for that matter. Careful you get one with SATA, not fibre-channel disks. SATA is easier/cheaper to replace. HTH Per -- Per Jessen, Zürich (-1.9°C) http://www.cloudsuisse.com/ - your owncloud, hosted in Switzerland. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Andrei Borzenkov

17:46

23.02.2018 20:23, Per Jessen пишет:

...

Otherwise, in the 2nd hand area, EMC and Fujitsu both do some very nice storage arrays. IBM and HP too for that matter. Careful you get one with SATA, not fibre-channel disks. SATA is easier/cheaper to replace.

I do not think anyone still manufactures arrays with FC disks since several years. Other important consideration is that most storage vendors will only accept disks made for these storage systems - special firmware or/and NVRAM content. Meaning you cannot use (or replace broken disk with) random cheap SATA disk anyway, you need to get compatible one. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Per Jessen

18:55

Andrei Borzenkov wrote:

...

23.02.2018 20:23, Per Jessen пишет:

...
Otherwise, in the 2nd hand area, EMC and Fujitsu both do some very nice storage arrays. IBM and HP too for that matter. Careful you get one with SATA, not fibre-channel disks. SATA is easier/cheaper to replace.

I do not think anyone still manufactures arrays with FC disks since several years.

I agree, but when you're out looking for full-size (3.5") 2nd hand they often turn up. If you're not aware ... in the article description it is sometimes just marked as "FC".

...

Other important consideration is that most storage vendors will only accept disks made for these storage systems - special firmware or/and NVRAM content. Meaning you cannot use (or replace broken disk with) random cheap SATA disk anyway, you need to get compatible one.

I only have actual experience with older Compaq/HP and IBM, and they don't have that problem. For what Greg is looking for, I think the Supermicro boxes would be a good solution, with plain commodity hardware. -- Per Jessen, Zürich (-2.0°C) http://www.dns24.ch/ - your free DNS host, made in Switzerland. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Greg Freemyer

23:57

On Fri, Feb 23, 2018 at 12:23 PM, Per Jessen <per@computer.org> wrote:

...

Greg Freemyer wrote:

...
== details

I need to buy a server with 50TB usable disk for a production environment. A high speed disk subsystem is critical. And the project may scale up over time.

It doesn't need to be a fail-over cluster, just a single server. I'm looking at used equipment most likely.

I was thinking one with 4 CPU sockets would let me start with 2 CPUs and then expand to 4 later on. Lots of RAM capacity would also be great.

My first thought was to get a Dell R920 and throw a bunch of disks in it.

https://goo.gl/images/x3Rfj2

24 disk slots!

But then I looked at what size drives are available for those slots: 2.5 inch. I only see 2TB drives max.

Greg, for storage servers, we use some fairly plain Supermicro SC846 boxes with 24 x 3.5" slots. They were bought 2nd hand, I think we have 12 or 14, with some being used for spare parts. (power supplies), 8 currently in production.

Thanks, I was thinking of something along those lines. Alternatively a NetApp filer isn't too expensive used.

...

They were cheap, maybe 400-500chf a piece. They have server boards, with ipmi board etc., but not overly beefy. One quad-core CPU, 8Gb RAM. They're mainly used for serving disk to xen guests over iscsi.

Again, exactly what I was thinking. 10 Gbit/sec? 40 Gbit/sec? Is it even feasible to consider SAS HBAs as the interconnect from the main server to the storage server? Do we have kernel support for that on subsystem end? If SAS is an option, does it have significantly less overhead than iSCSI? (Seems like it would).

...

You might look for something like that and then upgrade with faster SATA, if needed. We have been experimenting with adding a large SSD cache (pcix) to each box, but I don't have any numbers yet.

I have a Windows box I put a 2TB PCIx SSD in as disk cache. It makes a huge difference.

...

Otherwise, in the 2nd hand area, EMC and Fujitsu both do some very nice storage arrays. IBM and HP too for that matter. Careful you get one with SATA, not fibre-channel disks. SATA is easier/cheaper to replace.

I'm very familiar with the HP line of 2005. Not so much anything newer.

...

HTH Per

Thanks Greg -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Per Jessen

24 Feb 24 Feb

11:45

Greg Freemyer wrote:

...

On Fri, Feb 23, 2018 at 12:23 PM, Per Jessen <per@computer.org> wrote:

...
Greg Freemyer wrote:

...
== details

I need to buy a server with 50TB usable disk for a production environment. A high speed disk subsystem is critical. And the project may scale up over time.

It doesn't need to be a fail-over cluster, just a single server. I'm looking at used equipment most likely.

I was thinking one with 4 CPU sockets would let me start with 2 CPUs and then expand to 4 later on. Lots of RAM capacity would also be great.

My first thought was to get a Dell R920 and throw a bunch of disks in it.

https://goo.gl/images/x3Rfj2

24 disk slots!

But then I looked at what size drives are available for those slots: 2.5 inch. I only see 2TB drives max.

Greg, for storage servers, we use some fairly plain Supermicro SC846 boxes with 24 x 3.5" slots. They were bought 2nd hand, I think we have 12 or 14, with some being used for spare parts. (power supplies), 8 currently in production.

Thanks, I was thinking of something along those lines. Alternatively a NetApp filer isn't too expensive used.

...
They were cheap, maybe 400-500chf a piece. They have server boards, with ipmi board etc., but not overly beefy. One quad-core CPU, 8Gb RAM. They're mainly used for serving disk to xen guests over iscsi.

Again, exactly what I was thinking. 10 Gbit/sec? 40 Gbit/sec?

Nah. They're old servers, only 3Gbps SATA and PCI-x, we use multiple 1GigE cards, some with bonding. If the SSD caches don't add anything significant, we'll upgrade the motherboards and controllers. The chassis is still perfect.

...

Is it even feasible to consider SAS HBAs as the interconnect from the main server to the storage server? Do we have kernel support for that on subsystem end? If SAS is an option, does it have significantly less overhead than iSCSI? (Seems like it would).

We haven't looked into it.

...

...
You might look for something like that and then upgrade with faster SATA, if needed. We have been experimenting with adding a large SSD cache (pcix) to each box, but I don't have any numbers yet.

I have a Windows box I put a 2TB PCIx SSD in as disk cache. It makes a huge difference.

Yup, I see that too on workstations, we're waiting to see how the concurrent workload from a bunch of xen guests will benefit. -- Per Jessen, Zürich (-0.3°C) http://www.dns24.ch/ - free dynamic DNS, made in Switzerland. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Carlos E. R.

21:18

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Saturday, 2018-02-24 at 12:45 +0100, Per Jessen wrote:

...

Greg Freemyer wrote:

...

Nah. They're old servers, only 3Gbps SATA and PCI-x, we use multiple 1GigE cards, some with bonding. If the SSD caches don't add anything significant, we'll upgrade the motherboards and controllers. The chassis is still perfect.

I have not seen clear cases for SSD cache in Linux. There are two or three alternatives, and at least one of them has been abandoned. Others are too complicated. I considered using an SSD as cache on my desktop, but had to abandon the idea. If you get something conclusive I'd be interested to know, although your use case is very different from mine. - -- Cheers, Carlos E. R. (from openSUSE 42.3 x86_64 "Malachite" at Telcontar) -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iEYEARECAAYFAlqR1pkACgkQtTMYHG2NR9WyOQCfRjLRrBdeX0yBQAAeSRrz1vQE RzIAn042o5smDybWjHa3lVz6ksX3WEy2 =IdHe -----END PGP SIGNATURE----- -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Wols Lists

25 Feb 25 Feb

17:15

On 24/02/18 21:18, Carlos E. R. wrote:

...

On Saturday, 2018-02-24 at 12:45 +0100, Per Jessen wrote:

...
Greg Freemyer wrote:

...
Nah. They're old servers, only 3Gbps SATA and PCI-x, we use multiple 1GigE cards, some with bonding. If the SSD caches don't add anything significant, we'll upgrade the motherboards and controllers. The chassis is still perfect.

I have not seen clear cases for SSD cache in Linux. There are two or three alternatives, and at least one of them has been abandoned. Others are too complicated. I considered using an SSD as cache on my desktop, but had to abandon the idea.

If you get something conclusive I'd be interested to know, although your use case is very different from mine.

If you're using linux md-raid, then SSD journal support is being added. This appears to have quite an impact. Mostly write speed, of course. Cheers, Wol -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Greg Freemyer

17:42

On Sun, Feb 25, 2018 at 12:15 PM, Wols Lists <antlists@youngman.org.uk> wrote:

...

On 24/02/18 21:18, Carlos E. R. wrote:

...
On Saturday, 2018-02-24 at 12:45 +0100, Per Jessen wrote:

...
Greg Freemyer wrote:

...
Nah. They're old servers, only 3Gbps SATA and PCI-x, we use multiple 1GigE cards, some with bonding. If the SSD caches don't add anything significant, we'll upgrade the motherboards and controllers. The chassis is still perfect.

I have not seen clear cases for SSD cache in Linux. There are two or three alternatives, and at least one of them has been abandoned. Others are too complicated. I considered using an SSD as cache on my desktop, but had to abandon the idea.

If you get something conclusive I'd be interested to know, although your use case is very different from mine.

If you're using linux md-raid, then SSD journal support is being added. This appears to have quite an impact. Mostly write speed, of course.

That is great news regardless and I've never thought about the benefits of that before. As to the storage server, I don't know yet if it will be a build-from-scratch unit like Per described, or a used NetApp filer. I haven't started to price used NetApps yet. Greg -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Greg Freemyer

17:27

New subject: [opensuse] Re: Spec'ing a 50 TB dedicated storage subsystem and room to grow?

On Saturday, February 24, 2018, Carlos E. R. <robin.listas@telefonica.net> wrote:

...

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On Saturday, 2018-02-24 at 12:45 +0100, Per Jessen wrote:

...
Greg Freemyer wrote:

...
Nah. They're old servers, only 3Gbps SATA and PCI-x, we use multiple 1GigE cards, some with bonding. If the SSD caches don't add anything significant, we'll upgrade the motherboards and controllers. The chassis is still perfect.

I have not seen clear cases for SSD cache in Linux. There are two or three alternatives, and at least one of them has been abandoned. Others are too complicated. I considered using an SSD as cache on my desktop, but had to abandon the idea.

If you get something conclusive I'd be interested to know, although your use case is very different from mine.

Carlos, It is way too use case specific for my findings to be useful. I work with data sets that are too big to leverage the typical kernel block buffering mechanism, even on 64 GB machines. As an example, Friday I had to confirm a 150GB tar file (*.tgz) provided to me on a thumb drive wasn't encrypted. I didn't give it a whole lot of thought, I copied it to my laptop's rotating drive and started to untar it. After an hour I realized I made a mistake and killed the untar. A few hundred thousand files had been extracted at that point. A SSD cache I believe would have made that job far faster, but note it would need to also function as a write cache. I don't know if the Linux ssd cache schemes offer write-caching. In the meantime, my colleagues at work told me late Friday that we should consider using this opportunity to replace our 2010 era VMware ESXi server with a newer one (still used, but maybe a 2012 server design with 2015 released CPUs like the E5-4527 (uses DDR3 ram)). So, the land-of-ever-changing-specs continues to exist. If we indeed go that route, VMware's vSphere package (~$4500) supports multi-node ESXi setups (including fail-over) using a SSD in the host hypervisor node as a disk cache, but in write-through mode only. Ie. Writes are not accelerated by the cache, but subsequent reads don't have to go to disk. The contents of the cache can even move between 2 nodes if a VM is switched to a different node for load balancing reasons. Load balancing VMs between ESXi nodes is outside my personal knowledge base at the moment, but maybe it is headed my way. Putting the SSD cache in the main server has a lot of merit because it would allow the speed of NVMe SSDs to be leveraged. Then maybe another SSD cache in the backend shared storage server to perform write-caching! The one in the storage server could be a cheaper SATA interfaced SSD without any performance hit in all likelihood. Greg -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Carlos E. R.

26 Feb 26 Feb

16:09

New subject: [opensuse] Re: Spec'ing a 50 TB dedicated storage subsystem and room to grow?

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 Content-ID: <alpine.LSU.2.21.1802261706440.12310@minas-tirith.valinor> El 2018-02-25 a las 12:27 -0500, Greg Freemyer escribió:

...

On Saturday, February 24, 2018, Carlos E. R. <> wrote:

...
On Saturday, 2018-02-24 at 12:45 +0100, Per Jessen wrote:

...
Greg Freemyer wrote:

...
Nah. They're old servers, only 3Gbps SATA and PCI-x, we use multiple 1GigE cards, some with bonding. If the SSD caches don't add anything significant, we'll upgrade the motherboards and controllers. The chassis is still perfect.

I have not seen clear cases for SSD cache in Linux. There are two or three alternatives, and at least one of them has been abandoned. Others are too complicated. I considered using an SSD as cache on my desktop, but had to abandon the idea.

If you get something conclusive I'd be interested to know, although your use case is very different from mine.

Carlos,

It is way too use case specific for my findings to be useful. I work with data sets that are too big to leverage the typical kernel block buffering mechanism, even on 64 GB machines.

As an example, Friday I had to confirm a 150GB tar file (*.tgz) provided to me on a thumb drive wasn't encrypted. I didn't give it a whole lot of thought, I copied it to my laptop's rotating drive and started to untar it. After an hour I realized I made a mistake and killed the untar. A few hundred thousand files had been extracted at that point. A SSD cache I believe would have made that job far faster, but note it would need to also function as a write cache.

I don't know if the Linux ssd cache schemes offer write-caching.

I think it does, yes.

...

In the meantime, my colleagues at work told me late Friday that we should consider using this opportunity to replace our 2010 era VMware ESXi server with a newer one (still used, but maybe a 2012 server design with 2015 released CPUs like the E5-4527 (uses DDR3 ram)). So, the land-of-ever-changing-specs continues to exist.

If we indeed go that route, VMware's vSphere package (~$4500) supports multi-node ESXi setups (including fail-over) using a SSD in the host hypervisor node as a disk cache, but in write-through mode only. Ie. Writes are not accelerated by the cache, but subsequent reads don't have to go to disk. The contents of the cache can even move between 2 nodes if a VM is switched to a different node for load balancing reasons. Load balancing VMs between ESXi nodes is outside my personal knowledge base at the moment, but maybe it is headed my way.

Careful with vmware. The current version of Workstation does not accept my CPU, too old. CPU has to be newer than 2011. <https://kb.vmware.com/kb/51643> I don't know if this affects you, different product.

...

Putting the SSD cache in the main server has a lot of merit because it would allow the speed of NVMe SSDs to be leveraged. Then maybe another SSD cache in the backend shared storage server to perform write-caching!

The one in the storage server could be a cheaper SATA interfaced SSD without any performance hit in all likelihood.

Greg

- -- Cheers Carlos E. R. (from openSUSE 42.3 x86_64 "Malachite" (Minas Tirith)) -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iF4EAREIAAYFAlqUMSQACgkQja8UbcUWM1yeCgD/WS9sRE/UFqzgi9SXgOKgFTvy OzUCAP7bjz17++SXvCkBAJMWp+ftXcPf14tuK1tMwuvt1KMy0kTEdaBKIoldEOi2 =yIxZ -----END PGP SIGNATURE-----

L A Walsh

25 Feb 25 Feb

17:42

New subject: [opensuse] Re: Spec'ing a 50 TB dedicated storage subsystem and room to grow?

I'd piecemeal it... Get a PC w/a PCIe x16 or more slot and put an LSI MegaRaid 9286CV-8e. It has 8 external 6Gb channels. Have it talking to two (obsolete) LSI DE1600-SAS 12x3.5 enclosures, and 1 Areca ARC-8026 (loud fans), 24x3.5 enclosures. They LSI controller will be a bit finicky on disks it will accept as "good"...if you want good performance, you'll want to get Enterprise level drives. I put 24 4tB Hitachi Ultrastars in them about 5-6 years ago. Have them setup as a RAID10, so only 48TB available storage, but if bought today, 6TB disks might be better buy. Since the PC would be running linux, you can export the disks as SMB(CIFS), NFS or iSCSI. I found SMB gave me 125mB (thats 125mill) writes and 119mB reads over a 1G connect. Going to 10G transfers start to be cpu limited under smb (dunno about iSCSI or NFS), but right now, get: h> bin/iotest Using bs=16.0M, count=64, iosize=1.0G R:1073741824 bytes (1.0GB) copied, 1.88318 s, 544MB/s W:1073741824 bytes (1.0GB) copied, 4.95738 s, 207MB/s (those are base2 prefixes, so about 584mB read and 222mB writes for decimal). The sas enclosures can be daisy chained for up to 8 enclosures (might be more), so could have 192tB of storage w/4tB disks. Dunno if that is what you were thinking of or not? -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Greg Freemyer

17:53

New subject: [opensuse] Re: Spec'ing a 50 TB dedicated storage subsystem and room to grow?

On Sun, Feb 25, 2018 at 12:42 PM, L A Walsh <suse@tlinx.org> wrote:

...

I'd piecemeal it...

Get a PC w/a PCIe x16 or more slot and put an LSI MegaRaid 9286CV-8e. It has 8 external 6Gb channels.

Have it talking to two (obsolete) LSI DE1600-SAS 12x3.5 enclosures, and 1 Areca ARC-8026 (loud fans), 24x3.5 enclosures. They LSI controller will be a bit finicky on disks it will accept as "good"...if you want good performance, you'll want to get Enterprise level drives.

I put 24 4tB Hitachi Ultrastars in them about 5-6 years ago. Have them setup as a RAID10, so only 48TB available storage, but if bought today, 6TB disks might be better buy.

Since the PC would be running linux, you can export the disks as SMB(CIFS), NFS or iSCSI.

I found SMB gave me 125mB (thats 125mill) writes and 119mB reads over a 1G connect. Going to 10G transfers start to be cpu limited under smb (dunno about iSCSI or NFS), but right now, get: h> bin/iotest Using bs=16.0M, count=64, iosize=1.0G R:1073741824 bytes (1.0GB) copied, 1.88318 s, 544MB/s W:1073741824 bytes (1.0GB) copied, 4.95738 s, 207MB/s

(those are base2 prefixes, so about 584mB read and 222mB writes for decimal).

The sas enclosures can be daisy chained for up to 8 enclosures (might be more), so could have 192tB of storage w/4tB disks.

Dunno if that is what you were thinking of or not?

Linda, Thank you greatly. That is very much along the lines of what I'm thinking if we self build. NetApp used is the other option. I hope to decide this week. The goal is to get something running by around Mar 15. Greg -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

L A Walsh

26 Feb 26 Feb

03:00

New subject: [opensuse] Re: Spec'ing a 50 TB dedicated storage subsystem and room to grow?

A few clarifications:

...

...
LSI MegaRaid 9286CV-8e. It has 8 external 6Gb channels.

Note: though the MegaRaid is a SAS controller, it will work with SATA disks as easily as SAS disks.

...

...
The LSI controller will be a bit finicky on disks it will accept as "good"...if you want good performance, you'll want to get Enterprise level drives.

Clarification: Drives that don't test good on a HW controller like the LSI will likely work on a software-raid. However "work" doesn't mean it will work well. The reason the LSI controller kicked many drives: RPMs out of spec, so instead of 7200RPM, a few were faster, seeming to be 7920 RPM drives -- a benefit if used as a single Desktop drive, though many were worse ... performing more like 5400 RPM drives. 9 out of 12 Deskstars were too far out of RPM range to be considered 'good' by the LSI controller. Note: for drives that were slower, the slower speed could have come from bad sectors or tracks that had been remapped. Testing 12 UltraStars, showed less than a 1% variation. FWIW re:reliabilty: I have 24 2tB Ultrastars with 3yr warranties that have been running 24/7 since July 2009. I've had 2 drives go bad from that batch (both within the past year). I dropped in spares purchased w/initial disks and the RAID rebuilt in background, all with no system downtime. I added a 2nd container (in the Areca) holding 24 4tB Ultrastars (w/5yr warranties) in july 2014. No failures on that batch (yet). BTW, I got my Areca @ Circuit city, but this looks like the same unit: https://www.amazon.com/External-SATA-Expander-Enclosure-DS-24E/dp/B004IY1PIK and this might be newer version of it: http://www.areca.com.tw/products/sascableexpander8028.htmcontainer Note, you'll need to by the SFF-8088 cable to connect LSI controller with the expansion bay. My *beef* (didn't try netapp specifically), was that none of the off-the-shelf solutions had performance worth diddly. Some came with 1Gb ethernet, but then had max speeds that were abysmal -- <=10MB/s. Also, another gotcha, netapp rings a bell here, is that you needed to buy separate addon license to serve different things, so you buy it with iSCSI today, but you an't even try NFS or SMB without a separate license for them. You CAN buy one with all the licenses up front..but so much better to use linux and free license. (sorry for another long post, but lots of bewildering info... I was given advice that it is best to fill out your enclosures when you buy them (and not worry about later upgrades, so you can take advantage of a full parallel setup & redundant setup. To expand you add more enclosures. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Wols Lists

16:26

New subject: [opensuse] Re: Spec'ing a 50 TB dedicated storage subsystem and room to grow?

On 26/02/18 03:00, L A Walsh wrote:

...

Some came with 1Gb ethernet, but then had max speeds that were abysmal -- <=10MB/s.

Just remember, I know 10% is crap really, but you are getting 10% of the rated speed, not 1%, as ethernet is quoted in bits, and speed is quoted in bytes. Cheers, Wol -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Greg Freemyer

27 Feb 27 Feb

01:12

New subject: [opensuse] Re: Spec'ing a 50 TB dedicated storage subsystem and room to grow?

On Sun, Feb 25, 2018 at 10:00 PM, L A Walsh <suse@tlinx.org> wrote:

...

A few clarifications:

...
...
LSI MegaRaid 9286CV-8e. It has 8 external 6Gb channels.

--- Note: though the MegaRaid is a SAS controller, it will work with SATA disks as easily as SAS disks.

...
...
The LSI controller will be a bit finicky on disks it will accept as "good"...if you want good performance, you'll want to get Enterprise level drives.

--- Clarification: Drives that don't test good on a HW controller like the LSI will likely work on a software-raid. However "work" doesn't mean it will work well. The reason the LSI controller kicked many drives: RPMs out of spec, so instead of 7200RPM, a few were faster, seeming to be 7920 RPM drives -- a benefit if used as a single Desktop drive, though many were worse ... performing more like 5400 RPM drives.

9 out of 12 Deskstars were too far out of RPM range to be considered 'good' by the LSI controller.

Linda, If it's that picky, don't even think about a shingled drive (SMR). They can have very variable performance as the drive does internal housekeeping. In particular I occasionally stream a few TB at a time straight to an array. Even a 2TB front-end cache might fill up and kick in housekeeping that would trigger a drive eviction. Greg -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Andrei Borzenkov

26 Feb 26 Feb

07:03

New subject: [opensuse] Re: Spec'ing a 50 TB dedicated storage subsystem and room to grow?

On Sun, Feb 25, 2018 at 8:53 PM, Greg Freemyer <greg.freemyer@gmail.com> wrote:

...

NetApp used is the other option. I hope to decide this week. The goal is to get something running by around Mar 15.

Is it going to be FAS or E-Series (I do not expect more exotic options)? -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Greg Freemyer

27 Feb 27 Feb

01:07

New subject: [opensuse] Re: Spec'ing a 50 TB dedicated storage subsystem and room to grow?

On Mon, Feb 26, 2018 at 2:03 AM, Andrei Borzenkov <arvidjaar@gmail.com> wrote:

...

On Sun, Feb 25, 2018 at 8:53 PM, Greg Freemyer <greg.freemyer@gmail.com> wrote:

...
NetApp used is the other option. I hope to decide this week. The goal is to get something running by around Mar 15.

Is it going to be FAS or E-Series (I do not expect more exotic options)?

Right now we're looking at a FAS6210 with 12 x 6TB drives. That meets the short term need with dual controllers and lots of room to grow. $6K used Greg -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Lew Wolfgang

25 Feb 25 Feb

18:01

New subject: [opensuse] Re: Spec'ing a 50 TB dedicated storage subsystem and room to grow?

On 02/25/2018 09:42 AM, L A Walsh wrote:

...

The sas enclosures can be daisy chained for up to 8 enclosures (might be more), so could have 192tB of storage w/4tB disks.

BTW, I've been using Seagate 10-TB disks and they've been quite reliable and fast. They don't do the shingled recording thing either. Regards, Lew -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Greg Freemyer

22:21

New subject: [opensuse] Re: Spec'ing a 50 TB dedicated storage subsystem and room to grow?

On Sun, Feb 25, 2018 at 1:01 PM, Lew Wolfgang <wolfgang@sweet-haven.com> wrote:

...

On 02/25/2018 09:42 AM, L A Walsh wrote:

...
The sas enclosures can be daisy chained for up to 8 enclosures (might be more), so could have 192tB of storage w/4tB disks.

BTW, I've been using Seagate 10-TB disks and they've been quite reliable and fast. They don't do the shingled recording thing either.

Lew, The only 10TB I have bought were shingled and from time to time very slow. is it the helium-filled you're buying? Also, do you have thoughts of SAS v SATA for a busy fileserver? ie. We are looking at running 11,000 tapes worth of session scans on to this fileserver in the first 45 days of use. That's about 30 tape drives spinning 24 hours a day and seeking from one filemark to the next at highspeed to get the session data. Then building a pretty good size database describing the sessions. We expect just the session DB to be about 2 TB. Lots of other activity will also be hitting this same NAS from other projects. We process tapes for our clients, and often in very large scales. That 11,000 tapes is for a single project. Greg -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Lew Wolfgang

26 Feb 26 Feb

01:10

New subject: [opensuse] Re: Spec'ing a 50 TB dedicated storage subsystem and room to grow?

On 02/25/2018 02:21 PM, Greg Freemyer wrote:

...

On Sun, Feb 25, 2018 at 1:01 PM, Lew Wolfgang <wolfgang@sweet-haven.com> wrote:

...
On 02/25/2018 09:42 AM, L A Walsh wrote:

...
The sas enclosures can be daisy chained for up to 8 enclosures (might be more), so could have 192tB of storage w/4tB disks.

BTW, I've been using Seagate 10-TB disks and they've been quite reliable and fast. They don't do the shingled recording thing either. Lew,

The only 10TB I have bought were shingled and from time to time very slow.

I check the detailed specs and with Seagate and determined that they aren't shingled. We purchased 7 of them for testing before buying more. Configured as a six-disk RAID6 array with a hotspare we get about 1.6-GB/sec write rates. This is with lots 4-GB files of random data. I can get more measurements tomorrow if you're interested. We have another system with a 23-disk RAID 6 array using the 10-TB disks that I can't access now. This is with an LSI SAS 9380-8e controller with the SAS versions of the disks.

...

is it the helium-filled you're buying?

Yes. The cover is welded in place.

...

Also, do you have thoughts of SAS v SATA for a busy fileserver?

I've been told that SAS is better for a busy server, but I never took the time to quantify the difference. But note that you can get 12-Gb/sec with SAS, but only 6-Gb/sec with SATA. So that might make a difference. For the record, we use SuperMicro 4U chassis. The newest server is the one I can't reach right now, it has five of the 4U slave chassis, and one 4U with the processor. The slaves have 44 3.5-in hotswap bays, but we haven't filled them all yet. Each slave chassis has its own LSI RAID controller with battery cache backup. It's a monster of a system. I'll get some numbers tomorrow. Regards, Lew -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

jdd＠dodin.org

07:20

New subject: [opensuse] Re: Spec'ing a 50 TB dedicated storage subsystem and room to grow?

Le 26/02/2018 à 02:10, Lew Wolfgang a écrit :

...

LSI RAID controller with battery cache backup. It's a monster of a system. I'll get some numbers tomorrow.

I read this thread a bit like I watch "Black Mirror" :-), but keep on, may be we will have a similar one in some years :-) jdd -- http://dodin.org -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Wol's lists

25 Feb 25 Feb

22:25

New subject: [opensuse] Re: Spec'ing a 50 TB dedicated storage subsystem and room to grow?

On 25/02/18 17:42, L A Walsh wrote:

...

I put 24 4tB Hitachi Ultrastars in them about 5-6 years ago. Have them setup as a RAID10, so only 48TB available storage, but if bought today, 6TB disks might be better buy.

Someone else will probably have to write the code, but I've been thinking about spec'ing a raid-60 or 61 mode for md-raid. 24 drives would make sense as 3 8-drive raid-6, so you'd have 18 data disks and 6 parity disks. But this is a mode like linux raid-10 (which is *not* raid-1+0), so your data and parity would be scattered over all 24 drives. The important point of this is that if a drive fails, it gets rebuilt from all the other drives, and doesn't hammer just a few of them. If that sounds like a good idea, I'm told it's probably only a minor mod to implement it ... Cheers, Wol -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Greg Freemyer

27 Feb 27 Feb

00:57

New subject: [opensuse] Re: Spec'ing a 50 TB dedicated storage subsystem and room to grow?

On Sun, Feb 25, 2018 at 5:25 PM, Wol's lists <antlists@youngman.org.uk> wrote:

...

On 25/02/18 17:42, L A Walsh wrote:

...
I put 24 4tB Hitachi Ultrastars in them about 5-6 years ago. Have them setup as a RAID10, so only 48TB available storage, but if bought today, 6TB disks might be better buy.

Someone else will probably have to write the code, but I've been thinking about spec'ing a raid-60 or 61 mode for md-raid. 24 drives would make sense as 3 8-drive raid-6, so you'd have 18 data disks and 6 parity disks.

But this is a mode like linux raid-10 (which is *not* raid-1+0), so your data and parity would be scattered over all 24 drives. The important point of this is that if a drive fails, it gets rebuilt from all the other drives, and doesn't hammer just a few of them.

If that sounds like a good idea, I'm told it's probably only a minor mod to implement it ...

Cheers, Wol

Wol, Interesting idea about raid 60 and/or 61. In addition to the highspeed NAS / iSCSI server I need to build, I should probably build a 100TB archive NAS. Let's say a single raid 6 with 12 x 10TB drives. That seems to be about the limit of what I would want in a single raid 6, so raid 60 is at least interesting for volumes above 100TB. Even with a 12 x 10TB raid 6, I can't imagine how slow the rebuilds would be. Days? A week? Raid 61 would also be interesting. It seems to me the rebuild time on a raid 61 could be greatly faster than on just a 6 (or 60). That assumes the failed drive could just be copied over from the mirror pair of that drive. FYI: I've had Raid 6 rebuilds with 5 x 10TB shingled drives takes several days. It's pretty concerning to realize with RAID 6 on 10TB drives you only need a couple disk issues a couple days about to put your data at serious risk. Greg -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Wols Lists

21:57

New subject: [opensuse] Re: Spec'ing a 50 TB dedicated storage subsystem and room to grow?

On 27/02/18 00:57, Greg Freemyer wrote:

...

Raid 61 would also be interesting. It seems to me the rebuild time on a raid 61 could be greatly faster than on just a 6 (or 60). That assumes the failed drive could just be copied over from the mirror pair of that drive.

Actually, it would be faster even than that. Do you know the difference between Raid-1+0, and linux md-raid-10? linux-10 has the disadvantage (at least from the developer's point of view) that the drives are mirrors of each other, and thus rebuilding one drive places a lot of stress on said mirror. The point of the work I've spec'd is that the blocks are scattered according to a pseudo-random algorithm, such that there is no such mirror! So if you have say 20 drives, with your raid-61 configured as 8,2, that would mean you have two logical 8-drive raid-6 arrays, mirrored. But the blocks are scattered at random across your 20 drives. So if a drive fails, let's say it's 10TB, the rebuild can copy 0.5TB from EVERY other drive, and rebuild the failed one. The standard algorithm would hammer one other drive and quite possibly tip that over the edge too. The only snag with my algorithm is that, iirc, you can get a pathological failure if you don't have at least twice the drives. So an 8,2 setup might need 33 drives for the algorithm to work. Of course, if that's the case, it would fall back to a simpler algorithm, probably the one that leads to a mirror. Or at least for raid-6, it would know that if all copies of a block were stored on the one drive, it could rebuild that block from parity. But that's not a good idea :-( Cheers, Wol -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Greg Freemyer

28 Feb 28 Feb

20:54

New subject: [opensuse] Re: Spec'ing a 50 TB dedicated storage subsystem and room to grow?

On Tue, Feb 27, 2018 at 4:57 PM, Wols Lists <antlists@youngman.org.uk> wrote:

...

On 27/02/18 00:57, Greg Freemyer wrote:

...
Raid 61 would also be interesting. It seems to me the rebuild time on a raid 61 could be greatly faster than on just a 6 (or 60). That assumes the failed drive could just be copied over from the mirror pair of that drive.

Actually, it would be faster even than that. Do you know the difference between Raid-1+0, and linux md-raid-10? linux-10 has the disadvantage (at least from the developer's point of view) that the drives are mirrors of each other, and thus rebuilding one drive places a lot of stress on said mirror.

The point of the work I've spec'd is that the blocks are scattered according to a pseudo-random algorithm, such that there is no such mirror!

Unusual!

...

So if you have say 20 drives, with your raid-61 configured as 8,2, that would mean you have two logical 8-drive raid-6 arrays, mirrored. But the blocks are scattered at random across your 20 drives. So if a drive fails, let's say it's 10TB, the rebuild can copy 0.5TB from EVERY other drive, and rebuild the failed one.

Say what? Putting thinking hat on! Whoa, that is very cool if I have it right!

...

The standard algorithm would hammer one other drive and quite possibly tip that over the edge too.

The only snag with my algorithm is that, iirc, you can get a pathological failure if you don't have at least twice the drives. So an 8,2 setup might need 33 drives for the algorithm to work.

I'm confused here. Let's say I decide to be intentional about building a 80TB usable LV with your setup. If I use 10TB drives, does that mean I'd have to buy 33 x 10TB drives.At $400/drive, that's $13.2K just for the drives (chassis, controllers, etc not included). That seems like a lot of money for 80TB useable.

...

Of course, if that's the case, it would fall back to a simpler algorithm, probably the one that leads to a mirror. Or at least for raid-6, it would know that if all copies of a block were stored on the one drive, it could rebuild that block from parity. But that's not a good idea :-(

Cheers, Wol

Greg -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Wol's lists

21:21

New subject: [opensuse] Re: Spec'ing a 50 TB dedicated storage subsystem and room to grow?

On 28/02/18 20:54, Greg Freemyer wrote:

...

On Tue, Feb 27, 2018 at 4:57 PM, Wols Lists <antlists@youngman.org.uk> wrote:

...
On 27/02/18 00:57, Greg Freemyer wrote:

...
Raid 61 would also be interesting. It seems to me the rebuild time on a raid 61 could be greatly faster than on just a 6 (or 60). That assumes the failed drive could just be copied over from the mirror pair of that drive.

Actually, it would be faster even than that. Do you know the difference between Raid-1+0, and linux md-raid-10? linux-10 has the disadvantage (at least from the developer's point of view) that the drives are mirrors of each other, and thus rebuilding one drive places a lot of stress on said mirror.

The point of the work I've spec'd is that the blocks are scattered according to a pseudo-random algorithm, such that there is no such mirror!

Unusual!

...
So if you have say 20 drives, with your raid-61 configured as 8,2, that would mean you have two logical 8-drive raid-6 arrays, mirrored. But the blocks are scattered at random across your 20 drives. So if a drive fails, let's say it's 10TB, the rebuild can copy 0.5TB from EVERY other drive, and rebuild the failed one.

Say what?

Putting thinking hat on!

Whoa, that is very cool if I have it right!

Somebody posted to the linux-raid list about a CRUSH algorithm, I think it was called. This enables you to spec local storage, different controllers, network storage etc, and ensure that blocks are scattered over all of them. The intent was that you could lose a controller, or a network link, or whatever, and still guarantee that a complete stripe of blocks could be found elsewhere. But I get the impression that it's computationally expensive - I wanted a simple algorithm that got you most of the benefits for a tiny fraction of the cost.

...

...
The standard algorithm would hammer one other drive and quite possibly tip that over the edge too.

The only snag with my algorithm is that, iirc, you can get a pathological failure if you don't have at least twice the drives. So an 8,2 setup might need 33 drives for the algorithm to work.

I'm confused here.

If the number of drives is high enough, it's easy to prove that the pathological setup cannot occur. Unfortunately, every simulation I've run with less than that IS pathological :-( (By that, I mean that a single drive failure could destroy all copies of some blocks :-(

...

Let's say I decide to be intentional about building a 80TB usable LV with your setup. If I use 10TB drives, does that mean I'd have to buy 33 x 10TB drives.At $400/drive, that's $13.2K just for the drives (chassis, controllers, etc not included). That seems like a lot of money for 80TB useable.

I'm trying to remember my maths. That's 8 drives for data plus 2 parity, twice. 20 drives. You would need either 21 or 41 drives. But 41 sounds wrong, it should certainly work with 31. It should be possible to do it with 21, maybe I just need to improve my algorithm.

...

...
Of course, if that's the case, it would fall back to a simpler algorithm, probably the one that leads to a mirror. Or at least for raid-6, it would know that if all copies of a block were stored on the one drive, it could rebuild that block from parity. But that's not a good idea :-(

I attach my test code. Have a play. Note that you need to make sure that the primes aren't pathological - they must not be a factor of any of the other numbers. Any queries I'll try to remember what I was doing and say. There should be an email from me on the raid list that explains it all, I'll hunt it up later, but it's now my bed time ... :-) Cheers, Wol

Wols Lists

1 Mar 1 Mar

10:00

New subject: [opensuse] Re: Spec'ing a 50 TB dedicated storage subsystem and room to grow?

On 28/02/18 21:21, Wol's lists wrote:

...

...
Let's say I decide to be intentional about building a 80TB usable LV with your setup. If I use 10TB drives, does that mean I'd have to buy 33 x 10TB drives.At $400/drive, that's $13.2K just for the drives (chassis, controllers, etc not included). That seems like a lot of money for 80TB useable.

I'm trying to remember my maths. That's 8 drives for data plus 2 parity, twice. 20 drives. You would need either 21 or 41 drives. But 41 sounds wrong, it should certainly work with 31. It should be possible to do it with 21, maybe I just need to improve my algorithm.

I think I've sussed what was up with my brain-fade - I'm getting my logical and physical drive sizes mixed up. If we created our raid-61 as a 7,2 with let's say 22 drives, that would splatter 14 logical 16(ish) tera drives across 22 physical 10tera drives. Adding a new physical drive would increase the size of each logical drive by 7-800gig. At the end of the day this is all trade-offs - there comes up a point where things don't add up because you've got too few drives (you might need to go 6,2 to get the non-pathological algorithm which increases your need for drives because of the increased parity etc etc), or the number of drives might be so high as to be unmanageable. This algorithm was really conceived for arrays with about 100 disks! Cheers, Wol -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Greg Freemyer

12:20

New subject: [opensuse] Re: Spec'ing a 50 TB dedicated storage subsystem and room to grow?

...

This algorithm was really conceived for arrays with about 100 disks!

I don't think I'm going there anytime soon! But, it's not totally crazy to buy this 48-slot chassis and fill it up with 2TB drives. http://www.chenbro.com/en-global/products/RackmountChassis/4U_Chassis/NR4070... 2TB drives for $60 are somewhat common, so for about $5K you would have a fairly large spindle count array for testing your code. What capacity LV would that make with your raid 61 variant? Does your variant require all drives be the same size? If not, does the code support reshapping to let the LV slowly grow over time. Ie, start with 48 x 2TB drives, then replace some with 4TB to grow the LV while maintaining the data? If the code were stable, I might consider doing that in the future. If you wanted sub-1TB drives to test with, these guys sell used drives, but you would need to call them: They have inventory in Atlanta and Las Vegas. I bought a bunch (qty 50?) of 40GB drives from them about 8 years ago. They won't have drives that small now, of course. https://www.usmicrocorpretail.com/ fyi: I used to audit their disk wiping process for used drives. Every drive gets a 3-pass wipe minimum before it is resold. They have thousands of used (and wiped) drives in inventory. I don't think they sell low volumes (1 or 2 at a time), but I think they would sell 48 in a whack. Greg -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Wols Lists

13:39

New subject: [opensuse] Re: Spec'ing a 50 TB dedicated storage subsystem and room to grow?

On 01/03/18 12:20, Greg Freemyer wrote:

...

...
This algorithm was really conceived for arrays with about 100 disks!

I don't think I'm going there anytime soon!

But, it's not totally crazy to buy this 48-slot chassis and fill it up with 2TB drives.

http://www.chenbro.com/en-global/products/RackmountChassis/4U_Chassis/NR4070...

Does it work on a 240V supply? :-)

...

2TB drives for $60 are somewhat common, so for about $5K you would have a fairly large spindle count array for testing your code.

That would be brilliant if I could afford it. I don't know as it would be much use to me at home though ...

...

What capacity LV would that make with your raid 61 variant?

Okay, that's 48TB for a logical raid-6. That gives me 6 logical 8TB drives (or 8 x 6TB) so I could create a 32 or 36TB array. Doesn't sound a lot for 96TB of actual disk, but a mirror would be 48TB so I'm losing maybe 12TB (25%) for parity data - not bad.

...

Does your variant require all drives be the same size? If not, does the code support reshapping to let the LV slowly grow over time. Ie, start with 48 x 2TB drives, then replace some with 4TB to grow the LV while maintaining the data? If the code were stable, I might consider doing that in the future.

I don't think raid-10 needs all the drives to be the same size, and this code will probably piggy-back on that code. That said, mixing different size drives is likely to cause the algorithm to go loopy ... and if your logical array is eight drives, you'd probably need to replace 8 drives at a time for any semblance of safety.

...

If you wanted sub-1TB drives to test with, these guys sell used drives, but you would need to call them: They have inventory in Atlanta and Las Vegas. I bought a bunch (qty 50?) of 40GB drives from them about 8 years ago. They won't have drives that small now, of course.

https://www.usmicrocorpretail.com/

fyi: I used to audit their disk wiping process for used drives. Every drive gets a 3-pass wipe minimum before it is resold. They have thousands of used (and wiped) drives in inventory. I don't think they sell low volumes (1 or 2 at a time), but I think they would sell 48 in a whack.

I think trans-atlantic shipping would bump the price up :-) I used to wipe friends' drives by shoving them in one of my systems, converting from fat/ntfs to ext, and hammering the drive. At work years ago, we just used a slack dvd and "dd if=/dev/zero of=/dev/sda", except my colleagues didn't bother to let it run. Maybe 1/2hr and kill it, but that would have done a reasonable job of trashing it - especially for someone who doesn't really know how to try and recover anything. (They drew lots who got which machine, and I got the machine I wanted - I think it had three or four 18GB WD Bigfoot drives, which were pretty massive back then...) But thanks for the info - I might try and get some kind corp to donate me one of those to play with :-) Cheers, Wol -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Greg Freemyer

16:04

New subject: [opensuse] Re: Spec'ing a 50 TB dedicated storage subsystem and room to grow?

On Thu, Mar 1, 2018 at 8:39 AM, Wols Lists <antlists@youngman.org.uk> wrote:

...

On 01/03/18 12:20, Greg Freemyer wrote:

...
...
This algorithm was really conceived for arrays with about 100 disks!

I don't think I'm going there anytime soon!

But, it's not totally crazy to buy this 48-slot chassis and fill it up with 2TB drives.

http://www.chenbro.com/en-global/products/RackmountChassis/4U_Chassis/NR4070...

Does it work on a 240V supply? :-)

It's a modular design, so I'm sure you can order it as desired. It can hold 4 PSUs, so 2 hot / 1 spare, Or 3 hot / 1 spare.

...

...
2TB drives for $60 are somewhat common, so for about $5K you would have a fairly large spindle count array for testing your code.

That would be brilliant if I could afford it. I don't know as it would be much use to me at home though ...

...
What capacity LV would that make with your raid 61 variant?

Okay, that's 48TB for a logical raid-6. That gives me 6 logical 8TB drives (or 8 x 6TB) so I could create a 32 or 36TB array. Doesn't sound a lot for 96TB of actual disk, but a mirror would be 48TB so I'm losing maybe 12TB (25%) for parity data - not bad.

That's actually not bad at all for what you're doing. I must have mis-understood earlier comments.

...

...
Does your variant require all drives be the same size? If not, does the code support reshapping to let the LV slowly grow over time. Ie, start with 48 x 2TB drives, then replace some with 4TB to grow the LV while maintaining the data? If the code were stable, I might consider doing that in the future.

I don't think raid-10 needs all the drives to be the same size, and this code will probably piggy-back on that code.

That said, mixing different size drives is likely to cause the algorithm to go loopy ... and if your logical array is eight drives, you'd probably need to replace 8 drives at a time for any semblance of safety.

That's acceptable for lots of use cases, probably including mine. (Note I currently have 2 different needs for a storage sub-system. One is for an extremely fast database back-end that will get pounded 24-hours a day. One is for a lightly used archive server. I've used a Drobo in the past for an archive server, and as slow as it is, it worked fine.)

...

...
If you wanted sub-1TB drives to test with, these guys sell used drives, but you would need to call them: They have inventory in Atlanta and Las Vegas. I bought a bunch (qty 50?) of 40GB drives from them about 8 years ago. They won't have drives that small now, of course.

https://www.usmicrocorpretail.com/

fyi: I used to audit their disk wiping process for used drives. Every drive gets a 3-pass wipe minimum before it is resold. They have thousands of used (and wiped) drives in inventory. I don't think they sell low volumes (1 or 2 at a time), but I think they would sell 48 in a whack.

I think trans-atlantic shipping would bump the price up :-)

I assume similar exist in Europe. Their real business is data destruction. They go into facilities with lots of confidential data and wipe the computers before anything leaves the building. They typically do that in exchange for them getting the computers for free. The make their income by selling the refurbished used machines. INOP drives get run through chipper/shredder.

...

I used to wipe friends' drives by shoving them in one of my systems, converting from fat/ntfs to ext, and hammering the drive.

At work years ago, we just used a slack dvd and "dd if=/dev/zero of=/dev/sda", except my colleagues didn't bother to let it run. Maybe 1/2hr and kill it, but that would have done a reasonable job of trashing it - especially for someone who doesn't really know how to try and recover anything.

Nothing wrong with using dd for that, but you do have to let it run. Some filesystems start writing data 1/3rd of the way in and grow towards the front and back ends. A 30-minute wipe may not even get to the real data. I had one matter where a Windows NTFS system was "overwritten" with a Linux system. I was able to recover the full Windows system and testify as to what it contained. FYI: the Linux System was installed, but barely used. My impression is they were trying to destroy the data on the NTFS filesystem, but they failed.

...

But thanks for the info - I might try and get some kind corp to donate me one of those to play with :-)

Good Luck on that! Seriously, it seems like a reasonable trade. Someone provides you $5K or so of equipment, and you provide the world with your Raid 61 software.

...

Cheers, Wol

Greg -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Andrei Borzenkov

03:59

New subject: [opensuse] Re: Spec'ing a 50 TB dedicated storage subsystem and room to grow?

28.02.2018 00:57, Wols Lists пишет:

...

On 27/02/18 00:57, Greg Freemyer wrote:

...
Raid 61 would also be interesting. It seems to me the rebuild time on a raid 61 could be greatly faster than on just a 6 (or 60). That assumes the failed drive could just be copied over from the mirror pair of that drive.

Actually, it would be faster even than that. Do you know the difference between Raid-1+0, and linux md-raid-10? linux-10 has the disadvantage (at least from the developer's point of view) that the drives are mirrors of each other, and thus rebuilding one drive places a lot of stress on said mirror.

The point of the work I've spec'd is that the blocks are scattered according to a pseudo-random algorithm, such that there is no such mirror! So if you have say 20 drives, with your raid-61 configured as 8,2, that would mean you have two logical 8-drive raid-6 arrays, mirrored. But the blocks are scattered at random across your 20 drives. So if a drive fails, let's say it's 10TB, the rebuild can copy 0.5TB from EVERY other drive, and rebuild the failed one. The standard algorithm would hammer one other drive and quite possibly tip that over the edge too.

The rebuild of physical drive is limited by this very drive. There are implementations that have "virtual spare" by distributing spare sectors across all drives. Then it can parallelize rebuild to this "spare". This actually dramatically reduce time to restore array full redundancy. It still needs to rebuild failed drive when it is replaced but it happens with array is already fully functional and redundant and so is not time critical.

...

The only snag with my algorithm is that, iirc, you can get a pathological failure if you don't have at least twice the drives. So an 8,2 setup might need 33 drives for the algorithm to work. Of course, if that's the case, it would fall back to a simpler algorithm, probably the one that leads to a mirror. Or at least for raid-6, it would know that if all copies of a block were stored on the one drive, it could rebuild that block from parity. But that's not a good idea :-(

Cheers, Wol

-- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Wols Lists

09:37

New subject: [opensuse] Re: Spec'ing a 50 TB dedicated storage subsystem and room to grow?

On 01/03/18 03:59, Andrei Borzenkov wrote:

...

...
...
The point of the work I've spec'd is that the blocks are scattered according to a pseudo-random algorithm, such that there is no such mirror! So if you have say 20 drives, with your raid-61 configured as 8,2, that would mean you have two logical 8-drive raid-6 arrays, mirrored. But the blocks are scattered at random across your 20 drives. So if a drive fails, let's say it's 10TB, the rebuild can copy 0.5TB from EVERY other drive, and rebuild the failed one. The standard algorithm would hammer one other drive and quite possibly tip that over the edge too.

The rebuild of physical drive is limited by this very drive. There are implementations that have "virtual spare" by distributing spare sectors across all drives. Then it can parallelize rebuild to this "spare". This actually dramatically reduce time to restore array full redundancy.

It still needs to rebuild failed drive when it is replaced but it happens with array is already fully functional and redundant and so is not time critical.

That's called raid-6 :-) Raid-6 is still fully redundant with one failed drive. If you mirror it to raid-61, that means you can lose at least four drives and still not lose data. My algorithm would probably reduce the risk to the array (the old drives don't get hammered in a rebuild) but I don't see how they could speed the rebuild up that much, as the bottleneck is the drive being rebuilt, and there's absolutely nothing you can do about that. Cheers, Wol -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Andrei Borzenkov

09:51

New subject: [opensuse] Re: Spec'ing a 50 TB dedicated storage subsystem and room to grow?

On Thu, Mar 1, 2018 at 12:37 PM, Wols Lists <antlists@youngman.org.uk> wrote:

...

On 01/03/18 03:59, Andrei Borzenkov wrote:

...
...
...
The point of the work I've spec'd is that the blocks are scattered according to a pseudo-random algorithm, such that there is no such mirror! So if you have say 20 drives, with your raid-61 configured as 8,2, that would mean you have two logical 8-drive raid-6 arrays, mirrored. But the blocks are scattered at random across your 20 drives. So if a drive fails, let's say it's 10TB, the rebuild can copy 0.5TB from EVERY other drive, and rebuild the failed one. The standard algorithm would hammer one other drive and quite possibly tip that over the edge too.

The rebuild of physical drive is limited by this very drive. There are implementations that have "virtual spare" by distributing spare sectors across all drives. Then it can parallelize rebuild to this "spare". This actually dramatically reduce time to restore array full redundancy.

It still needs to rebuild failed drive when it is replaced but it happens with array is already fully functional and redundant and so is not time critical.

That's called raid-6 :-)

No.

...

Raid-6 is still fully redundant with one failed drive. If you mirror it to raid-61, that means you can lose at least four drives and still not lose data.

If you mirror it, you can lose half of all drives.

...

My algorithm would probably reduce the risk to the array (the old drives don't get hammered in a rebuild) but I don't see how they could speed the rebuild up that much, as the bottleneck is the drive being rebuilt, and there's absolutely nothing you can do about that.

As you wish. I do not try to sell you anything. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

David T-G

11:47

New subject: [opensuse] Re: Spec'ing a 50 TB dedicated storage subsystem and room to grow?

Andrei, et al -- ...and then Andrei Borzenkov said... % % On Thu, Mar 1, 2018 at 12:37 PM, Wols Lists <antlists@youngman.org.uk> wrote: ... % > Raid-6 is still fully redundant with one failed drive. If you mirror it % > to raid-61, that means you can lose at least four drives and still not % > lose data. % % If you mirror it, you can lose half of all drives. [snip] Well, hold on... You could either mirror each drive A1-A2, B1-B2, C1-C2, ... and then stripe all #1 into array 1 and mirror it onto array #2, and that would indeed give you the ability to lose many drives. You might even simply mirror A1-A2, B1-B2, C1-C2, ... and then build a large volume from each of those mirrored devices. But if you create your large array #1 first using RAID whatever and then create another #2 and mirror that way, then you can lose only the redundancy from each side before losing the entire stripe (#1 or #2). In the worst case, if it were RAID 0, then losing a single drive would destroy the array. So do you mirror and then stripe, or do you stripe and then mirror? How many drives you can lose where will be different depending on the approach. This has all been delightfully fascinating for me, one of those sad household users stuck behind a small pocketbook :-)/2 Thank you all! HTH & HAND :-D -- David T-G See http://justpickone.org/davidtg/email/ See http://justpickone.org/davidtg/tofu.txt -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Wols Lists

12:06

New subject: [opensuse] Re: Spec'ing a 50 TB dedicated storage subsystem and room to grow?

On 01/03/18 11:47, David T-G wrote:

...

So do you mirror and then stripe, or do you stripe and then mirror? How many drives you can lose where will be different depending on the approach.

Or do you do as linux does with its raid-10 mode which is neither stripe nor mirror ... https://en.wikipedia.org/wiki/Non-standard_RAID_levels#LINUX-MD-RAID-10 That's what's behind my raid-61, where the simple mechanics of 6 & 1 guarantee you can lose at least 4 drives, but the scattering action means you can probably survive losing a lot more. Have you been reading the linux raid wiki? :-) https://raid.wiki.kernel.org/index.php/Linux_Raid Cheers, Wol -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

L A Walsh

27 Feb 27 Feb

02:45

New subject: [opensuse] RAID Q's' & # spindle groups for performance (long/ref) (was ...dedicated storage subsystem ...)

[[[ SIDE NOTE:... before I forget ;-), the unit I describe below is different from a "dedicated storage subsystem" in that it is my home server that also manages my internet connections (web proxy, routing, email file serving...etc) all running via OpenSuse. That it is critical for so many things is a main reason I have been reluctant to go with less tested and less flexible alternatives. In emergency situations, I've been able to boot from a suse rescue CD OR boot in single-user directly from the disk. and brought up the system, service-by-service, by hand directly from the HW-init boot step. This allowed me to make temporary patches or do what was necessary to get my system back to normal running, to allow me to then examine more permanent corrections at leisure. Sometimes, I had a hand-booted system staying up for weeks as I didn't want to address boot problems at that moment. The fact that I couldn't do something simple like hand-load a driver (via modprobe) in shell and continue the boot process was(is?) a major reliability issue in some other boot & service managers. I still see that as an issue -- at least on a piecemeal system as linux has been. Background of the advice I was given: This started on the xfs list w/me wondering about effect of spindle numbers on IOPS and performance. I tried to trim old HW stuff, and non-perf/raid info. The whole discussion would be in xfs-archives from 2013 if you want original text... Stan said:

...

Hay Linda, if you're to re-architect your storage, the first thing I'd do is ditch that RAID50 setup. RAID50 exists strictly to reduce some of the penalties of RAID5. But then you find new downsides specific to RAID50, including the alignment issues you mentioned.

(my RAID50 alignment was 768K) I was wondering about how I might increase my random I/O performance and gave some specs about my setup back then. At the time I had a 9280 LSI card compare to the 9286 I have now main diffs or features of 9286 over previous card: * - 2 cpu's vs. 1; * - 12Gb bus (vs. 6Gb), and * - 4k sector support. The cpu thing -- was mostly about supporting multi-checksum RAID configs like RAID50 (mentioned below). most of the rest of this was from Stan with exact text, publicly available in xfs archives. -------- Original Message -------- Subject: Re: RAID setups, usage, Q's' effect of spindle groups... Date: Mon, 21 Jan 2013 07:38:09 -0600 From: Stan Hoeppner To: Linda Walsh CC: xfs-oss The 2108 ROC ASIC in the 9280 doesn't have sufficient horsepower for good performance with dual parity arrays, but that pales in comparison to the performance drop due to the RMW induced seek latency.

...

not to mention the diskspace hit RAID10 would be a bit too decadent for my usage/budget.

When one perceives the capacity overhead of RAID1/10 as an intolerable cost, instead of a benefit, one is forever destined to suffer from the poor performance of parity RAID schemes. ...

...

On #3 currently using 12.31tB in 20 partitions ...details elided....

So you have 24x 2TB 7.2K SATA drives total in two 630Js, correct?

...

I was mostly interested in how increasing number of spindles in a Raid50 would help parallelism

[how would this help performance overall...] The answer is simple too: Parity RAID sucks. If you want anything more than a trivial increase in performance, you need to ditch parity RAID. Given the time and effort involved in rearranging all of your disks to get one or two more RAID5 arrays with fewer disks per array into a RAID50, it doesn't make sense to do so when you can simply create one large RAID10, and be done monkeying around and second guessing. You'll have the performance you're seeking. Actually far, far more.

...

Consider this -- my max read and write (both), on my large array is 1GB/s. There's no way I could get that with a RAID10 setup without a much larger number of disks.

On the contrary. The same disks in RAID10 will walk all over your RAID50 setup. Let's discuss practical use and performance instead of peak optimums shall we? Note that immediately below I'm simply educating you, not recommending a 12 drive RAID10. Recommendations come later. In this one array you have 12 drives, 3x 4 drive RAID5 arrays in RAID50, for 9 effective data spindles. An equivalent 12 drive RAID10 would yield 6 data spindles. For a pure streaming read workload with all drives evenly in play, the RAID50 might be ~50% faster. For a purely random read workload about the same, although in both cases 50x or more slower than the streaming read case due to random seeks. With a pure streaming allocation write workload with perfect stripe filling, no RMW, the RAID50 will be faster, but less than the 50% above due to parity calcs in the ASIC. Now it gets interesting. With a purely random write non aligned non allocation workload on the RAID50, RMW cycles will abound driving seek latency through the roof, while the ASIC is performing a parity calc on each stripe update. Throughput here will be in the low tens of MBs per second, tops. RAID10 simply writes each sector--done. Throughput will be in the high tens to 100s of MB/s. So in this scenario RAID10 will be anywhere from 5-10x or more faster depending on the distribution of the writes across the drives. Another factor here is that RMW reads from the disks go into the LSI cache for parity recalculation, eating cache bandwidth and capacity, decreasing the writeback efficiency. With RAID10 you get full cache bandwidth for sinking incoming writes and performing flush scheduling, both being extremely important for random write workloads. Food for thought: A random write workload of ~500MB with RAID10 will complete almost instantly after the controller cache consumes it. With RAID50 you have to go through the hundreds or thousands of RMW cycles on the disks, so the same operation will take many minutes. Lets look at more real world scenarios. Take your example of the nightly background processes kicking in. This constitutes a mixed random read and write workload. In this situation every RMW can create 3 seeks per drive write: read, write, parity write. Now you have a seek for a pending read operation, making 4 seeks. But the problem isn't just the seeks, it is the inter-seek latency due to the slow 7.2K RPM platters having to spin under the head for the next read or write. This scenario makes scheduling in the controller and the drives themselves very difficult adding more latency. With RAID10 in this scenario, you simply have write/read/write/read/etc. You're not performing 2 extra seeks for each write, so you're not incurring that latency between operations, nor the scheduling complexity, thus driving throughput much higher. In this scenario, the 6 disk RAID10 may be 10x to as much as 50x faster than the RAID50 depending on the access/seek patterns. I've obviously not covered this in much technical detail as storage behavior is quite complex. I've attempted to give you a high level overview of the behavioral differences between parity and non parity RAID, and the potential performance differences with various workloads, and the differences between "peak" performance and actual performance. While your RAID50 may have greater theoretical peak streaming performance, the RAID10 will typically, literally, run circles around it with most day-to-day mixed IO workloads. While the RAID50 may have a peak throughput of ~1GB/s, it may only attain that 1-10% of the time. The RAID10 may have a peak throughput of "only" ~700MB/s, but may likely achieve that more than 60% of the time. And as a result its performance degradation will be much more graceful with concurrent workloads due the the dramatically lower IO completion latencies.

...

Though I admit, concurrency would rise... but I generate most of my workload, so usually I don't have too many things going on at the same time... a few maybe...

But I'd guess it's at times like this when you bog down the RAID50 with mixed workloads and become annoyed. You typically don't see that with the non-parity arrays.

...

When an xfs_fsr kicks in and starts swallowing disk-cache, *ahem*, and the daily backup kicks in, AND the daily 'rsync' to create a static snapshot... things can slow down a bit.. but rare am I up at those hours...

And this is one scenario where the RAID10 would run circles around the RAID50.

...

...
You'll need more drives to maintain the same usable capacity,

(oh, a minor detail! ;^))...

Well how much space do you really need in a one person development operation plus home media/etc storage system? 10TB, 24TB, 48TB? Assuming you have both 630Js filled with 24x 2TB drives, that's 48TB raw. If you have 6x 4 drive RAID5s in multiple RAID50 spans, you have 18x 2TB = 36TB of capacity. Your largest array is 12 drives with 9 effective spindles of throughput. You've split up your arrays for different functions, limiting some workloads to fewer spindles of performance, and having spindles sit idle that could otherwise be actively adding performance to active workloads. You've created partitions directly on the array disk devices and have various LVM devices and filesystems on those for various purposes, again limiting some filesystems to less performance than your total spindles can give you. The change I recommend you consider is to do something similar to what we do with SAN storage consolidation. Create a single large spindle count non-parity array on the LSI. In this case that would be a 24 drive RAID10 with a strip (sunit) of 32KB, yielding a stripe width (swidth) of 384KB, which should work very well with all of your filesystems and workloads, giving a good probability of full stripe writes. You'd have ~24TB of usable space. All of your workloads would have 12 spindles of non-parity performance, peak streaming read/write of ~1.4GB/s, and random read/write mixed workload throughput of a few hundred MB/s, simply stomping what you have now. You'd be very hard pressed to bog down this 12 spindle non-parity array. Making a conservative guesstimate, I'd say the mixed random IO throughput would be on the order of 30x-50x that of your current RAID5/50 arrays combined. In summary, you'd gain a staggering performance increase you simply wouldn't have considered possible with your current hardware. You'd "sacrifice" 12TB of your 48TB of raw space to achieve it. That 30-50x increase in random IOPs is exactly why many folks gladly "waste money on extra drives". After you see the dramatic performance increase you'll wonder why you ever considered spending money on high RPM SAS drives to reduce RAID5 latency. Put these 24 7.2K SATA drives in this RAID10 up against 24 15K SAS drives in a 6x4 RAID50. Your big slow Hitachis will best the nimble SAS 15ks in random IOPS, probably by a wide margin. Simply due to RMW. Yes, RMW will hammer 15K drives that much. RMW hammers all spinning rust, everything but SSDs.

...

[wondering about increasing spindle account and effect on perf in RAID50]

Optimizing the spindle count of constituent RAID5s in a RAID50 to gain performance is akin to a downhill skier manically waxing his skis every day, hoping to shave 2 seconds off a 2 minute course.

...

Thanks for any insights...(I'm always open to learning how wrong I am! ;-))...

If nothing else I hopefully got the point across as to how destructive parity RAID read-modify-write operations are to performance. It's simply impossible to get good mixed IO performance from parity RAID unless one's workloads always fit in controller write cache, or if one has SSD storage. === comment about piecemeal increases in a 24-unit disk housing (same author): A trap many/most home users fall into is buying such a chassis and 4 drives in RAID5, then adding sets of 4 drives in RAID5 as "budget permits", and ending up with 6 separate arrays thus 6 different data silos each of low performance. This may be better/easier on the wallet for those who can't/don't plan and save, but in the end they end up with 1/6th of their spindle performance for any given workload, and a difficult migration path to get the actual performance the drives are capable of. Which is obviously why I recommend acquiring the end game solution in one shot and configuring for maximum performance from day one. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

L A Walsh

01:45

New subject: [opensuse] Best disks for HW RAID, performance and longevity (was ... 50 TB dedicated storage subsystem ...)

I have some more information, that I want to post, even if you don't use it now... . First is on HD manufacturer reliability. Please look at the charts (they have them by quarter), but this one seems to show the last 3 years (2015-2017): https://www.extremetech.com/extreme/175089-who-makes-the-most-reliable-hard-... I am hoping to get this off quickly, so excused any raw wording -- but the manfacturer with the worst reliabilty (and not just over the past 3 years, this has been true since at least 2000 and probably going back to 1990 and before). Out of the manufacturers that are still around today, Seagate has consistently been near the bottom in quality. The chart shows Seagate shows annualized failure rates as high as 29.08%. Most are under 3%, but it seems to have, _nearly_ above 1%/year for listed drives. Second worst is WDC -- though I believe some of their newer drives (not seen in the above chart, but in some quarterly charts), that may be benefitting from their purchase of the Hitachi (HGST) business. In this chart, Hitachi has the lowest failure rate followed by Toshiba. This chart doesn't show disks by other manufacturers. The info about Hitachi (which was first to offer 5year warranties on their enterprise drives -- still do), I knew before seeing this chart. They consistently have had the highest quality. Conversely -- at least for enterprise drives, Seagate has consistently had the worst.

...

From the little I've seen, Seagate drives are often as expensive as Hitachi often more. So do yourself a favor and go with Hitachi's.

I have another post about RAIDs that I'm searching in my archives for. I'll get back to looking for it, but want to send this one off sooner rather than later. Drive manufacturers DO make a difference. If you don't want headaches go w/Enterprise Hitachi drives (Ultrastars). I can't speak for their consumer line's longevity, as I never had them long enough to know, but as I wrote earlier, their consumer drives were more likely to have a fairly wide spread on RPMs. -l -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Per Jessen

06:50

New subject: [opensuse] Best disks for HW RAID, performance and longevity (was ... 50 TB dedicated storage subsystem ...)

L A Walsh wrote:

...

I have some more information, that I want to post, even if you don't use it now... .

First is on HD manufacturer reliability.

Please look at the charts (they have them by quarter), but this one seems to show the last 3 years (2015-2017):

https://www.extremetech.com/extreme/175089-who-makes-the-most-reliable-hard-...

I was just about to suggest the Backblaze statistics, then I clicked on the link :-)

...

In this chart, Hitachi has the lowest failure rate followed by Toshiba. This chart doesn't show disks by other manufacturers.

Hitachi harddrives are good. We started buying when they were still IBM Ultrastars. We also use WDC RE4s, they're good too, but I don't have enough drives for any statistics.

...

Drive manufacturers DO make a difference. If you don't want headaches go w/Enterprise Hitachi drives (Ultrastars).

+1 -- Per Jessen, Zürich (-9.0°C) http://www.hostsuisse.com/ - virtual servers, made in Switzerland. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Carlos E. R.

13:10

New subject: [opensuse] Best disks for HW RAID, performance and longevity (was ... 50 TB dedicated storage subsystem ...)

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Monday, 2018-02-26 at 17:45 -0800, L A Walsh wrote:

...

I have some more information, that I want to post, even if you don't use it now... .

First is on HD manufacturer reliability.

Please look at the charts (they have them by quarter), but this one seems to show the last 3 years (2015-2017):

https://www.extremetech.com/extreme/175089-who-makes-the-most-reliable-hard-...

Interesting.

...

I am hoping to get this off quickly, so excused any raw wording -- but the manfacturer with the worst reliabilty (and not just over the past 3 years, this has been true since at least 2000 and probably going back to 1990 and before).

Out of the manufacturers that are still around today, Seagate has consistently been near the bottom in quality.

The chart shows Seagate shows annualized failure rates as high as 29.08%. Most are under 3%, but it seems to have, _nearly_ above 1%/year for listed drives.

Well, it is curious, but the only drives that failed on me were not Seagates. None of the seagates I have had failed while in use, with perhaps two exceptions: One that was bad since day one, so I had it was replaced by the shop. Another developed failures later (start of the bath curve) and was replaced on warranty. Both were 500 GB units, so long ago. A laptop drive that failed after perhaps 4 or 5 years of use, so that could be normal. I replaced without data loss with same model. The worst I had was Fujitsu long ago. Similar experiences with people I know, so we all switched to Seagate.

...

In this chart, Hitachi has the lowest failure rate followed by Toshiba. This chart doesn't show disks by other manufacturers.

My normal shop doesn't even have Hitachi. It has Toshiba, though. My secondary shop has one Hitachi: 3 TB - Hitachi HGST Deskstar NAS, black, 136€

...

I can't speak for their consumer line's longevity, as I never had them long enough to know, but as I wrote earlier, their consumer drives were more likely to have a fairly wide spread on RPMs.

Well, of course all the disks I personally bought are consumer drives. - -- Cheers, Carlos E. R. (from openSUSE 42.3 x86_64 "Malachite" at Telcontar) -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iEYEARECAAYFAlqVWLMACgkQtTMYHG2NR9WyeQCdHKoaR08PnqUPHe8g5vC0xfyP IKoAnjvC5IdX5Ud/xzBtpOKtwVLzlXxD =DxlL -----END PGP SIGNATURE-----

Wols Lists

22:04

New subject: [opensuse] Best disks for HW RAID, performance and longevity (was ... 50 TB dedicated storage subsystem ...)

On 27/02/18 01:45, L A Walsh wrote:

...

The chart shows Seagate shows annualized failure rates as high as 29.08%. Most are under 3%, but it seems to have, _nearly_ above 1%/year for listed drives.

Note that Seagate had a bad batch of disks ... Apparently the Barracudas (NOT raid-certified!!!) have a bit of a design fault that lets dust into the mechanism. I can't believe they haven't fixed this by now ... And this seems to have bitten the 3TB model extremely hard. There's a study by some web storage company (stores your backups for you for peanuts, charges mega-bucks if you need your data back...) that uses desktop drives because they store a paranoid number of copies. They track drive reliability, and the 3TB Barracudas were just plain awful, to the extent that if one drive failed, they'd rip out the entire rack of 30 or so on the assumption that a rebuild was almost certain to tip several more over the edge. Cheers, Wol -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

suse＠a-domani.nl

22 Feb 22 Feb

16:29

On 2018-02-23 18:01, Greg Freemyer wrote:

...

All,

I bought a used iSCSI based Drobo in 2016 with 50TB in it. It works great, but it isn't fast by today's standards. (100 MB/sec absolute max). And I never got it to work from openSUSE (Windows and Mac did work).

I now need to buy something similar, but with fast I/O and Linux support: 10 Gbit Ethernet and fibre-channel may be my only options? Or are there point-to-point SAS external racks I should consider?

Also, I will likely want to put a SSD based cache in the mix at some point. It could be integrated into the storage subsystem, or I could use something like bcache and have the SSD be in the server,. It just needs to be reliable and fast.

Recommendations? (I'm thinking used and $10K at the high-end for the subsystem including PCIx cards and switches, if needed.)

== details

I need to buy a server with 50TB usable disk for a production environment. A high speed disk subsystem is critical. And the project may scale up over time.

It doesn't need to be a fail-over cluster, just a single server. I'm looking at used equipment most likely.

I was thinking one with 4 CPU sockets would let me start with 2 CPUs and then expand to 4 later on. Lots of RAM capacity would also be great.

My first thought was to get a Dell R920 and throw a bunch of disks in it.

https://goo.gl/images/x3Rfj2

24 disk slots!

But then I looked at what size drives are available for those slots: 2.5 inch. I only see 2TB drives max.

I'm hoping to use a Dell server because it's the preferred brand at my (new) job. Unless I'm missing something a standalone rack mount Dell won't work.

So, now I'm looking at a used R820 most likely and some sort of disk subsystem to support it. I picked it because it is only a 5-year old design and it has 4 CPU sockets I can expand into as the server demand increases over time. Also 3TB max ram is way above what this will require, even long term.

https://media.adn.de/media/DE/DocLib/Poweredge_Easy_Matrix.pdf

A used, barebones R820 can be had for under $2K. Just add CPUs and ram.

Greg

Is single node a strict requrement? Why not a small ceph-cluster, three OSD-nodes, each with four Barracuda-8TB drives? Hans -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

2124

Age (days ago)

2488

Last active (days ago)

List overview

Download

41 comments

11 participants

participants (11)

Andrei Borzenkov
Carlos E. R.
David T-G
Greg Freemyer
jdd＠dodin.org
L A Walsh
Lew Wolfgang
Per Jessen
suse＠a-domani.nl
Wol's lists
Wols Lists