[opensuse] mdraid questions
Hi Folks, Background: I'm working on a box that has to record lots of data "very" reliably. The current iteration requires 4.1-TB of storage with an average write bandwidth of at least 260-MB/sec. The hardware is a 4U rack system with 24 hotswap 1-TB disks on the front, with two 1-TB internal disks. It has two LSI RAID controllers, one handles the 24 hotswap disks, the other the two internal disks configured as a RAID-1 mirror. openSuSE 12.1 is on the internal RAID. I configured the 24 data disks as two RAID-6 arrays each with 11 disks, leaving two disks for global auto-hotswap. I then configured the two arrays as a mdraid level 1 mirror. This allows for the failure of up to 15 disks, depending on which ones failed. The mdraid allows allows for on-line real-time duplication of the write data stream. It also allows for each side of the mirror to be shipped separately for physical transport reliability. I've got this configuration up and running with an average write bandwidth of about 560-MB/sec. So far, so good. Questions: I'm assuming that I can physically remove one of the RAID-6 arrays and install the disks in another (identical) chassis. An empty and uninitialized array will also be in the second chassis. Do I need to do anything with mdadm to cause the mirror on the second chassis to rebuild the empty array with the data from the first chassis? Would it be a matter of just plugging in the disks, turn on the power and step back? Or will some mdadm ju-ju be required? Will the wrong array be replicated? In case you're wondering, the next iteration will require 2-TB data disks. We also have a complete clone of the whole system to be used as a spare if necessary. This data HAS to be recorded! I realize this configuration is probably overkill, but it's better to be safe than sorry. Is there anything else I'm forgetting to consider? Thanks in advance for your opinions. Regards, Lew -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
Lew Wolfgang wrote:
I'm assuming that I can physically remove one of the RAID-6 arrays and install the disks in another (identical) chassis. An empty and uninitialized array will also be in the second chassis. Do I need to do anything with mdadm to cause the mirror on the second chassis to rebuild the empty array with the data from the first chassis? Would it be a matter of just plugging in the disks, turn on the power and step back? Or will some mdadm ju-ju be required? Will the wrong array be replicated?
I expect some mdadm ju-ju will be required. After all, it has no way of telling that it's supposed to overwrite those disks, unless you tell it. You'll first need to tell the LSI controller how to use them, as well. But you're going to test this procedure before you put it into service, right? So it should be fairly easy to follow your nose and figure out the precise ju-ju that's needed. Cheers, Dave -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 05/02/2012 08:42 AM, Dave Howorth wrote:
Lew Wolfgang wrote:
I'm assuming that I can physically remove one of the RAID-6 arrays and install the disks in another (identical) chassis. An empty and uninitialized array will also be in the second chassis. Do I need to do anything with mdadm to cause the mirror on the second chassis to rebuild the empty array with the data from the first chassis? Would it be a matter of just plugging in the disks, turn on the power and step back? Or will some mdadm ju-ju be required? Will the wrong array be replicated? I expect some mdadm ju-ju will be required. After all, it has no way of telling that it's supposed to overwrite those disks, unless you tell it. You'll first need to tell the LSI controller how to use them, as well.
Hi Dave, The LSI controllers shouldn't be a problem. I've tested this in the past with 3ware controllers and it worked even between controller models. The RAID configuration is kept on the disks themselves, the controller didn't "keep state", at least for 3ware.
But you're going to test this procedure before you put it into service, right? So it should be fairly easy to follow your nose and figure out the precise ju-ju that's needed.
Yup, it will be tested. I'm just trying to do my homework first. Regards, Lew -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
Lew Wolfgang wrote:
The LSI controllers shouldn't be a problem. I've tested this in the past with 3ware controllers and it worked even between controller models. The RAID configuration is kept on the disks themselves, the controller didn't "keep state", at least for 3ware.
Yes, you should be able to plug an existing RAID6 disk pack into a new controller and have it keep working. But the controller will have to be told what to do with new disks. It doesn't know that you want another RAID6 creating. You might want to use them as hot spares, or JBOD, or something. Then you'll have to tell mdadm to use the new RAID6 to replace the missing half of its mirror as Per says.
the on-site operators aren't Linux geeks, so we wanted the process to be as turn-key as possible.
That sounds like you won't have ssh access. Will there be a telephone, or does it all have to be written on paper? It might be easier to initialize all the disks back at your base and move them back and forth a couple of times to find the easiest incantation to tell the on-site people. I guess you'll also need to tell them how to identify and swap out a failed disk and add a replacement to the system. Are they familiar with the controller and its web interface? -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
Lew Wolfgang wrote:
Questions:
I'm assuming that I can physically remove one of the RAID-6 arrays and install the disks in another (identical) chassis. An empty and uninitialized array will also be in the second chassis. Do I need to do anything with mdadm to cause the mirror on the second chassis to rebuild the empty array with the data from the first chassis? Would it be a matter of just plugging in the disks, turn on the power and step back? Or will some mdadm ju-ju be required? Will the wrong array be replicated?
Definitely _some_ mdadm ju-ju required - it sounds like this: assemble a new RAID6 of set#1 (existing data) create RAID1 = set#1 + faulty. create RAID6 of set#2 (uninitialized array) hot-add this to your RAID1.
In case you're wondering, the next iteration will require 2-TB data disks. We also have a complete clone of the whole system to be used as a spare if necessary. This data HAS to be recorded! I realize this configuration is probably overkill, but it's better to be safe than sorry.
Is there anything else I'm forgetting to consider?
Battery-backed write caches or site-wide UPS? Staging disk purchases by production dates? Your initial RAID61 config will survive up to five concurrent disk failures (worst case) - is there a better config? -- Per Jessen, Zürich (15.2°C) -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 05/02/2012 12:11 PM, Per Jessen wrote:
Is there anything else I'm forgetting to consider? Battery-backed write caches or site-wide UPS? Staging disk purchases by production dates?
Your initial RAID61 config will survive up to five concurrent disk failures (worst case) - is there a better config?
Hi Per, Yes, we have battery-backed write caches (supercapacitor actually) and a good UPS. Staging disk purchases is a good suggestion! I initially wanted to set the 24-disks up as a RAID-60 with two hotspares. I tried this and actually measured 1.3-GB/sec average write bandwidth. These are SATA disks, not SAS. This would have allowed up to six disk failures, which would have been good enough I'm sure. But the customer also wanted full duplication of all recorded data in the field. Not only that, but the on-site operators aren't Linux geeks, so we wanted the process to be as turn-key as possible. mdraid-1 seemed to fit the bill. Regards, Lew -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
Hi Folks, I submitted some questions about mdraid last week and I finally have it all sorted out. I include my original email below and describe my adventure if anyone's interested. You'll recall that I have one LSI hardware raid controller with 24 1-T drives. I configured two 11-disk RAID-6 arrays with two global hotswap disks. The raids appear as /dev/sdb and dev/sdc. I then created a mdraid-1 with each RAID-6 as one of the mirrors. At the end of a data collection session the plan was to break the mdraid mirror and have two independent and identical arrays that could be returned to the depot for subsequent processing. I discovered that breaking a mdraid-1 mirror is not to be taken lightly! Indeed, I couldn't do it at all. I was able to "fail" one of the arrays and move it to another chassis, but I couldn't fail the remaining mirror due to the device being busy. Further, the failed array was corrupted by the process of removing the mdraid's superblock. So I decided to try leaving the mdraid mirrors intact, and just physically move each RAID-6 array to another chassis. This had issues because the device names changed and the mdraid wouldn't build. So I used the uuid to assemble the arrays, which worked for the array that didn't have its name changed. In other words, if the RAID-6 array was /dev/sdb1 in the first chassis, and it remained as /dev/sdb1 in the second, all was well. But I was able to re-assemble the array with the changed name using mdadm commands, which I'll show below. Here's how I used the mdadm command: Build the mdraid RAID-1 array: mdadm --create /dev/md0 -l 1 -n 2 /dev/sdb1 /dev/sdc1 echo -e 'DEVICE containers partitions' >> /etc/mdadm.conf echo -e 'ARRAY /dev/md0 UUID=d78d0e68:26e47cf4:c9aea959:c0f7dd94' >> /etc/mdadm.conf (where the uuid is obtained before breaking the array by "mdadm --detail /dev/md0" build filesystem of choice mount /dev/md0 as appropriate (note: I had to use parted to create the disk label and partition 1) Move the /dev/sdb array to another chassis, power should be off for safety. Power on in it's new home and as long as /etc/mdadm.conf contains the right uuid, the array should assemble. When /dev/sdc is moved to another chassis, and then shows up as /dev/sdb, do this: mdadm --assemble --scan --uuid=d78d0e68:26e47cf4:c9aea959:c0f7dd94 mdadm --run /dev/md0 mount /dev/md0 as appropriate Make sure the uuid is correct. I also made sure that /etc/mdadm.conf was the same on all boxes. That's it! I acknowledge that there are probably other ways to do this, but considering our constraints and untrained operators in the field, this should work well for us. Regards, Lew On 05/02/2012 08:25 AM, Lew Wolfgang wrote:
Hi Folks,
Background:
I'm working on a box that has to record lots of data "very" reliably. The current iteration requires 4.1-TB of storage with an average write bandwidth of at least 260-MB/sec. The hardware is a 4U rack system with 24 hotswap 1-TB disks on the front, with two 1-TB internal disks. It has two LSI RAID controllers, one handles the 24 hotswap disks, the other the two internal disks configured as a RAID-1 mirror. openSuSE 12.1 is on the internal RAID.
I configured the 24 data disks as two RAID-6 arrays each with 11 disks, leaving two disks for global auto-hotswap.
I then configured the two arrays as a mdraid level 1 mirror. This allows for the failure of up to 15 disks, depending on which ones failed. The mdraid allows allows for on-line real-time duplication of the write data stream. It also allows for each side of the mirror to be shipped separately for physical transport reliability.
I've got this configuration up and running with an average write bandwidth of about 560-MB/sec. So far, so good.
Questions:
I'm assuming that I can physically remove one of the RAID-6 arrays and install the disks in another (identical) chassis. An empty and uninitialized array will also be in the second chassis. Do I need to do anything with mdadm to cause the mirror on the second chassis to rebuild the empty array with the data from the first chassis? Would it be a matter of just plugging in the disks, turn on the power and step back? Or will some mdadm ju-ju be required? Will the wrong array be replicated?
In case you're wondering, the next iteration will require 2-TB data disks. We also have a complete clone of the whole system to be used as a spare if necessary. This data HAS to be recorded! I realize this configuration is probably overkill, but it's better to be safe than sorry.
Is there anything else I'm forgetting to consider?
Thanks in advance for your opinions.
Regards, Lew
-- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
participants (3)
-
Dave Howorth
-
Lew Wolfgang
-
Per Jessen