Re: [opensuse] RAID failure - md1: bitmap initialization failed: -5 (how to recover?)

7 Dec 2013

      On 12/07/2013 12:40 AM, jdd wrote:
...
I don't know if it is directly related, but...
I was said (with references) that most raid systems do not manage properly the data 
when *several* disks are *partly* failing at the same time. this is pretty often 
the case
It's very common for Hard drives to fail silently because some sectors become 
damaged, but are not read at the moment, so the defect is not seen by the raid system.
in this situation, the system stops.
the reason is that only read sectors are checked.
in my opinion, and for my use (mostly archives), this makes raid unusefull.
so I beg one should have to monitor acurately the smart datas on any hard disk in 
relation to the raid array to pevent the problem, but I couldn't certify this.
I know people that use different makes/models of HDD for the aray to try limiting 
the risk of similteous failure
Interesting and valid observations.

I've got some experience with 3Ware, and now LSI, hardware RAID controllers.
I've got a requirement to record lots of data in the field, when the disks are then
removed and returned to the depot for reading and processing.  I use 24-bay
hot-swap chassis configured as two 11-disk RAID-6 arrays with two global hot
spares.   The OS lives on two internal disks configured as RAID-1 using a
second hardware RAID controller.

The RAID controllers will initiate "Patrol Reads" all on their own to look for
hidden sector-read issues.  These patrols are done with a low priority and
don't interfere with read/write performance.  The arrays are also "verified"
on a daily basis, also in the background.

But I did have one issue where the depot chassis (an older box) wasn't able
to correctly sync with newer 6-Gb SATA drives.  The data were written correctly,
but to the second chassis appeared to the controller as if random drives were
failing.   I removed the disks and re-installed them into a chassis that was known
to work.  The second controller identified that the array was damaged, but then
proceeded to accurately recover the array.  It took about 12-hours, but it
worked!  I still don't know "why" it worked, but it sure did save my bacon.

Another observation:  Don't use RAID-5, which can operate with one failed
disk.  The most stressful time for an array is when a failed disk is replaced
and the array is rebuilt.  With RAID-5, if you suffer a second disk failure
during the rebuild your data is toast.  With RAID-6, a rebuild can experience
a second disk failure and still recover the data.  Sure, RAID-6 has greater
storage overhead, but if your data is important, disks are cheap these days.

BTW, I've measured 1.6-GB/sec of continuous write bandwidth to one of these
11-disk RAID-6 arrays from a single-threaded process.

Regards,
Lew

-- 
To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org
To contact the owner, e-mail: opensuse+owner@opensuse.org

Re: [opensuse] RAID failure - md1: bitmap initialization failed: -5 (how to recover?)

Lew Wolfgang