On 12/07/2013 12:40 AM, jdd wrote:
I don't know if it is directly related, but...
I was said (with references) that most raid systems do not manage properly the data when *several* disks are *partly* failing at the same time. this is pretty often the case
It's very common for Hard drives to fail silently because some sectors become damaged, but are not read at the moment, so the defect is not seen by the raid system.
in this situation, the system stops.
the reason is that only read sectors are checked.
in my opinion, and for my use (mostly archives), this makes raid unusefull.
so I beg one should have to monitor acurately the smart datas on any hard disk in relation to the raid array to pevent the problem, but I couldn't certify this.
I know people that use different makes/models of HDD for the aray to try limiting the risk of similteous failure
Interesting and valid observations. I've got some experience with 3Ware, and now LSI, hardware RAID controllers. I've got a requirement to record lots of data in the field, when the disks are then removed and returned to the depot for reading and processing. I use 24-bay hot-swap chassis configured as two 11-disk RAID-6 arrays with two global hot spares. The OS lives on two internal disks configured as RAID-1 using a second hardware RAID controller. The RAID controllers will initiate "Patrol Reads" all on their own to look for hidden sector-read issues. These patrols are done with a low priority and don't interfere with read/write performance. The arrays are also "verified" on a daily basis, also in the background. But I did have one issue where the depot chassis (an older box) wasn't able to correctly sync with newer 6-Gb SATA drives. The data were written correctly, but to the second chassis appeared to the controller as if random drives were failing. I removed the disks and re-installed them into a chassis that was known to work. The second controller identified that the array was damaged, but then proceeded to accurately recover the array. It took about 12-hours, but it worked! I still don't know "why" it worked, but it sure did save my bacon. Another observation: Don't use RAID-5, which can operate with one failed disk. The most stressful time for an array is when a failed disk is replaced and the array is rebuilt. With RAID-5, if you suffer a second disk failure during the rebuild your data is toast. With RAID-6, a rebuild can experience a second disk failure and still recover the data. Sure, RAID-6 has greater storage overhead, but if your data is important, disks are cheap these days. BTW, I've measured 1.6-GB/sec of continuous write bandwidth to one of these 11-disk RAID-6 arrays from a single-threaded process. Regards, Lew -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org