Mailinglist Archive: opensuse (2806 mails)
|< Previous||Next >|
Re: [opensuse] Failed RAID Please Help
- From: "Brian K. White" <brian@xxxxxxxxx>
- Date: Wed, 23 Jul 2008 23:24:24 -0400
- Message-id: <02f401c8ed3c$c5a31880$a900000a@miata>
----- Original Message ----- From: "John Andersen" <jsamyth@xxxxxxxxx>
Sent: Wednesday, July 23, 2008 8:50 PM
Subject: Re: [opensuse] Failed RAID Please Help
On Wed, Jul 23, 2008 at 3:47 PM, Rodney Baker <rodney.baker@xxxxxxxxxxxx> wrote:On Thu, 24 Jul 2008 03:30:54 John Andersen wrote:
If its raid0 you have bigger problems, about the same problems is
you had used LVM and skipped raid all together, but even given
the lack of redundancy, LVM makes more sense than raid0
in linux. So I'm guessing no sane person would use raid0
just to concatenate drives in linux, and you probably don't have raid0.
Hmmm; last time I saw him my doctor said he thought I was still sane, yet I'm
using raid0 for exactly that purpose...
My previous experience with LVM was that it was a PITA to set up and then it
got corrupted due to a power outage. As a result /home was completely
I learned from that - I won't use LVM again. /home is now on a raid1 array,
with nightly backups to an external drive, and non-critical data (e.g. stuff
downloaded from the net) goes onto a raid0 array that I used to concat three
smaller partitions that
Don't assume from the fact that you have not YET had a failure on raid0 that
it is any safer than LVM. Its about the same risk. Loss of any of one of
the partitions may cause loss of ALL data.
Depending on what file system you format the raid0 with it could be really
serious to just have a couple sectors go bad.
Raid0 composed of 3 drives TRIPLES you chance of loss, because a fault
on any ONE drive may render the whole thing borken. If you had a
1 in 10000 chance of a drive failure previously, you now have a 3 in 10000
Further, I think a more accurate and scarier way to represent it is:
If the MTBF of one drive is 600,000 hours,
Then MTBF of a 3 drive raid0 is only 200,000 hours.
(600k is a typical estimate for commodity sata drives)
Worse, commodity sata drives only have a duty cycle of 30%
So, if you are running these 24/7 instead of 8/7 then the individual mtbf drops to merely 200,000 hours and the mtbf for the array drops to merely 66,666
So the lifetime of the array is only a little better than 10% of the nominal/advertised lifetime of a drive.
And, on top of all that, remember the M in MTBF, MEAN time before failure. That 66,666 hour estimate is the average, so half of all such arrays will die even sooner, much sooner.
7.6 years sounds like a long time but that's total drive failure. Data corruption happens long before that.
I don't know where they get those huge mtbf estimates anyways. I see drives fail all the time in as little as a year. Some last 10 years, true, but many last 1, 2, or 3. If your power conditioning, air temperature and cleanliness aren't all *perfect* that surely drops all the numbers way down too. Running hot and suffering power fluctuations and surges both on the power connector and on the data connector definitely kills drives early, and what most people have in their homes is pretty bad power, pretty dusty air, and not cold enough nor enough air flow. Those ridiculously long mtbf estimates are probably simply whats required just to make a drive last a year or so in normal conditions.
Don't bet that your ups does any power conditioning either. The cheap ones mostly don't. They are simply switches and as long as there is power available from the wall, you are directly connected to the wall. Maybe there is a little surge absorbtion in play like what a cheap power-strip has, which is just about worthless for the purposes of this topic. It's value is that maybe you don't lose you whole room full of hardware when lightening hits your circuit. It does just about nothing for the 24/7 general dirtiness of most wall power, which gradually kills hardware a lot sooner than if the power was perfect 24/7 over the same period of time.
I'm seeing one out of ten drives die within 3 years even _in_ perfectly controlled and protected environment, consistent low temperature, good strong airflow over the drives, 100% power conditioning ups's, closed room (no constant influx of new dust) so the parts all stay clean, And that's with 100% duty cycle 5 year warranty u320 scsi drives not just commodity ide and sata drives. By die I also mean merely that the raid card they are connected to has marked them bad, meaning it detected a single data discrepency. That's a far cry from total drive failure and a lot easier to happen and happens a lot sooner on average.
Conversely, I have seen linux's software raid mark drives bad when really there was nothing wrong with them. Depending on the controller I've seen dmraid mark up to 50% of drives bad when they were really all 100% ok. Those same exact drives, on the same exact motherboards & cases, in the same exact server farm/power/air temp/etc..., running the same exact OS & software, but plugged into a real raid card instead of using software raid, the drives were fine and still are to this day so far, under heavier load actually since the servers in question never made it out of testing/vetting while the drives were "dying" so often, but are in full production now. That was just using raid10 in software too, not even the extra complication of raid5.
Raid0 has it's uses, but it definitely should be used with very open eyes and the acceptance that the array will likely die and all data will be gone in as little as a year or maybe three. Just do whatever you have to to somehow arrange to be ok with that.
Brian K. White brian@xxxxxxxxx http://www.myspace.com/KEYofR
filePro BBx Linux SCO FreeBSD #callahans Satriani Filk!
To unsubscribe, e-mail: opensuse+unsubscribe@xxxxxxxxxxxx
For additional commands, e-mail: opensuse+help@xxxxxxxxxxxx
|< Previous||Next >|