Re: [opensuse] Recover hardware controlled RAID disks with other computer?

12 Jul 2017

      On 11/07/17 22:16, Greg Freemyer wrote:
...
On Tue, Jul 11, 2017 at 4:45 PM, Carlos E. R.
<robin.listas@telefonica.net> wrote:
...
On 2017-07-11 22:35, Dave Howorth wrote:
...
On Tue, 11 Jul 2017 22:18:03 +0200
"Carlos E. R." <> wrote:
...
...
...
For those who don't know, a desktop drive is "within spec" if it
returns one soft read error per 10GB read. In other words, read a
6TB drive end-to-end twice, and the manufacturer says "if you get a
read error, that's normal". But it will cause an array to fail if
you haven't set it up properly ...
What would one do to set them up properly?  :-?
You need to set up the timeouts in Linux to be longer than the ones
imposed by the firmware on the drive. I'm sorry but I don't remember
whether it's a kernel thing or a mdadm thing. I expect the linux raid
wiki knows.
Wow :-(
That can be minutes.
That's actually a big difference between "set-up properly" or not.
A drive used in a raid-1/5/6 should be set to fail fast instead of
retry for a minute or two.
Drives designed for use in a raid array will come from the factory that way.
If you're using a desktop drive you really need to try and set the
retry time down low.
Except you can't :-(

Read the wiki page on timeouts. It seems to be drives over 1GB, but
timeout is no longer an adjustable parameter - it's set to about 2 1/2
minutes and that's that :-(

That's why you have to adjust the linux timeout up appropriately.

There's a script on the website that will do it for you. Note that
setting a long timeout in linux on all drives isn't a problem, it's just
that you daren't let linux time out quicker than the drive times out.
...
Then, in addition you should be running a scrub routinely.  That will
do a read of all the sectors on the physical media looking for bad
sectors.  If it finds any, it recreates the data from the other drives
and rewrites the sector.  Hopefully that fixes it.
Not quite true ...

The point of a scrub is it reads the drive "end to end". Nowadays drives
are computers in their own right, with loads of error correction built
in to the drive. So if the drive has difficulty reading, it will
re-write internally. Note that magnetic media decays just like ram, just
on a timescale of years rather than nano-seconds. Made worse if *parts*
of the drive are repeatedly rewritten - that will wear down the data
next to it. But yes, if a block fails, it will get recalculated and
rewritten.

The other thing is I think scrub also updates the mismatch count? This
is where my knowledge is currently very patchy, but mismatch count means
the data is inconsistent on disk. I think a check scrub just looks for
and counts mismatches. A repair scrub will copy drive 1 over the other
drive(s) for a mirror, and recalculate and overwrite parity for raid
5/6. For raids 1 & 5, that's about all you can do.

For raid 6, there's a program raid6check, which will try to work out
which block is corrupt and recalculate that. It's pretty good in that it
will identify and fix a single-block corruption, and is unlikely to
mis-identify a more complex (and unfixable) problem as a fixable
single-block error. Getting raid to do this for you automatically is
highly contentious - the raid guys say there are a lot of possible
causes and don't want an attempted fix to make matters worse. Personally
I think the current situation is sub-optimal but that's my opinion, not
theirs ...

Cheers,
Wol

-- 
To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org
To contact the owner, e-mail: opensuse+owner@opensuse.org

Re: [opensuse] Recover hardware controlled RAID disks with other computer?

Wols Lists