Neil Brown changed bug 920205
What Removed Added
Status NEW CONFIRMED

Comment # 3 on bug 920205 from
(In reply to Peter van Hoof from comment #2)
> > The script does not report mismatches.
> > Detecting mismatches isn't really the point of running a 'check'.
> > The main point is to read all block and make sure that none of them have gone
> > bad (i.e. cannot be read).
> > If any blocks are bad, they will automatically be fix (re-writing) if possible.
> >   If that isn't possible, the drive will be removed from the array.
> 
> I don't think this is correct. My information is that a 'check' doesn't
> repair anything. You need to run 'repair' to get the mismatches fixed.

Any access the results in a read error will attempt to correct that read error
if possible.
The difference between 'check' and 'repair' is that if a 'mismatch' is found,
then repair will repair it, but check will not.
A 'mismatch' is where the parity calculated from the data disks does not match
the parity stored on the parity block.  This should be extremely rare.  A read
error is more likely.


It seems  I was wrong about mdadm --monitor sending mail for mismatches.

If "--syslog" is enabled (which currently requires direct editing of the
systemd unit file) then the mismatch count will be logged to systemd, but
it is never emailed.  I should probably change that.

> At the moment I have
> 
> # cat /sys/block/md2/md/mismatch_cnt
> 16
> 
> from the April 1 check so I should have had an email from mdadm... I did a
> manual repair of this RAID5 array last month, but obviously the problems are
> back. I would love to know which disk is responsible for this so that I can
> examine it more closely...

It may not be one particular disk that is the problem, though that is certainly
a possibility.
It could be bad memory, or a problem with the drive controller, or maybe even a
cable problem.
My guess is that the controller is the most likely source of error, but I
wouldn't trust that guess very much.

The only way I know of to isolate this sort of problem is to replace component
(or swap components between similar systems) until the problem disappears (or
moved to the other system).


You are receiving this mail because: