Comment # 3 on bug 920205 from Neil Brown

(In reply to Peter van Hoof from comment #2)
> > The script does not report mismatches.
> > Detecting mismatches isn't really the point of running a 'check'.
> > The main point is to read all block and make sure that none of them have gone
> > bad (i.e. cannot be read).
> > If any blocks are bad, they will automatically be fix (re-writing) if possible.
> >   If that isn't possible, the drive will be removed from the array.
> 
> I don't think this is correct. My information is that a 'check' doesn't
> repair anything. You need to run 'repair' to get the mismatches fixed.

Any access the results in a read error will attempt to correct that read error
if possible.
The difference between 'check' and 'repair' is that if a 'mismatch' is found,
then repair will repair it, but check will not.
A 'mismatch' is where the parity calculated from the data disks does not match
the parity stored on the parity block.  This should be extremely rare.  A read
error is more likely.


It seems  I was wrong about mdadm --monitor sending mail for mismatches.

If "--syslog" is enabled (which currently requires direct editing of the
systemd unit file) then the mismatch count will be logged to systemd, but
it is never emailed.  I should probably change that.

> At the moment I have
> 
> # cat /sys/block/md2/md/mismatch_cnt
> 16
> 
> from the April 1 check so I should have had an email from mdadm... I did a
> manual repair of this RAID5 array last month, but obviously the problems are
> back. I would love to know which disk is responsible for this so that I can
> examine it more closely...

It may not be one particular disk that is the problem, though that is certainly
a possibility.
It could be bad memory, or a problem with the drive controller, or maybe even a
cable problem.
My guess is that the controller is the most likely source of error, but I
wouldn't trust that guess very much.

The only way I know of to isolate this sort of problem is to replace component
(or swap components between similar systems) until the problem disappears (or
moved to the other system).