What | Removed | Added |
---|---|---|
Status | NEW | CONFIRMED |
(In reply to Peter van Hoof from comment #2) > > The script does not report mismatches. > > Detecting mismatches isn't really the point of running a 'check'. > > The main point is to read all block and make sure that none of them have gone > > bad (i.e. cannot be read). > > If any blocks are bad, they will automatically be fix (re-writing) if possible. > > If that isn't possible, the drive will be removed from the array. > > I don't think this is correct. My information is that a 'check' doesn't > repair anything. You need to run 'repair' to get the mismatches fixed. Any access the results in a read error will attempt to correct that read error if possible. The difference between 'check' and 'repair' is that if a 'mismatch' is found, then repair will repair it, but check will not. A 'mismatch' is where the parity calculated from the data disks does not match the parity stored on the parity block. This should be extremely rare. A read error is more likely. It seems I was wrong about mdadm --monitor sending mail for mismatches. If "--syslog" is enabled (which currently requires direct editing of the systemd unit file) then the mismatch count will be logged to systemd, but it is never emailed. I should probably change that. > At the moment I have > > # cat /sys/block/md2/md/mismatch_cnt > 16 > > from the April 1 check so I should have had an email from mdadm... I did a > manual repair of this RAID5 array last month, but obviously the problems are > back. I would love to know which disk is responsible for this so that I can > examine it more closely... It may not be one particular disk that is the problem, though that is certainly a possibility. It could be bad memory, or a problem with the drive controller, or maybe even a cable problem. My guess is that the controller is the most likely source of error, but I wouldn't trust that guess very much. The only way I know of to isolate this sort of problem is to replace component (or swap components between similar systems) until the problem disappears (or moved to the other system).