[Bug 920205] mdcheck does not report errors in RAID5 array
http://bugzilla.suse.com/show_bug.cgi?id=920205 Neil Brown <nfbrown@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |CONFIRMED --- Comment #3 from Neil Brown <nfbrown@suse.com> --- (In reply to Peter van Hoof from comment #2)
The script does not report mismatches. Detecting mismatches isn't really the point of running a 'check'. The main point is to read all block and make sure that none of them have gone bad (i.e. cannot be read). If any blocks are bad, they will automatically be fix (re-writing) if possible. If that isn't possible, the drive will be removed from the array.
I don't think this is correct. My information is that a 'check' doesn't repair anything. You need to run 'repair' to get the mismatches fixed.
Any access the results in a read error will attempt to correct that read error if possible. The difference between 'check' and 'repair' is that if a 'mismatch' is found, then repair will repair it, but check will not. A 'mismatch' is where the parity calculated from the data disks does not match the parity stored on the parity block. This should be extremely rare. A read error is more likely. It seems I was wrong about mdadm --monitor sending mail for mismatches. If "--syslog" is enabled (which currently requires direct editing of the systemd unit file) then the mismatch count will be logged to systemd, but it is never emailed. I should probably change that.
At the moment I have
# cat /sys/block/md2/md/mismatch_cnt 16
from the April 1 check so I should have had an email from mdadm... I did a manual repair of this RAID5 array last month, but obviously the problems are back. I would love to know which disk is responsible for this so that I can examine it more closely...
It may not be one particular disk that is the problem, though that is certainly a possibility. It could be bad memory, or a problem with the drive controller, or maybe even a cable problem. My guess is that the controller is the most likely source of error, but I wouldn't trust that guess very much. The only way I know of to isolate this sort of problem is to replace component (or swap components between similar systems) until the problem disappears (or moved to the other system). -- You are receiving this mail because: You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@novell.com