[kernel-bugs] [Bug 1177595] many device resets and I/O errors during mdadm scrub

15 Oct 2020

      https://bugzilla.suse.com/show_bug.cgi?id=1177595
https://bugzilla.suse.com/show_bug.cgi?id=1177595#c1

--- Comment #1 from Coly Li <colyli@suse.com> ---
(In reply to Peter van Hoof from comment #0)
...
We have a disk server with a Supermicro S3008 L8e SAS controller and 6 SAS
drives of 12 TB each in an mdadm RAID5 software raid configuration. When
starting a scrub of the RAID array with
echo check > /sys/block/md0/md/sync_action
after about 0.5 - 1.5 hours of running the scrub, a lot of error messages
start appearing in the syslog. Mostly there are lots of cryptic messages
like this:
kernel: mpt3sas_cm0: log_info(0x3112011a): originator(PL), code(0x12),
sub_code(0x011a)
these are interspersed with other error messages about device resets and I/O
errors:
kernel: sd 6:0:2:0: Power-on or device reset occurred
kernel: blk_update_request: I/O error, dev sdc, sector 5160938280 op
0x0:(READ) flags 0x80700 phys_seg 1 prio class 0 
kernel: sd 6:0:2:0: [sdc] tag#1073 CDB: Read(10) 28 00 26 73 b5 65 00 00 01
00 
kernel: sd 6:0:2:0: [sdc] tag#1073 FAILED Result: hostbyte=DID_SOFT_ERROR
driverbyte=DRIVER_OK
These errors happen on all 6 disks in the RAID array (only sdc is shown
here, but the problems on the other disks are essentially identical).
I have also seen I/O errors in the output of smartctl -a (while the scrub
was ongoing), but that may simply be due to the device being reset during
the call...
Initially we thought these were hardware problems and we had the server
thoroughly checked by the manufacturer. They swapped out all the hardware,
but the problems would not go away. They concluded that it must be a
software (i.e., driver) issue. I cannot be completely certain, but it looks
like the problems started after upgrading openSUSE 15.1 -> 15.2. The kernel
was fully patched at the time we detected the problems on 29 September. Test
showed that the previous installed kernel version also showed the same
problem. It is likely that all kernel versions shipped with openSUSE 15.2
show this problem.
We currently mount the RAID5 array in read-only mode to prevent the I/O
errors from corrupting the file system. This severely limits the
functionality of the server.
I used to hear of similar issue situation when the hard drive was
device-managed SMR.

What are the exact models of these hard drives ?

Thanks.

Coly Li

-- 
You are receiving this mail because:
You are the assignee for the bug.

[kernel-bugs] [Bug 1177595] many device resets and I/O errors during mdadm scrub

bugzilla_noreply＠suse.com