Bug ID 1177595
Summary many device resets and I/O errors during mdadm scrub
Classification openSUSE
Product openSUSE Distribution
Version Leap 15.2
Hardware x86-64
OS openSUSE Leap 15.2
Status NEW
Severity Critical
Priority P5 - None
Component Kernel
Assignee kernel-bugs@opensuse.org
Reporter pvh@oma.be
QA Contact qa-bugs@suse.de
Found By ---
Blocker ---

We have a disk server with a Supermicro S3008 L8e SAS controller and 6 SAS
drives of 12 TB each in an mdadm RAID5 software raid configuration. When
starting a scrub of the RAID array with

echo check > /sys/block/md0/md/sync_action

after about 0.5 - 1.5 hours of running the scrub, a lot of error messages start
appearing in the syslog. Mostly there are lots of cryptic messages like this:

kernel: mpt3sas_cm0: log_info(0x3112011a): originator(PL), code(0x12),
sub_code(0x011a)

these are interspersed with other error messages about device resets and I/O
errors:

kernel: sd 6:0:2:0: Power-on or device reset occurred

kernel: blk_update_request: I/O error, dev sdc, sector 5160938280 op 0x0:(READ)
flags 0x80700 phys_seg 1 prio class 0 
kernel: sd 6:0:2:0: [sdc] tag#1073 CDB: Read(10) 28 00 26 73 b5 65 00 00 01 00 
kernel: sd 6:0:2:0: [sdc] tag#1073 FAILED Result: hostbyte=DID_SOFT_ERROR
driverbyte=DRIVER_OK 

These errors happen on all 6 disks in the RAID array (only sdc is shown here,
but the problems on the other disks are essentially identical).

I have also seen I/O errors in the output of smartctl -a (while the scrub was
ongoing), but that may simply be due to the device being reset during the
call...

Initially we thought these were hardware problems and we had the server
thoroughly checked by the manufacturer. They swapped out all the hardware, but
the problems would not go away. They concluded that it must be a software
(i.e., driver) issue. I cannot be completely certain, but it looks like the
problems started after upgrading openSUSE 15.1 -> 15.2. The kernel was fully
patched at the time we detected the problems on 29 September. Test showed that
the previous installed kernel version also showed the same problem. It is
likely that all kernel versions shipped with openSUSE 15.2 show this problem.

We currently mount the RAID5 array in read-only mode to prevent the I/O errors
from corrupting the file system. This severely limits the functionality of the
server.


You are receiving this mail because: