[opensuse] md RAID5 problem, Sata hard reset I/O error, ICH9R, driver bug in opensuse 11.2?
Hello! Has anyone else seen this problem? Could it be a bug in 11.2? Description: ----------------------------------------------------------------------------------------- I installed 11.2 on my server last weekend, moving from 11.0 (new install). For many years I've been running RAID5 over three partitions of three HDDs using md and ext3, without a hitch. I've also had other RAID constellations running in parallel on the same HDDs and others, also for years without any problems. When installing 11.2 I also changed the partitioning and created a new, otherwise identical RAID5 setup. However, when copying back the data from my backup the array suddenly broke down with 2 of the drives marked faulty and removed! I retried a number of times to try to understand what was happening. I tried both the default and the Xen kernel (both x86_64). I always got the same error. When testing, I saw that first one drive would get an exception and "hard reset" and be removed from the array - always the last raid device (2). This happened after copying several GB of data (usually after around 8 GB). If I then stopped and re-added the "faulty" drive it would get back in sync and I could continue copying another several GB until it got the hard reset again. If I didn't stop to re-add, then the second last raid device (1) would get a hard reset and be removed as faulty - obviously causing the 3-disk raid5 to break down. After trying this for a while I tried instead to create a RAID1 setup using the same 3 partitions and ext4. I've had no problems at all with this setup, filling 90% (about 80 GB) and running for 5 days. Worth mentioning is that I've never had any problems with these physical HDDs which are now around 4-5 years old. I'm also running root and other partitions and a few VMs using LVM and Xen from the very same HDDs. I checked smart incl long self test without finding any errors. I've been running exactly the same hardware for a few years without any problems. When googling a saw a similar problem reported but with nvidia (sata_nv) and some change introduce in the 2.6.26 kernel or around there. I don't know if this issue is related. In this case it's Intel ICH9R. ----------------------------------------------------------------------------------------- Product: OpenSuse 11.2 (final) uname -a: Linux serv 2.6.31.8-0.1-xen #1 SMP 2009-12-15 23:55:40 +0100 x86_64 x86_64 x86_64 GNU/Linux dmesg: ----------------------------------------------------------------------------------------- [ 2.992003] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300) [ 2.997047] ata3.00: ATA-7: SAMSUNG SP2004C, VM100-33, max UDMA7 [ 2.997049] ata3.00: 390721968 sectors, multi 0: LBA48 NCQ (depth 31/32) [ 3.002142] ata3.00: configured for UDMA/133 ... [15431.871709] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen [15431.871723] ata3.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 [15431.871724] res 40/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x4 (timeout) [15431.871733] ata3.00: status: { DRDY } [15431.871739] ata3: hard resetting link [15432.355696] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300) [15432.365819] ata3.00: configured for UDMA/133 [15432.365825] ata3.00: device reported invalid CHS sector 0 [15432.365831] end_request: I/O error, dev sdc, sector 185181205 [15432.365836] md: super_written gets error=-5, uptodate=0 [15432.365840] raid5: Disk failure on sdc5, disabling device. [15432.365841] raid5: Operation continuing on 2 devices. [15432.365853] ata3: EH complete [15432.443698] RAID5 conf printout: [15432.443706] --- rd:3 wd:2 [15432.443711] disk 0, o:1, dev:sda5 [15432.443715] disk 1, o:1, dev:sdb5 [15432.443718] disk 2, o:0, dev:sdc5 [15432.463691] RAID5 conf printout: [15432.463696] --- rd:3 wd:2 [15432.463700] disk 0, o:1, dev:sda5 [15432.463703] disk 1, o:1, dev:sdb5 ----------------------------------------------------------------------------------------- sudo /usr/sbin/smartctl -a /dev/sdc: OK ----------------------------------------------------------------------------------------- Model Family: SAMSUNG SpinPoint P120 series Device Model: SAMSUNG SP2004C Serial Number: S07GJ10Y946521 Firmware Version: VM100-33 User Capacity: 200,049,647,616 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 7 ATA Standard is: ATA/ATAPI-7 T13 1532D revision 4a Local Time is: Fri Jan 29 07:59:58 2010 CET ----------------------------------------------------------------------------------------- lspci -nn: ----------------------------------------------------------------------------------------- 00:1f.2 SATA controller [0106]: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA AHCI Controller [8086:2922] (rev 02) ----------------------------------------------------------------------------------------- Mobo: Gigabyte GA-P35C-DS3R -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
On Fri, Jan 29, 2010 at 2:20 PM, Dan Kopparhed <dank@kth.se> wrote:
Hello! Has anyone else seen this problem? Could it be a bug in 11.2?
Description: ----------------------------------------------------------------------------------------- I installed 11.2 on my server last weekend, moving from 11.0 (new install). For many years I've been running RAID5 over three partitions of three HDDs using md and ext3, without a hitch. I've also had other RAID constellations running in parallel on the same HDDs and others, also for years without any problems. When installing 11.2 I also changed the partitioning and created a new, otherwise identical RAID5 setup. However, when copying back the data from my backup the array suddenly broke down with 2 of the drives marked faulty and removed!
... snip ... I am experiencing an even more fundamental problem. I have created a RAID1 array using 2 equal size partitions. However, when I boot the system /dev/md0 does not get initialized. I have created a new initrd with the "md" drivers. I will post a new thread with the details. -- Arun Khan -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
participants (2)
-
Arun Khan
-
Dan Kopparhed