I am currently trying to 'nail down' if problems I am seeing with software raid 5 are related to hardware or software. The symptoms: The last drive in the array (sdg in this example, the array consists out of 6 partitions, each partition is the only partition on a 146 GByte drive) is all for sudden marked as 'failed'. Upon reboot, I did a thorough check of the drive, using the 'validate' tool build in to the scsi card (LSI card) and checking the smart tools self test. No errors where found. I added the partition back to the array, and everything looks fine... until a week later the same thing happened again. I can't take the machine down right now, as it is no "live", but again, smarttools come back empty (see relevant log entries below) 'smartd' logs in /var/log/messages: Jun 28 05:00:20 smartd[2114]: Device: /dev/sdd, starting scheduled Short Self-Test. Jun 28 06:00:20 smartd[2114]: Device: /dev/sde, starting scheduled Short Self-Test. Jun 28 07:00:21 smartd[2114]: Device: /dev/sdf, starting scheduled Short Self-Test. Jun 28 08:00:20 smartd[2114]: Device: /dev/sdg, starting scheduled Short Self-Test. Jun 28 16:13:56 kernel: SCSI error : <1 0 2 0> return code = 0x10000 Jun 28 16:13:56 kernel: end_request: I/O error, dev sdg, sector 85940159 Jun 28 16:13:56 kernel: raid5: Disk failure on sdg1, disabling device. Operation continuing on 5 devices Jun 28 16:13:56 kernel: klogd 1.4.1, ---------- state change ---------- Jun 28 16:13:56 kernel: Inspecting /boot/System.map-2.6.5-7.75-smp Jun 28 16:13:56 kernel: Loaded 24168 symbols from /boot/System.map-2.6.5-7.75-smp. Jun 28 16:13:56 kernel: Symbols match kernel version 2.6.5. Jun 28 16:13:56 kernel: No module symbols loaded - kernel modules not enabled. Jun 28 16:13:56 kernel: RAID5 conf printout: Jun 28 16:13:56 kernel: --- rd:6 wd:5 fd:1 Jun 28 16:13:56 kernel: disk 0, o:1, dev:sdb1 Jun 28 16:13:56 kernel: disk 1, o:1, dev:sdc1 Jun 28 16:13:56 kernel: disk 2, o:1, dev:sdd1 Jun 28 16:13:56 kernel: disk 3, o:1, dev:sde1 Jun 28 16:13:56 kernel: disk 4, o:1, dev:sdf1 Jun 28 16:13:56 kernel: disk 5, o:0, dev:sdg1 Jun 28 16:13:56 kernel: RAID5 conf printout: Jun 28 16:13:56 kernel: --- rd:6 wd:5 fd:1 Jun 28 16:13:56 kernel: disk 0, o:1, dev:sdb1 Jun 28 16:13:56 kernel: disk 1, o:1, dev:sdc1 Jun 28 16:13:56 kernel: disk 2, o:1, dev:sdd1 Jun 28 16:13:56 kernel: disk 3, o:1, dev:sde1 Jun 28 16:13:56 kernel: disk 4, o:1, dev:sdf1 smartctl -a /dev/sdg smartctl version 5.30 Copyright (C) 2002-4 Bruce Allen Home page is http://smartmontools.sourceforge.net/ Device: SEAGATE ST3146807LC Version: 0007 Serial number: 3HY12ZX300007428D5Z4 Device type: disk Transport protocol: Parallel SCSI (SPI-4) Local Time is: Mon Jun 28 17:38:11 2004 UTC Device supports SMART and is Enabled Temperature Warning Enabled SMART Health Status: OK Current Drive Temperature: 41 C Drive Trip Temperature: 68 C Vendor (Seagate) cache information Blocks sent to initiator = 821298259 Blocks received from initiator = 830338302 Blocks read from cache and sent to initiator = 1895657476 Number of read and write commands whose size <= segment size = 302039058 Number of read and write commands whose size > segment size = 253913 Vendor (Seagate) factory information number of hours powered up = 766.22 number of minutes until next internal SMART test = 88 Error counter log: Errors Corrected Total Total Correction Gigabytes Total delay: [rereads/ errors algorithm processed uncorrected minor | major rewrites] corrected invocations [10^9 bytes] errors read: 8489179 0 0 8489179 8489179 13923.322 0 write: 0 0 0 0 0 1255.104 0 verify: 107204 0 0 107204 107204 293.631 0 Non-medium error count: 678 SMART Self-test log Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ] Description number (hours) # 1 Background short Completed - 761 - [- - -] # 2 Background short Completed - 739 - [- - -] # 3 Background long Completed - 726 - [- - -] # 4 Background short Completed - 678 - [- - -] # 5 Background short Completed - 640 - [- - -] # 6 Background short Completed - 614 - [- - -] # 7 Background long Completed - 599 - [- - -] # 8 Background short Completed - 590 - [- - -] # 9 Background short Completed - 567 - [- - -] #10 Background short Completed - 543 - [- - -] #11 Background short Completed - 332 - [- - -] #12 Background short Completed - 2 - [- - -] #13 Background short Completed - 2 - [- - -] Long (extended) Self Test duration: 3072 seconds [51.2 minutes] FWIW: this is a penguin computing altus 3200, two opteron cpus, 16 Gig ram, 7x143 GByte drives, LSI SCSI card (no hardware raid) ---------------------------------------------------------------- Johannes Ullrich jullrich@euclidian.com contact: http://johannes.homepc.org/contact.htm ----------------------------------------------------------------
participants (1)
-
Johannes B. Ullrich