[opensuse] mdadm keeps breaking my array
Hi, I'm running openSUSE 11.1 with KDE 3.5.10. /home mounts a RAID1 mirrored array, /dev/md0, made up of two identical 1TB Western Digital disks, bought about 6 to 9 months ago. mdadm has kicked one of these disks out of the array twice in the last 24 hours. The first time this happened, I issued #mdadm --manage /dev/md0 --add /dev/sdd1 mdadm: re-added /dev/sdd1 so it appeared to be happy to put this disk back into the array. However, tonight it's happened again, and the same command results in #mdadm --manage /dev/md0 --add /dev/sdd1 mdadm: Cannot open /dev/sdd1: Device or resource busy Does this mean that sdd is failing and needs to be replaced? I sort of know the answer I'm going to get, but thought I'd check here first, in case I was missing something obvious. Bob -- Registered Linux User #463880 FSFE Member #1300 GPG-FP: A6C1 457C 6DBA B13E 5524 F703 D12A FB79 926B 994E openSUSE 11.1, Kernel 2.6.27.7-9-default, KDE 3.5.10 Intel Core2 Quad Q9400 2.66GHz, 4GB DDR RAM, nVidia GeForce 9200GS -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
On Friday 2009 January 23 12:14:42 Bob Williams wrote:
/home mounts a RAID1 mirrored array, /dev/md0, made up of two identical 1TB Western Digital disks, bought about 6 to 9 months ago.
mdadm has kicked one of these disks out of the array twice in the last 24 hours.
[T]onight it's happened again, and the same command results in
#mdadm --manage /dev/md0 --add /dev/sdd1 mdadm: Cannot open /dev/sdd1: Device or resource busy
Does this mean that sdd is failing and needs to be replaced? I sort of know the answer I'm going to get, but thought I'd check here first, in case I was missing something obvious.
Could be cable, controller, disk, or even some rare kernel bug, but it is probably the disk. If you can get the RAID1 sync'd and healthy again, you might try, shutting down, swapping the cables, and seeing if the problem follows the disk or the cable. -- Boyd Stephen Smith Jr. ,= ,-_-. =. bss@iguanasuicide.net ((_/)o o(\_)) ICQ: 514984 YM/AIM: DaTwinkDaddy `-'(. .)`-' http://iguanasuicide.net/ \_/
On Friday 23 January 2009 18:57:48 Boyd Stephen Smith Jr. wrote:
On Friday 2009 January 23 12:14:42 Bob Williams wrote:
Does this mean that sdd is failing and needs to be replaced? I sort of know the answer I'm going to get, but thought I'd check here first, in case I was missing something obvious.
Could be cable, controller, disk, or even some rare kernel bug, but it is probably the disk. If you can get the RAID1 sync'd and healthy again, you might try, shutting down, swapping the cables, and seeing if the problem follows the disk or the cable.
Actually, what I forgot to mention, is the two drives are attached to a PCI SATA-RAID controller card. I rebooted the computer after posting my last message, and on entering the card's setup, it appears the card has also got these drives setup as a RAID1 array (I must have set this up myself, sometime. Probably in a previous computer). I destroyed this array, so the two drives are independent at this level, and rebooted. This time, mdadm --manage is rebuilding the array again. After 90 minutes, it's showing #cat /proc/mdstat Personalities : [raid1] md0 : active raid1 sdd1[1] sdc1[0] 976759864 blocks super 1.0 [2/1] [U_] [===>.................] recovery = 17.9% (174884480/976759864) finish=550.0min speed=24294K/sec bitmap: 124/466 pages [496KB], 1024KB chunk unused devices: <none> so I'll be optimistic, and assume the two RAID setups were tripping over each other. If it fails again, I'll try your suggestion. Thanks. Bob -- Registered Linux User #463880 FSFE Member #1300 GPG-FP: A6C1 457C 6DBA B13E 5524 F703 D12A FB79 926B 994E openSUSE 11.1, Kernel 2.6.27.7-9-default, KDE 3.5.10 Intel Core2 Quad Q9400 2.66GHz, 4GB DDR RAM, nVidia GeForce 9200GS -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
On 2009-01-23T20:23:27, Bob Williams <linux@barrowhillfarm.org.uk> wrote:
Actually, what I forgot to mention, is the two drives are attached to a PCI SATA-RAID controller card. I rebooted the computer after posting my last message, and on entering the card's setup, it appears the card has also got these drives setup as a RAID1 array (I must have set this up myself, sometime. Probably in a previous computer). I destroyed this array, so the two drives are independent at this level, and rebooted.
This time, mdadm --manage is rebuilding the array again. After 90 minutes, it's showing
The kernel/md is not kicking the drive out of the array without a reason, and not without an error message. Check your logs as to what the reason is. Regards, Lars -- Teamlead Kernel, SuSE Labs, Research and Development SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg) "Experience is the name everyone gives to their mistakes." -- Oscar Wilde -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
On Friday 23 January 2009 23:06:42 Lars Marowsky-Bree wrote:
The kernel/md is not kicking the drive out of the array without a reason, and not without an error message. Check your logs as to what the reason is.
OK. The rebuild worked OK last night, and /dev/md0 is running on two disks ATM. I've found the following in /var/log/messages... Jan 23 17:21:18 barrowhillfarm kernel: ata7.00: exception Emask 0x10 SAct 0x0 SErr 0x80000 action 0xe frozen Jan 23 17:21:18 barrowhillfarm kernel: ata7.00: irq_stat 0x01100010, PHY RDY changed Jan 23 17:21:18 barrowhillfarm kernel: ata7: SError: { 10B8B } Jan 23 17:21:18 barrowhillfarm kernel: ata7.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 Jan 23 17:21:18 barrowhillfarm kernel: res 2a/2d:01:01:00:00/00:00:00:00:2a/00 Emask 0x12 (ATA bus error) Jan 23 17:21:18 barrowhillfarm kernel: ata7.00: status: { DF DRQ } Jan 23 17:21:18 barrowhillfarm kernel: ata7.00: error: { ABRT } Jan 23 17:21:18 barrowhillfarm kernel: ata7: hard resetting link Jan 23 17:21:25 barrowhillfarm kernel: ata7: SATA link up 3.0 Gbps (SStatus 123 SControl 0) Jan 23 17:21:25 barrowhillfarm kernel: ata7.00: configured for UDMA/100 Jan 23 17:21:25 barrowhillfarm kernel: ata7: EH complete Jan 23 17:21:25 barrowhillfarm kernel: sd 6:0:0:0: [sdd] 1953525168 512-byte hardware sectors: (1000GB/931GiB) Jan 23 17:21:25 barrowhillfarm kernel: sd 6:0:0:0: [sdd] Write Protect is off Jan 23 17:21:25 barrowhillfarm kernel: sd 6:0:0:0: [sdd] Mode Sense: 00 3a 00 00 Jan 23 17:21:25 barrowhillfarm kernel: sd 6:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Jan 23 17:21:25 barrowhillfarm kernel: end_request: I/O error, dev sdd, sector 1953519813 Jan 23 17:21:25 barrowhillfarm kernel: md: super_written gets error=-5, uptodate=0 Jan 23 17:21:25 barrowhillfarm kernel: raid1: Disk failure on sdd1, disabling device. Jan 23 17:21:25 barrowhillfarm kernel: raid1: Operation continuing on 1 devices. Jan 23 17:21:25 barrowhillfarm kernel: md: recovery of RAID array md0 Jan 23 17:21:25 barrowhillfarm kernel: md: minimum _guaranteed_ speed: 1000 KB/sec/disk. Jan 23 17:21:25 barrowhillfarm kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery. Jan 23 17:21:25 barrowhillfarm kernel: md: using 128k window, over a total of 976759864 blocks. Jan 23 17:21:25 barrowhillfarm kernel: md: resuming recovery of md0 from checkpoint. Jan 23 17:21:25 barrowhillfarm kernel: md: md0: recovery done. This seems to imply that sdd has a bad sector at 1953519813, but md seems quite happy to rebuild the array?? Does that mean that, by chance, it didn't use that bad sector when rebuilding, but tomorrow it might try writing there, triggering another failure? Are there any more detailed logs I should looking for? Thanks, Bob -- Registered Linux User #463880 FSFE Member #1300 GPG-FP: A6C1 457C 6DBA B13E 5524 F703 D12A FB79 926B 994E openSUSE 11.1, Kernel 2.6.27.7-9-default, KDE 3.5.10 Intel Core2 Quad Q9400 2.66GHz, 4GB DDR RAM, nVidia GeForce 9200GS -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Saturday, 2009-01-24 at 11:23 -0000, Bob Williams wrote:
This seems to imply that sdd has a bad sector at 1953519813, but md seems quite happy to rebuild the array?? Does that mean that, by chance, it didn't use that bad sector when rebuilding, but tomorrow it might try writing there, triggering another failure?
Try running "smart" tests on the disks: first one, check if it needs rebuilding when it fnishes, then the other. - -- Cheers, Carlos E. R. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.9 (GNU/Linux) iEYEARECAAYFAkl7ACQACgkQtTMYHG2NR9U6VACfVeUSvC4R4fok3STrS1k533J2 jj0AoIbfwaxTKiLbxE/SAthvsm4daoDM =SG/D -----END PGP SIGNATURE----- -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
On Saturday 24 January 2009 11:48:43 Carlos E. R. wrote:
On Saturday, 2009-01-24 at 11:23 -0000, Bob Williams wrote:
This seems to imply that sdd has a bad sector at 1953519813, but md seems quite happy to rebuild the array?? Does that mean that, by chance, it didn't use that bad sector when rebuilding, but tomorrow it might try writing there, triggering another failure?
Try running "smart" tests on the disks: first one, check if it needs rebuilding when it fnishes, then the other.
Sorry, can you tell me the command I need. Do I need any particular software installed? Bob -- Registered Linux User #463880 FSFE Member #1300 GPG-FP: A6C1 457C 6DBA B13E 5524 F703 D12A FB79 926B 994E openSUSE 11.1, Kernel 2.6.27.7-9-default, KDE 3.5.10 Intel Core2 Quad Q9400 2.66GHz, 4GB DDR RAM, nVidia GeForce 9200GS -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
Bob Williams wrote:
On Saturday 24 January 2009 11:48:43 Carlos E. R. wrote:
On Saturday, 2009-01-24 at 11:23 -0000, Bob Williams wrote:
This seems to imply that sdd has a bad sector at 1953519813, but md seems quite happy to rebuild the array?? Does that mean that, by chance, it didn't use that bad sector when rebuilding, but tomorrow it might try writing there, triggering another failure?
Try running "smart" tests on the disks: first one, check if it needs rebuilding when it fnishes, then the other.
Sorry, can you tell me the command I need. Do I need any particular software installed?
You need smartmontools installed - then try something like this: smartctl -a <disk device> -- /Per Jessen, Zürich -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
On Saturday 24 January 2009 12:19:45 Per Jessen wrote:
Bob Williams wrote:
On Saturday 24 January 2009 11:48:43 Carlos E. R. wrote:
On Saturday, 2009-01-24 at 11:23 -0000, Bob Williams wrote:
This seems to imply that sdd has a bad sector at 1953519813, but md seems quite happy to rebuild the array?? Does that mean that, by chance, it didn't use that bad sector when rebuilding, but tomorrow it might try writing there, triggering another failure?
Try running "smart" tests on the disks: first one, check if it needs rebuilding when it fnishes, then the other.
Sorry, can you tell me the command I need. Do I need any particular software installed?
You need smartmontools installed - then try something like this:
smartctl -a <disk device>
-- /Per Jessen, Zürich
Thank you. Both devices in the array passed: === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED Bob -- Registered Linux User #463880 FSFE Member #1300 GPG-FP: A6C1 457C 6DBA B13E 5524 F703 D12A FB79 926B 994E openSUSE 11.1, Kernel 2.6.27.7-9-default, KDE 3.5.10 Intel Core2 Quad Q9400 2.66GHz, 4GB DDR RAM, nVidia GeForce 9200GS -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
Bob Williams wrote:
On Saturday 24 January 2009 12:19:45 Per Jessen wrote:
Bob Williams wrote:
On Saturday 24 January 2009 11:48:43 Carlos E. R. wrote:
On Saturday, 2009-01-24 at 11:23 -0000, Bob Williams wrote:
This seems to imply that sdd has a bad sector at 1953519813, but md seems quite happy to rebuild the array?? Does that mean that, by chance, it didn't use that bad sector when rebuilding, but tomorrow it might try writing there, triggering another failure?
Try running "smart" tests on the disks: first one, check if it needs rebuilding when it fnishes, then the other.
Sorry, can you tell me the command I need. Do I need any particular software installed?
You need smartmontools installed - then try something like this:
smartctl -a <disk device>
-- /Per Jessen, Zürich
Thank you. Both devices in the array passed:
=== START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED
Bob
You could also run a long selftest on them: smartctl -t long <disk device> -- /Per Jessen, Zürich -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Saturday, 2009-01-24 at 12:38 -0000, Bob Williams wrote:
You need smartmontools installed - then try something like this:
smartctl -a <disk device>
Thank you. Both devices in the array passed:
=== START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED
Ok, but you need to trigger the actual tests. First the short test, then the long one. The explanation is in the man page. You trigger it, then wait till it finishes. You can continue using the computer during the test. Trigger only one side of the array each time - IMO. - -- Cheers, Carlos E. R. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.9 (GNU/Linux) iEYEARECAAYFAkl7HkAACgkQtTMYHG2NR9UojwCZAbhXXrYtSPLMHga43+EJAB1q TQoAn1wrZsUfMS5TVBjpVjwb1BRkS9UU =O8FD -----END PGP SIGNATURE----- -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
On Saturday 24 January 2009 13:57:13 Carlos E. R. wrote:
On Saturday, 2009-01-24 at 12:38 -0000, Bob Williams wrote:
You need smartmontools installed - then try something like this:
smartctl -a <disk device>
Thank you. Both devices in the array passed:
=== START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED
Ok, but you need to trigger the actual tests. First the short test, then the long one. The explanation is in the man page. You trigger it, then wait till it finishes. You can continue using the computer during the test.
Trigger only one side of the array each time - IMO.
Both disks passed the extended (long) test (whew!), so that narrows down the problem to the cabling, or maybe the controller card. Many thanks for your help, guys. As usual, I've learnt a lot :) Bob -- Registered Linux User #463880 FSFE Member #1300 GPG-FP: A6C1 457C 6DBA B13E 5524 F703 D12A FB79 926B 994E openSUSE 11.1, Kernel 2.6.27.7-9-default, KDE 3.5.10 Intel Core2 Quad Q9400 2.66GHz, 4GB DDR RAM, nVidia GeForce 9200GS -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Sunday, 2009-01-25 at 09:44 -0000, Bob Williams wrote:
Both disks passed the extended (long) test (whew!), so that narrows down the problem to the cabling, or maybe the controller card.
There is another possibility. Some manufacturers, like seagate, have a stand alone utility that run those tests under control of the cpu, including controller and cable test (so they say). It is a floppy or iso bootable image. - -- Cheers, Carlos E. R. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.9 (GNU/Linux) iEYEARECAAYFAkl8TtkACgkQtTMYHG2NR9UGAQCdF/mu6cuhvQzKoSskt7inLiMi KDoAn1zLL+ofa27wF4HGCQSdWD6jmb5B =0h0D -----END PGP SIGNATURE----- -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
On Sunday 25 January 2009 11:36:54 Carlos E. R. wrote:
On Sunday, 2009-01-25 at 09:44 -0000, Bob Williams wrote:
Both disks passed the extended (long) test (whew!), so that narrows down the problem to the cabling, or maybe the controller card.
There is another possibility. Some manufacturers, like seagate, have a stand alone utility that run those tests under control of the cpu, including controller and cable test (so they say). It is a floppy or iso bootable image.
These drives are Western Digital Caviar Greens. I don't remember getting any CD/floppy with them, but I admit I don't pay much attention when I do, as utility disks supplied with hardware are generally designed to run under Windows. I'll have a look on the WD website, though. Thanks for the tip. Bob -- Registered Linux User #463880 FSFE Member #1300 GPG-FP: A6C1 457C 6DBA B13E 5524 F703 D12A FB79 926B 994E openSUSE 11.1, Kernel 2.6.27.7-9-default, KDE 3.5.10 Intel Core2 Quad Q9400 2.66GHz, 4GB DDR RAM, nVidia GeForce 9200GS -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Sunday, 2009-01-25 at 12:06 -0000, Bob Williams wrote:
On Sunday 25 January 2009 11:36:54 Carlos E. R. wrote:
There is another possibility. Some manufacturers, like seagate, have a stand alone utility that run those tests under control of the cpu, including controller and cable test (so they say). It is a floppy or iso bootable image.
These drives are Western Digital Caviar Greens. I don't remember getting any CD/floppy with them, but I admit I don't pay much attention when I do, as utility disks supplied with hardware are generally designed to run under Windows.
I'll have a look on the WD website, though. Thanks for the tip.
Yes, Seagate has the utility for download somewhere on their site. You don't get the floppy/cd unless you buy the thing in a nice box on one of those nice computer supermarket stores. As to being a windows program... not quite, it is a boot diskette or cd, usually with freedos or msdos, some times perhaps linux. Different manufacturer do things differently, though. I could not find the up to date Fujitsu utility, for instance. - -- Cheers, Carlos E. R. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.9 (GNU/Linux) iEYEARECAAYFAkl8amIACgkQtTMYHG2NR9VqkwCfaf1OzUiOKHRgngVYeQTyK32a DCAAn3VoNNG3BGVKwbUyB4CkUldFP8vy =qMHI -----END PGP SIGNATURE----- -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
Carlos E. R. wrote:
On Sunday, 2009-01-25 at 12:06 -0000, Bob Williams wrote:
On Sunday 25 January 2009 11:36:54 Carlos E. R. wrote:
There is another possibility. Some manufacturers, like seagate, have a stand alone utility that run those tests under control of the cpu, including controller and cable test (so they say). It is a floppy or iso bootable image.
These drives are Western Digital Caviar Greens. I don't remember getting any CD/floppy with them, but I admit I don't pay much attention when I do, as utility disks supplied with hardware are generally designed to run under Windows.
I'll have a look on the WD website, though. Thanks for the tip.
Yes, Seagate has the utility for download somewhere on their site. You don't get the floppy/cd unless you buy the thing in a nice box on one of those nice computer supermarket stores.
As to being a windows program... not quite, it is a boot diskette or cd, usually with freedos or msdos, some times perhaps linux. Different manufacturer do things differently, though. I could not find the up to date Fujitsu utility, for instance.
-- Cheers, Carlos E. R.
The "Ultimate Boot CD" has most of the disk manufacturer's utilities on it. Regards Dave P -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
On Sunday 25 January 2009 12:06:57 Bob Williams wrote:
On Sunday 25 January 2009 11:36:54 Carlos E. R. wrote:
On Sunday, 2009-01-25 at 09:44 -0000, Bob Williams wrote:
Both disks passed the extended (long) test (whew!), so that narrows down the problem to the cabling, or maybe the controller card.
There is another possibility. Some manufacturers, like seagate, have a stand alone utility that run those tests under control of the cpu, including controller and cable test (so they say). It is a floppy or iso bootable image.
These drives are Western Digital Caviar Greens. I don't remember getting any CD/floppy with them, but I admit I don't pay much attention when I do, as utility disks supplied with hardware are generally designed to run under Windows.
I'll have a look on the WD website, though. Thanks for the tip.
Well, I just looked at the Western Digital website. They provide diagnostic tools for DOS or Windows, but it doesn't look as if these tools do anymore than smartctl can do. -- Registered Linux User #463880 FSFE Member #1300 GPG-FP: A6C1 457C 6DBA B13E 5524 F703 D12A FB79 926B 994E openSUSE 11.1, Kernel 2.6.27.7-9-default, KDE 3.5.10 Intel Core2 Quad Q9400 2.66GHz, 4GB DDR RAM, nVidia GeForce 9200GS -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Sunday, 2009-01-25 at 15:37 -0000, Bob Williams wrote:
Well, I just looked at the Western Digital website. They provide diagnostic tools for DOS or Windows, but it doesn't look as if these tools do anymore than smartctl can do.
The one from Seagate does a bit more. It has two alternatives: run the internal tests, which is the same as smartctl does, or run equivalent tests from the cpu. The second class of tests include the connection from the computer to the disk, as data has to travel the cables (the internal tests are run entirely by the HD firmware, ie, inside only). I don't know about other manufacturers from personal experience, so I can't comment; but I expect they are similar. - -- Cheers, Carlos E. R. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.9 (GNU/Linux) iEYEARECAAYFAkl8oGMACgkQtTMYHG2NR9UYTQCbB+UbvTuynolVwKYTO91r+ZCJ buoAn3zvm2G/fmktYpktIR1ZCfveKqlB =2Sxg -----END PGP SIGNATURE----- -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
Carlos E. R. wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On Sunday, 2009-01-25 at 09:44 -0000, Bob Williams wrote:
Both disks passed the extended (long) test (whew!), so that narrows down the problem to the cabling, or maybe the controller card.
There is another possibility. Some manufacturers, like seagate, have a stand alone utility that run those tests under control of the cpu, including controller and cable test (so they say). It is a floppy or iso bootable image.
- -- Cheers, Carlos E. R.
-----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.9 (GNU/Linux)
iEYEARECAAYFAkl8TtkACgkQtTMYHG2NR9UGAQCdF/mu6cuhvQzKoSskt7inLiMi
KDoAn1zLL+ofa27wF4HGCQSdWD6jmb5B =0h0D -----END PGP SIGNATURE----- If you go to the Seagate website they have a Linux version of the mentioned diagnostic that even there tech support people did not know existed.
http://www.seagate.com/www/en-us/support/downloads/seatools/ At bottom of page is a link to commandline version. Worked fine under openSUSE 11.0 when I had to bad Barracuda drives in a week. Also there is a recall of certain model Seagate and Maxtor drives with a model number starting with ST, including some in Barracuda line. I had two Maxtor drives go bad within two weeks of each other, they were actually barracuda models. Seagate replaced them free, both were slightly over a year old. Both the smartmon (long test only) and the seagate diagnostic show the failure as bad media. I have not tried the seagate test on an 11.1 system, disks are running great. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
On 2009-01-25T09:44:21, Bob Williams <linux@barrowhillfarm.org.uk> wrote:
Trigger only one side of the array each time - IMO. Both disks passed the extended (long) test (whew!), so that narrows down the problem to the cabling, or maybe the controller card.
Well, not necessarily. If it was a read error, rebuilding the error may have rewritten the sector, and that might have caused it to be remapped to one of the spares; so the error may now be "gone". Cabling errors may be flukes though, maybe you pressed the connector in better now ;-) Regards, Lars -- Teamlead Kernel, SuSE Labs, Research and Development SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg) "Experience is the name everyone gives to their mistakes." -- Oscar Wilde -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
On Monday 26 January 2009 10:11:12 Lars Marowsky-Bree wrote:
On 2009-01-25T09:44:21, Bob Williams <linux@barrowhillfarm.org.uk> wrote:
Trigger only one side of the array each time - IMO.
Both disks passed the extended (long) test (whew!), so that narrows down the problem to the cabling, or maybe the controller card.
Well, not necessarily.
If it was a read error, rebuilding the error may have rewritten the sector, and that might have caused it to be remapped to one of the spares; so the error may now be "gone".
Cabling errors may be flukes though, maybe you pressed the connector in better now ;-)
Well, I did a reboot this morning, and /dev/md0 failed the automatic fsck, forcing me to do a manual fsck. There was a whole heap of lost inodes, bad block mapping, multiple block mapping etc. So maybe the array instability was a software problem, not hardware related. Anyway, all's well now, thanks. Bob -- Registered Linux User #463880 FSFE Member #1300 GPG-FP: A6C1 457C 6DBA B13E 5524 F703 D12A FB79 926B 994E openSUSE 11.1, Kernel 2.6.27.7-9-default, KDE 3.5.10 Intel Core2 Quad Q9400 2.66GHz, 4GB DDR RAM, nVidia GeForce 9200GS -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
On 2009-01-26T15:54:00, Bob Williams <linux@barrowhillfarm.org.uk> wrote:
Well, I did a reboot this morning, and /dev/md0 failed the automatic fsck, forcing me to do a manual fsck. There was a whole heap of lost inodes, bad block mapping, multiple block mapping etc. So maybe the array instability was a software problem, not hardware related.
Anyway, all's well now, thanks.
I doubt that; the md code is very stable, as are the filesystems. I'd guess your hardware has some serious issues ... But best of luck. Regards, Lars -- Teamlead Kernel, SuSE Labs, Research and Development SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg) "Experience is the name everyone gives to their mistakes." -- Oscar Wilde -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
On Tuesday 27 January 2009 14:05:20 Lars Marowsky-Bree wrote:
On 2009-01-26T15:54:00, Bob Williams <linux@barrowhillfarm.org.uk> wrote:
Well, I did a reboot this morning, and /dev/md0 failed the automatic fsck, forcing me to do a manual fsck. There was a whole heap of lost inodes, bad block mapping, multiple block mapping etc. So maybe the array instability was a software problem, not hardware related.
Anyway, all's well now, thanks.
I doubt that; the md code is very stable, as are the filesystems. I'd guess your hardware has some serious issues ... But best of luck.
Regards, Lars
You may well be right, given that you've a lot more knowledge about linux systems than I have. I have set up smartctl to test both drives every night, mdm to send me an e-mail if the array degrades, a full backup to another drive every night, so I'll just sit back and wait for more information from my system. I may well be back for more advice, later... Bob -- Registered Linux User #463880 FSFE Member #1300 GPG-FP: A6C1 457C 6DBA B13E 5524 F703 D12A FB79 926B 994E openSUSE 11.1, Kernel 2.6.27.7-9-default, KDE 3.5.10 Intel Core2 Quad Q9400 2.66GHz, 4GB DDR RAM, nVidia GeForce 9200GS -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
Bob Williams escribió:
I've found the following in /var/log/messages...
Jan 23 17:21:18 barrowhillfarm kernel: ata7.00: exception Emask 0x10 SAct 0x0 SErr 0x80000 action 0xe frozen Jan 23 17:21:18 barrowhillfarm kernel: ata7.00: irq_stat 0x01100010, PHY RDY changed Jan 23 17:21:18 barrowhillfarm kernel: ata7: SError: { 10B8B } Jan 23 17:21:18 barrowhillfarm kernel: ata7.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 Jan 23 17:21:18 barrowhillfarm kernel: res 2a/2d:01:01:00:00/00:00:00:00:2a/00 Emask 0x12 (ATA bus error) Jan 23 17:21:18 barrowhillfarm kernel: ata7.00: status: { DF DRQ } Jan 23 17:21:18 barrowhillfarm kernel: ata7.00: error: { ABRT }
Cool, that means mdadm is kicking you disk out because it is broken, not a bug.. ;) -- "We have art in order not to die of the truth" - Friedrich Nietzsche Cristian Rodríguez R. Software Developer Platform/OpenSUSE - Core Services SUSE LINUX Products GmbH Research & Development http://www.opensuse.org/
Bob Williams wrote:
Actually, what I forgot to mention, is the two drives are attached to a PCI SATA-RAID controller card. I rebooted the computer after posting my last message, and on entering the card's setup, it appears the card has also got these drives setup as a RAID1 array (I must have set this up myself, sometime. Probably in a previous computer). I destroyed this array, so the two drives are independent at this level, and rebooted.
This time, mdadm --manage is rebuilding the array again. After 90 minutes, it's showing
#cat /proc/mdstat Personalities : [raid1] md0 : active raid1 sdd1[1] sdc1[0] 976759864 blocks super 1.0 [2/1] [U_] [===>.................] recovery = 17.9% (174884480/976759864) finish=550.0min speed=24294K/sec bitmap: 124/466 pages [496KB], 1024KB chunk
unused devices: <none>
so I'll be optimistic, and assume the two RAID setups were tripping over each other.
If it fails again, I'll try your suggestion.
Thanks.
Bob
Bob, This should not have mattered. I have almost an identical setup on an 11.1 box. I have a pci/sata controller card with the card configured as raid1 and using mdraid. So far, I haven't had as much as a hiccup out of the setup (and that is with my 6 year old daughter as its primary driver). I would vote for swapping cables just to eliminate that as a problem before looking at more costly options, but I wouldn't rule out sdd being the culprit. -- David C. Rankin, J.D.,P.E. Rankin Law Firm, PLLC 510 Ochiltree Street Nacogdoches, Texas 75961 Telephone: (936) 715-9333 Facsimile: (936) 715-9339 www.rankinlawfirm.com -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
participants (9)
-
Bob Williams
-
Boyd Stephen Smith Jr.
-
Carlos E. R.
-
Cristian Rodríguez
-
Dave Plater
-
David C. Rankin
-
Lars Marowsky-Bree
-
Per Jessen
-
upscope