[opensuse] raid question
Hi all, i have a little trouble with a software raid1 array. i built it with opensuse10.0. Now one disks left the array giving me the (F) of failure for one partition. here is the cat /proc/mdstat: Personalities : [raid1] md1 : active raid1 sda3[0] 129017920 blocks [2/1] [U_] md0 : active raid1 sdb1[2](F) sda1[0] 26217984 blocks [2/1] [U_] I ma quite far from this server location, so i need to know: how much fair is to assume disks is not broker and just use 'badblocks -f'? and if i want to replace it which is the easiest way ? i partition here the new disk (maybe with fdisk, but do not know the way to have the disk raid formatted with id=fd), replace the old one, and then use raidhotadd, or, if the new disk will get anyway the /dev/sdb identifier, the kernel will do it for me at boot time. Thank in advance, L.Cerini -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
On Friday 22 June 2007 17:40, Lorenzo Cerini wrote:
Hi all, i have a little trouble with a software raid1 array. i built it with opensuse10.0.
Now one disks left the array giving me the (F) of failure for one partition. here is the cat /proc/mdstat:
Personalities : [raid1] md1 : active raid1 sda3[0] 129017920 blocks [2/1] [U_]
md0 : active raid1 sdb1[2](F) sda1[0] 26217984 blocks [2/1] [U_]
I ma quite far from this server location, so i need to know: how much fair is to assume disks is not broker and just use 'badblocks -f'?
and if i want to replace it which is the easiest way ? i partition here the new disk (maybe with fdisk, but do not know the way to have the disk raid formatted with id=fd), replace the old one, and then use raidhotadd, or, if the new disk will get anyway the /dev/sdb identifier, the kernel will do it for me at boot time.
Hello Lorenzo, It seems like your raid arrays are in a pretty bad state. /dev/md1 is broken, and /dev/md0 is too. Here's a suggestion on how to troubleshoot it: 1. If you have some important data on that server, back it up first to a safe location other than the above mentioned server. Using scp, rsync, anything. 2. You can try to build the array one by one: For /dev/md1: mdadm /dev/md1 -a /dev/sdb3 (assuming the broken pair is sdb3) For /dev/md0: Remove the F member first: mdadm /dev/md0 -r /dev/sdb1 Add it again: mdadm /dev/md0 -a /dev/sdb1 For preparing the new disk, please take note the current partition scheme from the server, fdisk -l /dev/sda, fdisk -l /dev/sdb. You must make the partition on the new disk EXACTLY like the real one. Then partition Using fdisk, for example sdb: fdisk /dev/sdb n (new) primary partition (1-4) First cylinder: (just enter) Last cylinder: +1000M (1GB) Repeat for other partitions. Then, change the type of the partition as software raid: t L (for list of codes) fd (software raid) Repeat for other partitions w (save) -------------- I have the following note for the actual menu, attached as text file, hopefully it can go through the list. Remember, backup your data first! Keep save. HTH, -- Fajar Priyanto | Reg'd Linux User #327841 | Linux tutorial http://linux2.arinet.org 6:23pm up 5:51, 2.6.18.2-34-default GNU/Linux Let's use OpenOffice. http://www.openoffice.org
On Fri, 2007-06-22 at 18:24 +0700, Fajar Priyanto wrote:
First cylinder: (just enter) Last cylinder: +1000M (1GB)
I want my 24M back!
--
"Why can't humans just reboot instead of sleeping, so much wasted cycles" -Zombie Coder.
Jonathan Arsenault -
I thank you all. As Carlos said in one of the last post, first replace, then investigate. I was not the case to have the accounting office's server of a shipping company (that means they work 24/24 7/7), stopped for any reason for more than 20 minutes. I just replaced the disk added the new disk to raid and resync. Beside, just for completeness of information, the failed disk had some real hardware trouble. L.Cerini -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 The Friday 2007-06-22 at 12:40 +0200, Lorenzo Cerini wrote:
Hi all, i have a little trouble with a software raid1 array. i built it with opensuse10.0.
Now one disks left the array giving me the (F) of failure for one partition. here is the cat /proc/mdstat: ... I ma quite far from this server location, so i need to know: how much fair is to assume disks is not broker and just use 'badblocks -f'?
and if i want to replace it which is the easiest way ?
Be aware that software raid will remove a disk for a simple "glitch", a temporary failure. Just scan the logs for errors, check the drive (smart tests), etc. An attemt to write to a badblock would show on the log. Then re-enable the disk, and watch it. - -- Cheers, Carlos E. R. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Made with pgp4pine 1.76 iD4DBQFGe7FXtTMYHG2NR9URAszYAJ97SZzC/kEMfr6i0Y2t12Vc5TPW4QCWNGju 0n96oriBlVKNDo+FyFzPkQ== =9FsZ -----END PGP SIGNATURE----- -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
So, maybe it is better if i re-enable the disk and see what happens. There is no trouble about data-loss, since we regularly beckup everithing useful at midnight. L.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
The Friday 2007-06-22 at 12:40 +0200, Lorenzo Cerini wrote:
Hi all, i have a little trouble with a software raid1 array. i built it with opensuse10.0.
Now one disks left the array giving me the (F) of failure for one partition. here is the cat /proc/mdstat:
...
I ma quite far from this server location, so i need to know: how much fair is to assume disks is not broker and just use 'badblocks -f'?
and if i want to replace it which is the easiest way ?
Be aware that software raid will remove a disk for a simple "glitch", a temporary failure. Just scan the logs for errors, check the drive (smart tests), etc. An attemt to write to a badblock would show on the log.
Then re-enable the disk, and watch it.
- -- Cheers, Carlos E. R.
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Made with pgp4pine 1.76
iD4DBQFGe7FXtTMYHG2NR9URAszYAJ97SZzC/kEMfr6i0Y2t12Vc5TPW4QCWNGju 0n96oriBlVKNDo+FyFzPkQ== =9FsZ -----END PGP SIGNATURE-----
-- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 The Friday 2007-06-22 at 13:59 +0200, Lorenzo Cerini wrote:
So, maybe it is better if i re-enable the disk and see what happens. There is no trouble about data-loss, since we regularly beckup everithing useful at midnight.
Just have a look at the logs first. - -- Cheers, Carlos E. R. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Made with pgp4pine 1.76 iD8DBQFGe7vftTMYHG2NR9URAuPkAJ4+OlhEn6fjFKJKqzqkxQSFy65AUgCeKlyk 2MqADDocdKynq/YuNbxh73A= =2N0C -----END PGP SIGNATURE----- -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
I found this on my logs. I don't thik it is a badblock problem, but i cannot understand if it is a hardware problem: Jun 19 03:16:42 axis kernel: ata2: status=0x51 { DriveReady SeekComplete Error } Jun 19 03:16:42 axis kernel: ata2: error=0x40 { UncorrectableError } Jun 19 03:16:46 axis kernel: ata2: status=0x51 { DriveReady SeekComplete Error } Jun 19 03:16:46 axis kernel: ata2: error=0x40 { UncorrectableError } Jun 19 03:16:50 axis kernel: ata2: status=0x51 { DriveReady SeekComplete Error } Jun 19 03:16:50 axis kernel: ata2: error=0x40 { UncorrectableError } Jun 19 03:16:53 axis kernel: ata2: status=0x51 { DriveReady SeekComplete Error } Jun 19 03:16:53 axis kernel: ata2: error=0x40 { UncorrectableError } Jun 19 03:16:57 axis kernel: ata2: status=0x51 { DriveReady SeekComplete Error } Jun 19 03:16:57 axis kernel: ata2: error=0x40 { UncorrectableError } Jun 19 03:16:57 axis kernel: SCSI error : <1 0 0 0> return code = 0x8000002 Jun 19 03:16:57 axis kernel: sdb: Current: sense key: Medium Error Jun 19 03:16:57 axis kernel: Additional sense: Unrecovered read error - auto reallocate failed Jun 19 03:16:57 axis kernel: end_request: I/O error, dev sdb, sector 25690407 Jun 19 03:16:57 axis kernel: raid1: Disk failure on sdb1, disabling device. Jun 19 03:16:57 axis kernel: Operation continuing on 1 devices Jun 19 03:16:57 axis kernel: raid1: sdb1: rescheduling sector 25690344 Jun 19 03:16:57 axis kernel: RAID1 conf printout: Jun 19 03:16:57 axis kernel: --- wd:1 rd:2 Jun 19 03:16:57 axis kernel: disk 0, wo:0, o:1, dev:sda1 Jun 19 03:16:57 axis kernel: disk 1, wo:1, o:0, dev:sdb1 Jun 19 03:16:57 axis kernel: RAID1 conf printout: Jun 19 03:16:57 axis kernel: --- wd:1 rd:2 Jun 19 03:16:57 axis kernel: disk 0, wo:0, o:1, dev:sda1 Jun 19 03:16:57 axis kernel: raid1: sda1: redirecting sector 25690344 to another mirror L.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
The Friday 2007-06-22 at 13:59 +0200, Lorenzo Cerini wrote:
So, maybe it is better if i re-enable the disk and see what happens. There is no trouble about data-loss, since we regularly beckup everithing useful at midnight.
Just have a look at the logs first.
- -- Cheers, Carlos E. R.
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Made with pgp4pine 1.76
iD8DBQFGe7vftTMYHG2NR9URAuPkAJ4+OlhEn6fjFKJKqzqkxQSFy65AUgCeKlyk 2MqADDocdKynq/YuNbxh73A= =2N0C -----END PGP SIGNATURE-----
-- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 The Friday 2007-06-22 at 14:34 +0200, Lorenzo Cerini wrote:
I found this on my logs. I don't thik it is a badblock problem, but i cannot understand if it is a hardware problem:
I think so.
Jun 19 03:16:57 axis kernel: ata2: status=0x51 { DriveReady SeekComplete Error } Jun 19 03:16:57 axis kernel: ata2: error=0x40 { UncorrectableError } Jun 19 03:16:57 axis kernel: SCSI error : <1 0 0 0> return code = 0x8000002 Jun 19 03:16:57 axis kernel: sdb: Current: sense key: Medium Error Jun 19 03:16:57 axis kernel: Additional sense: Unrecovered read error - auto reallocate failed
I'm not familiar with scsi errors, but if I interpret it correctly, the drive tried to relocate the bad sector to somewhere else and failed: that's not good, it might mean that the drive has no more spare sectors for remapping and is thus at the end of its life. You should investigate the smart logs (smartctl -a /dev/sdb). If you can determine that what I said is the case, then you should replace the drive promptly. If in doubt, run the short and long diagnostics. They don't catch everything, but if they sey "bad", it is bad. See? First admin rule: read the logs ;-) - -- Cheers, Carlos E. R. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Made with pgp4pine 1.76 iD8DBQFGe9NetTMYHG2NR9URArTKAJ46WEx3b403WKAh9ndvhtV/kPtb3wCeLHlp fSygbHbuo4wI/qqEkUW5lQE= =zXEq -----END PGP SIGNATURE----- -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
Trouble is those are SATA disks, not SCSi. So have no smartctl ( or at least smartctl answer me this way) L.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
The Friday 2007-06-22 at 14:34 +0200, Lorenzo Cerini wrote:
I found this on my logs. I don't thik it is a badblock problem, but i cannot understand if it is a hardware problem:
I think so.
Jun 19 03:16:57 axis kernel: ata2: status=0x51 { DriveReady SeekComplete Error } Jun 19 03:16:57 axis kernel: ata2: error=0x40 { UncorrectableError } Jun 19 03:16:57 axis kernel: SCSI error : <1 0 0 0> return code = 0x8000002 Jun 19 03:16:57 axis kernel: sdb: Current: sense key: Medium Error Jun 19 03:16:57 axis kernel: Additional sense: Unrecovered read error - auto reallocate failed
I'm not familiar with scsi errors, but if I interpret it correctly, the drive tried to relocate the bad sector to somewhere else and failed: that's not good, it might mean that the drive has no more spare sectors for remapping and is thus at the end of its life.
You should investigate the smart logs (smartctl -a /dev/sdb). If you can determine that what I said is the case, then you should replace the drive promptly. If in doubt, run the short and long diagnostics. They don't catch everything, but if they sey "bad", it is bad.
See? First admin rule: read the logs ;-)
- -- Cheers, Carlos E. R.
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Made with pgp4pine 1.76
iD8DBQFGe9NetTMYHG2NR9URArTKAJ46WEx3b403WKAh9ndvhtV/kPtb3wCeLHlp fSygbHbuo4wI/qqEkUW5lQE= =zXEq -----END PGP SIGNATURE-----
-- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 The Friday 2007-06-22 at 16:07 +0200, Lorenzo Cerini wrote:
Trouble is those are SATA disks, not SCSi. So have no smartctl ( or at least smartctl answer me this way)
Doesn't "smartctl" work with SATA drives yet? I thought that had been solved. - -- Cheers, Carlos E. R. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Made with pgp4pine 1.76 iD8DBQFGfDOFtTMYHG2NR9URApbLAJ9NvNdxKQ5tziRX1PczEs2jhmd9hgCff1xM eGusbWF/12YDvW1Jz+QLXJM= =Ba0p -----END PGP SIGNATURE----- -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Carlos E. R. schrieb:
The Friday 2007-06-22 at 16:07 +0200, Lorenzo Cerini wrote:
Trouble is those are SATA disks, not SCSi. So have no smartctl ( or at least smartctl answer me this way)
Doesn't "smartctl" work with SATA drives yet? I thought that had been solved.
It works here on 10.2, but the OP is using 10.0. It seems to that it was not fixed there. Regards, Chris - -- http://rauchs-home.de - home of yet another suse repository ;) -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (GNU/Linux) iD8DBQFGfDTDayhvFxrDZlkRAiTrAJ9xEVyP1K4z12cny+u/Zh/Z4lHltwCfWCs+ YXKRPET/tpdZOHZIDqYwIuo= =nU/w -----END PGP SIGNATURE----- -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
On 6/22/07, Rauch Christian
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Carlos E. R. schrieb:
The Friday 2007-06-22 at 16:07 +0200, Lorenzo Cerini wrote:
Trouble is those are SATA disks, not SCSi. So have no smartctl ( or at least smartctl answer me this way)
Doesn't "smartctl" work with SATA drives yet? I thought that had been solved.
It works here on 10.2, but the OP is using 10.0. It seems to that it was not fixed there.
Regards, Chris
It used to require a "-d ata" argument. The OP should try that. Also, SATA drives do not reallocate on read only on write. Since the OP has the sector #, he should use dd to read in the sector from the good drive to a temp file. Then use dd to write it back out to the failed drive. In theory the bad drive will see that someone is writing to a bad sector and re-map it to one of the spare sectors. FYI: There was some discussion about mdraid doing this automatically on a failed read, but I don't think it has been implemented yet. Greg -- Greg Freemyer The Norcross Group Forensics for the 21st Century -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 The Friday 2007-06-22 at 16:52 -0400, Greg Freemyer wrote:
Also, SATA drives do not reallocate on read only on write.
Same as PATA. It is done by the disk hardware, no cpu intervention.
Since the OP has the sector #, he should use dd to read in the sector from the good drive to a temp file. Then use dd to write it back out to the failed drive. In theory the bad drive will see that someone is writing to a bad sector and re-map it to one of the spare sectors.
That's not possible; it seems you haven't read previous posts: | Jun 19 03:16:57 axis kernel: sdb: Current: sense key: Medium Error | Jun 19 03:16:57 axis kernel: Additional sense: Unrecovered readerror - auto reallocate failed | Jun 19 03:16:57 axis kernel: end_request: I/O error, dev sdb, sector25690407 Remapping has already failed. - -- Cheers, Carlos E. R. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Made with pgp4pine 1.76 iD8DBQFGfEjbtTMYHG2NR9URAonOAJ9GyKxE494dYF1ej7xk7LEnXgPfdACePNO8 wTf81MNpTrL/4RSZLXxT1U0= =pxmf -----END PGP SIGNATURE----- -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
On 6/22/07, Carlos E. R.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
The Friday 2007-06-22 at 16:52 -0400, Greg Freemyer wrote:
Also, SATA drives do not reallocate on read only on write.
Same as PATA. It is done by the disk hardware, no cpu intervention.
Since the OP has the sector #, he should use dd to read in the sector from the good drive to a temp file. Then use dd to write it back out to the failed drive. In theory the bad drive will see that someone is writing to a bad sector and re-map it to one of the spare sectors.
That's not possible; it seems you haven't read previous posts:
| Jun 19 03:16:57 axis kernel: sdb: Current: sense key: Medium Error | Jun 19 03:16:57 axis kernel: Additional sense: Unrecovered readerror - auto reallocate failed | Jun 19 03:16:57 axis kernel: end_request: I/O error, dev sdb, sector25690407
Remapping has already failed.
failed on a read the way I read it. I suggested to do a write. I don't know what subsystem generated the above. Maybe the dmraid layer tried a write after the failed read? Don't know, but I would still try to do a write manually via dd. FYI: I don't think the SATA error code interpretation by the SCSI layer is 100% accurate, so I would not trust anything SATA related that is being reported by the SCSI layer that libata is currently kludged underneath. Hopefully someday libata will become its own full fledged subsystem without any of the scsi core code causing confusion. Greg -- Greg Freemyer The Norcross Group Forensics for the 21st Century -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 The Friday 2007-06-22 at 18:33 -0400, Greg Freemyer wrote:
On 6/22/07, Carlos E. R.
wrote:
| Jun 19 03:16:57 axis kernel: sdb: Current: sense key: Medium Error | Jun 19 03:16:57 axis kernel: Additional sense: Unrecovered readerror - | auto reallocate failed | Jun 19 03:16:57 axis kernel: end_request: I/O error, dev sdb, | sector25690407
Remapping has already failed.
failed on a read the way I read it.
That is irrelevant: remapping was triggered and failed.
I suggested to do a write. I don't know what subsystem generated the above. Maybe the dmraid layer tried a write after the failed read? Don't know, but I would still try to do a write manually via dd.
Something tried remapping and failed, and that is the important thing. If it is the HD remapping that failed, as I think it is, then the failure is crucial and the HD needs replacing ASAP, no toying. In fact, trying to write to that sector will probably fail and the remapping will fail, too. Should fail. That's why reading the SMART log is so important in this case. If he has 10.0 and smartctl can't read it, then he should use a 10.2 rescue system and read that log. Or replace the disk first, investigate later. - -- Cheers, Carlos E. R. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Made with pgp4pine 1.76 iD8DBQFGfFRjtTMYHG2NR9URAncYAJ4xPb8Dh/Sn7Y9CYup6oC2lfsSRJwCfTIZ+ LkuuvJmr5fz1+arc9CNGiy0= =/rg6 -----END PGP SIGNATURE----- -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
On 6/22/07, Carlos E. R.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
The Friday 2007-06-22 at 18:33 -0400, Greg Freemyer wrote:
On 6/22/07, Carlos E. R.
wrote: | Jun 19 03:16:57 axis kernel: sdb: Current: sense key: Medium Error | Jun 19 03:16:57 axis kernel: Additional sense: Unrecovered readerror - | auto reallocate failed | Jun 19 03:16:57 axis kernel: end_request: I/O error, dev sdb, | sector25690407
Remapping has already failed.
failed on a read the way I read it.
That is irrelevant: remapping was triggered and failed.
I suggested to do a write. I don't know what subsystem generated the above. Maybe the dmraid layer tried a write after the failed read? Don't know, but I would still try to do a write manually via dd.
Something tried remapping and failed, and that is the important thing. If it is the HD remapping that failed, as I think it is, then the failure is crucial and the HD needs replacing ASAP, no toying.
In fact, trying to write to that sector will probably fail and the remapping will fail, too. Should fail.
That's why reading the SMART log is so important in this case. If he has 10.0 and smartctl can't read it, then he should use a 10.2 rescue system and read that log.
Or replace the disk first, investigate later.
I googled the error message. Found it in http://tldp.org/HOWTO/archived/SCSI-Programming-HOWTO/SCSI-Programming-HOWTO... So it appears to be coming out of the SCSI layer that sits above libata. As I said before, I would not trust those error messages since there is not a good mapping of ATA errors into the SCSI world. Checking smart logs makes sense, but based on that single error I would not be replacing hardware. Greg -- Greg Freemyer The Norcross Group Forensics for the 21st Century -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
participants (6)
-
Carlos E. R.
-
Fajar Priyanto
-
Greg Freemyer
-
Jonathan Arsenault
-
Lorenzo Cerini
-
Rauch Christian