[opensuse] software raid missing a drive??
Hi, When checking /proc/mdstat I see this: linux:~ # cat /proc/mdstat Personalities : [raid1] md0 : active raid1 hdb3[1] 155219904 blocks [2/1] [_U] unused devices: <none> linux:~ # Could someone help me interpret what is shown here? As I see it, one drive is missing from the raid-array, but then it says: unused devices: none. It is /dev/hda3 and /dev/hdb3 that were initially setup as raid1. And if hda3 is missing, can I add it back to a live and running system? What happens when I add it and the drive is broken? TIA -- L. de Braal BraHa Systems NL - Terneuzen T +31 115 649333 F +31 115 649444 -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
Leen de Braal wrote:
When checking /proc/mdstat I see this:
linux:~ # cat /proc/mdstat Personalities : [raid1] md0 : active raid1 hdb3[1] 155219904 blocks [2/1] [_U]
unused devices: <none> linux:~ #
I would suggest using mdadm --detail /dev/md0 I think the output is a bit clearer.
Could someone help me interpret what is shown here?
from looking at mine, it reads it is an active raid1 array running only on hdb3, which is the second device in the array (first is 0). Summary it is a 2 disk raid running on 1 disk, first disk is missing.
As I see it, one drive is missing from the raid-array, but then it says: unused devices: none. It is /dev/hda3 and /dev/hdb3 that were initially setup as raid1.
mdadm --detail /dev/md0 will tell you more.
And if hda3 is missing, can I add it back to a live and running system?
Yes. mdadm /dev/md0 -a /dev/hda3
What happens when I add it and the drive is broken?
It will tell you. again, use mdadm --detail /dev/md0 for info on what is happening to your array. -- Joe Morris Registered Linux user 231871 running openSUSE 10.2 x86_64 -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
Leen de Braal wrote:
When checking /proc/mdstat I see this:
linux:~ # cat /proc/mdstat Personalities : [raid1] md0 : active raid1 hdb3[1] 155219904 blocks [2/1] [_U]
unused devices: <none> linux:~ #
I would suggest using mdadm --detail /dev/md0 I think the output is a bit clearer.
Could someone help me interpret what is shown here?
from looking at mine, it reads it is an active raid1 array running only on hdb3, which is the second device in the array (first is 0). Summary it is a 2 disk raid running on 1 disk, first disk is missing.
As I see it, one drive is missing from the raid-array, but then it says: unused devices: none. It is /dev/hda3 and /dev/hdb3 that were initially setup as raid1.
mdadm --detail /dev/md0 will tell you more.
Update Time : Tue Mar 6 13:22:40 2007 State : clean, degraded Active Devices : 1 Working Devices : 1 Failed Devices : 0 Spare Devices : 0 UUID : aed4ffaa:f90aa9b6:be5af158:c22c8924 Events : 0.9596839 Number Major Minor RaidDevice State 0 0 0 - removed 1 3 67 1 active sync /dev/hdb3 linux:~ #
And if hda3 is missing, can I add it back to a live and running system?
Yes. mdadm /dev/md0 -a /dev/hda3
What happens when I add it and the drive is broken?
It will tell you. again, use mdadm --detail /dev/md0 for info on what is happening to your array.
linux:~ # mdadm /dev/md0 -a /dev/hda3 mdadm: hot added /dev/hda3 linux:~ # cat /proc/mdstat Personalities : [raid1] md0 : active raid1 hda3[2] hdb3[1] 155219904 blocks [2/1] [_U] [>....................] recovery = 0.0% (150208/155219904) finish=120.3min speed=21458K/sec unused devices: <none> linux:~ # It seems, that resyncing is going on now. Update Time : Tue Mar 6 13:26:30 2007 State : clean, degraded, recovering Active Devices : 1 Working Devices : 2 Failed Devices : 0 Spare Devices : 1 Rebuild Status : 3% complete UUID : aed4ffaa:f90aa9b6:be5af158:c22c8924 Events : 0.9596970 Number Major Minor RaidDevice State 0 0 0 - removed 1 3 67 1 active sync /dev/hdb3 2 3 3 0 spare rebuilding /dev/hda3 linux:~ # Still asking myself how this could have happened? Any idea?
-- L. de Braal BraHa Systems NL - Terneuzen T +31 115 649333 F +31 115 649444 -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
Leen de Braal wrote:
Update Time : Tue Mar 6 13:22:40 2007 State : clean, degraded Active Devices : 1 Working Devices : 1 Failed Devices : 0 Spare Devices : 0
UUID : aed4ffaa:f90aa9b6:be5af158:c22c8924 Events : 0.9596839
Number Major Minor RaidDevice State 0 0 0 - removed 1 3 67 1 active sync /dev/hdb3 linux:~ #
You should look for a message from mdadm in your syslog which may give you some ideas. Since it looks like it is still working, I would lean toward either controller or possibly cable problems.
It seems, that resyncing is going on now.
Update Time : Tue Mar 6 13:26:30 2007 State : clean, degraded, recovering Active Devices : 1 Working Devices : 2 Failed Devices : 0 Spare Devices : 1
Rebuild Status : 3% complete
UUID : aed4ffaa:f90aa9b6:be5af158:c22c8924 Events : 0.9596970
Number Major Minor RaidDevice State 0 0 0 - removed 1 3 67 1 active sync /dev/hdb3
2 3 3 0 spare rebuilding /dev/hda3 linux:~ #
Check it every once in a while till you can see it has finished resyncing.
Still asking myself how this could have happened? Any idea?
Perhaps a loose cable, a controller going bad, or perhaps a drive getting ready to go. -- Joe Morris Registered Linux user 231871 running openSUSE 10.2 x86_64 -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 The Tuesday 2007-03-06 at 13:28 +0100, Leen de Braal wrote:
Still asking myself how this could have happened? Any idea?
Look at the logs... it's the only way. It could be a glitch. There is a temporary problem sometime, a disk is removed, and it awaits manual intervention. It will automatically activate an spare if available, though. - -- Cheers, Carlos E. R. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Made with pgp4pine 1.76 iD8DBQFF7WVutTMYHG2NR9URAvw2AJ9cOcsCHaFsDf926Dyt4JojQ/82agCghA7T 6dKhBRj+aqIep79vW5y11DU= =bJwq -----END PGP SIGNATURE----- -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
The Tuesday 2007-03-06 at 13:28 +0100, Leen de Braal wrote:
Still asking myself how this could have happened? Any idea?
Look at the logs... it's the only way. It could be a glitch. There is a temporary problem sometime, a disk is removed, and it awaits manual intervention. It will automatically activate an spare if available, though.
Found: Mar 5 00:17:14 linux kernel: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } Mar 5 00:17:14 linux kernel: hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=273480054, high=16, low=5044598, sector=273480053 Mar 5 00:17:14 linux kernel: ide: failed opcode was: unknown Mar 5 00:17:14 linux kernel: end_request: I/O error, dev hda, sector 273480053 Mar 5 00:17:14 linux kernel: raid1: Disk failure on hda3, disabling device. Mar 5 00:17:14 linux kernel: Operation continuing on 1 devices Mar 5 00:17:14 linux kernel: raid1: hda3: rescheduling sector 271343408 Mar 5 00:17:14 linux kernel: RAID1 conf printout: Mar 5 00:17:14 linux kernel: --- wd:1 rd:2 Mar 5 00:17:14 linux kernel: disk 0, wo:1, o:0, dev:hda3 Mar 5 00:17:14 linux kernel: disk 1, wo:0, o:1, dev:hdb3 Mar 5 00:17:14 linux kernel: RAID1 conf printout: Mar 5 00:17:14 linux kernel: --- wd:1 rd:2 Mar 5 00:17:14 linux kernel: disk 1, wo:0, o:1, dev:hdb3 Mar 5 00:17:14 linux kernel: raid1: hdb3: redirecting sector 271343408 to another mirror I will check if this will come back after the resync. Thanks all for the explanation.
- -- Cheers, Carlos E. R.
-- L. de Braal BraHa Systems NL - Terneuzen T +31 115 649333 F +31 115 649444 -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
On 3/6/07, Leen de Braal
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
The Tuesday 2007-03-06 at 13:28 +0100, Leen de Braal wrote:
Still asking myself how this could have happened? Any idea?
Look at the logs... it's the only way. It could be a glitch. There is a temporary problem sometime, a disk is removed, and it awaits manual intervention. It will automatically activate an spare if available, though.
Found:
Mar 5 00:17:14 linux kernel: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } Mar 5 00:17:14 linux kernel: hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=273480054, high=16, low=5044598, sector=273480053 Mar 5 00:17:14 linux kernel: ide: failed opcode was: unknown Mar 5 00:17:14 linux kernel: end_request: I/O error, dev hda, sector 273480053 Mar 5 00:17:14 linux kernel: raid1: Disk failure on hda3, disabling device. Mar 5 00:17:14 linux kernel: Operation continuing on 1 devices Mar 5 00:17:14 linux kernel: raid1: hda3: rescheduling sector 271343408 Mar 5 00:17:14 linux kernel: RAID1 conf printout: Mar 5 00:17:14 linux kernel: --- wd:1 rd:2 Mar 5 00:17:14 linux kernel: disk 0, wo:1, o:0, dev:hda3 Mar 5 00:17:14 linux kernel: disk 1, wo:0, o:1, dev:hdb3 Mar 5 00:17:14 linux kernel: RAID1 conf printout: Mar 5 00:17:14 linux kernel: --- wd:1 rd:2 Mar 5 00:17:14 linux kernel: disk 1, wo:0, o:1, dev:hdb3 Mar 5 00:17:14 linux kernel: raid1: hdb3: redirecting sector 271343408 to another mirror
Is the above telling me that hda3 was removed from the mirror because of a single bad sector? That seems extremely aggressive. I know there is some LKML discussion of needing to have MD automatically detect the above and simply rewrite the failed sector with data from the good mirrored sector. During the write /dev/hda should re-map the failed sector and continue running fine. (ie. All disk sector remapping for failures happens on writes AIUI.) If a disk is failed after a single sector read error currently I can see why the kernel developers are looking into alternate ways to handle the situation. Greg -- Greg Freemyer The Norcross Group Forensics for the 21st Century -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
On 3/6/07, Leen de Braal
wrote: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
The Tuesday 2007-03-06 at 13:28 +0100, Leen de Braal wrote:
Still asking myself how this could have happened? Any idea?
Look at the logs... it's the only way. It could be a glitch. There is a temporary problem sometime, a disk is removed, and it awaits manual intervention. It will automatically activate an spare if available, though.
Found:
Mar 5 00:17:14 linux kernel: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } Mar 5 00:17:14 linux kernel: hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=273480054, high=16, low=5044598, sector=273480053 Mar 5 00:17:14 linux kernel: ide: failed opcode was: unknown Mar 5 00:17:14 linux kernel: end_request: I/O error, dev hda, sector 273480053 Mar 5 00:17:14 linux kernel: raid1: Disk failure on hda3, disabling device. Mar 5 00:17:14 linux kernel: Operation continuing on 1 devices Mar 5 00:17:14 linux kernel: raid1: hda3: rescheduling sector 271343408 Mar 5 00:17:14 linux kernel: RAID1 conf printout: Mar 5 00:17:14 linux kernel: --- wd:1 rd:2 Mar 5 00:17:14 linux kernel: disk 0, wo:1, o:0, dev:hda3 Mar 5 00:17:14 linux kernel: disk 1, wo:0, o:1, dev:hdb3 Mar 5 00:17:14 linux kernel: RAID1 conf printout: Mar 5 00:17:14 linux kernel: --- wd:1 rd:2 Mar 5 00:17:14 linux kernel: disk 1, wo:0, o:1, dev:hdb3 Mar 5 00:17:14 linux kernel: raid1: hdb3: redirecting sector 271343408 to another mirror
Is the above telling me that hda3 was removed from the mirror because of a single bad sector?
That seems extremely aggressive.
Me too
I know there is some LKML discussion of needing to have MD automatically detect the above and simply rewrite the failed sector with data from the good mirrored sector.
During the write /dev/hda should re-map the failed sector and continue running fine. (ie. All disk sector remapping for failures happens on writes AIUI.)
If a disk is failed after a single sector read error currently I can see why the kernel developers are looking into alternate ways to handle the situation.
It is running ok now, as far as i can see, all in sync. For me it means that I will have to pay more attention to monitor this kind of errors. Will look into mdadm, as I have seen, that it has parameters that can make it do this, and report me by mail or something like that.
Greg -- Greg Freemyer The Norcross Group Forensics for the 21st Century -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
-- L. de Braal BraHa Systems NL - Terneuzen T +31 115 649333 F +31 115 649444 -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
Leen de Braal wrote:
It is running ok now, as far as i can see, all in sync. For me it means that I will have to pay more attention to monitor this kind of errors. Will look into mdadm, as I have seen, that it has parameters that can make it do this, and report me by mail or something like that.
Make sure you have mdadm starting to monitor your raid, and configure it with /etc/Sysconfig editor for your situation. It works well. -- Joe Morris Registered Linux user 231871 running openSUSE 10.2 x86_64 -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 The Tuesday 2007-03-06 at 16:26 +0100, Leen de Braal wrote:
Look at the logs... it's the only way. It could be a glitch. There is a temporary problem sometime, a disk is removed, and it awaits manual intervention. It will automatically activate an spare if available, though.
Found:
Mar 5 00:17:14 linux kernel: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } Mar 5 00:17:14 linux kernel: hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=273480054, high=16, low=5044598, sector=273480053 Mar 5 00:17:14 linux kernel: ide: failed opcode was: unknown Mar 5 00:17:14 linux kernel: end_request: I/O error, dev hda, sector 273480053 Mar 5 00:17:14 linux kernel: raid1: Disk failure on hda3, disabling device. Mar 5 00:17:14 linux kernel: Operation continuing on 1 devices Mar 5 00:17:14 linux kernel: raid1: hda3: rescheduling sector 271343408 Mar 5 00:17:14 linux kernel: RAID1 conf printout: Mar 5 00:17:14 linux kernel: --- wd:1 rd:2 Mar 5 00:17:14 linux kernel: disk 0, wo:1, o:0, dev:hda3 Mar 5 00:17:14 linux kernel: disk 1, wo:0, o:1, dev:hdb3 Mar 5 00:17:14 linux kernel: RAID1 conf printout: Mar 5 00:17:14 linux kernel: --- wd:1 rd:2 Mar 5 00:17:14 linux kernel: disk 1, wo:0, o:1, dev:hdb3 Mar 5 00:17:14 linux kernel: raid1: hdb3: redirecting sector 271343408 to another mirror
Is the above telling me that hda3 was removed from the mirror because of a single bad sector?
Yes...
That seems extremely aggressive.
Quite so.
Me too
I know there is some LKML discussion of needing to have MD automatically detect the above and simply rewrite the failed sector with data from the good mirrored sector.
During the write /dev/hda should re-map the failed sector and continue running fine. (ie. All disk sector remapping for failures happens on writes AIUI.)
Yes, that should work. The disk firmware remaps bad sectors when writing. Alternatively, the software could remap a sector, but it would do that on the layer above the mirror, ie, at ext3 level, for example, meaning on both disks. But that is not automatic, either, AFAIK.
If a disk is failed after a single sector read error currently I can see why the kernel developers are looking into alternate ways to handle the situation.
Seems so.
It is running ok now, as far as i can see, all in sync. For me it means that I will have to pay more attention to monitor this kind of errors. Will look into mdadm, as I have seen, that it has parameters that can make it do this, and report me by mail or something like that.
You can set it to email you, even to page or phone you, I think. Also, you can find the error in the SMART log of that HD, using smartctl. It should be possible to deduce if the sector was remaped, looking at the Reallocated_Sector_Ct. - -- Cheers, Carlos E. R. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Made with pgp4pine 1.76 iD8DBQFF7iMttTMYHG2NR9URAlLZAJkBdp8ppHVlp57xw+cMKor04qsnZQCgipmz 9KAlen8lUNj4HC9SxCGpmQs= =+jq6 -----END PGP SIGNATURE----- -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
On Tuesday 06 March 2007, Leen de Braal wrote:
It is running ok now, as far as i can see, all in sync. For me it means that I will have to pay more attention to monitor this kind of errors.
Mdadm can help with that. It has a monitor mode which you can run which will send email if this sort of problem happens. -- _____________________________________ John Andersen
participants (5)
-
Carlos E. R.
-
Greg Freemyer
-
Joe Morris (NTM)
-
John Andersen
-
Leen de Braal