Re: [SLE] OOps, never mind: Re: [SLE] Help with disk integrity and RAID-1 please
Well, since you didn't post this to the list, but it seems to be valuable, I hope you (and the list) will forgive me if I forward this to all. I found it very useful, many thanks! (I also like your document on setting up a spam/virus filtering mail system :) Cheers, Simon --- Stephen Carter <stephen@retnet.co.uk> wrote:
Simon,
Not beating my drum here, hence the direct post, but I've written a small guide on setting up software RAID1 and it includes a few pointers on e-mail notification and drive replacement.
It's aimed at starters and building a SuSE 9.3 RAID 1 bootable setup from scratch and is by no means a complete readme, but may help in some regard.
The on-line version of the guide is: http://www.retnet.co.uk/modules.php?name=News&file=article&sid=54
which also includes a link at the top for a downloadable pdf version.
Cheers,
SteveC
Simon Roberts <thorpflyer@yahoo.com> 10/10/05 6:55 am >>> Silly me, when I rub the sleep out of my eyes, and do a long test, no, the disk is indeed dying. It reported happy before I told it to do any explicit tests, then again after a short test, but part way through a
long test, it's complaining of seek errors, and says it has only a day to live.
Pretty cool utility the SMART stuff though! Ideal for managing an array and preemptively replacing stuff before it's too late.
Thanks, Simon
--- Simon Roberts <thorpflyer@yahoo.com> wrote:
Following another post pointing out the existence of the smartctl test interface, it looks as if this drive of mine might actually be ok. Is there any possibility that I screwed up the configuration and, in effect, switched off the other drive from the RAID array, rather than it being taken down for errors? If I did, how might I get it back, can I just zero its contents and add it to the array again? And any pointer as to the command(s) to re-add it? (I know how to use dd to zero it).
TIA, Simon
--- Michael W Cocke <cocke@catherders.com> wrote:
On Sat, 8 Oct 2005 09:26:28 -0700 (PDT), you wrote:
Please forgive me if this shows up twice, I tried to send once but
has taken an improbable time and still not shown up, so it's time to try again.
Following a premature (3 months) disk failure, I created a RAID 1 array. I understand the basic idea of RAID, but have never used
it the
tools to do it before (not on Linux, not on anything).
As I built it, I knew there were many things I didn't know about, but hoped I could learn slowly in "spare" time. For example: does RAID
move
bad blocks on it's elements, or does it just dump the doubtful device? If RAID finds a disk problem, does it tell me about it, and if so how? If RAID rejects a device, particularly if it's for "transient" reasons like a single bad sector, can I re-prepare the disk manually and get it back into service. If I have to replace a failed disk, how do I do
that?
Anyway, these questions are still unanswered (after about 3 months...) and guess what: I'm pretty sure I have a drive failure. It makes odd noises, like the other one did :( I poked around, and managed to work out the existance of the mdadm command, and found this:
# mdadm --detail /dev/md0 /dev/md0: Version : 00.90.01 Creation Time : Thu Sep 1 05:49:50 2005 Raid Level : raid1 Array Size : 156280192 (149.04 GiB 160.03 GB) Device Size : 156280192 (149.04 GiB 160.03 GB) Raid Devices : 2 Total Devices : 1 Preferred Minor : 0 Persistence : Superblock is persistent
Update Time : Sat Oct 8 09:38:25 2005 State : clean, degraded Active Devices : 1 Working Devices : 1 Failed Devices : 0 Spare Devices : 0
UUID : b829bc95:3f42a40e:5a8be8f6:4fadb25c Events : 0.1345011
Number Major Minor RaidDevice State 0 0 0 - removed 1 34 1 1 active sync /dev/hdg1
I don't really know what I'm looking at, but the output looks bad,
right?
I also found this in dmesg's output:
md: Autodetecting RAID arrays. md: autorun ... md: considering hdg1 ... md: adding hdg1 ... md: adding hde1 ... md: created md0 md: bind<hde1> md: bind<hdg1> md: running: <hdg1><hde1> md: kicking non-fresh hde1 from array! md: unbind<hde1> md: export_rdev(hde1) raid1: raid set md0 active with 1 out of 2 mirrors md: ... autorun DONE.
Which also looks bad, don't you think?
So, can anyone please tell me in the short term:
1) Is hde indeed out of the array as it appears?
Yes.
2) How can I determine what the failure is? (is it "a few" bad sectors, too many to want to reuse the drive, or a more complete failure)
There is no such thing as a 'partial drive failure' on an IDE drive. Bad sector marking/remapping is handled via the on board electrics
if the alternate sector map is full, the drive is a short time away
from complete failure. Since you describe odd noises, you don't even need to worry about that - it's junk.
3) Can I reformat, move bad sectors, clean up the drive (if it's a
minor failure) and get it back into service, and if so how?
See #2 above.
4) If I elect/have to replace the drive, what do I do to make it take up it's ordained place in the md array?
Power down the system, replace the drive, power up the system. The
only real recovery headache with a RAID is if the boot drive is the
one that failed... In that case, you need to have made certain that ALL the disks are bootable (lilo can do that, I don't know about grub), or else have an alternate boot method.
Then in the longer term, where should I be looking for the docs so
can know this for myself in future?
All of the docs on the linux software raid system that I've seen are lousy... The code is still evolving, and it seems to be being written by people who aren't into docs. O'Reily has 'Managing RAID on
I linux'
=== message truncated === "You can tell whether a man is clever by his answers. You can tell whether a man is wise by his questions." Naguib Mahfouz __________________________________ Yahoo! Mail - PC Magazine Editors' Choice 2005 http://mail.yahoo.com
participants (1)
-
Simon Roberts