Re: [SLE] OOps, never mind: Re: [SLE] Help with disk integrity and RAID-1 please

10 Oct 2005

      Well, since you didn't post this to the list, but it seems to be
valuable, I hope you (and the list) will forgive me if I forward this
to all. I found it very useful, many thanks! (I also like your document
on setting up a spam/virus filtering mail system :)

Cheers,
Simon

--- Stephen Carter <stephen@retnet.co.uk> wrote:
...
Simon,
Not beating my drum here, hence the direct post, but I've written a
small guide on setting up software RAID1 and it includes a few
pointers on e-mail notification and drive replacement.
It's aimed at starters and building a SuSE 9.3 RAID 1 bootable
setup from scratch and is by no means a complete readme, but
may help in some regard.
The on-line version of the guide is:
http://www.retnet.co.uk/modules.php?name=News&file=article&sid=54
which also includes a link at the top for a downloadable pdf
version.
Cheers,
SteveC
...
...
...
Simon Roberts <thorpflyer@yahoo.com> 10/10/05 6:55 am >>> 
Silly me, when I rub the sleep out of my eyes, and do a long test,
no, 
the disk is indeed dying. It reported happy before I told it to do
any 
explicit tests, then again after a short test, but part way through a
long test, it's complaining of seek errors, and says it has only a
day 
to live.
Pretty cool utility the SMART stuff though! Ideal for managing an
array 
and preemptively replacing stuff before it's too late.
Thanks, 
Simon
--- Simon Roberts <thorpflyer@yahoo.com> wrote:
...
Following another post pointing out the existence of the smartctl 
test 
interface, it looks as if this drive of mine might actually be ok.
Is 
there any possibility that I screwed up the configuration and, in 
effect, switched off the other drive from the RAID array, rather
than 
it being taken down for errors? If I did, how might I get it back, 
can 
I just zero its contents and add it to the array again? And any 
pointer 
as to the command(s) to re-add it? (I know how to use dd to zero
it).
TIA, 
Simon
--- Michael W Cocke <cocke@catherders.com> wrote:
...
On Sat, 8 Oct 2005 09:26:28 -0700 (PDT), you wrote:
...
Please forgive me if this shows up twice, I tried to send once but
...
...
...
has taken an improbable time and still not shown up, so it's time 
to 
try again.
Following a premature (3 months) disk failure, I created a RAID 1 
array. I understand the basic idea of RAID, but have never used
it 
the
...
tools to do it before (not on Linux, not on anything).
As I built it, I knew there were many things I didn't know about, 
but 
hoped I could learn slowly in "spare" time. For example: does RAID
...
...
move
...
bad blocks on it's elements, or does it just dump the doubtful 
device? 
If RAID finds a disk problem, does it tell me about it, and if so 
how? 
If RAID rejects a device, particularly if it's for "transient" 
reasons 
like a single bad sector, can I re-prepare the disk manually and 
get 
it 
back into service. If I have to replace a failed disk, how do I do
...
...
...
that?
Anyway, these questions are still unanswered (after about 3 
months...) 
and guess what: I'm pretty sure I have a drive failure. It makes 
odd 
noises, like the other one did :( I poked around, and managed to 
work 
out the existance of the mdadm command, and found this:
# mdadm --detail /dev/md0 
/dev/md0: 
      Version : 00.90.01 
Creation Time : Thu Sep  1 05:49:50 2005 
   Raid Level : raid1 
   Array Size : 156280192 (149.04 GiB 160.03 GB) 
  Device Size : 156280192 (149.04 GiB 160.03 GB) 
 Raid Devices : 2 
Total Devices : 1 
Preferred Minor : 0 
  Persistence : Superblock is persistent
Update Time : Sat Oct  8 09:38:25 2005 
        State : clean, degraded 
Active Devices : 1 
Working Devices : 1 
Failed Devices : 0 
Spare Devices : 0
UUID : b829bc95:3f42a40e:5a8be8f6:4fadb25c 
       Events : 0.1345011
Number   Major   Minor   RaidDevice State 
     0       0        0        -      removed 
     1      34        1        1      active sync   /dev/hdg1
I don't really know what I'm looking at, but the output looks bad,
...

...
...
right?
I also found this in dmesg's output:
md: Autodetecting RAID arrays. 
md: autorun ... 
md: considering hdg1 ... 
md:  adding hdg1 ... 
md:  adding hde1 ... 
md: created md0 
md: bind<hde1> 
md: bind<hdg1> 
md: running: <hdg1><hde1> 
md: kicking non-fresh hde1 from array! 
md: unbind<hde1> 
md: export_rdev(hde1) 
raid1: raid set md0 active with 1 out of 2 mirrors 
md: ... autorun DONE.
Which also looks bad, don't you think?
So, can anyone please tell me in the short term:
1) Is hde indeed out of the array as it appears?
Yes.
...
2) How can I determine what the failure is? (is it "a few" bad 
sectors, 
too many to want to reuse the drive, or a more complete failure)
There is no such thing as a 'partial drive failure' on an IDE 
drive. 
Bad sector marking/remapping is handled via the on board electrics
...
if the alternate sector map is full, the drive is a short time away
...
...
from complete failure.  Since you describe odd noises, you don't 
even 
need to worry about that - it's junk.
...
3) Can I reformat, move bad sectors, clean up the drive (if it's a
...
...
...
minor failure) and get it back into service, and if so how?
See #2 above.
...
4) If I elect/have to replace the drive, what do I do to make it 
take 
up it's ordained place in the md array?
Power down the system, replace the drive, power up the system.  The
...
...
only real recovery headache with a RAID is if the boot drive is the
...
...
one that failed...  In that case, you need to have made certain 
that 
ALL the disks are bootable (lilo can do that, I don't know about 
grub), or else have an alternate boot method.
...
Then in the longer term, where should I be looking for the docs so
...
...
...
can know this for myself in future?
All of the docs on the linux software raid system that I've seen 
are 
lousy...  The code is still evolving, and it seems to be being 
written 
by people who aren't into docs.  O'Reily has 'Managing RAID on
I 
linux'
=== message truncated ===

"You can tell whether a man is clever by his answers. You can tell whether a man is wise by his questions."  Naguib Mahfouz

__________________________________ 
Yahoo! Mail - PC Magazine Editors' Choice 2005 
http://mail.yahoo.com

Simon Roberts

tags

participants (1)