Mailinglist Archive: opensuse (1196 mails)

< Previous Next >
Re: [opensuse] Does openSuSE Ever run fsck on disks in dmraid array with nvidia controller?
  • From: "David C. Rankin" <drankinatty@xxxxxxxxxxxxxxxxxx>
  • Date: Wed, 27 Jan 2010 23:54:52 -0600
  • Message-id: <4B6126AC.9080706@xxxxxxxxxxxxxxxxxx>
On 01/23/2010 01:58 PM, Carlos E. R. wrote:
I can't believe that it the way dmraid works. In the past, I have actually
preferred it to software raid because I can rebuild a new disk following a
drive
failure and remake the array before I ever have to boot the operating
system.
But, if you can never e2fsck -fcy /dev/sda, sdb, etc.. without first
disabling
the array in the bios, you are basically playing Russian roulette with disk
errors on any single disk in the array.
If you do that, you will corrupt your filesystem, and not be able to
reenable the array. Both images will be different, and, I guess, when the
array is reenabled the newer copy will simply be overwritten to the older
copy, including the badblocks; ie, bad blocks in side 0 will be copied and
marked bad to side 1, even if there are none there. And perhaps, good
blocks on 0 will overwrite bad blocks on 1.

Happily, there is some magic somewhere in dmraid that makes this work OK. The
key is to boot with some other media so that the 2 disks in question are not
mounted or used by the booted OS. After fsck'ing the disk in question, mount
both disks say under /mnt/a and /mnt/b. Then if 'b' was the drive fsck'ed, do a
cp -a /mnt/a/* /mnt/b. Then re-enable the array in the bios on the next boot and
all is well.

As for the 'bad block' terminology, I may be off. What I'm talking about having
to fsck to cure that isn't happening in dmraid is the offline uncorrectable
errors:

198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline -
207

I had been adding the bad block scan to fsck, perhaps wrongly so, but to force a
surface scan. The error above was the error that prompted the response to fsck
the disk with the bad block check enabled. What ever errors fsck finds and fixes
with the -fcy option are the errors that are NOT being fixed when the disks are
configured in a dmraid array. That's what was the genesis of the post and
concern about the robustness of dmraid over time.

The question is still pending though. How are (whatever these errors are)
handled by suse when a disk is part of an dmraid array? You cannot fsck the disk
when it is part of the array because fsck exits and throws the error that the
disk is 'under the exclusive control of another process' and refuses to test the
disk. My experience here is that the disk is FINE and just needs a simple fsck
and so far that has been true in two out of three dmraid arrays I have had
errors on during the past year. (in one case, the disk was cratering).

What I want to know is: "Is there any individual disk level error correction
performed on disks in dmraid arrays, or is it as it looks -- no fsck error
correction is ever performed on individual disks in a dmraid array?"

The ultimate question is should I ditch dmraid entirely and go with mdraid?
dmraid has been absolutely bullet-proof as has been my mdraid installs. But, if
mdraid can handle single disk error correction within an array where dmraid
can't, that would be enough justification for me to dump dmraid in favor of
mdraid.

dmraid and mdraid are incredibly similar in the flexibility they offer as a raid
solution. In either case you have mirrored copies, where upon failure, or just
for kicks, you can rip one disk out of the array and boot and run on a single
disk, or put it in another box without any concern about raid incompatibility.
Both just use normal partitions and disks, the only primary difference is in how
they are joined, either through a bios function or through a software function.
I have liked dmraid because I can easily mirror /, /boot, /home and swap on each
disk so that in the event of disk failure, the most work one disk can ever need
to act as a standalone is to reinstall grub in the mbr if the boot loader code
just happened to be other disk. (small potatoes)

But, if there is a fundamental difference in the single disk maintenance/error
correction area, that would make a big difference in the robustness of one
versus the other. I just don't know the answer to that question and after
extensive googling it doesn't seem like anyone else does either (or they just
haven't discussed it)

Thus my post -- Anybody know the answer?? Anders, are you a raid guy too??

(P.S. -- you know it is a good problem when the question is hard to frame :-)


--
David C. Rankin, J.D.,P.E.
Rankin Law Firm, PLLC
510 Ochiltree Street
Nacogdoches, Texas 75961
Telephone: (936) 715-9333
Facsimile: (936) 715-9339
www.rankinlawfirm.com
--
To unsubscribe, e-mail: opensuse+unsubscribe@xxxxxxxxxxxx
For additional commands, e-mail: opensuse+help@xxxxxxxxxxxx

< Previous Next >