On 01/23/2010 01:58 PM, Carlos E. R. wrote:
I can't believe that it the way dmraid works. In the past, I have actually preferred it to software raid because I can rebuild a new disk following a drive failure and remake the array before I ever have to boot the operating system. But, if you can never e2fsck -fcy /dev/sda, sdb, etc.. without first disabling the array in the bios, you are basically playing Russian roulette with disk errors on any single disk in the array. If you do that, you will corrupt your filesystem, and not be able to reenable the array. Both images will be different, and, I guess, when the array is reenabled the newer copy will simply be overwritten to the older copy, including the badblocks; ie, bad blocks in side 0 will be copied and marked bad to side 1, even if there are none there. And perhaps, good blocks on 0 will overwrite bad blocks on 1.
Happily, there is some magic somewhere in dmraid that makes this work OK. The key is to boot with some other media so that the 2 disks in question are not mounted or used by the booted OS. After fsck'ing the disk in question, mount both disks say under /mnt/a and /mnt/b. Then if 'b' was the drive fsck'ed, do a cp -a /mnt/a/* /mnt/b. Then re-enable the array in the bios on the next boot and all is well. As for the 'bad block' terminology, I may be off. What I'm talking about having to fsck to cure that isn't happening in dmraid is the offline uncorrectable errors: 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 207 I had been adding the bad block scan to fsck, perhaps wrongly so, but to force a surface scan. The error above was the error that prompted the response to fsck the disk with the bad block check enabled. What ever errors fsck finds and fixes with the -fcy option are the errors that are NOT being fixed when the disks are configured in a dmraid array. That's what was the genesis of the post and concern about the robustness of dmraid over time. The question is still pending though. How are (whatever these errors are) handled by suse when a disk is part of an dmraid array? You cannot fsck the disk when it is part of the array because fsck exits and throws the error that the disk is 'under the exclusive control of another process' and refuses to test the disk. My experience here is that the disk is FINE and just needs a simple fsck and so far that has been true in two out of three dmraid arrays I have had errors on during the past year. (in one case, the disk was cratering). What I want to know is: "Is there any individual disk level error correction performed on disks in dmraid arrays, or is it as it looks -- no fsck error correction is ever performed on individual disks in a dmraid array?" The ultimate question is should I ditch dmraid entirely and go with mdraid? dmraid has been absolutely bullet-proof as has been my mdraid installs. But, if mdraid can handle single disk error correction within an array where dmraid can't, that would be enough justification for me to dump dmraid in favor of mdraid. dmraid and mdraid are incredibly similar in the flexibility they offer as a raid solution. In either case you have mirrored copies, where upon failure, or just for kicks, you can rip one disk out of the array and boot and run on a single disk, or put it in another box without any concern about raid incompatibility. Both just use normal partitions and disks, the only primary difference is in how they are joined, either through a bios function or through a software function. I have liked dmraid because I can easily mirror /, /boot, /home and swap on each disk so that in the event of disk failure, the most work one disk can ever need to act as a standalone is to reinstall grub in the mbr if the boot loader code just happened to be other disk. (small potatoes) But, if there is a fundamental difference in the single disk maintenance/error correction area, that would make a big difference in the robustness of one versus the other. I just don't know the answer to that question and after extensive googling it doesn't seem like anyone else does either (or they just haven't discussed it) Thus my post -- Anybody know the answer?? Anders, are you a raid guy too?? (P.S. -- you know it is a good problem when the question is hard to frame :-) -- David C. Rankin, J.D.,P.E. Rankin Law Firm, PLLC 510 Ochiltree Street Nacogdoches, Texas 75961 Telephone: (936) 715-9333 Facsimile: (936) 715-9339 www.rankinlawfirm.com -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org