[opensuse] RAID failure - md1: bitmap initialization failed: -5 (how to recover?)
Guys, I have an older openSuSE box (11.0 i586) that serves as a backoffice fax server (hylafax/avantfax) that no longer boots due to a mdraid failure. At first I thought it would be a simple hardware drive failure where I could simply fail the drive, remove it from the array, replace it, and then rebuild, but that is not the case. I have poked around a bit, and it looks like the disk hardware is fine, but for some reason the bitmap file and superblock for the array cannot be found. I have copied the boot message from the screen and it is shown below: (booting fallback image) md: raid0 personality registered for level 0 xor: measuring software checksum speed <snip> xor: using function: p5_max (2935.000 MB/sec) async_tx: api initialized (sync-only) <snip> md: raid6 personality registered for level 6 md: raid5 personality registered for level 5 md: raid4 personality registered for level 4 md: md1 stopped. md: bind (sdb5) md: bind (sda5) md: md1 array is not clean -- starting background reconstruction raid1: raid set md1 active with 2 out of 2 mirrors md1: bitmap file is out of date (148 < 149) -- forcing full recovery md1: bitmap file is out of date, doing full recovery md1: bitmap initialization failed: -5 md1: failed to create bitmap (-5) md: pers->run() failed ... mdadm: failed to RUN_ARRAY /dev/md1: Input/Output error mdadm: device /dev/md1 already active - cannot assemble it Waiting for device /dev/md1 to appear: ok /dev/md1: unknown volume type invalid root filesystem -- exiting to /bin/sh $ The output from the normal boot image is the same aside from trying resume=/dev/sdb5 which fails. What is the best approach to attempt to correct the mdraid problem, or ignoring the raid issue, how do I recover the partitions from the disks? Attempting a normal mount (ro) from the recovery console fails due to unknown filesystem type (filesystem_raid) (I didn't catch the exact message). But that makes since, all the partitions are of type 'fd' 'Linux raid autodetect'. cat /proc/mdstat shows both disks used in the md1 array, but it doesn't know what personality it is. Manually attempting to mount /dev/md1 mount complains about unknown filesystem and unable to find superblock information. That just sounds bad... I don't know, but assume that reboot that produced the error was caused by a power outage where power was on and off several times long enough to exhaust the UPS. But that has happened numerous times and there has never been an issue on reboot. The big question is how to approach data recovery? Favorite links or tools? I have a write-up or two squirreled away somewhere from years past about dd/dd_rescue for copying the partitions, but I haven't done this on a md array before. Any help, or suggestions will be much appreciated. -- David C. Rankin, J.D.,P.E. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 12/06/2013 02:23 PM, David C. Rankin wrote:
The big question is how to approach data recovery? Favorite links or tools? I have a write-up or two squirreled away somewhere from years past about dd/dd_rescue for copying the partitions, but I haven't done this on a md array before. Any help, or suggestions will be much appreciated.
It appears I'm not alone with this type of raid failure: http://forums.opensuse.org/english/get-technical-help-here/install-boot-logi... Apparently, for reasons unknown, temporary errors in writing data to one drive will cause the drives to fall out of sync. In this case, mdraid cannot re-sync or rebuild because it doesn't know which is the complete drive. The approach seems to be the same as a failed drive. Try and determine which is the good drive with mdadm (-E|-D), pick one, fail it, remove it, try booting on the remaining good drive, then zero-superblock on the removed drive, re-add the drive to the array and it should re-sync. If anyone has any other thoughts on the matter, let me know. Thanks. -- David C. Rankin, J.D.,P.E. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
Dang, that's an awfully old thread that you linked. I've lost half of a mirrored array and simply failed the bad drive and proceeded as you indicated, but in that case I knew exactly which drive was bad. I never did a dd, just reinitialized the bad one put it in as a spare, and got the hell out of mdadm's way while it rebuilt the array. Took quite a while. Most of the details are lost to the fog of time. "David C. Rankin" <drankinatty@suddenlinkmail.com> wrote:
On 12/06/2013 02:23 PM, David C. Rankin wrote:
The big question is how to approach data recovery? Favorite links or tools? I have a write-up or two squirreled away somewhere from years past about dd/dd_rescue for copying the partitions, but I haven't done this on a md array before. Any help, or suggestions will be much appreciated.
It appears I'm not alone with this type of raid failure:
http://forums.opensuse.org/english/get-technical-help-here/install-boot-logi...
Apparently, for reasons unknown, temporary errors in writing data to one drive will cause the drives to fall out of sync. In this case, mdraid cannot re-sync or rebuild because it doesn't know which is the complete drive. The approach seems to be the same as a failed drive. Try and determine which is the good drive with mdadm (-E|-D), pick one, fail it, remove it, try booting on the remaining good drive, then zero-superblock on the removed drive, re-add the drive to the array and it should re-sync.
If anyone has any other thoughts on the matter, let me know. Thanks.
-- Sent from my Android phone with K-9 Mail. Please excuse my brevity. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
Le 07/12/2013 02:33, David C. Rankin a écrit :
If anyone has any other thoughts on the matter, let me know. Thanks.
I don't know if it is directly related, but... I was said (with references) that most raid systems do not manage properly the data when *several* disks are *partly* failing at the same time. this is pretty often the case It's very common for Hard drives to fail silently because some sectors become damaged, but are not read at the moment, so the defect is not seen by the raid system. in this situation, the system stops. the reason is that only read sectors are checked. in my opinion, and for my use (mostly archives), this makes raid unusefull. so I beg one should have to monitor acurately the smart datas on any hard disk in relation to the raid array to pevent the problem, but I couldn't certify this. I know people that use different makes/models of HDD for the aray to try limiting the risk of similteous failure jdd -- http://www.dodin.org -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 12/07/2013 12:40 AM, jdd wrote:
I don't know if it is directly related, but...
I was said (with references) that most raid systems do not manage properly the data when *several* disks are *partly* failing at the same time. this is pretty often the case
It's very common for Hard drives to fail silently because some sectors become damaged, but are not read at the moment, so the defect is not seen by the raid system.
in this situation, the system stops.
the reason is that only read sectors are checked.
in my opinion, and for my use (mostly archives), this makes raid unusefull.
so I beg one should have to monitor acurately the smart datas on any hard disk in relation to the raid array to pevent the problem, but I couldn't certify this.
I know people that use different makes/models of HDD for the aray to try limiting the risk of similteous failure
Interesting and valid observations. I've got some experience with 3Ware, and now LSI, hardware RAID controllers. I've got a requirement to record lots of data in the field, when the disks are then removed and returned to the depot for reading and processing. I use 24-bay hot-swap chassis configured as two 11-disk RAID-6 arrays with two global hot spares. The OS lives on two internal disks configured as RAID-1 using a second hardware RAID controller. The RAID controllers will initiate "Patrol Reads" all on their own to look for hidden sector-read issues. These patrols are done with a low priority and don't interfere with read/write performance. The arrays are also "verified" on a daily basis, also in the background. But I did have one issue where the depot chassis (an older box) wasn't able to correctly sync with newer 6-Gb SATA drives. The data were written correctly, but to the second chassis appeared to the controller as if random drives were failing. I removed the disks and re-installed them into a chassis that was known to work. The second controller identified that the array was damaged, but then proceeded to accurately recover the array. It took about 12-hours, but it worked! I still don't know "why" it worked, but it sure did save my bacon. Another observation: Don't use RAID-5, which can operate with one failed disk. The most stressful time for an array is when a failed disk is replaced and the array is rebuilt. With RAID-5, if you suffer a second disk failure during the rebuild your data is toast. With RAID-6, a rebuild can experience a second disk failure and still recover the data. Sure, RAID-6 has greater storage overhead, but if your data is important, disks are cheap these days. BTW, I've measured 1.6-GB/sec of continuous write bandwidth to one of these 11-disk RAID-6 arrays from a single-threaded process. Regards, Lew -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
Le 07/12/2013 16:45, Lew Wolfgang a écrit :
hot-swap chassis configured as two 11-disk RAID-6 arrays with two global hot spares.
good
The RAID controllers will initiate "Patrol Reads" all on their own to look for hidden sector-read issues.
that is. 11 1Tb HDD - a dream :-) very nice solution for High availability system :-) I don't have such requirement, and I only make redundant archives but I see too many people relying on two disks nats pretending to make raid. backup/archive/availability is (are?) an intersting topic, but a never ending discussion :-) thanks jdd -- http://www.dodin.org -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 12/07/2013 11:52 AM, jdd wrote:
Le 07/12/2013 16:45, Lew Wolfgang a écrit :
hot-swap chassis configured as two 11-disk RAID-6 arrays with two global hot spares.
good
The RAID controllers will initiate "Patrol Reads" all on their own to look for hidden sector-read issues.
that is.
11 1Tb HDD - a dream :-)
very nice solution for High availability system :-)
I don't have such requirement, and I only make redundant archives
but I see too many people relying on two disks nats pretending to make raid.
backup/archive/availability is (are?) an intersting topic, but a never ending discussion :-)
thanks jdd
OK jdd, Lew, JA, All (sorry jdd - you get 2 copies) I need your help. I need to make sure I don't screw anything up attempting to remedy the situation. I have booted the box with the 11.0 DVD and entered Recovery Console. I have 3 mdraid partitions on this box: /dev/md0 sda1/sdb1 /boot /dev/md1 sda5/sdb5 / /dev/md2 sda7/sdb7 /home After booting the 11.0 install dvd and booting Recovery Console, mdraid found and assembled all arrays, md0 and md2 are fine, its is just md1 that is the problem: # cat /proc/mdstat Personalities : [raid1] md2 : active raid1 sda7[0] sdb7[1] 221929772 blocks super 1.0 [2/2] [UU] bitmap: 0/424 pages [0KB], 256KB chunk md1 : inactive sda5[0] sdb5[1] 41945504 blocks super 1.0 md0 : active raid1 sda1[0] sdb1[1] 104376 blocks super 1.0 [2/2] [UU] bitmap: 0/7 pages [0KB], 8KB chunk I don't think there is anyway I can guess which of sda5 or sdb5 has the latest bitmap. I really don't think it matters, but I thought I would check. I just need to know the best way to go about fixing it so I do not lose the data. So is there a way I can use the --fail and --remove to then test which disk to keep as the good one before doing --zero-superblock on the other, and then --add to recreate the array and force a rebuild. Or is this just a case where you have to stop the array, and then recreate it using one of the drives and 'missing' and then re-add the other drive back after the array is up and running? I believe you have to use the --force to cause it to run with a missing device? The array information on disk for both disks (sda5/sdb5) shows the exact same Update Time, (Tue Nov 19 15:28:38 2013) the only difference between the output is the checksums (both shown correct) and the Events : 148/149, thus the reported error: md1: bitmap file is out of date (148 < 149) I have the complete output of: # mdadm --examine /dev/sd[ab]5 here: (1.7 Meg) http://www.3111skyline.com/dl/screenshots/suse/mdadm-examine.jpg What is the best way to do this? It seems simple, but I would rather look before I leap here. If there is a better way to attempt to get these partitions to re-sync, I am more than happy to give it a try. Thanks. -- David C. Rankin, J.D.,P.E. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 12/08/2013 02:17 AM, David C. Rankin wrote:
What is the best way to do this? It seems simple, but I would rather look before I leap here. If there is a better way to attempt to get these partitions to re-sync, I am more than happy to give it a try. Thanks.
I have worked through https://raid.wiki.kernel.org/index.php/RAID_Recovery to the point of recreating the array. The wiki says to stop and ask them before moving on to recreate which is where most data is lost. So I have posted the question to the linux-raid list at kernel.org. I'll let you know what the answer is. Just for completeness here, mdadm --examine /dev/md1 info is: Magic : a92b4efc Version : 1.0 Feature Map : 0x1 Array UUID : e45cfbeb:77c2b93b:43d3d214:390d0f25 Name : 1 Creation Time : Thu Aug 21 06:43:22 2008 Raid Level : raid1 Raid Devices : 2 Avail Dev Size : 41945504 (20.00 GiG 21.48 GB) Array Size : 41945504 (20.00 GiG 21.48 GB) Super Offset : 41945632 sectors State : clean Device UUID : e8c1c580:db4d853e:6fac1c8f:fb5399d7 Internal Bitmap : -81 sectors from superblock Update Time : Tue Nov 19 15:28:38 2013 checksum : d37d1086 - correct Events : 148 Array Slot : 0 (0,1) Array State : Uu Magic : a92b4efc Version : 1.0 Feature Map : 0x1 Array UUID : e45cfbeb:77c2b93b:43d3d214:390d0f25 Name : 1 Creation Time : Thu Aug 21 06:43:22 2008 Raid Level : raid1 Raid Devices : 2 Avail Dev Size : 41945504 (20.00 GiG 21.48 GB) Array Size : 41945504 (20.00 GiG 21.48 GB) Super Offset : 41945632 sectors State : clean Device UUID : 6edfa3f8:c8c4316d:66c19315:5eda0911 Internal Bitmap : -81 sectors from superblock Update Time : Tue Nov 19 15:28:38 2013 checksum : 39ef40a5 - correct Events : 149 Array Slot : 1 (0,1) Array State : uU attempting stop and then assemble with: # mdadm --stop /dev/md1 # mdadm --assemble --force /dev/dm1 /dev/sd[ab]5 The messages captured in the logs are: Rescue Kernel: md: md1: stopped. Rescue Kernel: md: unbind<sda5> Rescue Kernel: md: export_rdev(sda5) Rescue Kernel: md: unbind<sdb5> Rescue Kernel: md: export_rdev(sdb5) Rescue Kernel: md: md1: stopped. Rescue Kernel: md: md1 raid array is not clean -- starting background reconstruction Rescue Kernel: md: raid1: raid set md1 active with 2 out of 2 mirrors Rescue Kernel: md1: bitmap file is out of date (148 < 149) -- forcing full recovery Rescue Kernel: md1: bitmap file is out of date, doing full recovery Rescue Kernel: md1: bitmap initialisation failed: -5 Rescue Kernel: md1: failed to create bitmap (-5) hen on the command line I have: mdadm: failed to RUN_ARRAY /dev/md1: Input/Output error -- David C. Rankin, J.D.,P.E. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 12/8/2013 10:04 AM, David C. Rankin wrote:
attempting stop and then assemble with:
# mdadm --stop /dev/md1 # mdadm --assemble --force /dev/dm1 /dev/sd[ab]5
Maybe a step missing in there: http://www.dslreports.com/forum/r21036244- -- _____________________________________ ---This space for rent--- -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 12/08/2013 07:31 PM, John Andersen wrote:
On 12/8/2013 10:04 AM, David C. Rankin wrote:
attempting stop and then assemble with:
# mdadm --stop /dev/md1 # mdadm --assemble --force /dev/dm1 /dev/sd[ab]5
Maybe a step missing in there: http://www.dslreports.com/forum/r21036244-
JA, It is amazing what you can do within the Rescue Console. The convenience of: plug in network, echo "somename" > /etc/HOSTNAME, ifup eth0, dhcpcd eth0, vi /etc/ssh/sshd_config, PermitRootLogin yes, cd /root, mkdir .ssh, cd .ssh, rsync you@yourbox:~/.ssh/id_dsa.pub authorized_keys, rcsshd start, go to your box and finish setup.... But I digress, here is the current information on my mdraid saga with --verbose given: Both /dev/md0 (sd[ab]1) and /dev/md2 (sd[ab]7) assemble and mount just fine: nemtemp:~ # mount <snip> /dev/md0 on /mnt/boot type ext3 (rw) /dev/md2 on /mnt/home type ext3 (rw) nemtemp:~ # ll /mnt/boot total 13247 -rw------- 1 root root 512 2008-08-21 07:54 backup_mbr lrwxrwxrwx 1 root root 1 2008-08-21 06:49 boot -> . <snip> nemtemp:~ # ll /mnt/home total 60 drwxr-xr-x 8 1010 users 4096 2013-07-16 21:30 assistance drwxr-xr-x 20 1000 1051 4096 2011-03-16 22:34 backup <snip> Looking at /dev/md1 /dev/sd[ab]5 nemtemp:~ # cat /proc/mdstat Personalities : [raid1] md2 : active raid1 sda7[0] sdb7[1] 221929772 blocks super 1.0 [2/2] [UU] bitmap: 0/424 pages [0KB], 256KB chunk md1 : inactive sda5[0] sdb5[1] 41945504 blocks super 1.0 md0 : active raid1 sda1[0] sdb1[1] 104376 blocks super 1.0 [2/2] [UU] bitmap: 0/7 pages [0KB], 8KB chunk unused devices: <none> nemtemp:~ # mdadm --stop /dev/md1 mdadm: stopped /dev/md1 nemtemp:~ # cat /proc/mdstat Personalities : [raid1] md2 : active raid1 sda7[0] sdb7[1] 221929772 blocks super 1.0 [2/2] [UU] bitmap: 0/424 pages [0KB], 256KB chunk md0 : active raid1 sda1[0] sdb1[1] 104376 blocks super 1.0 [2/2] [UU] bitmap: 0/7 pages [0KB], 8KB chunk unused devices: <none> nemtemp:~ # mdadm --verbose --assemble --force /dev/md1 /dev/sd[ab]5 mdadm: looking for devices for /dev/md1 mdadm: /dev/sda5 is identified as a member of /dev/md1, slot 0. mdadm: /dev/sdb5 is identified as a member of /dev/md1, slot 1. mdadm: added /dev/sdb5 to /dev/md1 as 1 mdadm: added /dev/sda5 to /dev/md1 as 0 mdadm: failed to RUN_ARRAY /dev/md1: Input/output error The log from the start attempt: Dec 9 00:16:11 Rescue kernel: md: md1 stopped. Dec 9 00:16:11 Rescue kernel: md: bind<sdb5> Dec 9 00:16:11 Rescue kernel: md: bind<sda5> Dec 9 00:16:11 Rescue kernel: md: md1: raid array is not clean -- starting background reconstruction Dec 9 00:16:11 Rescue kernel: raid1: raid set md1 active with 2 out of 2 mirrors Dec 9 00:16:11 Rescue kernel: md1: bitmap file is out of date (148 < 149) -- forcing full recovery Dec 9 00:16:11 Rescue kernel: md1: bitmap file is out of date, doing full recovery Dec 9 00:16:12 Rescue kernel: md1: bitmap initialisation failed: -5 Dec 9 00:16:12 Rescue kernel: md1: failed to create bitmap (-5) Dec 9 00:16:12 Rescue kernel: md: pers->run() failed ... nemtemp:~ # cat /proc/mdstat Personalities : [raid1] md2 : active raid1 sda7[0] sdb7[1] 221929772 blocks super 1.0 [2/2] [UU] bitmap: 0/424 pages [0KB], 256KB chunk md1 : inactive sda5[0] sdb5[1] 41945504 blocks super 1.0 md0 : active raid1 sda1[0] sdb1[1] 104376 blocks super 1.0 [2/2] [UU] bitmap: 0/7 pages [0KB], 8KB chunk unused devices: <none> I'm not sure how to proceed safely from here. Is there anything else I should try before attempting to --create the array again? If I do create the array with 1 drive and "missing", should I then use --add or --re-add to add the other drive? Also, since /dev/sda5 shows Events: 148 and /dev/sdb5 shows Events: 149, should I choose /dev/sdb5 as the one to preserve and let "missing" take the place of /dev/sda5? If so, then does the following create statement look correct: mdadm --create --verbose --level=1 --metadata=1.0 --raid-devices=2 \ /dev/md1 /dev/sdb5 missing Should I also use --force? If attempting to assemble with "missing" and the create command gives problems due to the unused device still having the minor-number 1 for md1, is it better to --zero-superblock the on the device not included as "missing" or is it better to just unplug the drive, preserve the superblock data in case it is needed and just boot with a single disk spinning? Sorry for all the questions, but I just want to make sure I don't do something to compromise the data. I have run md and dm raid for years, had to recover dmraid a couple of time, but never had this much problem with a mdraid array. I found parts of man mdadm that I didn't know existed before :p With the information for both drives looking good with --examine, the (Update Time : Tue Nov 19 15:28:38 2013) being identical, and the Events being off by only 1, I can't see a reason the drives should not just assemble and run as it is. What say the experts? Here is the --detail and --examine information for the drives for completeness: nemtemp:~ # mdadm --detail /dev/md1 /dev/md1: Version : 01.00.03 Creation Time : Thu Aug 21 06:43:22 2008 Raid Level : raid1 Used Dev Size : 20972752 (20.00 GiB 21.48 GB) Raid Devices : 2 Total Devices : 2 Preferred Minor : 1 Persistence : Superblock is persistent Update Time : Tue Nov 19 15:28:38 2013 State : active, Not Started Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 Name : 1 UUID : e45cfbeb:77c2b93b:43d3d214:390d0f25 Events : 148 Number Major Minor RaidDevice State 0 8 5 0 active sync /dev/sda5 1 8 21 1 active sync /dev/sdb5 nemtemp:/ # mdadm -E /dev/sda5 /dev/sda5: Magic : a92b4efc Version : 1.0 Feature Map : 0x1 Array UUID : e45cfbeb:77c2b93b:43d3d214:390d0f25 Name : 1 Creation Time : Thu Aug 21 06:43:22 2008 Raid Level : raid1 Raid Devices : 2 Avail Dev Size : 41945504 (20.00 GiB 21.48 GB) Array Size : 41945504 (20.00 GiB 21.48 GB) Super Offset : 41945632 sectors State : clean Device UUID : e0c1c580:db4d853e:6fac1c8f:fb5399d7 Internal Bitmap : -81 sectors from superblock Update Time : Tue Nov 19 15:28:38 2013 Checksum : d37d1086 - correct Events : 148 Array Slot : 0 (0, 1) Array State : Uu nemtemp:/ # mdadm -E /dev/sdb5 /dev/sdb5: Magic : a92b4efc Version : 1.0 Feature Map : 0x1 Array UUID : e45cfbeb:77c2b93b:43d3d214:390d0f25 Name : 1 Creation Time : Thu Aug 21 06:43:22 2008 Raid Level : raid1 Raid Devices : 2 Avail Dev Size : 41945504 (20.00 GiB 21.48 GB) Array Size : 41945504 (20.00 GiB 21.48 GB) Super Offset : 41945632 sectors State : active Device UUID : 6edfa3f8:c8c4316d:66c19315:5eda0911 Internal Bitmap : -81 sectors from superblock Update Time : Tue Nov 19 15:28:38 2013 Checksum : 39ef40a5 - correct Events : 149 Array Slot : 1 (0, 1) Array State : uU -- David C. Rankin, J.D.,P.E. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 12/08/2013 08:15 PM, David C. Rankin wrote:
I'm not sure how to proceed safely from here. Is there anything else I should try before attempting to --create the array again? If I do create the array with 1 drive and "missing", should I then use --add or --re-add to add the other drive? Also, since /dev/sda5 shows Events: 148 and /dev/sdb5 shows Events: 149, should I choose /dev/sdb5 as the one to preserve and let "missing" take the place of /dev/sda5? If so, then does the following create statement look correct:
mdadm --create --verbose --level=1 --metadata=1.0 --raid-devices=2 \ /dev/md1 /dev/sdb5 missing
Should I also use --force?
Well before taking drastic steps, checking inf the partitions are mountable -- they are! Whoop! nemtemp:/mnt # mdadm --verbose --assemble /dev/md1 /dev/sdb5 mdadm: looking for devices for /dev/md1 mdadm: /dev/sdb5 is identified as a member of /dev/md1, slot 1. mdadm: no uptodate device for slot 0 of /dev/md1 mdadm: added /dev/sdb5 to /dev/md1 as 1 mdadm: /dev/md1 assembled from 1 drive - need all 2 to start it (use --run to insist). nemtemp:/mnt # cat /proc/mdstat Personalities : [raid1] md2 : active raid1 sda7[0] sdb7[1] 221929772 blocks super 1.0 [2/2] [UU] bitmap: 0/424 pages [0KB], 256KB chunk md1 : inactive sdb5[1](S) 20972752 blocks super 1.0 md0 : active raid1 sda1[0] sdb1[1] 104376 blocks super 1.0 [2/2] [UU] bitmap: 0/7 pages [0KB], 8KB chunk unused devices: <none> nemtemp:/mnt # mdadm --run /dev/md1 mdadm: failed to run array /dev/md1: Input/output error Hmm., this is just raid1, mirrored ext3, so mounting should work: nemtemp:/mnt # mkdir sda nemtemp:/mnt # mkdir sdb nemtemp:/mnt # mdadm --stop /dev/md1 mdadm: stopped /dev/md1 nemtemp:/mnt # mount -o ro /dev/sdb5 /mnt/sdb/ mount: unknown filesystem type 'linux_raid_member' nemtemp:/mnt # mount -t ext3 -o ro /dev/sdb5 /mnt/sdb/ nemtemp:/mnt # l sdb total 116 drwxr-xr-x 21 root root 4096 2013-01-25 17:06 ./ drwxr-xr-x 7 root root 140 2013-12-08 06:38 ../ drwxr-xr-x 2 root root 4096 2010-12-05 06:43 bin/ drwxr-xr-x 2 root root 4096 2008-08-21 06:48 boot/ drwxr-xr-x 2 root root 4096 2008-08-22 01:54 data/ drwxr-xr-x 5 root root 4096 2008-08-21 06:48 dev/ <snip> nemtemp:/mnt # mount -t ext3 -o ro /dev/sda5 /mnt/sda nemtemp:/mnt # l sda total 116 drwxr-xr-x 21 root root 4096 2013-01-25 17:06 ./ drwxr-xr-x 7 root root 140 2013-12-08 06:38 ../ drwxr-xr-x 2 root root 4096 2010-12-05 06:43 bin/ drwxr-xr-x 2 root root 4096 2008-08-21 06:48 boot/ drwxr-xr-x 2 root root 4096 2008-08-22 01:54 data/ drwxr-xr-x 5 root root 4096 2008-08-21 06:48 dev/ <snip> nemtemp:/mnt # mount <snip> /dev/md0 on /mnt/boot type ext3 (rw) /dev/md2 on /mnt/home type ext3 (rw) /dev/sdb5 on /mnt/sdb type ext3 (ro) /dev/sda5 on /mnt/sda type ext3 (ro) Both drives are fine!! Why is mdadm having problems? Would a newer mdadm be worth a shot? I'd rather figure out why my version (2.6.4) isn't working, but I think I've pretty much tried everything up to the point of having to use the --create mode and risk data loss. JA, all, anybody have any other suggestions? -- David C. Rankin, J.D.,P.E. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 12/8/2013 6:50 PM, David C. Rankin wrote:
Both drives are fine!! Why is mdadm having problems? Would a newer mdadm be worth a shot? I'd rather figure out why my version (2.6.4) isn't working, but I think I've pretty much tried everything up to the point of having to use the --create mode and risk data loss. JA, all, anybody have any other suggestions?
There seems to be a lot of talk that google dredged up about this problem being related to specific versions. So it might be ok to try a new version, as long as you can get it installed. -- _____________________________________ ---This space for rent--- -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 12/08/2013 09:17 PM, John Andersen wrote:
On 12/8/2013 6:50 PM, David C. Rankin wrote:
Both drives are fine!! Why is mdadm having problems? Would a newer mdadm be worth a shot? I'd rather figure out why my version (2.6.4) isn't working, but I think I've pretty much tried everything up to the point of having to use the --create mode and risk data loss. JA, all, anybody have any other suggestions?
There seems to be a lot of talk that google dredged up about this problem being related to specific versions. So it might be ok to try a new version, as long as you can get it installed.
Yes, I saw that too, but it mostly had to do with the array outgrowing its space, so I didn't think that was it - but it may be. I have an Arch install disk with mdadm 3.3.2. I should be able to boot that and see if it can assemble the array. If it does, then what? If the assemble/sync succeeds, then the Event Count should be corrected and it should boot and assemble back under the 11.0 version - right? If not, then were are going to have to what? I can't just chroot below one of the / partitions update and then create a single disk array - can I. Then --add the other partition to the raid set and have everything sync? I'll try the newer mdadm and see what happens. I've already rsync'ed the needed data of /dev/sda5 mounted -o ro under the Rescue Console, so at least the config, mail, mysql tables and other data is safe. Any other ideas, let me know. If not, I'll report back after the 3.3.2 test. -- David C. Rankin, J.D.,P.E. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
Le 09/12/2013 04:52, David C. Rankin a écrit :
Any other ideas, let me know. If not, I'll report back after the 3.3.2 test.
dd if=/dev/sdX of=null to see if some sector of one disk is not failing? mounting a partition do not says the data is good, and ddrescue is a bit much stress (and is very long) I don't remember if there is a system on these disks, but to mount the root system I use: http://dodin.info/wiki/index.php?n=Doc.AccesRootAvecOpensuseRescue jdd -- http://www.dodin.org -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On Sun, 08 Dec 2013 20:50:13 -0600 David C. Rankin wrote:
On 12/08/2013 08:15 PM, David C. Rankin wrote:
I'm not sure how to proceed safely from here. Is there anything else I should try before attempting to --create the array again? If I do create the array with 1 drive and "missing", should I then use --add or --re-add to add the other drive? Also, since /dev/sda5 shows Events: 148 and /dev/sdb5 shows Events: 149, should I choose /dev/sdb5 as the one to preserve and let "missing" take the place of /dev/sda5? If so, then does the following create statement look correct:
mdadm --create --verbose --level=1 --metadata=1.0 --raid-devices=2 \ /dev/md1 /dev/sdb5 missing
Should I also use --force?
Well before taking drastic steps, checking inf the partitions are mountable -- they are! Whoop!
nemtemp:/mnt # mdadm --verbose --assemble /dev/md1 /dev/sdb5 mdadm: looking for devices for /dev/md1 mdadm: /dev/sdb5 is identified as a member of /dev/md1, slot 1. mdadm: no uptodate device for slot 0 of /dev/md1 mdadm: added /dev/sdb5 to /dev/md1 as 1 mdadm: /dev/md1 assembled from 1 drive - need all 2 to start it (use --run to insist). nemtemp:/mnt # cat /proc/mdstat Personalities : [raid1] md2 : active raid1 sda7[0] sdb7[1] 221929772 blocks super 1.0 [2/2] [UU] bitmap: 0/424 pages [0KB], 256KB chunk
md1 : inactive sdb5[1](S) 20972752 blocks super 1.0
md0 : active raid1 sda1[0] sdb1[1] 104376 blocks super 1.0 [2/2] [UU] bitmap: 0/7 pages [0KB], 8KB chunk
unused devices: <none> nemtemp:/mnt # mdadm --run /dev/md1 mdadm: failed to run array /dev/md1: Input/output error
Hmm., this is just raid1, mirrored ext3, so mounting should work:
nemtemp:/mnt # mkdir sda nemtemp:/mnt # mkdir sdb
nemtemp:/mnt # mdadm --stop /dev/md1 mdadm: stopped /dev/md1 nemtemp:/mnt # mount -o ro /dev/sdb5 /mnt/sdb/ mount: unknown filesystem type 'linux_raid_member'
nemtemp:/mnt # mount -t ext3 -o ro /dev/sdb5 /mnt/sdb/ nemtemp:/mnt # l sdb total 116 drwxr-xr-x 21 root root 4096 2013-01-25 17:06 ./ drwxr-xr-x 7 root root 140 2013-12-08 06:38 ../ drwxr-xr-x 2 root root 4096 2010-12-05 06:43 bin/ drwxr-xr-x 2 root root 4096 2008-08-21 06:48 boot/ drwxr-xr-x 2 root root 4096 2008-08-22 01:54 data/ drwxr-xr-x 5 root root 4096 2008-08-21 06:48 dev/ <snip>
nemtemp:/mnt # mount -t ext3 -o ro /dev/sda5 /mnt/sda nemtemp:/mnt # l sda total 116 drwxr-xr-x 21 root root 4096 2013-01-25 17:06 ./ drwxr-xr-x 7 root root 140 2013-12-08 06:38 ../ drwxr-xr-x 2 root root 4096 2010-12-05 06:43 bin/ drwxr-xr-x 2 root root 4096 2008-08-21 06:48 boot/ drwxr-xr-x 2 root root 4096 2008-08-22 01:54 data/ drwxr-xr-x 5 root root 4096 2008-08-21 06:48 dev/ <snip>
nemtemp:/mnt # mount <snip> /dev/md0 on /mnt/boot type ext3 (rw) /dev/md2 on /mnt/home type ext3 (rw) /dev/sdb5 on /mnt/sdb type ext3 (ro) /dev/sda5 on /mnt/sda type ext3 (ro)
Both drives are fine!! Why is mdadm having problems? Would a newer mdadm be worth a shot? I'd rather figure out why my version (2.6.4) isn't working, but I think I've pretty much tried everything up to the point of having to use the --create mode and risk data loss. JA, all, anybody have any other suggestions?
Hi David, Wouldn't the safest course of action be to clone these partitions and make backups of the others before proceeding? If the additional event on sdb5 was a read/write error causing the read only bit to be set, that could explain mdadm being 'cranky' about mounting and using the partition. That would also explain you being able to mount the partition manually in read only mode. Along the same lines, are these drives SMART capable? Have you tried using smartctl to check on the status of sdb? Of particular interest would be the number of reallocated sectors and whether or not that number is perceptibly climbing -- in which case drive failure could be imminent. I wish you the best of luck & regards, Carl -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 12/08/2013 09:47 PM, Carl Hartung wrote:
Hi David,
Wouldn't the safest course of action be to clone these partitions and make backups of the others before proceeding? If the additional event on sdb5 was a read/write error causing the read only bit to be set, that could explain mdadm being 'cranky' about mounting and using the partition. That would also explain you being able to mount the partition manually in read only mode.
Along the same lines, are these drives SMART capable? Have you tried using smartctl to check on the status of sdb? Of particular interest would be the number of reallocated sectors and whether or not that number is perceptibly climbing -- in which case drive failure could be imminent.
I wish you the best of luck & regards,
Carl
Thanks Carl, Yes, I am going to copy the partition as a backup. I was able to rsync all the important data off the drive when I mounted them individually in the Recovery console. Given that this is 11.0, the data is all that is important, that being the configs, /var/spool, mysql tables, and web data. That is all safe. Now it is more "what happened?" As JA and the folks on the mdraid list think it is probably the old version of mdadm not handling the error right. We shall see. mdraid really isn't magic. It either works or it doesn't, if it doesn't, then there is either a glaring reason seen using --examine or --detail or it is just a bug with mdadm. Most everything is handled automatically after you issue --assemble, there are not a lot of user screw-ups you can make :-) (that's a good thing) -- David C. Rankin, J.D.,P.E. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 12/08/2013 10:35 PM, David C. Rankin wrote:
Now it is more "what happened?" As JA and the folks on the mdraid list think it is probably the old version of mdadm not handling the error right.
We shall see. mdraid really isn't magic. It either works or it doesn't, if it doesn't, then there is either a glaring reason seen using --examine or --detail or it is just a bug with mdadm. Most everything is handled automatically after you issue --assemble, there are not a lot of user screw-ups you can make :-)
(that's a good thing)
After 2 days pulling my hair out, the answer was as simple as popping in the Arch install CD and rebooting!! Nothing special about which disk you use, you just have to use one with a newer version of mdadm. The 11.0 Rescue disk had mdadm 2.6.4, the Arch disk had 3.3.2. After it booted, I checked cat /proc/mdstat and /dev/md1 (md126) under mdadm 3.3.2 and the array is labeled status "resync pending". Perfect. All I had to do was: # mkdir /mnt/mdmnt # mount /dev/md126 /mnt/mdmnt BINGO! The drive mounted flawlessly and a quick check of cat /proc/mdstat showed the drive to be re-syncing just fine. That was a whole lot of work to find out that it was an old mdadm causing all the problems -- live and learn... That is also a good lesson for all who have older versions of mdadm. If anything happens to cause a problem, the FIRST thing you do is to get the latest version of mdadm and boot with that! Thanks to all that helps. this problem is SOLVED! -- David C. Rankin, J.D.,P.E. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
Le 09/12/2013 06:36, David C. Rankin a écrit :
Thanks to all that helps. this problem is SOLVED!
great :-) jdd -- http://www.dodin.org -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
participants (5)
-
Carl Hartung
-
David C. Rankin
-
jdd
-
John Andersen
-
Lew Wolfgang