[Bug 752869] New: md raid1 doesn't boot after removing 1 disk, when the server is turned off
https://bugzilla.novell.com/show_bug.cgi?id=752869 https://bugzilla.novell.com/show_bug.cgi?id=752869#c0 Summary: md raid1 doesn't boot after removing 1 disk, when the server is turned off Classification: openSUSE Product: openSUSE 12.1 Version: Final Platform: x86-64 OS/Version: openSUSE 12.1 Status: NEW Severity: Major Priority: P5 - None Component: Other AssignedTo: bnc-team-screening@forge.provo.novell.com ReportedBy: wvvelzen@gmail.com QAContact: qa-bugs@suse.de Found By: --- Blocker: --- Created an attachment (id=482002) --> (http://bugzilla.novell.com/attachment.cgi?id=482002) State after booting with 1 raid disk removed User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:10.0.2) Gecko/20100101 Firefox/10.0.2 I was testing a md raid 1 setup, for different fail scenarios, and came across the following problem. Given the following raid setup: # grep '/dev/md' /etc/fstab /dev/md0 swap swap defaults 0 0 /dev/md2 / ext4 noatime,data=writeback,noacl,user_xattr 1 1 /dev/md1 /boot ext4 noatime,data=writeback,noacl,user_xattr 1 2 /dev/md3 /home ext4 noatime,data=writeback,noacl,user_xattr 1 2 # cat /proc/mdstat Personalities : [raid1] [raid0] [raid10] [raid6] [raid5] [raid4] md3 : active raid1 sdb5[2] sda5[0] 248460152 blocks super 1.0 [2/2] [UU] bitmap: 0/2 pages [0KB], 65536KB chunk md1 : active raid1 sdb2[2] sda2[0] 522228 blocks super 1.0 [2/2] [UU] bitmap: 0/1 pages [0KB], 65536KB chunk md0 : active raid1 sdb1[2] sda1[0] 2095092 blocks super 1.0 [2/2] [UU] bitmap: 0/1 pages [0KB], 65536KB chunk md2 : active raid1 sdb3[2] sda3[0] 41946040 blocks super 1.0 [2/2] [UU] bitmap: 1/1 pages [4KB], 65536KB chunk unused devices: <none> When I turn off the server and remove 1 disk (the second in this case). The server doesn't properly boot, and ends up in the emergency console. I will attach a screen photograph of this situation. Switching between systemd or sysinit v with F5 on the bootscreen doesn't make a difference. Starting in 'failsafe' mode doesn't help either. Adding an empty disk instead of the removed disk or a pre-partioned disk doesn't help either. On previous versions of openSUSE (10.3) this test case worked just fine. However when I do a hot remove of 1 disk of the raid 1 array, on a running server, so the md raid knows it's degraded, and can save this state to the mdraid superblock. I can reboot just fine without any problems. Reproducible: Always Steps to Reproduce: 1. 2. 3. Expected Results: The raid1 in this state should have just booted the os. This is the purpose of having a raid1 in the first place! -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=752869
https://bugzilla.novell.com/show_bug.cgi?id=752869#c1
kk zhang
https://bugzilla.novell.com/show_bug.cgi?id=752869
https://bugzilla.novell.com/show_bug.cgi?id=752869#c
Petr Uzel
https://bugzilla.novell.com/show_bug.cgi?id=752869
https://bugzilla.novell.com/show_bug.cgi?id=752869#c2
--- Comment #2 from Wilfred van Velzen
mdadm --stop /dev/md0 mdadm --stop /dev/md2
mdadm --assemble /dev/md0 /dev/sda1 --run mdadm --add /dev/md0 /dev/sdb1 [repeated for all the arrays]
And wait for the synchronization to finish, before rebooting. Checking this with:
cat /proc/mdstat
So it isn't an unrecoverable state, but still rather inconvenient. Specially if the machine is at a co-location and you have to drive for a X amount of time to go to the location to find out what happend and fix it... -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=752869
https://bugzilla.novell.com/show_bug.cgi?id=752869#c3
Neil Brown
https://bugzilla.novell.com/show_bug.cgi?id=752869
https://bugzilla.novell.com/show_bug.cgi?id=752869#c4
Wilfred van Velzen
It sounds like you don't have an mdadm.conf in your initrd. Did you convert this from non-RAID to RAID without rerunning mkinitrd ??
The raid setup was created during installation of opensuse...
could you: cd /tmp zcat /boot/initrd | cpio -idv
and see if "/tmp/etc/mdadm.conf" gets created? If it does, please attach it.
It is: AUTO -all ARRAY /dev/md2 metadata=1.0 name=linux:2 UUID=bf722ec4:89570cf1:4dd162d9:8c34c87b ARRAY /dev/md0 metadata=1.0 name=linux:0 UUID=fd8ef37a:564a9254:72876573:b839300c It's different from the one in /etc though: DEVICE containers partitions ARRAY /dev/md0 UUID=fd8ef37a:564a9254:72876573:b839300c ARRAY /dev/md1 UUID=ada7fa3a:9242009a:cd9f63b7:a3d1b3d3 ARRAY /dev/md2 UUID=bf722ec4:89570cf1:4dd162d9:8c34c87b ARRAY /dev/md3 UUID=a938e1fc:3a39681c:8f8d2c1b:789e6370
If it doesn't please mkinitrd then run the above commands again and see if an mdadm.conf is there.
I did this anyway, but the contents of /tmp/etc/mdadm.conf, remain the same. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=752869
https://bugzilla.novell.com/show_bug.cgi?id=752869#c5
Neil Brown
https://bugzilla.novell.com/show_bug.cgi?id=752869
https://bugzilla.novell.com/show_bug.cgi?id=752869#c6
Wilfred van Velzen
Thanks.
I think the problem might be that /lib/udev/rules.d/64-md-raid.rules
in the initrd is causing problems.
This file doesn't exist on the system: nnn:/etc/udev/rules.d # ls -l total 252 -rw-r--r-- 1 root root 571 Oct 29 19:37 51-lirc.rules -rw-r--r-- 1 root root 13890 Oct 30 07:23 55-hpmud.rules -rw-r--r-- 1 root root 161732 Mar 22 18:11 55-libsane.rules -rw-r--r-- 1 root root 692 Oct 30 07:23 56-hpmud_support.rules -rw-r--r-- 1 root root 46551 Mar 22 18:11 56-sane-backends-autoconfig.rules -rw-r--r-- 1 root root 2016 Oct 29 20:13 70-kpartx.rules -rw-r--r-- 1 root root 532 Dec 15 15:15 70-persistent-cd.rules -rw-r--r-- 1 root root 948 Mar 28 11:09 70-persistent-net.rules -rw-r--r-- 1 root root 182 Oct 29 20:13 71-multipath.rules -rw-r--r-- 1 root root 93 Oct 29 20:09 99-iwlwifi-led.rules
To confirm this I'd like you to create an initrd without this file and see what happens.
I think the easiest might be to edit /lib/mkinitrd/scripts/setup-udev.sh (after taking a copy just to be safe) and remove the line:
64-md-raid.rules \
and the run mkinitrd again.
Could you try that and let me know the result?
So this file isn't included in mkinitrd any way...? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=752869
https://bugzilla.novell.com/show_bug.cgi?id=752869#c7
Neil Brown
https://bugzilla.novell.com/show_bug.cgi?id=752869
https://bugzilla.novell.com/show_bug.cgi?id=752869#c8
Wilfred van Velzen
Look in /lib/udev/rules.d, not /etc/udev/rules.d
Ah, sorry, I should have looked more careful...
(confusing, isn't it... don't look in /run/udev/rules.d whatever you do :-)
It's empty? Anyway... I edited /lib/mkinitrd/scripts/setup-udev.sh, ran mkinitrd. And checked with 'zcat /boot/initrd | cpio -idv', what was in '/tmp/etc/mdadm.conf'. It still contains just the two ARRAY lines for /dev/md2 and /dev/md0, as shown before. Regardless I tried the cold-remove-one-raid-disk test. But I still get to the emergency console, like before, when I start the server in this state. After this when I reboot with the two disks. It boots but the array is in degraded mode: :/proc # cat mdstat Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] md1 : active raid1 sda2[0] sdb2[2] 522228 blocks super 1.0 [2/2] [UU] bitmap: 0/1 pages [0KB], 65536KB chunk md3 : active raid1 sda5[0] sdb5[2] 248460152 blocks super 1.0 [2/2] [UU] bitmap: 1/2 pages [4KB], 65536KB chunk md2 : active raid1 sda3[0] 41946040 blocks super 1.0 [2/1] [U_] bitmap: 1/1 pages [4KB], 65536KB chunk md0 : active (auto-read-only) raid1 sda1[0] sdb1[2] 2095092 blocks super 1.0 [2/2] [UU] bitmap: 0/1 pages [0KB], 65536KB chunk unused devices: <none> I don't think this happened before, but this is easily fixed, with some mdadm commands... -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=752869
https://bugzilla.novell.com/show_bug.cgi?id=752869#c9
Neil Brown
https://bugzilla.novell.com/show_bug.cgi?id=752869
https://bugzilla.novell.com/show_bug.cgi?id=752869#c10
Wilfred van Velzen
And you checked that "64-md-raid.rules" wasn't in the initrd in any directory?
Yep.
Very strange.
Would it be possible for me to get a copy of your initrd? It might be too big to attach, in which case you need to find somewhere to upload it that I could grab it from.
http://www.sercom.nl/tmp/initrd.zip -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=752869
https://bugzilla.novell.com/show_bug.cgi?id=752869#c11
--- Comment #11 from Wilfred van Velzen
https://bugzilla.novell.com/show_bug.cgi?id=752869
https://bugzilla.novell.com/show_bug.cgi?id=752869#c12
Neil Brown
https://bugzilla.novell.com/show_bug.cgi?id=752869
https://bugzilla.novell.com/show_bug.cgi?id=752869#c13
--- Comment #13 from Wilfred van Velzen
https://bugzilla.novell.com/show_bug.cgi?id=752869
https://bugzilla.novell.com/show_bug.cgi?id=752869#c14
Wilfred van Velzen
Could you try again with one disk missing and get another photo of the emergency console screen?
I did a 'zypper up', did a reboot, and made sure the state of the array was ok. It was: /proc # cat mdstat Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] md3 : active raid1 sda5[0] sdb5[2] 248460152 blocks super 1.0 [2/2] [UU] bitmap: 0/2 pages [0KB], 65536KB chunk md1 : active raid1 sda2[0] sdb2[2] 522228 blocks super 1.0 [2/2] [UU] bitmap: 0/1 pages [0KB], 65536KB chunk md2 : active raid1 sda3[0] sdb3[2] 41946040 blocks super 1.0 [2/2] [UU] bitmap: 1/1 pages [4KB], 65536KB chunk md0 : active (auto-read-only) raid1 sda1[0] sdb1[2] 2095092 blocks super 1.0 [2/2] [UU] bitmap: 0/1 pages [0KB], 65536KB chunk unused devices: <none> (Except for the auto-read-only state of the raid that contains the swap partition. But I think this is a different issue) After this I did the test with 1 disk removed. This got me to the emergency mode again (see the new attachement). After adding the disk and rebooting the state of the array again was degraded again, like before, for the root partion: /proc # cat mdstat Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] md1 : active raid1 sda2[0] sdb2[2] 522228 blocks super 1.0 [2/2] [UU] bitmap: 0/1 pages [0KB], 65536KB chunk md3 : active raid1 sda5[0] sdb5[2] 248460152 blocks super 1.0 [2/2] [UU] bitmap: 0/2 pages [0KB], 65536KB chunk md2 : active raid1 sda3[0] 41946040 blocks super 1.0 [2/1] [U_] bitmap: 1/1 pages [4KB], 65536KB chunk md0 : active (auto-read-only) raid1 sda1[0] sdb1[2] 2095092 blocks super 1.0 [2/2] [UU] bitmap: 0/1 pages [0KB], 65536KB chunk -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=752869
https://bugzilla.novell.com/show_bug.cgi?id=752869#c15
Neil Brown
https://bugzilla.novell.com/show_bug.cgi?id=752869
https://bugzilla.novell.com/show_bug.cgi?id=752869#c16
Neil Brown
https://bugzilla.novell.com/show_bug.cgi?id=752869
https://bugzilla.novell.com/show_bug.cgi?id=752869#c17
--- Comment #17 from Wilfred van Velzen
Could you please install the appropriate rpm from here:
Could you provide a (more) direct download link? The "Go to download repository" link on that page gives a 404 error. And regular opensuse package search, doesn't show your repository either...? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=752869
https://bugzilla.novell.com/show_bug.cgi?id=752869#c18
--- Comment #18 from Neil Brown
https://bugzilla.novell.com/show_bug.cgi?id=752869
https://bugzilla.novell.com/show_bug.cgi?id=752869#c19
Wilfred van Velzen
https://bugzilla.novell.com/show_bug.cgi?id=752869
https://bugzilla.novell.com/show_bug.cgi?id=752869#c20
Christian Boltz
Sorry, I thought it was public.. I wonder how I make it public.
AFAIK, home:*:branches:* have the publish flag disabled by default. Enable it in the project config - or just use osc getbinaries to download the RPMs. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=752869
https://bugzilla.novell.com/show_bug.cgi?id=752869#c21
Neil Brown
https://bugzilla.novell.com/show_bug.cgi?id=752869
https://bugzilla.novell.com/show_bug.cgi?id=752869#c22
Wilfred van Velzen
Thanks for testing and the screen shot. I think we are getting there.
I suspect that if you switched back to sysvinit now it would work.
I didn't try that.
The remaining problem is an interaction with systemd.
[...]
If this works as expected I'll ask a maintenance update with these fixes.
This worked as expected. The server now booted fine with the one disk removed!
Thanks for your help in isolating this (and sorry that it has taken 2 month!).
No problem, I'm used to much worse response time in bugzilla. :-/ This wasn't even a bug you will notice during normal operation. And I'm happy to help improve openSUSE! -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=752869
https://bugzilla.novell.com/show_bug.cgi?id=752869#c23
Neil Brown
https://bugzilla.novell.com/show_bug.cgi?id=752869
https://bugzilla.novell.com/show_bug.cgi?id=752869#c24
--- Comment #24 from Bernhard Wiedemann
https://bugzilla.novell.com/show_bug.cgi?id=752869
https://bugzilla.novell.com/show_bug.cgi?id=752869#c25
Marcus Meissner
https://bugzilla.novell.com/show_bug.cgi?id=752869
https://bugzilla.novell.com/show_bug.cgi?id=752869#c26
Neil Brown
https://bugzilla.novell.com/show_bug.cgi?id=752869
https://bugzilla.novell.com/show_bug.cgi?id=752869#c27
--- Comment #27 from Swamp Workflow Management
participants (1)
-
bugzilla_noreply@novell.com