[Bug 832501] New: boot on raid device is not started if degraded; fix provided
https://bugzilla.novell.com/show_bug.cgi?id=832501 https://bugzilla.novell.com/show_bug.cgi?id=832501#c0 Summary: boot on raid device is not started if degraded; fix provided Classification: openSUSE Product: openSUSE 12.3 Version: Final Platform: x86-64 OS/Version: openSUSE 12.3 Status: NEW Severity: Major Priority: P5 - None Component: Other AssignedTo: bnc-team-screening@forge.provo.novell.com ReportedBy: peter.maloney@brockmann-consult.de QAContact: qa-bugs@suse.de Found By: --- Blocker: --- Created an attachment (id=550422) --> (http://bugzilla.novell.com/attachment.cgi?id=550422) patch for /var/mkinitrd/scripts/, maybe not src files User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:21.0) Gecko/20100101 Firefox/21.0 If your /boot is on a separate raid device from your /, mkinitrd does not add any information in the initrd to start the raid device, so boot will fail. I don't know why booting works if the RAID is clean. Perhaps systemd is starting it in this case. Ubuntu 12.04 (grub 1.99) can boot with degraded raid as long as you manually fix the metadata version of the device (change to 0.90, possibly 1.0, but not 1.2 which is default on CLI and in Ubuntu installer), so I was sad to see that the latest openSUSE does not work (even though previous versions did work). But I was happy to see that openSUSE will work with my fix and without changing the metatdata, because openSUSE uses grub 2.00 and the installer uses metadata 1.0 instead of 1.2. I have fixed the problem on my machine by editing the mkinitrd scripts. I don't know if I did a nice clean job that will work on other systems, so please validate it. I have also added some extra output in verbose mode. In my solution, I have checked to see if the mdadm.conf exists, and if not, generated one. This is because the openSUSE installer did not generate one for me in my most hackish of tests. I think this seems like a good way to prevent some problems, even if they are the users' fault. In my solution, I am not sure if there is a problem when you have no mdadm.conf or your mdadm.conf has entries for things you don't want to be required for boot, and then the initrd will try to start them too. I did a check in /sys/devices/virtual/block/ to see if there are devices before trying to handle them, and then if there are devices but no mdadm.conf, then I use <(mdadm -D --scan) to read the output instead of the file. Reproducible: Always Steps to Reproduce: Set up a test machine: 2 x 16 GB virtual disks md0 is raid1, sda1 and sdb1, and mounted on /boot as ext4 md1 is raid1, sda2 and sdb2, and is a LVM PV /dev/suse is the LVM VG containing PV /dev/md1 /dev/suse/root is from VG /dev/suse, and mounted on / as ext4 /dev/suse/swap is from VG /dev/suse, and is swap On command line, you could create the devices like this: mdadm --create /dev/md0 -n 2 -x 0 -l 1 -e 1.0 missing /dev/sdb1 mdadm --create /dev/md1 -n 2 -x 0 -l 1 -e 1.0 missing /dev/sdb2 mkfs.ext4 -L boot /dev/md0 pvcreate /dev/md1 vgcreate suse /dev/md1 lvcreate -L 4GB -n swap suse lvcreate -l 100%FREE -n root suse mkfs.ext4 -L root /dev/suse/root mkswap /dev/suse/swap After the machine is up, run this to ensure the machine should be ready to boot with either disk missing: grub2-install /dev/sda grub2-install /dev/sdb mkinitrd grub2-mkconfig -o /boot/grub2/grub.cfg Then shut it down; remove a disk (I removed the 2nd for most of my tests, because virtualbox snapshots mess up if you boot from the one you add afterwards). Then boot it up Actual Results: You get a very long wait (at least 60 seconds) and then you get emergency mode. Normal startup was blocked because fsck could not open /dev/md0; it could not open it because /dev/md0 is started and exists, but is not running (as if --run was not used when assembling). Expected Results: You get a successful boot with degraded arrays. The systemd log shows you something like this: Jul 30 12:29:11 peterrouter.bc.local systemd[1]: Job dev-disk-by\x2duuid-a16b10b0\x2dd038\x2d4946\x2dad88\x2d97c0617bbf8c.device/start timed out. Jul 30 12:29:11 peterrouter.bc.local systemd[1]: Timed out waiting for device dev-disk-by\x2duuid-a16b10b0\x2dd038\x2d4946\x2dad88\x2d97c0617bbf8c.device. Jul 30 12:29:11 peterrouter.bc.local systemd[1]: Dependency failed for /boot. Jul 30 12:29:11 peterrouter.bc.local systemd[1]: Dependency failed for Local File Systems. Jul 30 12:29:11 peterrouter.bc.local systemd[1]: Dependency failed for Remote File Systems (Pre). Jul 30 12:29:11 peterrouter.bc.local systemd[1]: Job remote-fs-pre.target/start failed with result 'dependency'. Jul 30 12:29:11 peterrouter.bc.local systemd[1]: Job local-fs.target/start failed with result 'dependency'. Jul 30 12:29:11 peterrouter.bc.local systemd[1]: Triggering OnFailure= dependencies of local-fs.target. Jul 30 12:29:11 peterrouter.bc.local systemd[1]: Job boot.mount/start failed with result 'dependency'. Jul 30 12:29:11 peterrouter.bc.local systemd[1]: Dependency failed for File System Check on /dev/disk/by-uuid/a16b10b0-d038-4946-ad88-97c0617bbf8c. Jul 30 12:29:11 peterrouter.bc.local systemd[1]: Job systemd-fsck@dev-disk-by\x2duuid-a16b10b0\x2dd038\x2d4946\x2dad88\x2d97c0617bbf8c.service/start failed with result 'dependency'. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=832501
https://bugzilla.novell.com/show_bug.cgi?id=832501#c1
--- Comment #1 from Peter Maloney
https://bugzilla.novell.com/show_bug.cgi?id=832501
https://bugzilla.novell.com/show_bug.cgi?id=832501#c2
--- Comment #2 from Peter Maloney
https://bugzilla.novell.com/show_bug.cgi?id=832501
https://bugzilla.novell.com/show_bug.cgi?id=832501#c
FeiXiang Zhang
https://bugzilla.novell.com/show_bug.cgi?id=832501
https://bugzilla.novell.com/show_bug.cgi?id=832501#c3
--- Comment #3 from Olaf Hering
https://bugzilla.novell.com/show_bug.cgi?id=832501
https://bugzilla.novell.com/show_bug.cgi?id=832501#c4
Olaf Hering
https://bugzilla.novell.com/show_bug.cgi?id=832501
https://bugzilla.novell.com/show_bug.cgi?id=832501#c5
--- Comment #5 from Peter Maloney
https://bugzilla.novell.com/show_bug.cgi?id=832501
https://bugzilla.novell.com/show_bug.cgi?id=832501#c6
Neil Brown
If your /boot is on a separate raid device from your /, mkinitrd does not add any information in the initrd to start the raid device, so boot will fail.
Hi Peter, I'm a bit confused by this part of the problem description. As I understand it, the initrd does not need to access the /boot filesystem at all. The boot loader (e.g. grub) does of course so that it can load the kernel and the initrd, But all the initrd need access to is the root filesystem and the swap partition. Once it mounts root, the scripts in there will take over to mount /boot and anything else. Clearly you are having a problem and it does seem to be related to the md device containing /boot, but I think it needs to be fixed in the regular boot scripts, not in the initrd. Handling freshly degraded arrays at boot is somewhat tricky with the dependency driven boot sequence that systemd uses. As devices are discovered, udev runs "mdadm -I $DEVICE" and mdadm incrementally assembles the arrays. Once all components are there the array is started. But if all components never arrive, the array will never be started with just that mechanism. To address this you can run "mdadm -IRs" which essentially says "all devices have arrived, time to start any remaining md arrays which are degraded". systemd need to do this when it times out waiting for a device. But I don't know how to tell it (not that I have really looked recently). The initrd does have a call to "mdadm -IRs" half way through timing out for the root device. This is why your boot works if you tell the initrd to assemble the boot device. But that isn't really the right fix. I'll do some reading about systemd and see if I can figure out how to give it an action to perform on timeout. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=832501
https://bugzilla.novell.com/show_bug.cgi?id=832501#c7
Marco M.
https://bugzilla.novell.com/show_bug.cgi?id=832501
https://bugzilla.novell.com/show_bug.cgi?id=832501#c8
--- Comment #8 from Marco M.
https://bugzilla.novell.com/show_bug.cgi?id=832501
https://bugzilla.novell.com/show_bug.cgi?id=832501#c9
--- Comment #9 from Neil Brown
https://bugzilla.novell.com/show_bug.cgi?id=832501
https://bugzilla.novell.com/show_bug.cgi?id=832501#c10
--- Comment #10 from Andrey Borzenkov
I don't know why you don't get a login prompt
See bnc#852021 -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=832501
https://bugzilla.novell.com/show_bug.cgi?id=832501#c11
--- Comment #11 from Marco M.
I don't know why you don't get a login prompt, but it might be worth trying to boot with plymouth - that might be confusing things.
I think you try 'e' to the grub menu and it puts you in a simple editor. Find the kernel command line and add plymouth.enable=0 to the end. See if that provides a password prompt in emergency mode.
I added plymouth.enable=0 but the emergency shell is still not working -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=832501
https://bugzilla.novell.com/show_bug.cgi?id=832501#c12
--- Comment #12 from Marco M.
(In reply to comment #9)
I don't know why you don't get a login prompt
See bnc#852021
It looks very similar! I'm going to try the suggested patch as soon as possible and I'll let you know the result, thank you. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=832501
https://bugzilla.novell.com/show_bug.cgi?id=832501#c13
--- Comment #13 from Marco M.
(In reply to comment #10)
(In reply to comment #9)
I don't know why you don't get a login prompt
See bnc#852021
It looks very similar! I'm going to try the suggested patch as soon as possible and I'll let you know the result, thank you.
Ok the emergency shell problem is the same described in bnc#852021 and the patch proposed has worked for me. (in the sense that i solved the login prompt problem, but of course i still have the main problem we are facing here) I'm of course available to test a patch -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=832501
https://bugzilla.novell.com/show_bug.cgi?id=832501#c14
--- Comment #14 from Neil Brown
https://bugzilla.novell.com/show_bug.cgi?id=832501
https://bugzilla.novell.com/show_bug.cgi?id=832501#c15
--- Comment #15 from Marco M.
Created an attachment (id=569764) --> (http://bugzilla.novell.com/attachment.cgi?id=569764) [details] mdadm RPM for testing
Please test this rpm and confirm that it fixes the problem.
I installed the rpm with this command: rpm -ivh --replacepkgs --force --force was necessary otherwise rpm complains that the already installed package is newer than the one i was installing. I added the nofail option to all mounted standard (NO RAID) partitions on the disk that i was about to remove. (the absence of a partition which is indicated in fstab as automounted triggers the emergency shell) I pulled out the sdb disk and the system booted as expected! So the patch is working fine for me! -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=832501
https://bugzilla.novell.com/show_bug.cgi?id=832501#c16
Neil Brown
https://bugzilla.novell.com/show_bug.cgi?id=832501
https://bugzilla.novell.com/show_bug.cgi?id=832501#c17
--- Comment #17 from Bernhard Wiedemann
https://bugzilla.novell.com/show_bug.cgi?id=832501
https://bugzilla.novell.com/show_bug.cgi?id=832501#c18
--- Comment #18 from Peter Maloney
Could you please confirm whether you we seeing this in 12.3 or in a 13.1 beta?
@Neil It was 12.3; I have not tested 13.1 at all yet. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=832501
https://bugzilla.novell.com/show_bug.cgi?id=832501#c19
Marco M.
https://bugzilla.novell.com/show_bug.cgi?id=832501
https://bugzilla.novell.com/show_bug.cgi?id=832501#c20
--- Comment #20 from Neil Brown
https://bugzilla.novell.com/show_bug.cgi?id=832501
https://bugzilla.novell.com/show_bug.cgi?id=832501#c21
Benjamin Brunner
https://bugzilla.novell.com/show_bug.cgi?id=832501
https://bugzilla.novell.com/show_bug.cgi?id=832501#c22
--- Comment #22 from Swamp Workflow Management
http://bugzilla.novell.com/show_bug.cgi?id=832501
Marco M.
http://bugzilla.novell.com/show_bug.cgi?id=832501
--- Comment #24 from Marco M.
http://bugzilla.novell.com/show_bug.cgi?id=832501
--- Comment #25 from Andrei Borzenkov
I was unable to boot with a degraded raid 1 array (both boot and root where on raid)
I cannot reproduce this using boot on MD RAID with current Factory. Your log shows [ 143.781288] linux-m61d dracut-initqueue[270]: Warning: Cancelling resume operation. Device not found. [ 144.094926] linux-m61d systemd[1]: Timed out waiting for device dev-disk-by\x2duuid-1f3d0926\x2da660\x2d45a6\x2da2b0\x2dbfdf165d64b5.device. [ 144.096300] linux-m61d systemd[1]: Dependency failed for /sysroot. [ 144.098235] linux-m61d systemd[1]: Dependency failed for Initrd Root File System. [ 144.099596] linux-m61d systemd[1]: Dependency failed for Reload Configuration from the Real Root. [ 144.100150] linux-m61d systemd[1]: Dependency failed for File System Check on /dev/disk/by-uuid/1f3d0926-a660-45a6-a2b0-bfdf165d64b5. [ 144.553192] linux-m61d dracut-initqueue[270]: Scanning devices md2 for LVM logical volumes SystemVG/rootLV So it is actually problem of LVM on RAID, not RAID itself. /dev/mapper/SystemVG-rootLV: LABEL="rootFS" UUID="1f3d0926-a660-45a6-a2b0-bfdf165d64b5" TYPE="ext4" Please provide your initrd that fails as already requested. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=832501
--- Comment #26 from Marco M.
http://bugzilla.novell.com/show_bug.cgi?id=832501
--- Comment #27 from Marco M.
http://bugzilla.novell.com/show_bug.cgi?id=832501
--- Comment #30 from Marco M.
If I edit /etc/systemd/system.conf, uncomment #DefaultTimeoutStartSec=90s
and change it to 300s, and run "mkinitrd" then again the boot succeeds, but without the need to set rd.retry=80
This workaround has worked fine also for me. I did those two tests: 1) degraded raid1 device with root on it plus one data missing filesystem (a simple partition with a file system on it, no raid involved): root filesystem is mounted and the emergency shell appears. 2) degraded raid1 device with root on it, no other missing filesystem (there was a missing filesystem, but i marked it with the "nofail" tag in /etc/fstab): the system boots correctly in runlevel 5 ( in the systemd equivalent runlevel 5...) -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=832501
--- Comment #32 from Neil Brown
http://bugzilla.novell.com/show_bug.cgi?id=832501
http://bugzilla.novell.com/show_bug.cgi?id=832501#c33
Marco M.
http://bugzilla.novell.com/show_bug.cgi?id=832501
Marco M.
http://bugzilla.novell.com/show_bug.cgi?id=832501
http://bugzilla.novell.com/show_bug.cgi?id=832501#c34
--- Comment #34 from Neil Brown
http://bugzilla.novell.com/show_bug.cgi?id=832501
http://bugzilla.novell.com/show_bug.cgi?id=832501#c35
--- Comment #35 from Marco M.
http://bugzilla.novell.com/show_bug.cgi?id=832501
http://bugzilla.novell.com/show_bug.cgi?id=832501#c36
Neil Brown
some minutes more than normal
When booting with a newly degraded array I would expect an extra delay of 2 minutes. - there is a default timeout of 180 seconds and a magic factor of 2/3 applied. If you are seeing a longer delay, or a delay when the array was not newly degraded, then that might be a bug. Otherwise it is acting as expected. The yast installer/grub install issue is quite separate. Your best bet would be to open a new bug focusing on just that issue. Feel free to add me to 'cc', but I'm not likely to be the one to push it to resolution (I hope). Thanks for the positive report - I'll close this bug now on the assumption that the delays you see match expectations. If they don't and you want to pursue that issue, please re-open. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=832501
http://bugzilla.novell.com/show_bug.cgi?id=832501#c37
--- Comment #37 from Marco M.
If you are seeing a longer delay, or a delay when the array was not newly degraded, then that might be a bug. Otherwise it is acting as expected.
It is working as expected.
The yast installer/grub install issue is quite separate. Your best bet would be to open a new bug focusing on just that issue. Feel free to add me to 'cc', but I'm not likely to be the one to push it to resolution (I hope).
I'll build a new clean test environment and I'll open a new bug as you suggested. Thank you very much -- You are receiving this mail because: You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@novell.com