[Bug 919284] New: boot fails with LVMs-on-RAID; systemd "Timed out" and "Dependency failed" errors ? referred from systemd-info ML to 'downstream'
http://bugzilla.opensuse.org/show_bug.cgi?id=919284 Bug ID: 919284 Summary: boot fails with LVMs-on-RAID; systemd "Timed out" and "Dependency failed" errors ? referred from systemd-info ML to 'downstream' Classification: openSUSE Product: openSUSE Distribution Version: 13.2 Hardware: x86-64 OS: openSUSE 13.2 Status: NEW Severity: Major Priority: P5 - None Component: Basesystem Assignee: bnc-team-screening@forge.provo.novell.com Reporter: h15234@mailas.com QA Contact: qa-bugs@suse.de Found By: --- Blocker: ---
If it's about systemd not properly activating LVM volumes, I would open a bugreport @ opensuse. Doesn't sound like an upstream issue as such.
From this thread: http://lists.opensuse.org/opensuse/2015-02/msg00602.html
So despite systemd reporting that fscks were not done on HOME and VAR,
I'm working on an opensuse 13.2 machine running systemd v210. It's disks are all on RAID. /boot is on RAID1 on /dev/md126 The remaining partitions are on LVM-on-RAID10 The LVs are LV_ROOT VG0 -wi-ao--- 20.00g LV_SWAP VG0 -wi-ao--- 8.00g LV_HOME VG0 -wi-ao--- 100.00g LV_VAR VG0 -wi-ao--- 1.00g The system fails to boot, dropping to a maintenance mode prompt. Simply hitting Ctrl-D to continue, finishes booting the system. After boot, checking journalctl -b | egrep -i "Timed out|result=dependency" | egrep -i "dev|mount" Feb 20 08:16:15 ender systemd[1]: Job dev-VG0-LV_HOME.device/start timed out. Feb 20 08:16:15 ender systemd[1]: Timed out waiting for device dev-VG0-LV_HOME.device. Feb 20 08:16:15 ender systemd[1]: Job systemd-fsck@dev-VG0-LV_HOME.service/start finished, result=dependency Feb 20 08:16:15 ender systemd[1]: Dependency failed for File System Check on /dev/VG0/LV_HOME. Feb 20 08:16:15 ender systemd[1]: Job dev-VG0-LV_VAR.device/start timed out. Feb 20 08:16:15 ender systemd[1]: Timed out waiting for device dev-VG0-LV_VAR.device. Feb 20 08:16:15 ender systemd[1]: Job systemd-fsck@dev-VG0-LV_VAR.service/start finished, result=dependency Feb 20 08:16:15 ender systemd[1]: Dependency failed for File System Check on /dev/VG0/LV_VAR. Feb 20 08:16:15 ender systemd[1]: Job dev-disk-by\x2did-dm\x2dname\x2dVG0\x2dLV_HOME.device/start timed out. Feb 20 08:16:15 ender systemd[1]: Timed out waiting for device dev-disk-by\x2did-dm\x2dname\x2dVG0\x2dLV_HOME.device. This is reported ON the same system. I.e., all the LVs are correctly mounted and fully functional. Why are these time-outs & dependency-fails occurring? What do I need to change/fix to make sure it does not happen, and avoid getting dropped into emergency mode on boot? hanlon FWIW: I originally posted this to systemd-info mailing list; that got this reply On Fri, 20.02.15 08:38, h15234@mailas.com (h15234@mailas.com) wrote: > I'm working on a machine running systemd v210 (Opensuse 13.2) This is a really old systemd version, please ask downtream for help on such old version! > Its disks are all on RAID. > > /boot is on RAID1 on /dev/md126 > > The remaining partitions are on LVM-on-RAID10 > > The LVs are > > LV_ROOT VG0 -wi-ao--- 20.00g > LV_SWAP VG0 -wi-ao--- 8.00g > LV_HOME VG0 -wi-ao--- 100.00g > LV_VAR VG0 -wi-ao--- 1.00g Well, LVM and RAID are nothign we support upstream, please talk to the LVM/MD communities or downstream for help. Sorry, Lennart ... the volumes are mounted anyway? yes.
What about ROOT ?
same. all volumes are mounted and fully available after boot completes.
Which version of Grub? Of LVM?
grub-0.97-200.1.3.x86_64 lvm2-2.02.98-43.17.1.x86_64
/boot is on RAID1 on /dev/md126 That seems strange. It doesn't seem like a LVM mapper address. I'm a great supporter of LVM but I do not put /boot on LVM ever. I know I can, but its too much hassle when things go wrong.
That's not an LV. /boot is on an ext4-partitioned RAID vol.
Simply hitting Ctrl-D to continue, finishes booting the system. Is this an "every time" occurrence or intermittent?
every time. fully reproducible.
Does it only occur on 'cold boots' when the disks are being spun up or also on reboots when the disks are already spinning?
both.
One "solution" is to edit the GRUB boot command line and add bootdelay=10 and possibly lvmwait=/dev/VG0/LV_HOME
I've already tried both. As well as x-systemd.device-timeout=5m in /etc/fstab. none, individually or in any combination together, chane the problem.
If this is an 'everytime' problem I'd first make sure the device mapper module is in > the initrd of your boot.
it already is, cat /etc/sysconfig/kernel INITRD_MODULES="... raid0 raid1 raid10 raid456 dm-mod ..."
*IF* this is not a disk drive timing problem then its a problem with LVM activation.
agreed. which I'm guessing is why the /etc/systemd/system/lvm_local.service with the explicit LVM Before/Requires dependencies, and the ExecStart=/sbin/vgchange --available y seems to fix the problem. With the fixes in place booting finishes perfectly. Absolutely no more errors or warnings in dmesg, journalctl, etc.
However I'd look at /etc/sysconfig/dmraid as well for timeout values
it's already set at DMRAID_DEVICE_TIMEOUT="120" -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=919284
Anton Aylward
http://bugzilla.opensuse.org/show_bug.cgi?id=919284
--- Comment #2 from han l2
The vgchange command is there in
/usr/lib/systemd/system/lvm2-monitor.service
I'd check that on your system and see whether its esists. If so I would enable and activate it. I'd prefer to use the supplied unit rather than a lcoal/custom one.
already enabled systemctl status lvm2-monitor lvm2-monitor.service - Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling Loaded: loaded (/usr/lib/systemd/system/lvm2-monitor.service; enabled) Active: active (exited) since Mon 2015-02-23 13:17:37 PST; 19h ago Docs: man:dmeventd(8) man:lvcreate(8) man:lvchange(8) man:vgchange(8) Process: 32103 ExecStart=/sbin/lvm vgchange --monitor y (code=exited, status=0/SUCCESS) Main PID: 32103 (code=exited, status=0/SUCCESS) Enabling/disabling it makes no difference -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=919284
Andrei Borzenkov
http://bugzilla.opensuse.org/show_bug.cgi?id=919284
--- Comment #4 from han l2
Please try reproduce with added kernel parameters "systemd.log_level=debug rd.udev.log-priority=debug udev.log-priority=debug" and removed "quiet". If you can, upload output of "journalctl -b" after booting.
Adding those params causes boot to fail. (1) This boots kernel /vmlinuz root=/dev/VG0/ROOT lvmwait=/dev/VG0/ROOT ... noquiet net.ifnames=0 biosdevname=0 systemd.log_target=journal-or-kmsg systemd.log_level=debug loglevel=debug (2) This fails to boot kernel /vmlinuz root=/dev/VG0/ROOT lvmwait=/dev/VG0/ROOT ... noquiet net.ifnames=0 biosdevname=0 systemd.log_target=journal-or-kmsg systemd.log_level=debug loglevel=debug rd.udev.log-priority=debug udev.log-priority=debug I'm trying to understand why. In case (2), journalctl output ends with: ... Feb 24 11:33:52 hanl systemd[1]: Starting Trigger Flushing of Journal to Persistent Storage... Feb 24 11:33:52 hanl systemd[1]: Starting Tell Plymouth To Write Out Runtime Data... Feb 24 11:33:52 hanl systemd[1856]: Executing: /usr/bin/systemctl kill --kill-who=main --signal=SIGUSR1 systemd-journald.service Feb 24 11:33:52 hanl systemd-journal[861]: Permanent journal is using 65.0M (max allowed 50.0M, trying to leave 887.9M free of 795.9M available → current limit 65.0M). Feb 24 11:33:54 hanl systemd-journal[861]: Time spent on flushing to /var is 1.276575s for 18188 entries. Feb 24 11:33:54 hanl systemd[1]: Got disconnect on private connection. Feb 24 11:33:54 hanl systemd[1]: Received SIGCHLD from PID 1857 (plymouth). Feb 24 11:33:54 hanl systemd[1]: Child 1857 (plymouth) died (code=exited, status=0/SUCCESS) Feb 24 11:33:54 hanl systemd[1]: Child 1857 belongs to plymouth-read-write.service Feb 24 11:33:54 hanl systemd[1]: plymouth-read-write.service changed start -> dead Feb 24 11:33:54 hanl systemd[1]: Job plymouth-read-write.service/start finished, result=done Feb 24 11:33:54 hanl systemd[1]: plymouth-read-write.service: cgroup is empty Feb 24 11:33:54 hanl systemd[1]: Accepted new private connection. Feb 24 11:33:54 hanl systemd[1]: sysinit.target changed dead -> active Feb 24 11:33:54 hanl systemd[1]: Job sysinit.target/start finished, result=done Feb 24 11:33:54 hanl systemd[1]: ConditionPathExists=/etc/alsa/state-daemon.conf failed for alsa-state.service. Feb 24 11:33:54 hanl systemd[1]: Starting of alsa-state.service requested but condition failed. Ignoring. Feb 24 11:33:54 hanl systemd[1]: Job alsa-state.service/start finished, result=done Feb 24 11:33:54 hanl systemd[1]: ConditionPathExists=!/etc/alsa/state-daemon.conf succeeded for alsa-restore.service. Feb 24 11:33:54 hanl systemd[1]: alsa-restore.service changed dead -> start Feb 24 11:33:54 hanl systemd[1]: basic.target changed dead -> active Feb 24 11:33:54 hanl systemd[1]: Job basic.target/start finished, result=done Feb 24 11:33:54 hanl systemd[1]: ConditionVirtualization=false succeeded for mcelog.service. Feb 24 11:33:54 hanl systemd[1]: mcelog.service changed dead -> running Feb 24 11:33:54 hanl systemd[1]: Job mcelog.service/start finished, result=done Feb 24 11:33:54 hanl systemd[1]: Got message type=signal sender=n/a destination=n/a object=/org/freedesktop/systemd1/agent interface=org.freedesktop.systemd1.Agent member=Released cookie=1 reply_cookie=0 error=n/a Feb 24 11:33:54 hanl systemd[1]: Got disconnect on private connection. Feb 24 11:33:52 hanl systemd[1]: Started Trigger Flushing of Journal to Persistent Storage. Feb 24 11:33:54 hanl systemd[1860]: Executing: /usr/sbin/alsactl restore Feb 24 11:33:54 hanl systemd[1861]: Executing: /usr/sbin/mcelog --ignorenodev --daemon --foreground Feb 24 11:33:54 hanl systemd[1]: Started Restore Sound Card State. Feb 24 11:34:01 hanl systemd[1]: Starting Stop Read-Ahead Data Collection... Feb 24 11:34:01 hanl systemd[1864]: Executing: /usr/bin/systemd-notify --readahead=done Feb 24 11:34:01 hanl systemd[1]: Started Stop Read-Ahead Data Collection. Feb 24 11:34:34 hanl systemd-journal[861]: Suppressed 7548 messages from / Feb 24 11:34:34 hanl systemd[1]: Got notification message for unit systemd-journald.service Feb 24 11:34:34 hanl systemd[1]: systemd-journald.service: Got notification message from PID 861 (WATCHDOG=1...) Feb 24 11:34:34 hanl systemd[1]: systemd-journald.service: got WATCHDOG=1 Feb 24 11:34:34 hanl systemd[1]: Got notification message for unit systemd-journald.service Feb 24 11:34:34 hanl systemd[1]: systemd-journald.service: Got notification message from PID 861 (WATCHDOG=1...) Feb 24 11:34:34 hanl systemd[1]: systemd-journald.service: got WATCHDOG=1 Feb 24 11:35:14 hanl systemd[1]: Got notification message for unit systemd-journald.service Feb 24 11:35:14 hanl systemd[1]: systemd-journald.service: Got notification message from PID 861 (WATCHDOG=1...) Feb 24 11:35:14 hanl systemd[1]: systemd-journald.service: got WATCHDOG=1 Feb 24 11:35:14 hanl systemd[1]: Got notification message for unit systemd-journald.service Feb 24 11:35:14 hanl systemd[1]: systemd-journald.service: Got notification message from PID 861 (WATCHDOG=1...) Feb 24 11:35:14 hanl systemd[1]: systemd-journald.service: got WATCHDOG=1 Feb 24 11:35:54 hanl systemd[1]: Got notification message for unit systemd-journald.service Feb 24 11:35:54 hanl systemd[1]: systemd-journald.service: Got notification message from PID 861 (WATCHDOG=1...) Feb 24 11:35:54 hanl systemd[1]: systemd-journald.service: got WATCHDOG=1 Feb 24 11:35:54 hanl systemd[1]: Got notification message for unit systemd-journald.service Feb 24 11:35:54 hanl systemd[1]: systemd-journald.service: Got notification message from PID 861 (WATCHDOG=1...) Feb 24 11:35:54 hanl systemd[1]: systemd-journald.service: got WATCHDOG=1 Feb 24 11:36:34 hanl systemd[1]: Got notification message for unit systemd-journald.service Feb 24 11:36:34 hanl systemd[1]: systemd-journald.service: Got notification message from PID 861 (WATCHDOG=1...) Feb 24 11:36:34 hanl systemd[1]: systemd-journald.service: got WATCHDOG=1 Feb 24 11:36:34 hanl systemd[1]: Got notification message for unit systemd-journald.service Feb 24 11:36:34 hanl systemd[1]: systemd-journald.service: Got notification message from PID 861 (WATCHDOG=1...) Feb 24 11:36:34 hanl systemd[1]: systemd-journald.service: got WATCHDOG=1 Here, the system hung, and required a hard reboot -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=919284
Peter B
http://bugzilla.opensuse.org/show_bug.cgi?id=919284
--- Comment #5 from Andrei Borzenkov
Adding those params causes boot to fail.
(1) This boots
kernel /vmlinuz root=/dev/VG0/ROOT lvmwait=/dev/VG0/ROOT ... noquiet net.ifnames=0 biosdevname=0 systemd.log_target=journal-or-kmsg systemd.log_level=debug loglevel=debug
OK, so attach this one at least.
(2) This fails to boot
kernel /vmlinuz root=/dev/VG0/ROOT lvmwait=/dev/VG0/ROOT ... noquiet net.ifnames=0 biosdevname=0 systemd.log_target=journal-or-kmsg systemd.log_level=debug loglevel=debug rd.udev.log-priority=debug udev.log-priority=debug
Probably too much udev output overloads journal. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=919284
--- Comment #6 from han l2
OK, so attach this one at least.
I've been cutting out the cruft from the log step by step looking for what might matter. Here's the latest version of the filtered log. Look for the timing around 'my' lvm_local.service -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=919284
--- Comment #7 from Andrei Borzenkov
I've been cutting out the cruft from the log step by step looking for what might matter.
Please never trim logs you are asked to provide unless explicitly asked to do so. Also your logs are ellipsized for whatever reasons (they should not be when redirected to a file) which removes vital information - please use journalctl --full in this case.
Here's the latest version of the filtered log.
Look for the timing around 'my' lvm_local.service
Well, by your own word when you use lvm_local.service the problem is not observed. I asked you to attach logs when you can reproduce the problem ... -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=919284
Bernhard Wiedemann
http://bugzilla.opensuse.org/show_bug.cgi?id=919284
--- Comment #13 from Andrei Borzenkov
Andrei, this looks again like our infamous lvmetad issue.
Yes, it is possible, but the problem is I cannot reproduce it - I tried to create similar setup and it worked. So it is impossible to say anything without having logs from when it fails. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=919284
han l2
participants (1)
-
bugzilla_noreply@novell.com