[opensuse] boot fails with LVMs-on-RAID; systemd "Timed out" and "Dependency failed" errors ? referred from systemd-info ML to 'downstream'
I'm working on an opensuse 13.2 machine running systemd v210. It's disks are all on RAID. /boot is on RAID1 on /dev/md126 The remaining partitions are on LVM-on-RAID10 The LVs are LV_ROOT VG0 -wi-ao--- 20.00g LV_SWAP VG0 -wi-ao--- 8.00g LV_HOME VG0 -wi-ao--- 100.00g LV_VAR VG0 -wi-ao--- 1.00g The system fails to boot, dropping to a maintenance mode prompt. Simply hitting Ctrl-D to continue, finishes booting the system. After boot, checking journalctl -b | egrep -i "Timed out|result=dependency" | egrep -i "dev|mount" Feb 20 08:16:15 ender systemd[1]: Job dev-VG0-LV_HOME.device/start timed out. Feb 20 08:16:15 ender systemd[1]: Timed out waiting for device dev-VG0-LV_HOME.device. Feb 20 08:16:15 ender systemd[1]: Job systemd-fsck@dev-VG0-LV_HOME.service/start finished, result=dependency Feb 20 08:16:15 ender systemd[1]: Dependency failed for File System Check on /dev/VG0/LV_HOME. Feb 20 08:16:15 ender systemd[1]: Job dev-VG0-LV_VAR.device/start timed out. Feb 20 08:16:15 ender systemd[1]: Timed out waiting for device dev-VG0-LV_VAR.device. Feb 20 08:16:15 ender systemd[1]: Job systemd-fsck@dev-VG0-LV_VAR.service/start finished, result=dependency Feb 20 08:16:15 ender systemd[1]: Dependency failed for File System Check on /dev/VG0/LV_VAR. Feb 20 08:16:15 ender systemd[1]: Job dev-disk-by\x2did-dm\x2dname\x2dVG0\x2dLV_HOME.device/start timed out. Feb 20 08:16:15 ender systemd[1]: Timed out waiting for device dev-disk-by\x2did-dm\x2dname\x2dVG0\x2dLV_HOME.device. This is reported ON the same system. I.e., all the LVs are correctly mounted and fully functional. Why are these time-outs & dependency-fails occurring? What do I need to change/fix to make sure it does not happen, and avoid getting dropped into emergency mode on boot? hanlon FWIW: I originally posted this to systemd-info mailing list; that got this reply On Fri, 20.02.15 08:38, h15234@mailas.com (h15234@mailas.com) wrote: > I'm working on a machine running systemd v210 (Opensuse 13.2) This is a really old systemd version, please ask downtream for help on such old version! > Its disks are all on RAID. > > /boot is on RAID1 on /dev/md126 > > The remaining partitions are on LVM-on-RAID10 > > The LVs are > > LV_ROOT VG0 -wi-ao--- 20.00g > LV_SWAP VG0 -wi-ao--- 8.00g > LV_HOME VG0 -wi-ao--- 100.00g > LV_VAR VG0 -wi-ao--- 1.00g Well, LVM and RAID are nothign we support upstream, please talk to the LVM/MD communities or downstream for help. Sorry, Lennart -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
h15234@mailas.com wrote:
I'm working on an opensuse 13.2 machine running systemd v210.
It's disks are all on RAID.
/boot is on RAID1 on /dev/md126
md126 ? That doesn't sound right, but it mightnot be a problem.
The remaining partitions are on LVM-on-RAID10
The LVs are
LV_ROOT VG0 -wi-ao--- 20.00g LV_SWAP VG0 -wi-ao--- 8.00g LV_HOME VG0 -wi-ao--- 100.00g LV_VAR VG0 -wi-ao--- 1.00g
The system fails to boot, dropping to a maintenance mode prompt.
Simply hitting Ctrl-D to continue, finishes booting the system.
After boot, checking
journalctl -b | egrep -i "Timed out|result=dependency" | egrep -i "dev|mount" Feb 20 08:16:15 ender systemd[1]: Job dev-VG0-LV_HOME.device/start timed out. Feb 20 08:16:15 ender systemd[1]: Timed out waiting for device dev-VG0-LV_HOME.device. Feb 20 08:16:15 ender systemd[1]: Job systemd-fsck@dev-VG0-LV_HOME.service/start finished, result=dependency Feb 20 08:16:15 ender systemd[1]: Dependency failed for File System Check on /dev/VG0/LV_HOME. Feb 20 08:16:15 ender systemd[1]: Job dev-VG0-LV_VAR.device/start timed out. Feb 20 08:16:15 ender systemd[1]: Timed out waiting for device dev-VG0-LV_VAR.device. Feb 20 08:16:15 ender systemd[1]: Job systemd-fsck@dev-VG0-LV_VAR.service/start finished, result=dependency Feb 20 08:16:15 ender systemd[1]: Dependency failed for File System Check on /dev/VG0/LV_VAR. Feb 20 08:16:15 ender systemd[1]: Job dev-disk-by\x2did-dm\x2dname\x2dVG0\x2dLV_HOME.device/start timed out. Feb 20 08:16:15 ender systemd[1]: Timed out waiting for device dev-disk-by\x2did-dm\x2dname\x2dVG0\x2dLV_HOME.device.
This is reported ON the same system. I.e., all the LVs are correctly mounted and fully functional.
So despite systemd reporting that fscks were not done on HOME and VAR, the volumes are mounted anyway? What about ROOT ? -- Per Jessen, Zürich (4.6°C) http://www.hostsuisse.com/ - virtual servers, made in Switzerland. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 02/20/2015 12:07 PM, h15234@mailas.com wrote:
I'm working on an opensuse 13.2 machine running systemd v210.
Which version of Grub? Of LVM?
It's disks are all on RAID.
/boot is on RAID1 on /dev/md126
That seems strange. It doesn't seem like a LVM mapper address. I'm a great supporter of LVM but I do not put /boot on LVM ever. I know I can, but its too much hassle when things go wrong.
The remaining partitions are on LVM-on-RAID10
The LVs are
LV_ROOT VG0 -wi-ao--- 20.00g LV_SWAP VG0 -wi-ao--- 8.00g LV_HOME VG0 -wi-ao--- 100.00g LV_VAR VG0 -wi-ao--- 1.00g
The system fails to boot, dropping to a maintenance mode prompt.
Simply hitting Ctrl-D to continue, finishes booting the system.
Is this an "every time" occurrence or intermittent? Does it only occur on 'cold boots' when the disks are being spun up or also on reboots when the disks are already spinning? IIR reading that with LVM there is some unit that causes delay so that LVM. It sounds like either this simply isn't being done or isn't working. Your 'manual' start works because the human level 'reflexes' have given the disk subsystem time to come up and be operational. One "solution" is to edit the GRUB boot command line and add bootdelay=10 and possibly lvmwait=/dev/VG0/LV_HOME or whatever is appropriate. Ten seconds may seem to much, but see if it works and then reduce it step by step. If this turns out to be useful you can add them to system permanently by editing /etc/default/grub If this is an 'everytime' problem I'd first make sure the device mapper module is in the initrd of your boot. The device mapper is a low level volume manager. Higher level volume managers such as LVM2 use this driver. If the device mapper isn't part of the boot kernel it has to be loaded as a module from the ROOT FS, which isn't available yet. *IF* this is not a disk drive timing problem then its a problem with LVM activation. Somewhere along the line the equivalent of the command # lvm vgchange -ay isn't being done or isn't being done early enough or there isn't the 'wait' around it. That part *may* be systemd unit related. Not systemd related, but related to the unit that specifies startup of the disk subsystem. Since I don't run LVM on RAID I can't tell you where to go for that. Worse, it may not be static file by one produced by a unit generator. It may also be related to a udev rule since the rules there pertain to the startup of both LVM and RAID. Once agian, I don't run RAID so I can't be specific. However I'd look at /etc/sysconfig/dmraid as well for timeout values -- A: Yes. > Q: Are you sure? >> A: Because it reverses the logical flow of conversation. >>> Q: Why is top posting frowned upon? -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
So despite systemd reporting that fscks were not done on HOME and VAR, the volumes are mounted anyway?
yes.
What about ROOT ?
same. all volumes are mounted and fully available after boot completes.
Which version of Grub? Of LVM?
grub-0.97-200.1.3.x86_64 lvm2-2.02.98-43.17.1.x86_64
/boot is on RAID1 on /dev/md126 That seems strange. It doesn't seem like a LVM mapper address. I'm a great supporter of LVM but I do not put /boot on LVM ever. I know I can, but its too much hassle when things go wrong.
That's not an LV. /boot is on an ext4-partitioned RAID vol.
Simply hitting Ctrl-D to continue, finishes booting the system. Is this an "every time" occurrence or intermittent?
every time. fully reproducible.
Does it only occur on 'cold boots' when the disks are being spun up or also on reboots when the disks are already spinning?
both.
One "solution" is to edit the GRUB boot command line and add bootdelay=10 and possibly lvmwait=/dev/VG0/LV_HOME
I've already tried both. As well as x-systemd.device-timeout=5m in /etc/fstab. none, individually or in any combination together, chane the problem.
If this is an 'everytime' problem I'd first make sure the device mapper module is in > the initrd of your boot.
it already is, cat /etc/sysconfig/kernel INITRD_MODULES="... raid0 raid1 raid10 raid456 dm-mod ..."
*IF* this is not a disk drive timing problem then its a problem with LVM activation.
agreed. which I'm guessing is why the /etc/systemd/system/lvm_local.service with the explicit LVM Before/Requires dependencies, and the ExecStart=/sbin/vgchange --available y seems to fix the problem. With the fixes in place booting finishes perfectly. Absolutely no more errors or warnings in dmesg, journalctl, etc.
However I'd look at /etc/sysconfig/dmraid as well for timeout values
it's already set at DMRAID_DEVICE_TIMEOUT="120" I'm betting on it being systemd-related. Unfortunately v210 is 'old' to begin with, which is why upstream won't apparently get involved in any discussion. I've read that one of the Suse devs is having luck with v219 on 13.2; maybe that'll get upgraded. On top of that Opensuse's already had some issues with setting systemd dependencies correctly -- mostly what I've read is around wicked, though (what a mess!) Wouldn't be surprised if it shows up somewhere else too. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
h15234@mailas.com wrote:
/boot is on RAID1 on /dev/md126 That seems strange. It doesn't seem like a LVM mapper address. I'm a great supporter of LVM but I do not put /boot on LVM ever. I know I can, but its too much hassle when things go wrong.
That's not an LV.
/boot is on an ext4-partitioned RAID vol.
md126 suggests it was auto-created at some point. I don't think it's a big deal.
*IF* this is not a disk drive timing problem then its a problem with LVM activation.
agreed. which I'm guessing is why the
/etc/systemd/system/lvm_local.service
with the explicit LVM Before/Requires dependencies, and the
ExecStart=/sbin/vgchange --available y
seems to fix the problem.
I have seen this before, although not specifically related to systemd. I know I have had situations where 'vgchange -ay' simply wasn't done at the right time.
I'm betting on it being systemd-related.
Unfortunately v210 is 'old' to begin with, which is why upstream won't apparently get involved in any discussion. I've read that one of the Suse devs is having luck with v219 on 13.2; maybe that'll get upgraded.
If it's about systemd not properly activating LVM volumes, I would open a bugreport @ opensuse. Doesn't sound like an upstream issue as such. /Per -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On Tue, Feb 24, 2015, at 06:56 AM, Per Jessen wrote:
I have seen this before, although not specifically related to systemd. I know I have had situations where 'vgchange -ay' simply wasn't done at the right time.
I'm betting on it being systemd-related.
Unfortunately v210 is 'old' to begin with, which is why upstream won't apparently get involved in any discussion. I've read that one of the Suse devs is having luck with v219 on 13.2; maybe that'll get upgraded.
If it's about systemd not properly activating LVM volumes, I would open a bugreport @ opensuse. Doesn't sound like an upstream issue as such.
Ok, moved it here: https://bugzilla.opensuse.org/show_bug.cgi?id=919284 Thanks. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 02/24/2015 09:25 AM, h15234@mailas.com wrote:
*IF* this is not a disk drive timing problem then its a problem with LVM activation. agreed. which I'm guessing is why the
/etc/systemd/system/lvm_local.service
with the explicit LVM Before/Requires dependencies, and the
ExecStart=/sbin/vgchange --available y
seems to fix the problem. With the fixes in place booting finishes perfectly. Absolutely no more errors or warnings in dmesg, journalctl, etc.
That makes perfect sense. It the sort of thing that should have been carried over from a init.d script. However I'd ask why it is in a unit rather than in the UDEV rules. -- A: Yes. > Q: Are you sure? >> A: Because it reverses the logical flow of conversation. >>> Q: Why is top posting frowned upon? -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
Anton Aylward wrote:
On 02/24/2015 09:25 AM, h15234@mailas.com wrote:
*IF* this is not a disk drive timing problem then its a problem with LVM activation. agreed. which I'm guessing is why the
/etc/systemd/system/lvm_local.service
with the explicit LVM Before/Requires dependencies, and the
ExecStart=/sbin/vgchange --available y
seems to fix the problem. With the fixes in place booting finishes perfectly. Absolutely no more errors or warnings in dmesg, journalctl, etc.
That makes perfect sense. It the sort of thing that should have been carried over from a init.d script. However I'd ask why it is in a unit rather than in the UDEV rules.
Probably because it used to be done by /etc/init.d/boot.lvm -- Per Jessen, Zürich (4.3°C) http://www.hostsuisse.com/ - virtual servers, made in Switzerland. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 02/24/2015 12:01 PM, Per Jessen wrote:
That makes perfect sense. It the sort of thing that should have been carried over from a init.d script. However I'd ask why it is in a unit rather than in the UDEV rules.
Probably because it used to be done by /etc/init.d/boot.lvm
As a "first generation" change, yes, I agree. But for 13.2, a more matured "second" or "third" generation into the change should be in place. This is the sort of thing that belongs in udev rules. It what they are for. Other device activation at boot at with plugin devices is all there. -- A: Yes. > Q: Are you sure? >> A: Because it reverses the logical flow of conversation. >>> Q: Why is top posting frowned upon? -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
Anton Aylward wrote:
On 02/24/2015 12:01 PM, Per Jessen wrote:
That makes perfect sense. It the sort of thing that should have been carried over from a init.d script. However I'd ask why it is in a unit rather than in the UDEV rules.
Probably because it used to be done by /etc/init.d/boot.lvm
As a "first generation" change, yes, I agree. But for 13.2, a more matured "second" or "third" generation into the change should be in place. This is the sort of thing that belongs in udev rules. It what they are for. Other device activation at boot at with plugin devices is all there.
Without having thought it through I think you're probably right, but OTOH there is little if anything gained by such a change. Besides, we've had udev for quite some time, mayby there's a reason the lvm activation was kept as an init-script, I dunno. -- Per Jessen, Zürich (6.9°C) http://www.hostsuisse.com/ - dedicated server rental in Switzerland. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On Wed, Feb 25, 2015, at 05:30 AM, Anton Aylward wrote:
As a "first generation" change, yes, I agree. But for 13.2, a more matured "second" or "third" generation into the change should be in place. This is the sort of thing that belongs in udev rules. It what they are for. Other device activation at boot at with plugin devices is all there.
Starting with the idea that udev is the right way to handle "this" I figured a fix maybe could be put together in /etc/udev/rules.d Starting looking at the base rules cd /usr/lib/udev/rules.d ls -1 *dm* *lvm* *raid* *disk* *ide* *ata* *scsi* | sort | uniq ls: cannot access *ata*: No such file or directory 10-dm.rules 11-dm-lvm.rules 11-dm-mpath.rules 13-dm-disk.rules 55-scsi-sg3_id.rules 56-idedma.rules 58-scsi-sg3_symlink.rules 63-md-raid-arrays.rules 64-md-raid-assembly.rules 69-dm-lvm-metad.rules 80-udisks2.rules 80-udisks.rules 95-dm-notify.rules I honestly have no clue where I'd even start. Other than just "that's what RedHat did" I don't understand the flow. I notice the 'udisks/udisks2' rule(s) rpm -ql `rpm -qa | grep -i udisks` | grep udev /usr/lib/udev/rules.d/80-udisks2.rules /usr/lib/udev/rules.d/80-udisks.rules /usr/lib/udev/udisks-dm-export /usr/lib/udev/udisks-part-id /usr/lib/udev/udisks-probe-ata-smart /usr/lib/udev/udisks-probe-sas-expander On this machine it's running ps ax | grep -i udisk 2868 ? Ssl 0:20 /usr/lib/udisks2/udisksd --no-debug But checking on other machines around the LAN it's not. None of those other machines have the same config though, so I don't know what's necessary here. Reading http://www.freedesktop.org/wiki/Software/udisks/ I can see what it does generally, but what's udisks(2) for in an opensuse env? Do I need any of it here? Where does udisks(2) fit in? -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 02/24/2015 09:25 AM, h15234@mailas.com wrote:
Unfortunately v210 is 'old' to begin with, which is why upstream won't apparently get involved in any discussion. I've read that one of the Suse devs is having luck with v219 on 13.2; maybe that'll get upgraded.
I'm not sure that this is a purely systemd problem. Yes, you;'ve solved it by adding a unit, but as I say, disk activation is something udev rules should be taking care of. Regardless, I find that the vgchange command is there in my /usr/lib/systemd/system/lvm2-monitor.service I'd check that on your system and see whether its enabled, etc.
On top of that Opensuse's already had some issues with setting systemd dependencies correctly -- mostly what I've read is around wicked, though (what a mess!) Wouldn't be surprised if it shows up somewhere else too.
Wicked is its own mess in its own right and the top of the list of reasons I'm not moving from 13.1 to 13.2 yet. -- A: Yes. > Q: Are you sure? >> A: Because it reverses the logical flow of conversation. >>> Q: Why is top posting frowned upon? -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On Tue, Feb 24, 2015, at 07:13 AM, Anton Aylward wrote:
I'd check that on your system and see whether its enabled, etc.
Let's move comments to bug for now. Less messy than 2 places.
Wicked is its own mess in its own right and the top of the list of reasons I'm not moving from 13.1 to 13.2 yet.
Yep, That's a "beer discussion". TBH, I've never before seen a major new functionality released without an alternative, while DISABLING current/working function. It's apparently been broken for the *entire* release cycle. For all the complaining about systemd (not from me, generally), at least it was introduced in Opensuse in parallel with sysvinit for 1, maybe more?, releases. We had a choice. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
participants (3)
-
Anton Aylward
-
h15234@mailas.com
-
Per Jessen