[opensuse] Boot problems after Leap 15.2 upgrade on libvirt host
Hello, Yesterday, having successfully updated two VMs I run services on to Leap 15.2 in the weeks before, I figured upgrading the host they run on shouldn't be a problem. So, I replaced 15.1 with releasever in my repo addresses, ran zypper --releasever=15.2 ref and then zypper --releasever=15.2 dup and kexec'ed as usual... And the server didn't come up. And when I plugged a monitor in, I was greeted with a GRUB rescue prompt. It said it can't load normal.mod from / @/.snapshots/{default subvolume snapshot number}/snapshot/boot/grub2/i386-pc. Well, we know that grub resides on a separate subvolume in OpenSUSE's default btrfs layout, and lo and behold, it was there and pointing GRUB to the proper place was enough to get a normal shell. From there tried sourcing the config file which failed because it couldn't find the font, apparently. So I booted from a pendrive, checked everything looks alright, looked up the correct commandline and manually pointed it to the kernel and initrd. I thought: okay, that can be figured out soon enough, I'll ask about it later. Everything seemed fine, neither the RAID controller (some LSI MegaRAID that doesn't even support JBOD) nor btrfs reported any errors. But then, docker in both VMs failed to start, throwing up errors that, after some internet searching, were attributed to heavy IO loads (something about failed atomic operations). And when I manually stopped and started docker, it turned out docker.sock had magically spawned as a folder and I had to delete it before it could start, too. I tried rebooting everything (guests are shut down and started up again by libvirt-guests) to no avail. Everything went down identically. I tried force reinstalling grub2-i386-pc and grub2-branding-openSUSE. I've looked all over the journal in search of anything that would point me to some indication of where the problem is and found nothing. Potentially important info: - Both the host and VMs use btrfs on GPT on BIOS (no EFI in the picture) - Each of them has several GBs of unallocated space - The storage is a MegaRAID RAID 10 array - I've been using the hardware without any problems with 15.0 and 15.1 - The VMs are configured with qcow2 images via VirtIO SCSI with discard configured (otherwise the host file system gets clogged up with unused allocations - not too much disk space here) - I haven't had any problems prior to upgrading the host to 15.2 - the VMs have been running 15.2 without any difficulty for the last week or two. I've even installed some updates since upgrading them (I know, I know, not the most efficient way of doing things. It's served me reasonably well for the last 1,5 years) What more can I do to diagnose the problems? What could be the reasons? AFAIK, all the hardware is still going strong and, once all the dancing around with grub, rmdir and systemctl is done with, everything runs as it should. (Maybe a little slower? I might be imagining it or it could be unprimed disk caches) I think that later today, I'll revert to the snapshot from before the upgrade and see if that helps for the time being. Regards Radosław Wyrzykowski -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 7/19/20 5:52 AM, Radosław Wyrzykowski wrote:
What more can I do to diagnose the problems? What could be the reasons? AFAIK, all the hardware is still going strong and, once all the dancing around with grub, rmdir and systemctl is done with, everything runs as it should. (Maybe a little slower? I might be imagining it or it could be unprimed disk caches) I think that later today, I'll revert to the snapshot from before the upgrade and see if that helps for the time being.
Regards Radosław Wyrzykowski
This is a GUESS..... I have only toyed with qcow2 VMs under KVM, so I am not sure what as needs to be rebuilt on kernel update, but with VirtualBox, (installed from Oracle rpm, manually build drivers), the kernel-devel, kernel-source and kernel-syms are needed. However, there is a problem in 15.2 with the purge-kernels process that on kernel update will update the kernel, but not update the -devel, -source or -syms packages leaving the system incapable of building whatever kernel modules need to be built from source. Like I said, this is a guess, but if there is part of your setup that requires dkms module builds -- this is one place to check. -- David C. Rankin, J.D.,P.E. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
So, I rolled back to 15.1 and… Docker on the VMs (which were 15.2 the whole time) started up properly. I looked in the journal and it seems like there's a timeout set in the unit file that it was hitting and then systemd continually restarted it every couple of seconds… Not sure what change 15.2 made on the host is the cause… Maybe turning off autostart for the VMs in libvirt and relying on libvirt-guests to boot them after the host boot process is done would help… (There are still some atomic operation errors logged in the journal but it seems they don't cause a delay long enough to trigger the timeout. Alternatively, there is no timeout in the unit file shipped by 15.1. Gotta check.) As for GRUB, even though I reinstalled it and rebuilt the config, it still boots to a rescue shell with the same problem - it's looking for its modules in the default snapshot subvolume and not the dedicated GRUB subvolume. I compared grub.cfg with the one in one of the VMs (which was booting normally the whole time, using the same filesystem layout as the host). It seems the file on the host is missing all of the custom SUSE btrfs support stuff and was, in fact, missing it before the upgrade for some reason. Of course, I hadn't detected it before, bevause I use kexec to reboot the server after updates. I don't really want to start sifting through my backups to narrow down exactly when it happened, but since the files in /etc/grub.d are intact, it means some variable unset itself somehow. Maybe someone could point me in the right direction? Now that I think about it, I've been seeing something about the device not being found and then grub installing itself 'successfully' anyway after grub updates lately… Could be related. Regards Radosław Wyrzykowski PS. David, sorry if my answer (no out-of-tree modules) didn't get to you. It seems KMail ended up sending it somehow to opensuse-factory and I hadn't noticed. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 28/07/2020 18.18, Radosław Wyrzykowski wrote:
So, I rolled back to 15.1 and… Docker on the VMs (which were 15.2 the whole time) started up properly. I looked in the journal and it seems like there's a timeout set in the unit file that it was hitting and then systemd continually restarted it every couple of seconds…
You can probably increase the timeout. -- Cheers / Saludos, Carlos E. R. (from 15.1 x86_64 at Telcontar)
participants (3)
-
Carlos E. R.
-
David C. Rankin
-
Radosław Wyrzykowski