[Bug 1217416] New: MicroOS upgrade failed and screwed up nodes by deleting the root filesystem completely
https://bugzilla.suse.com/show_bug.cgi?id=1217416 Bug ID: 1217416 Summary: MicroOS upgrade failed and screwed up nodes by deleting the root filesystem completely Classification: openSUSE Product: openSUSE Tumbleweed Version: Current Hardware: x86-64 OS: openSUSE Tumbleweed Status: NEW Severity: Critical Priority: P5 - None Component: MicroOS Assignee: kubic-bugs@opensuse.org Reporter: jreder@dimedis.de QA Contact: qa-bugs@suse.de Target Milestone: --- Found By: --- Blocker: --- User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:120.0) Gecko/20100101 Firefox/120.0 Build Identifier: We run K8s clusters using MicroOS for the nodes. K8s is provisioned with Terraform on Hetzner cloud platform using kube-hetzner Terraform module. Clusters ran fine for about six months with automatic upgrades and rebooting controlled using Kured. On November 16th the automatic upgrade failed on all nodes. On reboot after transactional-update ran: - /.snapshots is not mounted and empty - health-checker fails due to this - kubelet does not start - later transactional-update runs - it performs an "unused snapshot cleanup" and removes **all snapshots** from the filesystem - on next reboot the node is dead - no root filesystem present anymore This is the corresponding log output from transactional-update after the first boot: Nov 19 00:34:02 k8stest1-wk-fsn1-cpx31-xgt transactional-update[18583]: Creating a new grub2 config Nov 19 00:34:02 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30290]: 2023-11-19 00:34:02 tukit 4.4.0 started Nov 19 00:34:02 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30290]: 2023-11-19 00:34:02 Options: call 52 bash -c /usr/sbin/grub2-mkconfig > /boot/grub2/grub.cfg Nov 19 00:34:03 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30290]: /var/lib/kubelet/pods not reset as customized by admin to unconfined_u:object_r:container_file_t:s0 Nov 19 00:34:03 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30290]: /var/lib/rancher/k3s/agent/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots not reset as customized by admin to unconfined_u:object_r:cont> Nov 19 00:34:03 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30290]: 2023-11-19 00:34:03 Executing `bash -c /usr/sbin/grub2-mkconfig > /boot/grub2/grub.cfg`: Nov 19 00:34:03 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30290]: Generating grub configuration file ... Nov 19 00:34:03 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30290]: Found theme: /boot/grub2/themes/openSUSE/theme.txt Nov 19 00:34:04 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30290]: Found linux image: /boot/vmlinuz-6.6.1-1-default Nov 19 00:34:04 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30290]: Found initrd image: /boot/initrd-6.6.1-1-default Nov 19 00:34:04 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30290]: Warning: os-prober will not be executed to detect other bootable partitions. Nov 19 00:34:04 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30290]: Systems on them will not be added to the GRUB boot configuration. Nov 19 00:34:04 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30290]: Check GRUB_DISABLE_OS_PROBER documentation entry. Nov 19 00:34:04 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30290]: Adding boot menu entry for UEFI Firmware Settings ... Nov 19 00:34:04 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30290]: done Nov 19 00:34:04 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30290]: 2023-11-19 00:34:04 Application returned with exit status 0. Nov 19 00:34:04 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30290]: 2023-11-19 00:34:04 Transaction completed. Nov 19 00:34:04 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30938]: 2023-11-19 00:34:04 tukit 4.4.0 started Nov 19 00:34:04 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30938]: 2023-11-19 00:34:04 Options: close 52 Nov 19 00:34:05 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30938]: /var/lib/kubelet/pods not reset as customized by admin to unconfined_u:object_r:container_file_t:s0 Nov 19 00:34:05 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30938]: /var/lib/rancher/k3s/agent/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots not reset as customized by admin to unconfined_u:object_r:cont> Nov 19 00:34:05 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30938]: 2023-11-19 00:34:05 New default snapshot is #52 (/.snapshots/52/snapshot). Nov 19 00:34:05 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30938]: 2023-11-19 00:34:05 Transaction completed. Nov 19 00:34:05 k8stest1-wk-fsn1-cpx31-xgt transactional-update[18583]: Warning: The following files were changed in the snapshot, but are shadowed by Nov 19 00:34:05 k8stest1-wk-fsn1-cpx31-xgt transactional-update[18583]: other mounts and will not be visible to the system: Nov 19 00:34:05 k8stest1-wk-fsn1-cpx31-xgt transactional-update[18583]: /.snapshots/52/snapshot/var/lib/dbus/machine-id Nov 19 00:34:05 k8stest1-wk-fsn1-cpx31-xgt transactional-update[18583]: WARNING: This snapshot has been created from a different base (47) Nov 19 00:34:05 k8stest1-wk-fsn1-cpx31-xgt transactional-update[18583]: than the previous default snapshot (51) and does not Nov 19 00:34:05 k8stest1-wk-fsn1-cpx31-xgt transactional-update[18583]: contain the changes from the latter. Nov 19 00:34:05 k8stest1-wk-fsn1-cpx31-xgt transactional-update[18583]: New default snapshot is #52 (/.snapshots/52/snapshot). Nov 19 00:34:05 k8stest1-wk-fsn1-cpx31-xgt transactional-update[18583]: transactional-update finished Nov 19 00:34:05 k8stest1-wk-fsn1-cpx31-xgt transactional-update[31158]: 2023-11-19 00:34:05 tukit 4.4.0 started Nov 19 00:34:05 k8stest1-wk-fsn1-cpx31-xgt transactional-update[31158]: 2023-11-19 00:34:05 Options: reboot kured Nov 19 00:34:05 k8stest1-wk-fsn1-cpx31-xgt transactional-update[31158]: 2023-11-19 00:34:05 Triggering reboot using kured Nov 19 00:34:05 k8stest1-wk-fsn1-cpx31-xgt transactional-update[31158]: 2023-11-19 00:34:05 Transaction completed. Nov 19 00:34:05 k8stest1-wk-fsn1-cpx31-xgt systemd[1]: transactional-update.service: Deactivated successfully. Nov 19 00:34:05 k8stest1-wk-fsn1-cpx31-xgt systemd[1]: Finished Update the system. Nov 19 00:34:05 k8stest1-wk-fsn1-cpx31-xgt systemd[1]: transactional-update.service: Consumed 1min 15.285s CPU time. -- Boot 1a23c0e12e33484c99064b5dcc589168 -- Nov 20 00:07:10 k8stest1-wk-fsn1-cpx31-xgt systemd[1]: Starting Update the system... Nov 20 00:07:10 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Checking for newer version. Nov 20 00:07:11 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: transactional-update 4.4.0 started Nov 20 00:07:11 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Options: cleanup dup reboot Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Separate /var detected. Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Adding cleanup algorithm to snapshot #47 Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19832]: Snapshot '47' not found. Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19760]: ERROR: cannot set cleanup algorithm for snapshot #47 Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Adding "important=yes" to snapshot #47 Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19838]: Snapshot '47' not found. Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19760]: ERROR: cannot set "important=yes for snapshot" #47 Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Mark unused snapshot #51 for deletion Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19847]: Snapshot '51' not found. Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19760]: ERROR: cannot set cleanup algorithm for snapshot #51 Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Deleting unused overlay /var/lib/overlay/1 Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Deleting unused overlay /var/lib/overlay/12 Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Deleting unused overlay /var/lib/overlay/15 Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Deleting unused overlay /var/lib/overlay/2 Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Deleting unused overlay /var/lib/overlay/3 Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Deleting unused overlay /var/lib/overlay/4 Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Deleting unused overlay /var/lib/overlay/41 Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Deleting unused overlay /var/lib/overlay/47 Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Deleting unused overlay /var/lib/overlay/48 Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Deleting unused overlay /var/lib/overlay/49 Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Deleting unused overlay /var/lib/overlay/5 Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Deleting unused overlay /var/lib/overlay/50 Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Deleting unused overlay /var/lib/overlay/51 Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Deleting unused overlay /var/lib/overlay/52 Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Deleting unused overlay /var/lib/overlay/6 Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Deleting unused overlay /var/lib/overlay/7 Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Deleting unused overlay /var/lib/overlay/work-etc Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: 2023-11-20 00:07:12 tukit 4.4.0 started Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: 2023-11-20 00:07:12 Options: -c52 open Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: ERROR: Base snapshot '52' does not exist. Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: transactional-update finished Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt systemd[1]: transactional-update.service: Main process exited, code=exited, status=1/FAILURE Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt systemd[1]: transactional-update.service: Failed with result 'exit-code'. Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt systemd[1]: Failed to start Update the system. Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt systemd[1]: transactional-update.service: Consumed 1.375s CPU time. This happened: - at the same time on two different nodes in two different clusters - as soon as we reboot another node on which a transactional-update ran after 16th November 2023 So it looks this is not specific to - K8s - the nodes / network / anything in the infrastructure Reproducible: Always Steps to Reproduce: Sorry, can't provide information here since all this is automatically controlled by the system itself. Actual Results: After reboot of an updated node it doesn't start any workloads. It looks that it removes all snapshots from the system. On next boot there is no root filesystem anymore. The node is dead. Expected Results: Node is alive after transactional-update, even after several reboots. transactional-update should not do anything if the /.snapshots folder is empty! It should not delete all snapshots! I started a discussion for this case in the forum of the kube-hetzner project: https://github.com/kube-hetzner/terraform-hcloud-kube-hetzner/discussions/10... -- You are receiving this mail because: You are the assignee for the bug.
participants (1)
-
bugzilla_noreply@suse.com