[Bug 1217416] New: MicroOS upgrade failed and screwed up nodes by deleting the root filesystem completely
https://bugzilla.suse.com/show_bug.cgi?id=1217416 Bug ID: 1217416 Summary: MicroOS upgrade failed and screwed up nodes by deleting the root filesystem completely Classification: openSUSE Product: openSUSE Tumbleweed Version: Current Hardware: x86-64 OS: openSUSE Tumbleweed Status: NEW Severity: Critical Priority: P5 - None Component: MicroOS Assignee: kubic-bugs@opensuse.org Reporter: jreder@dimedis.de QA Contact: qa-bugs@suse.de Target Milestone: --- Found By: --- Blocker: --- User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:120.0) Gecko/20100101 Firefox/120.0 Build Identifier: We run K8s clusters using MicroOS for the nodes. K8s is provisioned with Terraform on Hetzner cloud platform using kube-hetzner Terraform module. Clusters ran fine for about six months with automatic upgrades and rebooting controlled using Kured. On November 16th the automatic upgrade failed on all nodes. On reboot after transactional-update ran: - /.snapshots is not mounted and empty - health-checker fails due to this - kubelet does not start - later transactional-update runs - it performs an "unused snapshot cleanup" and removes **all snapshots** from the filesystem - on next reboot the node is dead - no root filesystem present anymore This is the corresponding log output from transactional-update after the first boot: Nov 19 00:34:02 k8stest1-wk-fsn1-cpx31-xgt transactional-update[18583]: Creating a new grub2 config Nov 19 00:34:02 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30290]: 2023-11-19 00:34:02 tukit 4.4.0 started Nov 19 00:34:02 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30290]: 2023-11-19 00:34:02 Options: call 52 bash -c /usr/sbin/grub2-mkconfig > /boot/grub2/grub.cfg Nov 19 00:34:03 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30290]: /var/lib/kubelet/pods not reset as customized by admin to unconfined_u:object_r:container_file_t:s0 Nov 19 00:34:03 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30290]: /var/lib/rancher/k3s/agent/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots not reset as customized by admin to unconfined_u:object_r:cont> Nov 19 00:34:03 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30290]: 2023-11-19 00:34:03 Executing `bash -c /usr/sbin/grub2-mkconfig > /boot/grub2/grub.cfg`: Nov 19 00:34:03 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30290]: Generating grub configuration file ... Nov 19 00:34:03 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30290]: Found theme: /boot/grub2/themes/openSUSE/theme.txt Nov 19 00:34:04 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30290]: Found linux image: /boot/vmlinuz-6.6.1-1-default Nov 19 00:34:04 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30290]: Found initrd image: /boot/initrd-6.6.1-1-default Nov 19 00:34:04 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30290]: Warning: os-prober will not be executed to detect other bootable partitions. Nov 19 00:34:04 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30290]: Systems on them will not be added to the GRUB boot configuration. Nov 19 00:34:04 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30290]: Check GRUB_DISABLE_OS_PROBER documentation entry. Nov 19 00:34:04 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30290]: Adding boot menu entry for UEFI Firmware Settings ... Nov 19 00:34:04 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30290]: done Nov 19 00:34:04 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30290]: 2023-11-19 00:34:04 Application returned with exit status 0. Nov 19 00:34:04 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30290]: 2023-11-19 00:34:04 Transaction completed. Nov 19 00:34:04 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30938]: 2023-11-19 00:34:04 tukit 4.4.0 started Nov 19 00:34:04 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30938]: 2023-11-19 00:34:04 Options: close 52 Nov 19 00:34:05 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30938]: /var/lib/kubelet/pods not reset as customized by admin to unconfined_u:object_r:container_file_t:s0 Nov 19 00:34:05 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30938]: /var/lib/rancher/k3s/agent/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots not reset as customized by admin to unconfined_u:object_r:cont> Nov 19 00:34:05 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30938]: 2023-11-19 00:34:05 New default snapshot is #52 (/.snapshots/52/snapshot). Nov 19 00:34:05 k8stest1-wk-fsn1-cpx31-xgt transactional-update[30938]: 2023-11-19 00:34:05 Transaction completed. Nov 19 00:34:05 k8stest1-wk-fsn1-cpx31-xgt transactional-update[18583]: Warning: The following files were changed in the snapshot, but are shadowed by Nov 19 00:34:05 k8stest1-wk-fsn1-cpx31-xgt transactional-update[18583]: other mounts and will not be visible to the system: Nov 19 00:34:05 k8stest1-wk-fsn1-cpx31-xgt transactional-update[18583]: /.snapshots/52/snapshot/var/lib/dbus/machine-id Nov 19 00:34:05 k8stest1-wk-fsn1-cpx31-xgt transactional-update[18583]: WARNING: This snapshot has been created from a different base (47) Nov 19 00:34:05 k8stest1-wk-fsn1-cpx31-xgt transactional-update[18583]: than the previous default snapshot (51) and does not Nov 19 00:34:05 k8stest1-wk-fsn1-cpx31-xgt transactional-update[18583]: contain the changes from the latter. Nov 19 00:34:05 k8stest1-wk-fsn1-cpx31-xgt transactional-update[18583]: New default snapshot is #52 (/.snapshots/52/snapshot). Nov 19 00:34:05 k8stest1-wk-fsn1-cpx31-xgt transactional-update[18583]: transactional-update finished Nov 19 00:34:05 k8stest1-wk-fsn1-cpx31-xgt transactional-update[31158]: 2023-11-19 00:34:05 tukit 4.4.0 started Nov 19 00:34:05 k8stest1-wk-fsn1-cpx31-xgt transactional-update[31158]: 2023-11-19 00:34:05 Options: reboot kured Nov 19 00:34:05 k8stest1-wk-fsn1-cpx31-xgt transactional-update[31158]: 2023-11-19 00:34:05 Triggering reboot using kured Nov 19 00:34:05 k8stest1-wk-fsn1-cpx31-xgt transactional-update[31158]: 2023-11-19 00:34:05 Transaction completed. Nov 19 00:34:05 k8stest1-wk-fsn1-cpx31-xgt systemd[1]: transactional-update.service: Deactivated successfully. Nov 19 00:34:05 k8stest1-wk-fsn1-cpx31-xgt systemd[1]: Finished Update the system. Nov 19 00:34:05 k8stest1-wk-fsn1-cpx31-xgt systemd[1]: transactional-update.service: Consumed 1min 15.285s CPU time. -- Boot 1a23c0e12e33484c99064b5dcc589168 -- Nov 20 00:07:10 k8stest1-wk-fsn1-cpx31-xgt systemd[1]: Starting Update the system... Nov 20 00:07:10 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Checking for newer version. Nov 20 00:07:11 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: transactional-update 4.4.0 started Nov 20 00:07:11 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Options: cleanup dup reboot Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Separate /var detected. Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Adding cleanup algorithm to snapshot #47 Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19832]: Snapshot '47' not found. Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19760]: ERROR: cannot set cleanup algorithm for snapshot #47 Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Adding "important=yes" to snapshot #47 Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19838]: Snapshot '47' not found. Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19760]: ERROR: cannot set "important=yes for snapshot" #47 Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Mark unused snapshot #51 for deletion Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19847]: Snapshot '51' not found. Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19760]: ERROR: cannot set cleanup algorithm for snapshot #51 Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Deleting unused overlay /var/lib/overlay/1 Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Deleting unused overlay /var/lib/overlay/12 Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Deleting unused overlay /var/lib/overlay/15 Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Deleting unused overlay /var/lib/overlay/2 Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Deleting unused overlay /var/lib/overlay/3 Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Deleting unused overlay /var/lib/overlay/4 Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Deleting unused overlay /var/lib/overlay/41 Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Deleting unused overlay /var/lib/overlay/47 Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Deleting unused overlay /var/lib/overlay/48 Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Deleting unused overlay /var/lib/overlay/49 Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Deleting unused overlay /var/lib/overlay/5 Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Deleting unused overlay /var/lib/overlay/50 Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Deleting unused overlay /var/lib/overlay/51 Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Deleting unused overlay /var/lib/overlay/52 Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Deleting unused overlay /var/lib/overlay/6 Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Deleting unused overlay /var/lib/overlay/7 Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: Deleting unused overlay /var/lib/overlay/work-etc Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: 2023-11-20 00:07:12 tukit 4.4.0 started Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: 2023-11-20 00:07:12 Options: -c52 open Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: ERROR: Base snapshot '52' does not exist. Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt transactional-update[19758]: transactional-update finished Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt systemd[1]: transactional-update.service: Main process exited, code=exited, status=1/FAILURE Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt systemd[1]: transactional-update.service: Failed with result 'exit-code'. Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt systemd[1]: Failed to start Update the system. Nov 20 00:07:12 k8stest1-wk-fsn1-cpx31-xgt systemd[1]: transactional-update.service: Consumed 1.375s CPU time. This happened: - at the same time on two different nodes in two different clusters - as soon as we reboot another node on which a transactional-update ran after 16th November 2023 So it looks this is not specific to - K8s - the nodes / network / anything in the infrastructure Reproducible: Always Steps to Reproduce: Sorry, can't provide information here since all this is automatically controlled by the system itself. Actual Results: After reboot of an updated node it doesn't start any workloads. It looks that it removes all snapshots from the system. On next boot there is no root filesystem anymore. The node is dead. Expected Results: Node is alive after transactional-update, even after several reboots. transactional-update should not do anything if the /.snapshots folder is empty! It should not delete all snapshots! I started a discussion for this case in the forum of the kube-hetzner project: https://github.com/kube-hetzner/terraform-hcloud-kube-hetzner/discussions/10... -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1217416 https://bugzilla.suse.com/show_bug.cgi?id=1217416#c2 --- Comment #2 from Jörn Reder <jreder@dimedis.de> --- Sorry for the delay (was a few days off) and thanks for looking into this and it's good news that you could reproduce the problem when .snapshots mount is missing. I can provide logs from the health-checker service, but as far as I understood it does not give us a reason for the missing mountpoint. Hopefully you can read more out of it: Nov 19 03:01:11 k8stest1-wk-fsn1-cpx31-xgt systemd[1]: Starting MicroOS Health Checker... Nov 19 03:01:11 k8stest1-wk-fsn1-cpx31-xgt health-checker[1198]: Clearing GRUB flag Nov 19 03:01:11 k8stest1-wk-fsn1-cpx31-xgt health-checker[1199]: grub2-editenv: error: cannot open `/boot/grub2/grubenv': Read-only file system. Nov 19 03:01:11 k8stest1-wk-fsn1-cpx31-xgt health-checker[1198]: Starting health check Nov 19 03:01:11 k8stest1-wk-fsn1-cpx31-xgt health-checker[1207]: <10>Nov 19 03:01:11 root: ERROR: "/usr/libexec/health-checker/btrfs-subvolumes-mounted.sh check" failed Nov 19 03:01:11 k8stest1-wk-fsn1-cpx31-xgt health-checker[1212]: active Nov 19 03:01:11 k8stest1-wk-fsn1-cpx31-xgt health-checker[1198]: Health check failed! Nov 19 03:01:11 k8stest1-wk-fsn1-cpx31-xgt root[1233]: Machine didn't come up correctly, do a rollback Nov 19 03:01:11 k8stest1-wk-fsn1-cpx31-xgt health-checker[1233]: <9>Nov 19 03:01:11 root: Machine didn't come up correctly, do a rollback Nov 19 03:01:11 k8stest1-wk-fsn1-cpx31-xgt health-checker[1234]: mount: /.snapshots: mount point not mounted or bad option. Nov 19 03:01:11 k8stest1-wk-fsn1-cpx31-xgt health-checker[1234]: dmesg(1) may have more information after failed mount system call. Nov 19 03:01:11 k8stest1-wk-fsn1-cpx31-xgt health-checker[1235]: ERROR: Could not set default subvolume: Inappropriate ioctl for device Nov 19 03:01:11 k8stest1-wk-fsn1-cpx31-xgt root[1236]: ERROR: btrfs set-default 449 failed! Nov 19 03:01:11 k8stest1-wk-fsn1-cpx31-xgt health-checker[1236]: <10>Nov 19 03:01:11 root: ERROR: btrfs set-default 449 failed! Nov 19 03:01:11 k8stest1-wk-fsn1-cpx31-xgt health-checker[1237]: /usr/sbin/health-checker: line 91: telem_send_payload: command not found Nov 19 03:01:11 k8stest1-wk-fsn1-cpx31-xgt systemd[1]: health-checker.service: Main process exited, code=exited, status=1/FAILURE Nov 19 03:01:11 k8stest1-wk-fsn1-cpx31-xgt systemd[1]: health-checker.service: Failed with result 'exit-code'. Nov 19 03:01:11 k8stest1-wk-fsn1-cpx31-xgt systemd[1]: Failed to start MicroOS Health Checker. BTW: we could rescue most of the machines where this happened by going back to a working snapshot. I wrote a short report of that in the discussion forum of the kube-hetzner project: https://github.com/kube-hetzner/terraform-hcloud-kube-hetzner/discussions/10... Since then all nodes run fine again and performed one transactional-update + reboot without problems. Nevertheless, it would be good if health-checker and transactional-update would stop if a .snapshots mountpoint is missing, as this can cause more damage than it helps. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1217416 https://bugzilla.suse.com/show_bug.cgi?id=1217416#c3 --- Comment #3 from OBSbugzilla Bot <bwiedemann+obsbugzillabot@suse.com> --- This is an autogenerated message for OBS integration: This bug (1217416) was mentioned in https://build.opensuse.org/request/show/1154848 Factory / transactional-update -- You are receiving this mail because: You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@suse.com