Kubic 20220320 pods CrashLoopBackOff

Hi, Ran into a bit of a pickle with Kubic yesterday after the last update. The cluster was running with 1.23.4 since January without issues, but after an update I started to noticed that coredns and kured crashlooping[1]. Rolling back did not solve this issue. Seemingly the pods can't get an address for some reason judging by the log that I also made available[2]. Tried to make a fresh install with a latest iso, but landed at the same issue sadly. Any ideas what can cause this? 1: https://paste.opensuse.org/11246326 2: https://drive.google.com/file/d/1Cm6bH1XV4AhsyCK5hptXxqZkw2mWHyev/view?usp=s... -- Br, A.

On 2022-03-22 12:42, Attila Pinter wrote:
Hi,
Ran into a bit of a pickle with Kubic yesterday after the last update. The cluster was running with 1.23.4 since January without issues, but after an update I started to noticed that coredns and kured crashlooping[1]. Rolling back did not solve this issue. Seemingly the pods can't get an address for some reason judging by the log that I also made available[2]. Tried to make a fresh install with a latest iso, but landed at the same issue sadly. Any ideas what can cause this?
1: https://paste.opensuse.org/11246326 2: https://drive.google.com/file/d/1Cm6bH1XV4AhsyCK5hptXxqZkw2mWHyev/view?usp=s... -- Br, A.
Hi Attila, Do you have the logs from your kubeadm upgrade or your kubeadm runs of your fresh install? I'm confused how you can be hitting this issue when openQA does not seem to be.. https://openqa.opensuse.org/tests/2256317#step/kubeadm/23 Regards, Richard

------- Original Message ------- On Tuesday, March 22nd, 2022 at 7:22 PM, rbrown <rbrown@suse.de> wrote:
Hi Attila,
Do you have the logs from your kubeadm upgrade or your kubeadm runs of
your fresh install?
I'm confused how you can be hitting this issue when openQA does not seem
to be..
https://openqa.opensuse.org/tests/2256317#step/kubeadm/23
Regards,
Richard
Hi Richard, Yea, it looks like a strange one, but I noticed that someone on the Kubic Matrix channel also ran into this issue. Kubeadm logs in journalctl[1] and the logs from /var/log/containers[2] is available, unfortunately only from the new cluster (1 Control Plane, 1 worker). Let me know if I can provide something else or try something on the systems. 1: https://paste.opensuse.org/83126572 2: https://drive.google.com/file/d/1PBajdUtKxpXZAygnfr3UE2MTLN51b4f3/view?usp=s... -- Br, A.

------- Original Message ------- On Tuesday, March 22nd, 2022 at 8:22 PM, Attila Pinter <adathor@protonmail.com> wrote:
------- Original Message -------
On Tuesday, March 22nd, 2022 at 7:22 PM, rbrown rbrown@suse.de wrote:
Hi Attila,
Do you have the logs from your kubeadm upgrade or your kubeadm runs of
your fresh install?
I'm confused how you can be hitting this issue when openQA does not seem
to be..
https://openqa.opensuse.org/tests/2256317#step/kubeadm/23
Regards,
Richard
Hi Richard,
Yea, it looks like a strange one, but I noticed that someone on the Kubic Matrix channel also ran into this issue. Kubeadm logs in journalctl[1] and the logs from /var/log/containers[2] is available, unfortunately only from the new cluster (1 Control Plane, 1 worker). Let me know if I can provide something else or try something on the systems.
1: https://paste.opensuse.org/83126572
2: https://drive.google.com/file/d/1PBajdUtKxpXZAygnfr3UE2MTLN51b4f3/view?usp=s...
--
Br,
A.
Reinstalled my test cluster yesterday that fixed most of the issues, but a t-u update broke things again. Disabled the t-u timer for now. So it looks like that pods are not getting an address, super weird. ``` kube-system kured-vfzmr 0/1 CrashLoopBackOff 3 (18s ago) 35s <none> dev-k8s-master-1 <none> <none> ``` I also broke my small test cluster so reinstalling that one now. This one is up to date so can play and test things on it. -- Br, A.

On Wednesday, March 23rd, 2022 at 12:34 PM, Attila Pinter <adathor@protonmail.com> wrote:
------- Original Message -------
On Tuesday, March 22nd, 2022 at 8:22 PM, Attila Pinter adathor@protonmail.com wrote:
Reinstalled my test cluster yesterday that fixed most of the issues, but a t-u update broke things again. Disabled the t-u timer for now. So it looks like that pods are not getting an address, super weird.
```
kube-system kured-vfzmr 0/1 CrashLoopBackOff 3 (18s ago) 35s <none> dev-k8s-master-1 <none> <none>
```
I also broke my small test cluster so reinstalling that one now. This one is up to date so can play and test things on it.
--
Br,
A.
One quick question: it is possible that t-u is updating the cluster? Using an old image from January to install a cluster, starts with 1.23.0, after t-u dup it comes back as 1.23.4. Was under the impression that it is only possible with kubicctl/kubeadm, but can be wrong. -- Br, A.

------- Original Message ------- On Wednesday, March 23rd, 2022 at 2:14 PM, Attila Pinter <adathor@protonmail.com> wrote:
One quick question: it is possible that t-u is updating the cluster? Using an old image from January to install a cluster, starts with 1.23.0, after t-u dup it comes back as 1.23.4. Was under the impression that it is only possible with kubicctl/kubeadm, but can be wrong.
--
Br,
A.
Finished reinstalling my test cluster - from the latest Kubic iso - for Kubic latest (1.23.4) and see the same CrashLoopBackOff as before: https://paste.opensuse.org/92716101. -- Br, A.

------- Original Message ------- On Wednesday, March 23rd, 2022 at 2:55 PM, Attila Pinter <adathor@protonmail.com> wrote:
Finished reinstalling my test cluster - from the latest Kubic iso - for Kubic latest (1.23.4) and see the same CrashLoopBackOff as before: https://paste.opensuse.org/92716101.
--
Br,
A.
Some additional logs from the weave, weave-init, and coredns pods: https://paste.opensuse.org/87628794. Seems to me that issue is here, can be wrong tho. Br, A.

On Wed, 2022-03-23 at 09:03 +0000, Attila Pinter wrote:
------- Original Message -------
On Wednesday, March 23rd, 2022 at 2:55 PM, Attila Pinter <adathor@protonmail.com> wrote:
Finished reinstalling my test cluster - from the latest Kubic iso - for Kubic latest (1.23.4) and see the same CrashLoopBackOff as before: https://paste.opensuse.org/92716101.
--
Br,
A.
Some additional logs from the weave, weave-init, and coredns pods: https://paste.opensuse.org/87628794. Seems to me that issue is here, can be wrong tho.
I have the same problem with kubic cluster: $ kubectl -n kube-system logs weave-net-8mmmc -c weave-init modprobe: can't load module nfnetlink (kernel/net/netfilter/nfnetlink.ko.zst): invalid module format Ignore the error if "xt_set" is built-in in the kernel If anyone needs more info I'd be glad to provide it. Thanks, Robert

On 2022-03-23 10:51, Robert Munteanu wrote:
On Wed, 2022-03-23 at 09:03 +0000, Attila Pinter wrote:
------- Original Message -------
On Wednesday, March 23rd, 2022 at 2:55 PM, Attila Pinter <adathor@protonmail.com> wrote:
Finished reinstalling my test cluster - from the latest Kubic iso - for Kubic latest (1.23.4) and see the same CrashLoopBackOff as before: https://paste.opensuse.org/92716101.
--
Br,
A.
Some additional logs from the weave, weave-init, and coredns pods: https://paste.opensuse.org/87628794. Seems to me that issue is here, can be wrong tho.
I have the same problem with kubic cluster:
$ kubectl -n kube-system logs weave-net-8mmmc -c weave-init modprobe: can't load module nfnetlink (kernel/net/netfilter/nfnetlink.ko.zst): invalid module format Ignore the error if "xt_set" is built-in in the kernel
If anyone needs more info I'd be glad to provide it.
Thanks, Robert
Hi Robert, Very interesting! This might be the clue I've been missing - if the kernel updated in a way that broke networking, then the problem isn't anything to do with the recent kubernetes package updates but with the kernel.. That would explain why I couldn't find fault in what I'd done recently ;) The last kernel update was in snapshot 0319..can people roll their Kubic hosts back to snapshots older than that and tell me if the problems go away? Would be a huge help if we can narrow down the problem to that kernel update. Regards, Richard

On 2022-03-23 11:10, rbrown wrote:
On 2022-03-23 10:51, Robert Munteanu wrote:
On Wed, 2022-03-23 at 09:03 +0000, Attila Pinter wrote:
------- Original Message -------
On Wednesday, March 23rd, 2022 at 2:55 PM, Attila Pinter <adathor@protonmail.com> wrote:
Finished reinstalling my test cluster - from the latest Kubic iso - for Kubic latest (1.23.4) and see the same CrashLoopBackOff as before: https://paste.opensuse.org/92716101.
--
Br,
A.
Some additional logs from the weave, weave-init, and coredns pods: https://paste.opensuse.org/87628794. Seems to me that issue is here, can be wrong tho.
I have the same problem with kubic cluster:
$ kubectl -n kube-system logs weave-net-8mmmc -c weave-init modprobe: can't load module nfnetlink (kernel/net/netfilter/nfnetlink.ko.zst): invalid module format Ignore the error if "xt_set" is built-in in the kernel
If anyone needs more info I'd be glad to provide it.
Thanks, Robert
Hi Robert,
Very interesting! This might be the clue I've been missing - if the kernel updated in a way that broke networking, then the problem isn't anything to do with the recent kubernetes package updates but with the kernel..
That would explain why I couldn't find fault in what I'd done recently ;)
The last kernel update was in snapshot 0319..can people roll their Kubic hosts back to snapshots older than that and tell me if the problems go away?
Would be a huge help if we can narrow down the problem to that kernel update.
Regards, Richard
This was the 0319 kernel update changes for the record: https://build.opensuse.org/package/rdiff/openSUSE:Factory/kernel-source?link... Does anyone see any changes that might explain the nfnetlink module not loading?

On Wed, 2022-03-23 at 11:10 +0100, rbrown wrote:
On 2022-03-23 10:51, Robert Munteanu wrote:
On Wed, 2022-03-23 at 09:03 +0000, Attila Pinter wrote:
------- Original Message -------
On Wednesday, March 23rd, 2022 at 2:55 PM, Attila Pinter <adathor@protonmail.com> wrote:
Finished reinstalling my test cluster - from the latest Kubic iso - for Kubic latest (1.23.4) and see the same CrashLoopBackOff as before: https://paste.opensuse.org/92716101.
--
Br,
A.
Some additional logs from the weave, weave-init, and coredns pods: https://paste.opensuse.org/87628794. Seems to me that issue is here, can be wrong tho.
I have the same problem with kubic cluster:
$ kubectl -n kube-system logs weave-net-8mmmc -c weave-init modprobe: can't load module nfnetlink (kernel/net/netfilter/nfnetlink.ko.zst): invalid module format Ignore the error if "xt_set" is built-in in the kernel
If anyone needs more info I'd be glad to provide it.
Thanks, Robert
Hi Robert,
Very interesting! This might be the clue I've been missing - if the kernel updated in a way that broke networking, then the problem isn't anything to do with the recent kubernetes package updates but with the kernel..
That would explain why I couldn't find fault in what I'd done recently ;)
The last kernel update was in snapshot 0319..can people roll their Kubic hosts back to snapshots older than that and tell me if the problems go away?
Hi Richard, For the record, that caught my attention since it was also in Attila's paste output. I rolled back all my kubic VMs to a snapshot taken around "2022-03-18 01:43:57" and things are getting back to normal. FWIW, this is the node 'wide' output for a node after I rolled back kubic-worker-1 Ready <none> 484d v1.23.0 10.25.0.43 <none> openSUSE MicroOS 5.16.14-1-default cri- o://1.22.0 and this is the one for a new that was not rolled back (ignore the NotReady status, it was just rebooted). kubic-worker-2 NotReady <none> 484d v1.23.4 10.25.0.40 <none> openSUSE MicroOS 5.16.15-1-default cri- o://1.23.2 Now I have to wait a bit since I reached the DockerHub pull limits but things are stabilising. Thanks, Robert
Would be a huge help if we can narrow down the problem to that kernel update.
Regards, Richard

Hi Richard, On Tue, 2022-03-22 at 13:22 +0100, rbrown wrote:
I'm confused how you can be hitting this issue when openQA does not seem to be..
openQA is actually hitting this issue, but it went unnoticed. https://openqa.opensuse.org/tests/2256317#step/kubeadm/17 The coredns pods are in a CrashLoopBackOff state, but there seems to be no assertion related to that. Thanks, Robert

Hi all, On 22.03.22 at 12:42 Attila Pinter wrote:
Ran into a bit of a pickle with Kubic yesterday after the last update. The cluster was running with 1.23.4 since January without issues, but after an update I started to noticed that coredns and kured crashlooping[1]. Rolling back did not solve this issue. Seemingly the pods can't get an address for some reason judging by the log that I also made available[2].
Anyone using k3s on openSUSE MicroOS? I noticed problems with coredns the day before yesterday on some of my singlenode k3s clusters. Two x86 machines and a Raspi4 are showing lots of pods in CrashLoopBackOff, while another singlenode x86 and my 3-node x86 cluster is running fine. All on the same kernel version (5.16.15-1-default) and k3s version (1.22.6) with Cilium 1.11.2. Errors from the coredns pods indicate that it cannot talk to outside DNS servers anymore due to timeouts. The hosts themselves are working fine as far as I can see, no errors. [...]
CoreDNS-1.8.6 linux/arm64, go1.17.1, 13a9191 [ERROR] plugin/errors: 2 2710547001195683759.7881225443334916626. HINFO: read udp 10.0.0.190:35403->192.168.99.1:53: i/o timeout [ERROR] plugin/errors: 2 2710547001195683759.7881225443334916626. HINFO: read udp 10.0.0.190:35711->192.168.99.121:53: i/o timeout [...]
As far as I understood Attila's and Robert's problem (Kubic 20220320 pods CrashLoopBackOff) was caused by weave not working properly, so I do not think this is related. However, wanted to report here in case anyone is experiencing similar issues... Kind Regards, Johannes -- Johannes Kastl Linux Consultant & Trainer Tel.: +49 (0) 151 2372 5802 Mail: kastl@b1-systems.de B1 Systems GmbH Osterfeldstraße 7 / 85088 Vohburg http://www.b1-systems.de GF: Ralph Dehner Unternehmenssitz: Vohburg / AG: Ingolstadt,HRB 3537

Hi all, On 01.04.22 at 10:42 Johannes Kastl wrote:
Anyone using k3s on openSUSE MicroOS?
I noticed problems with coredns the day before yesterday on some of my singlenode k3s clusters. Two x86 machines and a Raspi4 are showing lots of pods in CrashLoopBackOff, while another singlenode x86 and my 3-node x86 cluster is running fine.
All on the same kernel version (5.16.15-1-default) and k3s version (1.22.6) with Cilium 1.11.2.
As the problem appeared also on the "good" nodes I did some more debugging, and reverting to a snapshot with kernel 5.14.1 (14, not 15!) seems to solve the problem. I'll report the bug upstream at k3s and see what I can find out. Kind Regards, Johannes -- Johannes Kastl Linux Consultant & Trainer Tel.: +49 (0) 151 2372 5802 Mail: kastl@b1-systems.de B1 Systems GmbH Osterfeldstraße 7 / 85088 Vohburg http://www.b1-systems.de GF: Ralph Dehner Unternehmenssitz: Vohburg / AG: Ingolstadt,HRB 3537

On 04.04.22 at 14:24 Johannes Kastl wrote:
reverting to a snapshot with kernel 5.14.1 (14, not 15!) seems to solve Meh. This should of course read "5.16.14-1-default (14, not 15!)"...
Johannes -- Johannes Kastl Linux Consultant & Trainer Tel.: +49 (0) 151 2372 5802 Mail: kastl@b1-systems.de B1 Systems GmbH Osterfeldstraße 7 / 85088 Vohburg http://www.b1-systems.de GF: Ralph Dehner Unternehmenssitz: Vohburg / AG: Ingolstadt,HRB 3537

On Mon, 2022-04-04 at 14:29 +0200, Johannes Kastl wrote:
On 04.04.22 at 14:24 Johannes Kastl wrote:
reverting to a snapshot with kernel 5.14.1 (14, not 15!) seems to solve Meh. This should of course read "5.16.14-1-default (14, not 15!)"...
FWIW, this was the revert that fixed my weave CNI problems as well. Thanks, Robert

Hi Robert, On 04.04.22 at 15:18 Robert Munteanu wrote:
On Mon, 2022-04-04 at 14:29 +0200, Johannes Kastl wrote:
On 04.04.22 at 14:24 Johannes Kastl wrote:
reverting to a snapshot with kernel 5.14.1 (14, not 15!) seems to solve Meh. This should of course read "5.16.14-1-default (14, not 15!)"...
FWIW, this was the revert that fixed my weave CNI problems as well.
That is where I got the idea from, although I hoped the problems would not be related... Unfortunately this does not seem to be fixed in 5.17.1, which ones of my boxes got already (the one where I left the transactional-updates timer enabled, while I stopped and disabled it on the others). And, of course, one of my boxes does not have a BTRFS snapshot with the old kernel. :-) Johannes -- Johannes Kastl Linux Consultant & Trainer Tel.: +49 (0) 151 2372 5802 Mail: kastl@b1-systems.de B1 Systems GmbH Osterfeldstraße 7 / 85088 Vohburg http://www.b1-systems.de GF: Ralph Dehner Unternehmenssitz: Vohburg / AG: Ingolstadt,HRB 3537

On 04.04.22 at 14:29 Johannes Kastl wrote:
On 04.04.22 at 14:24 Johannes Kastl wrote:
reverting to a snapshot with kernel 5.14.1 (14, not 15!) seems to solve Meh. This should of course read "5.16.14-1-default (14, not 15!)"...
Here is the bug report: https://bugzilla.opensuse.org/show_bug.cgi?id=1198064 Kind Regards, Johannes -- Johannes Kastl Linux Consultant & Trainer Tel.: +49 (0) 151 2372 5802 Mail: kastl@b1-systems.de B1 Systems GmbH Osterfeldstraße 7 / 85088 Vohburg http://www.b1-systems.de GF: Ralph Dehner Unternehmenssitz: Vohburg / AG: Ingolstadt,HRB 3537
participants (4)
-
Attila Pinter
-
Johannes Kastl
-
rbrown
-
Robert Munteanu