Kubic 20220320 pods CrashLoopBackOff

newer
New Kubic snapshot 20220404...

older
New MicroOS snapshot 20220403...

Attila Pinter

22 Mar 2022 22 Mar '22

11:42

Hi, Ran into a bit of a pickle with Kubic yesterday after the last update. The cluster was running with 1.23.4 since January without issues, but after an update I started to noticed that coredns and kured crashlooping[1]. Rolling back did not solve this issue. Seemingly the pods can't get an address for some reason judging by the log that I also made available[2]. Tried to make a fresh install with a latest iso, but landed at the same issue sadly. Any ideas what can cause this? 1: https://paste.opensuse.org/11246326 2: https://drive.google.com/file/d/1Cm6bH1XV4AhsyCK5hptXxqZkw2mWHyev/view?usp=s... -- Br, A.

Show replies by thread

rbrown

22 Mar 22 Mar

12:22

On 2022-03-22 12:42, Attila Pinter wrote:

...

Hi,

Ran into a bit of a pickle with Kubic yesterday after the last update. The cluster was running with 1.23.4 since January without issues, but after an update I started to noticed that coredns and kured crashlooping[1]. Rolling back did not solve this issue. Seemingly the pods can't get an address for some reason judging by the log that I also made available[2]. Tried to make a fresh install with a latest iso, but landed at the same issue sadly. Any ideas what can cause this?

1: https://paste.opensuse.org/11246326 2: https://drive.google.com/file/d/1Cm6bH1XV4AhsyCK5hptXxqZkw2mWHyev/view?usp=s... -- Br, A.

Hi Attila, Do you have the logs from your kubeadm upgrade or your kubeadm runs of your fresh install? I'm confused how you can be hitting this issue when openQA does not seem to be.. https://openqa.opensuse.org/tests/2256317#step/kubeadm/23 Regards, Richard

Attila Pinter

13:22

------- Original Message ------- On Tuesday, March 22nd, 2022 at 7:22 PM, rbrown wrote:

...

Hi Attila,

Do you have the logs from your kubeadm upgrade or your kubeadm runs of

your fresh install?

I'm confused how you can be hitting this issue when openQA does not seem

to be..

https://openqa.opensuse.org/tests/2256317#step/kubeadm/23

Regards,

Richard

Hi Richard, Yea, it looks like a strange one, but I noticed that someone on the Kubic Matrix channel also ran into this issue. Kubeadm logs in journalctl[1] and the logs from /var/log/containers[2] is available, unfortunately only from the new cluster (1 Control Plane, 1 worker). Let me know if I can provide something else or try something on the systems. 1: https://paste.opensuse.org/83126572 2: https://drive.google.com/file/d/1PBajdUtKxpXZAygnfr3UE2MTLN51b4f3/view?usp=s... -- Br, A.

Attila Pinter

23 Mar 23 Mar

05:34

------- Original Message ------- On Tuesday, March 22nd, 2022 at 8:22 PM, Attila Pinter wrote:

...

------- Original Message -------

On Tuesday, March 22nd, 2022 at 7:22 PM, rbrown rbrown@suse.de wrote:

...
Hi Attila,

Do you have the logs from your kubeadm upgrade or your kubeadm runs of

your fresh install?

I'm confused how you can be hitting this issue when openQA does not seem

to be..

https://openqa.opensuse.org/tests/2256317#step/kubeadm/23

Regards,

Richard

Hi Richard,

Yea, it looks like a strange one, but I noticed that someone on the Kubic Matrix channel also ran into this issue. Kubeadm logs in journalctl[1] and the logs from /var/log/containers[2] is available, unfortunately only from the new cluster (1 Control Plane, 1 worker). Let me know if I can provide something else or try something on the systems.

1: https://paste.opensuse.org/83126572

2: https://drive.google.com/file/d/1PBajdUtKxpXZAygnfr3UE2MTLN51b4f3/view?usp=s...

--

Br,

A.

Reinstalled my test cluster yesterday that fixed most of the issues, but a t-u update broke things again. Disabled the t-u timer for now. So it looks like that pods are not getting an address, super weird. ``` kube-system kured-vfzmr 0/1 CrashLoopBackOff 3 (18s ago) 35s <none> dev-k8s-master-1 <none> <none> ``` I also broke my small test cluster so reinstalling that one now. This one is up to date so can play and test things on it. -- Br, A.

Attila Pinter

07:14

On Wednesday, March 23rd, 2022 at 12:34 PM, Attila Pinter wrote:

...

------- Original Message -------

On Tuesday, March 22nd, 2022 at 8:22 PM, Attila Pinter adathor@protonmail.com wrote:

Reinstalled my test cluster yesterday that fixed most of the issues, but a t-u update broke things again. Disabled the t-u timer for now. So it looks like that pods are not getting an address, super weird.

```

kube-system kured-vfzmr 0/1 CrashLoopBackOff 3 (18s ago) 35s <none> dev-k8s-master-1 <none> <none>

```

I also broke my small test cluster so reinstalling that one now. This one is up to date so can play and test things on it.

--

Br,

A.

One quick question: it is possible that t-u is updating the cluster? Using an old image from January to install a cluster, starts with 1.23.0, after t-u dup it comes back as 1.23.4. Was under the impression that it is only possible with kubicctl/kubeadm, but can be wrong. -- Br, A.

Attila Pinter

07:55

------- Original Message ------- On Wednesday, March 23rd, 2022 at 2:14 PM, Attila Pinter wrote:

...

One quick question: it is possible that t-u is updating the cluster? Using an old image from January to install a cluster, starts with 1.23.0, after t-u dup it comes back as 1.23.4. Was under the impression that it is only possible with kubicctl/kubeadm, but can be wrong.

--

Br,

A.

Finished reinstalling my test cluster - from the latest Kubic iso - for Kubic latest (1.23.4) and see the same CrashLoopBackOff as before: https://paste.opensuse.org/92716101. -- Br, A.

Attila Pinter

09:03

------- Original Message ------- On Wednesday, March 23rd, 2022 at 2:55 PM, Attila Pinter wrote:

...

Finished reinstalling my test cluster - from the latest Kubic iso - for Kubic latest (1.23.4) and see the same CrashLoopBackOff as before: https://paste.opensuse.org/92716101.

--

Br,

A.

Some additional logs from the weave, weave-init, and coredns pods: https://paste.opensuse.org/87628794. Seems to me that issue is here, can be wrong tho. Br, A.

Robert Munteanu

09:51

On Wed, 2022-03-23 at 09:03 +0000, Attila Pinter wrote:

...

------- Original Message -------

On Wednesday, March 23rd, 2022 at 2:55 PM, Attila Pinter wrote:

...
Finished reinstalling my test cluster - from the latest Kubic iso - for Kubic latest (1.23.4) and see the same CrashLoopBackOff as before: https://paste.opensuse.org/92716101.

--

Br,

A.

Some additional logs from the weave, weave-init, and coredns pods: https://paste.opensuse.org/87628794. Seems to me that issue is here, can be wrong tho.

I have the same problem with kubic cluster: $ kubectl -n kube-system logs weave-net-8mmmc -c weave-init modprobe: can't load module nfnetlink (kernel/net/netfilter/nfnetlink.ko.zst): invalid module format Ignore the error if "xt_set" is built-in in the kernel If anyone needs more info I'd be glad to provide it. Thanks, Robert

rbrown

10:10

On 2022-03-23 10:51, Robert Munteanu wrote:

...

On Wed, 2022-03-23 at 09:03 +0000, Attila Pinter wrote:

...
------- Original Message -------

On Wednesday, March 23rd, 2022 at 2:55 PM, Attila Pinter wrote:

...
Finished reinstalling my test cluster - from the latest Kubic iso - for Kubic latest (1.23.4) and see the same CrashLoopBackOff as before: https://paste.opensuse.org/92716101.

--

Br,

A.

Some additional logs from the weave, weave-init, and coredns pods: https://paste.opensuse.org/87628794. Seems to me that issue is here, can be wrong tho.

I have the same problem with kubic cluster:

$ kubectl -n kube-system logs weave-net-8mmmc -c weave-init modprobe: can't load module nfnetlink (kernel/net/netfilter/nfnetlink.ko.zst): invalid module format Ignore the error if "xt_set" is built-in in the kernel

If anyone needs more info I'd be glad to provide it.

Thanks, Robert

Hi Robert, Very interesting! This might be the clue I've been missing - if the kernel updated in a way that broke networking, then the problem isn't anything to do with the recent kubernetes package updates but with the kernel.. That would explain why I couldn't find fault in what I'd done recently ;) The last kernel update was in snapshot 0319..can people roll their Kubic hosts back to snapshots older than that and tell me if the problems go away? Would be a huge help if we can narrow down the problem to that kernel update. Regards, Richard

rbrown

10:13

On 2022-03-23 11:10, rbrown wrote:

...

On 2022-03-23 10:51, Robert Munteanu wrote:

...
On Wed, 2022-03-23 at 09:03 +0000, Attila Pinter wrote:

...
------- Original Message -------

On Wednesday, March 23rd, 2022 at 2:55 PM, Attila Pinter wrote:

...
Finished reinstalling my test cluster - from the latest Kubic iso - for Kubic latest (1.23.4) and see the same CrashLoopBackOff as before: https://paste.opensuse.org/92716101.

--

Br,

A.

Some additional logs from the weave, weave-init, and coredns pods: https://paste.opensuse.org/87628794. Seems to me that issue is here, can be wrong tho.

I have the same problem with kubic cluster:

$ kubectl -n kube-system logs weave-net-8mmmc -c weave-init modprobe: can't load module nfnetlink (kernel/net/netfilter/nfnetlink.ko.zst): invalid module format Ignore the error if "xt_set" is built-in in the kernel

If anyone needs more info I'd be glad to provide it.

Thanks, Robert

Hi Robert,

Very interesting! This might be the clue I've been missing - if the kernel updated in a way that broke networking, then the problem isn't anything to do with the recent kubernetes package updates but with the kernel..

That would explain why I couldn't find fault in what I'd done recently ;)

The last kernel update was in snapshot 0319..can people roll their Kubic hosts back to snapshots older than that and tell me if the problems go away?

Would be a huge help if we can narrow down the problem to that kernel update.

Regards, Richard

This was the 0319 kernel update changes for the record: https://build.opensuse.org/package/rdiff/openSUSE:Factory/kernel-source?linkrev=base&rev=635 Does anyone see any changes that might explain the nfnetlink module not loading?

Robert Munteanu

10:47

On Wed, 2022-03-23 at 11:10 +0100, rbrown wrote:

...

On 2022-03-23 10:51, Robert Munteanu wrote:

...
On Wed, 2022-03-23 at 09:03 +0000, Attila Pinter wrote:

...
------- Original Message -------

On Wednesday, March 23rd, 2022 at 2:55 PM, Attila Pinter wrote:

...
Finished reinstalling my test cluster - from the latest Kubic iso - for Kubic latest (1.23.4) and see the same CrashLoopBackOff as before: https://paste.opensuse.org/92716101.

--

Br,

A.

Some additional logs from the weave, weave-init, and coredns pods: https://paste.opensuse.org/87628794. Seems to me that issue is here, can be wrong tho.

I have the same problem with kubic cluster:

$ kubectl -n kube-system logs weave-net-8mmmc -c weave-init modprobe: can't load module nfnetlink (kernel/net/netfilter/nfnetlink.ko.zst): invalid module format Ignore the error if "xt_set" is built-in in the kernel

If anyone needs more info I'd be glad to provide it.

Thanks, Robert

Hi Robert,

Very interesting! This might be the clue I've been missing - if the kernel updated in a way that broke networking, then the problem isn't anything to do with the recent kubernetes package updates but with the kernel..

That would explain why I couldn't find fault in what I'd done recently ;)

The last kernel update was in snapshot 0319..can people roll their Kubic hosts back to snapshots older than that and tell me if the problems go away?

Hi Richard, For the record, that caught my attention since it was also in Attila's paste output. I rolled back all my kubic VMs to a snapshot taken around "2022-03-18 01:43:57" and things are getting back to normal. FWIW, this is the node 'wide' output for a node after I rolled back kubic-worker-1 Ready <none> 484d v1.23.0 10.25.0.43 <none> openSUSE MicroOS 5.16.14-1-default cri- o://1.22.0 and this is the one for a new that was not rolled back (ignore the NotReady status, it was just rebooted). kubic-worker-2 NotReady <none> 484d v1.23.4 10.25.0.40 <none> openSUSE MicroOS 5.16.15-1-default cri- o://1.23.2 Now I have to wait a bit since I reached the DockerHub pull limits but things are stabilising. Thanks, Robert

...

Would be a huge help if we can narrow down the problem to that kernel update.

Regards, Richard

Robert Munteanu

24 Mar 24 Mar

16:39

Hi Richard, On Tue, 2022-03-22 at 13:22 +0100, rbrown wrote:

...

I'm confused how you can be hitting this issue when openQA does not seem to be..

https://openqa.opensuse.org/tests/2256317#step/kubeadm/23

openQA is actually hitting this issue, but it went unnoticed. https://openqa.opensuse.org/tests/2256317#step/kubeadm/17 The coredns pods are in a CrashLoopBackOff state, but there seems to be no assertion related to that. Thanks, Robert

Johannes Kastl

1 Apr 1 Apr

08:42

New subject: k3s on MicroOS having issues with core-dns (was: Kubic 20220320 pods CrashLoopBackOff)

Hi all, On 22.03.22 at 12:42 Attila Pinter wrote:

...

Ran into a bit of a pickle with Kubic yesterday after the last update. The cluster was running with 1.23.4 since January without issues, but after an update I started to noticed that coredns and kured crashlooping[1]. Rolling back did not solve this issue. Seemingly the pods can't get an address for some reason judging by the log that I also made available[2].

Anyone using k3s on openSUSE MicroOS? I noticed problems with coredns the day before yesterday on some of my singlenode k3s clusters. Two x86 machines and a Raspi4 are showing lots of pods in CrashLoopBackOff, while another singlenode x86 and my 3-node x86 cluster is running fine. All on the same kernel version (5.16.15-1-default) and k3s version (1.22.6) with Cilium 1.11.2. Errors from the coredns pods indicate that it cannot talk to outside DNS servers anymore due to timeouts. The hosts themselves are working fine as far as I can see, no errors. [...]

...

CoreDNS-1.8.6 linux/arm64, go1.17.1, 13a9191 [ERROR] plugin/errors: 2 2710547001195683759.7881225443334916626. HINFO: read udp 10.0.0.190:35403->192.168.99.1:53: i/o timeout [ERROR] plugin/errors: 2 2710547001195683759.7881225443334916626. HINFO: read udp 10.0.0.190:35711->192.168.99.121:53: i/o timeout [...]

As far as I understood Attila's and Robert's problem (Kubic 20220320 pods CrashLoopBackOff) was caused by weave not working properly, so I do not think this is related. However, wanted to report here in case anyone is experiencing similar issues... Kind Regards, Johannes -- Johannes Kastl Linux Consultant & Trainer Tel.: +49 (0) 151 2372 5802 Mail: kastl@b1-systems.de B1 Systems GmbH Osterfeldstraße 7 / 85088 Vohburg http://www.b1-systems.de GF: Ralph Dehner Unternehmenssitz: Vohburg / AG: Ingolstadt,HRB 3537

Johannes Kastl

4 Apr 4 Apr

12:24

New subject: k3s on MicroOS having issues with core-dns (was: Kubic 20220320 pods CrashLoopBackOff)

Hi all, On 01.04.22 at 10:42 Johannes Kastl wrote:

...

Anyone using k3s on openSUSE MicroOS?

I noticed problems with coredns the day before yesterday on some of my singlenode k3s clusters. Two x86 machines and a Raspi4 are showing lots of pods in CrashLoopBackOff, while another singlenode x86 and my 3-node x86 cluster is running fine.

All on the same kernel version (5.16.15-1-default) and k3s version (1.22.6) with Cilium 1.11.2.

As the problem appeared also on the "good" nodes I did some more debugging, and reverting to a snapshot with kernel 5.14.1 (14, not 15!) seems to solve the problem. I'll report the bug upstream at k3s and see what I can find out. Kind Regards, Johannes -- Johannes Kastl Linux Consultant & Trainer Tel.: +49 (0) 151 2372 5802 Mail: kastl@b1-systems.de B1 Systems GmbH Osterfeldstraße 7 / 85088 Vohburg http://www.b1-systems.de GF: Ralph Dehner Unternehmenssitz: Vohburg / AG: Ingolstadt,HRB 3537

Johannes Kastl

12:29

New subject: k3s on MicroOS having issues with core-dns (was: Kubic 20220320 pods CrashLoopBackOff)

On 04.04.22 at 14:24 Johannes Kastl wrote:

...

reverting to a snapshot with kernel 5.14.1 (14, not 15!) seems to solve Meh. This should of course read "5.16.14-1-default (14, not 15!)"...

Johannes -- Johannes Kastl Linux Consultant & Trainer Tel.: +49 (0) 151 2372 5802 Mail: kastl@b1-systems.de B1 Systems GmbH Osterfeldstraße 7 / 85088 Vohburg http://www.b1-systems.de GF: Ralph Dehner Unternehmenssitz: Vohburg / AG: Ingolstadt,HRB 3537

Robert Munteanu

13:18

New subject: k3s on MicroOS having issues with core-dns (was: Kubic 20220320 pods CrashLoopBackOff)

On Mon, 2022-04-04 at 14:29 +0200, Johannes Kastl wrote:

...

On 04.04.22 at 14:24 Johannes Kastl wrote:

...
reverting to a snapshot with kernel 5.14.1 (14, not 15!) seems to solve Meh. This should of course read "5.16.14-1-default (14, not 15!)"...

FWIW, this was the revert that fixed my weave CNI problems as well. Thanks, Robert

Johannes Kastl

14:01

New subject: k3s on MicroOS having issues with core-dns (was: Kubic 20220320 pods CrashLoopBackOff)

Hi Robert, On 04.04.22 at 15:18 Robert Munteanu wrote:

...

On Mon, 2022-04-04 at 14:29 +0200, Johannes Kastl wrote:

...
On 04.04.22 at 14:24 Johannes Kastl wrote:

...
reverting to a snapshot with kernel 5.14.1 (14, not 15!) seems to solve Meh. This should of course read "5.16.14-1-default (14, not 15!)"...

FWIW, this was the revert that fixed my weave CNI problems as well.

That is where I got the idea from, although I hoped the problems would not be related... Unfortunately this does not seem to be fixed in 5.17.1, which ones of my boxes got already (the one where I left the transactional-updates timer enabled, while I stopped and disabled it on the others). And, of course, one of my boxes does not have a BTRFS snapshot with the old kernel. :-) Johannes -- Johannes Kastl Linux Consultant & Trainer Tel.: +49 (0) 151 2372 5802 Mail: kastl@b1-systems.de B1 Systems GmbH Osterfeldstraße 7 / 85088 Vohburg http://www.b1-systems.de GF: Ralph Dehner Unternehmenssitz: Vohburg / AG: Ingolstadt,HRB 3537

Johannes Kastl

5 Apr 5 Apr

07:37

New subject: k3s on MicroOS having issues with core-dns (was: Kubic 20220320 pods CrashLoopBackOff)

On 04.04.22 at 14:29 Johannes Kastl wrote:

...

On 04.04.22 at 14:24 Johannes Kastl wrote:

...
reverting to a snapshot with kernel 5.14.1 (14, not 15!) seems to solve Meh. This should of course read "5.16.14-1-default (14, not 15!)"...

Here is the bug report: https://bugzilla.opensuse.org/show_bug.cgi?id=1198064 Kind Regards, Johannes -- Johannes Kastl Linux Consultant & Trainer Tel.: +49 (0) 151 2372 5802 Mail: kastl@b1-systems.de B1 Systems GmbH Osterfeldstraße 7 / 85088 Vohburg http://www.b1-systems.de GF: Ralph Dehner Unternehmenssitz: Vohburg / AG: Ingolstadt,HRB 3537

758

Age (days ago)

772

Last active (days ago)

List overview

Download

17 comments

4 participants

participants (4)

Attila Pinter
Johannes Kastl
rbrown
Robert Munteanu

Kubic 20220320 pods CrashLoopBackOff

tags

participants (4)