Re: Kubic 20220320 pods CrashLoopBackOff

24 Mar 2022

      On Thu, 2022-03-24 at 09:37 +0100, Robert Munteanu wrote:
...
On Wed, 2022-03-23 at 11:54 +0100, Robert Munteanu wrote:
...
Hi Thorsten,
On Wed, 2022-03-23 at 11:42 +0100, Thorsten Kukuk wrote:
...
On Wed, Mar 23, Robert Munteanu wrote:
...
I have the same problem with kubic cluster:
$ kubectl -n kube-system logs weave-net-8mmmc -c weave-init
modprobe: can't load module nfnetlink
(kernel/net/netfilter/nfnetlink.ko.zst): invalid module format
Ignore the error if "xt_set" is built-in in the kernel
Works fine for me (but it's not running kubernetes):
microos:~ # lsmod |grep nfnetlink
nfnetlink              20480  0
microos:~ # cat /etc/os-release
NAME="openSUSE MicroOS"
# VERSION="20220321"
Could it be that your disk is full or the nfnetlink.ko.zst
corrupted
in
any other way?
What das "rpm -V kernel-default" say?
I already rolled back following Richard's ask to test older
snapshots
and the problem is (temporarily) solved.
FWIW, the old kernel version was 5.16.14 and the new one was
5.16.15
.
I don't think free space is an issue, all my kubic VMs have at
least
5
GB available on the root partition after a rollback and 8 GB on
/var.
I
don't think the snapshot rollback freed up that much space.
Also, the problem affected all 3 VMs, no pods would get an IP
address,
so I'm not sure this was a disk space problem.
If you think this is useful, I will try and get another VM rolled
forward to the latest kernel version and check the free space issue
and
the kernel RPM integrity.
Here's the update:
$ rpm -q kernel-default
kernel-default-5.16.15-1.1.x86_6
$ rpm -V kernel-default
(no errors, that is)
$ modprobe nfnetlink
$ lsmod | grep nfnetlink
nfnetlink              20480  0
The problem now is that, somehow, the rollback meant that the static
manifests in /etc/kubernetes/manifests are pinned to v1.23.0, which
are
gone, so the the API server is down
Mar 24 08:32:10 kubic-master-1 kubelet[1224]: E0324 08:32:10.629167  
1224 pod_workers.go:919] "Error syncing pod, skipping" err="failed to
\"StartContainer\" for \"kube-apiserver\" with ImagePullBackOff:
\"Back-off pulling image \\\"registry.opensuse.org/kubic/kube-
apiserver:v1.23.0\\\"\"" pod="kube-system/kube-apiserver-kubic-
master-
1" podUID=6498bfe4d6f53138be78d065788b23e4
I'm going to try and patch them in-place to point to 1.23.4, which is
what I assume is available right now. However, container images
disappearing from the registry really makes it hard to
troubleshoot/rollback.
This gets even more fun. Once I manually modprobe nfnetlink I see the
following in the weave-init container logs

modprobe: can't load module ip_set
(kernel/net/netfilter/ipset/ip_set.ko.zst): invalid module format
Ignore the error if "xt_set" is built-in in the kernel

I resorted to manually loading the ip_set module as well, and after
that the init container did not complain anymore.

However, this still did not resolve the problem. Only 2/3 weave-net
pods are up now

$ kubectl get pod -l name=weave-net
NAME              READY   STATUS    RESTARTS   AGE
weave-net-6db9t   2/2     Running   0          117m
weave-net-96vnw   1/2     Running   0          118m
weave-net-xxrtp   2/2     Running   0          118m

The only thing that stands out for the not-ready weave pod is a large
number of 'Vetoed installation of hairpin flow...' messages.

$ kubectl logs weave-net-6db9t -c weave | grep -c 'Vetoed installation
of hairpin'
30
$ kubectl logs weave-net-96vnw -c weave | grep -c 'Vetoed installation
of hairpin'
579
$ kubectl logs weave-net-xxrtp -c weave | grep -c 'Vetoed installation
of hairpin'
6

If I try to get the weave status, all pods report something similar to

$ kubectl exec weave-net-xxrtp -c weave -- /home/weave/weave --local
status

        Version: 2.8.1 (failed to check latest version - see logs; next
check at 2022/03/24 15:14:54)

        Service: router
       Protocol: weave 1..2
           Name: b2:4c:a5:c8:b2:89(kubic-master-1)
     Encryption: disabled
  PeerDiscovery: enabled
        Targets: 2
    Connections: 2 (2 established)
          Peers: 3 (with 6 established connections)
 TrustedSubnets: none

        Service: ipam
         Status: ready
          Range: 10.32.0.0/12
  DefaultSubnet: 10.32.0.0/12

Using kubectl --all-namespaces -o wide shows that weave has not
allocated any IP addresses to any pods, only the ones that use the host
network have them.

Another piece of information: running `watch ip addr` reveals a huge
churn in ip addresses, I see a couple of virtual addresses coming and
going for every refresh cycle.

I checked the systemd logs for a certain pod, but nothing stand out to
me:

Mar 24 11:22:17 kubic-worker-2 crio[1216]: time="2022-03-24 11:22:17.555204679Z" level=info msg="Got pod network &{Name:prometheus-pushgateway-8655bf87b9-7s8z6 Namespace:lmn-system ID:d6a47c65d1bfe28aefe9f3b43c7653fff5546e14f67dfd409156a23b7e20e732 UID:94debd66-9a71-4822-8439-0d9cb6edc00f NetNS:/var/run/netns/0a17ec7c-6a80-4c87-ad07-474e65f1c1df Networks:[] RuntimeConfig:map[weave:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}] Aliases:map[]}"
Mar 24 11:22:17 kubic-worker-2 crio[1216]: time="2022-03-24 11:22:17.555636823Z" level=info msg="Checking pod lmn-system_prometheus-pushgateway-8655bf87b9-7s8z6 for CNI network weave (type=weave-net)"
Mar 24 11:22:17 kubic-worker-2 crio[1216]: time="2022-03-24 11:22:17.735153225Z" level=info msg="Ran pod sandbox d6a47c65d1bfe28aefe9f3b43c7653fff5546e14f67dfd409156a23b7e20e732 with infra container: lmn-system/prometheus-pushgateway-8655bf87b9-7s8z6/POD" id=366231ac-359a-4a98-abdd-b9bd968889d6 name=/runtime.v1.RuntimeService/RunPodSandbox
Mar 24 11:22:17 kubic-worker-2 kubelet[1235]: E0324 11:22:17.736658    1235 pod_workers.go:919] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"prometheus-pushgateway\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=prometheus-pushgateway pod=prometheus-pushgateway-8655bf87b9-7s8z6_lmn-system(94debd66-9a71-4822-8439-0d9cb6edc00f)\"" pod="lmn-system/prometheus-pushgateway-8655bf87b9-7s8z6" podUID=94debd66-9a71-4822-8439-0d9cb6edc00f
Mar 24 11:22:18 kubic-worker-2 kubelet[1235]: I0324 11:22:18.473417    1235 kuberuntime_manager.go:517] "Sandbox for pod has no IP address. Need to start a new one" pod="lmn-system/prometheus-pushgateway-8655bf87b9-7s8z6"
Mar 24 11:22:18 kubic-worker-2 kubelet[1235]: I0324 11:22:18.473727    1235 kubelet.go:2101] "SyncLoop (PLEG): event for pod" pod="lmn-system/prometheus-pushgateway-8655bf87b9-7s8z6" event=&{ID:94debd66-9a71-4822-8439-0d9cb6edc00f Type:ContainerStarted Data:d6a47c65d1bfe28aefe9f3b43c7653fff5546e14f67dfd409156a23b7e20e732}
Mar 24 11:22:18 kubic-worker-2 crio[1216]: time="2022-03-24 11:22:18.505034449Z" level=info msg="Got pod network &{Name:prometheus-pushgateway-8655bf87b9-7s8z6 Namespace:lmn-system ID:d6a47c65d1bfe28aefe9f3b43c7653fff5546e14f67dfd409156a23b7e20e732 UID:94debd66-9a71-4822-8439-0d9cb6edc00f NetNS:/var/run/netns/0a17ec7c-6a80-4c87-ad07-474e65f1c1df Networks:[{Name:weave Ifname:eth0}] RuntimeConfig:map[weave:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}] Aliases:map[]}"
Mar 24 11:22:18 kubic-worker-2 crio[1216]: time="2022-03-24 11:22:18.505271666Z" level=info msg="Deleting pod lmn-system_prometheus-pushgateway-8655bf87b9-7s8z6 from CNI network \"weave\" (type=weave-net)"
Mar 24 11:22:25 kubic-worker-2 crio[1216]: time="2022-03-24 11:22:25.860145921Z" level=info msg="Running pod sandbox: lmn-system/prometheus-pushgateway-8655bf87b9-7s8z6/POD" id=c238805b-4797-4c57-ad72-a74caf002bfa name=/runtime.v1.RuntimeService/RunPodSandbox
Mar 24 11:22:26 kubic-worker-2 crio[1216]: time="2022-03-24 11:22:26.097940633Z" level=info msg="Got pod network &{Name:prometheus-pushgateway-8655bf87b9-7s8z6 Namespace:lmn-system ID:b5b362d92151bde559115c2dc5bb01bceef216434b77a4b7d6a7d72e6439e9a4 UID:94debd66-9a71-4822-8439-0d9cb6edc00f NetNS:/var/run/netns/374582ee-18ec-438b-a813-ff8ad4a44ea6 Networks:[] RuntimeConfig:map[weave:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}] Aliases:map[]}"
Mar 24 11:22:26 kubic-worker-2 crio[1216]: time="2022-03-24 11:22:26.098189993Z" level=info msg="Adding pod lmn-system_prometheus-pushgateway-8655bf87b9-7s8z6 to CNI network \"weave\" (type=weave-net)"
Mar 24 11:22:26 kubic-worker-2 kubelet[1235]: I0324 11:22:26.611351    1235 kubelet.go:2101] "SyncLoop (PLEG): event for pod" pod="lmn-system/prometheus-pushgateway-8655bf87b9-7s8z6" event=&{ID:94debd66-9a71-4822-8439-0d9cb6edc00f Type:ContainerDied Data:d6a47c65d1bfe28aefe9f3b43c7653fff5546e14f67dfd409156a23b7e20e732}
Mar 24 11:22:34 kubic-worker-2 crio[1216]: time="2022-03-24 11:22:34.585356194Z" level=info msg="Got pod network &{Name:prometheus-pushgateway-8655bf87b9-7s8z6 Namespace:lmn-system ID:b5b362d92151bde559115c2dc5bb01bceef216434b77a4b7d6a7d72e6439e9a4 UID:94debd66-9a71-4822-8439-0d9cb6edc00f NetNS:/var/run/netns/374582ee-18ec-438b-a813-ff8ad4a44ea6 Networks:[] RuntimeConfig:map[weave:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}] Aliases:map[]}"
Mar 24 11:22:34 kubic-worker-2 crio[1216]: time="2022-03-24 11:22:34.585713497Z" level=info msg="Checking pod lmn-system_prometheus-pushgateway-8655bf87b9-7s8z6 for CNI network weave (type=weave-net)"
Mar 24 11:22:34 kubic-worker-2 crio[1216]: time="2022-03-24 11:22:34.767330381Z" level=info msg="Ran pod sandbox b5b362d92151bde559115c2dc5bb01bceef216434b77a4b7d6a7d72e6439e9a4 with infra container: lmn-system/prometheus-pushgateway-8655bf87b9-7s8z6/POD" id=c238805b-4797-4c57-ad72-a74caf002bfa name=/runtime.v1.RuntimeService/RunPodSandbox
Mar 24 11:22:34 kubic-worker-2 kubelet[1235]: E0324 11:22:34.769600    1235 pod_workers.go:919] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"prometheus-pushgateway\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=prometheus-pushgateway pod=prometheus-pushgateway-8655bf87b9-7s8z6_lmn-system(94debd66-9a71-4822-8439-0d9cb6edc00f)\"" pod="lmn-system/prometheus-pushgateway-8655bf87b9-7s8z6" podUID=94debd66-9a71-4822-8439-0d9cb6edc00f
Mar 24 11:22:35 kubic-worker-2 kubelet[1235]: I0324 11:22:35.744818    1235 kubelet.go:2101] "SyncLoop (PLEG): event for pod" pod="lmn-system/prometheus-pushgateway-8655bf87b9-7s8z6" event=&{ID:94debd66-9a71-4822-8439-0d9cb6edc00f Type:ContainerStarted Data:b5b362d92151bde559115c2dc5bb01bceef216434b77a4b7d6a7d72e6439e9a4}
Mar 24 11:22:35 kubic-worker-2 kubelet[1235]: I0324 11:22:35.745477    1235 kuberuntime_manager.go:517] "Sandbox for pod has no IP address. Need to start a new one" pod="lmn-system/prometheus-pushgateway-8655bf87b9-7s8z6"
Mar 24 11:22:35 kubic-worker-2 crio[1216]: time="2022-03-24 11:22:35.753802361Z" level=info msg="Got pod network &{Name:prometheus-pushgateway-8655bf87b9-7s8z6 Namespace:lmn-system ID:b5b362d92151bde559115c2dc5bb01bceef216434b77a4b7d6a7d72e6439e9a4 UID:94debd66-9a71-4822-8439-0d9cb6edc00f NetNS:/var/run/netns/374582ee-18ec-438b-a813-ff8ad4a44ea6 Networks:[{Name:weave Ifname:eth0}] RuntimeConfig:map[weave:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}] Aliases:map[]}"
Mar 24 11:22:35 kubic-worker-2 crio[1216]: time="2022-03-24 11:22:35.753897070Z" level=info msg="Deleting pod lmn-system_prometheus-pushgateway-8655bf87b9-7s8z6 from CNI network \"weave\" (type=weave-net)"

I have no idea how to go on right now, any ideas would be appreciated.

Thanks,
Robert