On Thu, 2022-03-24 at 09:37 +0100, Robert Munteanu wrote:
On Wed, 2022-03-23 at 11:54 +0100, Robert Munteanu wrote:
Hi Thorsten,
On Wed, 2022-03-23 at 11:42 +0100, Thorsten Kukuk wrote:
On Wed, Mar 23, Robert Munteanu wrote:
I have the same problem with kubic cluster:
$ kubectl -n kube-system logs weave-net-8mmmc -c weave-init modprobe: can't load module nfnetlink (kernel/net/netfilter/nfnetlink.ko.zst): invalid module format Ignore the error if "xt_set" is built-in in the kernel
Works fine for me (but it's not running kubernetes): microos:~ # lsmod |grep nfnetlink nfnetlink 20480 0 microos:~ # cat /etc/os-release NAME="openSUSE MicroOS" # VERSION="20220321"
Could it be that your disk is full or the nfnetlink.ko.zst corrupted in any other way? What das "rpm -V kernel-default" say?
I already rolled back following Richard's ask to test older snapshots and the problem is (temporarily) solved.
FWIW, the old kernel version was 5.16.14 and the new one was 5.16.15 .
I don't think free space is an issue, all my kubic VMs have at least 5 GB available on the root partition after a rollback and 8 GB on /var. I don't think the snapshot rollback freed up that much space.
Also, the problem affected all 3 VMs, no pods would get an IP address, so I'm not sure this was a disk space problem.
If you think this is useful, I will try and get another VM rolled forward to the latest kernel version and check the free space issue and the kernel RPM integrity.
Here's the update:
$ rpm -q kernel-default kernel-default-5.16.15-1.1.x86_6
$ rpm -V kernel-default
(no errors, that is)
$ modprobe nfnetlink $ lsmod | grep nfnetlink nfnetlink 20480 0
The problem now is that, somehow, the rollback meant that the static manifests in /etc/kubernetes/manifests are pinned to v1.23.0, which are gone, so the the API server is down
Mar 24 08:32:10 kubic-master-1 kubelet[1224]: E0324 08:32:10.629167 1224 pod_workers.go:919] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"kube-apiserver\" with ImagePullBackOff: \"Back-off pulling image \\\"registry.opensuse.org/kubic/kube- apiserver:v1.23.0\\\"\"" pod="kube-system/kube-apiserver-kubic- master- 1" podUID=6498bfe4d6f53138be78d065788b23e4
I'm going to try and patch them in-place to point to 1.23.4, which is what I assume is available right now. However, container images disappearing from the registry really makes it hard to troubleshoot/rollback.
This gets even more fun. Once I manually modprobe nfnetlink I see the following in the weave-init container logs modprobe: can't load module ip_set (kernel/net/netfilter/ipset/ip_set.ko.zst): invalid module format Ignore the error if "xt_set" is built-in in the kernel I resorted to manually loading the ip_set module as well, and after that the init container did not complain anymore. However, this still did not resolve the problem. Only 2/3 weave-net pods are up now $ kubectl get pod -l name=weave-net NAME READY STATUS RESTARTS AGE weave-net-6db9t 2/2 Running 0 117m weave-net-96vnw 1/2 Running 0 118m weave-net-xxrtp 2/2 Running 0 118m The only thing that stands out for the not-ready weave pod is a large number of 'Vetoed installation of hairpin flow...' messages. $ kubectl logs weave-net-6db9t -c weave | grep -c 'Vetoed installation of hairpin' 30 $ kubectl logs weave-net-96vnw -c weave | grep -c 'Vetoed installation of hairpin' 579 $ kubectl logs weave-net-xxrtp -c weave | grep -c 'Vetoed installation of hairpin' 6 If I try to get the weave status, all pods report something similar to $ kubectl exec weave-net-xxrtp -c weave -- /home/weave/weave --local status Version: 2.8.1 (failed to check latest version - see logs; next check at 2022/03/24 15:14:54) Service: router Protocol: weave 1..2 Name: b2:4c:a5:c8:b2:89(kubic-master-1) Encryption: disabled PeerDiscovery: enabled Targets: 2 Connections: 2 (2 established) Peers: 3 (with 6 established connections) TrustedSubnets: none Service: ipam Status: ready Range: 10.32.0.0/12 DefaultSubnet: 10.32.0.0/12 Using kubectl --all-namespaces -o wide shows that weave has not allocated any IP addresses to any pods, only the ones that use the host network have them. Another piece of information: running `watch ip addr` reveals a huge churn in ip addresses, I see a couple of virtual addresses coming and going for every refresh cycle. I checked the systemd logs for a certain pod, but nothing stand out to me: Mar 24 11:22:17 kubic-worker-2 crio[1216]: time="2022-03-24 11:22:17.555204679Z" level=info msg="Got pod network &{Name:prometheus-pushgateway-8655bf87b9-7s8z6 Namespace:lmn-system ID:d6a47c65d1bfe28aefe9f3b43c7653fff5546e14f67dfd409156a23b7e20e732 UID:94debd66-9a71-4822-8439-0d9cb6edc00f NetNS:/var/run/netns/0a17ec7c-6a80-4c87-ad07-474e65f1c1df Networks:[] RuntimeConfig:map[weave:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}] Aliases:map[]}" Mar 24 11:22:17 kubic-worker-2 crio[1216]: time="2022-03-24 11:22:17.555636823Z" level=info msg="Checking pod lmn-system_prometheus-pushgateway-8655bf87b9-7s8z6 for CNI network weave (type=weave-net)" Mar 24 11:22:17 kubic-worker-2 crio[1216]: time="2022-03-24 11:22:17.735153225Z" level=info msg="Ran pod sandbox d6a47c65d1bfe28aefe9f3b43c7653fff5546e14f67dfd409156a23b7e20e732 with infra container: lmn-system/prometheus-pushgateway-8655bf87b9-7s8z6/POD" id=366231ac-359a-4a98-abdd-b9bd968889d6 name=/runtime.v1.RuntimeService/RunPodSandbox Mar 24 11:22:17 kubic-worker-2 kubelet[1235]: E0324 11:22:17.736658 1235 pod_workers.go:919] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"prometheus-pushgateway\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=prometheus-pushgateway pod=prometheus-pushgateway-8655bf87b9-7s8z6_lmn-system(94debd66-9a71-4822-8439-0d9cb6edc00f)\"" pod="lmn-system/prometheus-pushgateway-8655bf87b9-7s8z6" podUID=94debd66-9a71-4822-8439-0d9cb6edc00f Mar 24 11:22:18 kubic-worker-2 kubelet[1235]: I0324 11:22:18.473417 1235 kuberuntime_manager.go:517] "Sandbox for pod has no IP address. Need to start a new one" pod="lmn-system/prometheus-pushgateway-8655bf87b9-7s8z6" Mar 24 11:22:18 kubic-worker-2 kubelet[1235]: I0324 11:22:18.473727 1235 kubelet.go:2101] "SyncLoop (PLEG): event for pod" pod="lmn-system/prometheus-pushgateway-8655bf87b9-7s8z6" event=&{ID:94debd66-9a71-4822-8439-0d9cb6edc00f Type:ContainerStarted Data:d6a47c65d1bfe28aefe9f3b43c7653fff5546e14f67dfd409156a23b7e20e732} Mar 24 11:22:18 kubic-worker-2 crio[1216]: time="2022-03-24 11:22:18.505034449Z" level=info msg="Got pod network &{Name:prometheus-pushgateway-8655bf87b9-7s8z6 Namespace:lmn-system ID:d6a47c65d1bfe28aefe9f3b43c7653fff5546e14f67dfd409156a23b7e20e732 UID:94debd66-9a71-4822-8439-0d9cb6edc00f NetNS:/var/run/netns/0a17ec7c-6a80-4c87-ad07-474e65f1c1df Networks:[{Name:weave Ifname:eth0}] RuntimeConfig:map[weave:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}] Aliases:map[]}" Mar 24 11:22:18 kubic-worker-2 crio[1216]: time="2022-03-24 11:22:18.505271666Z" level=info msg="Deleting pod lmn-system_prometheus-pushgateway-8655bf87b9-7s8z6 from CNI network \"weave\" (type=weave-net)" Mar 24 11:22:25 kubic-worker-2 crio[1216]: time="2022-03-24 11:22:25.860145921Z" level=info msg="Running pod sandbox: lmn-system/prometheus-pushgateway-8655bf87b9-7s8z6/POD" id=c238805b-4797-4c57-ad72-a74caf002bfa name=/runtime.v1.RuntimeService/RunPodSandbox Mar 24 11:22:26 kubic-worker-2 crio[1216]: time="2022-03-24 11:22:26.097940633Z" level=info msg="Got pod network &{Name:prometheus-pushgateway-8655bf87b9-7s8z6 Namespace:lmn-system ID:b5b362d92151bde559115c2dc5bb01bceef216434b77a4b7d6a7d72e6439e9a4 UID:94debd66-9a71-4822-8439-0d9cb6edc00f NetNS:/var/run/netns/374582ee-18ec-438b-a813-ff8ad4a44ea6 Networks:[] RuntimeConfig:map[weave:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}] Aliases:map[]}" Mar 24 11:22:26 kubic-worker-2 crio[1216]: time="2022-03-24 11:22:26.098189993Z" level=info msg="Adding pod lmn-system_prometheus-pushgateway-8655bf87b9-7s8z6 to CNI network \"weave\" (type=weave-net)" Mar 24 11:22:26 kubic-worker-2 kubelet[1235]: I0324 11:22:26.611351 1235 kubelet.go:2101] "SyncLoop (PLEG): event for pod" pod="lmn-system/prometheus-pushgateway-8655bf87b9-7s8z6" event=&{ID:94debd66-9a71-4822-8439-0d9cb6edc00f Type:ContainerDied Data:d6a47c65d1bfe28aefe9f3b43c7653fff5546e14f67dfd409156a23b7e20e732} Mar 24 11:22:34 kubic-worker-2 crio[1216]: time="2022-03-24 11:22:34.585356194Z" level=info msg="Got pod network &{Name:prometheus-pushgateway-8655bf87b9-7s8z6 Namespace:lmn-system ID:b5b362d92151bde559115c2dc5bb01bceef216434b77a4b7d6a7d72e6439e9a4 UID:94debd66-9a71-4822-8439-0d9cb6edc00f NetNS:/var/run/netns/374582ee-18ec-438b-a813-ff8ad4a44ea6 Networks:[] RuntimeConfig:map[weave:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}] Aliases:map[]}" Mar 24 11:22:34 kubic-worker-2 crio[1216]: time="2022-03-24 11:22:34.585713497Z" level=info msg="Checking pod lmn-system_prometheus-pushgateway-8655bf87b9-7s8z6 for CNI network weave (type=weave-net)" Mar 24 11:22:34 kubic-worker-2 crio[1216]: time="2022-03-24 11:22:34.767330381Z" level=info msg="Ran pod sandbox b5b362d92151bde559115c2dc5bb01bceef216434b77a4b7d6a7d72e6439e9a4 with infra container: lmn-system/prometheus-pushgateway-8655bf87b9-7s8z6/POD" id=c238805b-4797-4c57-ad72-a74caf002bfa name=/runtime.v1.RuntimeService/RunPodSandbox Mar 24 11:22:34 kubic-worker-2 kubelet[1235]: E0324 11:22:34.769600 1235 pod_workers.go:919] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"prometheus-pushgateway\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=prometheus-pushgateway pod=prometheus-pushgateway-8655bf87b9-7s8z6_lmn-system(94debd66-9a71-4822-8439-0d9cb6edc00f)\"" pod="lmn-system/prometheus-pushgateway-8655bf87b9-7s8z6" podUID=94debd66-9a71-4822-8439-0d9cb6edc00f Mar 24 11:22:35 kubic-worker-2 kubelet[1235]: I0324 11:22:35.744818 1235 kubelet.go:2101] "SyncLoop (PLEG): event for pod" pod="lmn-system/prometheus-pushgateway-8655bf87b9-7s8z6" event=&{ID:94debd66-9a71-4822-8439-0d9cb6edc00f Type:ContainerStarted Data:b5b362d92151bde559115c2dc5bb01bceef216434b77a4b7d6a7d72e6439e9a4} Mar 24 11:22:35 kubic-worker-2 kubelet[1235]: I0324 11:22:35.745477 1235 kuberuntime_manager.go:517] "Sandbox for pod has no IP address. Need to start a new one" pod="lmn-system/prometheus-pushgateway-8655bf87b9-7s8z6" Mar 24 11:22:35 kubic-worker-2 crio[1216]: time="2022-03-24 11:22:35.753802361Z" level=info msg="Got pod network &{Name:prometheus-pushgateway-8655bf87b9-7s8z6 Namespace:lmn-system ID:b5b362d92151bde559115c2dc5bb01bceef216434b77a4b7d6a7d72e6439e9a4 UID:94debd66-9a71-4822-8439-0d9cb6edc00f NetNS:/var/run/netns/374582ee-18ec-438b-a813-ff8ad4a44ea6 Networks:[{Name:weave Ifname:eth0}] RuntimeConfig:map[weave:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}] Aliases:map[]}" Mar 24 11:22:35 kubic-worker-2 crio[1216]: time="2022-03-24 11:22:35.753897070Z" level=info msg="Deleting pod lmn-system_prometheus-pushgateway-8655bf87b9-7s8z6 from CNI network \"weave\" (type=weave-net)" I have no idea how to go on right now, any ideas would be appreciated. Thanks, Robert