[Bug 1171770] New: Worker nodes can't acces kube-api
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770 Bug ID: 1171770 Summary: Worker nodes can't acces kube-api Classification: openSUSE Product: openSUSE Tumbleweed Version: Current Hardware: 64bit OS: Linux Status: NEW Severity: Major Priority: P5 - None Component: Kubic Assignee: kubic-bugs@opensuse.org Reporter: contact@ffreitas.io QA Contact: qa-bugs@suse.de Found By: --- Blocker: --- Release : 20200514 Deployment method : ''' kubicctl init kubicctl node add worker01.local ''' After this deployment, none of my pods deployed on worker nodes can access the kube-api. For example with kured I get : ''' time="2020-05-14T22:18:03Z" level=info msg="Kubernetes Reboot Daemon: 1.3.0" time="2020-05-14T22:18:03Z" level=info msg="Node ID: worker01" time="2020-05-14T22:18:03Z" level=info msg="Lock Annotation: kube-system/kured:weave.works/kured-node-lock" time="2020-05-14T22:18:03Z" level=info msg="Reboot Sentinel: /var/run/reboot-required every 1h0m0s" time="2020-05-14T22:18:03Z" level=info msg="Blocking Pod Selectors: []" time="2020-05-14T22:18:03Z" level=info msg="Reboot on: SunMonTueWedThuFriSat between 00:00 and 23:59 UTC" time="2020-05-14T22:18:33Z" level=fatal msg="Error testing lock: Get https://10.96.0.1:443/apis/apps/v1/namespaces/kube-system/daemonsets/kured: dial tcp 10.96.0.1:443: i/o timeout" ''' I found a similar issue on reddit : https://www.reddit.com/r/kubernetes/comments/gjhxcj/fresh_kubeadm_install_po... -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770#c1
Francisco Freitas
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770#c2
Quentin Onno
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770#c4
--- Comment #4 from Francisco Freitas
kured 1.3.0 is not the latest build, that contains kured 1.4.0.
I'm not sure where the problem is, you can try to use flannel instead of weave for the pod network. But flannel is not really maintained anymore and reports a lot of iptables errors. DNS isn't fully working, too.
Destroyed my cluster. Done a transactional-update. Same issue with the latest kured version : ''' time="2020-05-17T09:53:27Z" level=info msg="Kubernetes Reboot Daemon: 1.4.0" time="2020-05-17T09:53:27Z" level=info msg="Node ID: worker01" time="2020-05-17T09:53:27Z" level=info msg="Lock Annotation: kube-system/kured:weave.works/kured-node-lock" time="2020-05-17T09:53:27Z" level=info msg="Reboot Sentinel: /var/run/reboot-required every 1h0m0s" time="2020-05-17T09:53:27Z" level=info msg="Blocking Pod Selectors: []" time="2020-05-17T09:53:27Z" level=info msg="Reboot on: SunMonTueWedThuFriSat between 00:00 and 23:59 UTC" ''' It does not come from kured. I got the same issue with multiple services (for example haproxy-ingress). For the CNI I tested : - cilium (build a yaml for the github repository and put it in /usr/shared/k8s-yaml/cilium) - weavenet (default init) - flannel (kubicctl init --pod-network flannel) Same issue with all of them. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770#c6
--- Comment #6 from Francisco Freitas
With flannel the error is for me much seldom than with weave. But it looks like the best way to find out if the cluster is affected or not is: run a busybox container and use nslookup to resolve a host. On an affected cluster you will run into a timeout (temporary failure in name resolution), else you should get immediately a response.
A second kubernetes cluster is running fine for me without the issues.
Between, kured is also broken since the last systemd update is incompatible ...
Again, not a kured issue for me as it affects other services. What is the configuration on your unaffected cluster ? Is it a fresh install ? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770#c8
--- Comment #8 from Francisco Freitas
(In reply to Francisco Freitas from comment #6)
Again, not a kured issue for me as it affects other services.
kured cannot reboot the system anymore since systemd moved binaries, so you are affected by this.
The issue I want to point to here is the timeout to 10.96.0.1 wich is the kube-api service. Kured is just an example I took. I've seen the issue you're talking about. Still not the one I'm hoping to solve
What is the configuration on your unaffected cluster ?
Is it a fresh install ?
It's a multi-master setup, but no fresh install, only always updated. So not really compareable.
I can't rollback here. I must start a new environment. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770#c9
Richard Brown
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770#c10
--- Comment #10 from Francisco Freitas
Hi all - I've been looking at this all day, here is the current status:
I can confirm it happens with both kubicctl and kubeadm clusters made from the current snapshots.
We know this doesn't occur on kubicctl clusters with multi-masters, which suggests haproxy somehow works around the issue.
Might want to verify this. Tried a multi-master deployment two releases back. I had no issue with the master nodes but I still got the issue on the worker nodes. (Will test it on the latest release again tonight).
This now leads me to wonder if the kernel or runc updates are to blame, which I will look at tomorrow, unless someone beats me to it first.
Sorry that this doesn't look like it will be a quick fix. Anyone got any other info that might help?
Couldn't it be tested by downgrading the kernel ? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770#c11
--- Comment #11 from Richard Brown
(In reply to Richard Brown from comment #9)
Hi all - I've been looking at this all day, here is the current status:
I can confirm it happens with both kubicctl and kubeadm clusters made from the current snapshots.
We know this doesn't occur on kubicctl clusters with multi-masters, which suggests haproxy somehow works around the issue.
Might want to verify this. Tried a multi-master deployment two releases back. I had no issue with the master nodes but I still got the issue on the worker nodes. (Will test it on the latest release again tonight).
This now leads me to wonder if the kernel or runc updates are to blame, which I will look at tomorrow, unless someone beats me to it first.
Sorry that this doesn't look like it will be a quick fix. Anyone got any other info that might help?
Couldn't it be tested by downgrading the kernel ?
Sure but a) from where? and b) I've worked enough today, I think I'd like a bit of a break before picking this up tomorrow ;) -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770#c12
--- Comment #12 from Francisco Freitas
(In reply to Francisco Freitas from comment #10)
(In reply to Richard Brown from comment #9)
Hi all - I've been looking at this all day, here is the current status:
I can confirm it happens with both kubicctl and kubeadm clusters made from the current snapshots.
We know this doesn't occur on kubicctl clusters with multi-masters, which suggests haproxy somehow works around the issue.
Might want to verify this. Tried a multi-master deployment two releases back. I had no issue with the master nodes but I still got the issue on the worker nodes. (Will test it on the latest release again tonight).
This now leads me to wonder if the kernel or runc updates are to blame, which I will look at tomorrow, unless someone beats me to it first.
Sorry that this doesn't look like it will be a quick fix. Anyone got any other info that might help?
Couldn't it be tested by downgrading the kernel ?
Sure but a) from where?
It was just a genuine question. I remember using the tumbleweed-cli to access the history repositories. I do not know if it can be done with kubic.
and b) I've worked enough today, I think I'd like a bit of a break before picking this up tomorrow ;)
I was not hoping for you to work on this single issue again today :p -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770#c13
--- Comment #13 from Francisco Freitas
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770#c14
--- Comment #14 from Richard Brown
The error is also present on multi-master cluster. I deployed a cluster from the release 20200516 using the following commands :
``` kubicctl init --haproxy loadbalancer --multi-master loadbalancer.cluster.local kubicctl node add --type master master02 kubicctl node add --type master master03 kubicctl node add worker01 ```
So.. I've used kubeadm init --image-repository to use only upstream containers - problem still occurs I've used rebuilt kubernetes 1.18.2 containers - problem still occurs i've deployed it on kubernetes 1.17.5 - problem still occurs I've used only upstream weave, cilium and other CNI providers - problem still occurs I've used https://download.opensuse.org/history/ to move my nodes to every version of kubic we've had in May - problem still occurs I'm officially flummoxed - does anyone have any idea when this last worked for sure? because I'm running out of things to rule out -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770#c15
--- Comment #15 from Francisco Freitas
(In reply to Francisco Freitas from comment #13)
The error is also present on multi-master cluster. I deployed a cluster from the release 20200516 using the following commands :
``` kubicctl init --haproxy loadbalancer --multi-master loadbalancer.cluster.local kubicctl node add --type master master02 kubicctl node add --type master master03 kubicctl node add worker01 ```
So.. I've used kubeadm init --image-repository to use only upstream containers - problem still occurs I've used rebuilt kubernetes 1.18.2 containers - problem still occurs i've deployed it on kubernetes 1.17.5 - problem still occurs I've used only upstream weave, cilium and other CNI providers - problem still occurs I've used https://download.opensuse.org/history/ to move my nodes to every version of kubic we've had in May - problem still occurs
I'm officially flummoxed - does anyone have any idea when this last worked for sure? because I'm running out of things to rule out
Last time I successfully installed a Kubic cluster was on april 7th with the following configuration : - upstream cilium for the CNI - single master - release 20200405 updated from a 20200108 iso -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770#c16
--- Comment #16 from Richard Brown
(In reply to Richard Brown from comment #14)
(In reply to Francisco Freitas from comment #13)
The error is also present on multi-master cluster. I deployed a cluster from the release 20200516 using the following commands :
``` kubicctl init --haproxy loadbalancer --multi-master loadbalancer.cluster.local kubicctl node add --type master master02 kubicctl node add --type master master03 kubicctl node add worker01 ```
So.. I've used kubeadm init --image-repository to use only upstream containers - problem still occurs I've used rebuilt kubernetes 1.18.2 containers - problem still occurs i've deployed it on kubernetes 1.17.5 - problem still occurs I've used only upstream weave, cilium and other CNI providers - problem still occurs I've used https://download.opensuse.org/history/ to move my nodes to every version of kubic we've had in May - problem still occurs
I'm officially flummoxed - does anyone have any idea when this last worked for sure? because I'm running out of things to rule out
Last time I successfully installed a Kubic cluster was on april 7th with the following configuration : - upstream cilium for the CNI - single master - release 20200405 updated from a 20200108 iso
Do you (or anyone else) have aN iso that old somewhere I can download it to see if I can narrow this down further? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770#c17
--- Comment #17 from Francisco Freitas
(In reply to Francisco Freitas from comment #15)
(In reply to Richard Brown from comment #14)
(In reply to Francisco Freitas from comment #13)
The error is also present on multi-master cluster. I deployed a cluster from the release 20200516 using the following commands :
``` kubicctl init --haproxy loadbalancer --multi-master loadbalancer.cluster.local kubicctl node add --type master master02 kubicctl node add --type master master03 kubicctl node add worker01 ```
So.. I've used kubeadm init --image-repository to use only upstream containers - problem still occurs I've used rebuilt kubernetes 1.18.2 containers - problem still occurs i've deployed it on kubernetes 1.17.5 - problem still occurs I've used only upstream weave, cilium and other CNI providers - problem still occurs I've used https://download.opensuse.org/history/ to move my nodes to every version of kubic we've had in May - problem still occurs
I'm officially flummoxed - does anyone have any idea when this last worked for sure? because I'm running out of things to rule out
Last time I successfully installed a Kubic cluster was on april 7th with the following configuration : - upstream cilium for the CNI - single master - release 20200405 updated from a 20200108 iso
Do you (or anyone else) have aN iso that old somewhere I can download it to see if I can narrow this down further?
I only got the 20200108 iso. Will it do the trick for you ? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770#c18
--- Comment #18 from Francisco Freitas
(In reply to Francisco Freitas from comment #15)
(In reply to Richard Brown from comment #14)
(In reply to Francisco Freitas from comment #13)
The error is also present on multi-master cluster. I deployed a cluster from the release 20200516 using the following commands :
``` kubicctl init --haproxy loadbalancer --multi-master loadbalancer.cluster.local kubicctl node add --type master master02 kubicctl node add --type master master03 kubicctl node add worker01 ```
So.. I've used kubeadm init --image-repository to use only upstream containers - problem still occurs I've used rebuilt kubernetes 1.18.2 containers - problem still occurs i've deployed it on kubernetes 1.17.5 - problem still occurs I've used only upstream weave, cilium and other CNI providers - problem still occurs I've used https://download.opensuse.org/history/ to move my nodes to every version of kubic we've had in May - problem still occurs
I'm officially flummoxed - does anyone have any idea when this last worked for sure? because I'm running out of things to rule out
Last time I successfully installed a Kubic cluster was on april 7th with the following configuration : - upstream cilium for the CNI - single master - release 20200405 updated from a 20200108 iso
Do you (or anyone else) have aN iso that old somewhere I can download it to see if I can narrow this down further?
In case you need it I've managed to upload it here : https://send.firefox.com/download/a4f0c1b25d2d81a9/#72p-GCmfwurTQhAPLPwEsw -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770#c19
--- Comment #19 from Richard Brown
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770#c20
--- Comment #20 from Richard Brown
Figure out why the heck /etc/sysctl.d/70-yast.conf's blocking of IP forwarding is taking an effect when /usr/lib/sysctl.d/90-yast.conf should be overriding it :)
Correction.. /usr/lib/sysctl.d/90-kubeadm.conf is what should be overriding 70-yast.conf -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770#c21
--- Comment #21 from Francisco Freitas
No need for an ISO, found what I believe to be the trigger for the problem.
WORKAROUND:
Delete /etc/sysctl.d/70-yast.conf
If cluster is already bootstrapped, reboot all nodes. Cluster communications work properly afterwards.
NEXT STEP:
Figure out why the heck /etc/sysctl.d/70-yast.conf's blocking of IP forwarding is taking an effect when /usr/lib/sysctl.d/90-yast.conf should be overriding it :)
Nice ! I will try it out tonight. Thanks for the workaround. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770#c22
Rafael Fernández López
From what I see, it should include `net.ipv6.conf.all.forwarding = 1` as well. I cannot explain why this is happening in a better way right now though.
-- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770#c23
--- Comment #23 from Rafael Fernández López
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770#c24
--- Comment #24 from Francisco Freitas
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770#c25
--- Comment #25 from Richard Brown
I have the impression that `net.ipv6.conf.all.forwarding = 0` being set by `/etc/sysctl.d/70-yast.conf` has an impact here.
I did override this setting by creating a `/etc/sysctl.d/91-kubeadm.conf` file with contents:
``` net.ipv4.ip_forward = 1 net.ipv6.conf.all.forwarding = 1 ```
After rebooting the node, everything works fine. As Richard mentioned, removing `/etc/sysctl.d/70-yast.conf` altogether and rebooting also makes the trick.
This makes me think that the override in `/usr/lib/sysctl.d/90-kubeadm.conf` is not enough, it currently has:
``` # The file is provided as part of the kubernetes-kubeadm package net.ipv4.ip_forward = 1 ```
From what I see, it should include `net.ipv6.conf.all.forwarding = 1` as well. I cannot explain why this is happening in a better way right now though.
I tried this before making my post, and it didn't work for me..but I trust your observation also so I'm putting it in a patch for kubernetes1.18 and kubernetes1.17 and testing those packages :) thanks! -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770#c26
--- Comment #26 from Richard Brown
(In reply to Rafael Fernández López from comment #22)
I have the impression that `net.ipv6.conf.all.forwarding = 0` being set by `/etc/sysctl.d/70-yast.conf` has an impact here.
I did override this setting by creating a `/etc/sysctl.d/91-kubeadm.conf` file with contents:
``` net.ipv4.ip_forward = 1 net.ipv6.conf.all.forwarding = 1 ```
After rebooting the node, everything works fine. As Richard mentioned, removing `/etc/sysctl.d/70-yast.conf` altogether and rebooting also makes the trick.
This makes me think that the override in `/usr/lib/sysctl.d/90-kubeadm.conf` is not enough, it currently has:
``` # The file is provided as part of the kubernetes-kubeadm package net.ipv4.ip_forward = 1 ```
From what I see, it should include `net.ipv6.conf.all.forwarding = 1` as well. I cannot explain why this is happening in a better way right now though.
I tried this before making my post, and it didn't work for me..but I trust your observation also so I'm putting it in a patch for kubernetes1.18 and kubernetes1.17 and testing those packages :)
thanks!
Put the change in the package, and confirmed - it does not work to add `net.ipv6.conf.all.forwarding = 1` However, I can confirm, if I copy 90-kubeadm.conf to /etc/sysctl.d, then it works. This means something is incorrectly parsing/not parsing /usr/share/sysctl.d Now we just need to figure out what -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770#c27
--- Comment #27 from Richard Brown
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770#c28
Aleksa Sarai
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770#c29
Martin Weiss
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770
Martin Weiss
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770#c30
--- Comment #30 from Rafael Fernández López
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770#c31
--- Comment #31 from Rafael Fernández López
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770#c32
--- Comment #32 from Aleksa Sarai
FYI - just had to realize the same issue on SLES 15 SP1 with kernel 4.12.14-197.40-default and we have realized that all forward = 1 JUST the eth0 and the lo interfaces have 0 !
While all other ipv4 forwarding were 1 we saw these two on 0:
/proc/sys/net/ipv4/conf/eth0/forwarding 0 /proc/sys/net/ipv4/conf/lo/forwarding 0
Dammit. Yeah I had noticed this last week (when I was figuring out how forwarding configuration worked), but I misunderstood what I was looking at -- my assumption was that forwarding meant forwarding in *both* directions. But I think it only refers to forwarding *incoming* packets (so forwarding being disabled on the host still allows forwarded packets from the container to go to the internet).
BUT - a sysctl --system (with only net.ipv4.ip_forward=1 in the conf) did NOT change the interfaces from 0 to 1!!
Yeah, this behaviour is expected (if misguided IMHO). The kernel treats setting this sysctl to its current value as a no-op. I guess we'll need to explicitly do % echo 1 | tee /proc/sys/ipv[46]/conf/*/forwarding somewhere... -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770#c33
--- Comment #33 from Martin Weiss
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770#c34
--- Comment #34 from Martin Weiss
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770
http://bugzilla.opensuse.org/show_bug.cgi?id=1171770#c36
Richard Brown
participants (1)
-
bugzilla_noreply@suse.com