Puzzling rootless docker issue on TW
I've been attempting to help a user in the forums[1] with an issue running Docker rootless - he's running into an issue with being able to pull images resulting in the following error: Error response from daemon: Get "https://registry-1.docker.io/v2/": dial tcp: lookup registry-1.docker.io on 10.0.2.3:53: read udp 10.0.2.100:48971->10.0.2.3:53: i/o timeout He also opened a discussion in the Docker forums (who referred him to our forums), and I've been chatting with the guy (Ákos) who was helping him over there in that thread[2]. The puzzling thing for me is that I can't duplicate the issue. I've tried on GNOME and KDE both, and I just cannot reproduce the problem. In troubleshooting the issue together with Ákos, we seem to have isolated the issue to something in the slirp4netns network that's used in that configuration - but nobody involved has customized that setup at all - yet it fails for both of them and doesn't fail for me. The issue *seems* to be that the DNS server set up in that network (10.0.2.3) is not responsive for either of them - but it is for me. Wondering if anyone here has any ideas about what else to look at beyond what we already have in these two threads. I'm completely out of ideas. Jim [1] https://forums.opensuse.org/t/rootless-docker-i-o-timeout-with-docker- pull/169468/12 [2] https://forums.docker.com/t/rootless-docker-i-o-timeout-with-docker- pull/137848 -- Jim Henderson Please keep on-topic replies on the list so everyone benefits
On 9/29/23 14:19, Jim Henderson wrote:
I've been attempting to help a user in the forums[1] with an issue running Docker rootless - he's running into an issue with being able to pull images resulting in the following error:
Error response from daemon: Get "https://registry-1.docker.io/v2/": dial tcp: lookup registry-1.docker.io on 10.0.2.3:53: read udp 10.0.2.100:48971->10.0.2.3:53: i/o timeout
He also opened a discussion in the Docker forums (who referred him to our forums), and I've been chatting with the guy (Ákos) who was helping him over there in that thread[2].
The puzzling thing for me is that I can't duplicate the issue. I've tried on GNOME and KDE both, and I just cannot reproduce the problem.
In troubleshooting the issue together with Ákos, we seem to have isolated the issue to something in the slirp4netns network that's used in that configuration - but nobody involved has customized that setup at all - yet it fails for both of them and doesn't fail for me.
The issue *seems* to be that the DNS server set up in that network (10.0.2.3) is not responsive for either of them - but it is for me.
Wondering if anyone here has any ideas about what else to look at beyond what we already have in these two threads. I'm completely out of ideas.
Jim
[1] https://forums.opensuse.org/t/rootless-docker-i-o-timeout-with-docker- pull/169468/12 [2] https://forums.docker.com/t/rootless-docker-i-o-timeout-with-docker- pull/137848
Hi Jim, Could there be a firewall blocking it ? -- Regards, Joe
On Fri, 29 Sep 2023 14:45:05 -0400, Joe Salmeri wrote:
Could there be a firewall blocking it ?
Actually, this got me to thinking - in the host there isn't any firewall configuration blocking it, but in the network namespace provided by slirp4netns, that might be a different issue. I've no idea where that iptables firewall configuration comes from, but it definitely is different in my setup than the host (which makes sense to me). What the failures are showing for the two who are seeing them is an ICMP response failure from the gateway in the rootless network namespace to the tap0 interface that's defined in the namespace (ie, 10.0.2.2 is the gateway, 10.0.2.100 is the tap0 interface within the namespace). It seems almost certain to be something within this virtual userspace- defined network. -- Jim Henderson Please keep on-topic replies on the list so everyone benefits
Jim,
Do the following
docker ps
get the hash for the container, or look in the logs for the hash, then do
the following
docker logs --follow
On Fri, 29 Sep 2023 14:45:05 -0400, Joe Salmeri wrote:
Could there be a firewall blocking it ?
Actually, this got me to thinking - in the host there isn't any firewall configuration blocking it, but in the network namespace provided by slirp4netns, that might be a different issue. I've no idea where that iptables firewall configuration comes from, but it definitely is different in my setup than the host (which makes sense to me).
What the failures are showing for the two who are seeing them is an ICMP response failure from the gateway in the rootless network namespace to the tap0 interface that's defined in the namespace (ie, 10.0.2.2 is the gateway, 10.0.2.100 is the tap0 interface within the namespace).
It seems almost certain to be something within this virtual userspace- defined network.
-- Jim Henderson Please keep on-topic replies on the list so everyone benefits
-- Terror PUP a.k.a Chuck "PUP" Payne ----------------------------------------- Discover it! Enjoy it! Share it! openSUSE Linux. ----------------------------------------- openSUSE -- Terrorpup openSUSE Ambassador/openSUSE Member skype,twiiter,identica,friendfeed -- terrorpup freenode(irc) --terrorpup/lupinstein Register Linux Userid: 155363 openSUSE Community Member since 2008.
On Fri, 29 Sep 2023 16:17:47 -0400, Chuck Payne wrote:
Do the following
docker ps
get the hash for the container, or look in the logs for the hash, then do the following
docker logs --follow
This will give you all the out of the docker container since you started to run it .
Thanks, Chuck - unfortunately, the issue happens before the container image is even pulled down. Rootless docker uses a userspace network layer that seems to interfere in some (but not all instances) with even pulling the image to start with. We're looking at network configuration inside that userspace stuff using the 'nsenter' command to look at a tap interface that sits between the rootless docker daemon and the outside world. -- Jim Henderson Please keep on-topic replies on the list so everyone benefits
On Fri, Sep 29, 2023 at 5:45 PM Jim Henderson
On Fri, 29 Sep 2023 16:17:47 -0400, Chuck Payne wrote:
Do the following
docker ps
get the hash for the container, or look in the logs for the hash, then do the following
docker logs --follow
This will give you all the out of the docker container since you started to run it .
Thanks, Chuck - unfortunately, the issue happens before the container image is even pulled down. Rootless docker uses a userspace network layer that seems to interfere in some (but not all instances) with even pulling the image to start with.
We're looking at network configuration inside that userspace stuff using the 'nsenter' command to look at a tap interface that sits between the rootless docker daemon and the outside world.
-- Jim Henderson Please keep on-topic replies on the list so everyone benefits
Then maybe a wireshark which you are trying to pull it to see what errors might be there. Good Luck. -- Terror PUP a.k.a Chuck "PUP" Payne ----------------------------------------- Discover it! Enjoy it! Share it! openSUSE Linux. ----------------------------------------- openSUSE -- Terrorpup openSUSE Ambassador/openSUSE Member skype,twiiter,identica,friendfeed -- terrorpup freenode(irc) --terrorpup/lupinstein Register Linux Userid: 155363 openSUSE Community Member since 2008.
On Fri, 29 Sep 2023 19:56:30 -0400, Chuck Payne wrote:
Then maybe a wireshark which you are trying to pull it to see what errors might be there. Good Luck.
Yep, we did that as well - you can see the traces in the thread on the Docker forums. The issue is that inside the userspace network space, for some reason, ICMP responses are not able to be returned. We can't figure out why that is. -- Jim Henderson Please keep on-topic replies on the list so everyone benefits
On 2023-09-29 22:08, Jim Henderson wrote:
On Fri, 29 Sep 2023 19:56:30 -0400, Chuck Payne wrote:
Then maybe a wireshark which you are trying to pull it to see what errors might be there. Good Luck. Yep, we did that as well - you can see the traces in the thread on the Docker forums. The issue is that inside the userspace network space, for some reason, ICMP responses are not able to be returned. We can't figure out why that is.
Hi Jim, Interestingly enough, I have a similar issue with my laptop. It may not be the same but it sounds strikingly similar. ::: TL;DR ::: In your case, can your user try to replicate the problem in a VM, using bridged networking? ::: Details ::: At times, I cannot access/ping one box on my LAN from my laptop. When the issue crops up, the packets seem to make it to my box (confirmed via tcpdump) but ping doesn't seem them. As if they're blocked. Even with the firewalld disabled. I can definitely access this box if my wireless NIC is the only one enabled (not the wired too). Which to me, it sounded like a routing issue. As I've been busy, and it's not critical to access this one box, I haven't spent too many cycles on it. Earlier in the week, I spun up a new TW VM on my laptop. I set up bridge mode networking with my wired NIC. The VM can access that other box! I thought, aha, perhaps the firewall rules are borked: # fgrep FirewallBackend firewalld.conf # FirewallBackend FirewallBackend=nftables From the VM, I dumped the rules and slurped them into my host OS but it didn't make a difference. Here's what I did in case it helps your situation: VM # nft list ruleset > good.rules # nft -f good.rules Next on my list was to try and figure out which firewall packages are installed during TW install and re-install them. See if I can replicate my VM's environment. Thx! -pablo
On Sat, 30 Sep 2023 06:18:49 -0400, Pablo Sanchez wrote:
Hi Jim,
Interestingly enough, I have a similar issue with my laptop. It may not be the same but it sounds strikingly similar.
::: TL;DR :::
In your case, can your user try to replicate the problem in a VM, using bridged networking?
I can ask the user who's having the issue; I've not been able to reproduce
the issue at all, but the user in the thread on the openSUSE forums
(tilfischer) as well as the individual helping in the Docker forums
(rimelek) can. The former is running on bare metal, the latter is running
TW inside an lxd VM.
I've been testing in VMware Workstation 17.0.2.
All are on the 20230926 release of TW. Rimelek's installation uses a pre-
built VM image from the lxd repositories, but both mine and tilfisher's
were installed from media using default options (his was KDE, mine was
GNOME, but I also tried KDE) and then updated with zypper dup.
I feel I need to emphasize that it's not a physical network issue, but a
virtual network issue in how rootless docker works.
I'm going to explain this in more detail, partly to help clarify the issue
for those reading this thread, and partly to help me make sure I
understand what I'm seeing.
Rootless docker is a way to run the docker daemon as a user, without root
privileges. In order to connect to the network, there's a userspace
network tool used that creates what appears to be a virtual routed
network. This is configured automatically by the /usr/bin/dockerd-
rootless-setuptool.sh script.
What you end up with is a network configuration that looks like this:
host <----> userspace network <----> docker networks
That "userspace network" is only present inside the host - it's a tap
interface that has its own subnet (10.0.2.x), and is configured with its
own routes and iptables firewall rules:
--- snip ---
localhost:/home/jhenderson # iptables -L
Chain INPUT (policy ACCEPT)
target prot opt source destination
Chain FORWARD (policy DROP)
target prot opt source destination
DOCKER-USER all -- anywhere anywhere
DOCKER-ISOLATION-STAGE-1 all -- anywhere
anywhere
ACCEPT all -- anywhere anywhere ctstate
RELATED,ESTABLISHED
DOCKER all -- anywhere anywhere
ACCEPT all -- anywhere anywhere
ACCEPT all -- anywhere anywhere
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
Chain DOCKER (1 references)
target prot opt source destination
Chain DOCKER-ISOLATION-STAGE-1 (1 references)
target prot opt source destination
DOCKER-ISOLATION-STAGE-2 all -- anywhere
anywhere
RETURN all -- anywhere anywhere
Chain DOCKER-ISOLATION-STAGE-2 (1 references)
target prot opt source destination
DROP all -- anywhere anywhere
RETURN all -- anywhere anywhere
Chain DOCKER-USER (1 references)
target prot opt source destination
RETURN all -- anywhere anywhere
localhost:/home/jhenderson #
--- snip ---
From the host's perspective, it doesn't exist as a network interface at
all:
--- snip ---
jhenderson@localhost:~> ifconfig
ens33: flags=4163
OK, so I've found that in the lxd image, it fails as with rimelek's setup. So I've been able to reliably reproduce it. I've also got it configured on the host that runs the lxd image, and it works there. From there I've been able to determine that the traffic is never leaving the userspace network. Running wireshark both inside the userspace network and outside it, I see the requests inside the userspace network, and no traffic on the host's network at all. What I was hoping to see was a DNS lookup request and response, followed by nothing - but the DNS request isn't even getting out. When I do the trace on the host (where it works for me), I see traffic on the host's external network. So it seems that the issue is that traffic isn't passing from the userspace network to the real-world network. -- Jim Henderson Please keep on-topic replies on the list so everyone benefits
On Sat, 30 Sep 2023 21:23:26 -0000 (UTC), Jim Henderson wrote:
OK, so I've found that in the lxd image, it fails as with rimelek's setup. So I've been able to reliably reproduce it.
I've also got it configured on the host that runs the lxd image, and it works there.
From there I've been able to determine that the traffic is never leaving the userspace network. Running wireshark both inside the userspace network and outside it, I see the requests inside the userspace network, and no traffic on the host's network at all.
What I was hoping to see was a DNS lookup request and response, followed by nothing - but the DNS request isn't even getting out.
When I do the trace on the host (where it works for me), I see traffic on the host's external network.
So it seems that the issue is that traffic isn't passing from the userspace network to the real-world network.
The user who reported the issue has figured it out. Because we generate a symlink for /etc/resolv.conf rather than a real file, slirp4netns doesn't work. The documentation for it specifically states that the file has to be a real file, not a symlink. Changing it to a real file resolved the issue for him. He's going to report this through bugzilla so a permanent fix can be implemented. His post on his resolution can be found at [1]. [1] https://forums.docker.com/t/rootless-docker-i-o-timeout-with-docker- pull/137848/29 -- Jim Henderson Please keep on-topic replies on the list so everyone benefits
participants (4)
-
Chuck Payne
-
Jim Henderson
-
Joe Salmeri
-
Pablo Sanchez