DNS status update ; changes on anna/elsa
Hi @here Whoever decided that all outgoing DNS traffic should only go to 185.85.248.19 (iodine.enidan.com): I decided that it's time to be a bit more generic and use 9.9.9.10, 9.9.9.9, 8.8.8.8 and 1.1.1.1 (in this order) for now for all DNS queries that go out from our infra.opensuse.org network into the world. I also changed from dnsmasq to bind on anna/elsa and (re-)enabled the infra.opensuse.org zone on chip.infra.opensuse.org (including in-addr.arpa). At the moment, FreeIPA is still authoritative for all infra.opensuse.org DNS entries - nothing changed here. But now it's just a single "click" away to make chip our "one and only" DNS hidden master server. Please note that we have another hidden master since a while: scar is providing DNS for all openVPN clients: ~> host lrupp.vpn.opensuse.org lrupp.vpn.opensuse.org has address 192.168.253.202 lrupp.vpn.opensuse.org has address 192.168.252.202 lrupp.vpn.opensuse.org mail is handled by 1 relay.infra.opensuse.org. ~> host lrupp.tcp.vpn.opensuse.org lrupp.tcp.vpn.opensuse.org has address 192.168.253.202 lrupp.tcp.vpn.opensuse.org mail is handled by 1 relay.infra.opensuse.org. ~> host lrupp.udp.vpn.opensuse.org lrupp.udp.vpn.opensuse.org has address 192.168.252.202 lrupp.udp.vpn.opensuse.org mail is handled by 1 relay.infra.opensuse.org. ~> host 192.168.253.202 202.253.168.192.in-addr.arpa domain name pointer lrupp.tcp.vpn.opensuse.org. ~> host 192.168.252.202 202.252.168.192.in-addr.arpa domain name pointer lrupp.udp.vpn.opensuse.org. This means that hosts that currently use anna/elsa as resolver, should show which VPN "user" is or was conntected to a machine. Next task is to get LDAP authentication on chip up and running for the WebUI. At the moment it looks like either the tool does not like me or I do not like the LDAP settings in our FreeIPA - who knows... Regarding DNSSec, I got good news from SUSE IT: while they currently face some issues with our registrar, they want to support us as good as possible. So we might end up in some temporary workaround - but that should not block us. We might even get a dedicated account at another registrar to manage the domains under openSUSE heroes control completely on our own in the future. While this is currently not 100% clear, I see this as a very positive sign that SUSE-IT is hearing us and tries to do their best to support us. Meanwhile, I like to get our DNS setup in order. Anyone who likes to join me in this is more than welcome! With kind regards, Lars
Lars Vogdt wrote:
Hi @here
Whoever decided that all outgoing DNS traffic should only go to 185.85.248.19 (iodine.enidan.com): I decided that it's time to be a bit more generic and use 9.9.9.10, 9.9.9.9, 8.8.8.8 and 1.1.1.1 (in this order) for now for all DNS queries that go out from our infra.opensuse.org network into the world.
Lars, there is an open ticket on why and what was done. Check out 92089. -- Per Jessen, Zürich (10.6°C) Member, openSUSE Heroes
Per Jessen wrote:
Lars Vogdt wrote:
Hi @here
Whoever decided that all outgoing DNS traffic should only go to 185.85.248.19 (iodine.enidan.com): I decided that it's time to be a bit more generic and use 9.9.9.10, 9.9.9.9, 8.8.8.8 and 1.1.1.1 (in this order) for now for all DNS queries that go out from our infra.opensuse.org network into the world.
Lars, there is an open ticket on why and what was done. Check out 92089.
Would you mind reverting to the using the setup with iodine.enidan.com ? You are ruining my stats. It has been working a lot better for the last 4-5 days, and using those dodgy cloudflare and google resolvers is not going to help. -- Per Jessen, Zürich (11.1°C) Member, openSUSE Heroes
Am May 11, 2021 9:20:41 PM UTC schrieb Per Jessen <per@opensuse.org>:
Lars, there is an open ticket on why and what was done. Check out 92089.
Would you mind reverting to the using the setup with iodine.enidan.com ?
Fine with me (but at the moment, your server is just added to the list, so it might not receive that much traffic). But I wonder why you don't collect stats directly on anna/elsa and instead try to work on symptoms? For me, this would mean to enable more logging on anna/elsa to really see what's happening. Note: With relying on just one external DNS server, you make the redundancy of anna/elsa obsolet. If there is a problem with - or in the way to - your server, the whole setup is broken.
You are ruining my stats. It has been working a lot better for the last 4-5 days, and using those dodgy cloudflare and google resolvers is not going to help.
Quad9 is marked as privacy friendly and GDPR conform. The Google and Cloudflare ones are the usual suspects of reliable (while privacy unfriendly) DNS with quick round-turn times. I'm happy to discuss which forwarding DNS we can/should use for our internal hosts. Worst case, we can even skip forwarding and always ask the root DNS. What's your opinion? My first suspicion with your Email problem is DNSSec as you already mentioned in the ticket. I will tune the logging a bit (which seems to be easier with bind than with dnsmasq or pdns) to get more information on the next hours. Regards, Lars
Lars Vogdt wrote:
Am May 11, 2021 9:20:41 PM UTC schrieb Per Jessen <per@opensuse.org>:
Lars, there is an open ticket on why and what was done. Check out 92089.
Would you mind reverting to the using the setup with iodine.enidan.com ?
Fine with me (but at the moment, your server is just added to the list, so it might not receive that much traffic).
Yup, I see it - it's a pretty complex setup you have :-)
But I wonder why you don't collect stats directly on anna/elsa and instead try to work on symptoms?
Do we have anything to go on? looking at my own mailservers, I never have this issue, so why we have it on anna/elsa must be something in the environment.
Note: With relying on just one external DNS server, you make the redundancy of anna/elsa obsolet. If there is a problem with - or in the way to - your server, the whole setup is broken.
True - I certainly didn't want to keep it going for more than a few days.
You are ruining my stats. It has been working a lot better for the last 4-5 days, and using those dodgy cloudflare and google resolvers is not going to help.
Quad9 is marked as privacy friendly and GDPR conform. The Google and Cloudflare ones are the usual suspects of reliable (while privacy unfriendly) DNS with quick round-turn times. I'm happy to discuss which forwarding DNS we can/should use for our internal hosts.
I only said 'dodgy' because one of them seems to cause the problems.
Worst case, we can even skip forwarding and always ask the root DNS. What's your opinion?
That would be my suggestion now - get rid of the forwarders. You have eliminated dnsmasq, but there are still 'no host found' in the log. -- Per Jessen, Zürich (11.9°C) Member, openSUSE Heroes
Am Wed, 12 May 2021 10:09:58 +0200 schrieb Per Jessen <per@opensuse.org>:
Fine with me (but at the moment, your server is just added to the list, so it might not receive that much traffic).
Yup, I see it - it's a pretty complex setup you have :-)
Not really. Just grown over the years :-)
That would be my suggestion now - get rid of the forwarders. You have eliminated dnsmasq, but there are still 'no host found' in the log.
From what I currently see on mx1, each of the "Domain not found" reports is valid (there is really no domain, resp. the hostname of the sender address is wrong or does not exist). What I am a bit curious about: I normally run at least a local caching DNS server on my MX - to avoid the extra round trips. In addition: as far as I know, none of our internal machines are using the MX for outgoing Emails - so why should rely on anna/elsa for our MX at all? My suggestion would be to run a reliable, caching DNS on MX1 & MX1, which is using external DNS either as forwarders or the root NS directly. For anna/elsa, I think we can gather some statistics from bind now and see who is generating most of the queries and where we see broken external DNS. BTW: your Email setup on MX1,2 is way more complex than my named.conf ;-) With kind regards, Lars
Lars Vogdt wrote:
That would be my suggestion now - get rid of the forwarders. You have eliminated dnsmasq, but there are still 'no host found' in the log.
From what I currently see on mx1, each of the "Domain not found" reports is valid (there is really no domain, resp. the hostname of the sender address is wrong or does not exist).
Hmm, try grep'ing the log for 'hrusecky.net' : # host hrusecky.net hrusecky.net mail is handled by 20 alt2.aspmx.l.google.com. hrusecky.net mail is handled by 30 aspmx2.googlemail.com. hrusecky.net mail is handled by 30 aspmx3.googlemail.com. hrusecky.net mail is handled by 30 aspmx4.googlemail.com. hrusecky.net mail is handled by 10 aspmx.l.google.com. hrusecky.net mail is handled by 20 alt1.aspmx.l.google.com. Still, today I see a lot less of 'Host not found', that is good.
What I am a bit curious about: I normally run at least a local caching DNS server on my MX - to avoid the extra round trips. In addition: as far as I know, none of our internal machines are using the MX for outgoing Emails - so why should rely on anna/elsa for our MX at all?
mailman3 might be the only one?
My suggestion would be to run a reliable, caching DNS on MX1 & MX1, which is using external DNS either as forwarders or the root NS directly.
Sofar mx[12] have been using anna+elsa as resolvers. My personal preference is to avoid running a resolving DNS locally, I believe it is better to run one or two centrally, to benefit from caching of requests from many machines.
For anna/elsa, I think we can gather some statistics from bind now and see who is generating most of the queries and where we see broken external DNS.
BTW: your Email setup on MX1,2 is way more complex than my named.conf ;-)
Really?? :-) -- Per Jessen, Zürich (14.1°C) Member, openSUSE Heroes
On 12/05/2021 11.36, Per Jessen wrote:
Sofar mx[12] have been using anna+elsa as resolvers. My personal preference is to avoid running a resolving DNS locally, I believe it is better to run one or two centrally, to benefit from caching of requests from many machines.
MX1+2 could use a local caching resolver, that uses anna+elsa to forward cache misses. This way you get faster lookups and still have shared caches.
Am Wed, 12 May 2021 11:49:29 +0200 schrieb "Bernhard M. Wiedemann" <bernhardout@lsmod.de>:
On 12/05/2021 11.36, Per Jessen wrote:
Sofar mx[12] have been using anna+elsa as resolvers. My personal preference is to avoid running a resolving DNS locally, I believe it is better to run one or two centrally, to benefit from caching of requests from many machines.
MX1+2 could use a local caching resolver, that uses anna+elsa to forward cache misses. This way you get faster lookups and still have shared caches.
Yes: this is a setup I would expect. I see also a lot of traffic reaching anna, while elsa is way less used. While we might be able to tune this a bit in the DNS setup, I think we should make more use of the resolver options on our internal machines: options attempts:1 timeout:1 rotate would be my recommended line in /etc/resolv.conf for internal machines. attempts:1 -> switch to another nameserver, if the 1st request fails timeout:1 -> switch to another nameserver, if not getting an answer after 1 second rotate -> rotate requests to the nameservers in the list Without the line above, our clients behavior is: * wait 30 seconds, if an DNS error occurs or the first DNS server in the line can not be reached * requests go out always to the first "nameserver" entry in the list - all other nameserver entries are only used in case of errors IMHO this somehow cries to be managed via Salt, but so far I could not see that we have a "base" or "common" role defined in our Salt repo? Regards, Lars
On 12/05/2021 13.38, Lars Vogdt wrote:
attempts:1 -> switch to another nameserver, if the 1st request fails
man resolv.conf says attempts:n Sets the number of times the resolver will send a query to its name servers before giving up and returning an error to the calling application. The default is RES_DFLRETRY (currently 2, see <resolv.h>). The value for this option is silently capped to 5. https://github.com/bminor/glibc/blob/master/resolv/res_send.c#L517 suggests, that it always tries the next NS after the first failure, so with default attempts:2 it does DNS1 DNS2 DNS1 DNS2 and with attempts:1 it does DNS1 DNS2 so it would fail earlier (after 2s instead of 4s) if all nameservers are unavailable, which should be fine as long as the network is reliable.
Hello, Am Mittwoch, 12. Mai 2021, 13:38:01 CEST schrieb Lars Vogdt: [...]
would be my recommended line in /etc/resolv.conf for internal machines.
attempts:1 -> switch to another nameserver, if the 1st request fails timeout:1 -> switch to another nameserver, if not getting an answer after 1 second rotate -> rotate requests to the nameservers in the list
Makes sense, but maybe use attempts:2 to give the nameservers a second chance if they all fail in the first round (which is hopefully unlikely). [Since you asked about the forwarders in an earlier mail - I'm fine with getting rid of the forwarders and directly asking the root DNS servers. I even do that on my laptop ;-) and for servers it makes even more sense.]
IMHO this somehow cries to be managed via Salt, but so far I could not see that we have a "base" or "common" role defined in our Salt repo?
We have that ;-) - pillar/common.sls - pillar/virt_cluster/*.sls (for cluster-specific config, nameserver IPs might fit into this category) - salt/role/base.sls (includes several profile.*) AFAIK resolv.conf is not managed in salt yet - feel free to salt it ;-) Regards, Christian Boltz -- Yes, English can be weird. It can be understood through tough thorough thought, though. [https://twitter.com/iowahawkblog/status/594168269759623168]
Sorry, I didn't know I was source of some issues, I was migrating my nameservers last few days, so Host not found might have been a correct answer for hrusecky.net On May 12, 2021 11:36:17 AM GMT+02:00, Per Jessen <per@opensuse.org> wrote:
Lars Vogdt wrote:
That would be my suggestion now - get rid of the forwarders. You have eliminated dnsmasq, but there are still 'no host found' in the log.
From what I currently see on mx1, each of the "Domain not found" reports is valid (there is really no domain, resp. the hostname of the sender address is wrong or does not exist).
Hmm, try grep'ing the log for 'hrusecky.net' :
# host hrusecky.net hrusecky.net mail is handled by 20 alt2.aspmx.l.google.com. hrusecky.net mail is handled by 30 aspmx2.googlemail.com. hrusecky.net mail is handled by 30 aspmx3.googlemail.com. hrusecky.net mail is handled by 30 aspmx4.googlemail.com. hrusecky.net mail is handled by 10 aspmx.l.google.com. hrusecky.net mail is handled by 20 alt1.aspmx.l.google.com.
Still, today I see a lot less of 'Host not found', that is good.
What I am a bit curious about: I normally run at least a local caching DNS server on my MX - to avoid the extra round trips. In addition: as far as I know, none of our internal machines are using the MX for outgoing Emails - so why should rely on anna/elsa for our MX at all?
mailman3 might be the only one?
My suggestion would be to run a reliable, caching DNS on MX1 & MX1, which is using external DNS either as forwarders or the root NS directly.
Sofar mx[12] have been using anna+elsa as resolvers. My personal preference is to avoid running a resolving DNS locally, I believe it is better to run one or two centrally, to benefit from caching of requests from many machines.
For anna/elsa, I think we can gather some statistics from bind now and see who is generating most of the queries and where we see broken external DNS.
BTW: your Email setup on MX1,2 is way more complex than my named.conf ;-)
Really?? :-)
-- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Michal Hrušecký wrote:
Sorry, I didn't know I was source of some issues, I was migrating my nameservers last few days, so Host not found might have been a correct answer for hrusecky.net
No problem - I only happened to recognise your address. It is only one of many the regularly has not been found - +------------+----------+ | date(ts) | count(*) | +------------+----------+ | 2021-01-29 | 2 | | 2021-01-31 | 1 | | 2021-02-07 | 1 | | 2021-02-17 | 1 | | 2021-02-19 | 1 | | 2021-02-24 | 1 | | 2021-02-25 | 3 | | 2021-03-02 | 1 | | 2021-03-03 | 1 | | 2021-03-05 | 2 | | 2021-03-07 | 2 | | 2021-03-14 | 2 | | 2021-03-15 | 6 | | 2021-03-23 | 1 | | 2021-03-29 | 3 | | 2021-03-31 | 1 | | 2021-04-02 | 2 | | 2021-04-06 | 5 | | 2021-04-07 | 3 | | 2021-04-08 | 2 | | 2021-04-12 | 2 | | 2021-04-13 | 2 | | 2021-04-15 | 3 | | 2021-04-16 | 4 | | 2021-04-17 | 4 | | 2021-04-18 | 2 | | 2021-04-19 | 1 | | 2021-04-21 | 2 | | 2021-04-22 | 3 | | 2021-04-24 | 1 | | 2021-04-26 | 1 | | 2021-04-28 | 1 | | 2021-05-02 | 4 | | 2021-05-03 | 1 | | 2021-05-04 | 1 | | 2021-05-05 | 1 | | 2021-05-06 | 5 | | 2021-05-07 | 1 | | 2021-05-08 | 2 | | 2021-05-09 | 1 | | 2021-05-10 | 5 | | 2021-05-11 | 2 | +------------+----------+ -- Per Jessen, Zürich (14.2°C) Member, openSUSE Heroes
JFYI: https://monitor.opensuse.org/grafana/d/mvnQn-jGk/mail-metrics-copy?orgId=1&from=now-90d&to=now hopefully helps us a bit to see what happens on our Mail servers. I did not check the forums machine, yet, as this was not on my radar so far... Regards, Lars
participants (5)
-
Bernhard M. Wiedemann
-
Christian Boltz
-
Lars Vogdt
-
Michal Hrušecký
-
Per Jessen