[Bug 960153] New: Network manager and DNS forwarders don't work together
http://bugzilla.suse.com/show_bug.cgi?id=960153 Bug ID: 960153 Summary: Network manager and DNS forwarders don't work together Classification: openSUSE Product: openSUSE Distribution Version: Leap 42.1 Hardware: Other OS: Other Status: NEW Severity: Normal Priority: P5 - None Component: Network Assignee: bnc-team-screening@forge.provo.novell.com Reporter: jack@suse.com QA Contact: qa-bugs@suse.de Found By: --- Blocker: --- Created attachment 660275 --> http://bugzilla.suse.com/attachment.cgi?id=660275&action=edit My /etc/sysconfig/network/config file I'm using Networkmanager to manage my network connection including updating /etc/resolv.conf (since this is a laptop which I connect at different places). As I need to use openvpn to connect to SUSE internal network, I'm using dnsmasq to forward resolving of SUSE domain names to SUSE name servers and the rest is resolved through the local name server. This used to work fine with 13.2. Now I have updated to 42.1 and my resolv.conf contains: # Generated by NetworkManager search suse.cz suse.de nameserver 192.168.2.1 When 'netconfig update' is run, it complains that resolv.conf has been updated manually and so it won't touch it. I have to run 'netconfig update -f' to force updating /etc/resolv.conf and then my DNS setup starts working again. So apparently some cooperation between NetworkManager and netconfig has been broken with 42.1... -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=960153
Chenzi Cao
http://bugzilla.suse.com/show_bug.cgi?id=960153
Jean Delvare
http://bugzilla.suse.com/show_bug.cgi?id=960153
Ludwig Nussel
http://bugzilla.suse.com/show_bug.cgi?id=960153
Bugzilla Account Maintenance <autobugz> changed:
What |Removed |Added
----------------------------------------------------------------------------
Assignee|rlmu@suse.com |bnc-team-screening@forge.pr
| |ovo.novell.com
Chenzi Cao
http://bugzilla.suse.com/show_bug.cgi?id=960153
Ludwig Nussel
http://bugzilla.suse.com/show_bug.cgi?id=960153
http://bugzilla.suse.com/show_bug.cgi?id=960153#c3
Jonathan Kang
http://bugzilla.suse.com/show_bug.cgi?id=960153
http://bugzilla.suse.com/show_bug.cgi?id=960153#c4
Ludwig Nussel
http://bugzilla.suse.com/show_bug.cgi?id=960153
http://bugzilla.suse.com/show_bug.cgi?id=960153#c5
--- Comment #5 from Jonathan Kang
Unfortunately, that is not how it works in openSUSE. We have netconfig, NetworkManager is meant to to provide the information to netconfig which in turn takes the necessary steps to update e.g. /etc/resolv.conf. I guess some patch got lost that detects use of netconfig?
Ah. You are right. (In reply to Jan Kara from comment #0)
Created attachment 660275 [details] My /etc/sysconfig/network/config file
I'm using Networkmanager to manage my network connection including updating /etc/resolv.conf (since this is a laptop which I connect at different places). As I need to use openvpn to connect to SUSE internal network, I'm using dnsmasq to forward resolving of SUSE domain names to SUSE name servers and the rest is resolved through the local name server. This used to work fine with 13.2.
Now I have updated to 42.1 and my resolv.conf contains: # Generated by NetworkManager search suse.cz suse.de nameserver 192.168.2.1
When 'netconfig update' is run, it complains that resolv.conf has been updated manually and so it won't touch it. I have to run 'netconfig update -f' to force updating /etc/resolv.conf and then my DNS setup starts working again.
So apparently some cooperation between NetworkManager and netconfig has been broken with 42.1...
Try adding "rc-manager=netconfig" in the main section of NetworkManager.conf to see if it fixes this problem. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=960153
Jean Delvare
http://bugzilla.suse.com/show_bug.cgi?id=960153
Jean Delvare
http://bugzilla.suse.com/show_bug.cgi?id=960153
http://bugzilla.suse.com/show_bug.cgi?id=960153#c6
Jan Kara
http://bugzilla.suse.com/show_bug.cgi?id=960153
http://bugzilla.suse.com/show_bug.cgi?id=960153#c8
--- Comment #8 from Jan Kara
http://bugzilla.suse.com/show_bug.cgi?id=960153
http://bugzilla.suse.com/show_bug.cgi?id=960153#c9
--- Comment #9 from Jonathan Kang
Actually, the issue does not appear to be quite fixed. So when I changed the settings, rebooted the machine, correct /etc/resolv.conf has been created by netconfig - I'm using 'Auto eth0' for network connection. However this morning when I resumed the machine from suspend-to-ram, when network manager reconnected, /etc/resolv.conf has been suddenly replaced by the problematic one (i.e., the one directly generated by NM).
The new issue you mentioned seems similar with the one reported in RedHat bugzilla[1]. *[1] https://bugzilla.redhat.com/show_bug.cgi?id=1373485 -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=960153
http://bugzilla.suse.com/show_bug.cgi?id=960153#c10
--- Comment #10 from Jan Kara
http://bugzilla.suse.com/show_bug.cgi?id=960153
http://bugzilla.suse.com/show_bug.cgi?id=960153#c11
--- Comment #11 from Jan Kara
http://bugzilla.suse.com/show_bug.cgi?id=960153
Ludwig Nussel
http://bugzilla.suse.com/show_bug.cgi?id=960153
http://bugzilla.suse.com/show_bug.cgi?id=960153#c13
--- Comment #13 from Jonathan Kang
Created attachment 710597 [details] Journal log from the resume
Here are interesting system messages from the reboot (full messages attached).
I suppose these logs are from a suspend rather than a reboot. I used your /etc/sysconf/network/config file to replace the one in my test machine, connected to vpn using vpnc and run "netconfig update". It didn't annoyed me that /etc/resolv.conf has been updated manually. Did I miss anything? Can you elaborate on how to reproduce this bug? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=960153
http://bugzilla.suse.com/show_bug.cgi?id=960153#c14
--- Comment #14 from Jan Kara
http://bugzilla.suse.com/show_bug.cgi?id=960153
http://bugzilla.suse.com/show_bug.cgi?id=960153#c15
--- Comment #15 from Jan Kara
http://bugzilla.suse.com/show_bug.cgi?id=960153
http://bugzilla.suse.com/show_bug.cgi?id=960153#c16
Jonathan Kang
[logging] level=DEBUG
to /etc/NetworkManager/NetworkManager.conf and then attach the output of "journalctl -b -u NetworkManager". Thanks. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=960153
http://bugzilla.suse.com/show_bug.cgi?id=960153#c17
Jan Kara
http://bugzilla.suse.com/show_bug.cgi?id=960153
http://bugzilla.suse.com/show_bug.cgi?id=960153#c18
--- Comment #18 from Jan Kara
http://bugzilla.suse.com/show_bug.cgi?id=960153
http://bugzilla.suse.com/show_bug.cgi?id=960153#c19
--- Comment #19 from Jonathan Kang
Created attachment 711392 [details] Debug messages from network manager
Here are requested debug messages from NetworkManager. Last two resume attempts resulted in NM directly modifying /etc/resolv.conf. The attempt before did not result in directly modified /etc/resolv.conf.
Sorry for missing a key point. You need restart NetworkManager to make sure the new added config effective. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=960153
http://bugzilla.suse.com/show_bug.cgi?id=960153#c20
--- Comment #20 from Jan Kara
http://bugzilla.suse.com/show_bug.cgi?id=960153
http://bugzilla.suse.com/show_bug.cgi?id=960153#c21
--- Comment #21 from Jonathan Kang
So 'netconfig modify' did not manage to finish in the allocated 1s, so NetworkManager killed it. I'm a bit puzzled that I don't see further debug messages showing that resolv.conf was updated directly but whatever. The code is pretty clearly showing that if updating using netconfig fails, it does the update directly.
Yes. There isn't any debug message when NM directly writes to resolv.conf. And by reading those debug messagesk, it's obvious that netconfig was killed by NM.
So I dug a bit more into netconfig to see what it is doing. I've instrumented netconfig script with some debug messages and here is the output (messages should be relatively self-explanatory):
Entering netconfig 11:00:15:691579073 Before modify 11:00:15:697381985 Entering modify 11:00:15:698074098 Done reading modify input 11:00:15:700162938 Determining state 11:00:15:702411665 Exiting from modify 11:00:15:703101298 After modify 11:00:15:703873893 Before modules 11:00:15:704495286 Running module /etc/netconfig.d//dns-resolver 11:00:15:705215014 dns-resolver begins 11:00:15:707300709 dns-resolver parsed policy 11:00:15:717535930 dns-resolver before resolv.conf 11:00:15:718404372 dns-resolver file created 11:00:15:721682996 dns-resolver header 11:00:15:723014609 dns-resolver searchlist 11:00:15:723693154 dns-resolver nameserver 11:00:15:724343748 dns-resolver done 11:00:16:693133913 <killed>
We can see that we spent most of the 1s time limit in netconfig_check_md5_and_move() (that is the only thing happening between 'nameserver' and 'done' messages in dns-resolver. During successful resume attempts when resolv.conf gets properly updated, I can see netconfig finishing just barely under 1s. So that explains why the problem sometimes happens and sometimes not.
Looking into netconfig_check_md5_and_move() function, I'm not hugely surprised that it takes close to 1s. That spawning of subshells, gawk processes, md5sum processes is going to take up some time when everything is resuming and fighting for CPU and disk. That being said I can investigate that more closely if you think it would be worth it.
These info are should be sufficient and very helpful. Thanks.
But overall it seems that the 1s timeout to run netconfig is just too short in some cases and bumping it up should fix my issues.
Quite right. 1 second seems to be not long enough. This is the documentation of nm_utils_kill_child_sync() function call:
Kill a child process synchronously and wait. The function first checks if the child already terminated and if it did, return the exit status. Otherwise send one @sig signal. @sig will always be sent unless the child already exited. If the child does not exit within @wait_before_kill_msec milliseconds, the function will send %SIGKILL and waits for the child indefinitly. If @wait_before_kill_msec is zero, no %SIGKILL signal will be sent.
So we can set @wait_before_kill_msec to a bigger value(> 1 second). Or set @wait_before_kill_msec to 0(we don't try to kill netconfig, let it exit itself). -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=960153
http://bugzilla.suse.com/show_bug.cgi?id=960153#c22
Jonathan Kang
http://bugzilla.suse.com/show_bug.cgi?id=960153
http://bugzilla.suse.com/show_bug.cgi?id=960153#c23
Jan Kara
http://bugzilla.suse.com/show_bug.cgi?id=960153
http://bugzilla.suse.com/show_bug.cgi?id=960153#c24
--- Comment #24 from Jonathan Kang
http://bugzilla.suse.com/show_bug.cgi?id=960153
http://bugzilla.suse.com/show_bug.cgi?id=960153#c25
Jan Kara
http://bugzilla.suse.com/show_bug.cgi?id=960153
http://bugzilla.suse.com/show_bug.cgi?id=960153#c26
Jan Kara
http://bugzilla.suse.com/show_bug.cgi?id=960153
http://bugzilla.suse.com/show_bug.cgi?id=960153#c27
--- Comment #27 from Jonathan Kang
So far no problems. I'd consider the problem fixed...
Thanks for it. I'll submit the fix then. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=960153
Swamp Workflow Management
http://bugzilla.suse.com/show_bug.cgi?id=960153
http://bugzilla.suse.com/show_bug.cgi?id=960153#c29
Jonathan Kang
http://bugzilla.suse.com/show_bug.cgi?id=960153
Swamp Workflow Management
http://bugzilla.suse.com/show_bug.cgi?id=960153
http://bugzilla.suse.com/show_bug.cgi?id=960153#c30
--- Comment #30 from Swamp Workflow Management
http://bugzilla.suse.com/show_bug.cgi?id=960153
http://bugzilla.suse.com/show_bug.cgi?id=960153#c31
--- Comment #31 from Swamp Workflow Management
http://bugzilla.suse.com/show_bug.cgi?id=960153
http://bugzilla.suse.com/show_bug.cgi?id=960153#c32
--- Comment #32 from Bernhard Wiedemann
http://bugzilla.suse.com/show_bug.cgi?id=960153
http://bugzilla.suse.com/show_bug.cgi?id=960153#c33
Olaf Hering
So we can set @wait_before_kill_msec to a bigger value(> 1 second). Or set @wait_before_kill_msec to 0(we don't try to kill netconfig, let it exit itself).
As it can be seen from bugreports, calling netconfig is not an atomic operation. A timeout may lead to corrupted config files. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=960153
http://bugzilla.suse.com/show_bug.cgi?id=960153#c34
--- Comment #34 from Jonathan Kang
As it can be seen from bugreports, calling netconfig is not an atomic operation. A timeout may lead to corrupted config files.
Yes, calling netconfig is not an atomic operation. Maybe we should wait util netconfig exits itself. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=960153
http://bugzilla.suse.com/show_bug.cgi?id=960153#c35
--- Comment #35 from Bernhard Wiedemann
http://bugzilla.suse.com/show_bug.cgi?id=960153
Swamp Workflow Management
http://bugzilla.suse.com/show_bug.cgi?id=960153
http://bugzilla.suse.com/show_bug.cgi?id=960153#c37
Jonathan Kang
http://bugzilla.suse.com/show_bug.cgi?id=960153
http://bugzilla.suse.com/show_bug.cgi?id=960153#c39
--- Comment #39 from Swamp Workflow Management
http://bugzilla.suse.com/show_bug.cgi?id=960153
http://bugzilla.suse.com/show_bug.cgi?id=960153#c40
--- Comment #40 from Swamp Workflow Management
http://bugzilla.suse.com/show_bug.cgi?id=960153
Swamp Workflow Management
participants (1)
-
bugzilla_noreply@novell.com