(In reply to Jan Kara from comment #20) > > So 'netconfig modify' did not manage to finish in the allocated 1s, so > NetworkManager killed it. I'm a bit puzzled that I don't see further debug > messages showing that resolv.conf was updated directly but whatever. The > code is pretty clearly showing that if updating using netconfig fails, it > does the update directly. Yes. There isn't any debug message when NM directly writes to resolv.conf. And by reading those debug messagesk, it's obvious that netconfig was killed by NM. > > So I dug a bit more into netconfig to see what it is doing. I've > instrumented netconfig script with some debug messages and here is the > output (messages should be relatively self-explanatory): > > Entering netconfig 11:00:15:691579073 > Before modify 11:00:15:697381985 > Entering modify 11:00:15:698074098 > Done reading modify input 11:00:15:700162938 > Determining state 11:00:15:702411665 > Exiting from modify 11:00:15:703101298 > After modify 11:00:15:703873893 > Before modules 11:00:15:704495286 > Running module /etc/netconfig.d//dns-resolver 11:00:15:705215014 > dns-resolver begins 11:00:15:707300709 > dns-resolver parsed policy 11:00:15:717535930 > dns-resolver before resolv.conf 11:00:15:718404372 > dns-resolver file created 11:00:15:721682996 > dns-resolver header 11:00:15:723014609 > dns-resolver searchlist 11:00:15:723693154 > dns-resolver nameserver 11:00:15:724343748 > dns-resolver done 11:00:16:693133913 > <killed> > > We can see that we spent most of the 1s time limit in > netconfig_check_md5_and_move() (that is the only thing happening between > 'nameserver' and 'done' messages in dns-resolver. During successful resume > attempts when resolv.conf gets properly updated, I can see netconfig > finishing just barely under 1s. So that explains why the problem sometimes > happens and sometimes not. > > Looking into netconfig_check_md5_and_move() function, I'm not hugely > surprised that it takes close to 1s. That spawning of subshells, gawk > processes, md5sum processes is going to take up some time when everything is > resuming and fighting for CPU and disk. That being said I can investigate > that more closely if you think it would be worth it. These info are should be sufficient and very helpful. Thanks. > > But overall it seems that the 1s timeout to run netconfig is just too short > in some cases and bumping it up should fix my issues. Quite right. 1 second seems to be not long enough. This is the documentation of nm_utils_kill_child_sync() function call: > Kill a child process synchronously and wait. The function first checks if > the child already terminated and if it did, return the exit status. > Otherwise send one @sig signal. @sig will always be sent unless the child > already exited. If the child does not exit within @wait_before_kill_msec > milliseconds, the function will send %SIGKILL and waits for the child > indefinitly. If @wait_before_kill_msec is zero, no %SIGKILL signal will be > sent. So we can set @wait_before_kill_msec to a bigger value(> 1 second). Or set @wait_before_kill_msec to 0(we don't try to kill netconfig, let it exit itself).