Comment # 21 on bug 960153 from Jonathan Kang

(In reply to Jan Kara from comment #20)
> 
> So 'netconfig modify' did not manage to finish in the allocated 1s, so
> NetworkManager killed it. I'm a bit puzzled that I don't see further debug
> messages showing that resolv.conf was updated directly but whatever. The
> code is pretty clearly showing that if updating using netconfig fails, it
> does the update directly.

Yes. There isn't any debug message when NM directly writes to resolv.conf. And
by reading those debug messagesk, it's obvious that netconfig was killed by
NM.
> 
> So I dug a bit more into netconfig to see what it is doing. I've
> instrumented netconfig script with some debug messages and here is the
> output (messages should be relatively self-explanatory):
> 
> Entering netconfig  11:00:15:691579073
> Before modify  11:00:15:697381985
> Entering modify  11:00:15:698074098
> Done reading modify input  11:00:15:700162938
> Determining state  11:00:15:702411665
> Exiting from modify  11:00:15:703101298
> After modify  11:00:15:703873893
> Before modules  11:00:15:704495286
> Running module /etc/netconfig.d//dns-resolver  11:00:15:705215014
> dns-resolver begins  11:00:15:707300709
> dns-resolver parsed policy  11:00:15:717535930
> dns-resolver before resolv.conf  11:00:15:718404372
> dns-resolver file created  11:00:15:721682996
> dns-resolver header  11:00:15:723014609
> dns-resolver searchlist  11:00:15:723693154
> dns-resolver nameserver  11:00:15:724343748
> dns-resolver done  11:00:16:693133913
> <killed>
> 
> We can see that we spent most of the 1s time limit in
> netconfig_check_md5_and_move() (that is the only thing happening between
> 'nameserver' and 'done' messages in dns-resolver. During successful resume
> attempts when resolv.conf gets properly updated, I can see netconfig
> finishing just barely under 1s. So that explains why the problem sometimes
> happens and sometimes not.
> 
> Looking into netconfig_check_md5_and_move() function, I'm not hugely
> surprised that it takes close to 1s. That spawning of subshells, gawk
> processes, md5sum processes is going to take up some time when everything is
> resuming and fighting for CPU and disk. That being said I can investigate
> that more closely if you think it would be worth it.

These info are should be sufficient and very helpful. Thanks.
> 
> But overall it seems that the 1s timeout to run netconfig is just too short
> in some cases and bumping it up should fix my issues.

Quite right. 1 second seems to be not long enough.

This is the documentation of nm_utils_kill_child_sync() function call:
> Kill a child process synchronously and wait. The function first checks if
> the child already terminated and if it did, return the exit status.
> Otherwise send one @sig signal. @sig  will always be sent unless the child
> already exited. If the child does not exit within @wait_before_kill_msec
> milliseconds, the function will send %SIGKILL and waits for the child
> indefinitly. If @wait_before_kill_msec is zero, no %SIGKILL signal will be
> sent.

So we can set @wait_before_kill_msec to a bigger value(> 1 second). Or set
@wait_before_kill_msec to 0(we don't try to kill netconfig, let it exit
itself).