http://bugzilla.suse.com/show_bug.cgi?id=960153 http://bugzilla.suse.com/show_bug.cgi?id=960153#c21 --- Comment #21 from Jonathan Kang <sckang@suse.com> --- (In reply to Jan Kara from comment #20)
So 'netconfig modify' did not manage to finish in the allocated 1s, so NetworkManager killed it. I'm a bit puzzled that I don't see further debug messages showing that resolv.conf was updated directly but whatever. The code is pretty clearly showing that if updating using netconfig fails, it does the update directly.
Yes. There isn't any debug message when NM directly writes to resolv.conf. And by reading those debug messagesk, it's obvious that netconfig was killed by NM.
So I dug a bit more into netconfig to see what it is doing. I've instrumented netconfig script with some debug messages and here is the output (messages should be relatively self-explanatory):
Entering netconfig 11:00:15:691579073 Before modify 11:00:15:697381985 Entering modify 11:00:15:698074098 Done reading modify input 11:00:15:700162938 Determining state 11:00:15:702411665 Exiting from modify 11:00:15:703101298 After modify 11:00:15:703873893 Before modules 11:00:15:704495286 Running module /etc/netconfig.d//dns-resolver 11:00:15:705215014 dns-resolver begins 11:00:15:707300709 dns-resolver parsed policy 11:00:15:717535930 dns-resolver before resolv.conf 11:00:15:718404372 dns-resolver file created 11:00:15:721682996 dns-resolver header 11:00:15:723014609 dns-resolver searchlist 11:00:15:723693154 dns-resolver nameserver 11:00:15:724343748 dns-resolver done 11:00:16:693133913 <killed>
We can see that we spent most of the 1s time limit in netconfig_check_md5_and_move() (that is the only thing happening between 'nameserver' and 'done' messages in dns-resolver. During successful resume attempts when resolv.conf gets properly updated, I can see netconfig finishing just barely under 1s. So that explains why the problem sometimes happens and sometimes not.
Looking into netconfig_check_md5_and_move() function, I'm not hugely surprised that it takes close to 1s. That spawning of subshells, gawk processes, md5sum processes is going to take up some time when everything is resuming and fighting for CPU and disk. That being said I can investigate that more closely if you think it would be worth it.
These info are should be sufficient and very helpful. Thanks.
But overall it seems that the 1s timeout to run netconfig is just too short in some cases and bumping it up should fix my issues.
Quite right. 1 second seems to be not long enough. This is the documentation of nm_utils_kill_child_sync() function call:
Kill a child process synchronously and wait. The function first checks if the child already terminated and if it did, return the exit status. Otherwise send one @sig signal. @sig will always be sent unless the child already exited. If the child does not exit within @wait_before_kill_msec milliseconds, the function will send %SIGKILL and waits for the child indefinitly. If @wait_before_kill_msec is zero, no %SIGKILL signal will be sent.
So we can set @wait_before_kill_msec to a bigger value(> 1 second). Or set @wait_before_kill_msec to 0(we don't try to kill netconfig, let it exit itself). -- You are receiving this mail because: You are on the CC list for the bug.