[Bug 337075] New: NTP daemon often fails to start under 10.3
https://bugzilla.novell.com/show_bug.cgi?id=337075
Summary: NTP daemon often fails to start under 10.3
Product: openSUSE 10.3
Version: Final
Platform: Other
OS/Version: Other
Status: NEW
Severity: Normal
Priority: P5 - None
Component: Basesystem
AssignedTo: bnc-team-screening@forge.provo.novell.com
ReportedBy: martin.burnicki@meinberg.de
QAContact: qa@suse.de
Found By: Customer
The startup procedure for the NTP service in OpenSUSE 10.3 urgently needs some
cleanup. After booting the start script seems to be run several times, the
first time even before the network interface is up.
If ntpd is finally started then it may stop itsel
If there is an initial time offset which has to be adjusted then running
ntpdate before starting ntpd even yields an error m
The first time ntpdate is called just after the loopback interface has been set
up, but before eth0 is up. From the syslog:
----------------------------------------------------------------------------
Oct 25 13:03:18 pc-martin5 kernel: Mobile IPv6
Oct 25 13:03:21 pc-martin5 ifup: lo
Oct 25 13:03:21 pc-martin5 ifup: lo
Oct 25 13:03:21 pc-martin5 ifup: IP address: 127.0.0.1/8
Oct 25 13:03:21 pc-martin5 ifup:
Oct 25 13:03:22 pc-martin5 ntpdate[2881]: can't find host 0.pool.ntp.org
Oct 25 13:03:22 pc-martin5 ntpdate[2881]: can't find host 1.pool.ntp.org
Oct 25 13:03:22 pc-martin5 ntpdate[2881]: can't find host 2.pool.ntp.org
Oct 25 13:03:22 pc-martin5 ntpdate[2881]: no servers can be used, exiting
Oct 25 13:03:23 pc-martin5 ifup: eth0 device: Realtek Semiconductor \
Co., Ltd. RTL-8169 Gigabit Ethernet (rev 10)
Oct 25 13:03:24 pc-martin5 kernel: r8169: eth0: link up
Oct 25 13:03:24 pc-martin5 ifup-dhcp: eth0 (DHCP)
Oct 25 13:03:24 pc-martin5 ifup-dhcp: IP/Netmask: 172.16.3.106
----------------------------------------------------------------------------
So it's not surprising that ntpdate is unable to find any host, nor to use any
server. The corresponding lines from boot.msg:
----------------------------------------------------------------------------
Setting up hostname 'pc-martin5'done
Setting up loopback interface lo
lo IP address: 127.0.0.1/8
Checking for network time protocol daemon (NTPD): unused
Can't determine current runlevel
done
System Boot Control: The system has been set up
----------------------------------------------------------------------------
The above happens in runlevel N.
The next time the NTP startup script is run is when runlevel 5 is entered,
after e.g. xinetd has been started:
----------------------------------------------------------------------------
Oct 25 13:03:28 pc-martin5 xinetd[3280]: Started working: 2 available services
Oct 26 13:03:30 pc-martin5 ntpdate[3209]: step time server 85.25.139.186 \
offset 86400.268576 sec
Oct 26 13:03:30 pc-martin5 syslog-ng[2221]: STATS: dropped 0
Oct 26 13:03:30 pc-martin5 ntpd[3298]: ntpd 4.2.4p3@1.1502-o Fri Sep 21 \
21:36:25 UTC 2007 (1)
Oct 26 13:03:30 pc-martin5 ntpd[3299]: precision = 1.000 usec
Oct 26 13:03:30 pc-martin5 ntpd[3299]: ntp_io: estimated max descriptors: \
1024, initial socket boundary: 16
Oct 26 13:03:30 pc-martin5 ntpd[3299]: unable to bind to wildcard socket \
address 0.0.0.0 - another process may be \
running - EXITING
Oct 26 13:03:31 pc-martin5 ntpdate[3176]: step time server 81.169.172.219 \
offset -0.004415 sec
Oct 26 13:03:31 pc-martin5 /usr/sbin/cron[3403]: (CRON) STARTUP (V5.0)
----------------------------------------------------------------------------
In this case ntpd is started when either ntpdate or ntpd from a previous call
is still running. The message "unable to bind to wildcard socket" and "EXITING"
is logged if the NTP daemon is unable to open its well-known port 123, which is
normally the case if another ntpd is already running, or ntpdate has not yet
finished (unless run with the -q, -d, or -u parameter, ntpdate also opens port
123 to send request packets to an upstream NTP server. This in intentional to
keep ntpdate from changing the system time if ntpd is already running).
So if runlevel 5 has been reached ntpd is _not_ running anymore, even though
the startup messages on the splash screen pretend everything would be OK.
Interestingly ntpd seems to keep running if the initial time offset to be
adjusted by ntpdate is small. So a safe way to duplicate the behaviour seems to
be to change the RTC time (e.g in the BIOS setup) before the OS is booted, so
that the initial time difference exceeds the sanity limit of ntpd, which should
normally be corrected by ntpdate before ntpd is started.
BTW, maybe it would be a good idea to use the new features of ntpd which
obsolete running ntpdate at least for this purpose.
I've tested this on a fresh installation with recent online repos. The machine
was a x86_64, but that should not matter.
Martin
https://bugzilla.novell.com/show_bug.cgi?id=337075
Mark Gordon
https://bugzilla.novell.com/show_bug.cgi?id=337075#c1
Mark Gordon
https://bugzilla.novell.com/show_bug.cgi?id=337075#c2
Michael Skibbe
https://bugzilla.novell.com/show_bug.cgi?id=337075#c3
--- Comment #3 from Martin Burnicki
i'm not really understanding - xntp is starting to early because the init script checks if network is up. networkmanager tell this when he is started but not then the interfaces are up - this annoying! but after each interface up/down xntp is restartet so it stops and start like before only with network interface up.
As already mentioned, this is a fresh installation of 10.3 on a workstation, so networkmanager is not being used, in conformity with the proposal made by the installation routine. I agree that it's basically just annoying if ntpdate/ntpd are run unsuccessfully before the network interfaces are up. Normally this should just result in an error message, and when ntpdate/ntpd are run the next time after all interfaces are up they should succeed. Appearently startup of ntpd even does succeed in some cases. However, in some cases it does not. BTW, this problem has also been discussed on the opensuse-de mailing list on opensuse.org a few days ago. See: http://lists.opensuse.org/cgi-bin/search.cgi?query=startet+beim+Booten+nicht I don't know if you are speaking German. However, your name suggests you might be a native German ;-)
From what I've observed it seems that at least ntpd does not start successfully if there is an initial time offset which has to be compensated.
In this case the last start of ntpd occurs when either ntpdate or ntpd from an earlier start are still running. In this case NTP port 123 is still open, so ntpd cannot bind to it and terminates itself with the appropriate error message. Did you know recent versions of ntpd support dynamic interfaces? I.e. you could start ntpd before all network interfaces are up. Ntpd then does a cyclic rescan and binds to any new interfaces. Ntpdate, jowever, does not support dynamic scanning of interfaces so it will always fail the first time it is run (before interfaces are up), and may succeed the second time. I even could imagine this is the basic problem here. As already mentioned, the ntpdate/ntpd pair is run twice, the first time when initial network support (i.e. the loopback interface) is available, and the second time when full network support is available and the final runlevel is entered. I have not yet run more checks, but there are a few potential problems. Imagine there is some initial time offset when ntpd is run the first time. Since no network interfaces are up, ntpdate can not query its upstream servers and thus can not compensate the initial time offset. Immediately after ntpdate has been run ntpd is started the first time, also when the network is not yet fully up. Ntpd could stay in memory due to the new feature of dynamic interface scanning. However, it would have to wait until it could bind to the network interfaces in order to be able to reach its upstream servers and determine its own time offset. When this is the case then the large initial time offset would still be there since it could not have been compensated by ntpdate. If at this stage ntpd determines that the initial time offset exceeds the sanity limit (~1000 seconds) it would stop itself with an appropriate error message. This could be avoided if the -g option would be added to ntpd's command line, which lets ntpd accept an initial offset exceeding that sanity limit. So when the final runlevel is entered then the first instance of ntpd could still be running. If in this case ntpdate might not work, either, because ntpd has port 123 open. On the other hand, ntpdate could run successfully if it happens to run before ntpd has detected the newly appeared network interfaces, and bound to them. Anyway, now this the second instance of ntpd is started while the first instance may still be running. This would cause the error message mentioned above. A quick workaround might be to kill existing instances of ntpd before the rcntp script runs ntpd and then starts ntpd. However, the proper solution might be to avoid ntpdate, run ntpd only once, with the -g option to let it accept large initial time offsets. If ntpdate is really to be used it does not make sense to run it before the network interfaces are up. Additionally, the "iburst" keyword should be added to all server lines in ntp.conf anyway, in order to speed up synchronization. Martin -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=337075#c4
--- Comment #4 from Michael Skibbe
https://bugzilla.novell.com/show_bug.cgi?id=337075#c5
Michael Skibbe
https://bugzilla.novell.com/show_bug.cgi?id=337075#c6
--- Comment #6 from Martin Burnicki
1) (yes, i'm native german) the person in the thread you pointed above used networkmanager so the bug finding is simple :-D
2) comment #1 reffers partly on the nm-ntp problem but the initial problem is the same as here.
3) i'm not sure which kind of network managment tool is used as standard (nm or ifup/ifdown)
From my observations nm is used as default on laptops whereas ifup/ifdown is preferred on workstations/servers. This makes sense, IMHO.
4) if you type in rcntp restart the init script will stop and start the service. the stop contains in line #211 a killproc. the "old" ntp should not listen on the device/port.
Yes, I know. If I type "rcntp restart" after the final runlevel has been reached then ntpd starts up and works fine, even if it ntpd has not been running before.
5) ntp will be restarted _only_ with networkmanager enabled. please check this with # grep NETWORKMANAGER /etc/sysconfig/network/config
pc-martin5:~ # grep NETWORKMANAGER /etc/sysconfig/network/config NETWORKMANAGER=no # This variable has no effect if NETWORKMANAGER=no # This variable has no effect if NETWORKMANAGER=no The output above is from the same systen as the messages in my initial report, so this clearly indicates that it is started twice even though networkmanager is _not_ used.
6) i take a look on the "-g" option but i'm working on a other solution to fix the nm-ntp problem. this bugfix should fix your problem (if you are not using nm), too.
I'm involved in the development of the NTP package (search for "burnicki" in NTP's bugzilla), though I'm mainly testing and trying to find/fix bugs. So if you let me know some details I can give you some hints or comments from NTP's point of view ... -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=337075#c7
--- Comment #7 from Michael Skibbe
I'm involved in the development of the NTP package (search for "burnicki" in NTP's bugzilla), though I'm mainly testing and trying to find/fix bugs. So if you let me know some details I can give you some hints or comments from NTP's point of view ...
we want to start ntp without servers so local time clocks are managed by ntp even you haven't one server you can reach. after changing status of an network interface we want to dynamically load server into running ntpd. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=337075#c8
Michael Skibbe
https://bugzilla.novell.com/show_bug.cgi?id=337075#c9
Martin Burnicki
https://bugzilla.novell.com/show_bug.cgi?id=337075#c10
Michael Skibbe
https://bugzilla.novell.com/show_bug.cgi?id=337075#c11
Martin Burnicki
can you please test these packages? [...] but please remind - these are only testing packages without warranty.
Of course ... Hm, I've uninstalled the old xntp* packages, and removed all config files. Then installed the new ones from the links you gave me (fine that they are now called ntp* rather than xntp*, BTW). Looking into ntp.conf I saw that servers should now be configured using "rcntp addserver", and would be saved in /etc/ntp.server, so I ran "rcntp addserver 0.pool.ntp.org". At this time ntpd was not yet running, so I got an error sysing "ntpdc: read: connection refused". Maybe the rcntp script should check whether ntpd is running before calling ntpdc. Then I started ntpd by entering "rcntp start", and tried to add another pool server by typing "rcntp addserver 1.pool.ntp.org". Though the server entry was added to ntp.servers, I got an error message saying "***Permission denied", and the output of "ntpq -p" indicated that ntpd did not use those servers but only local clock. Permission problems made me think of apparmor, so I stopped apparmor and then tried to add a 3rd pool server. This worked correctly, and the 3rd pool server was displayed in the "ntpq -p" billboard. So the apparmor rules need to be modified. After reboot apparmor was running again, and I saw an "audit:" entry concerning permission problems with "ntp.keys" in the syslog. I'm assuming this entry has been created by apparmor. After I had disabled apparmor permanently the "audit:" message did not appear anymore in the syslog after reboot. Unfortunately neither after "rcntp restart" nor after stopping/starting ntpd the pool servers were being used, and also after a reboot I only saw the localclock listed as used by ntpd. BTW, maybe "rcntp status" could also show the output of "ntpq -p", like this is done by other rc scripts? I also tried to configure the pool servers using yast, but obviously this also needs some modifications in order to read from and save the configured servers to ntp.server. So I think a few small corrections are still required. If I should do more testing, just let me know. Martin -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=337075#c12
Michael Skibbe
https://bugzilla.novell.com/show_bug.cgi?id=337075#c13
Martin Burnicki
https://bugzilla.novell.com/show_bug.cgi?id=337075#c14
--- Comment #14 from Martin Burnicki
https://bugzilla.novell.com/show_bug.cgi?id=337075
User mskibbe@novell.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=337075#c15
Michael Skibbe
https://bugzilla.novell.com/show_bug.cgi?id=337075
User martin.burnicki@meinberg.de added comment
https://bugzilla.novell.com/show_bug.cgi?id=337075#c16
Martin Burnicki
https://bugzilla.novell.com/show_bug.cgi?id=337075
User mskibbe@novell.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=337075#c17
Michael Skibbe
https://bugzilla.novell.com/show_bug.cgi?id=337075
User martin.burnicki@meinberg.de added comment
https://bugzilla.novell.com/show_bug.cgi?id=337075#c18
--- Comment #18 from Martin Burnicki
yast there is no need to fix the yast module because i want to drop the ntp.server(s) file(s) and only use the ntp.conf
I've seen this in the latest versions in your home repo (4.2.4p4-36.1).
i have to take a look into code / doc if there is an other way to tell ntp to listen on a new device.
In the meantime I've also made some tests on a laptop with Network Manager. Unfortunately this has not been successful: I've configured 3 pool servers plus the local clock, which is there by default. - If I run "ntpq -p" immediately after reboot then it says: "No association IDs returned". This happens regardless of whether the LAN (cable) is connected during boot, or not. This means there's not even an association for the local clock, which I find quite weird. - After I've run "rcntp restart" the "ntpq -p" billboard shows the local clock, and also the pool servers if the network is already reachable. I've also talked to Frank Kardel who is one of the core developers of NTP. He told me that he would expect the dynamic configuration to work correctly. However, the "dynamic" keyword would be required for the server associations. The "dynamic" keyword had been introduced to workaround some problems which can arise if the NTP daemon starts before the network is up. In the current development version (which will become the next release) that keyword has been obsoleted, and the behaviour introduced with "dynamic" has become the default. According to Frank, the best solution would be to use the routing socket to be able to react on network changes. However, as Frank says, the way routing sockets are implemented under Linux differs strongly from the implentation under other Unix-like systems, so "someone" would have to implement the routing socket support for Linux in ntpd. Anyway, even the "dynamic" keyword used in the stable version 4.2.4p4-63.1 does not seem to help. BTW, I've seen: # ps ax |grep ntp 4549 ? Ss 0:00 /usr/sbin/ntpd -p /var/run/ntp/ntpd.pid \ -g -u ntp -U 60 -i /var/lib/ntp -c /etc/ntp.conf~ Is it intentional that /etc/ntp.conf~ (a backup file) is used as configuration file?
thanks for your testing!
I'm happy if I can help. I'd also appreciate if NTP runs under Linux without problems. Martin -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=337075
User martin.burnicki@meinberg.de added comment
https://bugzilla.novell.com/show_bug.cgi?id=337075#c19
--- Comment #19 from Martin Burnicki
https://bugzilla.novell.com/show_bug.cgi?id=337075
User martin.burnicki@meinberg.de added comment
https://bugzilla.novell.com/show_bug.cgi?id=337075#c20
--- Comment #20 from Martin Burnicki
In the meantime I've also made some tests on a laptop with Network Manager.
Unfortunately this has not been successful:
I've configured 3 pool servers plus the local clock, which is there by default.
- If I run "ntpq -p" immediately after reboot then it says: "No association IDs returned". This happens regardless of whether the LAN (cable) is connected during boot, or not. This means there's not even an association for the local clock, which I find quite weird.
- After I've run "rcntp restart" the "ntpq -p" billboard shows the local clock, and also the pool servers if the network is already reachable.
Hm, my latest tests with NM were a little bit too fast. The update to the latest NTP RPM had undone my sysconfig modification "-U 60" which should let ntpd rescan the network interfaces once every 60 seconds rather than every 5 minutes or so. So I'd have had to wait longer for the interface rescan. Under certain conditions ntpd finds the configured servers after a port rescan even if it had been started before the network connection is up. So there's still hope and I'll investigate more on this. I think the final goal should be to get ntpd working correctly without having to add the servers from ntp.conf dynamically. Dynamical configuration is great, however, if the IPs of the upstream server are passed via DHCP. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=337075
User mskibbe@novell.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=337075#c21
--- Comment #21 from Michael Skibbe
https://bugzilla.novell.com/show_bug.cgi?id=337075
User martin.burnicki@meinberg.de added comment
https://bugzilla.novell.com/show_bug.cgi?id=337075#c22
--- Comment #22 from Martin Burnicki
the ntp.conf~ was only a temporary file which is created and deleted by the init script. i renamed this file ntp.conf.tmp.
I've seen that this when I had a look at the rc script. I'd appreciate if we could it working without having to ntp.conf file and found that it basically works. Finally it's mostly a timing problem. See below.
i didn't find information about dynamic keyword, yet. but i think there is an alternative!
I'm afraid we need the "dynamic" keyword anyway in ntp.4.2.4. It is used to indicate that a peer configuration can be done later, even if currently no interface can be determined for that peer address. Look for FLAG_DYNAMIC in ntp_peer.c. Please note this has become the default behaviour in the current ntp-dev, so in the next stable release this keyword will be obsolete, but the behaviour will be same as now with "dynamic". What I've done and observed so far is: - Modified the rc script such that it doesn't touch the ntp.conf file - Run ntpd with the standard ntp.conf file and "-U 60" - Removed the /etc/sysconfig/network/*ntp script - Stopped ntpd - Set system time 2 hours past - Reboot without network connection - After reboot "local clock" is shown by "ntpq -p" - Connected network cable - New interface detected by ntpd < 60 secs later - DNS resolving of peers ~5 mins later (see below) - Associations added after DNS success - Synchronizing to pool server The ugly things are: - It takes up to 60 seconds until the interface is detected - It takes even more until DNS resolving is retried and succeeds
ntpdc allows to rescan devices with ifreload. the init script now reload the interfaces when nm told it that a device changed. this _should_ work and fix your problem from comment #20. are you able to test this?
Yes. This is a very good good idea. I hope this also speeds up the DNS lookup. This ntpdc command should be put into the /etc/sysconfig/network/*ntp file, and nothing else, if possible. And there's one more problem: If it takes some time until the network connection comes up then ntpd synchronizes to the "local clock" association before, which obviously "eats up" the "-g" parameter. This means if the initial time offset exceeds 1000 seconds then ntpd stops itself. A workaround for this should be to remove the local clock from ntp.conf. The "local clock" time source is only a fallback which shall allow ntpd to serve it's time to other clients even if no upstream server is available. More tests to come.
the -U method isn't a real solution for the problem because nm should tell the init script immediately after a interface change to load the server and i'm willing to wait 60 seconds.
Right. I'd also appreciate if the ntpdc ifreload command would make this obsolete. However, the best solution would be if ntpd could receive network routing information through the routing socket. Don't you know one of your colleagues at Novell who's familiar with the Linux routing sockets and could help to port the implementation of those sockets in ntpd to Linux? Martin -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=337075
Michael Skibbe
https://bugzilla.novell.com/show_bug.cgi?id=337075
User martin.burnicki@meinberg.de added comment
https://bugzilla.novell.com/show_bug.cgi?id=337075#c23
--- Comment #23 from Martin Burnicki
https://bugzilla.novell.com/show_bug.cgi?id=337075
User martin.burnicki@meinberg.de added comment
https://bugzilla.novell.com/show_bug.cgi?id=337075#c24
--- Comment #24 from Martin Burnicki
https://bugzilla.novell.com/show_bug.cgi?id=337075
Michael Skibbe
https://bugzilla.novell.com/show_bug.cgi?id=337075
Hendrik Vogelsang
https://bugzilla.novell.com/show_bug.cgi?id=337075
Andreas Schneider
https://bugzilla.novell.com/show_bug.cgi?id=337075
User varkoly@novell.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=337075#c26
Peter Varkoly
https://bugzilla.novell.com/show_bug.cgi?id=337075
User martin.burnicki@meinberg.de added comment
https://bugzilla.novell.com/show_bug.cgi?id=337075#c27
--- Comment #27 from Martin Burnicki
this must be fixed by the ntp-developers
I know this issue has just recently been assigned to you. However, I don't know
whether you have fully read all the comments and not just my last one.
What has to be fixed by the NTP developers is just 1 point, However, there are
others (e.g. using ntpd with -g rather than running ntpdate before starting
ntpd) which has to fe fixed in the openSUSE/Novell rc script.
So I think "Wontfix" is not the correct status for this since changes have to
be made by you folks.
Martin Burnicki
participants (1)
-
bugzilla_noreply@novell.com