[Bug 545191] New: ntpd sync fails/noisy with kernel-xen (clocksource = jiffies); works as expected with kernel-default (clocksource = hpet)
http://bugzilla.novell.com/show_bug.cgi?id=545191 User pgnet.dev@gmail.com added comment http://bugzilla.novell.com/show_bug.cgi?id=545191#c1 Summary: ntpd sync fails/noisy with kernel-xen (clocksource = jiffies); works as expected with kernel-default (clocksource = hpet) Classification: openSUSE Product: openSUSE 11.1 Version: Final Platform: x86-64 OS/Version: openSUSE 11.1 Status: NEW Severity: Major Priority: P5 - None Component: Kernel AssignedTo: bnc-team-screening@forge.provo.novell.com ReportedBy: pgnet.dev@gmail.com QAContact: qa@suse.de Found By: --- User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9) Gecko/2008052906 Firefox/3.0 FirePHP/0.3 i've set up two identical servers -- same mobo (Asus M387A-CM), BIOS version (v1903), CPU (AMD Phenom II X4 2.6GHz), OS (opensuse 11.1), & kernel version (2.6.27.29-0.1). the difference between the 2 is that server (1) is running a Xen Dom0 kernel, and server (2) is running the default, non-Xen kernel. both machines are setup to use NTP for time services. i've identically config'd ntp, using 3 low-latency Stratum 2 servers. cat /etc/ntp.conf restrict default nomodify notrap noquery restrict 127.0.0.1 restrict 192.168.1.0 mask 255.255.255.0 notrust nomodify notrap server ntp1.stsn.net server clock.develooper.com server 192.83.249.28 driftfile /var/lib/ntp/drift/ntp.drift logfile /var/log/ntpd/ntp.log statsdir /var/log/ntpd/ filegen peerstats file peerstats type day enable filegen loopstats file loopstats type day enable filegen clockstats file clockstats type day enable i restarted ntpd on both servers at the same time, and waited 2 hours. server (2) is synced (took about 10 minutes, actually), with low jitter and offset, to stratum3; server (1) hasn't managed to make it out of stratum 16 ... (1) uname -a Linux server02 2.6.27.29-0.1-xen #1 SMP 2009-08-15 17:53:59 +0200 x86_64 x86_64 x86_64 GNU/Linux cat /sys/devices/system/clocksource/clocksource0/available_clocksource xen jiffies cat /sys/devices/system/clocksource/clocksource0/current_clocksource jiffies cat /proc/sys/xen/independent_wallclock 1 service ntp stop && sntp -P no -r us.pool.ntp.org && service ntp start && date Shutting down network time protocol daemon (NTPD) done Starting network time protocol daemon (NTPD) done Wed Oct 7 16:28:44 PDT 2009 .. .. date Wed Oct 7 18:50:21 PDT 2009 ntpq -p -c rv ... stratum=16, precision=-8, rootdelay=0.000, rootdispersion=10.485, ... remote refid st t when poll reach delay offset jitter ============================================================================== *time-sj.stsn.ne 198.60.22.240 2 u 57 64 377 27.612 -1390.8 658.318 +clock-a.develoo 204.123.2.72 2 u 15 64 377 28.280 -1478.4 652.232 +zorro.sf-bay.or 216.218.254.202 2 u 52 64 377 17.287 -1395.2 650.788 (2) uname -a Linux server03 2.6.27.29-0.1-default #1 SMP 2009-08-15 17:53:59 +0200 x86_64 x86_64 x86_64 GNU/Linux cat /sys/devices/system/clocksource/clocksource0/available_clocksource hpet acpi_pm jiffies tsc cat /sys/devices/system/clocksource/clocksource0/current_clocksource hpet service ntp stop && sntp -P no -r us.pool.ntp.org && service ntp start && date Shutting down network time protocol daemon (NTPD) done Starting network time protocol daemon (NTPD) done Wed Oct 7 16:28:46 PDT 2009 .. .. date Wed Oct 7 18:51:07 PDT 2009 ntpq -p -c rv ... stratum=3, precision=-20, rootdelay=47.377, rootdispersion=14.367, ... remote refid st t when poll reach delay offset jitter ============================================================================== *time-sj.stsn.ne 198.60.22.240 2 u 64 64 377 28.779 -4.716 2.679 +clock-b.develoo 204.123.2.72 2 u 59 64 377 26.669 -0.886 3.322 +zorro.sf-bay.or 216.218.254.202 2 u 51 64 377 16.563 -0.645 2.906 finally, i swapped the kernel-default & kernel-xen between the two machines, and saw the same/similar results. if i understand the above correctly, this says "problem with timing @ Xen kernel" to me ... this manifests as problematic for DomU localtime-dependent on Dom0 sync using xen closksource for apps (e.g. Dovecot) sensitive to noisy/incorrect/backward-moving time. Reproducible: Always Steps to Reproduce: 1. 2. 3. -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=545191
zhu rensheng
http://bugzilla.novell.com/show_bug.cgi?id=545191
account disabled
http://bugzilla.novell.com/show_bug.cgi?id=545191
Jason Douglas
http://bugzilla.novell.com/show_bug.cgi?id=545191
User jbeulich@novell.com added comment
http://bugzilla.novell.com/show_bug.cgi?id=545191#c1
Jan Beulich
http://bugzilla.novell.com/show_bug.cgi?id=545191
User pgnet.dev@gmail.com added comment
http://bugzilla.novell.com/show_bug.cgi?id=545191#c2
account disabled
Did you force the clocksource to jiffies
yes. but, it doesn't matter. in the kernel-default case, the default clocksource is "hpet", and available_clocksource includes {hpet, tsc, acpi_pm & jiffies} with currrent_clocksource = "hpet", all's fine. in the kernel-xen case, the Dom0's default clocksource is "xen", and available_clocksource includes {xen, jiffies} with BOTH currrent_clocksource = "xen" & "jiffies", the jitter's unacceptably large, preventing sync to timesources. docs are at best, unclear as to whether Dom0's clocksource should be == "xen" or "jiffies" when using ntpd fpr clock sync. i.e. any/all clocksource in Xen Dom0 +ntpd can't sync due, apparently, to excessive jitter.
explanation of the "noisy" part of the problem description
per above, case (1) jitter = {658.318, 652.232, 650.788} <- noisy, can't sync case (2) jitter = {2.679, 3.322, 2.906} <- typical/'quiet', sync is OK -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=545191
User jbeulich@novell.com added comment
http://bugzilla.novell.com/show_bug.cgi?id=545191#c3
Jan Beulich
http://bugzilla.novell.com/show_bug.cgi?id=545191
User pgnet.dev@gmail.com added comment
http://bugzilla.novell.com/show_bug.cgi?id=545191#c4
--- Comment #4 from account disabled
I have to admit that I have no clue what to do with this: I know very little about NTP, but I do know that others are using NTP on Xen without issues.
i'm not at all convinced that it's a problem with ntp, but suspect, instead, kernel clocksource ... in no particular order, here area number of references that -- at best -- demonstrate confusion in abundance, http://lists.xensource.com/archives/html/xen-devel/2009-05/msg01201.html http://www.linux-archive.org/debian-kernel/326055-bug-534978-clock-drift-xen... http://www.novell.com/communities/node/8629/time-synchronization-xen-setup http://www.linux.org.za/Lists-Archives/glug-tech-0905/msg00271.html http://www.gossamer-threads.com/lists/linux/kernel/1039416 http://lists.ntp.isc.org/pipermail/questions/2009-August/024110.html https://bugs.launchpad.net/xen/+bug/146924 http://lists.ntp.isc.org/pipermail/questions/2006-June/010460.html a common issue, questioned but not resolved, seems to be the type of clocksource. apparently (?), different OS's kernel-xen have different available/default clocksources ... which exhibit different behaviors. i'm still reading up ... so, yes, "others" get it to work. but under what reproducible conditions? hardly well documented, afaict ... i've now been able to reproduce the problem i've reported here on multiple motherboards from multiple vendors -- removing, in effect, the "it's your rtc that's dead" argument. the systems demonstrating this issue are all opensuse 11.1 with release- &/or SL111- version kernels. *solaris systems are having no such problems -- but, of course, that's apples and oranges. i've not yet done these tests on other Linux setups. note, also, that the situation of how time's kept is (?) changing as kernels evolve towards pv-ops, away from 'xenified'. J Fitzhardinge (he's @suse, no?) is involved in those discussions (e.g., http://lists.xensource.com/archives/html/xen-devel/2009-05/msg01201.html).
attaching kernel and hypervisor messages
in the Dom0 case, i suspect? captured via serial port? or are dmesg & xm dmesg output sufficient? -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=545191
User jbeulich@novell.com added comment
http://bugzilla.novell.com/show_bug.cgi?id=545191#c5
Jan Beulich
in the Dom0 case, i suspect? captured via serial port? or are dmesg & xm dmesg output sufficient?
xm dmesg output will be sufficient if you make sure the log level is high enough, and no messages are discarded. dmesg won't be enough - it's mainly the boot messages I'm after (i.e. /var/log/boot.msg). But you asking this question makes me ask another one: Are you seeing this issue in Dom0, DomU, or both? -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=545191
User pgnet.dev@gmail.com added comment
http://bugzilla.novell.com/show_bug.cgi?id=545191#c6
--- Comment #6 from account disabled
With "others" I don't mean people using other distros - SLE11 is working fine for them, and as long as you use an up-to-date kernel, that's gonna be the same as SLE11's.
then more confusion ... something in my env is consistently problematic. i've now chatted with enough people across multiple distros, hardware, configs, etc to know that problems with timekeeping in xen are not uncommon. the whole business seems a bit of a mess, atm ...
in the Dom0 case, i suspect? captured via serial port? or are dmesg & xm dmesg output sufficient?
xm dmesg output will be sufficient if you make sure the log level is high enough, and no messages are discarded.
dmesg won't be enough - it's mainly the boot messages I'm after (i.e. /var/log/boot.msg).
i'll put that together in a minute, and post ...
But you asking this question makes me ask another one: Are you seeing this issue in Dom0, DomU, or both?
both. the problem i've reported above/here is 'just' Dom0. in DomU, if: /proc/sys/xen/independent_wallclock --> 0 /sys/devices/system/clocksource/clocksource0/current_clocksource --> xen then, of course, DomU simply tracks Dom0. checking sync between the two, DomU tracks Dom0 faithfully, without problem, in this case. in the DomU (paravirt) case where i've independent, ntp-driven clock, with, /proc/sys/xen/independent_wallclock --> 1 /sys/devices/system/clocksource/clocksource0/current_clocksource --> jiffies the DomU jitter is somewhat worse than the Dom0 jitter. and, as one might expect, sync is just as elusive. -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=545191
User pgnet.dev@gmail.com added comment
http://bugzilla.novell.com/show_bug.cgi?id=545191#c7
account disabled
xm dmesg output will be sufficient if you make sure the log level is high enough, and no messages are discarded.
@ my opensuse 11.1 test box, Xen Dom0 booted as: title Xen (symlink) NORMAL root (hd0,0) kernel /xen.gz dom0_mem=768M loglvl=all loglvl_guest=all vga=gfx-1280x1024x32 console=vga,com1 com1=57600,8n1 cpufreq=xen:performance cpuidle iommu=1 module /vmlinuz-xen root=LABEL=DOM0_ROOT resume=LABEL=DOM0_SWAP showopts splash=silent vga=0x31a console=tty0 console=xvc0,57600 elevator=cfq reassigndev=0000:04:07.0 iommu=off module /initrd-xen attachment (bootlogs.txt) contains -> xm dmesg and, cat /var/log/boot.msg -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=545191
User jbeulich@novell.com added comment
http://bugzilla.novell.com/show_bug.cgi?id=545191#c8
--- Comment #8 from Jan Beulich
http://bugzilla.novell.com/show_bug.cgi?id=545191
User pgnet.dev@gmail.com added comment
http://bugzilla.novell.com/show_bug.cgi?id=545191#c9
--- Comment #9 from account disabled
But please, a valid configuration is needed here (i.e. either 11.2 hypervisor+kernel, or the 11.1 pair). Further, please remove the "cpuidle" parameter you pass to Xen.
valid, or just available? i use the stable/release kernel for 11.1, and the up-to-date, not-broken-like-3.3, 3.4.x hypervisor from OBS. that's been discussed -- perhaps even directly with you? -- previously in bugzilla. 11.2's got a number of significant problems with raid, etc that make it currently unusable. atm, there are no other functional options.
Also, did you try other clocksources for Xen itself to use? I don't really think this should make a difference, but it'd be good to know for sure anyway.
the only available clocksources are reported, @ Dom0, as cat /sys/devices/system/clocksource/clocksource0/available_clocksource xen jiffies per above, i've tried both ...
But, as said earlier, I don't think I have any pointers as to where to continue if the above adjustment doesn't result in any behavioral change.
did you contact J Fitzhardinge @ your company? this seems to be something he's directly involved in ... -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=545191
User pgnet.dev@gmail.com added comment
http://bugzilla.novell.com/show_bug.cgi?id=545191#c10
--- Comment #10 from account disabled
please remove the "cpuidle" parameter you pass to Xen.
in issue https://bugzilla.novell.com/show_bug.cgi?id=530035, wherein you "suggest to simply be patient. After all, this is not some critical problem we're talking about", removing cpuidle breaks power mgmt. apparently, on 10/1/2009, amd posted a patch, supposedl fixing those issues, that "has been submitted to xen-unstable but has not yet made it into the tree." ,,, -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=545191
User jbeulich@novell.com added comment
http://bugzilla.novell.com/show_bug.cgi?id=545191#c11
--- Comment #11 from Jan Beulich
valid, or just available?
Valid.
i use the stable/release kernel for 11.1, and the up-to-date, not-broken-like-3.3, 3.4.x hypervisor from OBS. that's been discussed -- perhaps even directly with you? -- previously in bugzilla.
If you mean power management being broken in 3.3, that certainly doesn't matter for this bug? We don't ship 3.4.x for 11.1, so we won't be able to help you with this combination either.
the only available clocksources are reported, @ Dom0, as
But I asked about Xen's clocksources, not Dom0's.
did you contact J Fitzhardinge @ your company? this seems to be something he's directly involved in ...
Who told you that Jeremy would work for Novell. He's at Citrix. (In reply to comment #10)
in issue https://bugzilla.novell.com/show_bug.cgi?id=530035, wherein you "suggest to simply be patient. After all, this is not some critical problem we're talking about", removing cpuidle breaks power mgmt.
It may break xenpm, but it certainly doesn't break power management.
apparently, on 10/1/2009, amd posted a patch, supposedl fixing those issues, that "has been submitted to xen-unstable but has not yet made it into the tree." ,,,
I saw that, and meanwhile it has been checked in. So once we'll have a product based on 3.5 or newer, this will work. Backporting such a change would require it to e.g. fix a severe issue, which isn't the case here. -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=545191
User pgnet.dev@gmail.com added comment
http://bugzilla.novell.com/show_bug.cgi?id=545191#c12
account disabled
participants (1)
-
bugzilla_noreply@novell.com