[Bug 1220119] New: Some tests on o3 triggers dmesg "clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc' as unstable because the skew is too large:"
https://bugzilla.suse.com/show_bug.cgi?id=1220119 Bug ID: 1220119 Summary: Some tests on o3 triggers dmesg "clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc' as unstable because the skew is too large:" Classification: openSUSE Product: openSUSE Tumbleweed Version: Current Hardware: Other OS: Other Status: NEW Severity: Normal Priority: P5 - None Component: Kernel Assignee: kernel-bugs@opensuse.org Reporter: petr.vorel@suse.com QA Contact: qa-bugs@suse.de CC: jwiesner@suse.com Target Milestone: --- Found By: --- Blocker: --- Some tests running on o3 (openQA instance on https://openqa.opensuse.org/) triggers in dmesg. All tests run in VM via qemu. This is on x86_64 on my private openQA worker by ltp_net_tcp_cmds, triggered for 2 tests: arp and host (both tests did not fail): OpenQA::run_ltp.pm: Starting ipneigh01_arp [ 410.818828][ C0] clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc' as unstable because the skew is too large: [ 410.820365][ C0] clocksource: 'kvm-clock' wd_nsec: 506724586 wd_now: 610448443a wd_last: 60e6144350 mask: ffffffffffffffff [ 410.821824][ C0] clocksource: 'tsc' cs_nsec: 49302930542 cs_now: 17c1364ba0a cs_last: 153fdc32526 mask: ffffffffffffffff [ 410.823293][ C0] clocksource: Clocksource 'tsc' skewed 48796205956 ns (48796 ms) over watchdog 'kvm-clock' interval of 506724586 ns (506 ms) [ 410.824843][ C0] clocksource: 'kvm-clock' (not 'tsc') is current clocksource. [ 410.825707][ C0] tsc: Marking TSC unstable due to clocksource watchdog [ 411.109710][T28135] Initializing XFRM netlink socket OpenQA::run_ltp.pm: Starting ipneigh01_ip OpenQA::run_ltp.pm: Starting arping01 OpenQA::run_ltp.pm: Starting clockdiff01 OpenQA::run_ltp.pm: Starting ftp OpenQA::run_ltp.pm: Starting host [ 410.818845][ C0] clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc' as unstable because the skew is too large: [ 410.820687][ C0] clocksource: 'kvm-clock' wd_nsec: 506741379 wd_now: 61044885d3 wd_last: 60e6144350 mask: ffffffffffffffff [ 410.821794][ C0] clocksource: 'tsc' cs_nsec: 804834172 cs_now: 154a546a082 cs_last: 153fdc32526 mask: ffffffffffffffff [ 410.822919][ C0] clocksource: Clocksource 'tsc' skewed 298092793 ns (298 ms) over watchdog 'kvm-clock' interval of 506741379 ns (506 ms) [ 410.824088][ C0] clocksource: 'kvm-clock' (not 'tsc') is current clocksource. [ 410.824768][ C0] tsc: Marking TSC unstable due to clocksource watchdog [ 411.126759][T28138] Initializing XFRM netlink socket Also, originally I noted this on ltp_net_tcp_cmds running on ppc64le on o3 (kernel tests in https://openqa.opensuse.org/group_overview/32), where host test keeps failing (but arp test is working), I don't see it on ltp_net_tcp_cmds on x86_64 on o3. I was not able to trigger this on VM on my laptop (x86_64). Others found that systemd-networkd test on Tumbleweed on x86_64 (https://openqa.opensuse.org/group_overview/1): [ 570.754611][ C0] Showing busy workqueues and worker pools: [ 570.755654][ C0] workqueue events: flags=0x0 [ 570.756580][ C0] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2 [ 570.756586][ C0] pending: drm_fb_helper_damage_work [ 570.759551][ C0] sysrq: Show Blocked State zDBFr-0- [ 311.682306][ C0] clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc' as unstable because the skew is too large: [ 311.684482][ C0] clocksource: 'kvm-clock' wd_nsec: 621509306 wd_now: 49d9caac63 wd_last: 49b4bf31a9 mask: ffffffffffffffff [ 311.686513][ C0] clocksource: 'tsc' cs_nsec: 952336665 cs_now: a3a422af12 cs_last: a327415f8e mask: ffffffffffffffff [ 311.688559][ C0] clocksource: Clocksource 'tsc' skewed 330827359 ns (330 ms) over watchdog 'kvm-clock' interval of 621509306 ns (621 ms) [ 311.690452][ C0] clocksource: 'kvm-clock' (not 'tsc') is current clocksource. [ 311.691559][ C0] tsc: Marking TSC unstable due to clocksource watchdog Test runs also on aarch64, but error is not there. Is it a problem of openQA setup? There are other bugs. -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1220119 Petr Vorel <petr.vorel@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- See Also| |https://bugzilla.suse.com/s | |how_bug.cgi?id=1217217 CC| |petr.vorel@suse.com -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1220119 Petr Vorel <petr.vorel@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- See Also| |https://bugzilla.suse.com/s | |how_bug.cgi?id=1218105 -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1220119 https://bugzilla.suse.com/show_bug.cgi?id=1220119#c1 --- Comment #1 from Jiri Wiesner <jwiesner@suse.com> --- Could we get a supportconfig tarball from both the KVM host and the guest? It'll save us some back-and-forth requests for data. -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1220119 https://bugzilla.suse.com/show_bug.cgi?id=1220119#c2 --- Comment #2 from Jiri Wiesner <jwiesner@suse.com> --- (In reply to Petr Vorel from comment #0)
This is on x86_64 on my private openQA worker by ltp_net_tcp_cmds, triggered for 2 tests: arp and host (both tests did not fail):
The current clocksource, the one used for providing time - kvm-clock, remained active after the error appeared. The error should have no effect on how kvm-clock works and I do not see a reason for why the error should have an effect on the workload. The clocksource watchdog gets executed every 500 milliseconds in softirq context (a timer handler). The kvm-clock clocksource actually reported a value close to 500 milliseconds so kvm-clock is probably fine.
[ 410.818828][ C0] clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc' as unstable because the skew is too large: [ 410.820365][ C0] clocksource: 'kvm-clock' wd_nsec: 506724586 wd_now: 610448443a wd_last: 60e6144350 mask: ffffffffffffffff [ 410.821824][ C0] clocksource: 'tsc' cs_nsec: 49302930542 cs_now: 17c1364ba0a cs_last: 153fdc32526 mask: ffffffffffffffff [ 410.818845][ C0] clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc' as unstable because the skew is too large: [ 410.820687][ C0] clocksource: 'kvm-clock' wd_nsec: 506741379 wd_now: 61044885d3 wd_last: 60e6144350 mask: ffffffffffffffff [ 410.821794][ C0] clocksource: 'tsc' cs_nsec: 804834172 cs_now: 154a546a082 cs_last: 153fdc32526 mask: ffffffffffffffff
There is something rather conspicuous about the errors - there are two them. The error is supposed to be printed only once per clocksource being marked unstable. It is as if the clocksource watchdog ran in two threads (there is supposed to be only one).
Also, originally I noted this on ltp_net_tcp_cmds running on ppc64le on o3 (kernel tests in https://openqa.opensuse.org/group_overview/32)
I'd be interested to see the kernel log from that test. While logged in, I can see a failed ltp_net_tcp_cmds/host test but I don't know how to display the kernel log.
Others found that systemd-networkd test on Tumbleweed on x86_64 (https://openqa.opensuse.org/group_overview/1): [ 311.682306][ C0] clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc' as unstable because the skew is too large: [ 311.684482][ C0] clocksource: 'kvm-clock' wd_nsec: 621509306 wd_now: 49d9caac63 wd_last: 49b4bf31a9 mask: ffffffffffffffff [ 311.686513][ C0] clocksource: 'tsc' cs_nsec: 952336665 cs_now: a3a422af12 cs_last: a327415f8e mask: ffffffffffffffff
The interval, as reported by kvm-clock, is much longer than 500 milliseconds indicating a delayed expiration of the timer, which could be a consequence of the system being under high load.
Is it a problem of openQA setup? There are other bugs.
There is almost certainly something wrong with the tsc clocksource (not kvm-clock). Perhaps, the TSC counter is somehow virtualized (or para-virtualized by Qemu, I don't know) and the virtualized TSC misbehaves under certain conditions. -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1220119 https://bugzilla.suse.com/show_bug.cgi?id=1220119#c3 --- Comment #3 from Petr Vorel <petr.vorel@suse.com> --- (In reply to Jiri Wiesner from comment #2) ...
I'd be interested to see the kernel log from that test. While logged in, I can see a failed ltp_net_tcp_cmds/host test but I don't know how to display the kernel log.
https://openqa.opensuse.org/tests/3947323/file/serial0.txt This is on most machines: Logs & Assets tab => serial0.txt. I'll attach it. Also test sporadically fails on aarch64 on o3 (attach dmesg as well), but there is nothing strange with clocksource. IMHO the test failure is unrelated to the clocksource problem. Also, dmesg before reboot (on some machines serial0.txt contains just firmware before log): https://openqa.opensuse.org/tests/3947323/logfile?filename=shutdown_ltp-dmes...
Others found that systemd-networkd test on Tumbleweed on x86_64 (https://openqa.opensuse.org/group_overview/1): [ 311.682306][ C0] clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc' as unstable because the skew is too large: [ 311.684482][ C0] clocksource: 'kvm-clock' wd_nsec: 621509306 wd_now: 49d9caac63 wd_last: 49b4bf31a9 mask: ffffffffffffffff [ 311.686513][ C0] clocksource: 'tsc' cs_nsec: 952336665 cs_now: a3a422af12 cs_last: a327415f8e mask: ffffffffffffffff
The interval, as reported by kvm-clock, is much longer than 500 milliseconds indicating a delayed expiration of the timer, which could be a consequence of the system being under high load.
Is it a problem of openQA setup? There are other bugs.
There is almost certainly something wrong with the tsc clocksource (not kvm-clock). Perhaps, the TSC counter is somehow virtualized (or para-virtualized by Qemu, I don't know) and the virtualized TSC misbehaves under certain conditions.
Interesting. -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1220119 https://bugzilla.suse.com/show_bug.cgi?id=1220119#c4 --- Comment #4 from Petr Vorel <petr.vorel@suse.com> --- Created attachment 872885 --> https://bugzilla.suse.com/attachment.cgi?id=872885&action=edit dmesg on ppc64le (affected) https://openqa.opensuse.org/tests/3939629/file/serial0.txt -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1220119 https://bugzilla.suse.com/show_bug.cgi?id=1220119#c5 --- Comment #5 from Petr Vorel <petr.vorel@suse.com> --- Created attachment 872886 --> https://bugzilla.suse.com/attachment.cgi?id=872886&action=edit test output on ppc64le (affected) https://openqa.opensuse.org/tests/3939629/file/serial_terminal.txt -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1220119 https://bugzilla.suse.com/show_bug.cgi?id=1220119#c6 --- Comment #6 from Petr Vorel <petr.vorel@suse.com> --- Created attachment 872887 --> https://bugzilla.suse.com/attachment.cgi?id=872887&action=edit dmesg on aarch64 (unaffected) https://openqa.opensuse.org/tests/3940457/file/serial0.txt Although "host openqa.opensuse.org" fails here as well, unlike ppc64le there is no clocksource problem on aarch64. -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1220119 https://bugzilla.suse.com/show_bug.cgi?id=1220119#c7 --- Comment #7 from Petr Vorel <petr.vorel@suse.com> --- Created attachment 872888 --> https://bugzilla.suse.com/attachment.cgi?id=872888&action=edit test output on aarch64 (unaffected) https://openqa.opensuse.org/tests/3940457/file/serial_terminal.txt -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1220119 https://bugzilla.suse.com/show_bug.cgi?id=1220119#c8 --- Comment #8 from Jiri Wiesner <jwiesner@suse.com> --- (In reply to Jiri Wiesner from comment #2)
There is almost certainly something wrong with the tsc clocksource (not kvm-clock). Perhaps, the TSC counter is somehow virtualized (or para-virtualized by Qemu, I don't know) and the virtualized TSC misbehaves under certain conditions.
A brief look at the code executed by the guest shows that kvm-clock is the paravirtualized clocksource whereas the tsc clocksource executes rdtscp directly. My guess is that the hypervisor shields the guest from the fact that time flows differently on its pCPUs. In line with this, the readout of kvm-clock would result in smaller counter values, probably related to the CPU time for which the vCPUs were able to run on the hypervisor, whereas a direct readout of the value of the real TSC counter (on the pCPU) gives much larger counter values and larger differences. Although if this were true, I am unsure about how NTP would be able to synchronize time in the guest so that userspace gets correct timestamps. Whether or not the tsc clocksource gets initialized during booting up depends on the flags exposed by the vCPU, hence it depends on the exact type of the vCPU. On some guests where the vCPU sets the TSC-related flags (as seen in /proc/cpuinfo on x86: tsc rdtscp constant_tsc tsc_known_freq tsc_deadline_timer tsc_adjust), the tsc clocksource will be initialized and checked by the watchdog. It is possible that the values read by the tsc clocksource are entirely misleading and irrelevant as far as timekeeping in the guest is concerned. Passing tsc=nowatchdog to the guest kernel would disable the clocksource watchdog checks on the tsc clocksource. I am looking at the ppc64le and aarch64 logs. There are no clocksource watchdog errors:
$ grep -E 'clocksource|watchdog' serial0.ppc64le.3939629.txt [ 0.000000][ T0] clocksource: timebase: mask: 0xffffffffffffffff max_cycles: 0x761537d007, max_idle_ns: 440795202126 ns [ 0.000002][ T0] clocksource: timebase mult[1f40000] shift[24] registered [ 0.007938][ T1] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 6370867519511994 ns [ 0.058782][ T1] clocksource: Switched to clocksource timebase $ grep -E 'clocksource|watchdog' serial0.aarch64.3940457.txt [ 0.000000][ T0] clocksource: arch_sys_counter: mask: 0xffffffffffffff max_cycles: 0xb8812736b, max_idle_ns: 440795202655 ns [ 0.050199][ T1] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 6370867519511994 ns [ 0.594978][ T1] clocksource: Switched to clocksource arch_sys_counter [ 11.709872][ T49] watchdog: Delayed init of the lockup detector failed: -19 [ 11.711996][ T49] watchdog: Hard watchdog permanently disabled Then I realized that the clocksource watchdog is only enabled on x86_64 and i386: $ grep -r CLOCKSOURCE_WATCHDOG config config/x86_64/default:CONFIG_CLOCKSOURCE_WATCHDOG=y config/x86_64/default:CONFIG_CLOCKSOURCE_WATCHDOG_MAX_SKEW_US=100 config/x86_64/default:# CONFIG_TEST_CLOCKSOURCE_WATCHDOG is not set config/i386/pae:CONFIG_CLOCKSOURCE_WATCHDOG=y config/i386/pae:CONFIG_CLOCKSOURCE_WATCHDOG_MAX_SKEW_US=100 config/i386/pae:# CONFIG_TEST_CLOCKSOURCE_WATCHDOG is not set How are the ppc64le and aarch64 results relevant to the clocksource watchdog errors observed on x86_64 guests? The ltp_net_tcp_cmds test failing should not have anything to do with the clocksource watchdog error - there is no evidence for that as far as I can see. -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1220119 Petr Vorel <petr.vorel@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #872885|dmesg on ppc64le (affected) |dmesg on ppc64le description|https://openqa.opensuse.org |(unaffected) |/tests/3939629/file/serial0 |https://openqa.opensuse.org |.txt |/tests/3939629/file/serial0 | |.txt -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1220119 Petr Vorel <petr.vorel@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #872886|test output on ppc64le |test output on ppc64le description|(affected) |(unaffected) |https://openqa.opensuse.org |https://openqa.opensuse.org |/tests/3939629/file/serial_ |/tests/3939629/file/serial_ |terminal.txt |terminal.txt -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1220119 Petr Vorel <petr.vorel@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #872886|0 |1 is obsolete| | -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1220119 Petr Vorel <petr.vorel@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #872888|0 |1 is obsolete| | -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1220119 https://bugzilla.suse.com/show_bug.cgi?id=1220119#c9 --- Comment #9 from Jiri Wiesner <jwiesner@suse.com> --- Created attachment 872890 --> https://bugzilla.suse.com/attachment.cgi?id=872890&action=edit dmesg on x86 (In reply to Jiri Wiesner from comment #1)
Could we get a supportconfig tarball from both the KVM host and the guest? It'll save us some back-and-forth requests for data.
May I please, at least, get a supportconfig tarball from the KVM host - the private openQA worker? I guess the guest is ephemeral. I, at least, have a kernel log from it. -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1220119 https://bugzilla.suse.com/show_bug.cgi?id=1220119#c10 --- Comment #10 from Petr Vorel <petr.vorel@suse.com> --- (In reply to Jiri Wiesner from comment #9)
Created attachment 872890 [details] dmesg on x86 Thanks!
(In reply to Jiri Wiesner from comment #1)
Could we get a supportconfig tarball from both the KVM host and the guest? It'll save us some back-and-forth requests for data.
May I please, at least, get a supportconfig tarball from the KVM host - the private openQA worker? I guess the guest is ephemeral. I, at least, have a kernel log from it.
Sure, I'll try to get to it today (supportconfig from my worker which is affected will be trivial, I'll try also to get it from networkd_dhcp on o3, which might be harder to get permission). Just tgz of supportconfig without any switch? -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1220119 https://bugzilla.suse.com/show_bug.cgi?id=1220119#c11 --- Comment #11 from Jiri Wiesner <jwiesner@suse.com> --- (In reply to Petr Vorel from comment #10)
supportconfig from my worker which is affected will be trivial At the risk of sounding strange, there is nothing trivial about it from by POV.
I'll try also to get it from networkd_dhcp on o3, which might be harder to get permission). That'd be nice.
Just tgz of supportconfig without any switch? I think it should be fine without any switches. -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1220119 Petr Vorel <petr.vorel@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #872885|0 |1 is obsolete| | -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1220119 Petr Vorel <petr.vorel@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #872887|0 |1 is obsolete| | -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1220119 https://bugzilla.suse.com/show_bug.cgi?id=1220119#c12 --- Comment #12 from Petr Vorel <petr.vorel@suse.com> --- Created attachment 872900 --> https://bugzilla.suse.com/attachment.cgi?id=872900&action=edit supportconfig output from openqaworker26 openqaworker26 is one of the openQA workers (other is e.g. openqaworker21) where systemd-networkd (from Tumbleweed tests group in https://openqa.opensuse.org/group_overview/1) triggered the issue. Log is unfortunately taken 4 days after the problem was triggered. -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1220119 https://bugzilla.suse.com/show_bug.cgi?id=1220119#c13 --- Comment #13 from Petr Vorel <petr.vorel@suse.com> --- Created attachment 872902 --> https://bugzilla.suse.com/attachment.cgi?id=872902&action=edit dmesg on x86_64 (systemd-networkd) https://openqa.opensuse.org/tests/3945256/file/serial0.txt worker openqaworker26 -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1220119 https://bugzilla.suse.com/show_bug.cgi?id=1220119#c14 --- Comment #14 from Petr Vorel <petr.vorel@suse.com> --- Created attachment 872904 --> https://bugzilla.suse.com/attachment.cgi?id=872904&action=edit supportconfig output from openqaworker21 Another worker which triggered the problem. -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1220119 https://bugzilla.suse.com/show_bug.cgi?id=1220119#c15 Jiri Wiesner <jwiesner@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |IN_PROGRESS --- Comment #15 from Jiri Wiesner <jwiesner@suse.com> --- Sorry, Petr, could you use "supportconfig -l" from now on? It collects also rotated logs. My mistake. Support engineers always somehow get us the supportconfigs. I tend to forget what options they use. -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1220119 https://bugzilla.suse.com/show_bug.cgi?id=1220119#c16 Petr Vorel <petr.vorel@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #872904|0 |1 is obsolete| | --- Comment #16 from Petr Vorel <petr.vorel@suse.com> --- Created attachment 872905 --> https://bugzilla.suse.com/attachment.cgi?id=872905&action=edit supportconfig output from openqaworker21 -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1220119 https://bugzilla.suse.com/show_bug.cgi?id=1220119#c17 Petr Vorel <petr.vorel@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #872900|0 |1 is obsolete| | --- Comment #17 from Petr Vorel <petr.vorel@suse.com> --- Created attachment 872906 --> https://bugzilla.suse.com/attachment.cgi?id=872906&action=edit supportconfig output from openqaworker26 -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1220119 https://bugzilla.suse.com/show_bug.cgi?id=1220119#c18 Jiri Wiesner <jwiesner@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Flags| |needinfo?(claudio.fontana@s | |use.com) CC| |claudio.fontana@suse.com --- Comment #18 from Jiri Wiesner <jwiesner@suse.com> --- Both openqaworker21 and openqaworker26 are 1-socket machines. This means that there should not be any discrepancies between TSC reads from different CPUs (possibly belonging to different sockets), which can be seen on 8-socket and larger machines. So, there is no need for TSC warp corrections. The CPUs are AMD EPYC 7773X with these flags: tsc, rdtscp, constant_tsc, nonstop_tsc, tsc_scale. The criteria for disabling the clocksource watchdog on the TSC are constant_tsc, nonstop_tsc and tsc_adjust, which means the TSC gets checked by the watchdog by default on these machines. The guest boot after which the systemd-networkd test failed happened on the 17th:
$ grep 'PM: RTC' serial0-worker-openqaworker26.txt [ 0.237369][ T1] PM: RTC time: 03:17:39, date: 2024-02-17 The KVM host was up since the 16th and there have been no clocksource watchdog errors since the machine has booted up: $ grep 'PM: RTC\|clocksource' messages.txt 2024-02-16T03:32:56+0000 openqaworker26 kernel: clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645519600211568 ns 2024-02-16T03:32:56+0000 openqaworker26 kernel: clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 133484873504 ns 2024-02-16T03:32:56+0000 openqaworker26 kernel: clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x1fb6ad8590d, max_idle_ns: 440795233723 ns 2024-02-16T03:32:56+0000 openqaworker26 kernel: clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645041785100000 ns 2024-02-16T03:32:56+0000 openqaworker26 kernel: PM: RTC time: 03:32:35, date: 2024-02-16 2024-02-16T03:32:56+0000 openqaworker26 kernel: clocksource: Switched to clocksource tsc-early 2024-02-16T03:32:56+0000 openqaworker26 kernel: clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns 2024-02-16T03:32:56+0000 openqaworker26 kernel: tsc: Refined TSC clocksource calibration: 2199.998 MHz 2024-02-16T03:32:56+0000 openqaworker26 kernel: clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x1fb63109b96, max_idle_ns: 440795265316 ns 2024-02-16T03:32:56+0000 openqaworker26 kernel: clocksource: Switched to clocksource tsc This means that the TSC counters work fine on the pCPUs. It is almost impossible for the HPET (the watchdog) and the TSC (the clocksource) to be both consistently wrong with the same time skew.
The log from the VM is mind-boggling. It seems as if the timer handler for the clocksource watchdog that was supposed to expire around 315.260000 ran 4 times:
$ grep 'PM: RTC\|clocksource' serial0-worker-openqaworker26.txt [ 0.000001][ T0] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns [ 0.004859][ T0] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 6370452778343963 ns [ 0.145004][ T0] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604467 ns [ 0.152187][ T0] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x1fb63109b96, max_idle_ns: 440795265316 ns [ 0.234335][ T1] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 6370867519511994 ns [ 0.237369][ T1] PM: RTC time: 03:17:39, date: 2024-02-17 [ 0.526994][ T1] clocksource: Switched to clocksource kvm-clock [ 0.537249][ T1] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns [ 0.647119][ T1] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x1fb63109b96, max_idle_ns: 440795265316 ns The first execution of the handler: [ 315.260520][ C0] clocksource: Long readout interval, skipping watchdog check: cs_nsec: 44383091202 wd_nsec: 506649480 The second execution of the handler: [ 315.260551][ C0] clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc' as unstable because the skew is too large: [ 315.262450][ C0] clocksource: 'kvm-clock' wd_nsec: 506673719 wd_now: 4ae9f21fd6 wd_last: 4acbbee59f mask: ffffffffffffffff [ 315.264225][ C0] clocksource: 'tsc' cs_nsec: 656770889 cs_now: a5e0ab6d74 cs_last: a58a8c159a mask: ffffffffffffffff [ 315.265736][ C0] clocksource: Clocksource 'tsc' skewed 150097170 ns (150 ms) over watchdog 'kvm-clock' interval of 506673719 ns (506 ms) [ 315.267289][ C0] clocksource: 'kvm-clock' (not 'tsc') is current clocksource. [ 315.268301][ C0] tsc: Marking TSC unstable due to clocksource watchdog The third execution of the handler: [ 315.260531][ C0] clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc' as unstable because the skew is too large: [ 315.262425][ C0] clocksource: 'kvm-clock' wd_nsec: 506655009 wd_now: 4ae9f1d6c0 wd_last: 4acbbee59f mask: ffffffffffffffff [ 315.264455][ C0] clocksource: 'tsc' cs_nsec: 748234030 cs_now: a5eca9c820 cs_last: a58a8c159a mask: ffffffffffffffff [ 315.266134][ C0] clocksource: Clocksource 'tsc' skewed 241579021 ns (241 ms) over watchdog 'kvm-clock' interval of 506655009 ns (506 ms) [ 315.268044][ C0] clocksource: 'kvm-clock' (not 'tsc') is current clocksource. [ 315.269278][ C0] tsc: Marking TSC unstable due to clocksource watchdog The fourth execution of the handler: [ 315.260529][ C0] clocksource: Long readout interval, skipping watchdog check: cs_nsec: 1438452798 wd_nsec: 506651809 I would like to investigate this later. I will need to take in account that the guest has voluntary preemption enabled.
Cladio, could someone from your team please elaborate on whether it makes sense at all to use the tsc clocksource in KVM guests? The tsc clocksource is not the current clocksource (kvm-clock is) but watchdog checks are enabled for it. I think the watchdog check errors are just noise because the tsc clocksource is useless and the tsc clocksource or, at least, its watchdog checks should be disabled. -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1220119 https://bugzilla.suse.com/show_bug.cgi?id=1220119#c19 Ivan Ivanov <ivan.ivanov@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |ivan.ivanov@suse.com --- Comment #19 from Ivan Ivanov <ivan.ivanov@suse.com> --- Created attachment 872923 --> https://bugzilla.suse.com/attachment.cgi?id=872923&action=edit Hm, happens here too on Thinkpad T495. Screenshot attached -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1220119 https://bugzilla.suse.com/show_bug.cgi?id=1220119#c20 --- Comment #20 from Jiri Wiesner <jwiesner@suse.com> --- (In reply to Ivan Ivanov from comment #19)
Created attachment 872923 [details] Hm, happens here too on Thinkpad T495. Screenshot attached
Since this is a kernel running on a bare metal machine it is a different issue. Yours is actually a hardware issue. Either the HPET or the TSC is wrong. I bet it's the HPET. You could pass tsc=nowatchdog to the kernel to disable the checks (on any kernel). If you ran the latest kernel on Leap 15.4/5/6 you could pass tsc=watchdog to force the TSC to act as a watchdog (implemented by my out-of-tree patch that was ignored upstream). Since this is a 1-socket machine and you probably do not need TSC warp detection, you could also use "tsc=reliable tsc=watchdog" to accomplish a similar effect as with tsc=watchdog and my out-of-tree patch in Leap. AFAIK, the clocksource watchdog error often tends to happen during boot and on machines that are under thermal stress. It's rather unnecessary to have the TSC marked unstable when there is a good chance that it is the HPET readout what is wrong. That's why I implemented the semantic change of tsc=watchdog. Another option would be to postpone the clocksource watchdog checks until after the system fully boots up. I can just see the cold reception a patch like this would receive upstream. They do like their clocksource watchdog (even if its actions - switching to the HPET and causing a performance degradation - are uncalled for). -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1220119 https://bugzilla.suse.com/show_bug.cgi?id=1220119#c21 --- Comment #21 from Jiri Wiesner <jwiesner@suse.com> --- Petr, I need to ask: Is the log in comment 13 a composite, meaning collated from logs coming from different boots of the VM? As for the association between the test failure and the occurrences of the clocksource watchdog error, does the clocksource watchdog error (or "clocksource: Long readout interval...") always happen when the ltp_net_tcp_cmds test fails? It seems to me that, since causation has not been established, there could be a hidden root cause causing both the test to fail as well as the (possibly occasional) clocksource errors. Regarding kernel options, it is also possible to use tsc=reliable, which is meant for virtualized environments, or tsc=unstable to disable the watchdog checks. -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1220119 https://bugzilla.suse.com/show_bug.cgi?id=1220119#c22 --- Comment #22 from Petr Vorel <petr.vorel@suse.com> --- (In reply to Jiri Wiesner from comment #21)
Petr, I need to ask: Is the log in comment 13 a composite, meaning collated from logs coming from different boots of the VM?
No, single boot. FYI These serial0.txt files are created as QEMU output from SUT, I believe by these QEMU parameters: "-chardev ringbuf,id=serial0,logfile=serial0,logappend=on -serial chardev:serial0" (from [1], this file will not last, but any new systemd-networkd@64bit job from Tumbleweed group [2] should have it in the future). I mean, QEMU machine boots writing dmesg to this file, tests runs and machine shutdown. And we have serial0.txt when we test on bare metal machines, again, it's always from single boot.
As for the association between the test failure and the occurrences of the clocksource watchdog error, does the clocksource watchdog error (or "clocksource: Long readout interval...") always happen when the ltp_net_tcp_cmds test fails?
FYI this log is from different test (pure openQA test, I don't maintain), from job opensuse-Tumbleweed-DVD-x86_64-Build20240216-systemd-networkd@64bit running on https://openqa.opensuse.org/tests/3945256 (e.g. on o3 worker). The one I was able to reproduce on ltp_net_tcp_cmds (e.g. job using LTP for testing) was also on Tumbleweed, but on my private worker.
It seems to me that, since causation has not been established, there could
It's also unclear for me why this happens randomly. Maybe we could find more from dmesg on actual KVM host. openQA KVM hosts are quite often overloaded, maybe that influence clock clocksource on SUT.
be a hidden root cause causing both the test to fail as well as the (possibly occasional) clocksource errors. Regarding kernel options, it is also possible to use tsc=reliable, which is meant for virtualized environments, or tsc=unstable to disable the watchdog checks.
Do you think we should start using tsc kernel param? BTW these machines are configured: -m 1536 -cpu host (sometimes have more CPU or RAM for particular tests), e.g. if it happens to us, it can happen to anybody using VMs on cloud, right? [1] https://openqa.opensuse.org/tests/3945256/file/autoinst-log.txt [2] https://openqa.opensuse.org/group_overview/1 -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1220119 https://bugzilla.suse.com/show_bug.cgi?id=1220119#c23 --- Comment #23 from Jiri Wiesner <jwiesner@suse.com> --- Created attachment 872947 --> https://bugzilla.suse.com/attachment.cgi?id=872947&action=edit test 01 debug scripts, scheduling latency with ftrace (In reply to Petr Vorel from comment #22)
(In reply to Jiri Wiesner from comment #21)
Petr, I need to ask: Is the log in comment 13 a composite, meaning collated from logs coming from different boots of the VM?
No, single boot. I mean, QEMU machine boots writing dmesg to this file, tests runs and machine shutdown.
Oh, my. It is time to face the uncomfortable reality that the timer handler really got executed 4 times, somehow interleaving the execution of some of the handlers.
It seems to me that, since causation has not been established, there could
It's also unclear for me why this happens randomly. Maybe we could find more from dmesg on actual KVM host. openQA KVM hosts are quite often overloaded, maybe that influence clock clocksource on SUT.
I think you may be onto something. An overloaded KVM host would struggle with running vCPU threads in a timely manner. Since we know the TSC hardware is fine (no watchdog errors on the host), the TSC readouts might reflect the actual passage of time as opposed to kvm-clock, which is managed by KVM/Qemu. The interval reported in cs_nsec is always longer than the interval in wd_nsec, which corroborates the hypothesis. I am attaching debugging scripts to measure scheduling latency on the KVM host. They are run as root: # ./run You should be able to leave them running even for days until the issue has been reproduced. But please check that they are not filling up the disk too quickly. I could use results from an active KVM host from several hours of runtime as a baseline. Also, I need to check if I set the thresholds adequately.
be a hidden root cause causing both the test to fail as well as the (possibly occasional) clocksource errors. Regarding kernel options, it is also possible to use tsc=reliable, which is meant for virtualized environments, or tsc=unstable to disable the watchdog checks.
Do you think we should start using tsc kernel param?
The more I think about it the less I am convinced the watchdog errors are undesirable. You actually what to know is something goes so wrong that the TSC reads do not match the kvm-clock reads because that is all the watchdog check really is in this case. The TSC clocksource may get marked unstable but that does not have any effect on the currently active clocksource - kvm-clock.
BTW these machines are configured: -m 1536 -cpu host (sometimes have more CPU or RAM for particular tests), e.g. if it happens to us, it can happen to anybody using VMs on cloud, right?
If this is caused by an overloaded KVM host I guess it cannot happen in cloud environment because cloud VMs have some guarantees tied to them as the number of pCPUs is considered. -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1220119 https://bugzilla.suse.com/show_bug.cgi?id=1220119#c24 --- Comment #24 from Petr Vorel <petr.vorel@suse.com> --- (In reply to Jiri Wiesner from comment #23) ...
I think you may be onto something. An overloaded KVM host would struggle with running vCPU threads in a timely manner. Since we know the TSC hardware is fine (no watchdog errors on the host), the TSC readouts might reflect the actual passage of time as opposed to kvm-clock, which is managed by KVM/Qemu. The interval reported in cs_nsec is always longer than the interval in wd_nsec, which corroborates the hypothesis. I am attaching debugging scripts to measure scheduling latency on the KVM host. They are run as root: # ./run You should be able to leave them running even for days until the issue has been reproduced. But please check that they are not filling up the disk too quickly. I could use results from an active KVM host from several hours of runtime as a baseline. Also, I need to check if I set the thresholds adequately.
I'm sorry, I was busy with travelling and other tasks. I started to run the script now on o3 workers openqaworker21 and openqaworker24 (these two affected). So far they took only few MB, according the taken size I'll decide on Friday whether I'll leave them running over the weekend (or cancel and start running on Monday).
be a hidden root cause causing both the test to fail as well as the (possibly occasional) clocksource errors. Regarding kernel options, it is also possible to use tsc=reliable, which is meant for virtualized environments, or tsc=unstable to disable the watchdog checks.
Do you think we should start using tsc kernel param?
The more I think about it the less I am convinced the watchdog errors are undesirable. You actually what to know is something goes so wrong that the TSC reads do not match the kvm-clock reads because that is all the watchdog check really is in this case. The TSC clocksource may get marked unstable but that does not have any effect on the currently active clocksource - kvm-clock.
Feel free to stop your time investment if you think it's just innocent log message. (I guess we will see it soon from the logs.) Also, the last job I saw "clocksource: timekeeping watchdog on CPU0" is opensuse-Tumbleweed-DVD-x86_64-Build20240303-systemd-networkd@64bit (https://openqa.opensuse.org/tests/3984236/file/serial0.txt), it's not on other jobs. But that might mean only KVM hosts were less overloaded.
BTW these machines are configured: -m 1536 -cpu host (sometimes have more CPU or RAM for particular tests), e.g. if it happens to us, it can happen to anybody using VMs on cloud, right?
If this is caused by an overloaded KVM host I guess it cannot happen in cloud environment because cloud VMs have some guarantees tied to them as the number of pCPUs is considered.
Good to know, thanks! -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1220119 https://bugzilla.suse.com/show_bug.cgi?id=1220119#c25 --- Comment #25 from Petr Vorel <petr.vorel@suse.com> --- @Jiri, I got the event "clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc' as unstable because the skew is too large" (at least). It's on openqaworker24, I'll give you access on Monday. -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1220119 https://bugzilla.suse.com/show_bug.cgi?id=1220119#c26 --- Comment #26 from Petr Vorel <petr.vorel@suse.com> --- Created attachment 873381 --> https://bugzilla.suse.com/attachment.cgi?id=873381&action=edit dmesg on affected system (https://openqa.opensuse.org/tests/4001564/file/serial0.txt) -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1220119 https://bugzilla.suse.com/show_bug.cgi?id=1220119#c27 --- Comment #27 from Jiri Wiesner <jwiesner@suse.com> --- At first sight, the hypervisor does not look overloaded. The CPU utilization on openqaworker24 was medium. The hourly average did not go above 20%. These are the 3 highest hourly averages:
user nice system idle iowait irq softirq CPU all 15.7 0.2 2.6 80.8 0.0 0.0 0.1 CPU all 14.3 0.0 2.8 82.3 0.0 0.0 0.1 CPU all 14.6 0.1 3.0 81.8 0.0 0.0 0.1
qemu-system-x86-45417 [060] ...2. 1803295.566825: schedlatwake: comm qemu-system-x86 pid 45417 prio 120 lat 184196877 ... qemu-system-x86-129996 [080] ...2. 1976791.187134: schedlatwake: comm qemu-system-x86 pid 129996 prio 120 lat 48920224 qemu-system-x86-93208 [002] ...2. 1888512.540509: schedlatwake: comm qemu-system-x86 pid 93208 prio 120 lat 48650888 qemu-system-x86-9531 [008] ...2. 1849588.538004: schedlatwake: comm qemu-system-x86 pid 9531 prio 120 lat 48236447 ... qemu-system-x86-54060 [017] ...2. 1846462.878024: schedlatwake: comm qemu-system-x86 pid 54060 prio 120 lat 7010089 qemu-system-x86-62362 [016] ...2. 1991589.150792: schedlatwake: comm qemu-system-x86 pid 62362 prio 120 lat 6967569 qemu-system-x86-120933 [060] ...2. 1990822.180850: schedlatwake: comm qemu-system-x86 pid 120933 prio 120 lat 6966636 The maximum was 184 seconds, which is really extreme. I am not sure whether
Hourly averages cannot exclude the possibility of spikes of activity happening on the system. The scripts measured scheduling latency. There were plenty scheduling latency values larger than 30 milliseconds (the values after "lat" are in microseconds): this could be caused by some bug in the synthetic event code of ftrace. If the scheduling latency is real it means that various qemu threads cannot run in a timely manner experiencing delays of several seconds or tens of seconds. Each VM create more than 20 threads, one of them is the vCPU. I am not sure if delays of other threads besides the vCPU thread can cause problem but I assume they may. My next step would be capturing a full context switch trace with snapshotting. A snapshot would be stored when a value of scheduling latency exceeds a threshold. This will allow me to verify those scheduling latency values as well as check for spikes of activity. -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1220119 https://bugzilla.suse.com/show_bug.cgi?id=1220119#c28 Jiri Wiesner <jwiesner@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #872947|0 |1 is obsolete| | --- Comment #28 from Jiri Wiesner <jwiesner@suse.com> --- Created attachment 873455 --> https://bugzilla.suse.com/attachment.cgi?id=873455&action=edit test 02 debug scripts, scheduling latency with ftrace The full context switch trace showed that I made a mistake in my approach. The scheduling latency reported by the test01 scripts is not real. Measuring scheduling latency between the sched/sched_waking and raw_syscalls/sys_exit tracepoints is a simplification that cannot be applied to qemu processes that spend a long time in kernel mode (running the code executed by the vCPU) after being worked up. I have started the test02 scripts on both openqaworker21 and openqaworker24. -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1220119 https://bugzilla.suse.com/show_bug.cgi?id=1220119#c29 Jiri Slaby <jslaby@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |jslaby@suse.com --- Comment #29 from Jiri Slaby <jslaby@suse.com> --- There was no progress for a while in here. By a chance, has this been fixed in the meantime? Is that needinfo still relevant? -- You are receiving this mail because: You are the assignee for the bug.
participants (1)
-
bugzilla_noreply@suse.com