[Bug 1220119] Some tests on o3 triggers dmesg "clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc' as unstable because the skew is too large:"

7 Mar 2024


      https://bugzilla.suse.com/show_bug.cgi?id=1220119
https://bugzilla.suse.com/show_bug.cgi?id=1220119#c24

--- Comment #24 from Petr Vorel <petr.vorel@suse.com> ---
(In reply to Jiri Wiesner from comment #23)
...
...
I think you may be onto something. An overloaded KVM host would struggle
with running vCPU threads in a timely manner. Since we know the TSC hardware
is fine (no watchdog errors on the host), the TSC readouts might reflect the
actual passage of time as opposed to kvm-clock, which is managed by
KVM/Qemu. The interval reported in cs_nsec is always longer than the
interval in wd_nsec, which corroborates the hypothesis. I am attaching
debugging scripts to measure scheduling latency on the KVM host. They are
run as root:
# ./run
You should be able to leave them running even for days until the issue has
been reproduced. But please check that they are not filling up the disk too
quickly. I could use results from an active KVM host from several hours of
runtime as a baseline. Also, I need to check if I set the thresholds
adequately.
I'm sorry, I was busy with travelling and other tasks. I started to run the
script now on o3 workers openqaworker21 and openqaworker24 (these two
affected). So far they took only few MB, according the taken size I'll decide
on Friday whether I'll leave them running over the weekend (or cancel and start
running on Monday).
...
...
...
be a hidden root cause causing both the test to fail as well as the
(possibly occasional) clocksource errors. Regarding kernel options, it is
also possible to use tsc=reliable, which is meant for virtualized
environments, or tsc=unstable to disable the watchdog checks.
Do you think we should start using tsc kernel param?
The more I think about it the less I am convinced the watchdog errors are
undesirable. You actually what to know is something goes so wrong that the
TSC reads do not match the kvm-clock reads because that is all the watchdog
check really is in this case. The TSC clocksource may get marked unstable
but that does not have any effect on the currently active clocksource -
kvm-clock.
Feel free to stop your time investment if you think it's just innocent log
message. (I guess we will see it soon from the logs.) 

Also, the last job I saw "clocksource: timekeeping watchdog on CPU0" is
opensuse-Tumbleweed-DVD-x86_64-Build20240303-systemd-networkd@64bit
(https://openqa.opensuse.org/tests/3984236/file/serial0.txt), it's not on other
jobs. But that might mean only KVM hosts were less overloaded.
...
...
BTW these machines are
configured: -m 1536 -cpu host (sometimes have more CPU or RAM for particular
tests), e.g. if it happens to us, it can happen to anybody using VMs on
cloud, right?
If this is caused by an overloaded KVM host I guess it cannot happen in
cloud environment because cloud VMs have some guarantees tied to them as the
number of pCPUs is considered.
Good to know, thanks!
-- 
You are receiving this mail because:
You are on the CC list for the bug.

[Bug 1220119] Some tests on o3 triggers dmesg "clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc' as unstable because the skew is too large:"

bugzilla_noreply＠suse.com