[Bug 1220119] Some tests on o3 triggers dmesg "clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc' as unstable because the skew is too large:"

22 Feb 2024


      https://bugzilla.suse.com/show_bug.cgi?id=1220119
https://bugzilla.suse.com/show_bug.cgi?id=1220119#c23

--- Comment #23 from Jiri Wiesner <jwiesner@suse.com> ---
Created attachment 872947
  --> https://bugzilla.suse.com/attachment.cgi?id=872947&action=edit
test 01 debug scripts, scheduling latency with ftrace

(In reply to Petr Vorel from comment #22)
...
(In reply to Jiri Wiesner from comment #21)
...
Petr, I need to ask:
Is the log in comment 13 a composite, meaning collated from logs coming from
different boots of the VM?
No, single boot.
I mean, QEMU machine boots writing dmesg to this file, tests runs and
machine shutdown.
Oh, my. It is time to face the uncomfortable reality that the timer handler
really got executed 4 times, somehow interleaving the execution of some of the
handlers.
...
...
It seems to me that, since causation has not been established, there could
It's also unclear for me why this happens randomly. Maybe we could find more
from dmesg on actual KVM host. openQA KVM hosts are quite often overloaded,
maybe that influence clock clocksource on SUT.
I think you may be onto something. An overloaded KVM host would struggle with
running vCPU threads in a timely manner. Since we know the TSC hardware is fine
(no watchdog errors on the host), the TSC readouts might reflect the actual
passage of time as opposed to kvm-clock, which is managed by KVM/Qemu. The
interval reported in cs_nsec is always longer than the interval in wd_nsec,
which corroborates the hypothesis. I am attaching debugging scripts to measure
scheduling latency on the KVM host. They are run as root:
# ./run
You should be able to leave them running even for days until the issue has been
reproduced. But please check that they are not filling up the disk too quickly.
I could use results from an active KVM host from several hours of runtime as a
baseline. Also, I need to check if I set the thresholds adequately.
...
...
be a hidden root cause causing both the test to fail as well as the
(possibly occasional) clocksource errors. Regarding kernel options, it is
also possible to use tsc=reliable, which is meant for virtualized
environments, or tsc=unstable to disable the watchdog checks.
Do you think we should start using tsc kernel param?
The more I think about it the less I am convinced the watchdog errors are
undesirable. You actually what to know is something goes so wrong that the TSC
reads do not match the kvm-clock reads because that is all the watchdog check
really is in this case. The TSC clocksource may get marked unstable but that
does not have any effect on the currently active clocksource - kvm-clock.
...
BTW these machines are
configured: -m 1536 -cpu host (sometimes have more CPU or RAM for particular
tests), e.g. if it happens to us, it can happen to anybody using VMs on
cloud, right?
If this is caused by an overloaded KVM host I guess it cannot happen in cloud
environment because cloud VMs have some guarantees tied to them as the number
of pCPUs is considered.
-- 
You are receiving this mail because:
You are on the CC list for the bug.

[Bug 1220119] Some tests on o3 triggers dmesg "clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc' as unstable because the skew is too large:"

bugzilla_noreply＠suse.com