[Bug 1220119] Some tests on o3 triggers dmesg "clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc' as unstable because the skew is too large:"

12 Mar 2024

      https://bugzilla.suse.com/show_bug.cgi?id=1220119
https://bugzilla.suse.com/show_bug.cgi?id=1220119#c27

--- Comment #27 from Jiri Wiesner <jwiesner@suse.com> ---
At first sight, the hypervisor does not look overloaded. The CPU utilization on
openqaworker24 was medium. The hourly average did not go above 20%. These are
the 3 highest hourly averages:
...
user    nice  system    idle  iowait     irq softirq
CPU all    15.7     0.2     2.6    80.8     0.0     0.0     0.1
CPU all    14.3     0.0     2.8    82.3     0.0     0.0     0.1
CPU all    14.6     0.1     3.0    81.8     0.0     0.0     0.1
...
qemu-system-x86-45417   [060] ...2. 1803295.566825: schedlatwake: comm qemu-system-x86 pid 45417 prio 120 lat 184196877
...
 qemu-system-x86-129996  [080] ...2. 1976791.187134: schedlatwake: comm qemu-system-x86 pid 129996 prio 120 lat 48920224
 qemu-system-x86-93208   [002] ...2. 1888512.540509: schedlatwake: comm qemu-system-x86 pid 93208 prio 120 lat 48650888
 qemu-system-x86-9531    [008] ...2. 1849588.538004: schedlatwake: comm qemu-system-x86 pid 9531 prio 120 lat 48236447
...
 qemu-system-x86-54060   [017] ...2. 1846462.878024: schedlatwake: comm qemu-system-x86 pid 54060 prio 120 lat 7010089
 qemu-system-x86-62362   [016] ...2. 1991589.150792: schedlatwake: comm qemu-system-x86 pid 62362 prio 120 lat 6967569
 qemu-system-x86-120933  [060] ...2. 1990822.180850: schedlatwake: comm qemu-system-x86 pid 120933 prio 120 lat 6966636
The maximum was 184 seconds, which is really extreme. I am not sure whether
Hourly averages cannot exclude the possibility of spikes of activity happening
on the system.

The scripts measured scheduling latency. There were plenty scheduling latency
values larger than 30 milliseconds (the values after "lat" are in
microseconds):
this could be caused by some bug in the synthetic event code of ftrace.

If the scheduling latency is real it means that various qemu threads cannot run
in a timely manner experiencing delays of several seconds or tens of seconds.
Each VM create more than 20 threads, one of them is the vCPU. I am not sure if
delays of other threads besides the vCPU thread can cause problem but I assume
they may.

My next step would be capturing a full context switch trace with snapshotting.
A snapshot would be stored when a value of scheduling latency exceeds a
threshold. This will allow me to verify those scheduling latency values as well
as check for spikes of activity.
-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 1220119] Some tests on o3 triggers dmesg "clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc' as unstable because the skew is too large:"

bugzilla_noreply＠suse.com