https://bugzilla.suse.com/show_bug.cgi?id=1220119 https://bugzilla.suse.com/show_bug.cgi?id=1220119#c27 --- Comment #27 from Jiri Wiesner <jwiesner@suse.com> --- At first sight, the hypervisor does not look overloaded. The CPU utilization on openqaworker24 was medium. The hourly average did not go above 20%. These are the 3 highest hourly averages:
user nice system idle iowait irq softirq CPU all 15.7 0.2 2.6 80.8 0.0 0.0 0.1 CPU all 14.3 0.0 2.8 82.3 0.0 0.0 0.1 CPU all 14.6 0.1 3.0 81.8 0.0 0.0 0.1
qemu-system-x86-45417 [060] ...2. 1803295.566825: schedlatwake: comm qemu-system-x86 pid 45417 prio 120 lat 184196877 ... qemu-system-x86-129996 [080] ...2. 1976791.187134: schedlatwake: comm qemu-system-x86 pid 129996 prio 120 lat 48920224 qemu-system-x86-93208 [002] ...2. 1888512.540509: schedlatwake: comm qemu-system-x86 pid 93208 prio 120 lat 48650888 qemu-system-x86-9531 [008] ...2. 1849588.538004: schedlatwake: comm qemu-system-x86 pid 9531 prio 120 lat 48236447 ... qemu-system-x86-54060 [017] ...2. 1846462.878024: schedlatwake: comm qemu-system-x86 pid 54060 prio 120 lat 7010089 qemu-system-x86-62362 [016] ...2. 1991589.150792: schedlatwake: comm qemu-system-x86 pid 62362 prio 120 lat 6967569 qemu-system-x86-120933 [060] ...2. 1990822.180850: schedlatwake: comm qemu-system-x86 pid 120933 prio 120 lat 6966636 The maximum was 184 seconds, which is really extreme. I am not sure whether
Hourly averages cannot exclude the possibility of spikes of activity happening on the system. The scripts measured scheduling latency. There were plenty scheduling latency values larger than 30 milliseconds (the values after "lat" are in microseconds): this could be caused by some bug in the synthetic event code of ftrace. If the scheduling latency is real it means that various qemu threads cannot run in a timely manner experiencing delays of several seconds or tens of seconds. Each VM create more than 20 threads, one of them is the vCPU. I am not sure if delays of other threads besides the vCPU thread can cause problem but I assume they may. My next step would be capturing a full context switch trace with snapshotting. A snapshot would be stored when a value of scheduling latency exceeds a threshold. This will allow me to verify those scheduling latency values as well as check for spikes of activity. -- You are receiving this mail because: You are the assignee for the bug.