At first sight, the hypervisor does not look overloaded. The CPU utilization on openqaworker24 was medium. The hourly average did not go above 20%. These are the 3 highest hourly averages: > user nice system idle iowait irq softirq > CPU all 15.7 0.2 2.6 80.8 0.0 0.0 0.1 > CPU all 14.3 0.0 2.8 82.3 0.0 0.0 0.1 > CPU all 14.6 0.1 3.0 81.8 0.0 0.0 0.1 Hourly averages cannot exclude the possibility of spikes of activity happening on the system. The scripts measured scheduling latency. There were plenty scheduling latency values larger than 30 milliseconds (the values after "lat" are in microseconds): > qemu-system-x86-45417 [060] ...2. 1803295.566825: schedlatwake: comm qemu-system-x86 pid 45417 prio 120 lat 184196877 > ... > qemu-system-x86-129996 [080] ...2. 1976791.187134: schedlatwake: comm qemu-system-x86 pid 129996 prio 120 lat 48920224 > qemu-system-x86-93208 [002] ...2. 1888512.540509: schedlatwake: comm qemu-system-x86 pid 93208 prio 120 lat 48650888 > qemu-system-x86-9531 [008] ...2. 1849588.538004: schedlatwake: comm qemu-system-x86 pid 9531 prio 120 lat 48236447 > ... > qemu-system-x86-54060 [017] ...2. 1846462.878024: schedlatwake: comm qemu-system-x86 pid 54060 prio 120 lat 7010089 > qemu-system-x86-62362 [016] ...2. 1991589.150792: schedlatwake: comm qemu-system-x86 pid 62362 prio 120 lat 6967569 > qemu-system-x86-120933 [060] ...2. 1990822.180850: schedlatwake: comm qemu-system-x86 pid 120933 prio 120 lat 6966636 The maximum was 184 seconds, which is really extreme. I am not sure whether this could be caused by some bug in the synthetic event code of ftrace. If the scheduling latency is real it means that various qemu threads cannot run in a timely manner experiencing delays of several seconds or tens of seconds. Each VM create more than 20 threads, one of them is the vCPU. I am not sure if delays of other threads besides the vCPU thread can cause problem but I assume they may. My next step would be capturing a full context switch trace with snapshotting. A snapshot would be stored when a value of scheduling latency exceeds a threshold. This will allow me to verify those scheduling latency values as well as check for spikes of activity.