http://bugzilla.suse.com/show_bug.cgi?id=1131437 http://bugzilla.suse.com/show_bug.cgi?id=1131437#c19 --- Comment #19 from Giovanni Gherdovich <giovanni.gherdovich@suse.com> --- Created attachment 804342 --> http://bugzilla.suse.com/attachment.cgi?id=804342&action=edit migrations and frequency scaling plot for dbench / 2 clients on marvin4 I don't have answers yet, but I made some plots to show migrations and the three turbostat metrics Avg_MHz, Busy% and Bzy_MHz. These numbers where not taken from the userspace turbostat tool but computed from the tracepoint data of power:pstate_sample (the intel_pstate freq scaling driver/governor). What's apparent in the plots is that on the SLE15 kernel both clients see a very high Bzy_MHz (average frequency w/o including idle time). This value is extremely close to the 1-core-turbo frequency (3.1 GHz on Marvin4), i.e. the frequency available when all but one cores are idle. On the other hand, Bzy_MHz on v5.0 shows: * the first client getting a value on par with SLE15 * the second client never getting more than the max non-turbo p-state (a.k.a. "base frequency", 2.3 GHz on Marvin4). It's actually often less than that. So this is consistent with the remark that "on v5.0 only one client at a time gets low latency". The Busy% signal doesn't look very different on the two kernels, which hints that idling shouldn't be the root cause of the problem. The migration pattern isn't aberrant: except for an initial phase lasting around 5 seconds (the first 10 pages of the flipbook), the clients don't roam too much around but appear to stick to a small set of cpus. It remain to see why v5.0 can't unlock 1-core-turbo for both clients as SLE15 does. Some notes on the plots and frequency formulas: * the most unorthodox of the plots is the migrations panel: it should be thought of as bundle of NCPUS horizontal stripes, each representing a cpu occupation over time. * in the migrations panel cpu are sorted numerically and not according to topology. I may upload an additional diagram showing what a NUMA node looks in such plot; roughly speaking, 0-11 and 24-35 are NUMA node #0 and 12-23,36-47 and NUMA node #1 * the power:pstate_sample tracepoint gives delta_APERF, delta_MPERF and delta_TSC, which is all it's needed to compute the "turbostat metrics". The "delta" part means "since the previous pstate_sample hit". Quick recap: * APERF is a counter ticking at the actual frequency of the core. Stops at idle. * MPERF is a counter ticking at the constant frequency of the max non-turbo p-state, also called "base frequency". Stops at idle. * TSC or Time Stamp Counter is exactly like MPERF but doesn't stop at idle (on Marvin4 -- older precessors have a so colled "non-invariant TSC" which means it stops at idle, and is then exactly the same as MPERF). * the formulas are (straight from turbostat source code): * Avg_MHz = delta_APERF * base_freq / delta_TSC (in the above we're computing the length of the time interval counting TSC ticks, since we know there are 2300M a second of those) * Busy% = delta_MPERF / delta_TSC (in the above we're using that MPERF and TSC ticks at the same speed, but the latter doesn't stop at idle) * Bzy_MHz = delta_APERF / delta_MPERF * base_freq (Here we use the relative speed of APERF wrt MPERF as a multiplier to MPERF's frequency. Also, we use that APERF and MPERF don't tick when in idle). * the attached page is one out of 360, since each page show half a second of activity and a dbench run is 3 minutes. -- You are receiving this mail because: You are on the CC list for the bug.