http://bugzilla.suse.com/show_bug.cgi?id=1131437
http://bugzilla.suse.com/show_bug.cgi?id=1131437#c19
--- Comment #19 from Giovanni Gherdovich ---
Created attachment 804342
--> http://bugzilla.suse.com/attachment.cgi?id=804342&action=edit
migrations and frequency scaling plot for dbench / 2 clients on marvin4
I don't have answers yet, but I made some plots to show migrations and the
three turbostat metrics Avg_MHz, Busy% and Bzy_MHz. These numbers where not
taken from the userspace turbostat tool but computed from the tracepoint data
of power:pstate_sample (the intel_pstate freq scaling driver/governor).
What's apparent in the plots is that on the SLE15 kernel both clients see a
very high Bzy_MHz (average frequency w/o including idle time). This value is
extremely close to the 1-core-turbo frequency (3.1 GHz on Marvin4), i.e. the
frequency available when all but one cores are idle.
On the other hand, Bzy_MHz on v5.0 shows:
* the first client getting a value on par with SLE15
* the second client never getting more than the max non-turbo p-state
(a.k.a. "base frequency", 2.3 GHz on Marvin4). It's actually often less than
that.
So this is consistent with the remark that "on v5.0 only one client at a time
gets low latency".
The Busy% signal doesn't look very different on the two kernels, which hints
that idling shouldn't be the root cause of the problem.
The migration pattern isn't aberrant: except for an initial phase lasting
around 5 seconds (the first 10 pages of the flipbook), the clients don't roam
too much around but appear to stick to a small set of cpus.
It remain to see why v5.0 can't unlock 1-core-turbo for both clients as SLE15
does.
Some notes on the plots and frequency formulas:
* the most unorthodox of the plots is the migrations panel: it should be
thought of as bundle of NCPUS horizontal stripes, each representing a cpu
occupation over time.
* in the migrations panel cpu are sorted numerically and not according to
topology. I may upload an additional diagram showing what a NUMA node looks
in such plot; roughly speaking, 0-11 and 24-35 are NUMA node #0 and
12-23,36-47 and NUMA node #1
* the power:pstate_sample tracepoint gives delta_APERF, delta_MPERF and
delta_TSC, which is all it's needed to compute the "turbostat
metrics". The "delta" part means "since the previous pstate_sample
hit". Quick recap:
* APERF is a counter ticking at the actual frequency of the core. Stops at
idle.
* MPERF is a counter ticking at the constant frequency of the max non-turbo
p-state, also called "base frequency". Stops at idle.
* TSC or Time Stamp Counter is exactly like MPERF but doesn't stop at idle
(on Marvin4 -- older precessors have a so colled "non-invariant TSC" which
means it stops at idle, and is then exactly the same as MPERF).
* the formulas are (straight from turbostat source code):
* Avg_MHz = delta_APERF * base_freq / delta_TSC
(in the above we're computing the length of the time interval counting
TSC ticks, since we know there are 2300M a second of those)
* Busy% = delta_MPERF / delta_TSC
(in the above we're using that MPERF and TSC ticks at the same speed,
but the latter doesn't stop at idle)
* Bzy_MHz = delta_APERF / delta_MPERF * base_freq
(Here we use the relative speed of APERF wrt MPERF as a multiplier to
MPERF's frequency. Also, we use that APERF and MPERF don't tick when in
idle).
* the attached page is one out of 360, since each page show half a second of
activity and a dbench run is 3 minutes.
--
You are receiving this mail because:
You are on the CC list for the bug.