Comment # 19 on bug 1131437 from
Created attachment 804342 [details]
migrations and frequency scaling plot for dbench / 2 clients on marvin4

I don't have answers yet, but I made some plots to show migrations and the
three turbostat metrics Avg_MHz, Busy% and Bzy_MHz. These numbers where not
taken from the userspace turbostat tool but computed from the tracepoint data
of power:pstate_sample (the intel_pstate freq scaling driver/governor).

What's apparent in the plots is that on the SLE15 kernel both clients see a
very high Bzy_MHz (average frequency w/o including idle time). This value is
extremely close to the 1-core-turbo frequency (3.1 GHz on Marvin4), i.e. the
frequency available when all but one cores are idle.

On the other hand, Bzy_MHz on v5.0 shows:

* the first client getting a value on par with SLE15
* the second client never getting more than the max non-turbo p-state
  (a.k.a. "base frequency", 2.3 GHz on Marvin4). It's actually often less than
  that.

So this is consistent with the remark that "on v5.0 only one client at a time
gets low latency".

The Busy% signal doesn't look very different on the two kernels, which hints
that idling shouldn't be the root cause of the problem.

The migration pattern isn't aberrant: except for an initial phase lasting
around 5 seconds (the first 10 pages of the flipbook), the clients don't roam
too much around but appear to stick to a small set of cpus.

It remain to see why v5.0 can't unlock 1-core-turbo for both clients as SLE15
does.

Some notes on the plots and frequency formulas:

* the most unorthodox of the plots is the migrations panel: it should be
  thought of as bundle of NCPUS horizontal stripes, each representing a cpu
  occupation over time.
* in the migrations panel cpu are sorted numerically and not according to
  topology. I may upload an additional diagram showing what a NUMA node looks
  in such plot; roughly speaking, 0-11 and 24-35 are NUMA node #0 and
  12-23,36-47 and NUMA node #1
* the power:pstate_sample tracepoint gives delta_APERF, delta_MPERF and
  delta_TSC, which is all it's needed to compute the "turbostat
  metrics". The "delta" part means "since the previous pstate_sample
  hit". Quick recap:

  * APERF is a counter ticking at the actual frequency of the core. Stops at
idle.
  * MPERF is a counter ticking at the constant frequency of the max non-turbo
    p-state, also called "base frequency". Stops at idle.
  * TSC or Time Stamp Counter is exactly like MPERF but doesn't stop at idle
    (on Marvin4 -- older precessors have a so colled "non-invariant TSC" which
    means it stops at idle, and is then exactly the same as MPERF).

  * the formulas are (straight from turbostat source code):

    * Avg_MHz = delta_APERF * base_freq / delta_TSC
      (in the above we're computing the length of the time interval counting
      TSC ticks, since we know there are 2300M a second of those)
    * Busy% = delta_MPERF / delta_TSC
      (in the above we're using that MPERF and TSC ticks at the same speed,
      but the latter doesn't stop at idle)
    * Bzy_MHz = delta_APERF / delta_MPERF * base_freq
      (Here we use the relative speed of APERF wrt MPERF as a multiplier to
      MPERF's frequency. Also, we use that APERF and MPERF don't tick when in
      idle).

* the attached page is one out of 360, since each page show half a second of
  activity and a dbench run is 3 minutes.


You are receiving this mail because: