Re: Proposal: Update default build flags to enable frame pointers

27 Nov 2024

      On Tue, Nov 26, 2024 at 6:43 PM Aaron Puchert
<aaronpuchert@alice-dsl.net> wrote:
...
Am 25.11.24 um 23:59 schrieb Andrii Nakryiko:
...
[...] DWARF's .eh_frame-based stack unwinding tables. The latter is
actually doesn't land itself well to system-wide profiling, and involved
compromises in memory usage, CPU usage, and just plain reliability (no
one will guarantee you that you captured all of the stack, for example).
The first two arguments are problematic: remember that we're talking
about frame pointers, which cost CPU for all users. On one hand you're
But how much CPU does it actually cost? You can see 1-2% in
*benchmarks*, and some Python-specific *microbenchmark* showed up to
10% overhead and is due to extremely bloated code that spills/fills
stuff onto/from stack (so yeah, reducing number of registers by one
does exacerbate an already existing problem, sure). In no way the
latter is a high-performance solution to anything, any Python
application that really cares and needs performance has been
offloading CPU-intensive work to C/C++ based libraries for a long time
(PyTorches of the world and some such).

So all the worries about frame pointer CPU overhead are very
overblown, I believe. As I mentioned, with Fedora change there was a
lot of back and forth, and ultimately frame pointers were enabled by
default, and so far no one is arguing to disable them, because there
is no perceptible CPU overhead associated with that.
...
arguing that DWARF profiling costs CPU for users that want to profile
shipped packages, but at the same time you wave aside the CPU cost that
this has for all users. Tumbleweed isn't just a distribution for server
farms, it's a distribution used for desktops as well, even by
non-programmers. I work for a software company as well. But we might not
be representative of all users. (And at my company, or at least in my
team, we don't profile shipped packages but our own binaries.)
Reliability is certainly a problem: "perf" limits the size of the
captured portion of the stack, and it affects the overhead. There seems
to be some kind of maximum for that limit. But:
* Sometimes the result is still good enough and allows drawing
   conclusions. With less frequent samples you can drive up the captured
   stack size significantly.
* Some profilers aren't limited by BPF and can collect the entire stack,
   no matter how big it is. Their overhead is higher, so you need a lower
   sampling frequency, but if you're not profiling a lot you might be
   willing to pay that price. The overhead can also hide issues, but
   that's a "can".
* If I want to profile a specific application, I can build it myself.
   You can always do this two-level profiling: first figure out the
   problematic binaries, then zoom into them. Of course this takes more
   time, but it's not made impossible by distribution choices.
To summarize: DWARF profiling doesn't work perfectly, but it might work
well enough in a large enough number of cases, and in the remaining
cases it's not made impossible. You just have to invest a bit more time.
There is an entire world of tools that use BPF and capture stack
traces not for profiling, but to pinpoint which user code path causes
some "issue", where the issue itself can be multitude of things
(resource utilization, suspicious calling patterns, lock misuses,
whatever you can come up with). Stack traces answer the questions like
"where in the code" and "how did we get to that place", regardless of
what's the actual question you are trying to answer.

Even more broad and ad-hoc are bpftrace-based scripts, they are
written quickly, often as one-time throw-aways, to pinpoint or debug
specific issues. They very often capture and use stack traces to
pinpoint the source of the problem in the application code. While
comprehensive tools like perf might afford to try to utilize
DWARF-based stack unwinding (however unreliable and expensive), all of
the more lightweight tooling just can't afford to even attempt to
tackle this problem in any reasonable way.

Stack traces have uses way beyond profiling. Making them effortless is
extremely useful to the disto ecosystem itself, IMO.