Re: Proposal: Update default build flags to enable frame pointers

28 Nov 2024

      On Wed, 27 Nov 2024, Andrii Nakryiko wrote:
...
On Wed, Nov 27, 2024 at 6:42 AM Richard Biener <rguenther@suse.de> wrote:
...
On Tue, 26 Nov 2024, Andrii Nakryiko wrote:
...
On Tue, Nov 26, 2024 at 6:43 PM Aaron Puchert
<aaronpuchert@alice-dsl.net> wrote:
...
Am 25.11.24 um 23:59 schrieb Andrii Nakryiko:
...
[...] DWARF's .eh_frame-based stack unwinding tables. The latter is
actually doesn't land itself well to system-wide profiling, and involved
compromises in memory usage, CPU usage, and just plain reliability (no
one will guarantee you that you captured all of the stack, for example).
The first two arguments are problematic: remember that we're talking
about frame pointers, which cost CPU for all users. On one hand you're
[...]
...
To summarize: DWARF profiling doesn't work perfectly, but it might work
well enough in a large enough number of cases, and in the remaining
cases it's not made impossible. You just have to invest a bit more time.
There is an entire world of tools that use BPF and capture stack
traces not for profiling, but to pinpoint which user code path causes
some "issue", where the issue itself can be multitude of things
(resource utilization, suspicious calling patterns, lock misuses,
whatever you can come up with). Stack traces answer the questions like
"where in the code" and "how did we get to that place", regardless of
what's the actual question you are trying to answer.
So how does FP unwinding not run into the same issue as DWARF
unwinding?  If the actual unwinding happens at sampling time
(to save you from capturing the very same stack and register state)
then with deep call stacks you spend large time in the kernel
and hit reliability issues when you fault.
Ok, so with FPs, stack unwinding just reads memory that belongs to the
current thread. That memory in 99.9% cases will be physically present
in memory (not paged out). So the process is pretty reliable (and
fast), if all functions maintain frame pointers.
With DWARF-based unwinding there are three classes of solutions.
1) On-the-fly unwinding based on .eh_frame DWARF data. Given it's
pretty big and is normally not used by the application, chances are
high that the contents of this section won't be physically mapped into
memory. So to access it one would need to cause page fault and wait
for the kernel to fulfill it. It's slower, but the real issue is that
stack traces with BPF and perf are captured in non-sleepable context
(NMI due to perf event, or tracepoints, or kprobes, none of which
allow page faults). So you'll, at the minimum, run into the problem
with .eh_frame effectively not being available when necessary.
I guess given RT is now first-class in the kernel there must be
an existing code path to defer this work to a kthread and put the
task to sleep instead?  All discussed "better alternatives" unwind
info suggest that info be read in on-demand which of course wouldn't
be possible from NMI context either.
...
2) Pre-processing of .eh_frame. Your tool would need to "discover" a
set of processes of interest, read .eh_frame and pre-process it into
something compact and easy to lookup, and put it into memory (e.g. for
BPF program), then when the time comes, look up those tables to do
unwinding. Main issues: lots of memory to preallocate, even if it
might never be needed, lots of upfront CPU spend, even if it might
never be needed, and also you can only unwind processes you knew about
at the pre-processing time. If some process started afterwards, you
won't have the data, so unwinding will still fail.
I don't see why you need to process .eh_frame into something different.
...
3) Stack contents capture. That's what perf supports. Capture some
relatively large portion of the current thread's stack, hoping to
capture enough. Post-process afterwards in user space, doing unwinding
using captured snapshot of the stack and .eh_frame. Main issues:
whatever amount of stack you captured, might not be enough. And that
depends entirely on specific functions that were active during the
snapshot. If some function has lots of local variables stored on
stack, it might use up the entire captured stack, and so you won't
even find other frames. And it's impossible to predict and mitigate.
And, of course, it's expensive to copy so much memory between kernel
and user space, and .eh_frame-based processing is still relatively
slow.
There is just no good and reliable solution for DWARF-based stack unwinding.
The issue is FP based unwinding cannot _ever_ be reliable as you
can't know whether you have a valid FP and all FP based unwinders
have to apply heuristics here.  You'd have to consult .eh_frame
to see whether you have a frame pointer.

Yeah, so FP based unwinding is heuristically "fast" - when it works.
But you don't get to know whether it does or not ;)
...
...
You can do DWARF unwinding in the kernel as well (as well as
stack capturing for FP based unwinding).
Not really. Kernel itself doesn't support it, and if you were to
implement it in a BPF program yourself you'd run into many practical
issues. The most damning would be inability to page in memory that
contains .eh_frame section contents from non-sleepable context, in
which stack trace is typically captured (NMI, tracepoints, kprobes).
See above.
I'm asking the kernel would need to gain support.  You are asking for all
programs and libraries to get frame-pointers.
...
...
...
Even more broad and ad-hoc are bpftrace-based scripts, they are
written quickly, often as one-time throw-aways, to pinpoint or debug
specific issues. They very often capture and use stack traces to
pinpoint the source of the problem in the application code. While
comprehensive tools like perf might afford to try to utilize
DWARF-based stack unwinding (however unreliable and expensive), all of
the more lightweight tooling just can't afford to even attempt to
tackle this problem in any reasonable way.
Why so?  Because they are not helped by appropriate tooling?  I
think this is all because of FUD and NIH.
See above. There are fundamental limitations, not just lack of some
common implementation to share. But yes, having to add extra
dependency on something so finicky (and still unreliable,
conceptually) doesn't help the case.
FP unwinding is unreliable by definition.  DWARF unwinding is
reliable and precise by definition.
...
...
Frame pointers are not a requirement for unwinding and they are not
in any way "better" than DWARF.  Unwinding IP with a frame chain
might be cheaper, but I doubt it would matter much for your
ad-hoc bpftrace scripts.
They are. And yes, it will matter A LOT for those bpftrace scripts. See above.
I'd say it's a missing BPF feature that it doesn't abstract (for example
by a kernel API) taking backtraces.
...
...
...
Stack traces have uses way beyond profiling. Making them effortless is
extremely useful to the disto ecosystem itself, IMO.
backtrace() is effortless stack tracing, and it uses DWARF.  There's
libbacktrace as well (used by GCC go which uses DWARF and not frame
pointers).
backtrace() is used from *inside* the application to get its own stack
trace. Here we are talking about capturing stack traces from outside
applications without application realizing or being specially built
for that.
So?  You were generalizing to the disro ecosystem.

Richard.

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)