On Wed, 27 Nov 2024, Andrii Nakryiko wrote:
On Wed, Nov 27, 2024 at 6:42 AM Richard Biener <rguenther@suse.de> wrote:
On Tue, 26 Nov 2024, Andrii Nakryiko wrote:
On Tue, Nov 26, 2024 at 6:43 PM Aaron Puchert <aaronpuchert@alice-dsl.net> wrote:
Am 25.11.24 um 23:59 schrieb Andrii Nakryiko:
[...] DWARF's .eh_frame-based stack unwinding tables. The latter is actually doesn't land itself well to system-wide profiling, and involved compromises in memory usage, CPU usage, and just plain reliability (no one will guarantee you that you captured all of the stack, for example).
The first two arguments are problematic: remember that we're talking about frame pointers, which cost CPU for all users. On one hand you're
[...]
To summarize: DWARF profiling doesn't work perfectly, but it might work well enough in a large enough number of cases, and in the remaining cases it's not made impossible. You just have to invest a bit more time.
There is an entire world of tools that use BPF and capture stack traces not for profiling, but to pinpoint which user code path causes some "issue", where the issue itself can be multitude of things (resource utilization, suspicious calling patterns, lock misuses, whatever you can come up with). Stack traces answer the questions like "where in the code" and "how did we get to that place", regardless of what's the actual question you are trying to answer.
So how does FP unwinding not run into the same issue as DWARF unwinding? If the actual unwinding happens at sampling time (to save you from capturing the very same stack and register state) then with deep call stacks you spend large time in the kernel and hit reliability issues when you fault.
Ok, so with FPs, stack unwinding just reads memory that belongs to the current thread. That memory in 99.9% cases will be physically present in memory (not paged out). So the process is pretty reliable (and fast), if all functions maintain frame pointers.
With DWARF-based unwinding there are three classes of solutions.
1) On-the-fly unwinding based on .eh_frame DWARF data. Given it's pretty big and is normally not used by the application, chances are high that the contents of this section won't be physically mapped into memory. So to access it one would need to cause page fault and wait for the kernel to fulfill it. It's slower, but the real issue is that stack traces with BPF and perf are captured in non-sleepable context (NMI due to perf event, or tracepoints, or kprobes, none of which allow page faults). So you'll, at the minimum, run into the problem with .eh_frame effectively not being available when necessary.
I guess given RT is now first-class in the kernel there must be an existing code path to defer this work to a kthread and put the task to sleep instead? All discussed "better alternatives" unwind info suggest that info be read in on-demand which of course wouldn't be possible from NMI context either.
2) Pre-processing of .eh_frame. Your tool would need to "discover" a set of processes of interest, read .eh_frame and pre-process it into something compact and easy to lookup, and put it into memory (e.g. for BPF program), then when the time comes, look up those tables to do unwinding. Main issues: lots of memory to preallocate, even if it might never be needed, lots of upfront CPU spend, even if it might never be needed, and also you can only unwind processes you knew about at the pre-processing time. If some process started afterwards, you won't have the data, so unwinding will still fail.
I don't see why you need to process .eh_frame into something different.
3) Stack contents capture. That's what perf supports. Capture some relatively large portion of the current thread's stack, hoping to capture enough. Post-process afterwards in user space, doing unwinding using captured snapshot of the stack and .eh_frame. Main issues: whatever amount of stack you captured, might not be enough. And that depends entirely on specific functions that were active during the snapshot. If some function has lots of local variables stored on stack, it might use up the entire captured stack, and so you won't even find other frames. And it's impossible to predict and mitigate. And, of course, it's expensive to copy so much memory between kernel and user space, and .eh_frame-based processing is still relatively slow.
There is just no good and reliable solution for DWARF-based stack unwinding.
The issue is FP based unwinding cannot _ever_ be reliable as you can't know whether you have a valid FP and all FP based unwinders have to apply heuristics here. You'd have to consult .eh_frame to see whether you have a frame pointer. Yeah, so FP based unwinding is heuristically "fast" - when it works. But you don't get to know whether it does or not ;)
You can do DWARF unwinding in the kernel as well (as well as stack capturing for FP based unwinding).
Not really. Kernel itself doesn't support it, and if you were to implement it in a BPF program yourself you'd run into many practical issues. The most damning would be inability to page in memory that contains .eh_frame section contents from non-sleepable context, in which stack trace is typically captured (NMI, tracepoints, kprobes). See above.
I'm asking the kernel would need to gain support. You are asking for all programs and libraries to get frame-pointers.
Even more broad and ad-hoc are bpftrace-based scripts, they are written quickly, often as one-time throw-aways, to pinpoint or debug specific issues. They very often capture and use stack traces to pinpoint the source of the problem in the application code. While comprehensive tools like perf might afford to try to utilize DWARF-based stack unwinding (however unreliable and expensive), all of the more lightweight tooling just can't afford to even attempt to tackle this problem in any reasonable way.
Why so? Because they are not helped by appropriate tooling? I think this is all because of FUD and NIH.
See above. There are fundamental limitations, not just lack of some common implementation to share. But yes, having to add extra dependency on something so finicky (and still unreliable, conceptually) doesn't help the case.
FP unwinding is unreliable by definition. DWARF unwinding is reliable and precise by definition.
Frame pointers are not a requirement for unwinding and they are not in any way "better" than DWARF. Unwinding IP with a frame chain might be cheaper, but I doubt it would matter much for your ad-hoc bpftrace scripts.
They are. And yes, it will matter A LOT for those bpftrace scripts. See above.
I'd say it's a missing BPF feature that it doesn't abstract (for example by a kernel API) taking backtraces.
Stack traces have uses way beyond profiling. Making them effortless is extremely useful to the disto ecosystem itself, IMO.
backtrace() is effortless stack tracing, and it uses DWARF. There's libbacktrace as well (used by GCC go which uses DWARF and not frame pointers).
backtrace() is used from *inside* the application to get its own stack trace. Here we are talking about capturing stack traces from outside applications without application realizing or being specially built for that.
So? You were generalizing to the disro ecosystem. Richard. -- Richard Biener <rguenther@suse.de> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)