On Tue, Nov 26, 2024 at 6:43 PM Aaron Puchert <aaronpuchert@alice-dsl.net> wrote:
Am 25.11.24 um 23:59 schrieb Andrii Nakryiko:
[...] DWARF's .eh_frame-based stack unwinding tables. The latter is actually doesn't land itself well to system-wide profiling, and involved compromises in memory usage, CPU usage, and just plain reliability (no one will guarantee you that you captured all of the stack, for example).
The first two arguments are problematic: remember that we're talking about frame pointers, which cost CPU for all users. On one hand you're
But how much CPU does it actually cost? You can see 1-2% in *benchmarks*, and some Python-specific *microbenchmark* showed up to 10% overhead and is due to extremely bloated code that spills/fills stuff onto/from stack (so yeah, reducing number of registers by one does exacerbate an already existing problem, sure). In no way the latter is a high-performance solution to anything, any Python application that really cares and needs performance has been offloading CPU-intensive work to C/C++ based libraries for a long time (PyTorches of the world and some such). So all the worries about frame pointer CPU overhead are very overblown, I believe. As I mentioned, with Fedora change there was a lot of back and forth, and ultimately frame pointers were enabled by default, and so far no one is arguing to disable them, because there is no perceptible CPU overhead associated with that.
arguing that DWARF profiling costs CPU for users that want to profile shipped packages, but at the same time you wave aside the CPU cost that this has for all users. Tumbleweed isn't just a distribution for server farms, it's a distribution used for desktops as well, even by non-programmers. I work for a software company as well. But we might not be representative of all users. (And at my company, or at least in my team, we don't profile shipped packages but our own binaries.)
Reliability is certainly a problem: "perf" limits the size of the captured portion of the stack, and it affects the overhead. There seems to be some kind of maximum for that limit. But: * Sometimes the result is still good enough and allows drawing conclusions. With less frequent samples you can drive up the captured stack size significantly. * Some profilers aren't limited by BPF and can collect the entire stack, no matter how big it is. Their overhead is higher, so you need a lower sampling frequency, but if you're not profiling a lot you might be willing to pay that price. The overhead can also hide issues, but that's a "can". * If I want to profile a specific application, I can build it myself. You can always do this two-level profiling: first figure out the problematic binaries, then zoom into them. Of course this takes more time, but it's not made impossible by distribution choices.
To summarize: DWARF profiling doesn't work perfectly, but it might work well enough in a large enough number of cases, and in the remaining cases it's not made impossible. You just have to invest a bit more time.
There is an entire world of tools that use BPF and capture stack traces not for profiling, but to pinpoint which user code path causes some "issue", where the issue itself can be multitude of things (resource utilization, suspicious calling patterns, lock misuses, whatever you can come up with). Stack traces answer the questions like "where in the code" and "how did we get to that place", regardless of what's the actual question you are trying to answer. Even more broad and ad-hoc are bpftrace-based scripts, they are written quickly, often as one-time throw-aways, to pinpoint or debug specific issues. They very often capture and use stack traces to pinpoint the source of the problem in the application code. While comprehensive tools like perf might afford to try to utilize DWARF-based stack unwinding (however unreliable and expensive), all of the more lightweight tooling just can't afford to even attempt to tackle this problem in any reasonable way. Stack traces have uses way beyond profiling. Making them effortless is extremely useful to the disto ecosystem itself, IMO.