Proposal: Update default build flags to enable frame pointers

newer
New Tumbleweed snapshot 20241203...

older
New Slowroll update 20241203T1944...

Neal Gompa

16 Nov 2024 16 Nov '24

18:24

Hello, I would like to change openSUSE's default build flags in rpm and rpm-config-SUSE to incorporate frame pointers (and equivalents on other architectures) to support real-time profiling and observability on SUSE distributions. In practice, this would mean essentially adopting the same tunables that exist in Fedora[1] for openSUSE to turn them on by default and to allow packages or OBS projects to selectively opt out as needed. The reasons to do so are threefold: * It is a major competitive disadvantage for us to lack the ability to do cheap real-time profiling and observability. Fedora[2], Ubuntu[3], Arch Linux[4], and AlmaLinux[5] all now do this, and thus support this capability. * The performance hit for having it vs not is insignitifcant[6]. * There are new tools in both the cloud-native and regular systems development worlds that leverage this, and openSUSE should be an enabler of those technologies. I want openSUSE to be a great place for people to develop and optimize workloads on, especially desktop ones, where most of the tooling we have for tracing and profiling is broken without frame pointers (see Sysprof and Hotspot from GNOME and KDE respectively, which both rely on frame pointers to have cheap real-time tracing for performance analysis). And given that we advertise as the "makers' choice", I think it would definitely be on-brand for us to have the capability to better support makers and shakers. For those interested in more detail about frame pointers, Brendan Gregg has a decent post about it[7]. I have also submitted a parallel request for openSUSE Leap 16 to have this feature enabled too[8]. I truly believe this would give us an even better footing with the broader community of developers and operators and make SUSE distributions very attractive for FOSS and proprietary software alike. Best regards, Neal [1]: https://src.fedoraproject.org/rpms/redhat-rpm-config/blob/93063bb396395b9a20... [2]: https://fedoraproject.org/wiki/Changes/fno-omit-frame-pointer [3]: https://ubuntu.com/blog/ubuntu-performance-engineering-with-frame-pointers-b... [4]: https://gitlab.archlinux.org/archlinux/rfcs/-/blob/master/rfcs/0026-fno-omit... [5]: https://almalinux.org/blog/2024-10-22-introducing-almalinux-os-kitten/ [6]: https://www.phoronix.com/review/fedora-38-beta-benchmarks [7]: https://www.brendangregg.com/blog/2024-03-17/the-return-of-the-frame-pointer... [8]: https://code.opensuse.org/leap/features/issue/175 -- 真実はいつも一つ！/ Always, there's only one truth!

Show replies by date

Alessio Biancalana

16 Nov 16 Nov

21:55

Hey Neal, For what it's worth I think the same. Also, these days I've been using bpftrace quite extensively and I noticed it segfaults on tumbleweed, that's really sad. Huge plus one from my side (again: for what it's worth). Alessio Il giorno sab, 16/11/2024 alle 13.24 -0500, Neal Gompa ha scritto:

...

Hello,

I would like to change openSUSE's default build flags in rpm and rpm-config-SUSE to incorporate frame pointers (and equivalents on other architectures) to support real-time profiling and observability on SUSE distributions.

In practice, this would mean essentially adopting the same tunables that exist in Fedora[1] for openSUSE to turn them on by default and to allow packages or OBS projects to selectively opt out as needed.

The reasons to do so are threefold:

* It is a major competitive disadvantage for us to lack the ability to do cheap real-time profiling and observability. Fedora[2], Ubuntu[3], Arch Linux[4], and AlmaLinux[5] all now do this, and thus support this capability. * The performance hit for having it vs not is insignitifcant[6]. * There are new tools in both the cloud-native and regular systems development worlds that leverage this, and openSUSE should be an enabler of those technologies.

I want openSUSE to be a great place for people to develop and optimize workloads on, especially desktop ones, where most of the tooling we have for tracing and profiling is broken without frame pointers (see Sysprof and Hotspot from GNOME and KDE respectively, which both rely on frame pointers to have cheap real-time tracing for performance analysis). And given that we advertise as the "makers' choice", I think it would definitely be on-brand for us to have the capability to better support makers and shakers.

For those interested in more detail about frame pointers, Brendan Gregg has a decent post about it[7]. I have also submitted a parallel request for openSUSE Leap 16 to have this feature enabled too[8].

I truly believe this would give us an even better footing with the broader community of developers and operators and make SUSE distributions very attractive for FOSS and proprietary software alike.

Best regards, Neal

[1]: https://src.fedoraproject.org/rpms/redhat-rpm-config/blob/93063bb396395b9a20... [2]: https://fedoraproject.org/wiki/Changes/fno-omit-frame-pointer [3]: https://ubuntu.com/blog/ubuntu-performance-engineering-with-frame-pointers-b... [4]: https://gitlab.archlinux.org/archlinux/rfcs/-/blob/master/rfcs/0026-fno-omit... [5]: https://almalinux.org/blog/2024-10-22-introducing-almalinux-os-kitten/ [6]: https://www.phoronix.com/review/fedora-38-beta-benchmarks [7]: https://www.brendangregg.com/blog/2024-03-17/the-return-of-the-frame-pointer... [8]: https://code.opensuse.org/leap/features/issue/175

-- 真実はいつも一つ！/ Always, there's only one truth!

Aaron Puchert

19 Nov 19 Nov

22:45

Am 16.11.24 um 22:55 schrieb Alessio Biancalana via openSUSE Factory:

...

Maybe https://bugzilla.opensuse.org/show_bug.cgi?id=1233220? Should be fixed with the next snapshot, sorry for that.

Richard Biener

18 Nov 18 Nov

08:30

On Sat, 16 Nov 2024, Neal Gompa wrote:

...

It seems this is optimizing the system for profiling it rather than using it which is an odd thing to do. I realize not having frame pointers can make accurate profiling more difficult (but I do this every day), still taking a 1-10% hit on cpython seems bad. I also object to enforce this for x86 32bit which is a very register starved architecture (x86-64 is only slightly better in this regard). I'll note -mno-omit-leaf-frame-pointer is x86 specific - is the proposal only directed to x86 and x86-64? How do you enforce frame pointers for JITed code, for code generated by compilers that are not GCC (rust, golang, etc.)? Or why do you choose to "ignore" profiling those? In any case - I propose shipping all packages with debug info included since that greatly improves the profiling experience - even more so than by enabling frame-pointers. Bandwidth and disk is cheap these days. Thanks, Richard.

...

-- Richard Biener <rguenther@suse.de> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Bernhard M. Wiedemann

09:13

On 18/11/2024 09.30, Richard Biener wrote:

...

I also object to enforce this for x86 32bit which is a very register starved architecture (x86-64 is only slightly better in this regard).

I think, we can leave i586 as is and just improve x86_64. It should have 8 additional general-purpose registers, so losing 1 of 15 is not as bad as losing 1 of 7 on i586. But I guess, most of the performance impact comes from the 3 extra instructions needed for save+restore of RBP. Is there an option to not use frame-pointers in small functions (where the performance-impact would be the largest) ? The "Shadow Stacks" alternative mentioned in [4] also sounds interesting. CPUs with this seem to be around for 6 years already... But it still needs code to use it. Not sure about other 64-bit architectures. aarch64, riscv, ppc64 - maybe later?

...

In any case - I propose shipping all packages with debug info included since that greatly improves the profiling experience - even more so than by enabling frame-pointers. Bandwidth and disk is cheap these days.

We have debuginfod these days that should allow to pull in required info on demand. And while desktop machines are very powerful these days, running VMs and containers greatly profits from small footprints. Ciao Bernhard M.

...

[4]: https://gitlab.archlinux.org/archlinux/rfcs/-/blob/master/rfcs/0026-fno-omit...

Richard Biener

13:39

On Mon, 18 Nov 2024, Bernhard M. Wiedemann via openSUSE Factory wrote:

...

On 18/11/2024 09.30, Richard Biener wrote:

...
I also object to enforce this for x86 32bit which is a very register starved architecture (x86-64 is only slightly better in this regard).

I think, we can leave i586 as is and just improve x86_64. It should have 8 additional general-purpose registers, so losing 1 of 15 is not as bad as losing 1 of 7 on i586.

But I guess, most of the performance impact comes from the 3 extra instructions needed for save+restore of RBP. Is there an option to not use frame-pointers in small functions (where the performance-impact would be the largest) ?

That's -momit-leaf-frame-pointer - small functions are usually leaf (do not call other functions).

...

The "Shadow Stacks" alternative mentioned in [4] also sounds interesting. CPUs with this seem to be around for 6 years already... But it still needs code to use it.

Not sure about other 64-bit architectures. aarch64, riscv, ppc64 - maybe later?

All those RISCy architectures have enough registers to spare, or like power and s390 have other ways to unwind for backtraces (link register).

...

...
In any case - I propose shipping all packages with debug info included since that greatly improves the profiling experience - even more so than by enabling frame-pointers. Bandwidth and disk is cheap these days.

We have debuginfod these days that should allow to pull in required info on demand.

I don't see that doing a whole system profile will pull that in.

...

And while desktop machines are very powerful these days, running VMs and containers greatly profits from small footprints.

Code without framepointer is smaller as well. Legacy hardware with not "tons of storage" also benefits from a few % more performance more than your up-do-date european HW. Richard.

...

Ciao Bernhard M.

...
[4]: https://gitlab.archlinux.org/archlinux/rfcs/-/blob/master/rfcs/0026-fno-omit...

-- Richard Biener <rguenther@suse.de> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Neal Gompa

15:22

On Mon, Nov 18, 2024 at 8:40 AM Richard Biener <rguenther@suse.de> wrote:

...

But the storage increase is several orders of magnitude smaller than using DWARF data. -- 真実はいつも一つ！/ Always, there's only one truth!

Jiri Slaby

21 Nov 21 Nov

05:32

On 18. 11. 24, 16:22, Neal Gompa wrote:

...

But the storage increase is several orders of magnitude smaller than using DWARF data.

Then we could use ORC. Which is a reason of NOT having framepointers in the kernel, while being able to unwind. -- js suse labs

Neal Gompa

12:30

On Thu, Nov 21, 2024 at 12:32 AM Jiri Slaby <jslaby@suse.cz> wrote:

...

Sure, do we already build our kernels with ORC? -- 真実はいつも一つ！/ Always, there's only one truth!

Jiri Slaby

15:10

On 21. 11. 24, 13:30, Neal Gompa wrote:

...

Yes, since ORC's day 0. We used to have dwarf unwinder before that. And attempts to push dwarf upstream lead to ORC. -- js suse labs

Jiri Slaby

15:19

On 21. 11. 24, 16:10, Jiri Slaby wrote:

...

BTW let me cite from: https://docs.kernel.org/arch/x86/orc-unwinder.html ==== 9.2. ORC vs frame pointers With frame pointers enabled, GCC adds instrumentation code to every function in the kernel. The kernel’s .text size increases by about 3.2%, resulting in a broad kernel-wide slowdown. Measurements by Mel Gorman [1] have shown a slowdown of 5-10% for some workloads. ==== 9.3 is "ORC vs DWARF" and is interesting readings too: ==== The simpler debuginfo format also enables the unwinder to be much faster than DWARF, which is important for perf and lockdep. In a basic performance test by Jiri Slaby [2], the ORC unwinder was about 20x faster than an out-of-tree DWARF unwinder. (Note: That measurement was taken before some performance tweaks were added, which doubled performance, so the speedup over DWARF may be closer to 40x.) ==== [1] https://lore.kernel.org/all/20170602104048.jkkzssljsompjdwy@suse.de/ [2] https://lore.kernel.org/r/d2ca5435-6386-29b8-db87-7f227c2b713a@suse.cz regards, -- js suse labs

Andrii Nakryiko

25 Nov 25 Nov

22:52

Jiri Slaby wrote:

...

ORC exists only for kernel, while this proposal involves enabling frame pointers for user space application, no? User space analogue of ORC is probably SFrame propsal (https://sourceware.org/binutils/wiki/sframe), but it's still "in the making" and needs proper integration into the kernel (and especially with BPF it will require even more work). All that is planned and there is interest in making this happen (myself included), but this is not a viable alternative for frame pointers as of right now and, probably, for at least next few years until Clang gets supports for SFrame, kernel supports lands, and BPF support exists *and* various BPF-based tools and profilers adopt a new asynchronous way of capturing stack traces (that's what's needed to be able to use SFrame from BPF side). So all in all, I don't think there is any realistic alternative to frame pointers now and for the foreseeable future.

Jan Engelhardt

18 Nov 18 Nov

10:34

On Monday 2024-11-18 09:30, Richard Biener wrote:

...

I wish people would stop with the "bandwidth/disk is cheap" nonsense. Tumbleweed is in the same size class as some Call Of Duty episode. Memes aside, it still downloads in roughly the same time as a SUSE installation did 20 years ago. And there was a technology shift where disk sizes experienced a massive crunch in favor of speed. Buy a machine today, and you get the same disk size as in 2007.

Neal Gompa

12:02

On Mon, Nov 18, 2024 at 3:30 AM Richard Biener <rguenther@suse.de> wrote:

...

On Sat, 16 Nov 2024, Neal Gompa wrote:

...
Hello,

I would like to change openSUSE's default build flags in rpm and rpm-config-SUSE to incorporate frame pointers (and equivalents on other architectures) to support real-time profiling and observability on SUSE distributions.

In practice, this would mean essentially adopting the same tunables that exist in Fedora[1] for openSUSE to turn them on by default and to allow packages or OBS projects to selectively opt out as needed.

The reasons to do so are threefold:

* It is a major competitive disadvantage for us to lack the ability to do cheap real-time profiling and observability. Fedora[2], Ubuntu[3], Arch Linux[4], and AlmaLinux[5] all now do this, and thus support this capability. * The performance hit for having it vs not is insignitifcant[6]. * There are new tools in both the cloud-native and regular systems development worlds that leverage this, and openSUSE should be an enabler of those technologies.

I want openSUSE to be a great place for people to develop and optimize workloads on, especially desktop ones, where most of the tooling we have for tracing and profiling is broken without frame pointers (see Sysprof and Hotspot from GNOME and KDE respectively, which both rely on frame pointers to have cheap real-time tracing for performance analysis). And given that we advertise as the "makers' choice", I think it would definitely be on-brand for us to have the capability to better support makers and shakers.

For those interested in more detail about frame pointers, Brendan Gregg has a decent post about it[7]. I have also submitted a parallel request for openSUSE Leap 16 to have this feature enabled too[8].

I truly believe this would give us an even better footing with the broader community of developers and operators and make SUSE distributions very attractive for FOSS and proprietary software alike.

It seems this is optimizing the system for profiling it rather than using it which is an odd thing to do. I realize not having frame pointers can make accurate profiling more difficult (but I do this every day), still taking a 1-10% hit on cpython seems bad.

It's not just about making accurate profiling easier, it's also about making it cheap. Frame pointers make it so you can sample at any point on a running system. Users can do sampling as they observe problems and report to developers. Developers and operators can observe with real workloads without impacting the system configuration. This is why the Go compiler has had frame pointers on by default for 6 years. It's also why other operating systems have frame pointers on for non-x86_32. Linux is the outlier.

...

I also object to enforce this for x86 32bit which is a very register starved architecture (x86-64 is only slightly better in this regard).

Sure, we can leave it out by default for i586/x86_32.

...

I'll note -mno-omit-leaf-frame-pointer is x86 specific - is the proposal only directed to x86 and x86-64?

No. As I said, the idea is to use equivalent flags on all supported architectures.

...

How do you enforce frame pointers for JITed code, for code generated by compilers that are not GCC (rust, golang, etc.)? Or why do you choose to "ignore" profiling those?

Go already does this and has for most of the past decade (which is why cloud-native observability tools even exist at all). I'd like to turn it on in Rust as well. In Fedora, it *was* turned on in Rust, controlled by the same rpm macro that affected the compiler flags for relevant architectures[1]. [1]: https://pagure.io/fedora-rust/rust-packaging/blob/1402e757e3200e6f06b8a9c0db...

...

In any case - I propose shipping all packages with debug info included since that greatly improves the profiling experience - even more so than by enabling frame-pointers. Bandwidth and disk is cheap these days.

Actually, DWARF based profiling isn't very good, which is why so few people do it. It's slow, it's memory intensive, thus the sampling capability is much poorer. Richard W.M. Jones did a decent comparison about this and the problems with DWARF profiling over leveraging frame pointers[2]. Also, disk space available on systems has curiously enough remained mostly the same in 20 years. There was a brief period where storage *did* go up, but the introduction of flash storage brought storage back down on most computer systems. It's also wildly expensive to have a lot of storage on portable computers, which are now what most people have. And bandwidth costs vary based on what part of the world you live in. It's cheaper in Europe than it is in the Americas and especially in Asia and Africa. [2]: https://rwmj.wordpress.com/2023/02/14/frame-pointers-vs-dwarf-my-verdict/ -- 真実はいつも一つ！/ Always, there's only one truth!

Aaron Puchert

19 Nov 19 Nov

23:05

Am 18.11.24 um 13:02 schrieb Neal Gompa:

...

That's an oversimplification. It's certainly slower and generates larger profiles, but it's also much more powerful. You can see inlined functions and get an association to source code lines in the annotation. Sometimes knowing the function names is enough, but especially in a larger code base or if you're not so familiar with the code that you can match assembly with source lines manually, this can be enormously helpful. For my part, I prefer DWARF profiling when I have the debug info around.

Andrii Nakryiko

25 Nov 25 Nov

22:59

Aaron Puchert wrote:

...

You are mixing up two separate things: stack trace capture (a.k.a. unwinding) and stack trace symbolization. What you talk about about DWARF and inline functions, that's about stack trace symbolization, which involves translating a collection of raw memory addresses (representing function call frames) to something human-readable. No matter how you got raw stack trace data (addresses, just numbers), with the use of frame pointers or not, you'd need to symbolize it to make sense. At that point either basic ELF symbol-based symbolization can be done (cheap, pretty fast, but missing inline functions and source code location information) or, if DWARF is available, it's possible to get inline function calls and/or source code information. But again, this is besides the frame pointer or not discussiong. Frame pointers help to get (mostly) reliable stack trace *capture*/unwinding cheaply and efficiently. The alternative to that would be ORC (in kernel only), SFrame (when it's properly supported), or DWARF's .eh_frame-based stack unwinding tables. The latter is actually doesn't land itself well to system-wide profiling, and involved compromises in memory usage, CPU usage, and just plain reliability (no one will guarantee you that you captured all of the stack, for example). It's too much to describe all the nuances of frame pointer vs DWARF-based (.eh_frame-based) stack unwinding. The latter just doesn't work well, frame pointer is currently the only reliably and efficient way to do this. Until something like SFrame gets adopted and supported widely.

Aaron Puchert

27 Nov 27 Nov

02:43

Am 25.11.24 um 23:59 schrieb Andrii Nakryiko:

...

The first two arguments are problematic: remember that we're talking about frame pointers, which cost CPU for all users. On one hand you're arguing that DWARF profiling costs CPU for users that want to profile shipped packages, but at the same time you wave aside the CPU cost that this has for all users. Tumbleweed isn't just a distribution for server farms, it's a distribution used for desktops as well, even by non-programmers. I work for a software company as well. But we might not be representative of all users. (And at my company, or at least in my team, we don't profile shipped packages but our own binaries.) Reliability is certainly a problem: "perf" limits the size of the captured portion of the stack, and it affects the overhead. There seems to be some kind of maximum for that limit. But: * Sometimes the result is still good enough and allows drawing conclusions. With less frequent samples you can drive up the captured stack size significantly. * Some profilers aren't limited by BPF and can collect the entire stack, no matter how big it is. Their overhead is higher, so you need a lower sampling frequency, but if you're not profiling a lot you might be willing to pay that price. The overhead can also hide issues, but that's a "can". * If I want to profile a specific application, I can build it myself. You can always do this two-level profiling: first figure out the problematic binaries, then zoom into them. Of course this takes more time, but it's not made impossible by distribution choices. To summarize: DWARF profiling doesn't work perfectly, but it might work well enough in a large enough number of cases, and in the remaining cases it's not made impossible. You just have to invest a bit more time.

Andrii Nakryiko

03:42

On Tue, Nov 26, 2024 at 6:43 PM Aaron Puchert <aaronpuchert@alice-dsl.net> wrote:

...

But how much CPU does it actually cost? You can see 1-2% in *benchmarks*, and some Python-specific *microbenchmark* showed up to 10% overhead and is due to extremely bloated code that spills/fills stuff onto/from stack (so yeah, reducing number of registers by one does exacerbate an already existing problem, sure). In no way the latter is a high-performance solution to anything, any Python application that really cares and needs performance has been offloading CPU-intensive work to C/C++ based libraries for a long time (PyTorches of the world and some such). So all the worries about frame pointer CPU overhead are very overblown, I believe. As I mentioned, with Fedora change there was a lot of back and forth, and ultimately frame pointers were enabled by default, and so far no one is arguing to disable them, because there is no perceptible CPU overhead associated with that.

...

There is an entire world of tools that use BPF and capture stack traces not for profiling, but to pinpoint which user code path causes some "issue", where the issue itself can be multitude of things (resource utilization, suspicious calling patterns, lock misuses, whatever you can come up with). Stack traces answer the questions like "where in the code" and "how did we get to that place", regardless of what's the actual question you are trying to answer. Even more broad and ad-hoc are bpftrace-based scripts, they are written quickly, often as one-time throw-aways, to pinpoint or debug specific issues. They very often capture and use stack traces to pinpoint the source of the problem in the application code. While comprehensive tools like perf might afford to try to utilize DWARF-based stack unwinding (however unreliable and expensive), all of the more lightweight tooling just can't afford to even attempt to tackle this problem in any reasonable way. Stack traces have uses way beyond profiling. Making them effortless is extremely useful to the disto ecosystem itself, IMO.

Richard Biener

14:42

On Tue, 26 Nov 2024, Andrii Nakryiko wrote:

...

On Tue, Nov 26, 2024 at 6:43 PM Aaron Puchert <aaronpuchert@alice-dsl.net> wrote:

...
Am 25.11.24 um 23:59 schrieb Andrii Nakryiko:

...
[...] DWARF's .eh_frame-based stack unwinding tables. The latter is actually doesn't land itself well to system-wide profiling, and involved compromises in memory usage, CPU usage, and just plain reliability (no one will guarantee you that you captured all of the stack, for example).

The first two arguments are problematic: remember that we're talking about frame pointers, which cost CPU for all users. On one hand you're

[...]

...
To summarize: DWARF profiling doesn't work perfectly, but it might work well enough in a large enough number of cases, and in the remaining cases it's not made impossible. You just have to invest a bit more time.

There is an entire world of tools that use BPF and capture stack traces not for profiling, but to pinpoint which user code path causes some "issue", where the issue itself can be multitude of things (resource utilization, suspicious calling patterns, lock misuses, whatever you can come up with). Stack traces answer the questions like "where in the code" and "how did we get to that place", regardless of what's the actual question you are trying to answer.

So how does FP unwinding not run into the same issue as DWARF unwinding? If the actual unwinding happens at sampling time (to save you from capturing the very same stack and register state) then with deep call stacks you spend large time in the kernel and hit reliability issues when you fault. You can do DWARF unwinding in the kernel as well (as well as stack capturing for FP based unwinding).

...

Even more broad and ad-hoc are bpftrace-based scripts, they are written quickly, often as one-time throw-aways, to pinpoint or debug specific issues. They very often capture and use stack traces to pinpoint the source of the problem in the application code. While comprehensive tools like perf might afford to try to utilize DWARF-based stack unwinding (however unreliable and expensive), all of the more lightweight tooling just can't afford to even attempt to tackle this problem in any reasonable way.

Why so? Because they are not helped by appropriate tooling? I think this is all because of FUD and NIH. Frame pointers are not a requirement for unwinding and they are not in any way "better" than DWARF. Unwinding IP with a frame chain might be cheaper, but I doubt it would matter much for your ad-hoc bpftrace scripts.

...

Stack traces have uses way beyond profiling. Making them effortless is extremely useful to the disto ecosystem itself, IMO.

backtrace() is effortless stack tracing, and it uses DWARF. There's libbacktrace as well (used by GCC go which uses DWARF and not frame pointers). Richard. -- Richard Biener <rguenther@suse.de> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Andrii Nakryiko

19:15

On Wed, Nov 27, 2024 at 6:42 AM Richard Biener <rguenther@suse.de> wrote:

...

On Tue, 26 Nov 2024, Andrii Nakryiko wrote:

...
On Tue, Nov 26, 2024 at 6:43 PM Aaron Puchert <aaronpuchert@alice-dsl.net> wrote:

...
Am 25.11.24 um 23:59 schrieb Andrii Nakryiko:

...
[...] DWARF's .eh_frame-based stack unwinding tables. The latter is actually doesn't land itself well to system-wide profiling, and involved compromises in memory usage, CPU usage, and just plain reliability (no one will guarantee you that you captured all of the stack, for example).

The first two arguments are problematic: remember that we're talking about frame pointers, which cost CPU for all users. On one hand you're

[...]

...
To summarize: DWARF profiling doesn't work perfectly, but it might work well enough in a large enough number of cases, and in the remaining cases it's not made impossible. You just have to invest a bit more time.

There is an entire world of tools that use BPF and capture stack traces not for profiling, but to pinpoint which user code path causes some "issue", where the issue itself can be multitude of things (resource utilization, suspicious calling patterns, lock misuses, whatever you can come up with). Stack traces answer the questions like "where in the code" and "how did we get to that place", regardless of what's the actual question you are trying to answer.

So how does FP unwinding not run into the same issue as DWARF unwinding? If the actual unwinding happens at sampling time (to save you from capturing the very same stack and register state) then with deep call stacks you spend large time in the kernel and hit reliability issues when you fault.

Ok, so with FPs, stack unwinding just reads memory that belongs to the current thread. That memory in 99.9% cases will be physically present in memory (not paged out). So the process is pretty reliable (and fast), if all functions maintain frame pointers. With DWARF-based unwinding there are three classes of solutions. 1) On-the-fly unwinding based on .eh_frame DWARF data. Given it's pretty big and is normally not used by the application, chances are high that the contents of this section won't be physically mapped into memory. So to access it one would need to cause page fault and wait for the kernel to fulfill it. It's slower, but the real issue is that stack traces with BPF and perf are captured in non-sleepable context (NMI due to perf event, or tracepoints, or kprobes, none of which allow page faults). So you'll, at the minimum, run into the problem with .eh_frame effectively not being available when necessary. 2) Pre-processing of .eh_frame. Your tool would need to "discover" a set of processes of interest, read .eh_frame and pre-process it into something compact and easy to lookup, and put it into memory (e.g. for BPF program), then when the time comes, look up those tables to do unwinding. Main issues: lots of memory to preallocate, even if it might never be needed, lots of upfront CPU spend, even if it might never be needed, and also you can only unwind processes you knew about at the pre-processing time. If some process started afterwards, you won't have the data, so unwinding will still fail. 3) Stack contents capture. That's what perf supports. Capture some relatively large portion of the current thread's stack, hoping to capture enough. Post-process afterwards in user space, doing unwinding using captured snapshot of the stack and .eh_frame. Main issues: whatever amount of stack you captured, might not be enough. And that depends entirely on specific functions that were active during the snapshot. If some function has lots of local variables stored on stack, it might use up the entire captured stack, and so you won't even find other frames. And it's impossible to predict and mitigate. And, of course, it's expensive to copy so much memory between kernel and user space, and .eh_frame-based processing is still relatively slow. There is just no good and reliable solution for DWARF-based stack unwinding.

...

You can do DWARF unwinding in the kernel as well (as well as stack capturing for FP based unwinding).

Not really. Kernel itself doesn't support it, and if you were to implement it in a BPF program yourself you'd run into many practical issues. The most damning would be inability to page in memory that contains .eh_frame section contents from non-sleepable context, in which stack trace is typically captured (NMI, tracepoints, kprobes). See above.

...

...
Even more broad and ad-hoc are bpftrace-based scripts, they are written quickly, often as one-time throw-aways, to pinpoint or debug specific issues. They very often capture and use stack traces to pinpoint the source of the problem in the application code. While comprehensive tools like perf might afford to try to utilize DWARF-based stack unwinding (however unreliable and expensive), all of the more lightweight tooling just can't afford to even attempt to tackle this problem in any reasonable way.

Why so? Because they are not helped by appropriate tooling? I think this is all because of FUD and NIH.

See above. There are fundamental limitations, not just lack of some common implementation to share. But yes, having to add extra dependency on something so finicky (and still unreliable, conceptually) doesn't help the case.

...

Frame pointers are not a requirement for unwinding and they are not in any way "better" than DWARF. Unwinding IP with a frame chain might be cheaper, but I doubt it would matter much for your ad-hoc bpftrace scripts.

They are. And yes, it will matter A LOT for those bpftrace scripts. See above.

...

...
Stack traces have uses way beyond profiling. Making them effortless is extremely useful to the disto ecosystem itself, IMO.

backtrace() is effortless stack tracing, and it uses DWARF. There's libbacktrace as well (used by GCC go which uses DWARF and not frame pointers).

backtrace() is used from *inside* the application to get its own stack trace. Here we are talking about capturing stack traces from outside applications without application realizing or being specially built for that.

...

Richard.

-- Richard Biener <rguenther@suse.de> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Aaron Puchert

28 Nov 28 Nov

01:20

Am 27.11.24 um 20:15 schrieb Andrii Nakryiko:

...

So we would just need to find a way to fault all .eh_frame pages of affected processes into memory before we start sampling? My very limited understanding is that the perf_event_open API (or is it just perf?) reports the start of processes and mapping of DSOs (PERF_RECORD_MMAP?), so couldn't one use that opportunity to fault the range in? Or am I overlooking something?

Andrii Nakryiko

05:55

On Wed, Nov 27, 2024 at 5:20 PM Aaron Puchert <aaronpuchert@alice-dsl.net> wrote:

...

You are overlooking *a lot* here. There is nothing (even conceptually) simple here, unfortunately. And you are not the first one who'd like to solve these problems. There are tons of technical and conceptual problems here, none of which are easily solvable. Not because no one tried, but because there are fundamental and technical (and sometimes even political) issues.

Richard Biener

08:21

On Wed, 27 Nov 2024, Andrii Nakryiko wrote:

...

On Wed, Nov 27, 2024 at 6:42 AM Richard Biener <rguenther@suse.de> wrote:

...
On Tue, 26 Nov 2024, Andrii Nakryiko wrote:

...
On Tue, Nov 26, 2024 at 6:43 PM Aaron Puchert <aaronpuchert@alice-dsl.net> wrote:

...
Am 25.11.24 um 23:59 schrieb Andrii Nakryiko:

...
[...] DWARF's .eh_frame-based stack unwinding tables. The latter is actually doesn't land itself well to system-wide profiling, and involved compromises in memory usage, CPU usage, and just plain reliability (no one will guarantee you that you captured all of the stack, for example).

The first two arguments are problematic: remember that we're talking about frame pointers, which cost CPU for all users. On one hand you're

[...]

...
To summarize: DWARF profiling doesn't work perfectly, but it might work well enough in a large enough number of cases, and in the remaining cases it's not made impossible. You just have to invest a bit more time.

There is an entire world of tools that use BPF and capture stack traces not for profiling, but to pinpoint which user code path causes some "issue", where the issue itself can be multitude of things (resource utilization, suspicious calling patterns, lock misuses, whatever you can come up with). Stack traces answer the questions like "where in the code" and "how did we get to that place", regardless of what's the actual question you are trying to answer.

So how does FP unwinding not run into the same issue as DWARF unwinding? If the actual unwinding happens at sampling time (to save you from capturing the very same stack and register state) then with deep call stacks you spend large time in the kernel and hit reliability issues when you fault.

Ok, so with FPs, stack unwinding just reads memory that belongs to the current thread. That memory in 99.9% cases will be physically present in memory (not paged out). So the process is pretty reliable (and fast), if all functions maintain frame pointers.

With DWARF-based unwinding there are three classes of solutions.

1) On-the-fly unwinding based on .eh_frame DWARF data. Given it's pretty big and is normally not used by the application, chances are high that the contents of this section won't be physically mapped into memory. So to access it one would need to cause page fault and wait for the kernel to fulfill it. It's slower, but the real issue is that stack traces with BPF and perf are captured in non-sleepable context (NMI due to perf event, or tracepoints, or kprobes, none of which allow page faults). So you'll, at the minimum, run into the problem with .eh_frame effectively not being available when necessary.

I guess given RT is now first-class in the kernel there must be an existing code path to defer this work to a kthread and put the task to sleep instead? All discussed "better alternatives" unwind info suggest that info be read in on-demand which of course wouldn't be possible from NMI context either.

...

I don't see why you need to process .eh_frame into something different.

...

The issue is FP based unwinding cannot _ever_ be reliable as you can't know whether you have a valid FP and all FP based unwinders have to apply heuristics here. You'd have to consult .eh_frame to see whether you have a frame pointer. Yeah, so FP based unwinding is heuristically "fast" - when it works. But you don't get to know whether it does or not ;)

...

I'm asking the kernel would need to gain support. You are asking for all programs and libraries to get frame-pointers.

...

FP unwinding is unreliable by definition. DWARF unwinding is reliable and precise by definition.

...

I'd say it's a missing BPF feature that it doesn't abstract (for example by a kernel API) taking backtraces.

...

So? You were generalizing to the disro ecosystem. Richard. -- Richard Biener <rguenther@suse.de> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Michal Suchánek

09:00

Hello, On Thu, Nov 28, 2024 at 09:21:22AM +0100, Richard Biener wrote:

...

It sounds like this is the case of the perfect getting in the way of good. Profiling is all statistics, and never 100% reliable. What level of unreliability is there with frame pointers? It has been pointed out that with the current state of the tools and overall ecosystem frame pointers help a lot of people a lot of the time for a number of different use cases that eventually result in improvements for all users. Sounds like a useful feature, even if not perfect. Thanks Michal

Jiri Slaby

09:33

On 27. 11. 24, 20:15, Andrii Nakryiko wrote:

...

We won't enable FPs in kernels in any case. Assembler is annotated only by ORC and many asm functions do not save/restore FP, so it'd be kind of useless anyway. regards, -- js suse labs

Andrii Nakryiko

3 Dec 3 Dec

00:51

On Thu, Nov 28, 2024 at 1:33 AM Jiri Slaby <jslaby@suse.cz> wrote:

...

Are you sure we are discussing the same topic here? This proposal is about frame pointers and capturing stack traces **for user space**. But you keep bringing up kernel stack traces. Two different beasts, with vastly different limitations (for one, kernel memory in which those ORC tables are stored is guaranteed to be mapped in physical memory, so available at all times, including in NMI context; there is no such guarantee for .eh_frame for user space; neither there is for SFrame tables, btw). ORC does indeed work in the kernel, and we do use ORC instead of frame pointers **for kernel**. There is no ORC for user space, though. SFrame is sort of that thing, but as I said there is tons of technical work to make it happen, and we are moving in that direction, but are not there yet. Meanwhile frame pointers are the only reasonable alternative.

...

Jiri Slaby

05:22

On 03. 12. 24, 1:51, Andrii Nakryiko wrote:

...

On Thu, Nov 28, 2024 at 1:33 AM Jiri Slaby <jslaby@suse.cz> wrote:

...
On 27. 11. 24, 20:15, Andrii Nakryiko wrote:

...
...
You can do DWARF unwinding in the kernel as well (as well as

Here "kernel"

...

...
...
...
stack capturing for FP based unwinding).

Not really. Kernel itself doesn't support it, and if you were to

Here "kernel".

...

...
...
implement it in a BPF program yourself you'd run into many practical issues. The most damning would be inability to page in memory that contains .eh_frame section contents from non-sleepable context, in which stack trace is typically captured (NMI, tracepoints, kprobes). See above.

We won't enable FPs in kernels in any case. Assembler is annotated only

I follow "kernel".

...

...
by ORC and many asm functions do not save/restore FP, so it'd be kind of useless anyway.

Are you sure we are discussing the same topic here? This proposal is about frame pointers and capturing stack traces **for user space**. But you keep bringing up kernel stack traces.

I didn't start this "kernel" subthread. But OK, good to know/be assured this is only about userspace. thanks, -- js suse labs

Jiri Slaby

28 Nov 28 Nov

09:37

On 27. 11. 24, 20:15, Andrii Nakryiko wrote:

...

There is just no good and reliable solution for DWARF-based stack unwinding.

I don't understand this either. We used to have DWARF annotations and an unwinder in SUSE kernels. It was both good and reliable. Before ORC became so. Note we use ORC for livepatching and ORC was developed to be reliable for those purposes. thanks, -- js suse labs

Andrii Nakryiko

3 Dec 3 Dec

00:55

On Thu, Nov 28, 2024 at 1:37 AM Jiri Slaby <jslaby@suse.cz> wrote:

...

Again, ORC is for kernels. We are discussing user space. There is no ORC for user space. ORC gets the benefit of always being in physical memory, and so can be relied upon in NMI context. Even if we had ORC for user space (SFrame is meant to be that), there is still a complication of user space application's ORC/SFrame data not being physically mapped into memory. Which means there is an extra delayed memory paging necessity, which means asynchronous user stack trace capture, and a bunch of associated kernel infrastructure around that. Which is what is being developed as part of SFrame integration. Please stop misleading others with references to ORC in the context of this frame pointer discussion. No one is proposing to enable frame pointers *for kernel* (we have ORC there and it does work, no need to improve anything). But we need frame pointers *for user space*, because we don't yet have ORC equivalent and necessary asynchronous stack trace capturing machinery in the kernel.

...

Jiri Slaby

05:19

On 03. 12. 24, 1:55, Andrii Nakryiko wrote:

...

Yes, you wrote: There is just no good and reliable solution for DWARF-based stack unwinding. I wrote: We used to have DWARF annotations and an unwinder in SUSE kernels. It was both good and reliable.

...

Please stop misleading others

How does that mislead others? thanks, -- js suse labs

Richard Biener

08:56

On Tue, 3 Dec 2024, Jiri Slaby wrote:

...

On 03. 12. 24, 1:55, Andrii Nakryiko wrote:

...
On Thu, Nov 28, 2024 at 1:37 AM Jiri Slaby <jslaby@suse.cz> wrote:

...
On 27. 11. 24, 20:15, Andrii Nakryiko wrote:

...
There is just no good and reliable solution for DWARF-based stack unwinding.

I don't understand this either. We used to have DWARF annotations and an unwinder in SUSE kernels. It was both good and reliable. Before ORC became so. Note we use ORC for livepatching and ORC was developed to be reliable for those purposes.

Again, ORC is for kernels. We are discussing user space. There is no ORC for user space.

Yes, you wrote: There is just no good and reliable solution for DWARF-based stack unwinding.

I wrote: We used to have DWARF annotations and an unwinder in SUSE kernels. It was both good and reliable.

Probably with always mapped .eh_frame

...

...
Please stop misleading others

How does that mislead others?

It doesn't I think. But the argument is "A is not implemented", let's do "B" instead. Yes, the paging issue is there with .eh_frame (but also with the stack, just more unlikely). And yes, you don't want to do even more work in NMI context. But I'd argue you want to do minimal work there only anyway, not even FP based unwinding - after a process switch you'd likely get TBL misses there. Not even considering RT ... So the solution is IMO to suspend the thread associated with the event from NMI and schedule a kernel worker thread (or whatever you have at your disposal these days for this), and exit from NMI context as fast as possible. Alternatively make sure the page we'd fault gets faulted in later (asynchonously) and give up on the unwind at that point (this is likely what we do when FP based unwind hits this?). You should get your .eh_frame faulted in after a few events then. You could also have a user-space helper to pre-fault .eh_frame of running processes (probably reasonably "easy" as root via scanning /proc and using ptrace). Richard.

...

thanks,

-- Richard Biener <rguenther@suse.de> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Michael Matz

13:37

Hello, On Tue, 3 Dec 2024, Richard Biener wrote:

...

Exactly. And certainly eh_frame (or sframe) based stack traces will have a lower overhead than copying dozens pages of stack at each event to/from the trace buffers. Ciao, Michael.

Andrii Nakryiko

22:14

On Mon, Dec 2, 2024 at 9:19 PM Jiri Slaby <jslaby@suse.cz> wrote:

...

On 03. 12. 24, 1:55, Andrii Nakryiko wrote:

...
On Thu, Nov 28, 2024 at 1:37 AM Jiri Slaby <jslaby@suse.cz> wrote:

...
On 27. 11. 24, 20:15, Andrii Nakryiko wrote:

...
There is just no good and reliable solution for DWARF-based stack unwinding.

I don't understand this either. We used to have DWARF annotations and an unwinder in SUSE kernels. It was both good and reliable. Before ORC became so. Note we use ORC for livepatching and ORC was developed to be reliable for those purposes.

Again, ORC is for kernels. We are discussing user space. There is no ORC for user space.

Yes, you wrote: There is just no good and reliable solution for DWARF-based stack unwinding.

"for user space", which is the whole context of this proposal. We are not discussing stack unwinding for kernel stack traces. That's what is misleading, you keep mixing in stack unwinding of *kernel stack traces*.

...

I wrote: We used to have DWARF annotations and an unwinder in SUSE kernels. It was both good and reliable.

If it's good and reliable, why is it not in the kernel anymore? But also I explained one pretty big source of unreliability:

...

...
The most damning would be inability to page in memory that contains .eh_frame section contents from non-sleepable context, in which stack trace is typically captured (NMI, tracepoints, kprobes).

I wasn't even talking about complexity and reliability of *interpreting* .eh_frame contents by kernel code (which from what I gather was a source of bugs and that's why Linus wanted it ripped out from the kernel; but I haven't researched this topic, so I'm sure I'm missing lots of details).

...

...
Please stop misleading others

How does that mislead others?

See above. Forget about kernel stack unwinding for this proposal (i.e., capturing stack trace of the kernel code), it's not about that. It's about stack unwinding of *user space code*. And here we are discussing stack unwinding performed *by kernel* code for *user space code*, on request of tools like perf and BPF (this request is triggered from kernel context in response to perf events, i.e., NMI, or based on multitude of possible in-kernel events triggering BPF program). The logic of stack unwinding in such cases lives in and is executed by kernel code, but it's capturing stack traces of user space applications. This is a very important distinction. Capturing *kernel stack traces* is basically a solved problem with ORC. Capturing user stack traces from user space, either from inside user space application itself or through ptrace-based debugger/profiling, is also a separate and, arguably, easier problem because of being able to page in .eh_frame data from disk, wait for that, etc (i.e., it's possible to sleep and wait for arbitrary amounts of time, if necessary). But capturing user stack traces in NMI (think perf events, like CPU cycles sampling), is a harder problem. We are working towards solving that by postponing stack trace capture until exiting from kernel back to user space, but we (kernel) are not there yet.

...

thanks, -- js suse labs

Aaron Puchert

28 Nov 28 Nov

01:08

Am 27.11.24 um 04:42 schrieb Andrii Nakryiko:

...

On Tue, Nov 26, 2024 at 6:43 PM Aaron Puchert <aaronpuchert@alice-dsl.net> wrote:

...
Am 25.11.24 um 23:59 schrieb Andrii Nakryiko:

...
[...] DWARF's .eh_frame-based stack unwinding tables. The latter is actually doesn't land itself well to system-wide profiling, and involved compromises in memory usage, CPU usage, and just plain reliability (no one will guarantee you that you captured all of the stack, for example). The first two arguments are problematic: remember that we're talking about frame pointers, which cost CPU for all users. On one hand you're But how much CPU does it actually cost? You can see 1-2% in *benchmarks*,

Sure, that was my understanding. But 1% across everything is not a small regression. I know you're arguing that lots of tools don't even support DWARF unwinding, but let's just focus on those that do: even if 2% of users does profiling, and the overhead is 20%, we have an average CPU usage cost of 0.4% for DWARF, which is still below 1%. (That's assuming those users are profiling all the time.) So total CPU usage across all users is going to go up if we enable FP. And it's not given at all that we get benefits of the same order. In individual programs we can certainly get much more, but across all workloads I doubt it. Most CPU-intensive applications are already subject to heavy profiling. From the Fedora Proposal:

...

Compiling the kernel with GCC is 2.4% slower with frame pointers That's quite a bit for a compiler. The LLVM compile-time tracker [1] shows "instructions" instead of "cycles" per default because the latter is too noisy for the regressions that we're interested in, which is basically everything >0.5%. Given that I've spend time on trying to get some 0.5% improvements, I'm not sure that I'm willing to accept a 2.4% regression. But if we enable frame pointers globally and I disable them in LLVM, I make other people mad.

...

there is no perceptible CPU overhead associated with that.

Not perceptible yes, because it's rather equitably distributed. But in terms of total cost it's much more significant than most perceptible regressions. I've said this in the beginning: I'm not against frame pointers, but I don't like how quickly we're willing to wave away the cost and acceptability of out-of-band alternatives. [1] <http://llvm-compile-time-tracker.com/>

Andrii Nakryiko

05:53

On Wed, Nov 27, 2024 at 5:08 PM Aaron Puchert <aaronpuchert@alice-dsl.net> wrote:

...

Am 27.11.24 um 04:42 schrieb Andrii Nakryiko:

...
On Tue, Nov 26, 2024 at 6:43 PM Aaron Puchert <aaronpuchert@alice-dsl.net> wrote:

...
Am 25.11.24 um 23:59 schrieb Andrii Nakryiko:

...
[...] DWARF's .eh_frame-based stack unwinding tables. The latter is actually doesn't land itself well to system-wide profiling, and involved compromises in memory usage, CPU usage, and just plain reliability (no one will guarantee you that you captured all of the stack, for example). The first two arguments are problematic: remember that we're talking about frame pointers, which cost CPU for all users. On one hand you're But how much CPU does it actually cost? You can see 1-2% in *benchmarks*,

Sure, that was my understanding. But 1% across everything is not a small

My whole point was that *benchmark* results are just part of the story, and we can't just take 1% and use that as "this is how much slower everything will be". And yet, you go ahead and do exactly that, doing some hypothetical math about collective slowdowns, with average usage costs, etc. I'm sorry, I see this as a completely useless hypothetical exercise. On the other hand, you completely discount and doubt any improvements enabled by more readily available profiling and observability tooling, even though it was *already* shown both in server-side land (e.g., in Meta fleet) and in Fedora distro ecosystem.

...

regression. I know you're arguing that lots of tools don't even support DWARF unwinding, but let's just focus on those that do: even if 2% of users does profiling, and the overhead is 20%, we have an average CPU usage cost of 0.4% for DWARF, which is still below 1%. (That's assuming those users are profiling all the time.) So total CPU usage across all users is going to go up if we enable FP.

And it's not given at all that we get benefits of the same order. In individual programs we can certainly get much more, but across all workloads I doubt it. Most CPU-intensive applications are already subject to heavy profiling.

From the Fedora Proposal:

...
Compiling the kernel with GCC is 2.4% slower with frame pointers That's quite a bit for a compiler. The LLVM compile-time tracker [1] shows "instructions" instead of "cycles" per default because the latter is too noisy for the regressions that we're interested in, which is basically everything >0.5%. Given that I've spend time on trying to get some 0.5% improvements, I'm not sure that I'm willing to accept a 2.4% regression. But if we enable frame pointers globally and I disable them in LLVM, I make other people mad.

...
there is no perceptible CPU overhead associated with that.

Not perceptible yes, because it's rather equitably distributed. But in terms of total cost it's much more significant than most perceptible regressions.

In server land calculating and comparing these total costs seems a bit more reasonable (even though, world and life is too complicated to just do such simple math). And there it was shown many times over that frame pointers enabled way more cost savings than any initial costs. For non-server use cases, IMO, what matters is whether the user experiences noticeable degradation in the short term (with the eye towards medium-to-long term benefits to the ecosystem in general). And if kernel compilation takes a second longer, I don't see it as a big deal. But I'll leave it to the openSUSE community to argue these numbers. I tried to clarify upsides and downsides of both FR and .eh_frame-based unwinding, which I believe I did. Convincing anyone about what is an acceptable cost of enabling frame pointers is not something I want to do.

...

I've said this in the beginning: I'm not against frame pointers, but I don't like how quickly we're willing to wave away the cost and acceptability of out-of-band alternatives.

There is nothing "quickly" here. The Fedora conversation was long and drawn out. Pro-FP folks prevailed, ultimately. We don't seem to be regretting this. OpenSUSE is just asked to not fall behind, and are certainly not pioneers here (no offense and disrespect intended).

...

[1] <http://llvm-compile-time-tracker.com/>

Aaron Puchert

29 Nov 29 Nov

00:47

Am 28.11.24 um 06:53 schrieb Andrii Nakryiko:

...

I don't discount these improvements, but (focusing on the distro ecosystem): * profiling and performance optimizations for a large part happen regardless of whether we build with frame pointers and * a 1-2% improvement across everything would be massive. The only way I see that happen is by compiler improvements, and I'm not sure they're going to happen this way. The second point is much easier in the Meta fleet. I don't doubt at all that you've gained even more than that.

Andrii Nakryiko

3 Dec 3 Dec

00:46

On Thu, Nov 28, 2024 at 4:48 PM Aaron Puchert <aaronpuchert@alice-dsl.net> wrote:

...

Am 28.11.24 um 06:53 schrieb Andrii Nakryiko:

...
My whole point was that *benchmark* results are just part of the story, and we can't just take 1% and use that as "this is how much slower everything will be". And yet, you go ahead and do exactly that, doing some hypothetical math about collective slowdowns, with average usage costs, etc. I'm sorry, I see this as a completely useless hypothetical exercise. It's not hypothetical at all. Frame pointers affect every function in every binary. There is no reason why it should affect something like GCC more or less than maybe harfbuzz or some GTK library. I'm not even taking outliers like that Python benchmark into account. There is going to be some deviation, but 1-2%overall definitely sounds plausible. (Given the average function length and the additional instructions needed.) On the other hand, you completely discount and doubt any improvements enabled by more readily available profiling and observability tooling, even though it was *already* shown both in server-side land (e.g., in Meta fleet) and in Fedora distro ecosystem.

I don't discount these improvements, but (focusing on the distro ecosystem):

* profiling and performance optimizations for a large part happen regardless of whether we build with frame pointers and

This is a subjective statement not backed by any data. I have a completely opposite experience (and expectation as well).

...

* a 1-2% improvement across everything would be massive. The only way I see that happen is by compiler improvements, and I'm not sure they're going to happen this way.

You are overpivoting on this 1-2%. This is based on a specific *benchmark*, not some end-to-end real life performance. We can't just blindly take those numbers and generalize them. Not everything will be slowed down, lots of code is a) not performance critical, b) might not even benefit from that extra register, c) is not called frequently enough for those few instruction to set up %rbp register to matter or be measurable (and think about it, if those 2-3 instructions matter, then perhaps your function doesn't even do all that much useful work and should/would be inlined by the compiler). And about your general point that 1-2% across everything only could be coming from the compiler. This is an unnecessary and unrealistic expectation. Most of the code isn't performance critical. On the other hand, that small portion of code that is performance critical, could get way more than 1-2% speed up if someone notices inefficiency there. And it's much easier to spot inefficiency when you have easily attainable (without going through tons of trouble) system-wide profiling. Anyways, I think we are in the territory of subjective opinions, and there just isn't objective hard data to prove anything conclusively and without any doubt. There has to be some value judgement by whoever is going to make this decision about frame pointers. The good thing is that Fedora showed that this change is a) useful and b) not really detrimental to overall user experience performance-wise. Also, in this part of discussion we are still focusing purely on *profiling* use cases, which is just one aspect that would benefit from frame pointers. All the observability tooling, ad-hoc debugging, etc, based on bpftrace or pure BPF would tremendously benefit both users and developers, if they just work out of the box. This can't be discounted. Actually, I'd suggest anyone interested to go and explore uprobes tracing and USDTs. They feel like magic, and if combined with stack traces are extremely powerful.

...

The second point is much easier in the Meta fleet. I don't doubt at all that you've gained even more than that.

Michal Suchánek

28 Nov 28 Nov

09:08

On Thu, Nov 28, 2024 at 02:08:25AM +0100, Aaron Puchert wrote:

...

That's not how it works. Most code is used rarely, and performance of such code is largely irrelevant, even for slowdowns much bigger than 1%. On the other hand, there are code paths which are performance critical and are in need of optimization. Having frame pointers results in having more and easier to use tools for identifying the code in which performance matters. That is the first step towards optimizing it, and reducing the total CPU time needs of the system, for all users. It does not have to be frame pointers in general. They just happen to be a feature that helps making performance measuring more usable today. Thanks Michal

Aaron Puchert

20 Nov 20 Nov

00:36

Am 18.11.24 um 09:30 schrieb Richard Biener:

...

The broader argument would be that it's valuable to know where you are (and it also helps for crash stacks), while we don't pay for full debug info which is dominated by information about types, variables, and so on, which is only needed in rare cases. We could also add something like DW_TAG_inlined_subroutine to get more fine-grained call stacks. But this is already more than frame pointers would provide. There is also the option to compress debug info, but I don't know how well this is supported by the tools in question, especially "perf".

Neal Gompa

00:40

On Tue, Nov 19, 2024 at 7:37 PM Aaron Puchert <aaronpuchert@alice-dsl.net> wrote:

...

Are we not using minidebuginfo and dwz still? Both of those have been in use in Fedora for over a decade[1][2]. These have been in place in the Red Hat world for a while, and still frame pointers were needed to provide a better experience. [1]: https://fedoraproject.org/wiki/Features/MiniDebugInfo [2]: https://fedoraproject.org/wiki/Features/DwarfCompressor -- 真実はいつも一つ！/ Always, there's only one truth!

Aaron Puchert

00:56

Am 20.11.24 um 01:40 schrieb Neal Gompa:

...

I don't think we ship with any debug info (not even symbols), but I'm happy to be proven wrong. We use dwz, but my understanding is that it doesn't do much for .debug_frame. The main benefit is that it deduplicates things like type information that is emitted into every TU. The frame information should only have duplicates for non-inlined inline functions, which are probably not so common with optimizations on. The compression that I'm talking about is general purpose compression such as gzip or zstd, which is a relatively recent feature.

...

These have been in place in the Red Hat world for a while, and still frame pointers were needed to provide a better experience.

To quote that Wiki page: “Debug info for backtraces relies on two types of information, the function names in the symbol tables, and (optionally) the linenumber debug information.” There is no mention of .debug_frame, only .symtab and .debug_line. (I agree about .symtab, but I'm not sure about .debug_line.) Aaron

Andreas Schwab

08:10

On Nov 20 2024, Aaron Puchert wrote:

...

.debug_frame is the same as .eh_frame which is always available. -- Andreas Schwab, SUSE Labs, schwab@suse.de GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE 1748 E4D4 88E3 0EEA B9D7 "And now for something completely different."

Aaron Puchert

21 Nov 21 Nov

01:58

Am 20.11.24 um 09:10 schrieb Andreas Schwab:

...

I thought they were slightly different, but I was also wondering why I didn't find .debug_frame anywhere. Indeed, "perf record --call-graph dwarf" seems to work on packaged binaries. I was always wondering why stacks are cut off, but that is likely due to the fixed size being recorded. Increasing it gives me longer stacks. Great, so we don't actually need any additional debug info here? However, missing .symtab is still an issue. Several stack frames show up as hex addresses. In fact Clang, where I tried this, is likely a lucky case because it's compiled with default visibility for non-inline functions. So we still get most functions via .dynsym. Most binaries will have a lot less symbol names in a profile. Aaron

Aaron Puchert

20 Nov 20 Nov

00:13

Am 16.11.24 um 19:24 schrieb Neal Gompa:

...

I would like to change openSUSE's default build flags in rpm and rpm-config-SUSE to incorporate frame pointers (and equivalents on other architectures) to support real-time profiling and observability on SUSE distributions.

Profiling doesn't necessarily need call stacks. Often you can learn a great deal from just seeing the hot functions. I'd say that I work with plain profiles ("perf record" without -g) most of the time and add call stacks only when I'm unsure why a function is so frequently called. Stacks can sometimes even obscure hotspots, depending on how you're reading them: if a hot function distributes its cost among many different call stacks, it will not stick out in a top-down flame graph. (You'll have to do a bottom-up.) There is another obstacle, one that applies even to profiling without call stacks: there are no symbols. We strip .symtab along with debug info. Without that, the stacks that you get are meaningless. However, if you meant to include that: I would like symbols to be kept (basically replace --strip by --strip-debug), and I think it would be less controversial: no runtime impact and only slightly larger binaries. With my compiler hat on, frame pointers often don't make sense. If there are no VLAs and allocas, the compiler knows how large a stack frame is and can simply add (for stacks that grow down) that fixed size in the epilogue. The frame pointer on the stack contains no information that the compiler doesn't already have. Furthermore, the frame pointer could be overwritten by stack buffer overflows. That's not a very strong argument, because an attacker would much rather overwrite the return address, and we have stack protectors, but "don't compute at runtime what you can compute at compile-time".

...

* The performance hit for having it vs not is insignificant[6].

This is a tricky argument. There are lots of things one could do that individually have little impact (maybe 1–10%), but those little things add up. One guy wants to add frame pointers for profiling, another wants stack protectors, the next guy wants automatic initialization of local variables (likely C++26) or mandatory boundary checks. They all claim that it adds just a little (on average). But what if the benefit is also small? If we take the average cost (which is typically small for most things you can add) we should also take the average benefit. That might not be much larger since lots of people, even Linux users, will never run "perf record". Even among developers, profiling might be restricted to self-built binaries. This is not my own situation: I do profiling of packaged binaries quite regularly. But we should get ourselves out of the way and think about the larger user base. The people that will profile packaged binaries are likely the packagers themselves, so in this bubble we're a bit biased.

...

I want openSUSE to be a great place for people to develop and optimize workloads on, especially desktop ones, where most of the tooling we have for tracing and profiling is broken without frame pointers (see Sysprof and Hotspot from GNOME and KDE respectively, which both rely on frame pointers to have cheap real-time tracing for performance analysis).

I don't know about any of those, but I'd assume they're just GUIs around "perf"? If you have an Intel CPU since Haswell (Zen 4 or so should also have LBR, but I haven't tried it yet), "perf record --call-graph lbr" works even without frame pointers, and in my experience pretty reliable. Just to make clear: I don't want this to be seen as argument against frame pointers, but I think the case isn't terribly clear. It's a feature that mostly benefits package maintainers and other people that want to tweak the distro, when they're trying to investigate performance issues, while the costs, however small they may be, are paid by everybody all the time. Aaron

Jan Engelhardt

00:25

On Wednesday 2024-11-20 01:13, Aaron Puchert wrote:

...

…and we just recently gone from FORTIFY_SOURCE=2 to FORTIFY_SOURCE=3, so allocations for 2024 are now used up ;-)

Andrii Nakryiko

25 Nov 25 Nov

23:22

Aaron Puchert wrote:

...

"Doesn't necessarily need call stacks" doesn't mean that one doesn't need call stacks. I'd say call stacks (stack traces) are more often than not useful and provide tons of contextual information that pure hot function view just can't. And the more complicated the application the more they are important.

...

Flame graphs is not the only way to look at the collection of stack traces, so I'm not sure all the above are reasonable objections. For instance, here at Meta, we use both flame graphs and an alternative so called "graph profiler" view, which allows you to get extremely detailed and customizable view and slicing of profiles. Aggregating across many calls, filtering out some of the call paths reaching some function, while keeping others, etc. But first and foremost is getting the stack traces data. That's what frame pointer proposal is trying to enable globally and without users having to jump through unreasonable hoops.

...

Yes, stripped out ELF symbols are pretty annoying. But. That's a separate issue. And also it's possible to avoid needing them by capturing build ID and doing symbolization offline using build ID to lookup DWARF information for profiled executable/library. The important part, again, is *capturing the stack trace*. You seem to be more worried about symbolization, which is an entirely different problem.

...

Sure, from compiler point of view. But here the concern is profilers and other stack trace-based tools used for observability, profiling, and debugging.

...

This has been discussed ad nauseam in https://fedoraproject.org/wiki/Changes/fno-omit-frame-pointer and elsewhere. Profiling the workload is much more common than you might think, and as it becomes more accessible (because frame pointers are there and tools just work out of the box), it just will be even more frequently used. Even if user doesn't do its own performance investigation, original application author can send simple perf-based (or whatnot) commands to run and report back data for further optimization. And cumulatively, the gains we get from more available profiling data far outweight any of the potential one-time regression due to frame pointers. There is zero doubt that whatever we (Meta) lost due to disabling frame pointers has been recovered many times over through high quality and widely available profiling data. Even within Fedora's ecosystem there were almost immediate reports on how having frame pointers in libc enabled enabled significant performance improvements (I'm too lazy to look for the links, please don't ask).

...

There are many profilers and various tools (especially BPF-based) that have nothing to do with perf, so no, I don't think one can just generalize to "perf UI". And as for LBR, we tried LBR as an augmentation for stack traces to get through functions calls inside some libraries that we don't control and couldn't enable frame pointers for. It doesn't work all that great in practice, unfortunately. LBR is limited to just 16 or 32 entires, and that's often not enough. But also LBR doesn't record stack trace, it records a trace of function returns (and other stuff, if you set it up correctly), and that's not 100% compatible with stack traces. LBRs have some exciting uses (I have a tool, retsnoop, that benefits a lot from LBR, but for entirely different use case), but it's certainly not a replacement for frame pointers and stack traces.

Aaron Puchert

27 Nov 27 Nov

03:54

Am 26.11.24 um 00:22 schrieb Andrii Nakryiko:

...

Isn't this entire proposal about the usability of profiling? I'm not just worried about symbolization, but about the bigger picture. Especially if we're profiling the entire system, we're easily talking about gigabytes of debug info to download.

...

My concern is not just profilers, since we're talking about a distribution default. The point that others have made and that I've just repeated here is that we're spending a register on a relatively register-sparse architecture for something that's not relevant for the program itself, but instrumentation for outside observers. And shipping with instrumentation by default just doesn't feel right. That is not a "real" argument, I know.

...

That's a bit hypothetical. In my experience, profiling is not even terribly common among C++ developers, which is to say that most teams have dedicated performance experts and most others rarely touch a profiler. Most developers are profiling their own applications, after just building them. That seems natural, after all you'll want to improve performance and for that you'll need to touch the source and recompile. Application authors or package maintainers asking their users for profiling sounds reasonable, but I haven't seen it yet. Realistically, if users complain about a performance problem, it's probably big enough that the overhead of DWARF and a reduced sample frequency are not an obstacle. And even truncated stacks could be enough for the developer to figure out the root cause. For what it's worth, I've just today profiled an awfully slow clang-tidy job and I could have attached a debugger to see what was wrong.

...

Meta has a large paid workforce though, while this is at least in parts a community project. If SUSE wants to add FPs because they need it to improve the distro, I don't think anybody is going to stand in their way. But at least right now I don't think they have dedicated people for performance, and I'm not aware of anyone regularly doing this kind of work in the community. If they exist, please step up. All we have is a proposal that was copied from Fedora and no Tumbleweed user that said "I want this and this is how I would use it." Meta is also a data center operator, while SUSE mostly ships software to my knowledge. So they don't have a big server farm where they could or would want to do whole system profiling. Individual maintainers might do performance work on their packages, but for that they don't need FPs as a distro default. SUSE customers might want this for their deployments, but this is again speculation. If that is actually the case, I think someone from SUSE should tell us. My employer (a large SUSE customer I believe) would probably not be interested so much, because we mainly run our own binaries and use just the OS base in deployment.

...

Surely libc doesn't have deep call stacks, so I assume this is more of an awareness issue than deficits in DWARF unwinding? I also just played around with profiling a bit more in the last days than usual.

Andrii Nakryiko

04:31

On Tue, Nov 26, 2024 at 7:54 PM Aaron Puchert <aaronpuchert@alice-dsl.net> wrote:

...

The very first paragraph of this proposal asks to enable frame pointers "to support real-time profiling and observability". Profiling *and observability*. Not the same thing. Both are important. But whatever the wording, easily and cheaply available stack traces is the point here. What applications and tools do with them are orthogonal. No one can and should prescribe how those stack traces are to be used. Stack traces answer a question of "where in the code" (and "how did we get to that point"), everything else is up to applications and tools.

...

What does debug info have to do with this proposal? I already explained that stack unwinding and stack symbolization are two different problems. You don't need DWARF to make sense out of stack traces (though DWARF is extremely valuable to have even better stack traces, of course). Not having stripped ELF symbols would be needed, of course, but that's nothing in comparison to gigabytes of DWARF. And then again, symbolization doesn't even have to happen immediately or even on the same host. Build ID is there to enable such scenarios.

...

It's a tradeoff, as everything in life. But supporters of frame pointers are arguing that overhead is negligible, and even if measurable in some cases, having stack traces (for profiling or other uses) easily accessible to tools is to everyone's benefit, and ultimately leads to more performance wins thanks to improved debuggability and observability of applications.

...

The more barriers to do something one has, the less likely it is that someone will attempt that thing. This very much applies to stack traces without frame pointers. Here are a few links I've collected some time after the Fedora proposal went through. From [0]: "Here is a little gem that I would have been unlikely to find without system-wide frame-pointers". From [1]: "My final summary here is that for most purposes you would be better off using frame pointers, and it’s a good thing that Fedora 38 now compiles everything with frame pointers. It should result in easier performance analysis, and even makes continuous performance analysis more plausible." Author of [1] also points out shortcomings of DWARF-based unwinding, btw. Like truncated ("detached") stack traces, overhead, etc. I never tried to collect every single blog post about the usefulness of frame pointer-based stack traces available system-wide, tbh. I have a few more links to internal posts which I can't, unfortunately, share. The theme is often the same: because it was trivial to get started, someone actually started, found something they wouldn't otherwise find, fixed it, and moved on. Note also that Apple has made the decision to always have frame pointers ([2]), probably having a good reason. And yeah, every Mac user "pays" for that, which doesn't seem to really be a problem in practice. [0] https://blogs.gnome.org/chergert/2023/10/03/what-have-frame-pointers-given-u... [1] https://rwmj.wordpress.com/2023/02/14/frame-pointers-vs-dwarf-my-verdict/ [2] https://developer.apple.com/documentation/xcode/writing-arm64-code-for-apple...

...

I'll leave it up to the openSUSE community to decide on how reasonable it is to expect random distro users to find this thread, and present "things I'd do with stack traces" write ups to the community. In my experience things just don't work like that. You have to provide the means and, ideally, popularize and educate people, and only then you'll reap the benefits.

...

I'm not a package maintainer, but if I had to rebuild the whole world just to profile my application that uses glibc, I'd never even start. I actually was in similar situations where others asked me to help, and yeah, I didn't even start, because that's a bit too much to ask from me. But as I mentioned before and in another reply, it's not *just* about profiling. There is a whole cohort of tools like BCC tools [3] and various bpftrace-based scripts, many of which do rely on stack traces to help users understand the source of problems originating with applications (not necessarily applications written or maintained by those users). [3] https://github.com/iovisor/bcc/tree/master/tools

...

Depth of call stack is just one possible problem. Stack usage is absolutely different and orthogonal. You can have a call stack with just two functions, but if they use multi-KB stack variables, you'll blow your stack capture allowance very quickly. But do take a look at [1] I mentioned above.

Aaron Puchert

28 Nov 28 Nov

01:57

Am 27.11.24 um 05:31 schrieb Andrii Nakryiko:

...

Let me just summarize: I said "we need symbols", you said "that's a separate issue", then I said "well but then you need debug info which could be quite large", then you say "you don't need debug info if you have symbols". So do you agree that we should have symbols or not? We all understand that unwinding and symbolization are technically different. But a bunch of hex numbers are not going to tell you anything. So you'll need symbolization at some point, and what's the point of making unwinding easy if symbolization remains hard?

...

I profile a lot and never rebuild the whole world or even just glibc. But even if you use glibc or libstdc++ and for some reason spend a lot of time in there that you want to understand. There is no need to rebuild the whole world and rebuilding glibc is easy. Just check out the package, add the flag and build. The build might take a bit, I don't know. But the setup is one minute.

...

Of course I meant the size in bytes. Unfortunately [1] seems to use blunt instruments. I would also be mad at a 60 GB trace file, but you can of course reduce sampling frequency. There are formulas that tell you how many samples you need for which level of accuracy, and if you want something like 1% error, you need 10,000 samples, which puts you at 320 MB for the stack traces. And if your workload runs long enough, reducing the frequency can also reduce overhead to acceptable levels. I'm not suggesting that the current DWARF unwinding is perfect, and I'd much rather see an unwinder that works in BPF. But it's not quite as bad as some make it look.

Andrii Nakryiko

06:08

On Wed, Nov 27, 2024 at 5:57 PM Aaron Puchert <aaronpuchert@alice-dsl.net> wrote:

...

Where is the contradiction? Yes, we need symbols. ELF symbols are enough to have basic stack trace symbolization. But even that is not an absolute requirement, because you can capture build ID + file offset and offload symbolization to a different host and fetch full ELF and DWARF data separately without affecting production workload. But then again, this is symbolization issues, not stack unwinding.

...

Two separate problems. Different set of solutions. Nothing in common between stack unwinding and symbolization besides the word "DWARF". Absolutely misleading conversation and arguments. Not sure if intentional or not, but it's just completely besides the topic of this proposal.

...

I believe you. But profiling something on your development machine is just one of many possible scenarios. Rebuilding libc for a production machine with a production workload that you *need* to profile and/or debug is a completely different one. You have an interesting line of counter-arguing here, tbh, using your personal use cases and approaches as a general argument against a change with much wider implications beyond your specific patterns.

...

Then I don't understand what you were trying to argue with "surely libc doesn't have deep call stacks". What about stack usage of an application itself that calls into libc? Why is libc called out separately? Anyhow, I explained three broad approaches to DWARF-based unwinding and their main limitations. I'm not sure I can add much to that here.

...

Neal Gompa

27 Nov 27 Nov

04:42

On Tue, Nov 26, 2024 at 10:55 PM Aaron Puchert <aaronpuchert@alice-dsl.net> wrote:

...

I think this is a matter of perspective. My point of view is that not having out-of-the-box instrumentation makes it hard for observability to develop further in spaces outside of the Kuberentes/Go case (where this is already default).

...

This is something that both the GNOME and KDE communities are actively exploring. I know that the GNOME folks who maintain Sysprof have been working on making the tool better for this *exact* use-case. I'm sure folks that work on KDE are working on something similar.

...

Most of the people who want this stuff are generally unable to figure out how to advocate for this in distributions. That's why *I* did it. Also, as a Tumbleweed user, I would hope that *I* count. I would also like to see us doing this more in KDE (and KDE upstream does qualification for Linux using openSUSE), and whole-system tracing requires the whole distribution being built for it.

...

On the contrary, performance bottlenecks can show up in random places. Whole-system tracing is incredibly useful with modern application middleware that can go fairly deep.

...

As long as some of the OS is linked in your application stack, it can be useful. -- 真実はいつも一つ！/ Always, there's only one truth!

alessio.biancalana＠suse.com

09:25

Il giorno mer, 27/11/2024 alle 04.54 +0100, Aaron Puchert ha scritto:

...

Howdy! The very first reply to this thread was mine, and it was pretty idiotic from my side not to introduce myself. I'm Alessio, I hack things on openSUSE when I have some free time, I maintain some packages and I try to champion all-things-observability both in the community _and_ in my day by day at SUSE. When I first saw this proposal coming to Fedora I was extremely thrilled, and when this expanded to other communities like Arch Linux I was very much fascinated by how the discussion evolved. I'm very thrilled as well that this topic is reaching openSUSE at last. I think advanced diagnostics are something in the scope of every operating system, and hacking on a really wide variety of software in my free time led me to have some of this analysis using both sysprof _and_ eBPF . I think one of the greatest performance slowdowns in GTK was discovered thanks to system-wide framepointers availability. This means to me that other distros are currently more ideal than openSUSE to suit these needs about the development of desktop environments and in general low level system software. I would love openSUSE to really stand up as "the makers' choice", as that is our claim and I think we have a lot of edge on that front to contribute to the ecosystem. As Neal pointed out, right now eBPF profiling isn't really a thing because eBPF unwinding itself isn't really a thing. Coming to GNOME, as Neal was reporting, Christian Hergert wrote a couple posts that were really eye-opening about this kind of system- wide profiling even being possible. https://fedoramagazine.org/performance-profiling-in-fedora-linux/ https://blogs.gnome.org/chergert/2022/12/31/frame-pointers-and-other-practic... On the contrary, he also wrote something about system-wide profiling actually _being_ possible even without frame pointers _but_ the overhead on the observed system IMHO makes it almost pointless (sorry for the blunt judgement here): https://blogs.gnome.org/chergert/2024/11/03/profiling-w-o-frame-pointers/ Obviously I don't spend all my time just watching flame graphs on my monitor. At the same time I think dismissing this as something totally unneeded wouldn't pay honmage to my favorite distro. I hope my contribution was valuable to you in some way. Alessio

Richard Biener

15:01

On Wed, 27 Nov 2024, Alessio Biancalana via openSUSE Factory wrote:

...

Il giorno mer, 27/11/2024 alle 04.54 +0100, Aaron Puchert ha scritto:

...
But at least right now I don't think they have dedicated people for performance, and I'm not aware of anyone regularly doing this kind of work in the community.

If they exist, please step up.

Howdy! The very first reply to this thread was mine, and it was pretty idiotic from my side not to introduce myself. I'm Alessio, I hack things on openSUSE when I have some free time, I maintain some packages and I try to champion all-things-observability both in the community _and_ in my day by day at SUSE.

When I first saw this proposal coming to Fedora I was extremely thrilled, and when this expanded to other communities like Arch Linux I was very much fascinated by how the discussion evolved. I'm very thrilled as well that this topic is reaching openSUSE at last. I think advanced diagnostics are something in the scope of every operating system, and hacking on a really wide variety of software in my free time led me to have some of this analysis using both sysprof _and_ eBPF . I think one of the greatest performance slowdowns in GTK was discovered thanks to system-wide framepointers availability.

This means to me that other distros are currently more ideal than openSUSE to suit these needs about the development of desktop environments and in general low level system software. I would love openSUSE to really stand up as "the makers' choice", as that is our claim and I think we have a lot of edge on that front to contribute to the ecosystem.

As Neal pointed out, right now eBPF profiling isn't really a thing because eBPF unwinding itself isn't really a thing.

Coming to GNOME, as Neal was reporting, Christian Hergert wrote a couple posts that were really eye-opening about this kind of system- wide profiling even being possible.

https://fedoramagazine.org/performance-profiling-in-fedora-linux/ https://blogs.gnome.org/chergert/2022/12/31/frame-pointers-and-other-practic...

On the contrary, he also wrote something about system-wide profiling actually _being_ possible even without frame pointers _but_ the overhead on the observed system IMHO makes it almost pointless (sorry for the blunt judgement here):

Where do you see it being "pointless"? Because 10% of the samples (on his system) are from unwinding? I've just tested doing a whole system profile on SLE15 SP6 and I see at most 5% overhead from perf (perf report isn't the most useful tool here. What's the perf overhead with a system with FPs? (does it reliably show the kernel side doing unwinding?) It will very much still show that greatest GTK slowdown, I bet even --call-graph lbr might have shown it. Richard.

...

-- Richard Biener <rguenther@suse.de> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Aaron Puchert

28 Nov 28 Nov

02:47

Am 27.11.24 um 10:25 schrieb alessio.biancalana@suse.com:

...

Il giorno mer, 27/11/2024 alle 04.54 +0100, Aaron Puchert ha scritto:

...
But at least right now I don't think they have dedicated people for performance, and I'm not aware of anyone regularly doing this kind of work in the community.

If they exist, please step up. Howdy! The very first reply to this thread was mine, and it was pretty idiotic from my side not to introduce myself. I'm Alessio, I hack things on openSUSE when I have some free time, I maintain some packages and I try to champion all-things-observability both in the community _and_ in my day by day at SUSE. Thanks for stepping up! As Neal pointed out, right now eBPF profiling isn't really a thing because eBPF unwinding itself isn't really a thing.

Coming to GNOME, as Neal was reporting, Christian Hergert wrote a couple posts that were really eye-opening about this kind of system- wide profiling even being possible.

https://fedoramagazine.org/performance-profiling-in-fedora-linux/

Got it, so the idea is to profile something like a GUI application with lots of dependencies. Most of the work doesn't happen in the application itself, but in all kinds of libraries (fonts, rendering, IO, etc.) that all need FPs. That's different from my profiling, where the application itself sits on the CPU all the time. Do you think that SUSE will systematically use this in the long term? I know data center operators do this (Meta, Google, Netflix), because they run their software at a large scale, and want to reduce their server footprint. They have dedicated performance engineers. SUSE isn't quite in the same situation, because most workloads on the software that we build happens out of our sight. (You could profile OBS or openQA, but I'm not sure how useful that would be.) So would we systematically look at usage scenarios? Would you expect that package maintainers do profiling for their packages?

...

https://blogs.gnome.org/chergert/2022/12/31/frame-pointers-and-other-practic...

Let's put the near term aside for now. What happens in the long term? When all distros have FP enabled, are we still going to see effort poured into out-of-band unwinding? Or is it just going to be written off? As I've written in reply to Andrii, in the compiler world a 2% regression is quite significant. So I'd like to see a way out.

...

On the contrary, he also wrote something about system-wide profiling actually _being_ possible even without frame pointers _but_ the overhead on the observed system IMHO makes it almost pointless (sorry for the blunt judgement here):

https://blogs.gnome.org/chergert/2024/11/03/profiling-w-o-frame-pointers/ Here I have to agree with Richard, 10% overhead by itself is not much. You're not going to profile all the time, so the cost should be OK. And the overhead itself should also not distort that much. One issue with profiling overhead is that it can hide latency or affect caches and other CPU structures. But this depends on sampling frequency. If you have too much overhead, reduce the frequency and repeat your workload until you have enough samples for a statistically meaningful result. I hope my contribution was valuable to you in some way. It was, thanks!

Andrii Nakryiko

06:11

On Wed, Nov 27, 2024 at 6:48 PM Aaron Puchert <aaronpuchert@alice-dsl.net> wrote:

...

[...]

...

Let's put the near term aside for now. What happens in the long term? When all distros have FP enabled, are we still going to see effort poured into out-of-band unwinding? Or is it just going to be written off?

As I've written in reply to Andrii, in the compiler world a 2% regression is quite significant. So I'd like to see a way out.

There is a proposal, called SFrame (see [0]), in the works that is meant to make frame pointers unnecessary. It requires support at all levels of software stack, from compiler, through linker, to kernel itself, BPF support, *and* BPF-based tooling to change the way they capture stack traces. It will take some time, but once all those pieces are ready and widely adopted, then frame pointers might not be necessary anymore. [0] https://sourceware.org/binutils/wiki/sframe [...]

Neal Gompa

08:24

On Wed, Nov 27, 2024 at 9:48 PM Aaron Puchert <aaronpuchert@alice-dsl.net> wrote:

...

Am 27.11.24 um 10:25 schrieb alessio.biancalana@suse.com:

Il giorno mer, 27/11/2024 alle 04.54 +0100, Aaron Puchert ha scritto:

But at least right now I don't think they have dedicated people for performance, and I'm not aware of anyone regularly doing this kind of work in the community.

If they exist, please step up.

Howdy! The very first reply to this thread was mine, and it was pretty idiotic from my side not to introduce myself. I'm Alessio, I hack things on openSUSE when I have some free time, I maintain some packages and I try to champion all-things-observability both in the community _and_ in my day by day at SUSE.

Thanks for stepping up!

As Neal pointed out, right now eBPF profiling isn't really a thing because eBPF unwinding itself isn't really a thing.

Coming to GNOME, as Neal was reporting, Christian Hergert wrote a couple posts that were really eye-opening about this kind of system- wide profiling even being possible.

https://fedoramagazine.org/performance-profiling-in-fedora-linux/

Got it, so the idea is to profile something like a GUI application with lots of dependencies. Most of the work doesn't happen in the application itself, but in all kinds of libraries (fonts, rendering, IO, etc.) that all need FPs. That's different from my profiling, where the application itself sits on the CPU all the time.

Right, your kind of profiling is the simple kind. And surprisingly uncommon. It's rare for application stacks to be "close to metal" and lack a deep middleware stack. Even server workloads typically have a few layers between the developer code and the metal these days.

...

Do you think that SUSE will systematically use this in the long term? I know data center operators do this (Meta, Google, Netflix), because they run their software at a large scale, and want to reduce their server footprint. They have dedicated performance engineers. SUSE isn't quite in the same situation, because most workloads on the software that we build happens out of our sight. (You could profile OBS or openQA, but I'm not sure how useful that would be.)

Can you please stop focusing on the wrong thing here? One of the key values of this is that you don't *need* dedicated performance engineers to do performance analysis. It becomes accessible to everyone. I think it will naturally get used as it becomes available. Within three months of it existing in Fedora, GNOME folks who develop GNOME on Fedora had made tremendous strides in GLib, GTK, and VTE. It has continued to pay off substantially simply because the capability is *there*. There's a lot of potential with openSUSE having it too, particularly with communities that use openSUSE as the principal development and qualification platform (such as KDE).

...

So would we systematically look at usage scenarios? Would you expect that package maintainers do profiling for their packages?

https://blogs.gnome.org/chergert/2022/12/31/frame-pointers-and-other-practic...

Let's put the near term aside for now. What happens in the long term? When all distros have FP enabled, are we still going to see effort poured into out-of-band unwinding? Or is it just going to be written off?

As I've written in reply to Andrii, in the compiler world a 2% regression is quite significant. So I'd like to see a way out.

The truth is that it was already written off. Nobody does it for serious system-wide tracing. There is no way out until someone comes up with an actually useful SFrames proposal that includes all Linux architectures (which does not exist yet and may not exist for another ten years). As a hobbyist Linux desktop systems developer, I'm tired of being hampered by our own stack because people think we don't deserve working tools. :( -- 真実はいつも一つ！/ Always, there's only one truth!

Michel Lind

22 Nov 22 Nov

17:44

cc:ing Andrii Nakryiko here, one of our resident profiling gurus at Meta, to bring him into the conversation. Apologies for yet another subthread, but there are different questions from Aaron and Jiry and since Andrii is not subscribed to the lists this is the only way to keep both packaging and factory lists in sync. Cheers, -- Michel On Sat, Nov 16, 2024 at 01:24:05PM -0500, Neal Gompa wrote:

...

Hello,

I would like to change openSUSE's default build flags in rpm and rpm-config-SUSE to incorporate frame pointers (and equivalents on other architectures) to support real-time profiling and observability on SUSE distributions.

In practice, this would mean essentially adopting the same tunables that exist in Fedora[1] for openSUSE to turn them on by default and to allow packages or OBS projects to selectively opt out as needed.

The reasons to do so are threefold:

* It is a major competitive disadvantage for us to lack the ability to do cheap real-time profiling and observability. Fedora[2], Ubuntu[3], Arch Linux[4], and AlmaLinux[5] all now do this, and thus support this capability. * The performance hit for having it vs not is insignitifcant[6]. * There are new tools in both the cloud-native and regular systems development worlds that leverage this, and openSUSE should be an enabler of those technologies.

I want openSUSE to be a great place for people to develop and optimize workloads on, especially desktop ones, where most of the tooling we have for tracing and profiling is broken without frame pointers (see Sysprof and Hotspot from GNOME and KDE respectively, which both rely on frame pointers to have cheap real-time tracing for performance analysis). And given that we advertise as the "makers' choice", I think it would definitely be on-brand for us to have the capability to better support makers and shakers.

For those interested in more detail about frame pointers, Brendan Gregg has a decent post about it[7]. I have also submitted a parallel request for openSUSE Leap 16 to have this feature enabled too[8].

I truly believe this would give us an even better footing with the broader community of developers and operators and make SUSE distributions very attractive for FOSS and proprietary software alike.

Best regards, Neal

[1]: https://src.fedoraproject.org/rpms/redhat-rpm-config/blob/93063bb396395b9a20... [2]: https://fedoraproject.org/wiki/Changes/fno-omit-frame-pointer [3]: https://ubuntu.com/blog/ubuntu-performance-engineering-with-frame-pointers-b... [4]: https://gitlab.archlinux.org/archlinux/rfcs/-/blob/master/rfcs/0026-fno-omit... [5]: https://almalinux.org/blog/2024-10-22-introducing-almalinux-os-kitten/ [6]: https://www.phoronix.com/review/fedora-38-beta-benchmarks [7]: https://www.brendangregg.com/blog/2024-03-17/the-return-of-the-frame-pointer... [8]: https://code.opensuse.org/leap/features/issue/175

-- 真実はいつも一つ！/ Always, there's only one truth!

-- _o) Michel Lind _( ) identities: https://keyoxide.org/5dce2e7e9c3b1cffd335c1d78b229d2f7ccc04f2

Andrii Nakryiko

25 Nov 25 Nov

23:27

On Fri, Nov 22, 2024 at 9:44 AM Michel Lind <michel@michel-slm.name> wrote:

...

Thanks Michel! It turned out I can reply through website UI (which I did on three separate replies), but thanks for looping me in! TL;DR, I'm very supportive of enabling frame pointers. There is no practical alternative to them right now. Maybe in a few years (and I'm one of the people who are working towards SFrame becoming one of those alternatives), but definitely not in the next year or two (if not more). We went through a lot (and I mean A LOT) of frame pointer discussion with Fedora proposal, I'm not sure I see the value to do that all over again. It's been proven useful and no significant regression was reported (as far as I'm aware). In big production (at Meta), there is no way we are going back to disabling frame pointers, it's been that useful and powers tons of efficiency work for many people in the company. I hope openSUSE community will follow an example of Fedora and Ubuntu and will make entire software ecosystem much more amenable to profiling and optimization work. Even if most users are not the immediate *users of this*, they get all the benefits of optimization work this change enables for tons of other developers.

...

Cheers,

-- Michel

On Sat, Nov 16, 2024 at 01:24:05PM -0500, Neal Gompa wrote:

...
Hello,

I would like to change openSUSE's default build flags in rpm and rpm-config-SUSE to incorporate frame pointers (and equivalents on other architectures) to support real-time profiling and observability on SUSE distributions.

In practice, this would mean essentially adopting the same tunables that exist in Fedora[1] for openSUSE to turn them on by default and to allow packages or OBS projects to selectively opt out as needed.

The reasons to do so are threefold:

* It is a major competitive disadvantage for us to lack the ability to do cheap real-time profiling and observability. Fedora[2], Ubuntu[3], Arch Linux[4], and AlmaLinux[5] all now do this, and thus support this capability. * The performance hit for having it vs not is insignitifcant[6]. * There are new tools in both the cloud-native and regular systems development worlds that leverage this, and openSUSE should be an enabler of those technologies.

I want openSUSE to be a great place for people to develop and optimize workloads on, especially desktop ones, where most of the tooling we have for tracing and profiling is broken without frame pointers (see Sysprof and Hotspot from GNOME and KDE respectively, which both rely on frame pointers to have cheap real-time tracing for performance analysis). And given that we advertise as the "makers' choice", I think it would definitely be on-brand for us to have the capability to better support makers and shakers.

For those interested in more detail about frame pointers, Brendan Gregg has a decent post about it[7]. I have also submitted a parallel request for openSUSE Leap 16 to have this feature enabled too[8].

I truly believe this would give us an even better footing with the broader community of developers and operators and make SUSE distributions very attractive for FOSS and proprietary software alike.

Best regards, Neal

[1]: https://src.fedoraproject.org/rpms/redhat-rpm-config/blob/93063bb396395b9a20... [2]: https://fedoraproject.org/wiki/Changes/fno-omit-frame-pointer [3]: https://ubuntu.com/blog/ubuntu-performance-engineering-with-frame-pointers-b... [4]: https://gitlab.archlinux.org/archlinux/rfcs/-/blob/master/rfcs/0026-fno-omit... [5]: https://almalinux.org/blog/2024-10-22-introducing-almalinux-os-kitten/ [6]: https://www.phoronix.com/review/fedora-38-beta-benchmarks [7]: https://www.brendangregg.com/blog/2024-03-17/the-return-of-the-frame-pointer... [8]: https://code.opensuse.org/leap/features/issue/175

-- 真実はいつも一つ！/ Always, there's only one truth!

-- _o) Michel Lind _( ) identities: https://keyoxide.org/5dce2e7e9c3b1cffd335c1d78b229d2f7ccc04f2

Alessio Biancalana

16 Nov 16 Nov

21:55

...

Hello,

I would like to change openSUSE's default build flags in rpm and rpm-config-SUSE to incorporate frame pointers (and equivalents on other architectures) to support real-time profiling and observability on SUSE distributions.

In practice, this would mean essentially adopting the same tunables that exist in Fedora[1] for openSUSE to turn them on by default and to allow packages or OBS projects to selectively opt out as needed.

The reasons to do so are threefold:

* It is a major competitive disadvantage for us to lack the ability to do cheap real-time profiling and observability. Fedora[2], Ubuntu[3], Arch Linux[4], and AlmaLinux[5] all now do this, and thus support this capability. * The performance hit for having it vs not is insignitifcant[6]. * There are new tools in both the cloud-native and regular systems development worlds that leverage this, and openSUSE should be an enabler of those technologies.

I want openSUSE to be a great place for people to develop and optimize workloads on, especially desktop ones, where most of the tooling we have for tracing and profiling is broken without frame pointers (see Sysprof and Hotspot from GNOME and KDE respectively, which both rely on frame pointers to have cheap real-time tracing for performance analysis). And given that we advertise as the "makers' choice", I think it would definitely be on-brand for us to have the capability to better support makers and shakers.

For those interested in more detail about frame pointers, Brendan Gregg has a decent post about it[7]. I have also submitted a parallel request for openSUSE Leap 16 to have this feature enabled too[8].

I truly believe this would give us an even better footing with the broader community of developers and operators and make SUSE distributions very attractive for FOSS and proprietary software alike.

Best regards, Neal

[1]: https://src.fedoraproject.org/rpms/redhat-rpm-config/blob/93063bb396395b9a20... [2]: https://fedoraproject.org/wiki/Changes/fno-omit-frame-pointer [3]: https://ubuntu.com/blog/ubuntu-performance-engineering-with-frame-pointers-b... [4]: https://gitlab.archlinux.org/archlinux/rfcs/-/blob/master/rfcs/0026-fno-omit... [5]: https://almalinux.org/blog/2024-10-22-introducing-almalinux-os-kitten/ [6]: https://www.phoronix.com/review/fedora-38-beta-benchmarks [7]: https://www.brendangregg.com/blog/2024-03-17/the-return-of-the-frame-pointer... [8]: https://code.opensuse.org/leap/features/issue/175

-- 真実はいつも一つ！/ Always, there's only one truth!

Aaron Puchert

19 Nov 19 Nov

22:45

Am 16.11.24 um 22:55 schrieb Alessio Biancalana via openSUSE Factory:

...

Maybe https://bugzilla.opensuse.org/show_bug.cgi?id=1233220? Should be fixed with the next snapshot, sorry for that.

Richard Biener

18 Nov 18 Nov

08:30

On Sat, 16 Nov 2024, Neal Gompa wrote:

...

-- Richard Biener <rguenther@suse.de> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Bernhard M. Wiedemann

09:13

On 18/11/2024 09.30, Richard Biener wrote:

...

I also object to enforce this for x86 32bit which is a very register starved architecture (x86-64 is only slightly better in this regard).

...

In any case - I propose shipping all packages with debug info included since that greatly improves the profiling experience - even more so than by enabling frame-pointers. Bandwidth and disk is cheap these days.

...

[4]: https://gitlab.archlinux.org/archlinux/rfcs/-/blob/master/rfcs/0026-fno-omit...

Richard Biener

13:39

On Mon, 18 Nov 2024, Bernhard M. Wiedemann via openSUSE Factory wrote:

...

On 18/11/2024 09.30, Richard Biener wrote:

...
I also object to enforce this for x86 32bit which is a very register starved architecture (x86-64 is only slightly better in this regard).

I think, we can leave i586 as is and just improve x86_64. It should have 8 additional general-purpose registers, so losing 1 of 15 is not as bad as losing 1 of 7 on i586.

But I guess, most of the performance impact comes from the 3 extra instructions needed for save+restore of RBP. Is there an option to not use frame-pointers in small functions (where the performance-impact would be the largest) ?

That's -momit-leaf-frame-pointer - small functions are usually leaf (do not call other functions).

...

The "Shadow Stacks" alternative mentioned in [4] also sounds interesting. CPUs with this seem to be around for 6 years already... But it still needs code to use it.

Not sure about other 64-bit architectures. aarch64, riscv, ppc64 - maybe later?

All those RISCy architectures have enough registers to spare, or like power and s390 have other ways to unwind for backtraces (link register).

...

...
In any case - I propose shipping all packages with debug info included since that greatly improves the profiling experience - even more so than by enabling frame-pointers. Bandwidth and disk is cheap these days.

We have debuginfod these days that should allow to pull in required info on demand.

I don't see that doing a whole system profile will pull that in.

...

And while desktop machines are very powerful these days, running VMs and containers greatly profits from small footprints.

Code without framepointer is smaller as well. Legacy hardware with not "tons of storage" also benefits from a few % more performance more than your up-do-date european HW. Richard.

...

Ciao Bernhard M.

...
[4]: https://gitlab.archlinux.org/archlinux/rfcs/-/blob/master/rfcs/0026-fno-omit...

-- Richard Biener <rguenther@suse.de> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Neal Gompa

15:22

On Mon, Nov 18, 2024 at 8:40 AM Richard Biener <rguenther@suse.de> wrote:

...

But the storage increase is several orders of magnitude smaller than using DWARF data. -- 真実はいつも一つ！/ Always, there's only one truth!

109

Age (days ago)

126

Last active (days ago)

List overview

Download

57 comments

13 participants

participants (13)

Aaron Puchert
Alessio Biancalana
alessio.biancalana＠suse.com
Andreas Schwab
Andrii Nakryiko
Bernhard M. Wiedemann
Jan Engelhardt
Jiri Slaby
Michael Matz
Michal Suchánek
Michel Lind
Neal Gompa
Richard Biener

Main

Development

Information

Community

Social Media

Other

Proposal: Update default build flags to enable frame pointers

tags

participants (13)