On Mon, Dec 2, 2024 at 9:19 PM Jiri Slaby <jslaby@suse.cz> wrote:
On 03. 12. 24, 1:55, Andrii Nakryiko wrote:
On Thu, Nov 28, 2024 at 1:37 AM Jiri Slaby <jslaby@suse.cz> wrote:
On 27. 11. 24, 20:15, Andrii Nakryiko wrote:
There is just no good and reliable solution for DWARF-based stack unwinding.
I don't understand this either. We used to have DWARF annotations and an unwinder in SUSE kernels. It was both good and reliable. Before ORC became so. Note we use ORC for livepatching and ORC was developed to be reliable for those purposes.
Again, ORC is for kernels. We are discussing user space. There is no ORC for user space.
Yes, you wrote: There is just no good and reliable solution for DWARF-based stack unwinding.
"for user space", which is the whole context of this proposal. We are not discussing stack unwinding for kernel stack traces. That's what is misleading, you keep mixing in stack unwinding of *kernel stack traces*.
I wrote: We used to have DWARF annotations and an unwinder in SUSE kernels. It was both good and reliable.
If it's good and reliable, why is it not in the kernel anymore? But also I explained one pretty big source of unreliability:
The most damning would be inability to page in memory that contains .eh_frame section contents from non-sleepable context, in which stack trace is typically captured (NMI, tracepoints, kprobes).
I wasn't even talking about complexity and reliability of *interpreting* .eh_frame contents by kernel code (which from what I gather was a source of bugs and that's why Linus wanted it ripped out from the kernel; but I haven't researched this topic, so I'm sure I'm missing lots of details).
Please stop misleading others
How does that mislead others?
See above. Forget about kernel stack unwinding for this proposal (i.e., capturing stack trace of the kernel code), it's not about that. It's about stack unwinding of *user space code*. And here we are discussing stack unwinding performed *by kernel* code for *user space code*, on request of tools like perf and BPF (this request is triggered from kernel context in response to perf events, i.e., NMI, or based on multitude of possible in-kernel events triggering BPF program). The logic of stack unwinding in such cases lives in and is executed by kernel code, but it's capturing stack traces of user space applications. This is a very important distinction. Capturing *kernel stack traces* is basically a solved problem with ORC. Capturing user stack traces from user space, either from inside user space application itself or through ptrace-based debugger/profiling, is also a separate and, arguably, easier problem because of being able to page in .eh_frame data from disk, wait for that, etc (i.e., it's possible to sleep and wait for arbitrary amounts of time, if necessary). But capturing user stack traces in NMI (think perf events, like CPU cycles sampling), is a harder problem. We are working towards solving that by postponing stack trace capture until exiting from kernel back to user space, but we (kernel) are not there yet.
thanks, -- js suse labs