@dysfun right but couldn't you do the cheap thing when _capturing_ the backtrace and keep the expensive stuff for later? I guess not?
@fasterthanlime Symbol resolution should be the most expensive part, but that is already done lazily. Capturing backtraces requires for each frame iterating through the .eh_frame_hdr of every dynamic library to find the right unwind info and then interpret the unwinding byte code from the start of the function until the location where the current instruction pointer is to find where each register is saved. There is a paper about compiling this byte code to executable code which is much faster.
@bjorn3 Is there a fast way to do this without debuginfo? What does `perf` do when you tell it to not use dwarf?
@fasterthanlime @bjorn3 frame pointers, which are disabled by default because you know, back in the old days of 32 bit x86, you didn't have many registers, so you'd abuse the frame pointer to get one, which made your program a bit faster, but also way harder to profile efficiently, making it effectively slower again, and nowadays the tradeoff doesn't really make sense, but it still sticks around.
-Cforce-frame-pointers=true enables them
@nilstrieb @bjorn3 (hi nora!!! ) I remember reading about major Linux restrictions switching to enabling them by default. I'm surprised that it's not the default in Rust yet!!
@nilstrieb @bjorn3 But yeah, I would be interested in a solution based on frame pointers!
@fasterthanlime @nilstrieb Frame pointers are now on for the standard library to allow people to use them everywhere without having to recompile the standard library. We tried enabling it by default, but the perf loss for running rustc compiled with frame pointers was just a bit too high unfortunately.
@fasterthanlime @bjorn3 point of clarification: .eh_frame is not debug info, it's part of the ABI. It does use almost exactly the same format as DWARF .debug_frame, though. Other platforms have alternate solutions here, like Apple's compact unwind tables.
@bjorn3 @fasterthanlime would it be possible to capture the backtrace for almost free if one pinky promised frame pointers were in use? Then possibly one shouldn’t have to completely unwind and execute the bytecode, or am I missing something?
@nrab @fasterthanlime Yes, but you did still have to write your own code for this. backtrace-rs doesn't support it out of the box. Also make sure all system libraries use frame pointers too.
@bjorn3 @fasterthanlime I know what I’m doing this weekend then

@fasterthanlime @dysfun Polar Signals has done a bunch of work to make stack traces fast (kinda a necessity for an enable-in-production profiling tool).

If I remember correctly, they implement¹ just enough DWARF interpreter to traverse the call stack - at the cost of sometimes failing to produce a stack trace. backtrace-rs could probably do something similar? It doesn't need to unwind the stack - it doesn't need to find all the locals and run their destructors - it just needs to find the return address. Once you've saved all the return address you can defer symbol lookup to later (possibly even on another machine, or requiring hitting the network for debuginfod symbols).

¹: in eBPF, obviously, because they're running from kernel context

Debug Daily. Optimize Always | Polar Signals

Polar Signals Cloud is an always-on, zero-instrumentation continuous profiling product that helps improve performance, understand incidents, and lower infrastructure costs.

Debug Daily. Optimize Always | Polar Signals

@RAOF @fasterthanlime @dysfun yes you only have to do some of the work when doing backtraces instead of proper unwinding

there's a lot of tradeoffs here in terms of correctness/speed/detail. if i wanted to dig into optimizing "oops all backtraces" in rust i would be looking into what samply does, which is specifically designed to cache and optimize the relevant data, and is all rust code

https://github.com/mstange/samply/

GitHub - mstange/samply: Command-line sampling profiler for macOS, Linux, and Windows

Command-line sampling profiler for macOS, Linux, and Windows - mstange/samply

GitHub

@RAOF @fasterthanlime @dysfun

you can also "avoid" the work of backtracing by just saving the entire stack (usually a couple MB) and general purpose cpu registers

and then do the actual analysis only if you really want to print it (this is a more extreme version of the "defer the symbol lookup" trick RAOF mentioned)

this is how minidumps work, saving all the work for later. it was also apparently a classic Trick that sampling profilers did, since dumping the entire stack on every sample and processing it in the background was faster than doing a backtrace on the spot.

samply is however optimized enough that it's faster to unwind than do the dump-the-stack hack

@Gankra @RAOF @dysfun fascinating!

that would do great as an eyre handler (with the required adaptation work)

@Gankra @RAOF @fasterthanlime @dysfun The unwinding part of samply is done by framehop: https://github.com/mstange/framehop

The tricky part about using it from backtrace-rs would probably be the detection of when libraries are loaded into / unloaded from the process. Or in other words, the tricky part about caching unwind rules is knowing when to invalidate the cache. I don't know how libunwind does that part.

GitHub - mstange/framehop: Stack unwinding library in Rust

Stack unwinding library in Rust. Contribute to mstange/framehop development by creating an account on GitHub.

GitHub
@Gankra @RAOF @fasterthanlime @dysfun Framepointer-based unwinding has the advantage that you don't need to know anything about which libraries are loaded and where to find their unwind info. So it's much easier to manage, in addition to the faster unwinding speed.

@fasterthanlime Stuff like this is exactly what I meant here: https://github.com/rust-lang/rfcs/pull/2154#issuecomment-1753333675

Debuginfo has a chicken-egg problem where nobody has an incentive to really optimize it (meaning the entire ecosystem including unwinder, symbolication, managing symbol files, CI, etc) because nobody wants to use it in prod because it's slow. But maybe stuff like framehop (used by samply) is going to change this one day?

Debuginfo-based panic locations by main-- · Pull Request #2154 · rust-lang/rfcs

Rendered For a quick mockup of the basic idea, have a look at this.

GitHub