Claire Huang wrote an undergraduate honor's thesis, supervised by @steveblackburn and @caizixian https://www.steveblackburn.org/pubs/theses/huang-2025.pdf

She uses sampling PEBS counters and data linear addressing (DLA) on Intel chips to attempt to understand the structure and attribution of load latencies in MMTk.

After identifying L1 misses in the trace loop as a significant overhead, she adds prefetching and reduces GC time by 10% or so across a range of benchmarks, and more on Zen4.

@wingo @steveblackburn @caizixian This also reminds me that Stephen Dolan added prefetching to OCaml's GC a few years ago: https://github.com/ocaml/ocaml/pull/10195

"On the few programs it's been tested on, marking time is reduced by 1/3 - 2/3, leading to overall performance improvements of anywhere around 5-20%, depending on how much GC the program does. (More benchmarking is needed!)"

I originally heard about it from this podcast episode: https://signalsandthreads.com/memory-management/

Speed up GC by prefetching during marking by stedolan · Pull Request #10195 · ocaml/ocaml

This PR rewrites the core marking loop of the major GC, using prefetching to make better use of the processor's memory parallelism. This removes essentially all of the cache misses that occur d...

GitHub