Claire Huang wrote an undergraduate honor's thesis, supervised by @steveblackburn and @caizixian https://www.steveblackburn.org/pubs/theses/huang-2025.pdf

She uses sampling PEBS counters and data linear addressing (DLA) on Intel chips to attempt to understand the structure and attribution of load latencies in MMTk.

After identifying L1 misses in the trace loop as a significant overhead, she adds prefetching and reduces GC time by 10% or so across a range of benchmarks, and more on Zen4.

Lots of fun details in Huang's thesis: static vs dynamic prefetching (static is fine), computation of how much one could gain if cache-miss latency were eliminated, what the GC time would be if hardware prefetchers were disabled (20-80% slower; see attached figure); mutator time without prefetchers (sometimes it's better??!?); how to use "perf mem"; all good stuff!

I don't know what's in the water at ANU but they have been doing lots of great work at all levels recently

@wingo Thanks Andy.

What’s in the water? Just a wonderfully talented and intellectually generous bunch of students. In particular there’s a great culture of generosity in the lab that @caizixian and other students have built. That culture encourages talented undergrads like Claire to join and flourish. I’m very grateful.

@steveblackburn @caizixian sincere congrats on building such an environment. as a faraway « consumer » of y’all’s work i appreciate whatever it is that went into the making; the generosity shines through!

@wingo @caizixian

It did not come from nowhere. @xiyang set the tone, passing the torch to the current students. He epitomised technical excellence + intellectual generosity. Thank you, Xi 🙏

@wingo @steveblackburn @caizixian This also reminds me that Stephen Dolan added prefetching to OCaml's GC a few years ago: https://github.com/ocaml/ocaml/pull/10195

"On the few programs it's been tested on, marking time is reduced by 1/3 - 2/3, leading to overall performance improvements of anywhere around 5-20%, depending on how much GC the program does. (More benchmarking is needed!)"

I originally heard about it from this podcast episode: https://signalsandthreads.com/memory-management/

Speed up GC by prefetching during marking by stedolan · Pull Request #10195 · ocaml/ocaml

This PR rewrites the core marking loop of the major GC, using prefetching to make better use of the processor's memory parallelism. This removes essentially all of the cache misses that occur d...

GitHub
@wingo @steveblackburn @caizixian The term "GC performance" is doing some heavy lifting here. GC time is dominated by marking. Was the garbage collector generational to avoid tracing the old gen? Was the performance perceptible on a workload or did the wasted hardware slots just get consumed by real work on the other SMT thread? I'm always amazed with papers being surprised that prefetching makes a difference. My unfortunate experience with most GC papers is they tend to be full of half truths.
@irogers @steveblackburn @caizixian in this case they are interested in the “main” spaces so they configure the collectors in a non-generational, non-concurrent, non-parallel mode. lots of details in the paper