Paul Khuong

@pkhuong@discuss.systems
439 Followers
357 Following
1.7K Posts
is it e-graph or egraph

*Boosts very welcome*

A collective I am part of is looking for a SuperMicro X11 server. We're hoping to find a second-hand one we could buy!

Does any of you tech people know where one could buy a second-hand SuperMicro X11 server? Or knows a company that might be getting rid of their old ones?

If you have leads on an X12, we would also like to hear about it.

Ideal situation would be in Montreal, so we could pick it up, but also open to hearing about any and all opportunities!

I think I have a design problem that wants an ECS. Tell me why I'm wrong :D
@cfbolz When sampling at a fixed byte period, I worry about aliasing between the fixed-period sampling process and the profilee's potentially periodic allocation pattern. Does PyPy naturally introduce enough nondetermism to make that a non-issue?
https://mastoxiv.page/@arXiv_csPL_bot/114731825902548197
arXiv cs.PL bot (@arXiv_csPL_bot@mastoxiv.page)

Low Overhead Allocation Sampling in a Garbage Collected Virtual Machine Christoph Jung, C. F. Bolz-Tereick https://arxiv.org/abs/2506.16883 https://arxiv.org/pdf/2506.16883 https://arxiv.org/html/2506.16883 arXiv:2506.16883v1 Announce Type: new Abstract: Compared to the more commonly used time-based profiling, allocation profiling provides an alternate view of the execution of allocation heavy dynamically typed languages. However, profiling every single allocation in a program is very inefficient. We present a sampling allocation profiler that is deeply integrated into the garbage collector of PyPy, a Python virtual machine. This integration ensures tunable low overhead for the allocation profiler, which we measure and quantify. Enabling allocation sampling profiling with a sampling period of 4 MB leads to a maximum time overhead of 25% in our benchmarks, over un-profiled regular execution. toXiv_bot_toot

mastoxiv
Since execution performance is readily quantified, it is most often measured and optimized--even when increased performance is of marginal value; viz. ("Dubious Achievement", Comm. of the ACM 34, 4 (April 1991), 18.)
Smart of the authors to wait until page 4 to introduce a novel fraktur-themed concept. Now too invested to nope out

[arXiv] TreeTracker Join: Simple, Optimal, Fast
https://arxiv.org/abs/2403.01631

TreeTracker gives a very simple breakdown of what the core differences are between a naive binary join and an optimal multi-way join.

Boxed types, what's the history / origin?

I can't find anything on https://en.wikipedia.org/wiki/Boxing_(computer_programming)

tips or hints?

Boxing (computer programming) - Wikipedia

What's the state of Java decompilation these days, can I finally Just Do It?

By which I mean, take a Java binary, run it through the decompiler, directly compile its output and without manual intervention it Just Works. Previous problems include (but are not limited to) the output of the decompiler being bare Java source not contained in a project, obfuscated names being invalid identifiers, uninitialized variables tripping up the Definite Assignment logic despite never actually being read..

Claire Huang wrote an undergraduate honor's thesis, supervised by @steveblackburn and @caizixian https://www.steveblackburn.org/pubs/theses/huang-2025.pdf

She uses sampling PEBS counters and data linear addressing (DLA) on Intel chips to attempt to understand the structure and attribution of load latencies in MMTk.

After identifying L1 misses in the trace loop as a significant overhead, she adds prefetching and reduces GC time by 10% or so across a range of benchmarks, and more on Zen4.

×

Claire Huang wrote an undergraduate honor's thesis, supervised by @steveblackburn and @caizixian https://www.steveblackburn.org/pubs/theses/huang-2025.pdf

She uses sampling PEBS counters and data linear addressing (DLA) on Intel chips to attempt to understand the structure and attribution of load latencies in MMTk.

After identifying L1 misses in the trace loop as a significant overhead, she adds prefetching and reduces GC time by 10% or so across a range of benchmarks, and more on Zen4.

Lots of fun details in Huang's thesis: static vs dynamic prefetching (static is fine), computation of how much one could gain if cache-miss latency were eliminated, what the GC time would be if hardware prefetchers were disabled (20-80% slower; see attached figure); mutator time without prefetchers (sometimes it's better??!?); how to use "perf mem"; all good stuff!

I don't know what's in the water at ANU but they have been doing lots of great work at all levels recently

@wingo Thanks Andy.

What’s in the water? Just a wonderfully talented and intellectually generous bunch of students. In particular there’s a great culture of generosity in the lab that @caizixian and other students have built. That culture encourages talented undergrads like Claire to join and flourish. I’m very grateful.

@steveblackburn @caizixian sincere congrats on building such an environment. as a faraway « consumer » of y’all’s work i appreciate whatever it is that went into the making; the generosity shines through!

@wingo @caizixian

It did not come from nowhere. @xiyang set the tone, passing the torch to the current students. He epitomised technical excellence + intellectual generosity. Thank you, Xi 🙏

@wingo @steveblackburn @caizixian This also reminds me that Stephen Dolan added prefetching to OCaml's GC a few years ago: https://github.com/ocaml/ocaml/pull/10195

"On the few programs it's been tested on, marking time is reduced by 1/3 - 2/3, leading to overall performance improvements of anywhere around 5-20%, depending on how much GC the program does. (More benchmarking is needed!)"

I originally heard about it from this podcast episode: https://signalsandthreads.com/memory-management/

Speed up GC by prefetching during marking by stedolan · Pull Request #10195 · ocaml/ocaml

This PR rewrites the core marking loop of the major GC, using prefetching to make better use of the processor's memory parallelism. This removes essentially all of the cache misses that occur d...

GitHub
@wingo @steveblackburn @caizixian The term "GC performance" is doing some heavy lifting here. GC time is dominated by marking. Was the garbage collector generational to avoid tracing the old gen? Was the performance perceptible on a workload or did the wasted hardware slots just get consumed by real work on the other SMT thread? I'm always amazed with papers being surprised that prefetching makes a difference. My unfortunate experience with most GC papers is they tend to be full of half truths.
@irogers @steveblackburn @caizixian in this case they are interested in the “main” spaces so they configure the collectors in a non-generational, non-concurrent, non-parallel mode. lots of details in the paper