Mastodawn

Paul Khuong

@pkhuong@discuss.systems

439 Followers

357 Following

1.7K Posts

https://pvk.ca

Paul Khuong 2d ago

Hayley 3d ago

is it e-graph or egraph

Paul Khuong 5d ago

Payne's Gay 5d ago

*Boosts very welcome*

A collective I am part of is looking for a SuperMicro X11 server. We're hoping to find a second-hand one we could buy!

Does any of you tech people know where one could buy a second-hand SuperMicro X11 server? Or knows a company that might be getting rid of their old ones?

If you have leads on an X12, we would also like to hear about it.

Ideal situation would be in Montreal, so we could pick it up, but also open to hearing about any and all opportunities!

Paul Khuong Jun 23

I think I have a design problem that wants an ECS. Tell me why I'm wrong :D

Paul Khuong Jun 23

@cfbolz When sampling at a fixed byte period, I worry about aliasing between the fixed-period sampling process and the profilee's potentially periodic allocation pattern. Does PyPy naturally introduce enough nondetermism to make that a non-issue?
https://mastoxiv.page/@arXiv_csPL_bot/114731825902548197

arXiv cs.PL bot (@arXiv_csPL_bot@mastoxiv.page)

Low Overhead Allocation Sampling in a Garbage Collected Virtual Machine Christoph Jung, C. F. Bolz-Tereick https://arxiv.org/abs/2506.16883 https://arxiv.org/pdf/2506.16883 https://arxiv.org/html/2506.16883 arXiv:2506.16883v1 Announce Type: new Abstract: Compared to the more commonly used time-based profiling, allocation profiling provides an alternate view of the execution of allocation heavy dynamically typed languages. However, profiling every single allocation in a program is very inefficient. We present a sampling allocation profiler that is deeply integrated into the garbage collector of PyPy, a Python virtual machine. This integration ensures tunable low overhead for the allocation profiler, which we measure and quantify. Enabling allocation sampling profiling with a sampling period of 4 MB leads to a maximum time overhead of 25% in our benchmarks, over un-profiled regular execution. toXiv_bot_toot

mastoxiv

Paul Khuong Jun 23

Bakerposting Jun 23

Since execution performance is readily quantified, it is most often measured and optimized--even when increased performance is of marginal value; viz. ("Dubious Achievement", Comm. of the ACM 34, 4 (April 1991), 18.)

Paul Khuong Jun 21

Smart of the authors to wait until page 4 to introduce a novel fraktur-themed concept. Now too invested to nope out

Paul Khuong Jun 20

AlexMillerDB Jun 20

[arXiv] TreeTracker Join: Simple, Optimal, Fast
https://arxiv.org/abs/2403.01631

TreeTracker gives a very simple breakdown of what the core differences are between a naive binary join and an optimal multi-way join.

Paul Khuong Jun 18

Shae Erisson Jun 18

Boxed types, what's the history / origin?

I can't find anything on https://en.wikipedia.org/wiki/Boxing_(computer_programming)

tips or hints?

Boxing (computer programming) - Wikipedia

Paul Khuong Jun 12

Harold Aptroot Jun 12

What's the state of Java decompilation these days, can I finally Just Do It?

By which I mean, take a Java binary, run it through the decompiler, directly compile its output and without manual intervention it Just Works. Previous problems include (but are not limited to) the output of the decompiler being bare Java source not contained in a project, obfuscated names being invalid identifiers, uninitialized variables tripping up the Definite Assignment logic despite never actually being read..

Paul Khuong Jun 4

Andy Wingo Jun 4

Claire Huang wrote an undergraduate honor's thesis, supervised by @steveblackburn and @caizixian https://www.steveblackburn.org/pubs/theses/huang-2025.pdf

She uses sampling PEBS counters and data linear addressing (DLA) on Intel chips to attempt to understand the structure and attribution of load latencies in MMTk.

After identifying L1 misses in the trace loop as a significant overhead, she adds prefetching and reduces GC time by 10% or so across a range of benchmarks, and more on Zen4.

Andy Wingo Jun 4

Claire Huang wrote an undergraduate honor's thesis, supervised by @steveblackburn and @caizixian https://www.steveblackburn.org/pubs/theses/huang-2025.pdf

She uses sampling PEBS counters and data linear addressing (DLA) on Intel chips to attempt to understand the structure and attribution of load latencies in MMTk.

After identifying L1 misses in the trace loop as a significant overhead, she adds prefetching and reduces GC time by 10% or so across a range of benchmarks, and more on Zen4.

Show thread

Andy Wingo Jun 4

Lots of fun details in Huang's thesis: static vs dynamic prefetching (static is fine), computation of how much one could gain if cache-miss latency were eliminated, what the GC time would be if hardware prefetchers were disabled (20-80% slower; see attached figure); mutator time without prefetchers (sometimes it's better??!?); how to use "perf mem"; all good stuff!

I don't know what's in the water at ANU but they have been doing lots of great work at all levels recently

Show thread

Steve Blackburn Jun 5

@wingo Thanks Andy.

What’s in the water? Just a wonderfully talented and intellectually generous bunch of students. In particular there’s a great culture of generosity in the lab that @caizixian and other students have built. That culture encourages talented undergrads like Claire to join and flourish. I’m very grateful.

Show thread

Andy Wingo Jun 5

@steveblackburn @caizixian sincere congrats on building such an environment. as a faraway « consumer » of y’all’s work i appreciate whatever it is that went into the making; the generosity shines through!

Show thread

Steve Blackburn Jun 5

@wingo @caizixian

It did not come from nowhere. @xiyang set the tone, passing the torch to the current students. He epitomised technical excellence + intellectual generosity. Thank you, Xi 🙏

Show thread

Per Vognsen Jun 5

@wingo @steveblackburn @caizixian This also reminds me that Stephen Dolan added prefetching to OCaml's GC a few years ago: https://github.com/ocaml/ocaml/pull/10195

"On the few programs it's been tested on, marking time is reduced by 1/3 - 2/3, leading to overall performance improvements of anywhere around 5-20%, depending on how much GC the program does. (More benchmarking is needed!)"

I originally heard about it from this podcast episode: https://signalsandthreads.com/memory-management/

Speed up GC by prefetching during marking by stedolan · Pull Request #10195 · ocaml/ocaml

This PR rewrites the core marking loop of the major GC, using prefetching to make better use of the processor's memory parallelism. This removes essentially all of the cache misses that occur d...

GitHub

Show thread

Ian Rogers Jun 14

@wingo @steveblackburn @caizixian The term "GC performance" is doing some heavy lifting here. GC time is dominated by marking. Was the garbage collector generational to avoid tracing the old gen? Was the performance perceptible on a workload or did the wasted hardware slots just get consumed by real work on the other SMT thread? I'm always amazed with papers being surprised that prefetching makes a difference. My unfortunate experience with most GC papers is they tend to be full of half truths.

Show thread

Andy Wingo Jun 14

@irogers @steveblackburn @caizixian in this case they are interested in the “main” spaces so they configure the collectors in a non-generational, non-concurrent, non-parallel mode. lots of details in the paper