Paul Khuong

504 Followers
389 Following
2.2K Posts

So like I mentioned I joined Antithesis a little while back. When I did, I pitched them on a crazy idea. Antithesis... Hypothesis... Hegel!

A remarkably short number of months later, Hegel. Hegel is a property-based testing protocol and family of client libraries which makes it easy to do Hypothesis-grade PBT everywhere.

Today: Rust. Tomorrow: The world! (Muahaha)

https://antithesis.com/blog/2026/hegel/

Hypothesis, Antithesis, synthesis

Introducing Hegel, our new family of property-based testing libraries.

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits
https://arxiv.org/abs/2603.19173

> Unlike prior benchmarks that evaluate kernels primarily relative to software implementations, SOL-ExecBench measures performance against analytically derived Speed-of-Light (SOL) bounds computed by SOLAR, our pipeline for deriving hardware-grounded SOL bounds, yielding a fixed target for hardware-efficient optimization.

Cf. Sec. 4.2 SOL Bound Derivation, 4.3 Metric: SOL Score

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

As agentic AI systems become increasingly capable of generating and optimizing GPU kernels, progress is constrained by benchmarks that reward speedup over software baselines rather than proximity to hardware-efficient execution. We present SOL-ExecBench, a benchmark of 235 CUDA kernel optimization problems extracted from 124 production and emerging AI models spanning language, diffusion, vision, audio, video, and hybrid architectures, targeting NVIDIA Blackwell GPUs. The benchmark covers forward and backward workloads across BF16, FP8, and NVFP4, including kernels whose best performance is expected to rely on Blackwell-specific capabilities. Unlike prior benchmarks that evaluate kernels primarily relative to software implementations, SOL-ExecBench measures performance against analytically derived Speed-of-Light (SOL) bounds computed by SOLAR, our pipeline for deriving hardware-grounded SOL bounds, yielding a fixed target for hardware-efficient optimization. We report a SOL Score that quantifies how much of the gap between a release-defined scoring baseline and the hardware SOL bound a candidate kernel closes. To support robust evaluation of agentic optimizers, we additionally provide a sandboxed harness with GPU clock locking, L2 cache clearing, isolated subprocess execution, and static analysis based checks against common reward-hacking strategies. SOL-ExecBench reframes GPU kernel benchmarking from beating a mutable software baseline to closing the remaining gap to hardware Speed-of-Light.

arXiv.org
The PS Portal's handling of captive portals is the best I've seen so far (the portal acts as an AP, another device connects to it and clicks on the magic button that unlocks the network). I wonder why I don't see that everywhere; how much did that feature add to their COGS?

won't say I'm totally proud of myself here, but once I saw that the Claude C compiler was super buggy according to YARPGen and Csmith, I had a hard time preventing myself from doing something about it

https://john.regehr.org/writing/claude_c_compiler.html

claude_c_compiler

A blog post I wrote about folios

https://blogs.oracle.com/linux/intro-to-folios

If you're a swiss resident ( or know one ... ) you can get a custom hardware design onto an ASIC for FREE !!!

You get back one chip with your design (and a bunch of others!) and some board to help bring up/testing.

Also, this is "resident", no need to be a student or at uni or anything, you just need a valid Swiss address ...

I know I might be repeating myself here because I mentioned it a few month back but this bear repeating IMHO.

https://swisschips.ethz.ch/news-and-events/tiny-tapeout-submission-form.html

PS: Can I haz boost for reach ?

Tiny Tapeout Submission Form: for shuttles & workshops

SwissChips
Shrunk the machine code a little here https://github.com/pkhuong/tiny_batcher The x86-64 build now clocks in at 199 bytes (224 on aarch64)! I think x86-64 might benefit from using high half byte registers, but hopefully simpler tricks can get us to <= 3 cache lines.
GitHub - pkhuong/tiny_batcher: a size-optimised sorting library for C and C++

a size-optimised sorting library for C and C++. Contribute to pkhuong/tiny_batcher development by creating an account on GitHub.

GitHub

As i compared https://github.com/pkhuong/tiny_batcher/blob/main/reference.py with the expected number of compare-exchange for Batcher's odd-even merge sort, it became clear that Knuth's Algorithm M isn't that (5.2.2, p111 in TAoCP vol 3 2nd ed).

Is there a reference for the exact number of compare-exchange I should expect for small n?

@jix maybe?

A couple months later, I still like my take on generic comparison sorting without function pointers https://github.com/pkhuong/tiny_batcher
GitHub - pkhuong/tiny_batcher: a size-optimised sorting library for C and C++

a size-optimised sorting library for C and C++. Contribute to pkhuong/tiny_batcher development by creating an account on GitHub.

GitHub

We're hiring in the US (EST ideal) for large scale distributed database work (C++20; think snowflake-style compute/storage separation). Hit me up if you have any questions.

https://job-boards.greenhouse.io/redpandadata/jobs/4659296005

Senior Software Engineer, Core SQL

United States, Canada