Paul Khuong

505 Followers
389 Following
2.2K Posts
ppb/examples/picoscope.c at main · pkhuong/ppb

A non-allocating lexer for protocol buffers. Contribute to pkhuong/ppb development by creating an account on GitHub.

GitHub

There's a bunch of simplistic JSON parsers. I felt the need to write something similar for protobuf, so here's https://github.com/pkhuong/ppb

The basic idea is you predeclare all the fields you're interested in (including catch-all fields, as an option), prescan to gather stats like number of occurrence and total number of bytes, for each field (just enough to preallocate), then lex the same bytes again to get individual field info, in order.

GitHub - pkhuong/ppb: A non-allocating lexer for protocol buffers

A non-allocating lexer for protocol buffers. Contribute to pkhuong/ppb development by creating an account on GitHub.

GitHub
I find proof engineering with Frama-C's WP plugin can be pretty churny, but nearly all in annotations. Is there a good way to isolate proof-only changes from regular code change in version control?

Speedability of computably approximable reals and their approximations

George Barmpalias, Nan Fang, Wolfgang Merkle, Ivan Titov
https://arxiv.org/abs/2603.26484 https://arxiv.org/pdf/2603.26484 https://arxiv.org/html/2603.26484

arXiv:2603.26484v1 Announce Type: new
Abstract: An approximation of a real is a sequence of rational numbers that converges to the real. An approximation is left-c.e. if it is computable and nondecreasing and is d.c.e. if it is computable and has bounded variation. A real is computably approximable if it has some computable approximation, and left-c.e. and d.c.e. reals are defined accordingly.
An approximation $\{a_s\}_{s \in \omega}$ is speedable if there exists a nondecreasing computable function $f$ such that the approximation $\{a_{f(s)}\}_{s \in \omega}$ converges in a certain formal sense faster than $\{a_s\}_{s \in \omega}$. This leads to various notions of speedability for reals, e.g., one may require for a computably approximable real that either all or some of its approximations of a specific type are speedable.
Merkle and Titov established the equivalence of several speedability notions for left-c.e. reals that are defined in terms of left-c.e. approximations. We extend these results to d.c.e. reals and d.c.e. approximations, and we prove that in this setting, being speedable is equivalent to not being Martin-L\"{o}f random. Finally, we demonstrate that every computably approximable real has a computable approximation that is speedable.

toXiv_bot_toot

Speedability of computably approximable reals and their approximations

An approximation of a real is a sequence of rational numbers that converges to the real. An approximation is left-c.e. if it is computable and nondecreasing and is d.c.e. if it is computable and has bounded variation. A real is computably approximable if it has some computable approximation, and left-c.e. and d.c.e. reals are defined accordingly. An approximation $\{a_s\}_{s \in ω}$ is speedable if there exists a nondecreasing computable function $f$ such that the approximation $\{a_{f(s)}\}_{s \in ω}$ converges in a certain formal sense faster than $\{a_s\}_{s \in ω}$. This leads to various notions of speedability for reals, e.g., one may require for a computably approximable real that either all or some of its approximations of a specific type are speedable. Merkle and Titov established the equivalence of several speedability notions for left-c.e. reals that are defined in terms of left-c.e. approximations. We extend these results to d.c.e. reals and d.c.e. approximations, and we prove that in this setting, being speedable is equivalent to not being Martin-Löf random. Finally, we demonstrate that every computably approximable real has a computable approximation that is speedable.

arXiv.org
https://arxiv.org/abs/2601.09477 that's good work, and i enjoyed the bit where they swapped out DFT for a WHT in Pagh's thing (you don't always want full https://arxiv.org/abs/1909.01410)
Engineering Compressed Matrix Multiplication with the Fast Walsh-Hadamard Transform

We present an implementation of Pagh's compressed matrix multiplication algorithm, a randomized algorithm that constructs sketches of matrices to compute an unbiased estimate of their product. By leveraging fast polynomial multiplication via the FFT, the algorithm achieves high performance when the product matrix is sparse or contains only a small number of entries with magnitudes significantly larger than the rest. We show empirically that the algorithm is practical and can outperform state-of-the-art DGEMM implementations when the product matrix has few nonzero entries or is otherwise dominated by a small subset of elements with large magnitude. As a minor theoretical contribution, we replace the FFT with the Fast Walsh-Hadamard Transform (FWHT) in sketched multiplication, preserving all correctness and variance guarantees of the original algorithm. Experiments with our carefully engineered multithreaded CPU implementation for dense double-precision matrices on 64-core CPU nodes across a range of synthetic benchmarks, exhibiting variable sparsity patterns, show that the FWHT variant is up to 4 times faster than the FFT-based version. Under favorable sparsity and magnitude patterns in the product matrix, our FWHT-based implementation achieves a speedup of up to 40 over DGEMM from Intel MKL, with low probability of error in the estimates. Our implementation is released as free software and comes with NumPy-compatible Python bindings.

arXiv.org
@wolf480pl @pervognsen more or less crazy than cross-modifying code? (https://github.com/backtrace-labs/dynamic_flag)
GitHub - backtrace-labs/dynamic_flag: A C library for runtime-flippable feature flags on Linux/x86-64, with negligible overhead in the common case

A C library for runtime-flippable feature flags on Linux/x86-64, with negligible overhead in the common case - backtrace-labs/dynamic_flag

GitHub
@itamarst @amonakov @pervognsen you can specify the TID. IIRC, the problem with specifying the core is you can get access to counters that don't clearly map to a process/uid.
@TaliaRinger Sounds like federated learning. I've heard of use case for ML at the edge, and maybe institutions with proprietary data they don't want to share but are willing to mix in a shared model.

So like I mentioned I joined Antithesis a little while back. When I did, I pitched them on a crazy idea. Antithesis... Hypothesis... Hegel!

A remarkably short number of months later, Hegel. Hegel is a property-based testing protocol and family of client libraries which makes it easy to do Hypothesis-grade PBT everywhere.

Today: Rust. Tomorrow: The world! (Muahaha)

https://antithesis.com/blog/2026/hegel/

Hypothesis, Antithesis, synthesis

Introducing Hegel, our new family of property-based testing libraries.

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits
https://arxiv.org/abs/2603.19173

> Unlike prior benchmarks that evaluate kernels primarily relative to software implementations, SOL-ExecBench measures performance against analytically derived Speed-of-Light (SOL) bounds computed by SOLAR, our pipeline for deriving hardware-grounded SOL bounds, yielding a fixed target for hardware-efficient optimization.

Cf. Sec. 4.2 SOL Bound Derivation, 4.3 Metric: SOL Score

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

As agentic AI systems become increasingly capable of generating and optimizing GPU kernels, progress is constrained by benchmarks that reward speedup over software baselines rather than proximity to hardware-efficient execution. We present SOL-ExecBench, a benchmark of 235 CUDA kernel optimization problems extracted from 124 production and emerging AI models spanning language, diffusion, vision, audio, video, and hybrid architectures, targeting NVIDIA Blackwell GPUs. The benchmark covers forward and backward workloads across BF16, FP8, and NVFP4, including kernels whose best performance is expected to rely on Blackwell-specific capabilities. Unlike prior benchmarks that evaluate kernels primarily relative to software implementations, SOL-ExecBench measures performance against analytically derived Speed-of-Light (SOL) bounds computed by SOLAR, our pipeline for deriving hardware-grounded SOL bounds, yielding a fixed target for hardware-efficient optimization. We report a SOL Score that quantifies how much of the gap between a release-defined scoring baseline and the hardware SOL bound a candidate kernel closes. To support robust evaluation of agentic optimizers, we additionally provide a sandboxed harness with GPU clock locking, L2 cache clearing, isolated subprocess execution, and static analysis based checks against common reward-hacking strategies. SOL-ExecBench reframes GPU kernel benchmarking from beating a mutable software baseline to closing the remaining gap to hardware Speed-of-Light.

arXiv.org