Paul Khuong

505 Followers
389 Following
2.2K Posts

Speedability of computably approximable reals and their approximations

George Barmpalias, Nan Fang, Wolfgang Merkle, Ivan Titov
https://arxiv.org/abs/2603.26484 https://arxiv.org/pdf/2603.26484 https://arxiv.org/html/2603.26484

arXiv:2603.26484v1 Announce Type: new
Abstract: An approximation of a real is a sequence of rational numbers that converges to the real. An approximation is left-c.e. if it is computable and nondecreasing and is d.c.e. if it is computable and has bounded variation. A real is computably approximable if it has some computable approximation, and left-c.e. and d.c.e. reals are defined accordingly.
An approximation $\{a_s\}_{s \in \omega}$ is speedable if there exists a nondecreasing computable function $f$ such that the approximation $\{a_{f(s)}\}_{s \in \omega}$ converges in a certain formal sense faster than $\{a_s\}_{s \in \omega}$. This leads to various notions of speedability for reals, e.g., one may require for a computably approximable real that either all or some of its approximations of a specific type are speedable.
Merkle and Titov established the equivalence of several speedability notions for left-c.e. reals that are defined in terms of left-c.e. approximations. We extend these results to d.c.e. reals and d.c.e. approximations, and we prove that in this setting, being speedable is equivalent to not being Martin-L\"{o}f random. Finally, we demonstrate that every computably approximable real has a computable approximation that is speedable.

toXiv_bot_toot

Speedability of computably approximable reals and their approximations

An approximation of a real is a sequence of rational numbers that converges to the real. An approximation is left-c.e. if it is computable and nondecreasing and is d.c.e. if it is computable and has bounded variation. A real is computably approximable if it has some computable approximation, and left-c.e. and d.c.e. reals are defined accordingly. An approximation $\{a_s\}_{s \in ω}$ is speedable if there exists a nondecreasing computable function $f$ such that the approximation $\{a_{f(s)}\}_{s \in ω}$ converges in a certain formal sense faster than $\{a_s\}_{s \in ω}$. This leads to various notions of speedability for reals, e.g., one may require for a computably approximable real that either all or some of its approximations of a specific type are speedable. Merkle and Titov established the equivalence of several speedability notions for left-c.e. reals that are defined in terms of left-c.e. approximations. We extend these results to d.c.e. reals and d.c.e. approximations, and we prove that in this setting, being speedable is equivalent to not being Martin-Löf random. Finally, we demonstrate that every computably approximable real has a computable approximation that is speedable.

arXiv.org
https://arxiv.org/abs/2601.09477 that's good work, and i enjoyed the bit where they swapped out DFT for a WHT in Pagh's thing (you don't always want full https://arxiv.org/abs/1909.01410)
Engineering Compressed Matrix Multiplication with the Fast Walsh-Hadamard Transform

We present an implementation of Pagh's compressed matrix multiplication algorithm, a randomized algorithm that constructs sketches of matrices to compute an unbiased estimate of their product. By leveraging fast polynomial multiplication via the FFT, the algorithm achieves high performance when the product matrix is sparse or contains only a small number of entries with magnitudes significantly larger than the rest. We show empirically that the algorithm is practical and can outperform state-of-the-art DGEMM implementations when the product matrix has few nonzero entries or is otherwise dominated by a small subset of elements with large magnitude. As a minor theoretical contribution, we replace the FFT with the Fast Walsh-Hadamard Transform (FWHT) in sketched multiplication, preserving all correctness and variance guarantees of the original algorithm. Experiments with our carefully engineered multithreaded CPU implementation for dense double-precision matrices on 64-core CPU nodes across a range of synthetic benchmarks, exhibiting variable sparsity patterns, show that the FWHT variant is up to 4 times faster than the FFT-based version. Under favorable sparsity and magnitude patterns in the product matrix, our FWHT-based implementation achieves a speedup of up to 40 over DGEMM from Intel MKL, with low probability of error in the estimates. Our implementation is released as free software and comes with NumPy-compatible Python bindings.

arXiv.org

So like I mentioned I joined Antithesis a little while back. When I did, I pitched them on a crazy idea. Antithesis... Hypothesis... Hegel!

A remarkably short number of months later, Hegel. Hegel is a property-based testing protocol and family of client libraries which makes it easy to do Hypothesis-grade PBT everywhere.

Today: Rust. Tomorrow: The world! (Muahaha)

https://antithesis.com/blog/2026/hegel/

Hypothesis, Antithesis, synthesis

Introducing Hegel, our new family of property-based testing libraries.

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits
https://arxiv.org/abs/2603.19173

> Unlike prior benchmarks that evaluate kernels primarily relative to software implementations, SOL-ExecBench measures performance against analytically derived Speed-of-Light (SOL) bounds computed by SOLAR, our pipeline for deriving hardware-grounded SOL bounds, yielding a fixed target for hardware-efficient optimization.

Cf. Sec. 4.2 SOL Bound Derivation, 4.3 Metric: SOL Score

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

As agentic AI systems become increasingly capable of generating and optimizing GPU kernels, progress is constrained by benchmarks that reward speedup over software baselines rather than proximity to hardware-efficient execution. We present SOL-ExecBench, a benchmark of 235 CUDA kernel optimization problems extracted from 124 production and emerging AI models spanning language, diffusion, vision, audio, video, and hybrid architectures, targeting NVIDIA Blackwell GPUs. The benchmark covers forward and backward workloads across BF16, FP8, and NVFP4, including kernels whose best performance is expected to rely on Blackwell-specific capabilities. Unlike prior benchmarks that evaluate kernels primarily relative to software implementations, SOL-ExecBench measures performance against analytically derived Speed-of-Light (SOL) bounds computed by SOLAR, our pipeline for deriving hardware-grounded SOL bounds, yielding a fixed target for hardware-efficient optimization. We report a SOL Score that quantifies how much of the gap between a release-defined scoring baseline and the hardware SOL bound a candidate kernel closes. To support robust evaluation of agentic optimizers, we additionally provide a sandboxed harness with GPU clock locking, L2 cache clearing, isolated subprocess execution, and static analysis based checks against common reward-hacking strategies. SOL-ExecBench reframes GPU kernel benchmarking from beating a mutable software baseline to closing the remaining gap to hardware Speed-of-Light.

arXiv.org
The PS Portal's handling of captive portals is the best I've seen so far (the portal acts as an AP, another device connects to it and clicks on the magic button that unlocks the network). I wonder why I don't see that everywhere; how much did that feature add to their COGS?

won't say I'm totally proud of myself here, but once I saw that the Claude C compiler was super buggy according to YARPGen and Csmith, I had a hard time preventing myself from doing something about it

https://john.regehr.org/writing/claude_c_compiler.html

claude_c_compiler

A blog post I wrote about folios

https://blogs.oracle.com/linux/intro-to-folios

If you're a swiss resident ( or know one ... ) you can get a custom hardware design onto an ASIC for FREE !!!

You get back one chip with your design (and a bunch of others!) and some board to help bring up/testing.

Also, this is "resident", no need to be a student or at uni or anything, you just need a valid Swiss address ...

I know I might be repeating myself here because I mentioned it a few month back but this bear repeating IMHO.

https://swisschips.ethz.ch/news-and-events/tiny-tapeout-submission-form.html

PS: Can I haz boost for reach ?

Tiny Tapeout Submission Form: for shuttles & workshops

SwissChips
Shrunk the machine code a little here https://github.com/pkhuong/tiny_batcher The x86-64 build now clocks in at 199 bytes (224 on aarch64)! I think x86-64 might benefit from using high half byte registers, but hopefully simpler tricks can get us to <= 3 cache lines.
GitHub - pkhuong/tiny_batcher: a size-optimised sorting library for C and C++

a size-optimised sorting library for C and C++. Contribute to pkhuong/tiny_batcher development by creating an account on GitHub.

GitHub

As i compared https://github.com/pkhuong/tiny_batcher/blob/main/reference.py with the expected number of compare-exchange for Batcher's odd-even merge sort, it became clear that Knuth's Algorithm M isn't that (5.2.2, p111 in TAoCP vol 3 2nd ed).

Is there a reference for the exact number of compare-exchange I should expect for small n?

@jix maybe?