A lot of what I was doing in my latest blog post is tweaking Swift code so the compiler will emit the desired SIMD instructions.
Interesting to watch this clip of Kieran Kunhya (FFmpeg) and Jean-Baptiste Kempf (VLC) talk about how no amount of intrinsics or autovectorization (roughly what I was doing) can get within an order of magnitude runtime speed compared to handwritten assembly for SIMD mathematics.

