Day 20 of Advent of Compiler Optimisations!

Loop over 65,536 integers doing comparisons — that's 65,536 iterations, right? Wrong! With the right flags, the compiler processes 8 integers per iteration using SIMD instructions. Same number of assembly instructions, 8× the throughput. What's the trick that makes this possible?

Read more: https://xania.org/202512/20-simd-city
Watch: https://youtu.be/d68x8TF7XJs

#AoCO2025

SIMD City: Auto-vectorisation — Matt Godbolt’s blog

Doing more with less: vectorising can speed your code up 8x or more!

@mattgodbolt I was a bit surprised to see vpmaskmovd advocated without noting that it's quite slow on AMD through Zen 4 (fixed in Zen 5). There are certainly other autovectorizations (including the one with vpmaxsd) that don't have this problem – most of the time, if the instruction is available, you want to use it.

I'm sure you know this, but readers might not.

@raph I don't think I advocated specific instructions here, I just showed what the compiler chose. I am targeting a specific Intel CPU here to show how it generates code. I don't have any personal experience with AMD CPU performance so, this was news to me: thanks for sharing!
@mattgodbolt Probably "advocating" is too strong a word, and of course this optimization makes perfect sense when generating code for this CPU. I'm spending a lot of time these days (with fearless-simd and Vello) figuring out how to deliver code that's super-performant across a wide range of chips, and of course that has its own challenges. I'm looking forward to AVX-512 becoming more common, as the masked operations there are sweet.

@mattgodbolt Speaking of which: Go is getting experimental SIMD intrinsics. See https://github.com/golang/go/issues/73787

Is there any hope of getting a version with that enabled in compiler explorer? It would greatly help discussions, I believe, because it would make it easy to link to snippets that generate suboptimal sequences.

Only involves x86_64 so far. Requires building the compiler with GOEXPERIMENT=simd (given that there's tip, maybe you do custom builds already?)

simd/archsimd: architecture-specific SIMD intrinsics under a GOEXPERIMENT · Issue #73787 · golang/go

Update (12/16/2025): The AMD64 low-level SIMD package is now available in Go 1.26 RC1 under GOEXPERIMENT=simd. Also, the package is renamed to simd/archsimd, per #76473,. See #73787 (comment) . Upd...

GitHub
@Merovius we do custom builds of many compilers! Feel free to submit a PR: we have documentation on how to add new compilers :)
@mattgodbolt 👍 I'll look into it
@Merovius @mattgodbolt Just to clarify, I’m 90% sure it only requires setting GOEXPERIMENT=simd when building the application (go build), not when building the compiler. A stock compiler is fine.
@prattmic @[email protected] for this one it needed to be enabled at build time as well, but I’ll try it out before submitting a PR, thanks

@Merovius @mattgodbolt @prattmic

Just verified, plain build does the right thing w/ GOEXPERIMENT=simd (and of course I am cross-compiling to amd64 and then using Apple’s emulation, as one does).

@mattgodbolt Now do it with floats 🙃

(I spent lot of time trying to convince GCC/Clang to optimize various vectorizable float loops with just local assumptions without the big guns of `-fchange-how-floats-work-globally`, but they are surprisingly bad at that.)