Mastodawn

Day 20 of Advent of Compiler Optimisations!

Loop over 65,536 integers doing comparisons — that's 65,536 iterations, right? Wrong! With the right flags, the compiler processes 8 integers per iteration using SIMD instructions. Same number of assembly instructions, 8× the throughput. What's the trick that makes this possible?

#AoCO2025

SIMD City: Auto-vectorisation — Matt Godbolt’s blog

Doing more with less: vectorising can speed your code up 8x or more!

Show thread

Raph Levien Dec 20

@mattgodbolt I was a bit surprised to see vpmaskmovd advocated without noting that it's quite slow on AMD through Zen 4 (fixed in Zen 5). There are certainly other autovectorizations (including the one with vpmaxsd) that don't have this problem – most of the time, if the instruction is available, you want to use it.

I'm sure you know this, but readers might not.

Show thread

matt godbolt Dec 20

@raph I don't think I advocated specific instructions here, I just showed what the compiler chose. I am targeting a specific Intel CPU here to show how it generates code. I don't have any personal experience with AMD CPU performance so, this was news to me: thanks for sharing!

Show thread

Raph Levien Dec 20

@mattgodbolt Probably "advocating" is too strong a word, and of course this optimization makes perfect sense when generating code for this CPU. I'm spending a lot of time these days (with fearless-simd and Vello) figuring out how to deliver code that's super-performant across a wide range of chips, and of course that has its own challenges. I'm looking forward to AVX-512 becoming more common, as the masked operations there are sweet.

Show thread

Merovius Dec 21

@mattgodbolt Speaking of which: Go is getting experimental SIMD intrinsics. See https://github.com/golang/go/issues/73787

Is there any hope of getting a version with that enabled in compiler explorer? It would greatly help discussions, I believe, because it would make it easy to link to snippets that generate suboptimal sequences.

Only involves x86_64 so far. Requires building the compiler with GOEXPERIMENT=simd (given that there's tip, maybe you do custom builds already?)

simd/archsimd: architecture-specific SIMD intrinsics under a GOEXPERIMENT · Issue #73787 · golang/go

Update (12/16/2025): The AMD64 low-level SIMD package is now available in Go 1.26 RC1 under GOEXPERIMENT=simd. Also, the package is renamed to simd/archsimd, per #76473,. See #73787 (comment) . Upd...

GitHub

Show thread

matt godbolt Dec 21

@Merovius we do custom builds of many compilers! Feel free to submit a PR: we have documentation on how to add new compilers :)

Show thread

Merovius Dec 21

@mattgodbolt 👍 I'll look into it

Show thread

Michael Pratt Dec 21

@Merovius @mattgodbolt Just to clarify, I’m 90% sure it only requires setting GOEXPERIMENT=simd when building the application (go build), not when building the compiler. A stock compiler is fine.

Show thread

Merovius Dec 21

@prattmic @[email protected] for this one it needed to be enabled at build time as well, but I’ll try it out before submitting a PR, thanks

Show thread

dr2chase Dec 21

@Merovius @mattgodbolt @prattmic

Just verified, plain build does the right thing w/ GOEXPERIMENT=simd (and of course I am cross-compiling to amd64 and then using Apple’s emulation, as one does).

Show thread

Xarn Dec 21

@mattgodbolt Now do it with floats 🙃

(I spent lot of time trying to convince GCC/Clang to optimize various vectorizable float loops with just local assumptions without the big guns of `-fchange-how-floats-work-globally`, but they are surprisingly bad at that.)

Show thread

matt godbolt Dec 21

@horenmar give me ten minutes... You'll see ..

Show thread

matt godbolt Dec 21

@horenmar https://hachyderm.io/@mattgodbolt/115757816470759311 :)

Show thread

Xarn Dec 21

@mattgodbolt I have a version with more float simd :-P

https://codingnest.com/files/The%20Compiler%20Is%20Smarter%20Than%20You.pdf

Show thread

Xarn Dec 21

@mattgodbolt Wait, I actually did one that's entirely about floats.

https://codingnest.com/files/Fun,%20Safe,%20Math%20Optimizations.pdf