Day 11 of Advent of Compiler Optimisations!

A clever loop that counts set bits using the "clear bottom bit" trick: value &= value - 1. Works great, generates tight assembly. But change one compiler flag to target a slightly newer CPU and something extraordinary happens to your loop. The compiler spots a pattern you didn't even know was there. What replaces your careful bit manipulation?

Read more: https://xania.org/202512/11-pop-goes-the-weasel-er-count
Watch: https://youtu.be/Hu0vu1tpZnc

#AoCO2025

Apologies the blog post didn't commit until just now...

Fixing my process so I won't make this mistake again too

@mattgodbolt i'm banned off your website for some reason

@tauon same.

@mattgodbolt are you blocking Hetzner IP ranges or something broke?

@mo @mattgodbolt would be funny if that was the reason, my ip address should appear as coming from my university lol
@mo @tauon sorry I typod something in my deploy scripts...

@mattgodbolt last picture[1] still returns AccessDenied, but other parts of the page loads fine, tnx

[1] https://xania.org/202512/11-popcount-clang-opt.png

@tauon

@mo @tauon thanks! On it! I tested this stuff haha but until you go live it's always a crapshoot
@mo @tauon oops
@mo @tauon fixed now. And my release scripts updated

@mattgodbolt getting a
<Error>
<script id="__gaOptOutExtension"/>
<Code>AccessDenied</Code>
<Message>Access Denied</Message>
</Error>

when following your link

@funkylab sorry about that; should be fixed now
@mattgodbolt no need to be sorry; thank you for doing the Advent of compiler optimizations! It has definitely sparked some joy amongst me and a group of my friends!

@mattgodbolt popcount, probably. :-)

Edit: Interestingly, the code you used in the video as a starting point is exactly what I use as a fallback when the CPU lacks a popcount instruction:

@soc perhaps my overly marketing-ey social media made this one a bit too easy :-). But yes!
@mattgodbolt Oooops ... apologies ... I didn't realize your post's intent. 😬
I just read the question, and answered it with the thing that came to my mind immediately.
@mattgodbolt Makes me realize I still haven't checked whether my compiler intrinsic for sorting floating-point numbers by total order has any benefits over the written-in-code version ...
@mattgodbolt sigh, I wonder what the best options would be for gcc to generate the fastest code on current hardware, portability be damned.
-O3 -march=native -mtune=native ...?
@sol_hsa @mattgodbolt Me too! Always wanting to lower the CPU usage for my MAME builds 😀
@mattgodbolt funnily, sometimes compilers tried to do that 'optimization' even if the target CPU had no popcnt :/ http://msinilo.pl/blog2/post/compilers-are-too-smart/
Compilers are (too) smart

@mattgodbolt Re:LLVM's loop deletion pass, I feel like this is something that should work regardless of the target (as long as it declares that it has the instruction), yet in practice it doesn't seem to happen all the time  

e.g those for
- mips64 https://aoco.compiler-explorer.com/z/GxofKqr1n
- sparc64 https://aoco.compiler-explorer.com/z/o8jTfTzn4
In which it misses the opportunity to fold the loop

Compiler Explorer - C++ (x86-64 clang 21.1.0)

unsigned population_count_builtin(long long value) { // Note how this is compiled down to a single instruction always return __builtin_popcountll(value); } unsigned population_count_loop(long long value) { // But sometimes the loop stays intact unsigned result = 0; while (value) { value &= value - 1; ++result; } return result; }

I knew pretty quickly which instruction the compiler would choose when you said, “population count”.

One server could not run the software I’d deployed. “Illegal instruction” it would say, then core. Some time spent in gdb and I discovered the popcnt instruction was the instruction in question, which led me to discover that this server’s CPU was very very old.

@mattgodbolt

@mattgodbolt thanks for linking to my blog post!
@mattgodbolt So I tried this out with Rust and saw the same codegen that you describe (kinda crazy!). However, I tried the -march=sandybridge thing, but it didn't emit the xor eax, eax you said. I did see it with C++, though.
How would one go about benchmarking if the sandybridge popcnt issue is real so that I might try to fix this in rustc?
@fp take a look at things like uarch-bench by Travis Downs; setting up very subtle timing loops with carefully crafted dependency chains