Today's fun lesson in Rust firmware size optimization: a case where being clever and doing something "more efficient" made things bigger and slower.

Specifically, a bitwise shift that cost 128 bytes, or 2% of the entire flash size.

1/

#rustlang

In my keypad scanning firmware, the code looks for buttons at every intersection of an 8x8 grid, for a total of 64 possible locations. Currently, it returns this as a u64, where each bit indicates key up or down at each possible location.

The actual scanning code processes units of 8 bits, for efficiency. So at each step it needs to take 8 bits and shift and combine it into a u64. That happens here:

https://github.com/cbiffle/keypad-go-firmware/blob/5442d69db9854ea33b6bff88c0d6caad13d2e53b/src/scanner.rs#L408

2/

keypad-go-firmware/src/scanner.rs at 5442d69db9854ea33b6bff88c0d6caad13d2e53b · cbiffle/keypad-go-firmware

Firmware for the Keypad:GO! widget. Contribute to cbiffle/keypad-go-firmware development by creating an account on GitHub.

GitHub

Why did I do it this way? Habit, mostly -- left over from my time writing C, where returning integers is cheap but returning arrays can be expensive, if it's possible at all.

But underpinning this "cheap" code is a 64-bit shift, both to create the bitmask, and to interpret it later.

My processor doesn't _have_ 64-bit shift instructions. I assumed that my request would be optimized into some other form. But it wasn't, and in hindsight, I'm not even sure how that optimization would work.

So this wound up generating a call to the builtin _aeabi_llsl function.

3/

Not only does this bring in a 44-byte generic 64-bit shift routine, it also inserts a function call that doesn't get inlined. Function calls aren't free; the calling routine has to arrange all its state _just so_ to produce the environment the *called* routine expects.

So overall, two calls to this 44-byte routine added another 84 bytes in overhead, by effectively de-optimizing the functions that contained those shift calls.

4/

@cliffle If there was just one call to the builtin then I guess LTO would inline it, yeah?

It sounds like the (mythical) Sufficiently Smart Compiler[*] could have chosen to inline both instances for a small binary size saving as well, maybe?

[*] Linker, compiler, who even keeps track these days?

@projectgus the line between linker and compiler is arbitrary, yeah.

But in this case, no! LTO won't inline these shift routines. 🤷