Three instruction NEON float prefix sum. I'd wanted to abuse FCMLA (floating-point complex multiply accumulate) for non-complex arithmetic for so long, and I finally came up with something :)

With two unnecessary multiplies to save one instruction, this may only work out on Apple CPUs, but it's a bit of fun.

(For loops you can broadcast the carried value with vfmaq_laneq_f32(scan, ones, prev, 3) for three multiplies saving two instructions. LLVM fights you on that, though.)

[oops, see reply]

@dougall do you have some insight on how these NEON instructions are for latency? I've heard compilers are usually a bit conservative on them for this reason.

@Alonely0 I only really know the Apple ones by heart, but the broadcasts are mostly free there. FADD is about 2 or 3 cycles (got faster on M4, but not FADDP, so it's still 3 cycles), but FMUL/FMLA/FCMLA are all 4 cycles.

Apparently the Cortex-A57 had 10 cycle FMLA, so the conservatism might be be justified.

@Alonely0 By "broadcasts" I meant the indexed modes of the instructions. For comparison the other methods would use shuffles, usually EXT – any one-or-two-register shuffle is 2c (which is the minimum latency for any NEON operation in any implementation, no fast adds/bitops like you get on x86).