Three instruction NEON float prefix sum. I'd wanted to abuse FCMLA (floating-point complex multiply accumulate) for non-complex arithmetic for so long, and I finally came up with something :)

With two unnecessary multiplies to save one instruction, this may only work out on Apple CPUs, but it's a bit of fun.

(For loops you can broadcast the carried value with vfmaq_laneq_f32(scan, ones, prev, 3) for three multiplies saving two instructions. LLVM fights you on that, though.)

[oops, see reply]

Correction: two instruction NEON float prefix sum.

I guess I'm a bit out of practice, was focusing on the complex ops too much, and two seemed too good to be true.

@dougall This is awesome. How low can integers go?
@zeux Good question... For integers the main trick is SWAR: multiplying by 0x0101010101010101 is a prefix sum if the sum of the first 7 bytes fits in a byte. For a general solution, the same idea replaces the first FCMLA step in NEON for u8s and u16s, multiplying by 0x101 or 0x10001. The last trick (multiply-add "by element") also works for the final step, but only for u16s and u32s.

@dougall I was hopeful for the SWAR trick with uint16 on x64 but in addition to range issues it was just plain slower than the standard log2 construction. I’m mainly interested in uint32x4 now but that needs u64 multiply…

It’s pretty disappointing that you need 4 instructions to do 6 scalar adds…

@zeux @dougall 3 scalar adds! Even with a reduction tree it's 4 at most. (a+b, c+d, (a+b)+c, (a+b)+(c+d)).

@rygorous @zeux Yeah, this is what I've got, which I'd count as three SIMD instructions for three scalar adds:

v = vaddq_u32(v, vshlq_n_u64(v, 32));
v = vmlaq_laneq_u32(v, c0011, v, 1);

The first two instructions could be changed to a 64-bit USRA, if it doesn't overflow, and you don't mind pairs being in swapped order, but I think that's a bit of a stretch.

(for a loop you can add:

v = vmlaq_laneq_u32(v, c1111, running, 3);
running = v;

four for four. but the latency is probably a problem.)

@dougall @zeux You never need the middle portions of a prefix sum across loop iterations (or at least I've never seen it), and you do get ADDV for the full horizontal sum.

For that matter, I end up needing the exclusive (0, a, a+b, a+b+c) variant with (a+b+c+d) added to running total much more often than inclusive.

Either way, IMO getting the per-iteration total off your main prefix sum is usually a mistake. Tally them separately.

@dougall @zeux My usual setup is something like ADDV + ADD for the scalar total (loop-carried) and then inside iteration: prefix sum, broadcast sum total (overlaps with prefix sum) and add carried sum total to item prefix sum.

Can't recall an instance where I've cared about individual cycle-level latency of the actual prefix sum portion.

@rygorous @zeux Interesting... ADDV felt like wasted work since we compute it anyway, and can accumulate without latency problems:

v = vaddq_u32(v, vshlq_n_u64(v, 32)));
v = vmlaq_laneq_u32(v, c0011, v, 1);
out = vmlaq_laneq_u32(v, c1111, acc, 3); // doesn't change v
acc = vaddq_u32(acc, v); // 2c latency

But I guess for prefix-exclusive, the ADDV should work out to the same number of instructions, with an earlier, less confusing critical path.

@dougall @rygorous Yeah this version is worse for me vs the naive version (add/ext/add/ext + add/dup). Need to see if I can make USRA work...

@dougall @rygorous The single-mlaq variant is also worse, if the final addition is done via add/dup. There of course mla is still on the critical path. Fixing that by using VADDV seems to regress perf further :( I might be doing something wrong.

And yeah for USRA to work things can't overflow which I can't guarantee. A reverse order might have worked but the overflow needs to work; this is one of the issues I ran into with SWAR attempts and had to hack around it with extra ops to mask bits off.

@zeux @rygorous Curious... Yeah, I'd check the compiler output, because it makes a mess of this if it can see either of the constants: https://godbolt.org/z/sz3qazq3M

That seems most likely. If you're not latency bound I'd try the four instruction version (just goes 2c->3c latency) and if you are latency bound you could try to sum even wider.

Compiler Explorer - C (armv8-a clang (trunk))

// 5 SIMD instructions void prefix_sum_u32(uint32_t *in, uint32_t *out, size_t n) { uint32x4_t c0011 = {0, 0, 1, 1}; uint32x4_t c1111 = {1, 1, 1, 1}; uint32x4_t acc = vdupq_n_u32(0); // hide constants from compiler asm("" : "+w"(c1111)); asm("" : "+w"(c0011)); for (size_t i = 0; i < n; i += 4) { uint32x4_t v = vld1q_u32(in + i); v = vaddq_u32(v, vreinterpretq_u32_u64( vshlq_n_u64(vreinterpretq_u64_u32(v), 32))); v = vmlaq_laneq_u32(v, c0011, v, 1); uint32x4_t result = vmlaq_laneq_u32(v, c1111, acc, 3); acc = vaddq_u32(acc, v); vst1q_u32(out + i, result); } } // 8 SIMD instructions void prefix_sum_u32_bad(uint32_t *in, uint32_t *out, size_t n) { uint32x4_t c0011 = {0, 0, 1, 1}; uint32x4_t c1111 = {1, 1, 1, 1}; uint32x4_t acc = vdupq_n_u32(0); for (size_t i = 0; i < n; i += 4) { uint32x4_t v = vld1q_u32(in + i); v = vaddq_u32(v, vreinterpretq_u32_u64( vshlq_n_u64(vreinterpretq_u64_u32(v), 32))); v = vmlaq_laneq_u32(v, c0011, v, 1); uint32x4_t result = vmlaq_laneq_u32(v, c1111, acc, 3); acc = vaddq_u32(acc, v); vst1q_u32(out + i, result); } }

@dougall @rygorous Yeah this is maddening. You need c0011 to be a constant so that compiler can hoist the load out of the loop; but you need it to be opaque to the compiler! I'm tired of LLVM just constantly making wrong decisions around intrinsics.

Anyway if I propagate the value manually but load the initial state out of a const volatile, I get maybe a little faster with a single-mla variant? I can measure ~2% delta on a larger algorithm so probably 5+% on this loop. two-mla variant is ~same.

@dougall @rygorous Oh the asm trick actually works inside the loop? That kinda defies expectations. But maybe with that I can actually ship this...
@rygorous @zeux (USRA should also work to save an instruction for a suffix sum (fully reversed order), I think, which might be acceptable in more situations.)
@rygorous @dougall Yeah I meant even if you aren’t reusing any results. I assume it’s more work to do dependent adds (addv vs add) but still would have hoped it’d be less.