@dzaima

3 Followers
9 Following
26 Posts
(granted, those hypotheses are a rather long list.. even then, those feel like weird things to complain about relative to "need to crack SIMD ops into 8 uops (or crazier things for reduces esp. vfredosum, or the billion different funky loads/stores))

@fclc I don't understand the particular badness of "implicit state" here though; the vtype bits are pretty cleanly just bits of the nearest preceding vsetvli/vsetivli instruction (vsetvl can just cause a stall, it's for context restore anyway).

Is it wanting to do work with instrs without yet having even fetched some preceding ones? Wanting to compute stuff before the cycle or two or whatever of the vtype being forwaded? The last-vtype forwarding taking too much area/power? Something else?

Of course going to a 64-bit bit encoding would immediately mean being at least as bad as separate vsetvl; 48-bit would work, but of course isn't particularly compatible with the other discussed idea of getting rid of RVC.
@miquelp Question is what you'd be losing space relative to; cause fitting the 5 vtype bits into the instructions with a 32-bit encoding would take up a rather massive amount of encoding space (3×5 reg bits, 5 vtype, 1 masking, leaves 11 bits of opcode (or 9 bits if not stealing RVC's encoding space) for fitting in all vector instrs, and you still have to leave space for non-vector instrs).
The clang-22 thing can be fixed by a `__asm__(" # force to mem " : "+m"(state->c.v.inner));` before `state->c.v.outer--;`
@pkhuong Changing the `uint8_t` fields to `uint32_t` further reduces it to 173 bytes on clang≤21 from reduced prefixes, but 22-prerelease gets fancy and does some SIMD, bloating up to 195
@pkhuong x86-64 clang (17…22-prerelease, i.e. everything I have locally) -Os compile tiny_batcher_generate to 179 bytes.
I think it should be fine to only set & use `ret.right = 0` for the done case, for -2 bytes; needs an __asm__ no-op after `done:` prevent it from being duplicated though.
(no I don't know if there's an actually-correct way to use llvm-mca on compiler-explorer for loops, besides manually copying the desired specific assembly around)
So it basically just ends up being a code size "bench".
The gcc "multiply" ends up effectively roughly timing `count==1`, and "multiply_restrict" roughly timing `count==5`;
The clang "multiply" has an 8x vectorized loop, and an 8x unrolled loop, depending on a runtime aliasing test (never both used at once, but that's what llvm-mca simulates anyways), plus a 1x loop for tail. "multiply_restrict" is that but without the unrolled loop as there's never aliasing, i.e. alike `count==9`.
@funkylab llvm-mca assumes the whole assembly block is a single loop iteration (all jumps (incl. the backwards ones of the loops) assumed always-untaken or something); it is entirely-meaningless when applied to a whole C loop (for that you'd need to look only at the desired loop body, exclude initialization/tail element handling/etc, and need to divide by vectorization width / unroll count)