Someone asked a #RVV #riscv #vectorextension question and i replied on Twitter.
https://twitter.com/Anno0770/status/1772927877790507222

I copy things here. Because the dumb login wall of Twitter. *sigh*

Anno0770 (@Anno0770) on X

@eigenform @jonmasters There are multiple things going on here. I am not sure i get all of them right but i try. The idea of having a dynamic vector length is not a problem. This has been a common architectural feature of vector computers for a while. RISC-V does things not seen in other vector archs

X (formerly Twitter)

This tweet was in response jonmasters quote tweeting https://twitter.com/pshufb/status/1772745519242051899 with #yolo and someone asking something along the lines of "Can we do better even?"

There are multiple things going on here. I am not sure i get all of them right but i try.

The idea of having a dynamic vector length is not a problem. This has been a common architectural feature of vector computers for a while.
RISC-V does things not seen in other vector archs

Wojciech Muła (@pshufb) on X

My gut feelings about RVV was -- unfortunately -- right. In a simple procedure a compiler put FIVE vsetvli instructions. The vsetvli instruction changes behaviour of the subsequent instructions until the next vsetvli. For in-order CPUs it's not a big deal, for OOO -- good luck.

X (formerly Twitter)
RISC-V wants to support 8bit, 16bit, 32bit, 64bit data (some1 threatened quad precision (128bit) but i didn't think that happened) data types while using 32bit long instructions.
Furthermore RISC-V wants the ability to merge vectors (treat multiple consecutive vector regs as 1)
furthermore they want to be able to use less than one VR 1/2, 1/4, 1/8th of a register. This also leaves the question what happens with the tail unused end. Do we want it preserved or can it be trashed?
Let's look at how the ve ISA used in the NEC #VectorEngine deals with this.
You want to you vectorized SAXPY to look something like:
SAXPY(Vector{Float} out, float a, Vector{Float} x, Vector{Float} y, int n)
i = 0;
while (n != i) {
work = setVL(n - i)
vr v1 = x[i:(i+work)]
vr v2 = y[i:(i+work)]
v1 = v1 * a;
v2 += v1
out[i:(i+work)] = v2
i += work;
}
So the idea is you have a loop, you don't know often it will run and the hardware will make a good decision what vector lengths to use every iteration gives it's capabilities. All those questions RVV has are valid question and the #VectorEngine also has to answer them.
It does so:
Tails are undefined behaviour, always assume they are trashed. There is no means to join registers together.
All our vector element are always 64 bit wide. If you load 32 bit data half the reg goes unused.
But if you are working with 32bit or 16bit floats we offer separate packed instructions which can operate on a register full of those smaller floats but it's 64 bit data, so make sure it's aligned. Vector length is the number of 64 bit wide vector registers that are used. Yes you request a vector operation of length 1, 5. You request a Vector length using the LVL instruction. See image
So in this design the MVL is a constant and doesn't depend on anything.
If you use packed data (analog to operating on 16 or 32 bit elements in RVV) you use a different instruction. ve has enough room for instructions since it is a 64bit wide ISA. When you LVL you don't need to hear back.
Now back to RVV and it's LVL analog. Yup that analog is so long i can't take a proper screenshot.
This is the complete instruction:
vsetvli x1, n, e64, m1, ta, mu
Or: Tell me in register x1 how many of the n 64bit values you can process using 1 vector register. Also set the tail policy to trashing and the mask policy to undisturbed.
Why does LVL need to set tail policies? I can find excuses but no good reason.
Why do i need to be told how many elements the hardware chooses to process instead of me telling it?
Some thought they would be clever:
So every time we change the width of the data type we are processing or we set start a new iteration we produce a scalar which the next loop iteration will inevitably depend on because some liked the aesthetics of allowing to split work across the last two iterations of a loop.
The observation "This allows software to avoid needing to explicitly calculate a running maximum of vector lengths observed during a stripmined loop." is misleading. That a loop is stripmined is not relevant here. There is no need to "observe" anything "running". An ISA could just tell or define and then one can calculate ahead of time. Once. Instead of having the hardware check every iteration. If it wasn't deterministic you could do some interesting power-saving stuff. ... Tangent forks here
Due to all of this complexity compilers just lower every vector
instruction to vsetivli and then a vector instruction and then try to clean up the mess afterwards. (They don't have to but they do) and they are not great at the cleaning up yet.
Hope that wasn't to false. I am open for corrections.
I cut off a tangent about a particularly nasty thing which this interface allows but i thought to unholy to consider.
Now i found something in the docs which hints at this. I am not 100% convinced that RVV does that, but they might.
Given these snippets from the SPEC. What is we have a 512bit long Vector register and we claim to support 8bit, 16bit, 32bit, 64bit.
But we only put 8 ALUs which doesn't support "packing".
8 64 bit operations or 8 32bit or 8 16bit or 8 8bit words.
😱 vsetvli would always return 8 times LMUL and hardware does stripping under the hood to make that work.
I really hope they thought about that well or added wording which forbids this. I am not going to.