Ok fedi, you're full of game devs and weird microarchitecture experts and generally the right kind of people to ask...

I'm thinking about a generic data representation for multi-bit vectors in ngscopeclient.

Right now we support single-bit digital signals (one byte aka C++ bool per sample), analog signals (one float32 per sample), and arbitrary struct/class types (for protocol decoder output).

Notably missing is multi-bit digital vectors. There is some legacy code in libscopehal for a "digital bus" datatype that has one std::vector<bool> per sample but this... doesn't scale for obvious reasons, ngscopeclient can't display them, and no supported filter creates or accepts them.

In order to fully support FPGA internal LAs, regular LAs, VCD import/export, integration with RTL simulators, and a multitude of other use cases we need to handle multi bit vectors.

So the question is, what should the representation look like?

Considerations include, but are not limited to:
* GPUs naturally want to work with int32s when doing memory accesses, and have consecutive threads access consecutive memory addresses. Trying to write a stateful digital decode that makes a roughly linear pass over a signal may require a weird non-linear sample order
* We want to be efficient for both CPU and GPU processing
* We don't want a huge amount of memory overhead if we have say a 50 million point 2-bit wide vector
* Merging of N single-bit signals into one N-bit signal, or splitting one N-bit signal to N single-bit signals, should be reasonably efficient, e.g. to allow tree expansion of vectors as a bunch of rows
* We need to handle vectors as small as 2 bits for some random state variable up to 256 or 512 bits for a large AXI interface etc
* Some filter blocks, for e.g. Boolean / bitwise operations, may need to generalize to arbitrarily wide vectors. Others, like a decode for a specific protocol, may only need to account for a fixed list of sizes (say 16/32/64/128/256/512) or even a single size.
So like, at the lowest level
* One bit per byte?
* One bit from each of eight channels per byte?
* Eight bits from one channel per byte?
* Do we perhaps want multiple of these for different use cases? If so, how do we convert/adapt between them?
@azonenberg what are your top most common access patterns?

are you going to read that data or send it off to sth with a fixed interface? then maybe align yourself with that.
are you going to run operators on the data? would they work if you compacted the bits into bigger types like ints? Say checking for the value of a single bit could be done with two bitshifts, I think.

@tammeow Things I will definitely want to do:

* Render it in a logic analyzer style hex waveform display
* Render as N separate single-bit vectors
* Do protocol operations on the parallel data as an integer

but again, I don't yet know what I will end up doing for sure because I don't even have the ability to ingest and render the data yet. the drivers that will collect it don't exist because we have no way to display said data once collected.

So the whole ecosystem of decodes and hardware support isn't there

@tammeow it's a lot easier to optimize a block you've already written than to hypothesize about what the inner loop of one you haven't envisioned yet is going to look like
@azonenberg hmm fair. so if you went for a naive implementation with wrapping for conversion, you would be able to trace which format would be more useful ig? would you at least be able to predict if you are more likely to view all channels in one go vs each on their own? if you do each on their own, then that is also the preferable memory representation. wanna get dat cache line alignment.

@tammeow I'm expecting a tree-style logic analyzer view where it defaults to the word but you can expand it to see the bits.

But the rendering will be done in a shader so I can for example fetch a single int32 for 32 samples, then render the pixels separately for each row

Maybe what I need to do is start by working backwards, make a dummy generator that creates like a 32 bit counter or something and try actually writing a rendering shader and see how it performs

@azonenberg yeah! oh gosh i realize that this could be silli if the endianness was mixed, but we mostly live in little endianness land.

so to fetch a value it would be sth like this ig:
// n_sample from 0 to 31
int sample_n_value = (i_samples32 << n_sample >> 31-n_sample)