Ok fedi, you're full of game devs and weird microarchitecture experts and generally the right kind of people to ask...

I'm thinking about a generic data representation for multi-bit vectors in ngscopeclient.

Right now we support single-bit digital signals (one byte aka C++ bool per sample), analog signals (one float32 per sample), and arbitrary struct/class types (for protocol decoder output).

Notably missing is multi-bit digital vectors. There is some legacy code in libscopehal for a "digital bus" datatype that has one std::vector<bool> per sample but this... doesn't scale for obvious reasons, ngscopeclient can't display them, and no supported filter creates or accepts them.

In order to fully support FPGA internal LAs, regular LAs, VCD import/export, integration with RTL simulators, and a multitude of other use cases we need to handle multi bit vectors.

So the question is, what should the representation look like?

Considerations include, but are not limited to:
* GPUs naturally want to work with int32s when doing memory accesses, and have consecutive threads access consecutive memory addresses. Trying to write a stateful digital decode that makes a roughly linear pass over a signal may require a weird non-linear sample order
* We want to be efficient for both CPU and GPU processing
* We don't want a huge amount of memory overhead if we have say a 50 million point 2-bit wide vector
* Merging of N single-bit signals into one N-bit signal, or splitting one N-bit signal to N single-bit signals, should be reasonably efficient, e.g. to allow tree expansion of vectors as a bunch of rows
* We need to handle vectors as small as 2 bits for some random state variable up to 256 or 512 bits for a large AXI interface etc
* Some filter blocks, for e.g. Boolean / bitwise operations, may need to generalize to arbitrarily wide vectors. Others, like a decode for a specific protocol, may only need to account for a fixed list of sizes (say 16/32/64/128/256/512) or even a single size.
So like, at the lowest level
* One bit per byte?
* One bit from each of eight channels per byte?
* Eight bits from one channel per byte?
* Do we perhaps want multiple of these for different use cases? If so, how do we convert/adapt between them?

@azonenberg You would have fewer and more efficient memory accesses if many bits were packed into a byte. Do GPUs have intrinsics for packing/unpacking? Could SIMD play a role?

Maybe some benchmarking is required on a specific test case. Are you thinking a 2 bit vector would be one bit for two channels, or would it be two sequential bits for one channel?

@0h00000000 I hope to not be doing packing/unpacking at all, hence why I am thinking a 32 bit vector should map to a uint32 per sample rather than one sample storing 32 consecutive values from one channel.

Most of the time you are working on parallel data it's logically a N-bit word