Ok fedi, you're full of game devs and weird microarchitecture experts and generally the right kind of people to ask...

I'm thinking about a generic data representation for multi-bit vectors in ngscopeclient.

Right now we support single-bit digital signals (one byte aka C++ bool per sample), analog signals (one float32 per sample), and arbitrary struct/class types (for protocol decoder output).

Notably missing is multi-bit digital vectors. There is some legacy code in libscopehal for a "digital bus" datatype that has one std::vector<bool> per sample but this... doesn't scale for obvious reasons, ngscopeclient can't display them, and no supported filter creates or accepts them.

In order to fully support FPGA internal LAs, regular LAs, VCD import/export, integration with RTL simulators, and a multitude of other use cases we need to handle multi bit vectors.

So the question is, what should the representation look like?

Considerations include, but are not limited to:
* GPUs naturally want to work with int32s when doing memory accesses, and have consecutive threads access consecutive memory addresses. Trying to write a stateful digital decode that makes a roughly linear pass over a signal may require a weird non-linear sample order
* We want to be efficient for both CPU and GPU processing
* We don't want a huge amount of memory overhead if we have say a 50 million point 2-bit wide vector
* Merging of N single-bit signals into one N-bit signal, or splitting one N-bit signal to N single-bit signals, should be reasonably efficient, e.g. to allow tree expansion of vectors as a bunch of rows
* We need to handle vectors as small as 2 bits for some random state variable up to 256 or 512 bits for a large AXI interface etc
* Some filter blocks, for e.g. Boolean / bitwise operations, may need to generalize to arbitrarily wide vectors. Others, like a decode for a specific protocol, may only need to account for a fixed list of sizes (say 16/32/64/128/256/512) or even a single size.
@azonenberg For GPU you can have one channel per GPU thread, then the whole threadgroup loads one 32-bit value at a time and each thread works on a different bit. If you have more than 32 channels you're looking at you give your GPU code a stride for accessing successive samples.

@crzwdjk That presumes a lot of things about application architecture. If you have 32 channels on a parallel bus, 99% of the time I expect you would be trying to do 32-bit integer operations on the entire value and not breaking it up.

So if you had a warp of 64 threads you'd be processing 64 separate (not necessarily consecutive logically) 32-bit samples per thread concurrently.

But we'd need to think about thread patterns for decoding things that scale to thousands or tens of thousands of threads.