Ok fedi, you're full of game devs and weird microarchitecture experts and generally the right kind of people to ask...

I'm thinking about a generic data representation for multi-bit vectors in ngscopeclient.

Right now we support single-bit digital signals (one byte aka C++ bool per sample), analog signals (one float32 per sample), and arbitrary struct/class types (for protocol decoder output).

Notably missing is multi-bit digital vectors. There is some legacy code in libscopehal for a "digital bus" datatype that has one std::vector<bool> per sample but this... doesn't scale for obvious reasons, ngscopeclient can't display them, and no supported filter creates or accepts them.

In order to fully support FPGA internal LAs, regular LAs, VCD import/export, integration with RTL simulators, and a multitude of other use cases we need to handle multi bit vectors.

So the question is, what should the representation look like?

Considerations include, but are not limited to:
* GPUs naturally want to work with int32s when doing memory accesses, and have consecutive threads access consecutive memory addresses. Trying to write a stateful digital decode that makes a roughly linear pass over a signal may require a weird non-linear sample order
* We want to be efficient for both CPU and GPU processing
* We don't want a huge amount of memory overhead if we have say a 50 million point 2-bit wide vector
* Merging of N single-bit signals into one N-bit signal, or splitting one N-bit signal to N single-bit signals, should be reasonably efficient, e.g. to allow tree expansion of vectors as a bunch of rows
* We need to handle vectors as small as 2 bits for some random state variable up to 256 or 512 bits for a large AXI interface etc
* Some filter blocks, for e.g. Boolean / bitwise operations, may need to generalize to arbitrarily wide vectors. Others, like a decode for a specific protocol, may only need to account for a fixed list of sizes (say 16/32/64/128/256/512) or even a single size.
@azonenberg From my point of view it would help to have some examples of the kind of access patterns you’re anticipating. Like, what’s the shape of your most bandwidth-heavy filters, what do they read, how big is their working window, and what do they write out?
(I’m coming at this from the GPU compute kernel angle. I’ve gone as far as implementing Huffman coders in shaders, so I know a little about bit twiddling on these things. I don’t have much clue about EE signals processing.)

@pmdj So part of the problem is that I don't even have much in the way of CPU based decoding working on parallel bus data right now, almost everything is serial, because we don't have the data type to represent it!

Entire classes of instrument drivers have been blocked on this.

At least to start I'd be working with buses like AXI but I'm not sure how I would want to actually represent that a a decode yet because of the potential to have request/reply reordering.

@pmdj Like I may have to pick something plausible, implement it, start building an ecosystem of drivers and decodes, and only *then* discover something is suboptimal and revamp it