Mastodawn

Andrew Zonenberg

Ok fedi, you're full of game devs and weird microarchitecture experts and generally the right kind of people to ask...

I'm thinking about a generic data representation for multi-bit vectors in ngscopeclient.

Right now we support single-bit digital signals (one byte aka C++ bool per sample), analog signals (one float32 per sample), and arbitrary struct/class types (for protocol decoder output).

Notably missing is multi-bit digital vectors. There is some legacy code in libscopehal for a "digital bus" datatype that has one std::vector<bool> per sample but this... doesn't scale for obvious reasons, ngscopeclient can't display them, and no supported filter creates or accepts them.

In order to fully support FPGA internal LAs, regular LAs, VCD import/export, integration with RTL simulators, and a multitude of other use cases we need to handle multi bit vectors.

So the question is, what should the representation look like?

Andrew Zonenberg Mar 4

Considerations include, but are not limited to:
* GPUs naturally want to work with int32s when doing memory accesses, and have consecutive threads access consecutive memory addresses. Trying to write a stateful digital decode that makes a roughly linear pass over a signal may require a weird non-linear sample order
* We want to be efficient for both CPU and GPU processing
* We don't want a huge amount of memory overhead if we have say a 50 million point 2-bit wide vector
* Merging of N single-bit signals into one N-bit signal, or splitting one N-bit signal to N single-bit signals, should be reasonably efficient, e.g. to allow tree expansion of vectors as a bunch of rows
* We need to handle vectors as small as 2 bits for some random state variable up to 256 or 512 bits for a large AXI interface etc
* Some filter blocks, for e.g. Boolean / bitwise operations, may need to generalize to arbitrarily wide vectors. Others, like a decode for a specific protocol, may only need to account for a fixed list of sizes (say 16/32/64/128/256/512) or even a single size.

Andrew Zonenberg Mar 4

So like, at the lowest level
* One bit per byte?
* One bit from each of eight channels per byte?
* Eight bits from one channel per byte?
* Do we perhaps want multiple of these for different use cases? If so, how do we convert/adapt between them?

Simon Richter Mar 4

@azonenberg Amiga people know all about "chunky to planar" conversion

Tammi 𝆮 𓃠 Mar 4

@azonenberg what are your top most common access patterns?

are you going to read that data or send it off to sth with a fixed interface? then maybe align yourself with that.
are you going to run operators on the data? would they work if you compacted the bits into bigger types like ints? Say checking for the value of a single bit could be done with two bitshifts, I think.

Andrew Zonenberg Mar 4

@tammeow Things I will definitely want to do:

* Render it in a logic analyzer style hex waveform display
* Render as N separate single-bit vectors
* Do protocol operations on the parallel data as an integer

but again, I don't yet know what I will end up doing for sure because I don't even have the ability to ingest and render the data yet. the drivers that will collect it don't exist because we have no way to display said data once collected.

So the whole ecosystem of decodes and hardware support isn't there

Andrew Zonenberg Mar 4

@tammeow it's a lot easier to optimize a block you've already written than to hypothesize about what the inner loop of one you haven't envisioned yet is going to look like

Tammi 𝆮 𓃠 Mar 4

@azonenberg hmm fair. so if you went for a naive implementation with wrapping for conversion, you would be able to trace which format would be more useful ig? would you at least be able to predict if you are more likely to view all channels in one go vs each on their own? if you do each on their own, then that is also the preferable memory representation. wanna get dat cache line alignment.

Andrew Zonenberg Mar 4

@tammeow I'm expecting a tree-style logic analyzer view where it defaults to the word but you can expand it to see the bits.

But the rendering will be done in a shader so I can for example fetch a single int32 for 32 samples, then render the pixels separately for each row

Maybe what I need to do is start by working backwards, make a dummy generator that creates like a 32 bit counter or something and try actually writing a rendering shader and see how it performs

Tammi 𝆮 𓃠 Mar 4

@azonenberg yeah! oh gosh i realize that this could be silli if the endianness was mixed, but we mostly live in little endianness land.

so to fetch a value it would be sth like this ig:
// n_sample from 0 to 31
int sample_n_value = (i_samples32 << n_sample >> 31-n_sample)

@azonenberg You would have fewer and more efficient memory accesses if many bits were packed into a byte. Do GPUs have intrinsics for packing/unpacking? Could SIMD play a role?

Maybe some benchmarking is required on a specific test case. Are you thinking a 2 bit vector would be one bit for two channels, or would it be two sequential bits for one channel?

Andrew Zonenberg Mar 4

@0h00000000 I hope to not be doing packing/unpacking at all, hence why I am thinking a 32 bit vector should map to a uint32 per sample rather than one sample storing 32 consecutive values from one channel.

Most of the time you are working on parallel data it's logically a N-bit word

Mike Bell Mar 5

@azonenberg I think I would round the number of channels up to the next power of 2 number of bytes and then store one sample of all the channels per byte/word/whatever

Andrew Zonenberg Mar 5

@rebelmike that's kinda what I'm thinking is probably the best route, maybe with striding or interleaving for >32 bits for better GPU memory locality

Phil Dennis-Jordan Mar 4

@azonenberg From my point of view it would help to have some examples of the kind of access patterns you’re anticipating. Like, what’s the shape of your most bandwidth-heavy filters, what do they read, how big is their working window, and what do they write out?
(I’m coming at this from the GPU compute kernel angle. I’ve gone as far as implementing Huffman coders in shaders, so I know a little about bit twiddling on these things. I don’t have much clue about EE signals processing.)

Andrew Zonenberg Mar 4

@pmdj So part of the problem is that I don't even have much in the way of CPU based decoding working on parallel bus data right now, almost everything is serial, because we don't have the data type to represent it!

Entire classes of instrument drivers have been blocked on this.

At least to start I'd be working with buses like AXI but I'm not sure how I would want to actually represent that a a decode yet because of the potential to have request/reply reordering.

Andrew Zonenberg Mar 4

@pmdj Like I may have to pick something plausible, implement it, start building an ecosystem of drivers and decodes, and only *then* discover something is suboptimal and revamp it

Phil Dennis-Jordan Mar 4

@azonenberg At the risk of stating the obvious, a good starting point may be storing samples larger than 32 bits in SOA-style arrays of 32-bit slices. These arrays can of course be located consecutively in the same buffer. It may be easier to do this than using some kind of chunked ordering which requires special striding logic depending on sample size.
For <32b samples, you can have each thread read a uchar4 or ushort2, then redistribute components among threads with SIMD shuffles.

crzwdjk ✅ Mar 4

@azonenberg For GPU you can have one channel per GPU thread, then the whole threadgroup loads one 32-bit value at a time and each thread works on a different bit. If you have more than 32 channels you're looking at you give your GPU code a stride for accessing successive samples.

Andrew Zonenberg Mar 4

@crzwdjk That presumes a lot of things about application architecture. If you have 32 channels on a parallel bus, 99% of the time I expect you would be trying to do 32-bit integer operations on the entire value and not breaking it up.

So if you had a warp of 64 threads you'd be processing 64 separate (not necessarily consecutive logically) 32-bit samples per thread concurrently.

But we'd need to think about thread patterns for decoding things that scale to thousands or tens of thousands of threads.

@azonenberg oh hey, I was just thinking about this kinda thing :v

Andrew Zonenberg Mar 4

@lethalbit Well I would love to hear your thoughts, maybe this can be a topic for a future developer call as well (we're overdue for one but I've been swamped and not had time to even schedule one lately) as well.

I love this, so I Mar 4

@azonenberg this sounds like something @dysfun would have Some Opinions about