I'm like 30% sure SDL3 is not the problem or at least not the only problem because I tried resetting the streams every frame with SDL_ClearAudioStream and it still accumulates latency (in addition to also now sounding atrocious due to missing samples).
I've also seen this happen with pipewire before in other situations, and it was resolved by bypassing pipewire.
ok I did it. I've got a program that writes a pipewire stream of F64 audio samples where each sample is the total elapsed time since the first frame, expressed in mintues.
I've got a second program that reads that pipewire stream, and checks the offset against it's own elapsed time since the first sample processed. This program prints out the calculated drift ever second.
The results are interesting.
In the first version of this, both programs just measured the time using std::chrono::steady_clock::time_point. This resulted in an oscillating drift that was well under a millisecond at its peak and nothing to be concerned about.
This is good! That means there's no place what so ever within pipewire on my computer for this specific audio setup where any intermediary buffers might be growing and adding more latency as the programs run.
This is not the interesting case.
In the second version, I changed the first program to instead calculate elapsed time as the frame number * the sampling interval, and left the second program alone.
In this version, the calculated drift is essentially the difference between the progress through the stream vs the amount of time that actually passed from the perspective of the observer. In this version, the amount of drift rises gradually. It seems the stream is advancing just a touch faster than it should.
The samples in the stream are reporting that more time has elapsed in the "recording" than actually has transpired according to the clock. The amount of drift accumulated seems to be a millisecond every few minutes.
I'm honestly not sure what to make of that.
I think my conclusions from this are
1. the latency drift I observed with my experiments with pipewire today is probably inconsequential.
2. there is probably nothing sinister about pipewire.
3. if you have a chain of nodes that are a mix of push or pull driven and have different buffering strategies, you are in the Cool Zone
4. my program is probably going to have to handle "leap samples" in some situations. I admit I wasn't expecting that, but it feels obvious in retrospect.
that or I'm just good at creating wizard problems for myself. either way I'm in a good mood.
some small problems with this system:
1. I've had to turn down the sampling rate so I can convolve longer samples. 22050 hz works out ok though for what I've been messing with so far, so maybe it's not that big a deal. longer samples kinda make things muddy anyway
2. now I want to do multiple convolutions at once and layer things and that's probably not happening on this hardware XD
@aeva The actual reason for that was almost certainly memory access patterns. Thread invocations in PS waves are generally launched and packed to have nice memory access patterns (as much as possible), compute waves and invocations launch more or less in order and micro-managing memory access is _your_ problem.
This really matters for 2D because there's lots of land mines there wrt ordering, but for 1D, not so much.
@aeva To give a concrete example: suppose you're doing some simple compute shader where all you're doing is
cur_pixel = img.load(x, y)
processed = f(cur_pixel, x, y)
img.store(x, y, cur_pixel)
and you're dispatching 16x16 thread groups, (x,y) = DispatchThreadID, yada yada, all totally vanilla, right?
@aeva well, suppose we're working in 32-thread waves internally (totally hypothetical number)
now those 32 invocations get (in the very first thread group) x=0,...,15 for y=0 and then y=1.
Say the image is R8G8B8A8 pixels and the internal image layout stores aligned groups of 4 texels next to each other and then goes to the next y, and the next 4-wide strip of texels is actually stored something like 256 bytes away or whatever.
@aeva so, x=0,..,3 y=0 are all good, these are all adjacent, straight shot, read 16 consecutive bytes, great.
x=0,...,3 y=1 in threads 16..19 are also good, these are the next 16 bytes in memory.
But if we have 256-byte cache lines (another Totally Hypothetical Number), well, those 32 bytes are all we get.
x=4,..,7 for y=0 and 1 are in the cache line at offset 256, x=8,...,11 for y=0,1 at offset 512, x=12,...,15 at offset 768.
@aeva And caches are usually built to have multiple "banks" that each handle a fraction of a cache line. Let's say our hypothetical cache has 16 16-byte banks to cover each 256B cache line.
Well, all the requests we get from that nice sequential load go into the first 2 banks and the rest gets nothing.
So that's lopsided and causes problems, and will often mean you lose a lot of your potential cache bandwidth because you only actually get that if your requests are nicely distributed over mem.
@aeva long story short, this whole thing with your thread groups being a row-major array of 16x16 pixels can kind of screw you over, if the underlying image layout is Not Like That.
This happens all the time.
Ordering and packing of PS invocations into waves is specifically set up by the GPU vendor to play nice with whatever memory pipeline, caches, and texture/surface layouts it has.
In CS, all of that is Your Job, generally given no information about the real memory layout.
Good luck!
@aeva If you do know what the real memory layout is, you can make sure consecutive invocations have nice memory access patterns, but outside consoles (where you often get those docs), eh, good luck with that.
The good news is that with 1D, this problem doesn't exist, because 1D data is sequential everywhere.
So as long as you're making sure adjacent invocations grab adjacent indices, your memory access patterns are generally fine.
(Once you do strided, you're back in the danger zone.)
As part of my GDC 2019 session, Optimizing DX12/DXR GPU Workloads using Nsight Graphics: GPU Trace and the Peak-Performance-Percentage (P3) Method, I presented an optimization technique named thread…
@aeva I don't know what value Slice has with the sizes you pass in, but it would be really bad if Slice works out to be some medium to large power of 2.
The issue is that the "i" loop goes in sequential samples but invocation to invocation (which is the dimension that matters), the loads inside are strided to be "Slice" elements apart.
You really want that to be the other way round. Ideally sequential loads between invocations.
@aeva separately, don't want that % SizeA in there, that doesn't have to be bad but it can be, I don't know how good shader compilers are about optimizing induction variables like that
might want to keep that as an actual counter and just do (in the modified loop)
j += GROUP_SIZE;
j -= (j >= SizeA) ? SizeA : 0;
(you also need SizeA >= GROUP_SIZE now, but I don't think that changes anything in your case)
@aeva even on a GPU, if you do enough MADs per sample eventually you're going to be compute bound with this approach, but I'd be shocked if you were anywhere close to that right now.
First-order it's going to be all futzing with memory access.
@aeva I mean, you can literally do the math!
If you're on a GPU, then even on a mobile GPU from several years ago, you're in the TFLOP/s range by now for actual math.
So, ballpark 1e12 MADs per second.
48kHz stereo is ballpark 1e5 samples per second.
Math-wise, that means you can in theory do 1e7 MADs per sample, enough for brute-force direct convolution with a >3 minute IR. You're probably not doing that.
@aeva You can always do better convolution algs, but even for brute-force, the math is just not the problem for IR sizes you're likely using.
But as written in your code, you also have two loads for every MAD, and there's nowhere near that level of load bandwidth available, not even if it's all L1 hits.
Making it sequential across invocations should help noticeably. But beyond that, you'll need to save loads.
@aeva I don't know about ideal but there is definitely is some mileage to be had in loading one of the two into registers/shared memory in blocks, double-buffering the next fetch and having only one load via L1/tex in the inner loop.
That said the better FFT-based conv kinda nukes that.
Good news: FFT-based conv kinda automatically exploits all the sharing for you!
Bad news: that means you're now down to loading and using each IR FFT coeff exactly once.
@aeva It is work-efficient and gets both your load count and your mul-add count way down, but it also means what's left is kinda BW bound by construction and there's not much you can do about it.
(I mean you can make the block sizes small enough that you're still summing tons of terms again, but that's defeating the purpose.)
@aeva thanks! That does seem lower than what I remember getting…. though probably still means that for the amount of computation we usually have to do in a frame, it‘s CPU all the way
unless maybe in The Future we make the model a lot bigger
@aeva yeah, I think that‘s still the conclusion
which is unfortunate because I would really like to be paid to do that! but it is difficult to argue for when like, even if you implement everything very well the roundtrip already has highs that would (in presence of the rest of the audio stack) cause issues. like, right now our model runs, on a weakish desktop, with 2ms averages and 3 to 4ms highs, for a 10ms frame, and that’s already kind of as high as we dare going
@aeva noise / echo / reverb suppression and/or speech enhancement in various different configurations, for voice calling. so essentially „run smallish neural network on audio fast“
and yeah sure we can make the task arbitrarily more complex by making the model larger but then we need to do annoying things like justify the added compute by showing a major quality improvement. maybe requirements will do it for us eventually if someone decides we must have fullband stereo audio or something
@rygorous @aeva Not sure if still relevant, but if bandwidth is still an issue you could try 16 bit fixed point or 24 bit fixed point and unpack to float in the shader.
I would expect audio devices to operate on fixed point natively to begin with so some API working in floats probably is adding conversion overhead.