@aeva And caches are usually built to have multiple "banks" that each handle a fraction of a cache line. Let's say our hypothetical cache has 16 16-byte banks to cover each 256B cache line.

Well, all the requests we get from that nice sequential load go into the first 2 banks and the rest gets nothing.

So that's lopsided and causes problems, and will often mean you lose a lot of your potential cache bandwidth because you only actually get that if your requests are nicely distributed over mem.

@aeva long story short, this whole thing with your thread groups being a row-major array of 16x16 pixels can kind of screw you over, if the underlying image layout is Not Like That.

This happens all the time.

Ordering and packing of PS invocations into waves is specifically set up by the GPU vendor to play nice with whatever memory pipeline, caches, and texture/surface layouts it has.

In CS, all of that is Your Job, generally given no information about the real memory layout.

Good luck!

@aeva If you do know what the real memory layout is, you can make sure consecutive invocations have nice memory access patterns, but outside consoles (where you often get those docs), eh, good luck with that.

The good news is that with 1D, this problem doesn't exist, because 1D data is sequential everywhere.

So as long as you're making sure adjacent invocations grab adjacent indices, your memory access patterns are generally fine.

(Once you do strided, you're back in the danger zone.)

@aeva also I want to emphasize that this Purely Hypothetical Example with row-major invocation layout in CS vs. a column-heavy layout in the HW is of course entirely hypothetical and in no way inspired by real events such as https://developer.nvidia.com/blog/optimizing-compute-shaders-for-l2-locality-using-thread-group-id-swizzling/
Optimizing Compute Shaders for L2 Locality using Thread-Group ID Swizzling | NVIDIA Technical Blog

As part of my GDC 2019 session, Optimizing DX12/DXR GPU Workloads using Nsight Graphics: GPU Trace and the Peak-Performance-Percentage (P3) Method, I presented an optimization technique named thread…

NVIDIA Technical Blog
@rygorous that sounds likely. I don't think I accounted for memory layout of the texture. I assume this is also why Epic seems to be so fond of putting everything in scan line order these days?
@rygorous so, my program as written is two linear memory reads, some basic arithmetic, and some wave ops. I think it should be pretty cache efficient, or at least I don't have any obvious ideas for making it moreso. I would think all the extra raster pipeline stuff would not be worth it, but the opportunity to move one of the loads into an earlier shader stage to effectively make it invariant across the wave and make use of the ROP to implement most of the dot product seems maybe worthwhile?
@rygorous the ROP is, like, free math, right?
@aeva Not really. The "math" may be free but the queue spots are not and you'll likely end up waiting longer in the shader to get to emit your output then you would've spent just doing the math directly
@aeva Looking at the shader you posted yesterday (?) at https://github.com/Aeva/convolver/blob/excelsior/src/convolver.cs.glsl, you're in the Danger Zone(tm)
convolver/src/convolver.cs.glsl at excelsior · Aeva/convolver

Contribute to Aeva/convolver development by creating an account on GitHub.

GitHub
@aeva the issue is SliceStart is derived from LaneIndex (Subgroup invocation ID) which is then multiplied by Slice

@aeva I don't know what value Slice has with the sizes you pass in, but it would be really bad if Slice works out to be some medium to large power of 2.

The issue is that the "i" loop goes in sequential samples but invocation to invocation (which is the dimension that matters), the loads inside are strided to be "Slice" elements apart.

You really want that to be the other way round. Ideally sequential loads between invocations.

@aeva so basically, try making the loop be "for (i = LaneIndex; i < SizeB; i += GROUP_SIZE)" and poof, suddenly those loads are mostly-sequential invocation to invocation instead of always hitting a few dozen cache lines

@aeva separately, don't want that % SizeA in there, that doesn't have to be bad but it can be, I don't know how good shader compilers are about optimizing induction variables like that

might want to keep that as an actual counter and just do (in the modified loop)

j += GROUP_SIZE;
j -= (j >= SizeA) ? SizeA : 0;

(you also need SizeA >= GROUP_SIZE now, but I don't think that changes anything in your case)

@aeva even on a GPU, if you do enough MADs per sample eventually you're going to be compute bound with this approach, but I'd be shocked if you were anywhere close to that right now.

First-order it's going to be all futzing with memory access.

@aeva I mean, you can literally do the math!

If you're on a GPU, then even on a mobile GPU from several years ago, you're in the TFLOP/s range by now for actual math.

So, ballpark 1e12 MADs per second.

48kHz stereo is ballpark 1e5 samples per second.

Math-wise, that means you can in theory do 1e7 MADs per sample, enough for brute-force direct convolution with a >3 minute IR. You're probably not doing that.

@aeva You can always do better convolution algs, but even for brute-force, the math is just not the problem for IR sizes you're likely using.

But as written in your code, you also have two loads for every MAD, and there's nowhere near that level of load bandwidth available, not even if it's all L1 hits.

Making it sequential across invocations should help noticeably. But beyond that, you'll need to save loads.

@rygorous huh. so is the ideal pattern something like out[0...N] += IR[0] * in[0...N], where the IR[0] is loaded once, and you basically just peel the first MAD for each sample being processed at once, and then do it all again for IR[1] etc. And the += would have to be an atomic add 🤔

@aeva I don't know about ideal but there is definitely is some mileage to be had in loading one of the two into registers/shared memory in blocks, double-buffering the next fetch and having only one load via L1/tex in the inner loop.

That said the better FFT-based conv kinda nukes that.

Good news: FFT-based conv kinda automatically exploits all the sharing for you!

Bad news: that means you're now down to loading and using each IR FFT coeff exactly once.

@aeva It is work-efficient and gets both your load count and your mul-add count way down, but it also means what's left is kinda BW bound by construction and there's not much you can do about it.

(I mean you can make the block sizes small enough that you're still summing tons of terms again, but that's defeating the purpose.)

@rygorous ok weird thing happened just now, I gave the your about changing the iterations another try and did notice the worst case runs were cheaper while the average was about the same (this is not the weird part), but then I dropped GROUP_SIZE (sets both the work group size and required subgroup size) from 32 to 16 and the average time went from 7.7 ms to 6.175 ms and the best record time went from 6.8 ms to 1.8 ms.
@aeva this is a Side Question, do you have any idea how much of that is just fixed costs that don’t depend on the amount of compute at all? I‘m wondering because last time we looked at „should we run our audio processing on Not CPU“ the conclusion was a clear nope, latency to dispatch any work in 10ms batches already kills us, but likely a lot / most (?) of that was tflite not being set up for this type thing rather than the underlying systems, and we never had time to dig deeper than that
@halcy I'm seeing round trip times as low as 1ms with an average of about 6. I'm using a UMA GPU though, which lets me leave memory mapped to both the GPU and CPU. Most of my perf is down to how much I can juice the compute shader and bringing down the timing variance caused by synchronization scheduler noise. Right now I have to leave 2 ms of head room or the audio becomes unstable, so my latency is roughly 8 ms.

@aeva thanks! That does seem lower than what I remember getting…. though probably still means that for the amount of computation we usually have to do in a frame, it‘s CPU all the way

unless maybe in The Future we make the model a lot bigger

@halcy I don't think audio processing on the GPU is worthwhile unless you're doing an absurdly expensive transform like what I'm doing or you're able to dispatch a large enough batch to saturate the GPU. there's a sweet spot where it is faster to do the round trip to the GPU, because time is convoluted

@aeva yeah, I think that‘s still the conclusion

which is unfortunate because I would really like to be paid to do that! but it is difficult to argue for when like, even if you implement everything very well the roundtrip already has highs that would (in presence of the rest of the audio stack) cause issues. like, right now our model runs, on a weakish desktop, with 2ms averages and 3 to 4ms highs, for a 10ms frame, and that’s already kind of as high as we dare going

@aeva (and there’s also power efficiency because Phones Exist, and there it becomes even harder still)
@halcy what sort of audio processing are you doing?

@aeva noise / echo / reverb suppression and/or speech enhancement in various different configurations, for voice calling. so essentially „run smallish neural network on audio fast“

and yeah sure we can make the task arbitrarily more complex by making the model larger but then we need to do annoying things like justify the added compute by showing a major quality improvement. maybe requirements will do it for us eventually if someone decides we must have fullband stereo audio or something

@halcy oh just make it worse and slap the "AI" sticker on it
@aeva pretty sure the ad copy is extremely AI‘d up already. and unfortunately, if we make it worse, people will increasingly click the little „call audio was bad“ button in the little window you sometimes get at the end of the call, making a number go down that then causes me (well, okay, our PM) to panic and stop rollout
@halcy wait that button actually does something ?!

@aeva doesn’t immediately file a bug, but when you press the uhhh idk what it looks like now they‘ve been messing with it but, thumbs down or below 4 stars or sad smiley face, whatever it is, button, ideally also with a Problem Token (the little what was actually bad questionaire) then at least for media (so calling a/v - can’t speak to how fronted or chat or whatever do it) in Teams/Skype it goes to a big database along with technical telemetry that is usually correlated with call quality (stuff like „percent of aborted calls“) which then feeds into a quite thorough A/B system, and if we spot regressions in what are considered Tier 1 metrics rollout stops (no statsig positive change is okay generally if you can justify why the change fixes a rare bug, adds a useful feature or whatever. Though of course if you can move a problem token or T1 metric in the right direction, that’s even better).

Mostly we catch issues before changes make it to public, though

@aeva anyways what I‘m saying is you should definitely always vote 5 because that makes my KPI numbers go up
@halcy i'm reminded of this xkcd for some strange reason https://xkcd.com/958/
Hotels

xkcd
@halcy anyways, any time i get the urge to press one of those buttons, i'll gladly stab that 5 for you ♡
@aeva in no jokes land, please do press the buttons that most reflect the call experience, which will make my life easier by contributing to a realistic picture of where we‘re at and what we most need to work on and what is and is not working

@halcy the little reaction emoji thing has been broken for months, there's no overlay for it anymore so i don't see when people press it and i'm not sure if they see when i do. could you pass that along to whoever's problem that is?

also there doesn't seem to be an option to make it not use the GPU anymore which is somewhat problematic for me since i have to completely shut down teams when doing perf stuff

@aeva i can try, but these sort of reports tend to not go anywhere unless I can repro
@halcy i can try and provide more info tomorrow if you want
@aeva if there’s an obvious way to reproduce the emoji issue and I can do it on my machine, I can file a bug relatively easy, so sure
@aeva (i‘ve filed bugs based on second hand info a few times before but they tend to just go nowhere, unfortunately)