I'm slowly working through the vulkan spec writing a compute-only vulkan program from scratch that doesn't render anything, and it's going pretty well because the spec is really well written and I already know more or less exactly what I want to do anyway, but I just want to say just how silly (fun) it feels to write a program like this because you get to just skip over large swaths of the API.
Like, I'm working from the spec because the tutorials all make it more complicated.
I think it's cute that practically every vulkan command has one or more optional args to let you enter Hard Mode
(sorry for the double post, I added this to the wrong thread)
I wonder how many people have actually managed to knuckle down and write a complete, useful vulkan program from scratch (no copy pasting from tutorials and stack overflow, no offloading significant parts to 3rd party libraries like VMA)
To think if I power through and get this thing working I could potentially be like the 20th person to bother
oh, update on my little vulkan compute project, last night I got as far as repeatedly dispatching an empty compute shader and allocating some memory ๐ I'm in the home stretch! I think I just need to figure out the resources / resource binding stuff and then I'll be able to start on my DSP experiment :3
which mostly means the next things are figuring out the least effort way of getting audio data into C++ (probably stb_vorbis?) and writing even more boilerplate for alsa...
I reworked it so the convolution shader processes the audio in tandem with playback, so I'm *very* close to getting this working with live audio streams.
But more importantly, I used this to convolve my song "strange birds" with a choir-ish fanfare sound effect from a game I used to play as a kid and the result is like the grand cosmos opened up before me and I'm awash in the radiant light of the universe. Absolutely incredible.
ok the problem I'm having with latency now is that the audio latency in the system grows over time and I'm not sure why. like it starts snappy and after running for a short while it gets super laggy :/
I'm guessing it's because SDL3 can and will resize buffers as it wants to, whereas I'd rather it just go crazy if it under runs.
What I want to do is have a fixed size buffer for input and output, enough that I can have the output double or tripple buffered to smooth over hitches caused by linux. if my program can't keep up I don't want it to quietly allocate more runway I want it to scream at me LOUDLY and HORRIBLY, but it wont do that because I'll rejigger my program until it is perfect.
What actually happens is (sdl? poopwire?) just infinitybuffers so it never hitches and I get a second of latency after a little bit
I'm like 30% sure SDL3 is not the problem or at least not the only problem because I tried resetting the streams every frame with SDL_ClearAudioStream and it still accumulates latency (in addition to also now sounding atrocious due to missing samples).
I've also seen this happen with pipewire before in other situations, and it was resolved by bypassing pipewire.
ok I did it. I've got a program that writes a pipewire stream of F64 audio samples where each sample is the total elapsed time since the first frame, expressed in mintues.
I've got a second program that reads that pipewire stream, and checks the offset against it's own elapsed time since the first sample processed. This program prints out the calculated drift ever second.
The results are interesting.
In the first version of this, both programs just measured the time using std::chrono::steady_clock::time_point. This resulted in an oscillating drift that was well under a millisecond at its peak and nothing to be concerned about.
This is good! That means there's no place what so ever within pipewire on my computer for this specific audio setup where any intermediary buffers might be growing and adding more latency as the programs run.
This is not the interesting case.
In the second version, I changed the first program to instead calculate elapsed time as the frame number * the sampling interval, and left the second program alone.
In this version, the calculated drift is essentially the difference between the progress through the stream vs the amount of time that actually passed from the perspective of the observer. In this version, the amount of drift rises gradually. It seems the stream is advancing just a touch faster than it should.
The samples in the stream are reporting that more time has elapsed in the "recording" than actually has transpired according to the clock. The amount of drift accumulated seems to be a millisecond every few minutes.
I'm honestly not sure what to make of that.
I think my conclusions from this are
1. the latency drift I observed with my experiments with pipewire today is probably inconsequential.
2. there is probably nothing sinister about pipewire.
3. if you have a chain of nodes that are a mix of push or pull driven and have different buffering strategies, you are in the Cool Zone
4. my program is probably going to have to handle "leap samples" in some situations. I admit I wasn't expecting that, but it feels obvious in retrospect.
that or I'm just good at creating wizard problems for myself. either way I'm in a good mood.
some small problems with this system:
1. I've had to turn down the sampling rate so I can convolve longer samples. 22050 hz works out ok though for what I've been messing with so far, so maybe it's not that big a deal. longer samples kinda make things muddy anyway
2. now I want to do multiple convolutions at once and layer things and that's probably not happening on this hardware XD
I figure I should probably start recording my convolution experiments for reference, and this thread seems as good a place as any to post them.
Tonight's first experiment: An excerpt from a The King In Yellow audio book convolved with a short clip from the Chrono Cross OST (Chronopolis)
Tonight's second convolution experiment: The same audio book excerpt, but convolved with a frog instead.
Recordings of speech seem to convolve really well with music and weird samples like this, but it really depends on the voice and what you pick as a kernel.
@aeva These are wonderful!
By inverse do you mean swapping which is the source and which is the filter? I'm only 75% sure but I _think_ that would be mathematically identical.
@aeva Oh I see!
Something I really want to try is moving that 1s clip window around over time, maybe oscillating or just moving it slower than the playback speed. Seems like that could sound wonderfully dynamic.
Glad you are convolving. If the rest of civilization is convolving at the same rate as you, how do you know?
So, you gonna start a convolution?
@aeva The actual reason for that was almost certainly memory access patterns. Thread invocations in PS waves are generally launched and packed to have nice memory access patterns (as much as possible), compute waves and invocations launch more or less in order and micro-managing memory access is _your_ problem.
This really matters for 2D because there's lots of land mines there wrt ordering, but for 1D, not so much.
@aeva To give a concrete example: suppose you're doing some simple compute shader where all you're doing is
cur_pixel = img.load(x, y)
processed = f(cur_pixel, x, y)
img.store(x, y, cur_pixel)
and you're dispatching 16x16 thread groups, (x,y) = DispatchThreadID, yada yada, all totally vanilla, right?
@aeva well, suppose we're working in 32-thread waves internally (totally hypothetical number)
now those 32 invocations get (in the very first thread group) x=0,...,15 for y=0 and then y=1.
Say the image is R8G8B8A8 pixels and the internal image layout stores aligned groups of 4 texels next to each other and then goes to the next y, and the next 4-wide strip of texels is actually stored something like 256 bytes away or whatever.
@aeva so, x=0,..,3 y=0 are all good, these are all adjacent, straight shot, read 16 consecutive bytes, great.
x=0,...,3 y=1 in threads 16..19 are also good, these are the next 16 bytes in memory.
But if we have 256-byte cache lines (another Totally Hypothetical Number), well, those 32 bytes are all we get.
x=4,..,7 for y=0 and 1 are in the cache line at offset 256, x=8,...,11 for y=0,1 at offset 512, x=12,...,15 at offset 768.
@aeva And caches are usually built to have multiple "banks" that each handle a fraction of a cache line. Let's say our hypothetical cache has 16 16-byte banks to cover each 256B cache line.
Well, all the requests we get from that nice sequential load go into the first 2 banks and the rest gets nothing.
So that's lopsided and causes problems, and will often mean you lose a lot of your potential cache bandwidth because you only actually get that if your requests are nicely distributed over mem.
@aeva long story short, this whole thing with your thread groups being a row-major array of 16x16 pixels can kind of screw you over, if the underlying image layout is Not Like That.
This happens all the time.
Ordering and packing of PS invocations into waves is specifically set up by the GPU vendor to play nice with whatever memory pipeline, caches, and texture/surface layouts it has.
In CS, all of that is Your Job, generally given no information about the real memory layout.
Good luck!
@aeva If you do know what the real memory layout is, you can make sure consecutive invocations have nice memory access patterns, but outside consoles (where you often get those docs), eh, good luck with that.
The good news is that with 1D, this problem doesn't exist, because 1D data is sequential everywhere.
So as long as you're making sure adjacent invocations grab adjacent indices, your memory access patterns are generally fine.
(Once you do strided, you're back in the danger zone.)
@aeva I built my own audio system and hate every time I have to work on it, so I guess different strokes and all that.
(fwiw:
https://shirakumo.github.io/libmixed/
https://shirakumo.github.io/cl-libmixed/
https://shirakumo.github.io/harmony/ )
@shinmera @aeva [i know nothing about audio processing so i'm like 99.9% sure that there's a good reason why the following doesn't make sense; asking the following out of curiosity]
can the ear-destruction be avoided by like... doing some kind of analysis/checks on the final sample before sending it to the audio device...? (e.g. checking & asserting that its amplitude is less than some upper bound?)
[but if it were that easy, it probably would have been the fist thing anyone would try, so]
@JamesWidman @aeva Doing that kind of analysis would be quite difficult, since you can't just check individual samples, and it's not immediately obvious what is a symphonic sequence and what is erroneous noise. A lot of the time what causes horrendous noise is also not in the data, but in the way the data is sent (buffer over or underruns).
As aeva said, usually making sure the volume is low enough is a good enough fix.
@aeva Agreed! My DSP project is the most coding fun I've had in years, with bonus fun sounds too ๐ฅณ
The Graphics Programmer to Audio Programmer pipeline is real ๐
@aeva also probably the reason why 20th level Wizards aren't running the world
With like say a Monk or Barbarian it's pretty obvious. If you just maintain a large enough distance you're pretty safe.
But Wizards? They got Power Word Kill and Wish and Time Stop and yet somehow those poor border towns are overrun by *Giant Rats* and Goblins? Why do they let this happen?
Answer: on it on it on it, as soon as they figure out why the enchanted quill refuses to draw black runes without magenta ink