In my everlasting quest to beat big triangle I revisited an #OpenCL fft implementstion I made for uni in 2021. Couple of days in with the knowledge I have now I got it down already to 0.0033 seconds end to end latency (including host / device transfer) for a 1 million point complex FFT. This is on the low cost 5700 XT

Graphs and new code soon to follow, check https://github.com/Dantali0n/oCLFFT

GitHub - Dantali0n/oCLFFT: OpenCL Fast Fourier Transform using cooley-tukey in-place bit-reversal algorithm with lookup tables.

OpenCL Fast Fourier Transform using cooley-tukey in-place bit-reversal algorithm with lookup tables. - Dantali0n/oCLFFT

GitHub
Current memory limitation is between 4 and 8 million points due to memory exhaustion on my 8GB card. The implementation is about 12x as fast as #fftw in _measure_ mode on the Ryzen 5900x
I would compare againdt hipfft and vkfft etc if they weren't such a pain in the ass to install, link and use. Might find motivation at some later date, first more tuning.