You'd think there's a million ready-to-use "sort some things, on the GPU" codes out there, but outside of CUDA there isn't all that much. The ones that work, that is.

So I spent way too much time debugging sorting of a bunch of points! As a side effect though, Unity HDRP GPU sorting utility got a fix pull request :) https://github.com/Unity-Technologies/Graphics/pull/7954

Improve robustness of GPUSort wrt out of bounds buffer reads/writes by aras-p · Pull Request #7954 · Unity-Technologies/Graphics

Purpose of this PR GPUSort utility (which seems to be used for the "software" Line Rendering system in HDRP) in some cases produces incorrect results, at least on Metal (most notably, when the size...

GitHub
@aras hey, I recently reimplemented bitonic sort on GPU (Shadertoy); what kind of sort were you after?
@webanck "any", for starters. As can be seen from the github PR link, right now I'm using the (bitonic) sort from Unity's HDRP. I just had to fix it first, since it was producing incorrect results in some cases lol :)
@aras @webanck one thing that’s nice about odd even is that it never makes the sequence less sorted in the intermediate steps. So you can amortize the passes over multiple frames in particle systems. I guess if you tracked which step you were in for the bitonic sort you could render the reversed subsequences in reverse order to get the same effect for bitonic.
@aras I see, thanks for the link to the blog post (https://poniesandlight.co.uk/reflect/bitonic_merge_sort/) on your PR, didn't find that when implementing from GPU gems and wikipedia's descriptions. Could be cleaner but this was my result on Shadertoy: https://www.shadertoy.com/view/DtjyWz.
Implementing Bitonic Merge Sort in Vulkan Compute

In which I describe how to implement bitonic sorting networks in compute shaders

@aras was your previous (or was there more than one?) unity PR accepted and submitted? At the time we mused whether submitting PRs was worth the effort.
@MouseByTheSea I got a response that they will look into it! :)
@aras that’s great news, I’ll ask again in a month :) if you can not get a PR accepted then there’s no hope for the rest of us :)
@MouseByTheSea comments on the GitHub PR indicate positive things!
@aras it’s gone in! Great work btw, and it’s good to see them accepting PRs.

@aras
This here is god's work.
Thank you.

Yep, it would be undefined behaviour on non D3D APIs. (By default on Vulkan as you noted)

@aras you may have seen this already, but AMD has a ParallelSort library: https://gpuopen.com/fidelityfx-parallel-sort/
AMD FidelityFX Parallel Sort

AMD FidelityFX Parallel Sort makes sorting data on the GPU quicker, and easier. Use our SM6.0 compute shaders to get your data in order.

AMD GPUOpen
@mjp I have not, actually! And googling for various GPU sorting keywords did not, ahem, sort the result of this towards the top. Will look, thanks!

@mjp @aras AMD's parallel radix sort is mostly optimized for 32-bit integers and does not have great performance for 64-bit.

A faster implementation of radix sort would be Onesweep: https://arxiv.org/pdf/2206.01784.pdf

It relies on intra-workgroup synchronisation in order to reduce per-pass global memory accesses from 3 to 2.

@dtiselice @mjp yeah I saw OneSweep (as well as port of it to HLSL/Unity here https://github.com/b0nes164/ShaderOneSweep) but was wondering how "portable" it is. E.g. this blog post says that the OneSweep approach just "can't" work on Metal https://betterprogramming.pub/memory-bandwidth-optimized-parallel-radix-sort-in-metal-for-apple-m1-and-beyond-4f4590cfd5d3 but admittedly I have not investigated why that would be the case.
GitHub - b0nes164/ShaderOneSweep: A compute shader implementation of the OneSweep sorting algorithm.

A compute shader implementation of the OneSweep sorting algorithm. - GitHub - b0nes164/ShaderOneSweep: A compute shader implementation of the OneSweep sorting algorithm.

GitHub

@aras @mjp It doesn't work on Metal because of the intra-workgroup synchronisation that's needed.

There's also Modern GPU's mergesort implementation that's portable: https://moderngpu.github.io/mergesort.html

Mergesort - Modern GPU

@aras Looks like a 3d gaussian splatting renderer in the making!
@aras thank you Aras for the fixes. We encourage everyone to help us. If it wasn't for Natasha and Seb this work would be dead. So we shouldn't take it for granted, is a privilege to have it out and we fought current unity management to not kill these features
We have a full prototype of the hair rasterizer as well on a smaller GitHub project in python and hlsl
https://github.com/johnpars/LineSoftwareRasterizer
If you anyone buys me MacBook pro I'll port it metal :)
GitHub - johnpars/LineSoftwareRasterizer

Contribute to johnpars/LineSoftwareRasterizer development by creating an account on GitHub.

GitHub
@aras honestly also good compute code lacks in the wild because is hard to write GPU compute without a giant infrastructure framework. Unity is not lean, unreal is crazy, frameworks are all c++ and CUDA shrinks your options.
This is why I've written coalpy, so people can write hlsl and tie it with simple python bindings. It includes also imgui integrated support, live editing, window management and can use a vulkan backend. The goal is to democratize real hlsl compute
https://coalpy.org
Home

Compute Abstraction Layer for Python.

CoalPy