You'd think there's a million ready-to-use "sort some things, on the GPU" codes out there, but outside of CUDA there isn't all that much. The ones that work, that is.

So I spent way too much time debugging sorting of a bunch of points! As a side effect though, Unity HDRP GPU sorting utility got a fix pull request :) https://github.com/Unity-Technologies/Graphics/pull/7954

Improve robustness of GPUSort wrt out of bounds buffer reads/writes by aras-p · Pull Request #7954 · Unity-Technologies/Graphics

Purpose of this PR GPUSort utility (which seems to be used for the "software" Line Rendering system in HDRP) in some cases produces incorrect results, at least on Metal (most notably, when the size...

GitHub
@aras you may have seen this already, but AMD has a ParallelSort library: https://gpuopen.com/fidelityfx-parallel-sort/
AMD FidelityFX Parallel Sort

AMD FidelityFX Parallel Sort makes sorting data on the GPU quicker, and easier. Use our SM6.0 compute shaders to get your data in order.

AMD GPUOpen

@mjp @aras AMD's parallel radix sort is mostly optimized for 32-bit integers and does not have great performance for 64-bit.

A faster implementation of radix sort would be Onesweep: https://arxiv.org/pdf/2206.01784.pdf

It relies on intra-workgroup synchronisation in order to reduce per-pass global memory accesses from 3 to 2.

@dtiselice @mjp yeah I saw OneSweep (as well as port of it to HLSL/Unity here https://github.com/b0nes164/ShaderOneSweep) but was wondering how "portable" it is. E.g. this blog post says that the OneSweep approach just "can't" work on Metal https://betterprogramming.pub/memory-bandwidth-optimized-parallel-radix-sort-in-metal-for-apple-m1-and-beyond-4f4590cfd5d3 but admittedly I have not investigated why that would be the case.
GitHub - b0nes164/ShaderOneSweep: A compute shader implementation of the OneSweep sorting algorithm.

A compute shader implementation of the OneSweep sorting algorithm. - GitHub - b0nes164/ShaderOneSweep: A compute shader implementation of the OneSweep sorting algorithm.

GitHub

@aras @mjp It doesn't work on Metal because of the intra-workgroup synchronisation that's needed.

There's also Modern GPU's mergesort implementation that's portable: https://moderngpu.github.io/mergesort.html

Mergesort - Modern GPU