Re-wrote my acceleration structure for neighbourhood search based on some insight from @vassvik and got significant speedups on large fluid simulations:) Here’s ~1/2 million particles ✨
The core idea is to take the 10x10x10 footprint accessed by all threads in the workgroup and split it into chunks, and then fetch these chunks cooperatively in a way that efficiently aligns with subgroup boundaries in such a way that we don't introduce divergence, and then store to shared memory