Finally got the proofs for my manuscript about porting #GPUSPH to run on CPU (yes, we went the other way). The completely bungled up all the figures and listings. This will be a nightmare to fix because I have to comment on each of them to say “this figure goes there, and the figure that should be here is at that other place”, and I'm ready to bet they'll *still* mess it up after I'm done adding all the comments.

And they are getting paid big monies because gold open access!

There are 16 Listings and 19 Figures in this manuscript. This will take the whole day. Help.

#spreadsheet to the rescue! Let's use #LibreOffice #Calc to write the comments for me.

Column A: the original Listing/Figure
Column B: what they are using for image
Column C: XLOOKUP current A into B and return the corresponding A (this tells you where to find the correct content for each listing figure.
Column D: TEXTJOIN to produce a comment that tells the typesetter how to shift images around.

#LOCalc

Oh good I can ask to view the revised proof and specify that I have “Specific quality concerns related to Figures”.

It's out, if anyone is curious

https://doi.org/10.1002/cpe.8313

This is a “how to” guide. #GPUSPH, as the name suggests, was designed from the ground up to run on #GPU (w/ #CUDA, for historical reasons). We wrote a CPU version a long time ago for a publication that required a comparison, but it was never maintained. In 2021, I finally took the plunge, and taking inspiration from #SYCL, adapted the device code in functor form, so that it could be “trivially” compiled for CPU as well.

#HPC #GPGPU

This was actually relatively simple because we were extensively using parameter structures to account for the different variables and states that needed to be tracked for each individual formulation/combination of options. Turning these into functors and turning the device kernels into their `void operator() const` was a snap (could have been automated, I did it by hand to take advantage of the review process to clean up a few things.)
The result is something that can be trivially parallelized with #OpenMP.
As an alternative, it's possible to use the #multiGPU support in #GPUSPH to run code in parallel. This is obviously less efficient, although it may be a good idea to use it in #NUMA setups (one thread per NUMA node, OpenMP for the cores in the same node). This is not implemented yet.
#GPUSPH also supports multi-node execution via #MPI (you can run across multiple GPUs on separate nodes).
This can be used with the new CPU backend too, but it's untested and I expect it to be not particularly efficient, because kernel execution and data transfers are not yet asynchronous in the CPU backend. (It's honestly a low priority task for us, and given that we're short on hands we'll prioritize other things for the time being.)
One of the nice things of the refactoring that I had to do to introduce CPU support is that it also allowed me to trivially had support for #AMD #HIP / #ROCm.
That, and the fact that AMD engineers have written a drop-in replacement for the Thrust library that we depend on in a couple of places. (This is also one of the things that is holding back a full #SYCL port for #GPUSPH, BTW.)
Ironically, the hardest past when introducing HIP support was having to work around the incompatibilities between their headers and the NVIDIA ones for the “small vector” types (`float4` and the like) that tripped all over our code. I opened a bunch of tickets, and most of them got resolved (the last one was with the release of HIP/ROCm 6.1)