I've just reviewed a manuscript about the recent progresses made to introduce #GPU support in a classic, large #CFD code with existing good support for massive simulations on traditional #HPC settings (CPU clusters).

I'm always fascinated by the stark difference between the kind of work that goes into this process, and what went into the *reverse* process that we followed for #GPUSPH, which was developed for GPU from the start, and was only ported to CPU recently, through the approach described in this paper I'm sure I've already mentioned here:

https://doi.org/10.1002/cpe.8313

When I get to review this kind of articles, I always feel the urge to start a dialogue with the authors about these differences, but that's not really my role as the reviewer, so I have to hold back and limit my comments to what's required for my role.

So I guess you get to read about the stuff I couldn't write in my reviewer comments.

1/n

One of the ways in which #GPUSPH is atypical is that we don't relay a lot on existing libraries. As I've had the opportunity to mention, the only hard dependency in GPUSPH is Thrust, a library that can be used for parallel acceleration of some commonly used primitives (think sorting, reductions, scans etc) on both CPU and NVIDIA GPUs (via CUDA).

Even this dependency alone has in fact been a constant source of pain, and it's the biggest roadblock to true hardware independence for GPUSPH —the only reason why we've been able to introduce support for AMD GPUs is because AMD interested in forking Thrust into rocThrust: this helped us add HIP support, at the expense of some additional pain («are we using the correct version of Thrust for the backend?»)

The obvious benefit of relying on external libraries is that the only work you have to do is learn how to use the library API. And of course the libraries are often highly optimized and can offer much better performance than code hand written by someone whose primary specialization lies elsewhere.

2/n

As an example, when we started working on the semi-implicit integrator for #GPUSPH we had to address the new issue of solving the large like system that came from the implicitation of the viscous term.

This is where most #CFD codes rely on one of the many high quality #BLAS libraries, either directly or through something like the excellent #PETSc —at which point they are dependent on that library's capabilities for parallelization and hardware support.

Fresh out of our recent experience with Thrust breaking down on some NVIDIA GPUs (despite being designed for them), we decided instead to roll our own. The first implementation we achieved was not particularly exceptional, but it did the job well enough to convince us to pursue this path:

https://doi.org/10.1016/j.jcp.2018.07.060

3/n

That first implementation didn't even support the multi-GPU and multi-node features of #GPUSPH (could only run on a single GPU), but it paved the way for the full version, that took advantage of the whole infrastructure of GPUSPH in multiple ways.

First of all, we didn't have to worry about how to encode the matrix and its sparseness, because we could compute the coefficients on the fly, and operate with the same neighbors list transversal logic that was used in the rest of the code; this allowed us to minimize memory use and increase code reuse.

Secondly, we gained control on the accuracy of intermediate operations, allowing us to use compensating sums wherever needed.

Thirdly, we could leverage the multi-GPU and multi-node capabilities already present in GPUSPH to distribute computations across all available devices.

And last but not least, we actually found ways to improve the classic #CG and #BiCGSTAB linear solving algorithms to achieve excellent accuracy and convergence even without preconditioners, while making the algorithms themselves more parallel-friendly:

https://doi.org/10.1016/j.jcp.2022.111413

4/n

#LinearAlgebra #NumericalAnalysis

(Oh, and of course our implementation is hardware-independent.)

I was tempted to point the authors of the paper I was reviewing to that last paper, but the truth is that they would have gotten little benefit from it, unless they were willing to invest the time in implementing the solver themselves, since AFAIK no existing solver uses our approach (and they couldn't have extracted easily from the bowels of GPUSPH).

These two are in fact the largest downsides of rolling your own code: time, and integration.

It does take more time to reach a satisfactory result when rolling your own code; but on the other hand, it can be a useful investment by reducing maintenance burden for the future (in our case, we didn't have to worry about how to solve those linear systems when introducing AMD GPU support).

And the code being well-integrated into the existing system can really help reduce overhead, but it also makes it harder to reuse it somewhere else.

5/n

Either approach isn't necessarily superior to the other: factors such as code legacy, time (and humans) available, priorities have a large impact in the path that one decides to take. But I'm glad that we started from scratch at a time: not being tied back by old existing code allowed us to focus on a "GPU first" (in fact, GPU-only) approach for GPUSPH, and at a time when the GPGPU field was young and available libraries were largely absent. Having also helped port an important pre-existing code from CPU to GPU, I can appreciate the different mindset that goes into the two paths.

This double experience has had a significant impact on all my subsequent programming endeavours: even when I'm writing a quick prototype for serial CPU execution, even when I'm doing this in languages that aren't exactly amenable to massive parallelization (I'm looking at you, Python), I find myself constantly thinking about how to structure both the data and the algorithm so that a future parallelization, on either CPU or GPU, would be easier to achieve.

And I'm left wondering if I'm managing to transmit this mindset to my students.

6/6