That first implementation didn't even support the multi-GPU and multi-node features of #GPUSPH (could only run on a single GPU), but it paved the way for the full version, that took advantage of the whole infrastructure of GPUSPH in multiple ways.

First of all, we didn't have to worry about how to encode the matrix and its sparseness, because we could compute the coefficients on the fly, and operate with the same neighbors list transversal logic that was used in the rest of the code; this allowed us to minimize memory use and increase code reuse.

Secondly, we gained control on the accuracy of intermediate operations, allowing us to use compensating sums wherever needed.

Thirdly, we could leverage the multi-GPU and multi-node capabilities already present in GPUSPH to distribute computations across all available devices.

And last but not least, we actually found ways to improve the classic #CG and #BiCGSTAB linear solving algorithms to achieve excellent accuracy and convergence even without preconditioners, while making the algorithms themselves more parallel-friendly:

https://doi.org/10.1016/j.jcp.2022.111413

4/n

#LinearAlgebra #NumericalAnalysis

Even now, Thrust as a dependency is one of the main reason why we have a #CUDA backend, a #HIP / #ROCm backend and a pure #CPU backend in #GPUSPH, but not a #SYCL or #OneAPI backend (which would allow us to extend hardware support to #Intel GPUs). <https://doi.org/10.1002/cpe.8313>

This is also one of the reason why we implemented our own #BLAS routines when we introduced the semi-implicit integrator. A side-effect of this choice is that it allowed us to develop the improved #BiCGSTAB that I've had the opportunity to mention before <https://doi.org/10.1016/j.jcp.2022.111413>. Sometimes I do wonder if it would be appropriate to “excorporate” it into its own library for general use, since it's something that would benefit others. OTOH, this one was developed specifically for GPUSPH and it's tightly integrated with the rest of it (including its support for multi-GPU), and refactoring to turn it into a library like cuBLAS is

a. too much effort
b. probably not worth it.

Again, following @eniko's original thread, it's really not that hard to roll your own, and probably less time consuming than trying to wrangle your way through an API that may or may not fit your needs.

6/

This, BTW, is why we developed a semi-implicit formulation that handles the viscous part (only) implicitly:
https://www.sciencedirect.com/science/article/abs/pii/S002199911830593X
https://www.sciencedirect.com/science/article/abs/pii/S0021999122004752
(the latter may be of interest to you if you use #BiCGSTAB or #ConjugateGradient methods to solve linear systems that have nothing to do with SPH, BTW).

And now comes the new difficulty: THERMAL!

6/n

Although the focus of the paper is on #SmoothedParticleHydrodynamics, and the simulation of viscous fluids, the key finding is of more general interest: the numerical stability of #BiCGSTAB can be improved by rewriting the standard formulation https://en.wikipedia.org/wiki/Biconjugate_gradient_stabilized_method to avoid catastrophic cancellations in the computation of some important coefficients. Anyone interested in solving large linear systems would benefit from adopting the proposed alternative form of the method.
Biconjugate gradient stabilized method - Wikipedia

I'm still preparing my post series on #GPGPU, but in the mean time you can read some about in our most recent published paper:
“A numerically robust, parallel-friendly variant of #BiCGSTAB for the semi-implicit integration of the viscous term in #SmoothedParticleHydrodynamics”, freely accessible for the next 50 days from this link:
https://authors.elsevier.com/a/1fN8C508HsZiG