New article about Optimizing GPU Programs from Java using Babylon and HAT.
Using the matrix multiplication as an example, this article explains how Java developers can tune GPU workloads from the Java compute-kernels to achieve performance close to native cuBLAS, scaling from 7 GFLOP/s on CPUs to 14 TFLOP/s on an NVIDIA A10 GPU, just using Babylon's code reflection APIs and careful designed APIs for GPUs.
https://openjdk.org/projects/babylon/articles/hat-matmul/hat-matmul

