Mixed-precision numerics in scientific applications: survey and perspectives
Mixed-precision numerics in scientific applications: survey and perspectives
My work for the month is to optimise the matrix adressing scheme of OpenFOAM to reduce cache miss. Initial idea is to replace the LDU Matrix addressing scheme with Diagnol matrix addressing scheme using multiple arrays for structures meshes. Will check it out on a simple 2D Poisson equation using Gauss-Seidel solver to check performance benefits, and will eventually proceed to introduce a new solver as a plugin if benefits are noticeable.
Wish me luck.
We are excited to announce support for Flux for #Kubeflow v2.2 to enable AI/ML workloads paired with #HPC simulation in #Kubernetes! 🥳
https://bsky.app/profile/vsoch.bsky.social/post/3mhssklh5xk2q
See the full post above to learn more, or jump into the demo! https://youtu.be/NbP0NdSDwog?si=DLHkdtYVnWa5lobg

We are excited to announce support for Flux for #Kubeflow v2.2 to enable AI/ML workloads paired with #HPC simulation. Flux adds a ZeroMQ bootstrap, support for #PMIx, more flavors of #MPI, and bypasses potential etcd and kube-sched bottlenecks. We are excited to bring this to the larger community! 🥳 [contains quote post or other embedded content]
Lustre Users Group 2026
April 27 – 29, 2026
Indianapolis, IN
Make plans to be with us in Indianapolis, IN for the conference for all things related to Lustre shared parallel storage. Opening Reception the evening of April 27th. The conference presentations are April 28th through April 29th, 2026 to learn about new features/improvements in Lustre including the most-recent release, Lustre 2.17 with Hybrid IO, Dynamic LNet NID Configuration and Nodemap enhancements.
Deadline extended until March 16th!
- Final weekend to submit your feedback -
Can you spare a minute for a very short survey? If you have ever used our OpenMP API Examples book, we are asking for your feedback in how we can improve it. The survey is short and quick.
Survey: https://link.openmp.org/4
(We will *not* add you to our contact list or sell your information)
#openmp #parallel #programming #HPC
Since I no longer work directly w/ model trainers, I rely on public info to understand the infrastructure reqs of newer model architectures. This paper is a great explainer of how MOE taxes compute/memory/network: https://arxiv.org/abs/2603.07685v1
My notes here: https://glennklockwood.com/garden/expert-parallelism

Scaling Mixture-of-Experts (MoE) training introduces systems challenges absent in dense models. Because each token activates only a subset of experts, this sparsity allows total parameters to grow much faster than per-token computation, creating coupled constraints across memory, communication, and computation. Optimizing one dimension often shifts pressure to another, demanding co-design across the full system stack. We address these challenges for MoE training through integrated optimizations spanning memory (fine-grained recomputation, offloading, etc.), communication (optimized dispatchers, overlapping, etc.), and computation (Grouped GEMM, fusions, CUDA Graphs, etc.). The framework also provides Parallel Folding for flexible multi-dimensional parallelism, low-precision training support for FP8 and NVFP4, and efficient long-context training. On NVIDIA GB300 and GB200, it achieves 1,233/1,048 TFLOPS/GPU for DeepSeek-V3-685B and 974/919 TFLOPS/GPU for Qwen3-235B. As a performant, scalable, and production-ready open-source solution, it has been used across academia and industry for training MoE models ranging from billions to trillions of parameters on clusters scaling up to thousands of GPUs. This report explains how these techniques work, their trade-offs, and their interactions at the systems level, providing practical guidance for scaling MoE models with Megatron Core.