Mastodawn

hpc.social admins 5d ago

HGPU group 5d ago

Mixed-precision numerics in scientific applications: survey and perspectives

#GPU #MixedPrecision #Review

https://hgpu.org/?p=30704

Mixed-precision numerics in scientific applications: survey and perspectives

The explosive demand for artificial intelligence (AI) workloads has led to a significant increase in silicon area dedicated to lower-precision computations on recent high-performance computing hard…

hgpu.org

hpc.social admins 6d ago

Amartya 6d ago

My work for the month is to optimise the matrix adressing scheme of OpenFOAM to reduce cache miss. Initial idea is to replace the LDU Matrix addressing scheme with Diagnol matrix addressing scheme using multiple arrays for structures meshes. Will check it out on a simple 2D Poisson equation using Gauss-Seidel solver to check performance benefits, and will eventually proceed to introduce a new solver as a plugin if benefits are noticeable.
Wish me luck.

#openfoam #scientificcomputing #hpc #cfd

hpc.social admins 6d ago

vsoch 6d ago

We are excited to announce support for Flux for #Kubeflow v2.2 to enable AI/ML workloads paired with #HPC simulation in #Kubernetes! 🥳

https://bsky.app/profile/vsoch.bsky.social/post/3mhssklh5xk2q

See the full post above to learn more, or jump into the demo! https://youtu.be/NbP0NdSDwog?si=DLHkdtYVnWa5lobg

v (@vsoch.bsky.social)

We are excited to announce support for Flux for #Kubeflow v2.2 to enable AI/ML workloads paired with #HPC simulation. Flux adds a ZeroMQ bootstrap, support for #PMIx, more flavors of #MPI, and bypasses potential etcd and kube-sched bottlenecks. We are excited to bring this to the larger community! 🥳 [contains quote post or other embedded content]

Bluesky Social

hpc.social admins 6d ago

Benjamin Carr, Ph.D. 👨🏻‍💻🧬6d ago

Price of #Nvidia's #VeraRubin #NVL72 racks skyrockets to $8.8M apiece, but #server makers' margins will be tight — Nvidia is moving closer to shipping entire full-scale systems
#Blackwell #NVL72 #rackscale system costs $2.8 – $3.4M for an #AI training and #HPC NVL72 #GB200 and $6M to $6.5M for an AI inference NVL72 #GB300
Vera Rubin NVL72 #VR200 systems are currently quoted at $5M - $7M per unit.
Nvidia has never confirmed the list prices of its NVL72 or #NVL144 products.
https://www.tomshardware.com/tech-industry/artificial-intelligence/price-of-nvidias-vera-rubin-nvl72-racks-skyrockets-to-as-much-as-usd8-8-million-apiece-but-server-makers-margins-will-be-tight-nvidia-is-moving-closer-to-shipping-entire-full-scale-systems

Price of Nvidia's Vera Rubin NVL72 racks skyrockets to as much as $8.8 million apiece, but server makers' margins will be tight — Nvidia is moving closer to shipping entire full-scale systems

Nvidia and other chipmakers will still make plenty of cash.

Tom's Hardware

hpc.social admins Mar 19

ADMIN magazine Mar 19

The EuroHPC Joint Undertaking has launched the HPCTRAIN project, a program to help trainees gain practical experience with HPC systems and technologies
https://www.admin-magazine.com/News/EuroHPC-JU-Establishes-HPC-Training-Program?utm_source=mam
#HPC #EuroHPCJU #training #HPCTRAIN #supercomputing

hpc.social admins Mar 15

Kenneth Hoste Mar 15

Some hard decisions were made this morning...

#laptop #stickers

hpc.social admins Mar 14

Lustre Users Group 2026
April 27 – 29, 2026
Indianapolis, IN

Make plans to be with us in Indianapolis, IN for the conference for all things related to Lustre shared parallel storage. Opening Reception the evening of April 27th. The conference presentations are April 28th through April 29th, 2026 to learn about new features/improvements in Lustre including the most-recent release, Lustre 2.17 with Hybrid IO, Dynamic LNet NID Configuration and Nodemap enhancements.

https://www.opensfs.org/lug-2026/

LUG 2026 | OpenSFS: The Lustre File System Community

hpc.social admins Mar 13

OpenMP ARB Mar 13

Deadline extended until March 16th!

- Final weekend to submit your feedback -

Can you spare a minute for a very short survey? If you have ever used our OpenMP API Examples book, we are asking for your feedback in how we can improve it. The survey is short and quick.

Survey: https://link.openmp.org/4

(We will *not* add you to our contact list or sell your information)
#openmp #parallel #programming #HPC

hpc.social admins Mar 13

Johannes Köster Mar 13

Today, the #SnakemakeHackathon2026 at the TU Munich ended with the release of Snakemake 9.17! I want to thank all participants and my co-organizers! You have been incredibly dedicated and we improved a ton of things throughout the ecosystem. While 9.17 alone is already an impressive release, more releases will come in the next days and weeks, as the remaining pull requests that have been started during the hackathon are finalized. https://snakemake.github.io

Snakemake

hpc.social admins Mar 10

Glenn K. Lockwood Mar 10

Since I no longer work directly w/ model trainers, I rely on public info to understand the infrastructure reqs of newer model architectures. This paper is a great explainer of how MOE taxes compute/memory/network: https://arxiv.org/abs/2603.07685v1

My notes here: https://glennklockwood.com/garden/expert-parallelism

#AI

Scalable Training of Mixture-of-Experts Models with Megatron Core

Scaling Mixture-of-Experts (MoE) training introduces systems challenges absent in dense models. Because each token activates only a subset of experts, this sparsity allows total parameters to grow much faster than per-token computation, creating coupled constraints across memory, communication, and computation. Optimizing one dimension often shifts pressure to another, demanding co-design across the full system stack. We address these challenges for MoE training through integrated optimizations spanning memory (fine-grained recomputation, offloading, etc.), communication (optimized dispatchers, overlapping, etc.), and computation (Grouped GEMM, fusions, CUDA Graphs, etc.). The framework also provides Parallel Folding for flexible multi-dimensional parallelism, low-precision training support for FP8 and NVFP4, and efficient long-context training. On NVIDIA GB300 and GB200, it achieves 1,233/1,048 TFLOPS/GPU for DeepSeek-V3-685B and 974/919 TFLOPS/GPU for Qwen3-235B. As a performant, scalable, and production-ready open-source solution, it has been used across academia and industry for training MoE models ranging from billions to trillions of parameters on clusters scaling up to thousands of GPUs. This report explains how these techniques work, their trade-offs, and their interactions at the systems level, providing practical guidance for scaling MoE models with Megatron Core.

arXiv.org