Mastodawn

Sync training across geos isn’t new, tho doing it b/c of training data governance is. But training across AMD+NVIDIA is new; leave it to DOE to demonstrate such odd methods!

Unclear what separates “federated learning” from multicluster training tho.

https://www.sandia.gov/labnews/2025/12/18/three-national-security-laboratories-one-ai-model/

#AI

Three national security laboratories, one AI model

Sandia, Los Alamos and Lawrence Livermore national laboratories have proven that it's possible to share a large language model without compromising sensitive data from each lab.

LabNews

Glenn K. Lockwood 2d ago

Very last minute, but I'm giving a talk online tomorrow (Thurs Dec 18) about my analysis of over 85K model training checkpoints and implications for system design. Punchline is "less bandwidth makes training go faster."

Registration required: https://www.vastdata.com/events/vast-live-smarter-not-faster-the-storage-reality-hidden-in-85-000-ai-checkpoints?utm_medium=Social&utm_source=Glenn&utm_campaign=

#AI #storage

Smarter, Not Faster: The Storage Reality Hidden in 85,000 AI Checkpoints

Stop chasing multi-terabyte-per-second performance for your global storage. Focus on "checkpoint overlap," not raw bandwidth. Invest your budget in what matters most: GPUs.

VAST Data

Glenn K. Lockwood 3d ago

NERSC recently did a wholesale replacement of its FDR InfiniBand storage fabric to RoCE. The IB was a greenfield installation back when I started in 2015, and replacing it with a competing technology in production is quite the feat. Glad to hear it succeeded.

https://www.nersc.gov/news-and-events/news/network-upgrades-pave-the-way-for-a-faster-future

Network Upgrades Pave the Way to a Faster Future - NERSC: National Energy Research Scientific Computing Center

The National Energy Research Scientific Computing Center (NERSC), a U. S.

NERSC: National Energy Research Scientific Computing Center

Glenn K. Lockwood 5d ago

RE: https://mast.hpc.social/@thedeadline/115724682400656262

Does this mean no more dirt-cheap NRE from Slurm? Or will Slurm development no longer be coin-operated? Would love to see serious engineering effort go into modernizing Slurm, but this could go in many directions.

#HPC

Glenn K. Lockwood Dec 6

This is kinda funny. At the risk of punching down, I know of at least one big storage deal that my employer got a crack at as a result of DDN doubling the price of the flash after their initial bid.

If that’s what they consider helping people deal with rising flash prices, I hope they keep doing it! It’s a great strategy to help DDN’s competitors.
From: @insidehpc.com
https://rss-parrot.net/u/insidehpc.com/status/1764933474992899678

RSS Parrot

Home of RSS Parrot, a free Fediverse service that lets you turn Mastodon into an RSS or Atom feed reader.

Glenn K. Lockwood Dec 5

Philosophical Q: what is the role of industry in the CS peer review process? I am no longer a researcher and no longer publish (so far), but am still invited to review papers/proposals/projects/abstracts. It's not really my job to do this anymore, but I still feel partly obligated.

Adding to complexity are the weird semi-conflicts. If research is on a platform that competes with my employer's product, is that a conflict?

Would love opinions.

Glenn K. Lockwood Dec 4

So yesterday I flew home from Oak Ridge, TN to San Francisco by way of Dulles Airport. My two flights emitted the same amount of CO2 as running a 100 MW data center for about 70-80 minutes. Or running HPL on the Frontier supercomputer for about 7 hours.

#HPC #AI

Glenn K. Lockwood Dec 4

Honest Q: what, exactly, is an AI factory?

Glenn K. Lockwood Dec 3

François Tessier Dec 3

📢 Call for Papers: 7th International Workshop on Extreme-Scale #Storage and #Analysis (ESSA 2026), held in conjunction with IPDPS (New Orleans, May 2026)!
More info here: sites.google.com/view/essa-2026
#HPC #Cloud #IPDPS

Glenn K. Lockwood Dec 3

Helios sounds like AMD's answer to NVIDIA's rack-scale NVLink, but it uses UALink over Ethernet with custom Broadcom scale-up switches. Interestingly, HPE will ship its Helios rack before its own Cray GX rack. Another example of #HPC playing second fiddle to #AI.

https://www.amd.com/en/newsroom/press-releases/2025-12-2-amd-and-hpe-expand-collaboration-to-advance-open-r.html