Slicing Is All You Need: Towards a Universal One-Sided Distributed MatMul

https://arxiv.org/abs/2510.08874

#HackerNews #Slicing #MatMul #Distributed #Computing #OneSided #AI #Research

Slicing Is All You Need: Towards A Universal One-Sided Algorithm for Distributed Matrix Multiplication

Many important applications across science, data analytics, and AI workloads depend on distributed matrix multiplication. Prior work has developed a large array of algorithms suitable for different problem sizes and partitionings including 1D, 2D, 1.5D, and 2.5D algorithms. A limitation of current work is that existing algorithms are limited to a subset of partitionings. Multiple algorithm implementations are required to support the full space of possible partitionings. If no algorithm implementation is available for a particular set of partitionings, one or more operands must be redistributed, increasing communication costs. This paper presents a universal one-sided algorithm for distributed matrix multiplication that supports all combinations of partitionings and replication factors. Our algorithm uses slicing (index arithmetic) to compute the sets of overlapping tiles that must be multiplied together. This list of local matrix multiplies can then either be executed directly, or reordered and lowered to an optimized IR to maximize overlap. We implement our algorithm using a high-level C++-based PGAS programming framework that performs direct GPU-to-GPU communication using intra-node interconnects. We evaluate performance for a wide variety of partitionings and replication factors, finding that our work is competitive with PyTorch DTensor, a highly optimized distributed tensor library targeting AI models.

arXiv.org
All in on MatMul? Don’t Put All Your Tensors in One Basket!

Matrix multiplication dominates AI hardware and research. Betting everything on MatMul risks an innovation monoculture — it’s time to diversify our compute bets.

SIGARCH
Detailing the tiling scheme used for a CUDA kernel doing matrix-matrix multiplication #gpu #cuda #cplusplus #matmul #gemm https://indii.org/blog/gpu-matrix-multiply-tiling/
Matrix Multiplication On GPU: Part 2, Tiling

Breaking down large matrix multiplications into tiles

indii.org
Advanced GEMM Optimization on Modern x86-64 Multi-Core Processors

This blog post explains how to optimize multi-threaded FP32 matrix multiplication for modern processors using FMA3 and AVX2 vector instructions. The optimized custom implementation resembles the BLIS design and outperforms existing BLAS libraries (including OpenBLAS and MKL) on a wide range of matrix sizes. Tags: High-performance GEMM on CPU. Fast SGEMM in C. High-performance matrix multiplication on CPU. SGEMM Optimization on CPU.

salykova

Ummm...this is totally gonna fuck #NVidia's share value! 😂😂😂

That's what they get when they rely on throwing hardware at an issue when you could've fixed the software algorithms! #AI #MatMul #MatrixMath

https://arstechnica.com/information-technology/2024/06/researchers-upend-ai-status-quo-by-eliminating-matrix-multiplication-in-llms/?utm_brand=arstechnica&utm_social-type=owned&utm_source=mastodon&utm_medium=social

Researchers upend AI status quo by eliminating matrix multiplication in LLMs

Running AI models without floating point matrix math could mean far less power consumption.

Ars Technica

My #Python code using #numpy is taking ~20 minutes to multiply two large-ish matrices (~17k x ~33k, ~33k x 700). Is this normal? I feel like this shouldn't be normal.

#NumPy #LinearAlgebra #Matmul

Another incredible thread from Horace He:

https://twitter.com/cHHillee/status/1630274804795445248?s=20

This time in response to a thread from @karpathy where slightly increasing the size of the embedding matrix resulted in a large speedup.

This thread covers some nuances of memory accesses and utilization for matmul, the basics of which are introduced here:

https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html

#CUDA #matmul

Horace He on Twitter

“Recently, Karpathy tweeted that *increasing* the size of his matmul made it run faster. But... why? Many people seem content to leave this as black magic. But luckily, this *can* be understood! Here's a plot of FLOPs achieved for square matmuls. Let's explain each curve! 1/19”

Twitter