Mastodawn

#FuriosaAI, a Seoul-based #AIchipdeveloper, is reportedly seeking $300 million to $500 million in a Series D funding round. The company’s flagship #RNGD chip is optimised for #tensorcontraction, a mathematical operation that can produce the same results as #matrixmultiplication more efficiently, leading to faster #AIperformance. https://siliconangle.com/2026/01/19/ai-chip-developer-furiosaai-reportedly-raising-500m/?Pirates.BZ #Pirates #Tech #Startup #News

HGPU group Dec 21

CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning

#CUDA #CUBLAS #MatrixMultiplication #Package

https://hgpu.org/?p=30469

CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning

In this paper, we propose CUDA-L2, a system that combines large language models (LLMs) and reinforcement learning (RL) to automatically optimize Half-precision General Matrix Multiply (HGEMM) CUDA …

hgpu.org

N-gated Hacker News Nov 24

Ah, another groundbreaking revelation from the Department of Redundancy Department. 🤦‍♂️ It turns out slicing bread isn't just for sandwiches anymore; it's the ultimate solution for distributed matrix multiplication! 🥪🔢 Just slice and dice, and voilà, computational problems disappear, or so we're told! 🙄
https://arxiv.org/abs/2510.08874 #HackerNews #SlicingBread #MatrixMultiplication #ComputationalSolutions #TechHumor #HackerNews #ngated

Slicing Is All You Need: Towards A Universal One-Sided Algorithm for Distributed Matrix Multiplication

Many important applications across science, data analytics, and AI workloads depend on distributed matrix multiplication. Prior work has developed a large array of algorithms suitable for different problem sizes and partitionings including 1D, 2D, 1.5D, and 2.5D algorithms. A limitation of current work is that existing algorithms are limited to a subset of partitionings. Multiple algorithm implementations are required to support the full space of possible partitionings. If no algorithm implementation is available for a particular set of partitionings, one or more operands must be redistributed, increasing communication costs. This paper presents a universal one-sided algorithm for distributed matrix multiplication that supports all combinations of partitionings and replication factors. Our algorithm uses slicing (index arithmetic) to compute the sets of overlapping tiles that must be multiplied together. This list of local matrix multiplies can then either be executed directly, or reordered and lowered to an optimized IR to maximize overlap. We implement our algorithm using a high-level C++-based PGAS programming framework that performs direct GPU-to-GPU communication using intra-node interconnects. We evaluate performance for a wide variety of partitionings and replication factors, finding that our work is competitive with PyTorch DTensor, a highly optimized distributed tensor library targeting AI models.

arXiv.org

N-gated Hacker News Oct 14, 2025

Congratulations, you've unlocked the secret to making matrix multiplication sound like a cult initiation ceremony! 🎉🔮 Why focus on the math when you can drown in a tsunami of bureaucratic jargon and committees instead? 🙃📑
https://www.sigarch.org/dont-put-all-your-tensors-in-one-basket-hardware-lottery/ #matrixmultiplication #cultinitiation #bureaucraticjargon #tsunamioffun #hackernews #HackerNews #ngated

All in on MatMul? Don’t Put All Your Tensors in One Basket!

Matrix multiplication dominates AI hardware and research. Betting everything on MatMul risks an innovation monoculture — it’s time to diversify our compute bets.

SIGARCH

Reddit Tech VN Bot Oct 13, 2025

**Title:** JAX Pallas:Mosaic GPUBreakthrough in Collective Matrix Multiplication
**Post:**
🚀 Sáng day, JAX Pallas:Mosaic GPU thay đổi mặt giới hạn tính toán với Collective Matrix Multiplication! Benoît tính toán đồng thời tận dụng tối ưu GPU, giúp xử lý dữ liệu boire trôi lớn cường độ cao. Thành Técn này mở ngàn đường cho AI, ML và nghiên cứu khoa học. 🌐
#JAX #GPU #MatrixMultiplication #AI #MachineLearning #Tecnolgopy #HọcMáy

*(500 characters)*

https://www.reddit.com/r/programming/co

N-gated Hacker News Oct 3, 2025

📚🔬 Behold! Another riveting tale of matrix multiplication that promises to make your brain cells do backflips. 🤸‍♂️🎉 Multithreaded #FP32 #optimizations that require you to sacrifice your first-born to hyperparameters just to squeeze out a few extra bytes of performance. ⚙️🛠️ And if you want the actual code, here's a hint: #sgemm.c. Happy debugging! 🖥️💥
https://salykova.github.io/gemm-cpu #matrixmultiplication #multithreading #debugging #HackerNews #ngated

Advanced Matrix Multiplication Optimization on Modern Multi-Core Processors

A detailed blog post on optimizing multi-threaded matrix multiplication for x86 processors to achieve OpenBLAS/MKL-like performance. Tags: High-performance GEMM on CPU, Fast GEMM on CPU, High-performance matrix multiplication on CPU, Fast Matrix Multiplication on CPU, Matrix multiplication in C, GEMM in C, Matrix multiplication acceleration.

salykova

N-gated Hacker News Jul 18, 2025

🚀 Oh wow, hold the phone, folks! Matrix multiplication is so central to #computing, you'd think it just solved world hunger 🤯. Apparently, the secret to AI supremacy lies not in groundbreaking innovation, but in rehashing #kernels across platforms—because obviously, speed is all that matters 🙄.
https://burn.dev/blog/sota-multiplatform-matmul/ #MatrixMultiplication #AIInnovation #SpeedMatters #HackerNews #ngated

Burn

Next Generation AI Infrastructure

Hacker News Jul 18, 2025

Multiplatform Matrix Multiplication Kernels

https://burn.dev/blog/sota-multiplatform-matmul/

#HackerNews #Multiplatform #Matrix #Multiplication #Kernels #MachineLearning #MatrixMultiplication #DataScience #TechInnovation

Burn

Next Generation AI Infrastructure

HGPU group Jun 22, 2025

A Novel Compiler Transformation for Fast Sparse Matrix Multiplication in GPUs

#CUDA #Compilers #Sparse #MatrixMultiplication

https://hgpu.org/?p=29951

A Novel Compiler Transformation for Fast Sparse Matrix Multiplication in GPUs

Sparse data structures are commonly used in neural networks to reduce the memory footprint. These data structures are compact but cause irregularities such as random memory accesses, which prevent …

hgpu.org

N-gated Hacker News Jun 17, 2025

🎉 Big news, folks! AMD's CDNA 4 is here to revolutionize... slightly. 📉 Chester Lam's riveting expose on how marginally improving matrix multiplication with low precision data types is the cutting edge of mediocrity. Who knew that less precision could be the future of tech? 🤔💡
https://chipsandcheese.com/p/amds-cdna-4-architecture-announcement #AMD #CDNA4 #CDNA4 #Revolution #MatrixMultiplication #LowPrecision #HackerNews #ngated

AMD’s CDNA 4 Architecture Announcement

CDNA 4 is AMD’s latest compute oriented GPU architecture, and represents a modest update over CDNA 3.

Chips and Cheese