A Novel Compiler Transformation for Fast Sparse Matrix Multiplication in GPUs

#CUDA #Compilers #Sparse #MatrixMultiplication

https://hgpu.org/?p=29951

A Novel Compiler Transformation for Fast Sparse Matrix Multiplication in GPUs

Sparse data structures are commonly used in neural networks to reduce the memory footprint. These data structures are compact but cause irregularities such as random memory accesses, which prevent …

hgpu.org
🎉 Big news, folks! AMD's CDNA 4 is here to revolutionize... slightly. 📉 Chester Lam's riveting expose on how marginally improving matrix multiplication with low precision data types is the cutting edge of mediocrity. Who knew that less precision could be the future of tech? 🤔💡
https://chipsandcheese.com/p/amds-cdna-4-architecture-announcement #AMD #CDNA4 #CDNA4 #Revolution #MatrixMultiplication #LowPrecision #HackerNews #ngated
AMD’s CDNA 4 Architecture Announcement

CDNA 4 is AMD’s latest compute oriented GPU architecture, and represents a modest update over CDNA 3.

Chips and Cheese
MVDRAM: Enabling GeMV Execution in Unmodified DRAM for Low-Bit LLM Acceleration

General matrix-vector multiplication (GeMV) remains a critical latency bottleneck in large language model (LLM) inference, even with quantized low-bit models. Processing-Using-DRAM (PUD), an analog in-DRAM computing technique, has the potential to repurpose on-device DRAM as a GeMV engine, offering additional high-throughput processing capabilities to widespread consumer devices without DRAM modifications. However, applying PUD to GeMV operations in the LLM inference pipeline incurs significant overheads $\textit{before}$ and $\textit{after}$ in-DRAM computation, diminishing the benefits of its high-throughput processing capabilities. This paper presents MVDRAM, the first practical system to accelerate GeMV operations for low-bit LLM inference using unmodified DRAM. By leveraging the data sharing patterns and mathematical linearity in GeMV operations, MVDRAM orchestrates the processor and DRAM to eliminate the costs associated with pre-arranging inputs and bit-transposition of outputs required in conventional PUD approaches. Our experimental evaluation with four DDR4 DRAM modules shows that MVDRAM achieves comparable or even better inference speed than the processor-based implementation for GeMV operations in low-bit (under 4-bit) LLM. In particular, MVDRAM achieves up to 7.29$\times$ speedup and 30.5$\times$ energy efficiency for low-bit GeMV operations. For end-to-end LLM inference, MVDRAM achieves 2.18$\times$ and 1.31$\times$ throughput improvements, along with 3.04$\times$ and 2.35$\times$ energy efficiency, for 2-bit and 4-bit quantized low-bit models, respectively. MVDRAM has the potential to redefine the AI hardware landscape by demonstrating the feasibility of standard DRAM as an LLM accelerator.

arXiv.org

Optimizing Matrix Multiplication: The Core of Modern Computing 🚀 Investigating the lower bound of 6 operations for matrix multiplication & special cases like Google's PageRank. From AI breakthroughs to web search, efficient matrices drive innovation behind the scenes.

#MatrixMultiplication #AI #Optimization https://t.co/FIxJ497NRw

Optimizing Matrix Operations: The Backbone of Modern Computing

 This paper explores the fundamental importance of efficient matrix operations in modern computing, with a focus on the minimum number of computational operations required for matrix multiplication. We discuss the mathematical proof that establishes a lower bound of six operations for general matrix multiplication, and examine exceptions to this rule, such as matrices containing only zeros and ones, as utilized in the Google PageRank algorithm. The paper also highlights the pervasive nature of these optimizations in programming languages and their critical role in various applications, including web search algorithms and artificial intelligence

Zenodo
Deep Dive into Matrix Optimization on AMD GPUs:

Writing Super-Fast Matrix Multiplication with HIP, RGP, and ISA

seb-v
Advanced GEMM Optimization on Modern x86-64 Multi-Core Processors

This blog post explains how to optimize multi-threaded FP32 matrix multiplication for modern processors using FMA3 and AVX2 vector instructions. The optimized custom implementation resembles the BLIS design and outperforms existing BLAS libraries (including OpenBLAS and MKL) on a wide range of matrix sizes. Tags: High-performance GEMM on CPU. Fast SGEMM in C. High-performance matrix multiplication on CPU. SGEMM Optimization on CPU.

salykova

How much can we gain from Tensor Kernel Fusion on GPUs?

#CUDA #MatrixMultiplication #LinearAlgebra #Performance

https://hgpu.org/?p=29255

How much can we gain from Tensor Kernel Fusion on GPUs?

Kernel fusion is a crucial optimization technique for GPU applications, particularly deep neural networks, where it involves combining multiple consecutive kernels into a single larger kernel. This…

hgpu.org

Fast and Practical Strassen’s Matrix Multiplication using FPGAs

#OpenCL #FPGA #MatrixMultiplication #BLAS #LinearAlgebra #GEMM #Package

https://hgpu.org/?p=29241

Fast and Practical Strassen’s Matrix Multiplication using FPGAs

Matrix multiplication is a cornerstone operation in a wide array of scientific fields, including machine learning and computer graphics. The standard algorithm for matrix multiplication has a compl…

hgpu.org

Evaluation of computational and energy performance in matrix multiplication algorithms on CPU and GPU using MKL, cuBLAS and SYCL

#CUDA #SYCL #MKL #CUBLAS #MatrixMultiplication #LinearAlgebra #Performance #Package

https://hgpu.org/?p=29229

Evaluation of computational and energy performance in matrix multiplication algorithms on CPU and GPU using MKL, cuBLAS and SYCL

Matrix multiplication is fundamental in the backpropagation algorithm used to train deep neural network models. Libraries like Intel’s MKL or NVIDIA’s cuBLAS implemented new and optimiz…

hgpu.org
New Breakthrough Brings Matrix Multiplication Closer to Ideal

By eliminating a hidden inefficiency, computer scientists have come up with a new way to multiply large matrices that’s faster than ever.

Quanta Magazine