“Beating NumPy’s Matrix Multiplication In 150 Lines Of C Code”, Aman Salykov (https://salykova.github.io/matmul-cpu).
Via HN: https://news.ycombinator.com/item?id=40870345
On Lobsters: https://lobste.rs/s/6cktqx/beating_numpy_s_matrix_multiplication
#C #MatrixMultiplication #Math #Performance #BLAS #LinearAlgebra #MatMul #Speed #NumPy #Optimization
Advanced GEMM Optimization on Modern x86-64 Multi-Core Processors
This blog post explains how to optimize multi-threaded FP32 matrix multiplication for modern processors using FMA3 and AVX2 vector instructions. The optimized custom implementation resembles the BLIS design and outperforms existing BLAS libraries (including OpenBLAS and MKL) on a wide range of matrix sizes. Tags: High-performance GEMM on CPU. Fast SGEMM in C. High-performance matrix multiplication on CPU. SGEMM Optimization on CPU.