Mastodawn

🌗 奇怪的是，GPU 矩陣乘法在給予「可預測」資料時執行得更快！
➤ 揭開 GPU 效能與資料可預測性的意外關聯：動態功耗的幕後影響
✤ https://www.thonking.ai/p/strangely-matrix-multiplications
作者在測試 GPU 上的矩陣乘法效能時，意外發現了令人費解的現象：當輸入資料「可預測」（例如全為零或全為一）時，運算速度竟然比輸入隨機資料更快。這項發現源於他將 CUTLASS 函式庫的效能評測結果，與透過 PyTorch 呼叫 CuBLAS 的結果進行比較。起初，CUTLASS 的內部評測工具顯示出顯著的效能優勢，但在 Python 環境下測試時，這些優勢卻消失了。經過仔細的程式碼比對，他發現 CUTLASS 的評測工具預設以整數初始化輸入資料。隨後，他使用 `torch.zeros` 和 `torch.randn` 進行實驗，證實了資料內容確實會影響矩陣乘法的執行時間。
作者進一步探究，揭示了此現象背後的原因是半導體中的
#GPU 效能 #矩陣乘法 #CUTLASS #CuBLAS #功耗限制 #半導體 #動態功耗 #A100

Strangely, Matrix Multiplications on GPUs Run Faster When Given "Predictable" Data! [short]

Great minds discuss flops per watt.

Thonk From First Principles

HGPU group May 3

Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs

#Triton #CUDA #CUBLAS #LLM #Performance #Package

https://hgpu.org/?p=30763

Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs

NVIDIA’s CUDA Tile (CuTile) introduces a Python-based, tile-centric abstraction for GPU kernel development that aims to simplify programming while retaining Tensor Core and Tensor Memory Acce…

hgpu.org

HGPU group May 3

FACT: Compositional Kernel Synthesis with a Three-Stage Agentic Workflow

#CUDA #CUBLAS

https://hgpu.org/?p=30760

FACT: Compositional Kernel Synthesis with a Three-Stage Agentic Workflow

Deep learning compilers and vendor libraries deliver strong baseline performance but are bounded by finite, engineer-curated catalogs. When these omit needed optimizations, practitioners substitute…

hgpu.org

HGPU group Dec 21

CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning

#CUDA #CUBLAS #MatrixMultiplication #Package

https://hgpu.org/?p=30469

CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning

In this paper, we propose CUDA-L2, a system that combines large language models (LLMs) and reinforcement learning (RL) to automatically optimize Half-precision General Matrix Multiply (HGEMM) CUDA …

hgpu.org

N-gated Hacker News Dec 4, 2025

Oh great, another #AI claiming it can multiply matrices faster than #cuBLAS 😴. Reinforcement learning to the rescue! Because when in doubt, throw AI at it and pray for miracles 🙏.
https://github.com/deepreinforce-ai/CUDA-L2 #Matrix #Multiplication #ReinforcementLearning #TechHumor #HackerNews #ngated

GitHub - deepreinforce-ai/CUDA-L2: CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning

CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning - deepreinforce-ai/CUDA-L2

GitHub

Hacker News Dec 4, 2025

CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication Through RL

https://github.com/deepreinforce-ai/CUDA-L2

#HackerNews #CUDA #L2 #cuBLAS #Matrix #Multiplication #RL #Performance

GitHub - deepreinforce-ai/CUDA-L2: CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning

CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning - deepreinforce-ai/CUDA-L2

GitHub

Habr Sep 18, 2025

Nvidia CMP – микроскопы для забивания гвоздей?! Копаем глубже…

Почему видеокарта, имеющая неплохие вычислительные возможности, в Stable Diffusion работает в 20 раз медленнее, чем RTX 3060? Почему в LM Studio она становится фаворитом, а в ComfyUI карета превращается в тыкву? Почему FurMark на CMP 90HX тормозит, а на CMP 50HX «бублик» крутится почти нормально? Разгадки в разных программных ограничениях, которые можно найти с помощью экспериментов. Я купил три майнинговые карты Nvidia, чтобы понять, можно ли заставить их эффективно работать. В этот раз мы рассмотрим: статистику производительности в LM Studio, как всё печально в ComfyUI и Stable Diffusion, анатомию программного кода GPU, почему оптимизации производительности дают на CMP обратный эффект, какие режимы вычислений могут раскрыть их потенциал.

https://habr.com/ru/articles/948396/

#llm #nvidia #cmp #50hx #90hx #lm_studio #майнинг #cuda #cublas #40hx

Nvidia CMP – микроскопы для забивания гвоздей?! Копаем глубже…

Хабр

FredPlus10 Oct 15, 2024

TIL: Even though #Cublas always assumes column-major order, the docs of #cudaMemcpy2D assume row-major order!

HGPU group Jun 2, 2024

Evaluation of computational and energy performance in matrix multiplication algorithms on CPU and GPU using MKL, cuBLAS and SYCL

#CUDA #SYCL #MKL #CUBLAS #MatrixMultiplication #LinearAlgebra #Performance #Package

https://hgpu.org/?p=29229

Evaluation of computational and energy performance in matrix multiplication algorithms on CPU and GPU using MKL, cuBLAS and SYCL

Matrix multiplication is fundamental in the backpropagation algorithm used to train deep neural network models. Libraries like Intel’s MKL or NVIDIA’s cuBLAS implemented new and optimiz…

hgpu.org

Peter Guhl Nov 30, 2023

Not sure who needs to know that, but if you get a #CUBLAS error 15 with #llama.cpp and the .cu-file has something about f16 at about the line which fails, starting main with --memory-f32 may be a workaround. Had this with the #NVIDIA #Tesla #M40 24GB.
#AI #MachineLearning #CUDA #llama2 #Meta