🌗 奇怪的是,GPU 矩陣乘法在給予「可預測」資料時執行得更快!
➤ 揭開 GPU 效能與資料可預測性的意外關聯:動態功耗的幕後影響
https://www.thonking.ai/p/strangely-matrix-multiplications
作者在測試 GPU 上的矩陣乘法效能時,意外發現了令人費解的現象:當輸入資料「可預測」(例如全為零或全為一)時,運算速度竟然比輸入隨機資料更快。這項發現源於他將 CUTLASS 函式庫的效能評測結果,與透過 PyTorch 呼叫 CuBLAS 的結果進行比較。起初,CUTLASS 的內部評測工具顯示出顯著的效能優勢,但在 Python 環境下測試時,這些優勢卻消失了。經過仔細的程式碼比對,他發現 CUTLASS 的評測工具預設以整數初始化輸入資料。隨後,他使用 `torch.zeros` 和 `torch.randn` 進行實驗,證實了資料內容確實會影響矩陣乘法的執行時間。
作者進一步探究,揭示了此現象背後的原因是半導體中的
#GPU 效能 #矩陣乘法 #CUTLASS #CuBLAS #功耗限制 #半導體 #動態功耗 #A100
Strangely, Matrix Multiplications on GPUs Run Faster When Given "Predictable" Data! [short]

Great minds discuss flops per watt.

Thonk From First Principles

NVIDIA wprowadza CuTe DSL w CUTLASS 4 – Python zbliża się do C++ w wydajności

Czy da się dogonić C++ wydajnością, pisząc w Pythonie – i to bez czarów, bez sugar-coata i bez tygodni czekania na kompilację? NVIDIA twierdzi, że tak: nowy CuTe DSL w CUTLASS 4 obiecuje „C++-owe” osiągi Tensor Cores z wygodą pythonowych API.

Czytaj dalej:
https://pressmind.org/nvidia-wprowadza-cute-dsl-w-cutlass-4-python-zbliza-sie-do-c-w-wydajnosci/

#PressMindLabs #cutedsl #cutlass #gemm #nvidia #pythonjit

#cutlass is a short, broad sabre or slashing sword with a straight or slightly curved blade sharpened on the cutting edge and a hilt often featuring a solid cupped or basket-shaped guard
#Cutlass #JetFire Αυτό ήταν το πρώτο αυτοκίνητο με εργοστασιακό Turbo στην ιστορία (Βίντεο) https://www.zougla.gr/automoto/automoto-news/afto-itan-to-proto-aftokinito-me-ergostasiako-turbo-stin-istoria-vinteo/?utm_source=dlvr.it&utm_medium=mastodon
Someone discovered that slapping the word "cutlass" on a #kernel magically boosts #performance by 100 tflops! ⚡🔪 Meanwhile, #GitHub is busy throwing #AI #buzzwords around like confetti, because who needs actual substance when you have Sparkly New Features™? 🙄🎉
https://github.com/triton-lang/triton/pull/7298 #cutlass #boost #tech #news #optimization #HackerNews #ngated
[Gluon][Tutorial] Persistent attention by Mogball · Pull Request #7298 · triton-lang/triton

Rewrite the attention kernel to be persistent. This gives better performance at low-contexts. However, fp16 at large context has suffered a bit due to a ptxas instruction scheduling issue in the so...

GitHub
[Gluon][Tutorial] Persistent attention by Mogball · Pull Request #7298 · triton-lang/triton

Rewrite the attention kernel to be persistent. This gives better performance at low-contexts. However, fp16 at large context has suffered a bit due to a ptxas instruction scheduling issue in the so...

GitHub

Hackernews post title something like "Nvidia software runs significantly faster when kernel name has 'cutlass' in it."

WHAT?

Hackernews commenter replies "The Volkswagon emissions testing model"

AH, I SEE 😂😂😂

https://github.com/triton-lang/triton/pull/7298
https://news.ycombinator.com/item?id=44530581

#hn #nvidia #github #programming #cutlass

[Gluon][Tutorial] Persistent attention by Mogball · Pull Request #7298 · triton-lang/triton

Rewrite the attention kernel to be persistent. This gives better performance at low-contexts. However, fp16 at large context has suffered a bit due to a ptxas instruction scheduling issue in the so...

GitHub