Part2: #dailyreport #cuda #nvidia #gentoo #llvm #clang

I learned cmake config files and difference between
Compiler Runtime Library (libgcc and libatomic,
LLVM/Clang: compiler-rt, MSVC:vcruntime.lib) and C
standard library (glibc, musl) and C++ Standard Library
(GCC: libstdc++, LLVM: libc++, MSVC STL) and linker
(GCC:binutils, LLVM:lld) and ABI. Between “toolchain”
and “build pipeline”.

Gentoo STL:
- libc++: sys-devel/gcc
- libstdc++: llvm-runtimes/libcxx

Gentoo libc: sys-libs/glibc and sys-libs/musl

I learned how Nvidia CUDA and CUDNN distribud and what
tools PyTorch have.

Also, I updated my daemon+script to get most heavy
current recent process, which I share at my gentoo
overlay as a package.

Part1: #dailyreport #cuda #nvidia #gentoo #llvm #clang
#programming #gcc #c++ #linux #toolchain #pytorch

I am compiling PyTorch with CUDA and CUDNN. PyTorch is
mainly a Python library with main part of Caffe2 C++
library.

Main dependency of Caffe2 with CUDA support is
NVIDIA "cutlass" library (collection of CUDA C++
template abstractions). This library have "CUDA code"
that may be compiled with nvcc NVIDIA CUDA compiler,
distributed with nvidia-cuda-toolkit, or with LLMV
Clang++ compiler. But llvm support CUDA only up to 12.1
version, but may be used to compile CUDA for sm_52
architecture. Looks like kneeling before NVIDIA. :)

Before installing dev-libs/cutlass you should do:
export CUDAARCHS=75

I sucessfully compiled cutlass, now I am trying to
compile PyTorch CUDA code with Clang++ compiler.

A Cult AI Computer’s Boom and Bust:
I am aware that CUDA isn’t a language. But 🤷‍♂️

📺 https://www.youtube.com/watch?v=sV7C6Ezl35A

#video #yt #youtube #ai #boom #bust #it #ipl #aicomputing #history #aicult #aiboom #cuda #lisp #code

A Cult AI Computer’s Boom and Bust

YouTube
What is CUDA? - Computerphile

YouTube

Ask HN: How to learn CUDA to professional level | Hacker News

LinkAsk HN: How to learn CUDA to professional level | Hacker News
https://news.ycombinator.com/item?id=35756489

📌 Summary:
本文集結多位程式開發者及CUDA使用者的經驗與建議,探討如何達到專業級的CUDA編程能力。學習CUDA的核心在於理解GPU的平行運算架構與CUDA程式框架,搭配NVIDIA官方CUDA Programming Guide及書籍作為基礎。初學者應具備紮實的C/C++基礎,並從簡單的小型平行程式開始實作,逐步熟悉工具鏈、編譯器與硬體限制。硬體方面,建議使用近幾年內具備較新驅動程式的NVIDIA顯示卡,例如GTX 1080以上型號,以確保CUDA Toolkit的相容性。

實務上,學習過程中不可避免會遇到除錯和性能優化的挑戰,包含記憶體佈局、warp分派、同步與L2快取管理等細節。部分開發者建議先追求正確性,再逐步針對性能進行優化,避免過早優化帶來錯誤。對於應用範疇,CUDA多用於高效能計算、遊戲3D圖形及機器學習AI領域,但若目標是AI模型開發,可能更傾向於使用PyTorch、TensorFlow等高階框架。學習路徑建議包括多看開源專案(如Leela Chess Zero)、利用NVIDIA官方課程、閱讀相關高效能計算書籍,以及參與社羣討論或實務專案。

此外,CUDA程式碼的硬體相容性問題不容忽視,不同世代及型號的GPU在指令集與硬體資源上存在差異,對初學者而言選定目標硬體、配合特定架構進行專案開發會較有效率。高階應用也可輔以如CUTLASS等抽象層工具降低開發門檻。總體而言,精通CUDA需投入大量時間與耐心,建議制訂6至8週的學習計畫並逐步實踐,才能在職場中具備競爭力。

🎯 Key Points:
→ 基礎入門
★ 使用 NVIDIA 官方 CUDA Programming Guide 及書籍學習基礎理論與API
★ 具備 C 或 C++ 程式語言能力,清楚理解並行程式設計概念
★ 實務練習:從簡單平行任務開始(如矩陣乘法),逐步擴大複雜度
★ 搭配合適GPU硬體,建議近代 NVIDIA 顯卡(如 GTX 1080、RTX 20系列以上)及符合驅動版本要求

→ 學習流程與工具
★ 安裝並熟悉 CUDA Toolkit(版本示例12.9.1)、NVidia Nsight與compute-sanitizer等除錯工具
★ 閱讀和分析GitHub上的公開CUDA專案,以實際程式碼理解應用方式
★ 練習利用共享記憶體(shared memory)、warp調度與Tensor Core加速等技術提升效能
★ 利用LLM(大型語言模型)或社羣資源協助程式碼理解與疑難排解

→ 進階挑戰與應用方向
★ 記憶體管理、指令集多樣性、不同GPU架構兼容性為難點
★ CUDA多用於高效運算領域,如遊戲3D圖形與AI訓練,AI開發者多用PyTorch/TensorFlow高階框架
★ 實務建議:先確保功能正確,再進行性能優化,避免記憶體錯誤
★ 鼓勵閱讀高階 HPC(高效能計算)與平行計算相關書籍,如《Programming Massively Parallel Processors》與《Scientific Parallel Computing》
★ 瞭解不同GPU品牌和API的差異,必要時可使用如HIPIFY等工具進行跨平臺移植
★ 建議結合實務專案與學習,逐步建立完整技能樹

🔖 Keywords:
#CUDA #GPU編程 #平行運算 #NVIDIA #高效能計算

Ask HN: How to learn CUDA to professional level | Hacker News

Ask HN: How to learn CUDA to professional level

Discussion: https://news.ycombinator.com/item?id=44216123

#cuda

Ask HN: How to learn CUDA to professional level | Hacker News

All You Need Is Binary Search! A Practical View on Lightweight Database Indexing on GPUs

#CUDA #Databases #Performance

https://hgpu.org/?p=29922

All You Need Is Binary Search! A Practical View on Lightweight Database Indexing on GPUs

Performing binary search on a sorted dense array is a widely used baseline when benchmarking sophisticated index structures: It is simple, fast to build, and indexes the dataset with minimal memory…

hgpu.org

GPUMC: A Stateless Model Checker for GPU Weak Memory Concurrency

#OpenCL #CUDA #Concurrency #Memory #ModelCheck

https://hgpu.org/?p=29921

GPUMC: A Stateless Model Checker for GPU Weak Memory Concurrency

GPU computing is embracing weak memory concurrency for performance improvement. However, compared to CPUs, modern GPUs provide more fine-grained concurrency features such as scopes, have additional…

hgpu.org
🌖 Mojo 中高效矩陣轉置 🔥
➤ 使用 Mojo 實現高效能 GPU 運算
https://veitner.bearblog.dev/highly-efficient-matrix-transpose-in-mojo/
本文逐步展示瞭如何使用 Mojo 語言針對 Hopper 架構實現高效矩陣轉置核心。最佳核心實現了 2775.49 GB/s 的頻寬,達到 84.1056% 的效能。作者將此優化方法與其先前使用純 CUDA 在相同 H100 硬體上達到的 2771.35 GB/s 頻寬進行比較,證明 Mojo 在相同任務上也能達到與 CUDA 相似的效能。文章涵蓋了基本方法、使用 TMA(Tensor Memory Access) 以及優化技術,例如 Swizzling 和線程粗化,並提供了詳細的程式碼範例和效能比較。
+ 哇,Mojo 真的很有潛力!能與 CUDA 相提並論,甚至在某些方面超越它,真是令人印象深刻。
+ 這個文章解釋得非常清楚,即使對 Mojo 不熟悉的人也能理解。程式碼範例也很實用,可以直接拿
#GPU 程式設計 #Mojo 語言 #矩陣運算 #CUDA
Highly efficient matrix transpose in Mojo 🔥

In this blogpost I will step by step show you how to implement a highly efficient transpose kernel for the architecture using Mojo. The best kernel archive...

simons blog

Bringing GPU-Level Performance to Enterprise Java: A Practical Guide to CUDA Integration

#cuda #gpu #java #performance

https://www.infoq.com/articles/cuda-integration-for-java/

Bringing GPU-Level Performance to Enterprise Java: A Practical Guide to CUDA Integration

Learn how to offload compute-heavy Java tasks to the GPU using JNI and CUDA for ten to one hundred times performance improvement in secure and data-parallel workloads.

InfoQ