Mastodawn

Interesting #CUDA alternatives:

• ROCm (AMD): https://rocm.docs.amd.com
• HIP (CUDA portability layer): https://github.com/ROCm-Developer-Tools/HIP
• oneAPI (Intel): https://www.intel.com/oneapi
• SYCL (Khronos): https://www.khronos.org/sycl/
• OpenCL: https://www.khronos.org/opencl/
• BarraCUDA (experimental): https://github.com/BarraCUDA/BarraCUDA

AMD ROCm documentation — ROCm Documentation

Start building for HPC and AI with the performance-first AMD ROCm software stack. Explore how-to guides and reference docs.

PitCrew 22h ago

ICYMI: NVIDIA recommended driver 580.126.18 released for Linux

#CUDA #GeForce #Linux #LinuxGaming #NVIDIA #OpenGL #PCGaming #RTXOn #Vulkan

https://www.gamingonlinux.com/2026/02/nvidia-recommended-driver-580-126-18-released-for-linux/

Leibniz Supercomputing Centre 2d ago

Vor dem Anwenden kommt das #debugging Zwei #Workshops am LRZ zeigen dafür praktische Lösungen fürs #supercomputing

🪲 Am 3. März stehen die plattformübergreifenden Werkzeuge von Linaro Forge auf der Agenda, die auch beim #programmieren unterstützen: https://app1.edoobox.com/en/LRZ/Online%20Courses/Online%20Course.ed.3b183bea439d_9895629709.Debugging%20and%20Optimising%20Parallel%20Codes%20with%20Linaro%20Forge

🐞 Am 5. März folgen die Tools von Total View, die vor allem komplexe Anwendungen, die in #c C++, #Fortran oder #CUDA geschrieben sind, verbessern: https://app1.edoobox.com/en/LRZ/Online%20Courses/Online%20Course.ed.3fff44cf6f7d_10196942266.Debugging%20with%20TotalView

#code #software #openSource

Habr 2d ago

От MNIST к Transformer. Hello CUDA. Основы, Setup и наше первое ядро

Мы живем в эпоху, когда ИИ стал доступен каждому. Но за магией PyTorch скрывается колоссальная инженерная работа и сложные вычислительные процессы, которые для большинства остаются черным ящиком. Я хочу запустить большой цикл статей От MNIST к Transformer , цель которого пошагаво пройти путь от простого CUDA ядра до создания архитектуры Transformer - фундамента современных LLM моделей. Мы не будем использовать готовые высокоуровневые библиотеки. Мы будем разбирать, как все устроено под капотом, и пересобирать их ключевые механизмы своими руками на самом низком уровне. Только так можно по настоящему понять как работают LLM и что за этим стоит. Приготовьтесь, будет много кода на C++ и CUDA, работы с памятью и погружения в архитектуру GPU. И конечно же математика что за этим стоит. Поехали!

https://habr.com/ru/articles/996610/

#cuda #c++ #gpgpu #ml #lowlevel_programming

От MNIST к Transformer. Hello CUDA. Основы, Setup и наше первое ядро

Хабр

Pekka Jääskeläinen 3d ago

A journal article about chipStar is finally published! https://doi.org/10.1177/10943420261423001

chipStar is a compilation tool/runtime for porting CUDA/HIP applications on OpenCL/SPIR-V-capable platforms. Its origin is in the HIPCL "prototype" developed within my research group mainly by Michal Babej. Then refined to chipStar in close collab with Argonne, Intel and Paulius Velesko (PGLC) among others. #opencl #spirv #cuda #ijhpca

The code is here: https://github.com/CHIP-SPV/chipStar

HGPU group 3d ago

Deep Kernel Fusion for Transformers

#CUDA #LLM #Performance

https://hgpu.org/?p=30570

Deep Kernel Fusion for Transformers

Agentic LLM inference with long contexts is increasingly limited by memory bandwidth rather than compute. In this setting, SwiGLU MLP blocks, whose large weights exceed cache capacity, become a maj…

hgpu.org

HGPU group 3d ago

DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels

#CUDA #LLM #CodeGeneration #Package

https://hgpu.org/?p=30569

DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels

Diffusion large language models (dLLMs) have emerged as a compelling alternative to autoregressive (AR) LLMs, owing to their capacity for parallel token generation. This paradigm is particularly we…

hgpu.org

HGPU group 3d ago

Improving HPC Code Generation Capability of LLMs via Online Reinforcement Learning with Real-Machine Benchmark Rewards

#CUDA #OpenMP #HPC #CodeGeneration #LLM

https://hgpu.org/?p=30568

Improving HPC Code Generation Capability of LLMs via Online Reinforcement Learning with Real-Machine Benchmark Rewards

Large language models (LLMs) have demonstrated strong code generation capabilities, yet the runtime performance of generated code is not guaranteed, and there have been few attempts to train LLMs u…

hgpu.org

sayzard 3d ago

AISatoshi (@AiXsatoshi)

Minimax-m2.5-NVFP4 모델을 vllm에서 CUDA12.8로 구동한 성능 보고: NVFP4에서 84.5 tok/s, AWQ에서 109.6 tok/s를 기록. 작성자는 CUDA13이 FP4 최적화가 더 잘 된 것 같아 업그레이드를 고려 중이라고 언급함 — GPU/CUDA 버전이 양자화 성능에 미치는 영향에 대한 실사용 벤치마크 정보.

https://x.com/AiXsatoshi/status/2023016702318129524

#minimax #vllm #cuda #nvfp4