New benchmark shows that larger CUDA tiles can cut Flash Attention throughput by 18‑43 % across sequence lengths. The study dives into kernel design, TFLOPS loss, and what it means for transformer model efficiency on NVIDIA GPUs. Open‑source researchers can use these insights to tune their kernels and reclaim performance. #FlashAttention #CUDATiles #GPUPerformance #TFLOPS

🔗 https://aidailypost.com/news/large-cuda-tiles-reduce-flash-attention-tflops-by-1843-across

So sánh hiệu năng GPU qua benchmark nhân ma trận BF16 8192x8192. B200 dẫn đầu với 1629,45 TFLOPS và thời gian 306,85ms, vượt trội H200 SXM (680 TFLOPS), MI300X (464,9 TFLOPS) và các dòng RTX. Tesla V100 và Colab T4 "chậm như rùa". Kết luận: Mini PC Strix Halo (khoảng 59 TFLOPS) đủ dùng, thêm RTX 3090 nếu cần CUDA. #GPU #TFLOPS #ĐánhGiáHiệuNăng #MáyTínhChơiGame #AI #AMD #NVIDIA #ROCm #MLX #Kaggle #Colab #DGXSpark #TechNews #CôngNghẹ #TestingGPU #Benchmarks #ViễnThông #TechCompare #VietnamTech

ht

[Gluon][Tutorial] Persistent attention by Mogball · Pull Request #7298 · triton-lang/triton

Rewrite the attention kernel to be persistent. This gives better performance at low-contexts. However, fp16 at large context has suffered a bit due to a ptxas instruction scheduling issue in the so...

GitHub
😂 Ah, the classic tale of tech sorcery where simply naming your kernel "cutlass" magically unlocks 100 #tflops of speed! Meanwhile, x.com is still busy booting you off your browser faster than you can say "incompatibility." 🏴‍☠️🔗📉
https://twitter.com/cis_female/status/1943069934332055912 #techhumor #cutlass #xcom #incompatibility #HackerNews #ngated
sophia (@cis_female) on X

> fp8 is 100 tflops faster when the kernel name has "cutlass" in it kms https://t.co/KpZjwSAkrM

X (formerly Twitter)
sophia (@cis_female) on X

> fp8 is 100 tflops faster when the kernel name has "cutlass" in it kms https://t.co/KpZjwSAkrM

X (formerly Twitter)
Nintendo Switch 2: potenza in TFLOPS svelata

Nuove fughe di notizie suggeriscono la potenza di Nintendo Switch 2. È meno potente di Xbox Series S, ma il supporto a DLSS potrebbe fare la differenza.Tra

CeoTech
#China's secretive #Tianh 3 #supercomputer uses homegrown hybrid #CPU — rivals US systems with 1.57 #Exaflops of performance. #NUDT #MT3000 features a unique heterogeneous architecture that includes general-purpose CPU cores with 96 control cores and 1,536 accelerator cores. Netting the MT-3000 processor reportedly achieves 11.6 FP64 #TFLOPS of peak performance and demonstrates a power efficiency of 45.4 #GigaFLOPS/Watt at operational frequency of 1.20 GHz https://www.tomshardware.com/tech-industry/supercomputers/chinas-secretive-tianhe-3-supercomputer-uses-homegrown-hybrid-cpu-rivals-us-systems-with-157-exaflops-of-performance-report #hpc #sanctions
China's secretive Tianhe 3 supercomputer uses homegrown hybrid CPU — rivals US systems with 1.57 Exaflops of performance: Report

Tianhe 3 could achieve peak performance of 1.57 ExaFLOPS.

Tom's Hardware