Mastodawn

Tom Maiaroto (@tmaiaroto)

Atomic의 E4B 설정에서 128k 컨텍스트 윈도우로 약 96 tokens/sec 성능을 달성했다는 공유입니다. flash attention을 끄고 -ctk f16, -ctv f16 옵션을 사용해야 충돌을 피할 수 있으며, 8bit assistant나 Q4_K_M도 사용할 수 있다고 합니다. llama-swap 기반 테스트 결과입니다.

https://x.com/tmaiaroto/status/2052650641802383456

#llm #inference #quantization #flashattention #llama

Tom Maiaroto (@tmaiaroto) on X

@ItsmeAjayKV @UnslothAI @googlegemma Ok, finally got the magic settings for E4B with Atomic's stuff. About 96 tokens/sec with 128k context window. Keep flash attention off and use -ctk f16 -ctv f16 otherwise it crashes (or did for me). I also use the 8bit assistant but Q4_K_M works too. This is from my llama-swap

X (formerly Twitter)

sayzard 2d ago

Dan McAteer (@daniel_mac8)

SubQ가 최대 100만 토큰 문맥에서 FlashAttention보다 52배 빠르고 Opus보다 20배 저렴하다고 주장하며, 트랜스포머 이후의 큰 돌파구일 수 있다는 기대를 모으고 있다. 다만 실측 검증이 필요해 AI 인프라/어텐션 최적화 분야의 주목할 만한 신기술 후보로 보인다.

https://x.com/daniel_mac8/status/2051710659822305661

#subq #flashattention #llm #attention #aiinfrastructure

Dan McAteer (@daniel_mac8) on X

SubQ is either the biggest breakthrough since the Transformer... > 52x faster than FlashAttention at 1mm tok context > 20x cheaper than Opus ...or it's AI Theranos. Requested early access so hopefully can investigate soon.

X (formerly Twitter)

sayzard Apr 27

Sandro (@pupposandro)

Qwen3.6-27B를 60K 컨텍스트에서 단일 RTX 3090으로 초당 89.7 토큰 처리하는 성능을 공개했습니다. 슬라이딩 윈도우 Flash Attention과 2단계 캐시를 Luce DFlash에 병합해, 기존 전체 어텐션 대비 3.64배 빠르고 speculative acceptance 100%를 달성했다고 밝혔습니다.

https://x.com/pupposandro/status/2048781323301515443

#qwen #flashattention #llm #performance #optimization

Sandro (@pupposandro) on X

89.7 tok/s with Qwen3.6-27B at 60K context on a single RTX 3090. 3.64x faster than full attention, 100% speculative acceptance. Just merged sliding window flash attention + two-phase cache into Luce DFlash. FA now attends to the last 2048 KV positions instead of the full 60K,

X (formerly Twitter)

N-gated Hacker News Mar 12

🤔 Ah, the classic tale of a tech enthusiast playing "will-it-blend?" with TPUs and Flash Attention! 🤪 Our hero Archer FAFO (Finds A Free Option) decides to port algorithms like he's playing a game of Tetris—except it's on a free-tier #TPU in #Colab, which is basically like using a Ferrari to deliver pizza for free. 🍕🚗
https://archerzhang.me/forcing-flash-attention-onto-a-tpu #techenthusiast #FlashAttention #freeoptions #algorithmshack #HackerNews #ngated

Forcing Flash Attention onto a TPU and Learning the Hard Way · Archer Zhang

This is the fifth post in a series on LLM internals. Part 1 covered attention, Part 2 covered generation, Part 3 covered the Flash Attention algorithm, Part ...

AI Daily Post Mar 7

New benchmark shows that larger CUDA tiles can cut Flash Attention throughput by 18‑43 % across sequence lengths. The study dives into kernel design, TFLOPS loss, and what it means for transformer model efficiency on NVIDIA GPUs. Open‑source researchers can use these insights to tune their kernels and reclaim performance. #FlashAttention #CUDATiles #GPUPerformance #TFLOPS

🔗 https://aidailypost.com/news/large-cuda-tiles-reduce-flash-attention-tflops-by-1843-across

sayzard Mar 5

Yuchen Jin (@Yuchenj_UW)

작성자는 모델에게 B200s용 커널을 FlashAttention-4보다 더 잘 작성하게 하거나, NanoGPT를 더 빠르게 만들기 위한 새로운 연구 아이디어를 내게 하는 등 실험적·개발자용 활용 사례를 언급하며 곧 테스트하겠다고 밝혔습니다.

https://x.com/Yuchenj_UW/status/2029642799277318503

#nanogpt #flashattention #gpu #kernels

Yuchen Jin (@Yuchenj_UW) on X

@DeryaTR_ @_overment 🫡 I have some too, like asking it to write kernels on B200s better than FlashAttention-4, or come up with new research ideas to make nanogpt faster, will test today

X (formerly Twitter)

TechLİfe Feb 16

The Hidden Engineering Behind Fast AI: How LLM Inference Actually Works

https://techlife.blog/posts/llm-inference-optimization/

#LLM #Inference #PagedAttention #vLLM #FlashAttention #SpeculativeDecoding #MachineLearning #GPUOptimization #KVCache

The Hidden Engineering Behind Fast AI: How LLM Inference Actually Works

A deep dive into PagedAttention, speculative decoding, FlashAttention, and continuous batching — the clever tricks that make modern LLMs respond in milliseconds instead of minutes.

TechLife

Habr Feb 6

Triton, Flash-attension, Sage-attension и bitsandbytes с Rocm7 в Windows

В конце января 2026 вышел triton-windows 3.6.0.post25 , который позволяет использовать flash-attention , sage-attention (v1) и другие библиотеки, использующие Triton, на картах AMD с поддержкой rocWMMA в Windows. Также, несмотря на то, что в официальном репозитории bitsandbytes еще не приняли PR для поддержки ROCm 7, его все же можно собрать, внеся небольшие изменения в код. Эти изменения я уже сделал в своем форке. В этой статье я расскажу, как установить все это себе, а также для примера запустим пару тестов в ComfyUI, в том числе со свежей LTX-2, и сделаем Qlora адаптер для модели Gemma 3.

https://habr.com/ru/articles/987672/

#triton #amd #rx7900 #sageattention #flashattention #bitsandbytes #rocm #rocm7 #comfyui #ltx2

Triton, Flash-attension, Sage-attension и bitsandbytes с Rocm7 в Windows

Хабр

Reddit Tech VN Bot Dec 31

🖥️ Thử Qwen3‑30B (a3b VL Q4_XS) trên GPU P40 với Flash Attention. Đạt context 100k, nhưng khi tới ~60K gặp lỗi lặp đoạn, hiệu năng giảm mạnh. Tắt FA, chuyển MOE weights sang CPU: tốc độ giảm ~5x, K‑cache chậm ở Q4/Q5. Người dùng đang tìm cách tối ưu cài đặt. #AI #LLM #Qwen30B #FlashAttention #GPU #LocalLLaMA #trí_tự_nhiên #công_nghệ

https://www.reddit.com/r/LocalLLaMA/comments/1q03z3j/p40_qwen30b_60k_context_window_ceiling_with_flash/