Linoy Tsaban (@linoy_tsaban)

편집용 모델 FLUX.2 [klein] 9B가 KV-Cache 최적화를 적용한 새 버전 9B-KV로 업데이트되어 계산량을 줄이고 멀티 레퍼런스 편집에서 추론 속도를 최대 2.5배까지 개선했다고 발표했습니다. 작성자는 특히 총알 주변을 자연스럽게 편집하는 성능을 칭찬했습니다.

https://x.com/linoy_tsaban/status/2032133741175611408

#flux2 #kvcache #model #editing #llm

Linoy Tsaban (@linoy_tsaban) on X

My favorite editing model, FLUX.2 [klein] 9B, just got 2x faster: Meet FLUX.2 [klein] 9B-KV 😍💨 > Using KV-Cache Optimization to reduce computation & speed up inference by up to 2.5 times for multi-reference editing love how well it edits "around" the bullets

X (formerly Twitter)

Awni Hannun (@awnihannun)

Transformer 아키텍처에 대해 '긴 KV 캐시와 희소 조회(sparse lookup, DSA 유사)'가 균형적이라는 기술적 의견을 제시하는 트윗입니다. 토큰에 따라 메모리가 선형적으로 늘고(장기 기억·인컨텍스트 학습에 유리), 계산량은(거의) 선형에 가깝다고 설명합니다. 아키텍처 최적화 제안입니다.

https://x.com/awnihannun/status/2024580405844914184

#transformer #kvcache #sparseattention #incontextlearning

Awni Hannun (@awnihannun) on X

A long KV cache with sparse lookup (kind of like DSA) strikes me as the right balance for a Transformer. - Memory is not fixed but scales linearly with tokens (which is good for remembering things + in-context learning) - Compute is (almost) linear rather than quadratic

X (formerly Twitter)
The Hidden Engineering Behind Fast AI: How LLM Inference Actually Works

A deep dive into PagedAttention, speculative decoding, FlashAttention, and continuous batching — the clever tricks that make modern LLMs respond in milliseconds instead of minutes.

TechLife
Nvidia stellt Dynamic Memory Sparsification vor. Die Technik reduziert den KV-Cache um Faktor acht, indem unwichtige Token während der Inferenz dynamisch entfernt werden. Laut Paper der University of Edinburgh bleibt die Genauigkeit erhalten, während der Hardwarebedarf für lange Kontexte massiv sinkt. Erste Implementierungen existieren bereits. #Nvidia #DMS #KVCache
https://www.all-ai.de/news/news26/nvidia-speicher-8x
Nvidia reduziert den KI-Speicherbedarf mit neuer DMS-Technik deutlich

Die dynamische Kompression verkleinert den KV-Cache um den Faktor acht bei gleichbleibender Modellgenauigkeit.

All-AI.de

Куда и почему уходят бабки на нейросети

Малоизвестный среди обычных людей факт: у нейросетей нет никаких "разговоров". Ты смотришь в веб-интерфейсе на "диалог" - но это обман, красивый фокус. Каждый раз, когда ты пишешь новое сообщение, все старые сообщения обрабатываются заново. У нейросетей по-настоящему многоразовых задач не существует. Если результат немного поменялся — тебе просто не покажут в веб-интерфейсе изменившиеся сообщения. Иначе пользователь чувствовал бы себя как в дурке, ИИ его бы постоянно как бы газлайтил, изменяя старые ответы без предупреждения. По факту, история переписки в ИИ-чатах фиксирована, тем или иным способом. И стоило бы это вагон. Интересно. Читать далее

https://habr.com/ru/companies/bar/articles/991126/

#LLM #transformer #attention #KVcache #inference #GPU #CUDA #ChatGPT #Claude #токены

Куда и почему уходят бабки на нейросети

Малоизвестный среди обычных людей факт: у нейросетей нет никаких "разговоров". Ты смотришь в веб-интерфейсе на "диалог" - но это обман, красивый фокус. Каждый раз, когда ты пишешь новое сообщение, все...

Хабр

𝗭𝗲𝗻 𝗠𝗮𝗴𝗻𝗲𝘁𝘀 (@ZenMagnets)

GLM-4.7-Flash의 큰 KV 캐시 문제(FATASS)에 대한 간단한 우회법 발견을 공유합니다. vllm에서 MLA를 활성화하는 한 줄 수정으로 200k 컨텍스트를 180GB 대신 약 10GB로 맞출 수 있어, 단일 32GB 5090 GPU로 GLM-4.7-Flash-NVFP4 전체 200k 컨텍스트 구동이 가능해졌다고 주장합니다. @Zai_org의 의도 대로 MLA 사용을 권장합니다.

https://x.com/ZenMagnets/status/2013838570059170117

#glm4.7flash #vllm #kvcache #mla #gpu

𝗭𝗲𝗻 𝗠𝗮𝗴𝗻𝗲𝘁𝘀 (@ZenMagnets) on X

Easy Workaround for FATASS KV Cache on GLM-4.7-Flash found. One line vllm fix to make 200k context fit in 10gb instead of 180gb by turning on MLA as @Zai_org intended. !--> This means a single 32gb 5090 can run GLM-4.7-Flash-NVFP4 with full 200k context! (if you're not also

X (formerly Twitter)

GLM-4-32B-0414 nổi bật với chỉ **2 đầu KV**, giúp tiết kiệm đáng kể bộ nhớ cache KV nhờ sử dụng GQA. Tiếc rằng GLM-4.7-Flash đã loại bỏ tính năng này, làm giảm hiệu quả tối ưu hóa bộ nhớ. #AI #LLM #GLM #KVCache #GQA #TríTuệNhânTạo #MôHìnhNgônNgữ #AIoptimization

https://www.reddit.com/r/LocalLLaMA/comments/1qiphdr/two_heads_is_all_i_need/

[MI455X에 LPDDR5X 모듈 24개가 박혀있다.

MI455X에 LPDDR5X 모듈 24개가 탑재되어 있으며, KV 캐시 용도로 사용될 것으로 예상됩니다. 최대 대역폭은 1.63TB/s로 추정되며, AMD의 베니스 가속기와 비교되어 언급되었습니다.

https://news.hada.io/topic?id=25990

#lpdrd5x #kvcache #amd #mi455x #memory

MI455X에 LPDDR5X 모듈 24개가 박혀있다.

<ol> <li>KV 캐시 용으로 쓰일 것으로 알려지고 있음.</li> <li>스펙은 뇌피셜이지만 삼성이 8533 64GB lpddr5x를 홍보한 적이 있으므로 최대 1.63...

GeekNews
NVIDIA’s new Inference Context Memory Storage Platform reshapes AI inference by treating KV cache as a multi-tier memory hierarchy—from HBM to NVMe SSD. This enables longer context windows, persistent reasoning, and scalable multi-agent inference while keeping hot data in GPU memory and offloading cold context to SSD.
https://www.buysellram.com/blog/nvidia-unveils-the-inference-context-memory-storage-platform/
#NVIDIA #Rubin #AI #Inference #LLM #AIInfrastructure #MemoryHierarchy #HBM #NVMe #DPU #BlueField4 #AIHardware #GPU #DRAM #KVCache #DataCenter #tech
NVIDIA Unveils the Inference Context Memory Storage Platform — A New Era for Long-Context AI

NVIDIA’s Inference Context Memory Storage Platform redefines AI memory architecture, enabling long-context inference with HBM4, BlueField-4 DPUs, and Spectrum-X networking. Learn how this shift impacts GPU and DRAM markets.

BuySellRam
NVIDIA’s new Inference Context Memory Storage Platform reshapes AI inference by treating KV cache as a multi-tier memory hierarchy—from HBM to NVMe SSD. This enables longer context windows, persistent reasoning, and scalable multi-agent inference while keeping hot data in GPU memory and offloading cold context to SSD.
https://www.buysellram.com/blog/nvidia-unveils-the-inference-context-memory-storage-platform/
#NVIDIA #Rubin #AI #Inference #LLM #AIInfrastructure #MemoryHierarchy #HBM #NVMe #DPU #BlueField4 #AIHardware #GPU #DRAM #KVCache #DataCenter #tech
NVIDIA Unveils the Inference Context Memory Storage Platform — A New Era for Long-Context AI

NVIDIA’s Inference Context Memory Storage Platform redefines AI memory architecture, enabling long-context inference with HBM4, BlueField-4 DPUs, and Spectrum-X networking. Learn how this shift impacts GPU and DRAM markets.

BuySellRam