KV-Cache в LLM: разбираем инференс через 9 ключевых вопросов

Почему Cache Read и Cache Write стоят денег и как работает Prompt Caching? Разбираем KV-Cache через 9 ключевых вопросов. Разобраться

https://habr.com/ru/articles/1021832/

#машинное_обучение #машинное_обучение_нейросети #llm #gpu #transformers #kvcache #prompt_caching #attention #vllm #prefix_caching

KV-Cache в LLM: разбираем инференс через 9 ключевых вопросов

Не так давно лимиты на использование Claude Code резко уменьшились, и люди стали лучше считать свои токены. Я не стал исключением, поэтому первым делом собрал информацию по использованию токенов в...

Хабр

Prince Canuma (@Prince_Canuma)

MLX에 TriAttention을 구현한 결과가 공개됐습니다. Gemma-4-31B-IT에서 BF16 기준 6만 토큰까지 KV 캐시를 최대 81% 압축할 수 있다고 합니다. TurboQuant처럼 KV 값을 양자화하는 대신, TriAttention은 중요도가 낮은 토큰을 아예 제거하는 방식입니다.

https://x.com/Prince_Canuma/status/2042021304270819394

#mlx #triaattention #gemma #kvcache #quantization

Prince Canuma (@Prince_Canuma) on X

Just implemented TriAttention in MLX and the results are wild! You can get up to 81% KV compression at 60K tokens for Gemma-4-31B-IT in BF16 🔥 Unlike TurboQuant, which quantizes KV cache values, TriAttention prunes low-importance tokens entirely by scoring keys using

X (formerly Twitter)
The $100 Billion Selloff Started by a 12-Page PDF From Two Google Researchers

What the headlines missed, what six dev teams found, and the 4-bit config that actually works.

Medium

From 300KB to 69KB per Token: How LLM Architectures Solve the KV Cache Problem

https://news.future-shock.ai/the-weight-of-remembering/

#HackerNews #LLMarchitectures #KVcache #AIoptimization #technews

The Weight of Remembering

How the KV cache gives every AI conversation a physical weight in silicon, and what happens when the memory runs out.

Future Shock Newsletter

The key takeaway isn’t just compression—it’s where the bottleneck shifts. KV cache has been dominating memory footprint in long-context inference, so reducing it changes the cost structure significantly. But it doesn’t remove the constraint entirely.

https://www.buysellram.com/blog/will-googles-turboquant-ai-compression-finally-demolish-the-ai-memory-wall/

#AI #ArtificialIntelligence #TurboQuant #Google #AIMemoryWall #AICompression #KVCache #LLMInference #AIInfrastructure #MemoryBottleneck #ModelEfficiency #AIHardware #DataCenter

Will Google's TurboQuant AI Compression Finally Demolish the AI Memory Wall?

Will TurboQuant end the HBM shortage? Explore Google’s 6x KV cache compression, the Jevons Paradox, and how to manage GPU assets as the AI Memory Wall moves.

BuySellRam

The key takeaway isn’t just compression—it’s where the bottleneck shifts. KV cache has been dominating memory footprint in long-context inference, so reducing it changes the cost structure significantly. But it doesn’t remove the constraint entirely:
https://www.buysellram.com/blog/will-googles-turboquant-ai-compression-finally-demolish-the-ai-memory-wall/

#AI #ArtificialIntelligence #TurboQuant #Google #AIMemoryWall #AICompression #KVCache #LLMInference #AIInfrastructure #MemoryBottleneck #ModelEfficiency #AIHardware #DataCenter #technology

Will Google's TurboQuant AI Compression Finally Demolish the AI Memory Wall?

Will TurboQuant end the HBM shortage? Explore Google’s 6x KV cache compression, the Jevons Paradox, and how to manage GPU assets as the AI Memory Wall moves.

BuySellRam

The AI world is buzzing over TurboQuant, Google Research’s new answer to the AI Memory Wall. This isn't just an incremental update; it’s a fundamental shift in how we think about hardware efficiency.

By combining two new methods—PolarQuant and QJL—Google has managed to compress the Key-Value (KV) cache by 6x with zero accuracy loss. For those running H100s, this translates to an 8x speedup in attention processing.

Why it matters:

Beyond Brute Force: Much like DeepSeek-R1, Google is proving that high-level math can bypass the need for endless HBM expansion.

The "Memory Wall" Pivot: TurboQuant moves the bottleneck from memory bandwidth to compute, effectively "stretching" the life of existing silicon.

The Jevons Paradox: History shows that when we make a resource (memory) 6x more efficient, we don't use less of it—we build models 10x larger.

Is this the end of the global DRAM shortage, or just the beginning of a much larger scaling era?

https://www.buysellram.com/blog/will-googles-turboquant-ai-compression-finally-demolish-the-ai-memory-wall/

#AI #ArtificialIntelligence #TurboQuant #Google #AIMemoryWall #AICompression #KVCache #LLMInference #AIInfrastructure #MemoryBottleneck #ModelEfficiency #AIHardware #DataCenter #deepseek #technology

Will Google's TurboQuant AI Compression Finally Demolish the AI Memory Wall?

Will TurboQuant end the HBM shortage? Explore Google’s 6x KV cache compression, the Jevons Paradox, and how to manage GPU assets as the AI Memory Wall moves.

BuySellRam

The AI world is buzzing over TurboQuant, Google Research’s new answer to the AI Memory Wall. This isn't just an incremental update; it’s a fundamental shift in how we think about hardware efficiency.

By combining two new methods—PolarQuant and QJL—Google has managed to compress the Key-Value (KV) cache by 6x with zero accuracy loss. For those running H100s, this translates to an 8x speedup in attention processing.

Why it matters:

Beyond Brute Force: Much like DeepSeek-R1, Google is proving that high-level math can bypass the need for endless HBM expansion.

The "Memory Wall" Pivot: TurboQuant moves the bottleneck from memory bandwidth to compute, effectively "stretching" the life of existing silicon.

The Jevons Paradox: History shows that when we make a resource (memory) 6x more efficient, we don't use less of it—we build models 10x larger.

Is this the end of the global DRAM shortage, or just the beginning of a much larger scaling era?

#AI #ArtificialIntelligence #TurboQuant #Google #AIMemoryWall #AICompression #KVCache #LLMInference #AIInfrastructure #MemoryBottleneck #ModelEfficiency #AIHardware #DataCenter #tech

Google’s TurboQuant is being positioned as a breakthrough that could finally break the AI “memory wall”—but the reality is more nuanced.

In this analysis, we explore how TurboQuant achieves up to 6× memory reduction and 8× performance gains by compressing KV cache during inference, enabling more efficient use of existing GPUs like A100 and H100.

The upside is clear: lower infrastructure costs, extended hardware lifecycles, and the potential to run long-context AI workloads on more affordable systems. However, compression is not a silver bullet. The compute overhead of decompression, the persistent weight memory requirements, and the long-term effects of the Jevons Paradox suggest that demand for high-performance hardware is far from over.

https://www.buysellram.com/blog/will-googles-turboquant-ai-compression-finally-demolish-the-ai-memory-wall/

#AI #ArtificialIntelligence #TurboQuant #Google #AIMemoryWall #AICompression #KVCache #LLMInference #AIInfrastructure #MemoryBottleneck #ModelEfficiency #AIHardware #DataCenter #tech

Will Google's TurboQuant AI Compression Finally Demolish the AI Memory Wall?

Will TurboQuant end the HBM shortage? Explore Google’s 6x KV cache compression, the Jevons Paradox, and how to manage GPU assets as the AI Memory Wall moves.

BuySellRam

Google’s TurboQuant is being positioned as a breakthrough that could finally break the AI “memory wall”—but the reality is more nuanced.
In this analysis, we explore how TurboQuant achieves up to 6× memory reduction and 8× performance gains by compressing KV cache during inference, enabling more efficient use of existing GPUs like A100 and H100.
https://www.buysellram.com/blog/will-googles-turboquant-ai-compression-finally-demolish-the-ai-memory-wall/

#AI #TurboQuant #Google #AIMemoryWall #AICompression #KVCache #ModelEfficiency #AIHardware #DataCenter #technology

Will Google's TurboQuant AI Compression Finally Demolish the AI Memory Wall?

Will TurboQuant end the HBM shortage? Explore Google’s 6x KV cache compression, the Jevons Paradox, and how to manage GPU assets as the AI Memory Wall moves.

BuySellRam