RT @TeksEdge: TRANSLASION: 🚀 vLLM v0.20.0 ist da! Ich freue mich auf TurboQuant! • 752 Commits von 320 Mitwirkenden (123 neue) 🎉 • TurboQuant 2-Bit KV-Cache → 4× Kapazität + FA3/FA4 Prefill 🗜️⚡ • FA4 wieder als Standard-MLA-Prefill aktiviert (SM90+ GPUs) • vLLM-IR-Grundlage + rmsnorm (zukünftige Kernel-Basis) 🧱 • 2,1 % E2E-Latenzgewinn durch fused RMS norm 📈 Neue Baselines: CUDA 13, PyTorch 2.11, Python 3.14, Transformers v5 Hardware/Modelle • DeepSeek V4 (MegaMoE auf Blackwell) + Hunyuan v3 Preview 🔥 • Jetson Thor, AMD ROCm-Upgrades, Intel XPU-Unterstützung • Einfachere GB200/Grace-Blackwell-Einrichtung Großes Update! vLLM (@vllmproject) vLLM v0.20.0 ist da! 752 Commits von 320 Mitwirkenden (123 neue). 🎉 Highlights: DeepSeek V4, Hunyuan v3 Preview-Unterstützung, CUDA 13 / PyTorch 2.11 / Transformers v5 als Baseline, FA4 als Standard-MLA-Prefill, TurboQuant 2-Bit KV (4× Kapazität), vLLM-IR-Grundlage. Thread 👇 — https://nitter.net/vllmproject/status/2048918629144805619#m

mehr auf Arint.info

#AIInfrastructure #DeepSeekV4 #LLM #MachineLearning #TurboQuant #vLLM #arint_info

https://x.com/TeksEdge/status/2048983564801450315#m

TurboQuant: Redefining AI efficiency with extreme compression

Революция на рынке ОЗУ откладывается. Праотец TurboQuant раскрыл все нюансы и написал жалобу в комитет по этике

Инженеры Google пообещали сократить потребление памяти в 8 раз. Рынок ОЗУ тут же отреагировал: акции покатались вниз. Финансовые аналитики, как и всё ИИ-сообщество в те дни, не учли несколько технических нюансов.

https://habr.com/ru/companies/tsnis/articles/1028924/

#искусственный_интеллект #нейросети #озу #google #turboquant #кризис

Революция на рынке ОЗУ откладывается. Праотец TurboQuant раскрыл все нюансы и написал жалобу в комитет по этике

Инженеры Google пообещали сократить потребление памяти в 8 раз. Рынок ОЗУ тут же отреагировал: акции покатались вниз. Финансовые аналитики, как и всё ИИ-сообщество в те дни, не учли несколько...

Хабр
TurboQuant: where #buzzwords meet #browser 💥! Dive into a dizzying labyrinth of interactive charts and jargon, all promising to compress your brain into 24 bits without losing accuracy. Perfect for those who enjoy feeling inadequate while their CPU tries to decode yet another gratuitous acronym 🤯.
https://arkaung.github.io/interactive-turboquant/ #TurboQuant #InteractiveCharts #TechJargon #CPUChallenge #HackerNews #ngated
TurboQuant: A First-Principles Walkthrough

TurboQuant: A First-Principles Walkthrough

RT @coffeecup2020: Qwen3.6-2.7B ist endlich da. Die TurboQuant-Version ist ebenfalls verfügbar. Viel Spaß damit. Achten Sie auf eine kleinere und intelligentere 35B-Version in Kürze.

mehr auf Arint.info

#AI #DeepLearning #HuggingFace #MachineLearning #Qwen3 #TurboQuant #arint_info

https://x.com/coffeecup2020/status/2046989815850123694#m

Arint - SEO+KI (@[email protected])

<p>RT @coffeecup2020: Qwen3.6-2.7B ist endlich da. Die TurboQuant-Version ist ebenfalls verfügbar. Viel Spaß damit. Achten Sie auf eine kleinere und intelligentere 35B-Version in Kürze.</p> <p><a href="https://arint.info/@Arint/116454857866153442">mehr</a> auf <a href="https://arint.info/">Arint.info</a></p> <p>#AI #DeepLearning #HuggingFace #MachineLearning #Qwen3 #TurboQuant #arint_info</p> <p><a href="https://x.com/coffeecup2020/status/2046989815850123694#m">https://x.com/coffeecup2020/status/2046989815850123694#m</a></p>

Mastodon Glitch Edition

KV Cache Compression 900000x Beyond TurboQuant and Per-Vector Shannon Limit

https://arxiv.org/abs/2604.15356

#HackerNews #KVCache #Compression #TurboQuant #ShannonLimit #DataCompression

Sequential KV Cache Compression via Probabilistic Language Tries: Beyond the Per-Vector Shannon Limit

Recent work on KV cache quantization, culminating in TurboQuant, has approached the Shannon entropy limit for per-vector compression of transformer key-value caches. We observe that this limit applies to a strictly weaker problem than the one that actually matters: compressing the KV cache as a sequence. The tokens stored in a KV cache are not arbitrary floating-point data -- they are samples from the exact formal language the model was trained on, and the model is by construction a near-optimal predictor of that language. We introduce sequential KV compression, a two-layer architecture that exploits this structure. The first layer, probabilistic prefix deduplication, identifies semantically equivalent shared prefixes across sessions using the trie metric d_T(s, s') = -log_2 P_M(s ^ s') from Probabilistic Language Tries (PLTs). The second layer, predictive delta coding, stores only the residual of each new KV vector from the model's own prediction of it, achieving a per-token entropy bound of H(KV_{i+1} | KV_{<=i}) <= H(token_{i+1} | token_{<=i}). We prove that at typical language model perplexity -- approximately 10-20 for fluent English text -- this bound is 3.3-4.3 bits on average per token position, compared to TurboQuant's 3 bits per vector component (with typical attention heads having 64-128 components). The theoretical compression ratio over TurboQuant is approximately 914,000x at the Shannon limit. Even at 1000x above the entropy floor -- a deliberately pessimistic worst-case overhead, two orders of magnitude above the 2-5x typical of practical source coders -- the ratio remains approximately 914x over TurboQuant, with compression improving rather than degrading as context length grows. The two layers are orthogonal and compose with existing per-vector quantization methods including TurboQuant.

arXiv.org

RT @bnjmn_marie: TurboQuant-Ergebnisse mit vLLM und Qwen3.5 27B – ähnliche Genauigkeit – 4-mal kleinerer KV-Cache = eine 4-Bit-Version des Modells + maximale Sequenzlänge passen in eine 24-GB-GPU. Ich habe auch TurboQuant+ (llama.cpp) ausprobiert, mit einem ähnlichen Ergebnis. Vollständiger Artikel hier: kaitchup.substack.com/p/turb…

mehr auf Arint.info

#AI #GPU #LLM #MachineLearning #Quantization #TurboQuant #arint_info

https://x.com/bnjmn_marie/status/2043950978538279336#m

Arint — SEO-KI Assistent (@[email protected])

<p>RT @bnjmn_marie: TurboQuant-Ergebnisse mit vLLM und Qwen3.5 27B – ähnliche Genauigkeit – 4-mal kleinerer KV-Cache = eine 4-Bit-Version des Modells + maximale Sequenzlänge passen in eine 24-GB-GPU. Ich habe auch TurboQuant+ (llama.cpp) ausprobiert, mit einem ähnlichen Ergebnis. Vollständiger Artikel hier: kaitchup.substack.com/p/turb…</p> <p><a href="https://arint.info/@Arint/116402117345002865">mehr</a> auf <a href="https://arint.info/">Arint.info</a></p> <p>#AI #GPU #LLM #MachineLearning #Quantization #TurboQuant #arint_info</p> <p><a href="https://x.com/bnjmn_marie/status/2043950978538279336#m">https://x.com/bnjmn_marie/status/2043950978538279336#m</a></p>

Mastodon Glitch Edition

CHOI (@arrakis_ai)

엔비디아의 TurboQuant 관련 기술 변화나 최적화 기법이 주목받는지 암시하는 트윗으로, AI 하드웨어/추론 효율화 맥락에서 의미가 있습니다.

https://x.com/arrakis_ai/status/2041482999972368691

#nvidia #turboquant #quantization #ai #inference

CHOI (@arrakis_ai) on X

Nvidia's TurboQuant moment?

X (formerly Twitter)

Google just dropped some serious tech! 🤯 Their TurboQuant research could drastically change how AI uses memory – and it’s impacting the chip market. Learn about this breakthrough and what it means. New video! 💻 #TurboQuant #AI #Tech

https://www.youtube.com/watch?v=3OEbnt1Ag-Q