Mastodawn

ray__

TurboQuant: Redefining AI efficiency with extreme compression

https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/

TurboQuant: Redefining AI efficiency with extreme compression

Show thread

bluequbit 3d ago

I did not understand what polarQuant is.

Is is something like pattern based compression where the algorithm finds repeating patterns and creates an index of those common symbols or numbers?

Show thread

mrugge 3d ago

1. Efficient recursive transform of kv embeddings into polar coordinates
2. Quantize resulting angles without the need for explicit normalization. This saves memory via key insight: angles follow a distribution and have analytical form.

Show thread

quotemstr 3d ago

Reminds me vaguely of Burrows-Wheeler transformations in bzip2.

Show thread

Maxious 3d ago

https://mesuvash.github.io/blog/2026/turboquant-interactive/ has a little visualisation

TurboQuant Animated: Watch Vector Quantization Happen

Interactive 2D and 3D animations showing every step of TurboQuant: normalize, rotate, quantize, reconstruct. Add your own points and see the compression error in real time.

Show thread

spencerflem 3d ago

I like the visualization, but I don’t understand the grid quantization. If every point is on the unit circle aren’t all the center grid cords unused?

Show thread

vincnetas 3d ago

i think grid can be a surface of the unit sphere

Show thread

viktorcode 3d ago

The way I understand it, it's a way of compressing vectors by switching from their per-component representation to polar coordinates representation, where the nearby vectors are clumped together to a single line, allowing to describe them by different lengths

Show thread

benob 3d ago

This is the worst lay-people explanation of an AI component I have seen in a long time. It doesn't even seem AI generated.

Show thread

spencerflem 3d ago

I think it is though-

“ TurboQuant, QJL, and PolarQuant are more than just practical engineering solutions; they’re fundamental algorithmic contributions backed by strong theoretical proofs. These methods don't just work well in real-world applications; they are provably efficient and operate near theoretical lower bounds.”

Show thread

benob 3d ago

Maybe they quantized a bit too much the model parameters...

Show thread

integralid 3d ago

I also instinctively reacted to that fragment, but at this point I think this is overreacting to a single expression. It's not just a normal thing to say in English, it's something people have been saying for a long time before LLMs existed.

Show thread

nvme0n1p1 3d ago

There are tells all over the page:

> Redefining AI efficiency with extreme compression

"Redefine" is a favorite word of AI. Honestly no need to read further.

> the key-value cache, a high-speed "digital cheat sheet" that stores frequently used information under simple labels

No competent engineer would describe a cache as a "cheat sheet". Cheat sheets are static, but caches dynamically update during execution. Students don't rewrite their cheat sheets during the test, do they? LLMs love their inaccurate metaphors.

> QJL: The zero-overhead, 1-bit trick

> It reduces each resulting vector number to a single sign bit (+1 or -1). This algorithm essentially creates a high-speed shorthand that requires zero memory overhead.

Why does it keep emphasizing zero overhead? Why is storing a single bit a "trick?" Either there's currently an epidemic of algorithms that use more than one bit to store a bit, or the AI is shoving in extra plausible-sounding words to pad things out. You decide which is more likely.

It's 1:30am and I can't sleep, and I still regret wasting my time on this slop.

Show thread

pqs 3d ago

There is also the possibility that the article when through the hands of the company's communication department which has writers that probably write at LLM level.

Show thread

veunes 3d ago

Looks like Google canned all their tech writers just to pivot the budget into H100s for training these very same writers

Show thread

BenoitP 3d ago

It is AI generated. Or was written by someone a bit far from the technical advances IMHO. The Johnson-Lindenstrauss Lemma is a very specific and powerful concept, when in the article the QLJ explanation is vacuous. A knowledgeable human would not have left the reader wanting for how that relates to the Lemma.

Show thread

amitport 3d ago

This is a great development for KV cache compression. I did notice a missing citation in the related works regarding the core mathematical mechanism, though. The foundational technique of applying a geometric rotation prior to extreme quantization, specifically for managing the high-dimensional geometry and enabling proper bias correction, was introduced in our NeurIPS 2021 paper, "DRIVE" (https://proceedings.neurips.cc/paper/2021/hash/0397758f8990c...). We used this exact rotational approach and a similar bias correction mechanism to achieve optimal distributed mean estimation. I also presented this work and subsequent papers in a private invited talk at Google shortly after publication. Given the strong theoretical overlap with the mechanisms in TurboQuant and PolarQuant, I hope to see this prior art acknowledged in the upcoming camera-ready versions.

DRIVE: One-bit Distributed Mean Estimation

Show thread

lucrbvi 3d ago

Sounds like Multi-Head Latent Attention (MLA) from DeepSeek

Show thread

mskkm 3d ago

Pied Piper vibes. As far as I can tell, this algorithm is hardly compatible with modern GPU architectures. My guess is that’s why the paper reports accuracy-vs-space, but conveniently avoids reporting inference wall-clock time. The baseline numbers also look seriously underreported. “several orders of magnitude” speedups for vector search? Really? anyone has actually reproduced these results?