Mastodawn

I chased down what the "4x faster at AI tasks" was measuring:

> Testing conducted by Apple in January 2026 using preproduction 13-inch and 15-inch MacBook Air systems with Apple M5, 10-core CPU, 10-core GPU, 32GB of unified memory, and 4TB SSD, and production 13-inch and 15-inch MacBook Air systems with Apple M4, 10-core CPU, 10-core GPU, 32GB of unified memory, and 2TB SSD. Time to first token measured with an 8K-token prompt using a 14-billion parameter model with 4-bit quantization, and LM Studio 0.4.1 (Build 1). Performance tests are conducted using specific computer systems and reflect the approximate performance of MacBook Air.

Show thread

jbellis Nov 11

if you're getting near-perfect recall with int8 and no reranking then you're either testing an unusual dataset or a tiny one, but if it works for you then great!

Show thread

jbellis Nov 11

two great points here:
(1) quantization is how you speed up vector indexes, and
(2) how your build your graph matters much much less*

These are the insights behind DiskANN, which has replaced HNSW in most production systems.

past that, well, you should really go read the DiskANN paper instead of this article, product quantization is way way way way way more effective than simple int8 or binary quant.

here's my writeup from a year and a half ago: https://dev.to/datastax/why-vector-compression-matters-64l

and if you want to skip forward several years to the cutting edge, check out https://arxiv.org/abs/2509.18471 and the references list for further reading

* but it still matters more than a lot of people thought circa 2020

Why Vector Compression Matters

Vector indexes are the hottest topic in databases because approximate nearest neighbor (ANN) vector...

DEV Community

Official	https://
Support this service	https://www.patreon.com/birddotmakeup