TurboQuant KV Compression and SSD Expert Streaming for M5 Pro and IOS

https://github.com/SharpAI/SwiftLM

GitHub - SharpAI/SwiftLM: ⚡ Native Swift LLM inference server for Apple Silicon. OpenAI-compatible API, SSD streaming for 100B+ MoE models, TurboQuant KV cache compression, + iOS iPhone app.

⚡ Native Swift LLM inference server for Apple Silicon. OpenAI-compatible API, SSD streaming for 100B+ MoE models, TurboQuant KV cache compression, + iOS iPhone app. - SharpAI/SwiftLM

GitHub

We implemented two techniques to run massive 100B+ parameter MoE models natively on the M5 Pro 64GB MacBook Pro:

TurboQuant KV compression: We ported the V3 Lloyd-Max codebooks from the TurboQuant paper (Zandieh et al., ICLR 2026) into native C++ and fused dequantization into Metal shaders. This achieves a measured 4.3× KV cache compression at runtime, completely eliminating Python overhead.

SSD Expert Streaming: To fit a 122B parameter model (e.g., Qwen3.5-122B MoE) without triggering macOS VM swapping or Watchdog kernel kills, the full ~60 GB weight file remains on NVMe. Only the top-k active expert pages are streamed to the GPU per forward pass at ~9 GB/s. As a result, inference runs with only 2,694 MB of active GPU VRAM on the M5 Pro 64GB, while the OS page cache automatically handles hot-expert reuse.

By combining these two approaches, we can comfortably run massive models in memory-constrained environments on Apple Silicon.

Also tested QWEN 4B on IPHONE 13 Pro.

Code and implementation details: https://github.com/SharpAI/SwiftLM

GitHub - SharpAI/SwiftLM: ⚡ Native Swift LLM inference server for Apple Silicon. OpenAI-compatible API, SSD streaming for 100B+ MoE models, TurboQuant KV cache compression, + iOS iPhone app.

⚡ Native Swift LLM inference server for Apple Silicon. OpenAI-compatible API, SSD streaming for 100B+ MoE models, TurboQuant KV cache compression, + iOS iPhone app. - SharpAI/SwiftLM

GitHub
what tokens/s are you getting with a 122B MoE model in this setup? I didn't see any benchmarks in the benchmarks section on the readme.md
yeah this I'd like to see added to teh readme.
I'll add more details. We just wired up the pipeline on both MAC and IOS.