Another step towards (good) local models? Combined with the recent reporting (https://simonwillison.net/2026/Mar/18/llm-in-a-flash/) around “memory mapping good actually” for getting chonky boi models on small devices, it feels like we’re getting closer.
https://arstechnica.com/ai/2026/03/google-says-new-turboquant-compression-can-lower-ai-memory-usage-without-sacrificing-quality/
Autoresearching Apple’s “LLM in a Flash” to run Qwen 397B locally
Here's a fascinating piece of research by Dan Woods, who managed to get a custom version of Qwen3.5-397B-A17B running at 5.5+ tokens/second on a 48GB MacBook Pro M3 Max despite …
Simon Willison’s Weblog