Mastodawn

anemll 5d ago

iPhone 17 Pro Demonstrated Running a 400B LLM

https://twitter.com/anemll/status/2035901335984611412

Anemll (@anemll) on X

Running 400B model on iPhone! 0.6 t/s Credit @danveloper @alexintosh @danpacary @anemll

X (formerly Twitter)

Show thread

firstbabylonian

> SSD streaming to GPU

Is this solution based on what Apple describes in their 2023 paper 'LLM in a flash' [1]?

1: https://arxiv.org/abs/2312.11514

LLM in a flash: Efficient Large Language Model Inference with Limited Memory

Large language models (LLMs) are central to modern natural language processing, delivering exceptional performance in various tasks. However, their substantial computational and memory requirements present challenges, especially for devices with limited DRAM capacity. This paper tackles the challenge of efficiently running LLMs that exceed the available DRAM capacity by storing the model parameters in flash memory, but bringing them on demand to DRAM. Our method involves constructing an inference cost model that takes into account the characteristics of flash memory, guiding us to optimize in two critical areas: reducing the volume of data transferred from flash and reading data in larger, more contiguous chunks. Within this hardware-informed framework, we introduce two principal techniques. First, "windowing" strategically reduces data transfer by reusing previously activated neurons, and second, "row-column bundling", tailored to the sequential data access strengths of flash memory, increases the size of data chunks read from flash memory. These methods collectively enable running models up to twice the size of the available DRAM, with a 4-5x and 20-25x increase in inference speed compared to naive loading approaches in CPU and GPU, respectively. Our integration of sparsity awareness, context-adaptive loading, and a hardware-oriented design paves the way for effective inference of LLMs on devices with limited memory.

arXiv.org

Show thread

simonw 5d ago

Yes. I collected some details here: https://simonwillison.net/2026/Mar/18/llm-in-a-flash/

Autoresearching Apple’s “LLM in a Flash” to run Qwen 397B locally

Here's a fascinating piece of research by Dan Woods, who managed to get a custom version of Qwen3.5-397B-A17B running at 5.5+ tokens/second on a 48GB MacBook Pro M3 Max despite …

Simon Willison’s Weblog

Show thread

zozbot234 5d ago

A similar approach was recently featured here: https://news.ycombinator.com/item?id=47476422 Though iPhone Pro has very limited RAM (12GB total) which you still need for the active part of the model. (Unless you want to use Intel Optane wearout-resistant storage, but that was power hungry and thus unsuitable to a mobile device.)

Flash-MoE: Running a 397B Parameter Model on a Laptop | Hacker News

Show thread

simonw 5d ago

Yeah, this new post is a continuation of that work.

Show thread

Aurornis 5d ago

> Though iPhone Pro has very limited RAM (12GB total) which you still need for the active part of the model.

This is why mixture of experts (MoE) models are favored for these demos: Only a portion of the weights are active for each token.

Show thread

foobiekr 5d ago

This is not entirely dissimilar to what Cerebus does with their weights streaming.