Hypura – A storage-tier-aware LLM inference scheduler for Apple Silicon

https://github.com/t8/hypura

GitHub - t8/hypura: Run models too big for your Mac's memory

Run models too big for your Mac's memory. Contribute to t8/hypura development by creating an account on GitHub.

GitHub
Where does "1T parameter model" come from? I can only see models with 70B params or less mentioned in the repo.
Yeah title comes from nowhere in the link. No doubt it's possible but all that matters is speed and we learn nothing of that here...
I'm referencing it as being possible, however I didn't share benchmarks because candidly the performance would be so slow it would only be useful for very specific tasks over long time horizons. The more practical use cases are less flashy but capable of achieving multiple tokens/sec (ie smaller MoE models where not all experts need to be loaded in memory simultaneously)
It will be interesting to compare this to https://news.ycombinator.com/item?id=47476422 and https://news.ycombinator.com/item?id=47490070 . Very similar design except that this is apparently using mmap, which according to the earlier experiment incurs significant overhead.
Flash-MoE: Running a 397B Parameter Model on a Laptop | Hacker News

It was written by an LLM, so... yeah.
Except this isnt using heavily quantised versions of the model thus reducing quality.
Intel Optane rolling in its grave.

Still have 4 brand new ones in my storage unit. Just in case these moments.

Joke aside (I do have them tho!), I don't think Optane is that much use (not to mention it is only 256GiB for my unit). It is useful legacy crutch if you have legacy software that is not designed to issue multiple reads / writes in parallel. If you do, it is really not faster than NVMe, especially these modern ones.

It's not about being faster (except for small reads where latency dominates, which is actually relevant when reading a handful of expert-layers immediately after routing), it's the wearout resistance which opens up the possibility of storing KV-cache (including the "linear" KV-cache of recent Qwen, which is not append-only as it was with the pure attention model) and maybe even per-layer activations - though this has the least use given how ephemeral these are.
Is it too late for Intel to bring them back to life?
Yes, their NAND division has been sold, it is now mostly under solidigm. Maybe solidigm could bring it back, but it seems unlikely (given the previous commercial failure).

Wouldn't be Intel if they didn't quit halfway through on a good thing.

Still, couldn't one get a RAID 0 card with four drives to saturate a 16x lane? That's already the max one could push through PCIe anyhow.

Memristors are also missing in this AI hype even when they were around the corner 10 years back.

the practical question is whether the read pattern is sequential enough to actually saturate nvme bandwidth or if the attention layer access pattern ends up being random enough to kill throughput. sequential reads on a decent nvme get you 5-7 GB/s, random reads drop to maybe 500 MB/s depending on queue depth.

for a 1T model youd need to stream something like 2TB of weights per forward pass at fp16. even at peak sequential thats 300+ seconds per token which is... not great for interactive use but maybe fine for batch inference where you dont care about latency.

still a cool proof of concept though. the gap between 'can run' and 'runs usefully' is where things get interesting.

> for a 1T model youd need to stream something like 2TB of weights per forward pass

Isn't this missing the point of MoE models completely? MoE inference is sparse, you only read a small fraction of the weights per layer. You still have a problem of each individual expert-layer being quite small (a few MiBs each give or take) but those reads are large enough for the NVMe.

But across a sequence you still have to load most of them.
Yes, definitely agree. It's more of a POC than a functional use case. However, for many smaller MoE models this method can actually be useful and capable of achieving multiple tokens/sec.
4K random read with a queue depth of 1 on an M1 Max is about 65MB/s.
For a lot of local workloads, sub-1 tok/s is useless in foreground and perfectly acceptable in background. If the choice is “this crashes” vs “this finishes overnight,” that’s still a meaningful capability jump.