Mastodawn

tatef

Hypura – A storage-tier-aware LLM inference scheduler for Apple Silicon

https://github.com/t8/hypura

GitHub - t8/hypura: Run models too big for your Mac's memory

Run models too big for your Mac's memory. Contribute to t8/hypura development by creating an account on GitHub.

GitHub

Show thread

marksully 4d ago

Where does "1T parameter model" come from? I can only see models with 70B params or less mentioned in the repo.

Show thread

causal 4d ago

Yeah title comes from nowhere in the link. No doubt it's possible but all that matters is speed and we learn nothing of that here...

Show thread

tatef 4d ago

I'm referencing it as being possible, however I didn't share benchmarks because candidly the performance would be so slow it would only be useful for very specific tasks over long time horizons. The more practical use cases are less flashy but capable of achieving multiple tokens/sec (ie smaller MoE models where not all experts need to be loaded in memory simultaneously)

Show thread

zozbot234 4d ago

It will be interesting to compare this to https://news.ycombinator.com/item?id=47476422 and https://news.ycombinator.com/item?id=47490070 . Very similar design except that this is apparently using mmap, which according to the earlier experiment incurs significant overhead.

Flash-MoE: Running a 397B Parameter Model on a Laptop | Hacker News

Show thread

salynchnew 4d ago

It was written by an LLM, so... yeah.

Show thread

jeffybefffy519 4d ago

Except this isnt using heavily quantised versions of the model thus reducing quality.

Show thread

baq 4d ago

Intel Optane rolling in its grave.

Show thread

liuliu 4d ago

Still have 4 brand new ones in my storage unit. Just in case these moments.

Joke aside (I do have them tho!), I don't think Optane is that much use (not to mention it is only 256GiB for my unit). It is useful legacy crutch if you have legacy software that is not designed to issue multiple reads / writes in parallel. If you do, it is really not faster than NVMe, especially these modern ones.

Show thread

zozbot234 4d ago

It's not about being faster (except for small reads where latency dominates, which is actually relevant when reading a handful of expert-layers immediately after routing), it's the wearout resistance which opens up the possibility of storing KV-cache (including the "linear" KV-cache of recent Qwen, which is not append-only as it was with the pure attention model) and maybe even per-layer activations - though this has the least use given how ephemeral these are.

Show thread

speedgoose 4d ago

Is it too late for Intel to bring them back to life?

Show thread

c0balt 4d ago

Yes, their NAND division has been sold, it is now mostly under solidigm. Maybe solidigm could bring it back, but it seems unlikely (given the previous commercial failure).

pmem

Wouldn't be Intel if they didn't quit halfway through on a good thing.

Still, couldn't one get a RAID 0 card with four drives to saturate a 16x lane? That's already the max one could push through PCIe anyhow.

Show thread

aitchnyu 4d ago

Memristors are also missing in this AI hype even when they were around the corner 10 years back.

Show thread

vicchenai 4d ago

the practical question is whether the read pattern is sequential enough to actually saturate nvme bandwidth or if the attention layer access pattern ends up being random enough to kill throughput. sequential reads on a decent nvme get you 5-7 GB/s, random reads drop to maybe 500 MB/s depending on queue depth.

for a 1T model youd need to stream something like 2TB of weights per forward pass at fp16. even at peak sequential thats 300+ seconds per token which is... not great for interactive use but maybe fine for batch inference where you dont care about latency.

still a cool proof of concept though. the gap between 'can run' and 'runs usefully' is where things get interesting.

Show thread

zozbot234 4d ago

> for a 1T model youd need to stream something like 2TB of weights per forward pass

Isn't this missing the point of MoE models completely? MoE inference is sparse, you only read a small fraction of the weights per layer. You still have a problem of each individual expert-layer being quite small (a few MiBs each give or take) but those reads are large enough for the NVMe.

Show thread

visarga 4d ago

But across a sequence you still have to load most of them.

Show thread

tatef 4d ago

Yes, definitely agree. It's more of a POC than a functional use case. However, for many smaller MoE models this method can actually be useful and capable of achieving multiple tokens/sec.

Show thread

p_ing 4d ago

4K random read with a queue depth of 1 on an M1 Max is about 65MB/s.

Show thread

vanyaland 4d ago

For a lot of local workloads, sub-1 tok/s is useless in foreground and perfectly acceptable in background. If the choice is “this crashes” vs “this finishes overnight,” that’s still a meaningful capability jump.