Hypura – A storage-tier-aware LLM inference scheduler for Apple Silicon
Hypura – A storage-tier-aware LLM inference scheduler for Apple Silicon
Still have 4 brand new ones in my storage unit. Just in case these moments.
Joke aside (I do have them tho!), I don't think Optane is that much use (not to mention it is only 256GiB for my unit). It is useful legacy crutch if you have legacy software that is not designed to issue multiple reads / writes in parallel. If you do, it is really not faster than NVMe, especially these modern ones.
Wouldn't be Intel if they didn't quit halfway through on a good thing.
Still, couldn't one get a RAID 0 card with four drives to saturate a 16x lane? That's already the max one could push through PCIe anyhow.
the practical question is whether the read pattern is sequential enough to actually saturate nvme bandwidth or if the attention layer access pattern ends up being random enough to kill throughput. sequential reads on a decent nvme get you 5-7 GB/s, random reads drop to maybe 500 MB/s depending on queue depth.
for a 1T model youd need to stream something like 2TB of weights per forward pass at fp16. even at peak sequential thats 300+ seconds per token which is... not great for interactive use but maybe fine for batch inference where you dont care about latency.
still a cool proof of concept though. the gap between 'can run' and 'runs usefully' is where things get interesting.
> for a 1T model youd need to stream something like 2TB of weights per forward pass
Isn't this missing the point of MoE models completely? MoE inference is sparse, you only read a small fraction of the weights per layer. You still have a problem of each individual expert-layer being quite small (a few MiBs each give or take) but those reads are large enough for the NVMe.