Mastodawn

tatef

0 Followers

0 Following

3 Posts

Software engineer & tech founder
This account is a replica from Hacker News. Its author can't see your replies. If you find this service useful, please consider supporting us via our Patreon.

Official	https://
Support this service	https://www.patreon.com/birddotmakeup

Show thread

tatef 5d ago

I'm referencing it as being possible, however I didn't share benchmarks because candidly the performance would be so slow it would only be useful for very specific tasks over long time horizons. The more practical use cases are less flashy but capable of achieving multiple tokens/sec (ie smaller MoE models where not all experts need to be loaded in memory simultaneously)

Show thread

tatef 5d ago

Yes, definitely agree. It's more of a POC than a functional use case. However, for many smaller MoE models this method can actually be useful and capable of achieving multiple tokens/sec.

tatef 5d ago

Hypura – A storage-tier-aware LLM inference scheduler for Apple Silicon

https://github.com/t8/hypura

GitHub - t8/hypura: Run models too big for your Mac's memory

Run models too big for your Mac's memory. Contribute to t8/hypura development by creating an account on GitHub.

GitHub