Hypura – A storage-tier-aware LLM inference scheduler for Apple Silicon
https://github.com/t8/hypura

GitHub - t8/hypura: Run models too big for your Mac's memory
Run models too big for your Mac's memory. Contribute to t8/hypura development by creating an account on GitHub.
GitHubIt will be interesting to compare this to
https://news.ycombinator.com/item?id=47476422 and
https://news.ycombinator.com/item?id=47490070 . Very similar design except that this is apparently using mmap, which according to the earlier experiment incurs significant overhead.
Flash-MoE: Running a 397B Parameter Model on a Laptop | Hacker News