Mastodawn

Hypura – A storage-tier-aware LLM inference scheduler for Apple Silicon

Run models too big for your Mac's memory. Contribute to t8/hypura development by creating an account on GitHub.

GitHub

It will be interesting to compare this to https://news.ycombinator.com/item?id=47476422 and https://news.ycombinator.com/item?id=47490070 . Very similar design except that this is apparently using mmap, which according to the earlier experiment incurs significant overhead.

Except this isnt using heavily quantised versions of the model thus reducing quality.