Mastodawn

This is now my go-to library when it comes to serving an llm.

PagedAttention, distributed serving, a *very* nice python interface… this project has it all!

A high-throughput and memory-efficient inference and serving engine for LLMs - vllm-project/vllm

GitHub