The future is cyberpunk and not in a good way.
If your response to a natural disaster is focused only on politics and show little empathy, you are a truly horrible human being.
Merry Christmas and Happy New Year y’all! 🎄⛄️
#WeekendReads https://hao-ai-lab.github.io/blogs/distserve/
Or rather, it should better be named The Introduction to LLM Dynamic Batching Inference

Throughput is Not All You Need: Maximizing Goodput in LLM Serving using Prefill-Decode Disaggregation
TL;DR: LLM apps today have diverse latency requirements. For example, a chatbot may require a fast initial response (e.g., under 0.2 seconds) but moderate speed in decoding which only needs to match human reading speed, whereas code completion requires a fast end-to-end generation time for real-time code suggestions.
In this blog post, we show existing serving systems that optimize throughput are not optimal under latency criteria. We advocate using goodput, the number of completed requests per second adhering to the Service Level Objectives (SLOs), as an improved measure of LLM serving performance to account for both cost and user satisfaction.
I’m today years old when I learned that sharks don’t have bones.
The world should brace for a Trump win. 😬