Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs

#HeterogeneousSystems #GPUcluster #LLM

https://hgpu.org/?p=29242

Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs

This paper introduces Helix, a distributed system for high-throughput, low-latency large language model (LLM) serving on heterogeneous GPU clusters. A key idea behind Helix is to formulate inferenc…

hgpu.org