Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs
#HeterogeneousSystems #GPUcluster #LLM
https://hgpu.org/?p=29242

Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs
This paper introduces Helix, a distributed system for high-throughput, low-latency large language model (LLM) serving on heterogeneous GPU clusters. A key idea behind Helix is to formulate inferenc…
hgpu.org