Tangram: Hiding GPU Heterogeneity for Efficient LLM Parallelization
#GPUcluster #LLM #Performance
https://hgpu.org/?p=30879

Tangram: Hiding GPU Heterogeneity for Efficient LLM Parallelization
The scale of LLM training jobs requires parallelization planning over large GPU clusters. Due to different GPU types and interconnects added over time, these GPU clusters are increasingly heterogen…
hgpu.orgHybrid Learning and Optimization-Based Dynamic Scheduling for DL Workloads on Heterogeneous GPU Clusters
#GPUcluster #TaskScheduling #Package
https://hgpu.org/?p=30451

Hybrid Learning and Optimization-Based Dynamic Scheduling for DL Workloads on Heterogeneous GPU Clusters
Modern cloud platforms increasingly host large-scale deep learning (DL) workloads, demanding high-throughput, low-latency GPU scheduling. However, the growing heterogeneity of GPU clusters and limi…
hgpu.org
Collective Communication for 100k+ GPUs
The increasing scale of large language models (LLMs) necessitates highly efficient collective communication frameworks, particularly as training workloads extend to hundreds of thousands of GPUs. T…
hgpu.orgDemystifying NCCL: An In-depth Analysis of GPU Communication Protocols and Algorithms
#CUDA #GPUcluster #Communication
https://hgpu.org/?p=30035

Demystifying NCCL: An In-depth Analysis of GPU Communication Protocols and Algorithms
The NVIDIA Collective Communication Library (NCCL) is a critical software layer enabling high-performance collectives on large-scale GPU clusters. Despite being open source with a documented API, i…
hgpu.orgLiteGD: Lightweight and dynamic GPU Dispatching for Large-scale Heterogeneous Clusters
#GPUcluster
https://hgpu.org/?p=29950

LiteGD: Lightweight and dynamic GPU Dispatching for Large-scale Heterogeneous Clusters
Parallel computing with multiple GPUs has become the dominant paradigm for machine learning tasks, especially those of large language models (LLMs). To reduce the latency incurred by inter-GPU comm…
hgpu.orgFLASH: Fast All-to-All Communication in GPU Clusters
#GPUcluster #Communication #MPI
https://hgpu.org/?p=29914

FLASH: Fast All-to-All Communication in GPU Clusters
Scheduling All-to-All communications efficiently is fundamental to minimizing job completion times in distributed systems. Incast and straggler flows can slow down All-to-All transfers; and GPU clu…
hgpu.orgScheduling Deep Learning Jobs in Multi-Tenant GPU Clusters via Wise Resource Sharing
#CUDA #MPI #GPUcluster #TaskScheduling #DeepLearning #DL #PyTorch
https://hgpu.org/?p=29319

Scheduling Deep Learning Jobs in Multi-Tenant GPU Clusters via Wise Resource Sharing
Deep learning (DL) has demonstrated significant success across diverse fields, leading to the construction of dedicated GPU accelerators within GPU clusters for high-quality training services. Effi…
hgpu.orgHelix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs
#HeterogeneousSystems #GPUcluster #LLM
https://hgpu.org/?p=29242

Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs
This paper introduces Helix, a distributed system for high-throughput, low-latency large language model (LLM) serving on heterogeneous GPU clusters. A key idea behind Helix is to formulate inferenc…
hgpu.orgBalancing Tracking Granularity and Parallelism in Many-Task Systems: The Horizons Approach
#SYCL #GPUcluster #HPC #Package
https://hgpu.org/?p=29182

Balancing Tracking Granularity and Parallelism in Many-Task Systems: The Horizons Approach
Reducing the need for users to manually manage the details of work and data distribution is an important goal of high-level many-task runtime systems. For distributed memory platforms this means th…
hgpu.org
Stellenausschreibung: Projektassistenz im Bereich Machine Learning