The orchestration of deep neural network (DNN) model inference on GPU
clusters presents two significant challenges: achieving high accelerator
efficiency given the batching properties of model inference while meeting
latency service level objectives (SLOs), and adapting to workload changes both
in terms of short-term fluctuations and long-term resource allocation. To
address these challenges, we propose Symphony, a centralized scheduling system
that can scale to millions of requests per second and coordinate tens of
thousands of GPUs. Our system utilizes a non-work-conserving scheduling
algorithm capable of achieving high batch efficiency while also enabling robust
autoscaling. Additionally, we developed an epoch-scale algorithm that allocates
models to sub-clusters based on the compute and memory needs of the models.
Through extensive experiments, we demonstrate that Symphony outperforms prior
systems by up to 4.7x higher goodput