Symphony: Optimized Model Serving using Centralized Orchestration

Canumalla, Anirudh; Chen, Lequn; Deng, Weixin; Krishnamurthy, Arvind; Philipose, Matthai; Xin, Yu

Symphony: Optimized Model Serving using Centralized Orchestration

Authors: Anirudh Canumalla
Lequn Chen
Weixin Deng
Arvind Krishnamurthy
Matthai Philipose
Yu Xin
Publication date: 14 August 2023
Publisher

Abstract

The orchestration of deep neural network (DNN) model inference on GPU clusters presents two significant challenges: achieving high accelerator efficiency given the batching properties of model inference while meeting latency service level objectives (SLOs), and adapting to workload changes both in terms of short-term fluctuations and long-term resource allocation. To address these challenges, we propose Symphony, a centralized scheduling system that can scale to millions of requests per second and coordinate tens of thousands of GPUs. Our system utilizes a non-work-conserving scheduling algorithm capable of achieving high batch efficiency while also enabling robust autoscaling. Additionally, we developed an epoch-scale algorithm that allocates models to sub-clusters based on the compute and memory needs of the models. Through extensive experiments, we demonstrate that Symphony outperforms prior systems by up to 4.7x higher goodput

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2308.07470

Last time updated on 18/08/2023