Towards QoS-Aware and Resource-Efficient GPU Microservices Based on
  Spatial Multitasking GPUs In Datacenters by Zhang, Wei et al.
Towards QoS-Aware and Resource-Efficient GPU Microservices Based on Spatial
Multitasking GPUs In Datacenters
1st Wei Zhang
Shanghai Jiao Tong University
2nd Quan Chen
Shanghai Jiao Tong University
3rdKaihua Fu
Shanghai Jiao Tong University
4th Ningxin Zheng
Shanghai Jiao Tong University
5th Zhiyi Huang
University of Otago
6th Jingwen Leng
Shanghai Jiao Tong University
7th Chao Li
Shanghai Jiao Tong University
8th Wenli Zheng
Shanghai Jiao Tong University
9th Minyi Guo
Shanghai Jiao Tong University
Abstract—While prior researches focus on CPU-based mi-
croservices, they are not applicable for GPU-based microservices
due to the different contention patterns. It is challenging to
optimize the resource utilization while guaranteeing the QoS
for GPU microservices. We find that the overhead is caused
by inter microservice communication, GPU resource contention and
imbalanced throughput within microservice pipeline. We propose
Camelot, a runtime system that manages GPU micorservices
considering the above factors. In Camelot, a global memory-
based communication mechanism enables onsite data sharing that
significantly reduces the end-to-end latencies of user queries. We
also propose two contention aware resource allocation policies that
either maximize the peak supported service load or minimize
the resource usage at low load while ensuring the required QoS.
The two policies consider the microservice pipeline effect and the
runtime GPU resource contention when allocating resources for
the microservices. Compared with state-of-the-art work, Camelot
increases the supported peak load by up to 64.5% with limited
GPUs, and reduces 35% resource usage at low load while
achieving the desired 99%-ile latency target.
I. INTRODUCTION
Datacenters [1] host latency critical user-facing applications,
such as web search [2] and web service [3]. These applications
have strict Quality of Service (QoS) requirement in terms of
tail latency, and require frequent bug fixing and feature up-
dating. To meet these requirements, service design shifts from
a monolithic architecture to a microservice architecture [4],
where a complex user-facing service is decomposed into multi-
ple loosely coupled microservices. Each microservice provides
a specialized functionality. A microservice-based application
involves the interoperation of multiple microservices, each
of which can be implemented, deployed, and updated inde-
pendently without compromising the application’s integrity.
Such independence improves the application’s scalability,
portability, and availability. Considering these advantages, the
microservice architecture has been regarded as the widely
accepted and employed software architecture by Internet giants
such as Netflix, Amazon, Apple and eBay [5]–[7].
Similarly, user-facing services on GPU (e.g., intelligent
personal assistant [8], graph processing [9], and deep learn-
ing [10]) are also shifting towards the microservice architec-
ture (referred as “GPU microservices”). Figure 1 shows an ex-
query
  : SMs contention   : Global memory (capacity and bandwidth) contention  : PCIe bus contention
Global memory
Microservice1
SM SM SM...
Microservice2
SM SM...
Host memory
GPU1
Global memory
GPU2
Microservice3
SM SM...
Output
 
 
 
Fig. 1: An example of deploying GPU microservices.
ample of deploying an application that has three microservice
stages on GPU. In the figure, multiple microservices run on a
spatial multitasking GPU concurrently, since the current Volta
Multi-Process Service(MPS) [11] allows multiple applications
to share GPU computational resources for better resource
efficiency. Observed from this figure, the back pressure effects
caused by the dependencies between the microservices result
in expensive overhead [4]. The cascading QoS violations will
quickly propagate through the entire service, which leads to
worse consequences of QoS violations. Therefore, even though
the quality-of-service (QoS) requirements of user-facing ap-
plications are similar for microservices and monoliths, the
tail latency required for each individual microservice is much
stricter than for traditional monolith applications.
Besides guranteeing the QoS of microservices, it is cost
efficient to maximize the supported peak load of a user-facing
application with limited resources, and minimize resource
usage of a service with varying load. There are some prior
researches on characterizing and managing resources for CPU
microservices [5], [12]–[14]. Benefit from the containerized
deployment pattern [13] of CPU microservices, such interfer-
ence could be encapsulated and resolved at the container level.
A container may be imposed with certain limits on the CPU
and memory resources consumed by a microservice.
However, prior analysis and resource management policies
do not apply for GPU microservices. While CPU microser-
vices contend for CPU and memory bandwidth, GPU mi-
croservices contend for SMs, global memory capacity and
bandwidth, and PCI-e bandwidth (as shown in Figure 1). In
addition, there is no containerized environment that enables
fine-grained resource sharing for spatial multitasking GPUs.
ar
X
iv
:2
00
5.
02
08
8v
1 
 [c
s.D
C]
  5
 M
ay
 20
20
Balancing the throughput of the microservices to improve the
microservice pipeline efficiency and eliminate the backpres-
sure effect is challenging on GPU. Laius [15] is state-of-the-
art work that manages resource on spatial multitasking GPUs.
It improves the GPU utilization by co-locating user-facing
applications and batch applications when ensuring the QoS
of the user-facing applications. However, Laius is not able
to handle GPU microservices that show complex dependency
relationship, because it assumes independent relationship be-
tween that the co-located tasks.
We find that the communication overhead between mi-
croservices, the pipeline efficiency (determined by the number
of SMs allocated to each microservice, and the number of
instances in each microservice stage), and the global memory
bandwidth contention together determine the tail latencies of
GPU microservices. We have two insights: 1) the commu-
nication between GPU microservices result in the long end-
to-end latencies due to the limited PCIe bandwidth; 2) the
global memory capacity of an GPU becomes one of the
main limitations for the microservice co-location, because each
microservice occupies large global memory space.
While there is no standard GPU microservice benchmarks,
we first develop Camelot suite, a benchmark suite that in-
cludes both real and artifact GPU microservices. The real-
system workloads include end-to-end services that cover nat-
ural language processing (NLP), deep neural network (DNN)
and image processing. We use cutting-edge model such as
LSTM [16], Bert [17], VGG [18], and DC-GAN [19] to
build the real-system benchmarks, and the benchmarks are
programmed with python, C/C++, and CUDA. The artifact
benchmark is comprised of compute intensive, memory in-
tensive and PCI-e intensive microservices. We can emulate
various end-to-end services using the artifact benchmark.
Because the load of a user-facing service varies (diurnal
load pattern [1]) and the contention scenario is only known
at runtime, an online method is required to manage the
GPU microservices. We therefore propose a runtime system
named Camelot to manage GPU resources online. In Camelot,
a global memory-based communication mechanism enables
fast data transfer between microservices on the same GPU;
two contention-aware resource allocation policies identify the
optimal GPU resource allocations that minimize the resource
usage or maximize the throughput while ensuring the required
QoS. The allocation decisions are made based on the pipeline
effect of microservices and the runtime contention behaviors.
To enable the effective resource allocation, we also propose a
performance predictor that precisely predicts the global band-
width usage, duration, and throughput of each microservice
under various resource configurations. This paper makes three
main contributions.
• Comprehensive characterization of GPU microser-
vices. The characterization reveals the challenges in man-
aging GPU microservices. We will open source both the
benchmark suite and the runtime system1.
• A global memory-based communication mechanism
for GPU microservices. Adopting the mechanism, the
microservices on the same GPU communicate directly
without the expensive CPU-GPU data copies.
• A lightweight GPU resource allocation policy. The pol-
icy considers communication overhead, global memory
capacity, shared resource contention, and pipeline stall
when managing the GPU resources.
We implement Camelot and evaluate it on a GPU server
with two Nvidia 2080Ti GPUs, and a DGX-2 machine with
Nvidia V100 GPUs. According to our experimental results,
Camelot effectively increases the supported peak load by
up to 73.9% and 64.5% compared with EA and Laius, and
reduces the GPU resource usage by 46.5% compared with
equal allocation and 35% compared with Laius at low load
while ensuring the required QoS.
II. RELATED WORK
There have been some efforts on related topics: resource
management and scheduling in datacenters, benchmark suites
for user-facing services, and microservice architecture.
Microservice Architecture. Yu et al. proposed a microser-
vice benchmark suite DeathStarBench [4], and used it to
study the architectural characteristics of microservices. Li
et al. [5] presented a data flow-driven approach to semi-
automatically decompose cloud services into microservices.
Zhou et al. [20] identified the gap between existing bench-
marks and industrial microservices, and proposed a medium-
size microservice benchmark system. There also exist some
efforts on the measuring the performance of microservice-
based applications [13], [21]–[23]. Gribaudo et al. [24] pro-
vided a simulation-based approach to explore the impact of
microservice-based architectures in terms of performances and
dependability, given a desired configuration. However, these
researches are for CPU microservices and are not applicable
for GPU microservices.
Resource scheduling for CPU microservices. There has
been a large amount of prior work on improving the utilization
while avoiding QoS violations for CPU microservices. Bao et
al. [13] analyzed the performance degradation of microser-
vices from the perspective of service overhead and develops
a workflow-based scheduler to minimize end-to-end latency
and improves utilization. Based on the characteristics of the
workload, HyScale and ATOM [25], [26] designed resource
hybrid controllers that combine horizontal and vertical scaling
to dynamically resource division to improve the corresponding
time of microservices. Considering the complexity of perfor-
mance prediction, Seer [14] proposed an online performance
prediction system.
Resource management on GPUs. DART [27] employed a
pipeline-based scheduling architecture with data parallelism,
where heterogeneous CPUs and GPUs are arranged into nodes
1The source code is available at github. Currently the link is hidden due to
the double-blind review but available by request.
with different parallelism levels. Laius [15] allocated the com-
putation resource to the co-located applications for maximiz-
ing the throughput of batch applications while guaranteeing the
required QoS of user-facing services. Baymax [8] reorders the
GPU kernels for ensuring QoS at co-location on time-sharing
accelerators. However, none of them considers the dependence
relationship between microservices as Camelot does. Ignoring
the characteristics of the microservice architecture makes
results in low resource utilization compared with Camelot.
III. REPRESENTATIVE MICROSERVICES
In this section, we describe Camelot suite, a GPU mi-
croservice benchmark suite that includes four representative
end-to-end user-facing GPU microservices. Besides, we design
an artifact benchmark comprised of compute-, memory- and
PCIe-intensive microservices for extensive evaluation. We
build Camelot suite based on four guidelines:
• Functional integrity - The benchmarks should reflect
the real world requirements, show full functionality, and
are deployable on real systems.
• Programming Heterogeneity - The benchmarks should
allow programming language and framework heterogene-
ity, with each tier developed in the most suitable lan-
guage, only requiring a well-designed API for microser-
vices to communicate with each other.
• Modularity - According to Conway’s Third Law [28], the
artifact benchmarks should be independent and modular-
ized. This modularity prevents vague boundaries and sets
up the “inter-operate, not integrate” relationship between
the artifact benchmarks.
• Representativeness - The computational parts of the
microservices should come from popular open-source
applications and state-of-the-art approaches used in aca-
demic and industrial.
A. Real system GPU Microservices
According to the above concepts, we choose user-facing
applications that uses common deep learning techniques and
implement them in microservice architecture. Table I lists the
end-to-end user-facing applications that cover a wide spectrum
of real applications based on GPU microservices.
TABLE I: End-to-end GPU microservices in Camelot suite
Workload Miroservices Implementation Language
Img-to-img [29] Face recognition FR-API [30] PYTHON&Image enhancement FSRCNN [31] CUDA
Img-to-text [32] Image feature extraction VGG [18] C++ &Image caption LSTM CUDA
Text-to-imag [33] Semantic understanding LSTM [16] C++&Image generation DC-GAN [19] CUDA
Text-to-text [34] Text summarization BERT [17] PYTHON&Text translation Opennmt [35] CUDA
Figure 2 illustrates the tiered view of Camelot Suite span-
ning the query taxonomy it supports, and the end-to-end
applications in Camelot Suite. They are widely-used in natural
language proessing (text-to-text), image proessing (img-to-
img), image generation (text-to-img), and image caption (img-
to-text). Their functionalities are described as follows.
（Text query） （Image query）
Query Taxonomy
NLP
Text result Image resultText resultImage result
This bird has wings
that are grey and
has a white belly.
Sentence 
translate
Text 
summarization
text2text text2img img2imgimg2text
Fig. 2: Tier-level view of the Camelot Suite.
10 20 30 40 50 60 70 80 90 10
0
Computational resource
 percentage
0.0
0.5
1.0
1.5
2.0
2.5
3.0
La
te
nc
y(
s)
c1
c2
c3
(a) Processing time of the compute-
intensive microservice
10 20 30 40 50 60 70 80 90 10
0
Computational resource
 percentage
25
50
75
100
125
150
175
200
M
em
or
y 
ba
nd
wi
dt
h(
GB
/s
)
m1
m2
m3
(b) Memory bandwidth of the
memory-intensive microservice
Fig. 3: Scalability of the artifact benchmarks.
Natural language processing applications belong to the text-
to-text class [34], [36] and consist of two GPU microservices.
The first microservice is text summarization task, with Bert.
Text summarization is designed to turn text or collections of
text into short summaries that contain key information. The
second microservice is sentence translation, which aims to
translate the text summary output from the first stage into
another language.
Image processing applications belong to the img-to-img
class [29], [31], [37]. The first stage of the application is
the face recognition service based on an open-source project
named “face-recognition”. The second part is the image en-
hancement service, which is implemented using the FSRCNN
model. This is an image-based microservice where users send
requests and upload images to the microservices. Then face
recognition service first recognizes the face location informa-
tion in the image and cut out the tiny faces. Next, the image
enhancement service further processes the tiny face to generate
a high pixel (64×64) face image.
Image generation applications generate new images accord-
ing to the text, and belong to the text-to-img class. For
example, if the input to the neural network is “flowers of
pink petals”, the output will be an image containing these
elements. The task consists of two parts [33], [38]–[41]: (1)
Use natural language processing to understand the description
in the input, with LSTM. (2) Generate a network to output an
accurate image that expresses the text, with deep convolutional
generative adversarial network (DC-GAN).
im
g-
to
-im
g
te
xt
-to
-im
g
im
g-
to
-te
xt
te
xt
-to
-te
xt
0
200
400
600
800
Th
ro
ug
hp
ut
Stage1 Stage2 Total
12827 3963 1462 10000
(a) The supported peak throughput of
each microservice if it is assigned a
whole GPU.
im
g-
to
-im
g
im
g-
to
-te
xt
te
xt
-to
-im
g
te
xt
-to
-te
xt
0.0
0.2
0.4
0.6
0.8
1.0
tim
e(
s)
stage1(co-located)
stage2(co-located)
stage1(offline)
stage2(offline)
1x
No
rm
al
ize
d 
99
%
ta
il 
la
te
nc
y
(b) The QoS violation of microser-
vices with the balanced deployment
policy.
Fig. 4: Low throughput and QoS violation.
Image captioning applications generate a human readable
textual description for a given image, and belongs to img-
to-text class [32], [42]. The benchmark involves two models:
(1)The feature extraction model. Given an image, it extracts
significant features, which are usually represented by a vector
of fixed length. VGG is usually used for feature extraction. (2)
The language model. For image description, a neural network
such as a language model can predict a sequence of words in
a description based on the extracted features of the network.
A common method is to use a cyclic neural network, such
as Long and Short Term Memory Network (LSTM), as a
language model.
B. Artifact Benchmarks for Extensive Study
The artifact benchmarks are ported from three PCI-e in-
tensive, compute-intensive and memory-intensive workloads
in Rodinia [43]. By connecting the artifact benchmarks as
needed, we are able to build various end-to-end GPU mi-
croservices. The arithmetic intensities of the compute intensive
microservice and the memory intensive microservice can be
configured accordingly. Figure 3 shows the scalability of the
microservices with different compute intensities and memory
intensities. In the figure, c3 is configured to be more compute
intensive than c2 and c1, m1 is more memory intensive
than m2 and m3. The two microservices are sensitive to
the resource allocation, thus are suitable to study resource
management for general GPU microservices.
IV. INVESTIGATING GPU MICROSERVICES
We use the real system benchmarks in Camelot suite to
investigate the effectiveness of the current service deployment
methods for GPU microservices. Specifically, we seek to
answer two research questions. 1) Can the current deployment
methods effectively utilize GPU resources? 2) If no, what are
the main factors that result in the inefficiency?
A. Inefficient Microservice Pipeline
We use two Nvidia RTX 2080Ti GPUs as the experimental
platform to perform the investigation. Because our study does
not rely on any specific feature of 2080Ti, it applies for other
spatial multitasking GPUs.
Data Transfer
41%
Processing
59%
Data Transfer
Processing
(a) Img-to-img
Data Transfer
32%
Processing
68%
Data Transfer
Processing
(b) Img-to-text
Data Transfer
26%
Processing
74%
Data Transfer
Processing
(c) Text-to-img
Data Transfer
17%
Processing
83%
Data Transfer
Processing
(d) Text-to-text
Fig. 5: Breaking down the end-to-end latency of a query.
Standalone deployment policy deploys each micros rvice
on a standalone GPU, and relying on the cross-GPU data
copies to perform the communication between the microser-
vices. In this experiment, we gradually increases the load of
each benchmark until its 99%-ile latency achieves the QoS
target, and report the peak throughput (i.e., query-per-second,
QPS) of the benchmark in Figure 4a. Let T1 and T2 represent
the time spent by a user query on the two microservice
stages when the latency of the query reaches the QoS target.
The bar “Total” in the figure shows the peak throughput of
the benchmark, “Stage1” and “Stage2” show the achievable
throughputs of the two microservice stages while making
sure that their processing time are shorter than T1 and T2
respectively.
As shown in Figure 4a, the peak supported throughput of
a benchmark is determined by the microservice stage that has
the lowest throughput. For instance, the peak throughputs of
image-to-image and image-to-text are determined by the first
microservice stage and the second microservice stage respec-
tively. Therefore, the standalone deployment policy results in
the low peak throughput of GPU microservices due to the
inefficient microservice pipeline. This is because it does not
consider the differences between the microservices.
Balanced deployment policy is designed base on the fact
that a user-facing application achieves the highest throughput
when the throughputs of its microservice stages are identical.
The policy allocates the computational resources (i.e., SMs)
to the microservices in a fine-grained manner accordingly.
The fine-grained allocation is enabled by the Nvidia Volta
MPS technique [11]. To achieve the balanced deployment, for
each benchmark, the throughput and processing time of each
microservice stage are profiled offline, the SM allocation is
carefully tuned so that the throughputs of the two stages are
identical, while still ensuring that the aggregated processing
time is shorter than the QoS target. For instance, if some SMs
of the GPU for Stage2 of the img-to-img benchmark can be
allocated to Stage1, the peak throughput of img-to-img can be
improved.
Figure 4b shows the QoS violation of the benchmarks when
the balanced deployment policy is adopted. In the figure,
the stars represent the normalized 99%-ile latencies of the
benchmarks (corresponding to the right y-axis). The bars
“stage1 (offline)”, “stage2 (offline)”, “stage1 (co-located)”,
“stage2 (co-located)” represent the offline-profiled processing
time of the first and the second microservice stages, and the
actual processing time of the first and the second microservice
stages respectively (the left y-axis).
32 64 128 256 512
Batchsize
0
2GB
4GB
6GB
8GB
10GB
12GB
GP
U 
gl
ob
al
 
 m
em
or
y 
us
ag
e
out of memory 
     (21GB)
global memory capacity
0
20%
40%
60%
80%
100%
GP
U 
ul
til
iza
tio
n
GPU global memory usage
GPU utilization
Fig. 6: The global memory usage of the first microservice
(FR-API in Table I) in img-to-img with different batch sizes.
We get two observations from Figure 4b. As for the first
observation, all the benchmarks suffer from QoS violation with
the balanced deployment policy. This is mainly because the
microservices on the same GPU contend for PCIe bandwidth,
global memory bandwidth (Figure 1), although the SMs are
explicitly allocated. The unstable runtime contention behavior
results in the long tail latency. As for the second observation,
the actual processing time of both the two stages are longer
than their offline-profiled processing time due to the shared re-
source contentions. The unbalanced performance degradations
due to the contention also result in the inefficient pipleine in
consequence. Our evaluation in Section VIII-D also verifies
the necessity to manage global memory bandwidth contention
for GPU microservices.
The current service deployment policies result in low peak
throughputs or QoS violations of GPU microservices due to
the inefficient microservice pipeline, without tuning the SM
allocation online based on runtime contention behaviors.
B. Large Communication Overhead
Besides the inefficient pipeline, the communication over-
head between microservices contributes to the long end-to-
end latency. As shown in Figure 1, microservices communicate
through the main memory. When a microservice m1 sends the
result to the next microservice m2, its data is first transferred
from the global memory used by m1 to the main memory, and
then transferred back to the global memory used by m2, even
if m1 and m2 are on the same GPU. This is because m2 is
not allowed to access m1’s data directly.
Figure 5 shows the breakdown of the end-to-end latencies
of the queries in the benchmarks. As shown in the figure, the
communication time takes a large percentage of the end-to-end
latency for all the real applications. The data transfer time (host
to device/device to host) takes 32.4% to 46.9% of the end-to-
end latency. If the long communication time is eliminated, we
can greatly reduce the end-to-end latency of user queries. In
this case, the supported peak load can be further increased, and
the required GPU resource decreases to support a low load.
C. Limited Global Memory Space
While the current machine learning models often use large
batch size to improve the throughput, the models are large
in capacity. In this scenario, microservices are hard to be co-
located on the same GPU due to the limited global memory
space. As an example, Figure 6 shows the global memory
usage and the corresponding GPU utilization when the first
microservice of img-to-img uses different batch sizes. As
shown in the figure, the global memory of a GPU is only
able to host the microservice with batchsize smaller than 256,
while the GPU utilization is lower than 25%. In this scenario,
we are not able to allocate the remaining free computational
resource of the GPU to other microservices.
Unified memory technique that automatic swaps data be-
tween the main memory and the global memory can enable
the reallocation. However, it incurs heavy data transfer through
PCIe bus [44]. The transfer significantly slows down the com-
munication between microservices (discussed in Section VI).
The limited global memory space of GPUs also contribute to
the inefficiency of microservice pipelines.
Besides the computational resources on each of the GPUs,
the resource allocation for improving the pipeline effect of
GPU microservices has to consider the global memory space
as one of the main constraints.
V. THE CAMELOT METHODOLOGY
In this section, we propose Camelot, a runtime system that
maximizes the supported peak load of GPU microservices with
limited GPUs and minimizes resource usage at low load while
ensuring the QoS requirements.
A. Design Principles of Camelot
Based on the investigation in Section IV, we design Camelot
based on three design principles.
(1) Camelot should minimize the communication over-
head between microservices. The CPU-GPU data transfer
between microservices results in the long end-to-end latency.
In addition, the PCI-e bandwidth contention between mi-
croservices instances also leads to increased communication
overhead and long latency.
(2) Camelot should maximize pipeline efficiency while
achieving the required QoS online. The pipeline efficiency
is affected by both the percentage of SM resources allocated to
each microservice and the runtime contention behaviors, since
the microservices on the same GPU contend for the shared
resources (e.g., global memory bandwidth).
(3) Camelot should schedule microservices across mul-
tiple GPUs considering the limited global memory space.
Since the global memory space is one of the resource bot-
tlenecks for GPU microservices, Camelot should be able to
use multiple GPUs to host a end-to-end microservice-based
application. Same to the SMs, the GPU memory space is one
of the main constraints when scheduling the microservices.
B. Overview of Camelot
Figure 7 shows the design overview of Camelot. As shown
in the figure, Camelot adopts a global memory-based com-
munication mechanism to reduce the communication over-
head between microservices on the same GPU. For Camelot,
we propose two contention-aware resource allocation poli-
cies that maximize the supported peak load of an end-to-
end microservice-based application with limited GPUs and
Users
Queries
Allocation
decision
GPU-0
GPU-X
!!!"
!$ …Performance models
Global memory-based communication
···!! !" !#Microservice pipeline
Camelot
Contention-aware
resource allocation
Policy 1:
Maximize
peak load
Policy 2:
Minimize
resource usage Process
pool
Fig. 7: Design overview of Camelot.
minimize the resource usage at low load respectively, while
ensuring the desired 99%-ile latency target.
The global memory-based communication eliminates the
back and forth data transmissions for microservice commu-
nication between CPU memory and the global memory of
GPU (Section VI). It achieves the purpose by only passing the
handle of the to be transferred data in the global memory to
the receiver. The mechanism resolves the long communication
overhead that results in the long end-to-end latency.
The two resource allocation policies allocate GPU compu-
tational resources (i.e., SMs) to the microservices based on
the performance of each microservice with various resource
configurations, and the runtime contention behaviors (Sec-
tion VII). The challenging part here is that Camelot needs
to constraint the degradation due to the runtime contention.
Otherwise the user-facing service would suffer from QoS
violation. By considering both the performance and the con-
tention, the two policies handle the ineffective pipeline effect
and the shared resource contention.
Specifically, when a user query q is submitted, it is pro-
cessed in the following steps. 1) The query q is pushed into a
query wait queue and wait to be issued to the GPU. 2) Once
enough queries are received or the first query in the queue
tend to suffer from QoS violation, the queries are batched
and issued. 3) According to the batch size, Camelot cal-
cualates the GFLOPs (floating point operations) of the batch of
queries, and predicts the global memory usage, global memory
bandwidth usage, processing time, and throughput (executed
requests per second) of each microservice stage under various
computational resources. The prediction is done based on an
offline-trained performance model. 4) Based on the predic-
tion, Camelot identifies the percentages of the computational
resource that should be allocated to each microservice and de-
termines the number of instances for each microservice stage.
5) When co-locating these microservice instances, Camelot
considers the reduced communication overhead with the global
memory-based communication, the contention on the global
memory bandwidth, and the limited global memory space on
each GPU. Camelot uses the process pool technique proposed
in Laius [15] to realize the dynamic SM allocation.
VI. REDUCING COMMUNICATION TIME
In this section, we present a global memory-based commu-
nication mechanism that enables fast communication between
CPU memory
GPU
M1 M2 M3
Global memory
CPU memory
GPU
M1 M2 M3
Global memory
(a) The default mechanism
CPU memory
GPU
M1 M2 M3
Global me ory
CPU memory
GPU
M1 M2 M3
Global me ory
(b) The proposed mechanism
Fig. 8: Comparison of the default and the proposed global
memory-based communication mechanisms.
microservices on the same GPU.
A. Characterizing the Contention on PCIe Bus
Figure 8 compares the traditional main memory-based
communication and the proposed global memory-based com-
munication mechanism between microservices. During the
execution of GPU microservices, since the input of the next
stage depends on the output of the previous stage, the results of
a microservice stage must be transferred to the next stage. As
shown in Figure 8(a), adjacent microservices in the pipeline
(e.g., M1 and M2, M2 and M3) communicate with each other
by copying data back and forth between GPU global memory
and the CPU memory. The default communication mechanism
results in the long communication latency and the low data
transfer bandwidth, especially when multiple microservices
co-run on the same GPU.
To show the impact of the default communication mecha-
nism, we perform an experiment that runs multiple instances of
a PCIe-intensive microservice M concurrently on a GPU. The
functionality of M is copying 5GB data from the main mem-
ory to the global memory. In the experiment, each instance
of M is allocated only 10% of the computational resource to
eliminate the impact of the contention on the SMs. Figure 9
shows the data transfer time over PCIe bus of an instance of
M . In the figure, the x-axis shows the number of the instances
of M on the GPU.
As shown in Figure 9, the data transfer time increases
when more than three instances are co-located. The increased
data transfer time is due to the contention on the PCIe
bandwidth. While the theoretical peak bandwidth of 16x PCI-e
3.0 bus used in our platform is 15,800MB/s and the effective
bandwidth is 12,160MB/s [45], and a single memcpy task uses
PCIe bandwidth of 3,150MB/s according to our measurement.
If the memcpy task transfers data from pinned memory, a
single such memcpy task is able to consume all the PCIe
bandwidth.
In this scenario, if a GPU hosts more than b 121603150 c = 3
PCIe-intensive microservice instances, the microservices con-
tend for the limited PCIe bandwidth and suffer from the long
communication time. The long communication time results in
the long end-to-end latency of user queries.
1 2 3 4 5 6 7 8 9 10
The number of instances
0
1
2
3
4
T
im
e
(s
)
memcpy time
Fig. 9: The kernel processing time and the PCIe transfer time
for PCIe-intensive microservice.
Global memory
M1 M2
global memory 
handle
shm_head
shm_data1
shm_datan
…
M1 result
process1 process2
GPU
Host
Access by handleM1 result
Fig. 10: The global memory-based communication.
B. Global Memory-Based Communication
Observed from Figure 8(a), the data that should be passed
from M1 to M2 is already in the global memory space of
M1, although the data is not accessible for M2. If M1 is
able to share the data with M2, the expensive memcpy (from
device to host, and from host to device) can be eliminated. We
design a global memory-based communication mechanism as
shown in Figure 8(b) to achieve this purpose. In more detail,
adopting the global memory-based mechanism, the result of a
microservice M is temporarily stored in the global memory.
Another microservice is able to access the data from the global
memory directly without copying data back and forth between
the global memory and the main memory.
Figure 10 illustrates the design of global memory-based
communication mechanism. As shown in the figure, when a
microservice M1 needs to pass its result to microservice M2
on the same GPU, its process on the host passes a global
memory handle (8 bytes) to the process of M2 on the CPU
side. Once M2 gets the data handle, it is able to directly
access the data from the global memory. We implement the
mechanism using the CUDA IPC (inter-process communica-
tion) technique provided by Nvidia. The sender process gets
the IPC handle for a given global memory pointer using
cudaIpcGetMemHandle(), passes it to the receiver process
using standard IPC mechanisms on the host side, and the
receiver process uses cudaIpcOpenMemHandle() to retrieve
the device pointer from the IPC handle.
Figure 11 shows the communication time between two
microservices on the same GPU using the default and the
global memory-based mechanisms. In the figure, the two mi-
croservices do not contend for the PCIe bandwidth. Observed
from this figure, the global memory-based mechanism greatly
reduces the communication time when the to be passed data
is larger than 0.02MB. The larger the to be transferred data,
the larger the performance gain is achieved with the global
memory-based mechanism. In addition, if the to be transferred
0.00
2 0.02 0.05 0.12
5 0.16 0.4 5 6 12.5
Amount of data transferred(MB)
0
1
2
3
4
Ti
m
e(
m
s)
7.10
main-memory-based
global-memory-based
Fig. 11: Communicating with the main memory-based and
global memory-based mechanisms.
data between two microservices are small (e.g., only 2 bytes),
the traditional memory-based mechanism shows shorter time.
This is mainly because CUDA IPC incurs slight fix overhead
when probing, transferring, and decoding the IPC handle in
the global memory-based communication mechanism.
Besides reducing the communication time, the mechanism
also reduces the global memory space usage of the microser-
vices. With the traditional mechanism, M1 and M2 save two
copies of the transferred data. On the contrary, with the global
memory-based mechanism, only M1 saves a single copy of
the transferred data. M1 and M2 also save a IPC handle
of 8 bytes respectively. While the transferred data between
microservices are often larger than 8 bytes, the global memory-
based mechanism does not consume extra global memory
space. Instead, it reduces the global memory usage.
It is worth noting that the microservices on different GPUs
are not able to communicate through the global memory-based
mechanism. Therefore, the microservices that require heavy
communication should be placed on the same GPU.
VII. ALLOCATING GPU RESOURCES
In this section, we present two contention-aware resource
allocation policies for GPU microservices. The first policy
maximizes the supported peak load of GPU microservices with
limited GPUs while avoiding QoS violation. The second policy
minimizes GPU resource usage while ensuring the QoS, in
case that the load of a service is low.
A. Low Overhead Performance Prediction
Camelot predicts the processing duration, the global mem-
ory bandwidth usage, and the throughput of each microservice
to support the two resource allocation policies. The throughput
represents the number of queries that can be processed per
second at a microservice. For each microservice, we train it a
performance model that predicts its processing duration, global
memory bandwidth usage, and throughput.
The model for a microservice takes its input batchsize and
percentage of computational resources as the input features,
as they seriously affect the microservice’s performance. The
input batchsize reflects the workload of a query, and the
percentage of computational resource reflects the computa-
tional ability used to process the query. And these features
can be collected by profiling tools such as Nsight Compute
provided by Nvidia [46]. To collect training samples for a
microservice, we submit queries with different batch sizes,
execute them with different computational resource quotas
and collect the corresponding duration. During the profiling,
DT DT DT LR LR LR RF RF RF
0
10
20
30
40
50
P
re
d
ic
ti
o
n
 e
rr
o
r%
Processing time
Global memory bandwidth
Throughput
Fig. 12: Errors of predicting duration, global memory band-
width and throughput with DT, LR, and RF.
queries are executed in solo-run mode to avoid interference
due to shared resource contention.
Since the QoS target of a user query is hundreds of
milliseconds to support smooth user interaction [47], it is
crucial to choose the modeling technique that shows both
high accuracy and low complexity. We evaluate a spectrum
of broadly used low latency algorithms for the microservice
performance prediction: Linear regression (LR) [48], Decision
Tree (DT) [49], and Random forest (RF) [50].
To evaluate the accuracy of the three modeling techniques,
we use 70% of the collected samples to train the models
and use the rest for testing. Figure 12 present the prediction
errors of the duration, global memory bandwidth usage, and
throughput of each microservice in Camelot suite with LR,
DT, and RF. In general, DR and RF show high accuracy for
predicting the microservice performance. Besides accuracy,
we also measure the execution time of different prediction
models. The time of predicting with DT is shorter than 1 ms,
while the the RF model runs higher than 5 ms. We therefore
choose DT as the modeling technique to train the performance
models. Besides, Camelot also predicts the FLOPs (floating
point operations) and the required global memory space of the
microserives with different workloads. LR is able to precisely
capture such linear relationship.
We do not use black box methods, such as Reinforcement
Learning [51] or Bayesian optimization [52], to predict the
performance of microservices online, because in-production
GPUs lacks the ability to obtain runtime statistics online with
low overhead. In a datacenter, it is acceptable to profile a
service and build a new model before running it permanently.
Similar to prior work on datacenters [15], the profiling is done
offline so it does not incur runtime overhead.
B. Case 1: Maximizing Peak Load
The peak load of an end-to-end service is determined by the
smallest peak load of its microservices. Therefore, the design
principle here is maximizing the smallest throughput of the
microservices in an end-to-end service, while still ensuring the
end-to-end latency shorter than the QoS target. Camelot tunes
the number of microservice instances for each microservice
stage, and the SM resource quota for each microservice
instance. Other resources (such as global memory bandwidth)
cannot be explicitly allocated.
We formalize the above problem to be a single-objective
optimization problem, where the objective function is max-
imizing the smallest throughput of the microservices, and
TABLE II: The variables used in the optimization problem
Variable Varible description Provided by
Ai the ith part of Microservice A Benchmarks
pi the computational resource quotas
allocated to the ith microservice Section VII-B
s the batchsize of Microservice Ai scheduler
C the total number of GPUs Section VII-C
BW the available global memory bandwidth Nvprof
I the maximal client CUDA contexts
supported by Volta MPS server per-device Volta MPS
R the overall computational resources respectively Nvprof
Ni the number of the i-th microservice’s processes scheduler
f(pi) the predicted throughput of Ai Section VII-A
g(pi) the predicted global memory bandwidth
usage of Ai Section VII-A
M(i, s) the global memory footprint of Ai
with batch size s Section VII-A
C(i, x) the amount of calculations of Ai
with batchsize s Section VII-A
G the GFLOPS of the used GPU Nsight compute
F the global memory capacity of the used GPU Nvprof
the constraints are global memory capacity, global memory
bandwidth, computational resources on the GPUs and the QoS
target of microservices. In addition, the number of instances
for each microservice stage and the resource quotas allocated
to each process can be derived from an optimization problem
related to its feasible solutions.
The constraints in the optimization problem are as follows.
First, to avoid global memory bandwidth contention, the
accumulated global memory bandwidth required by all the
microservices on a GPU should be less than the available
global memory bandwidth of the GPU. Second, the accumu-
lated computational resource quotas allocated to concurrent
instances should not exceed the overall available computa-
tional resources. Third, the number of microservice instances
on a GPU should not exceed 48 ( Volta MPS allows at most
48 client-server connections per-device). Fourth, the total time
required for the total user-facing application should be smaller
than the QoS target.
Assume a user-facing GPU application Q has n microser-
vice stages. Equation 1 shows the object and the constraints in
the optimization problem. Table II lists the variables used to
maximize the supported peak load by solving the optimization
problem in Equation 1.
Object: MAX(minni=1Ni × f(pi)),
Constraint-1:
∑n
i=1
Ni × pi ≤ C ∗ R, 0 ≤ pi ≤ R
Constraint-2:
∑n
i=1
Ni ≤ C × I, 0 ≤ Ni ≤ I
Constraint-3:
∑n
i=1
Ni × b(pi) ≤ BW
Constraint-4:
∑n
i=1
Ni ×M(i, s)) ≤ F
Constraint-5:
∑n
i=1
g(pi) ≤ QoS
(1)
C. Case 2: Minimizing Resource Usage
In this policy, Camelot first minimizes the number of GPUs
required to support the low load, and then minimizes the
resource usage in each of the GPUs. This design choice is
able to reduce the search space for resolving the optimization
problem described later.
To determine the minimum number of GPUs required,
Camelot already predicts the number of floating point op-
erations and the global memory footprint of microservices
Step1
Rough pruning
[N1 … Nn]
[P1 …  Pn]
25%
25%
25%
For example
Step2
25%
Qo# $%&'()*+&(,-.%)/ 0*&12+1(ℎ $%&'()*+&(M-.%)/ '+5- $%&'()*+&(…
GPU1 GPU2 GPU3
25% 25%
Sorted GPU list
Fig. 13: The deployment scheme of Camelot.
with different loads (C(i,s) and M(i,s) in Table II). Based
on the designed GFLOPS (Giga floating-point operations per
second) and the global memory capacity of a GPU, Equation 2
calculates the minimum number of GPUs y required. In the
equation, G and F represent the GFLOPS and the global
memory capacity of the used GPU respectively. Observed
from the equation, the minimum number of GPUs required is
calculated under constraints of both the computational ability
and the global memory space.
y =MAX(
∑n
i=1 C(i, s)
G
,
∑n
i=1M(i, s)
F
) (2)
Equation 3 shows the object and the constraints that further
reduces the resource usage in the y GPUs. When choosing
the batch size to run microservices, Camelot considers the
global memory footprint of different batch sizes(M(i,s)). When
global memory resources are scarce, excessive batch size will
put pressure on the global memory space. Therefore, batch
size should also be considered as a variable when determining
resource allocation in the next stage.
Object: MIN(
∑n
i=1
Ni × pi), 0 ≤ pi ≤ R
Constraint-1:
∑n
i=1
Ni ≤ I, 0 ≤ Ni ≤ I
Constraint-2:
∑n
i=1
Ni × b(pi) ≤ BW
Constraint-3:
∑n
i=1
Ni ×M(i, s)) ≤ F
Constraint-4:
∑n
i=1
g(pi) ≤ QoS
(3)
By solving the two optimization problems, Camelot finds
out the resource quota for each microservice stage, and the
number of instances for each microservice stage. We currently
adopt Simulated Annealing algorithm [53] to resolve the
optimization problems.
In more detail, we have a vector of length 2N called V : [n1,
n2, .., p1, p2 ...pn], where N is the number of microservice
stages. For microservice stage-i, we will deploy ni instances
and will allocate pi percentage computing resources for each
instance. The amount of computing resources of the entire
GPU is 100%. Similar to the traditional simulated annealing
algorithm, Camelot iterates continuously to search for an
optimal result for V . In each iteration, the current state (V )
randomly moves in one direction and get a new state candidate
(V ′). Camelot will check if the new state V ′ meets constraints
such as memory bandwidth (as shown in formula 3). If
the new state is valid, Camelot calculates the throughput
of the new state and compares it with the global optimal
throughput. If the new state’s throughput is higher, Camelot
updates the global optimal throughput. If not, Camelot still has
the possibility (Acceptance Probability) to update the global
optimal throughput as the new state (V ′)’s throughput. The
acceptance probability decreases with more iterations.
D. Deployment scheme across multiple GPUs
Distributing microservice instances to multiple GPUs con-
tains two steps. The first step is also searching for the number
of instances for each microservice stage and the computing
resource quotas allocated to each instance. The second step
is to find a deployment scheme according to the number of
instances for each stage and the computing resource quotas in
the first step. However, it is impractical to search exhaustively
for the optimal deployment scheme for all instances. To speed
up the entire search progress, we use a specific deployment
strategy to quickly find out a reasonable deployment scheme
as shown in Figure 13.
A GPU has multiple resource dimensions including com-
puting resource, global memory capacity, global memory
bandwidth and PCIe bandwidth, etc. When deploying the
instances of a microservice stage, we sort the remaining GPUs
according to their available resources. The partial ordering of
resources during GPU sorting is related to the characteristics
of the microservice. According to previous experiments in
Section IV, we prove that for GPU microservices, the global
memory capacity will become the major resource bottleneck.
Therefore, Camelot sets the global memory capacity as the
highest priority resource in the deployment scheme. For exam-
ple, for applications that take up a lot of global memory space,
they will be sorted according to the size of the remaining
global memory when sorting. If the remaining global memory
is the same size, then they will be sorted according to other
resource dimensions.
GPUs with fewer resources have higher priority and will
try to deploy instances on the GPU with higher priority
first. In this case, Camelot avoids excessive fragmentation
of the resources available in the resource pool. In addition,
deploying instances of the same stage on the same GPU as
much as possible can share models between multiple instances,
reducing the consumption of GPU global memory, which is
often the most stressful resource during allocation.
VIII. EVALUATION OF CAMELOT
In this section, we evaluate the effectiveness of Camelot in
maximizing the supported peak load and minimizing resource
usage at low load, while ensuring the required QoS.
We evaluate Camelot on a machine equipped with two
Nvidia RTX 2080Ti GPUs and a DGX-2 machine [54] that
equipped with Nvidia V100 GPUs. Table III summarizes the
detailed software and hardware experimental configurations.
Camelot does not rely on any special hardware features of
2080Ti or V100, and is easy to be set up on other GPUs
with Volta or Turing architecture. The peak global memory
bandwidths of the 2080Ti and V100 GPUs are 616 GB/s
and 897 GB/s, respectively [55]. They are used as constraints
in the resource allocation policies. We use both the real
system benchmarks and the artifact benchmarks in Camelot
suite as user-facing GPU microservices. Except the large
scale evaluation in Section VIII-E, we report the experimental
results on the machine equipped with two 2080Ti GPUs.
128643216
Batchsize
160
180
200
220
240
Th
ro
ug
hp
ut
(q
ue
ry
/s
)
EA
Laius
Camelot
tail Latency
0.0x
0.1x
0.2x
0.3x
0.4x
0.5x
0.6x
0.7x
0.8x
0.9x
1.0x
No
rm
al
ize
d 
99
%
 ta
il 
La
te
nc
y
(a) Img-to-img
2561286432
Batchsize
350
450
550
650
750
Th
ro
ug
hp
ut
(q
ue
ry
/s
)
EA
Laius
Camelot
tail Latency
0.0x
0.1x
0.2x
0.3x
0.4x
0.5x
0.6x
0.7x
0.8x
0.9x
1.0x
No
rm
al
ize
d 
99
%
 ta
il 
La
te
nc
y
(b) Img-to-text
16842
Batchsize
150
250
350
450
550
650
Th
ro
ug
hp
ut
(q
ue
ry
/s
)
EA
Laius
Camelot
tail Latency
0.0x
0.1x
0.2x
0.3x
0.4x
0.5x
0.6x
0.7x
0.8x
0.9x
1.0x
No
rm
al
ize
d 
99
%
 ta
il 
La
te
nc
y
(c) Text-to-img
128643216
Batchsize
150
250
350
450
550
650
Th
ro
ug
hp
ut
(q
ue
ry
/s
)
EA
Laius
Camelot
tail Latency
0.0x
0.1x
0.2x
0.3x
0.4x
0.5x
0.6x
0.7x
0.8x
0.9x
1.0x
No
rm
al
ize
d 
99
%
 ta
il 
La
te
nc
y
(d) Text-to-text
Fig. 14: The supported peak loads of the benchmarks with EA, Laius and Camelot. The stars shows the normalized 99%-ile
latencies of the benchmarks with Camelot (corresponding to the right y-axis).
TABLE III: Hardware and software specifications.
Specification
Hardware
Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz
Two Nvidia GeForce RTX 2080Ti
Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz
NVIDIA DGX-2 with 16 Tesla V100s-SXM3
Software Ubuntu 16.04.5 LTS with kernel 4.15.0-43-genericCUDA Driver 410.78 CUDA SDK 10.0 CUDNN 7.4.2
While we do not find prior work on resource management
for GPU microservices, we compare Camelot with the Even
allocation (“EA” for short) policy, and Laius [15] that is
proposed for managing the applications co-located on spatial
multitasking GPUs. EA evenly allocates all the GPU resources
to the microservices in a user-facing applications. On a spatial
multitasking GPU, Laius predicts the computational resource
required by a user-facing query and dynamically reallocates
the remaining computational resources to batch applications
for maximizing their throughputs. While Laius is designed
for single GPU situation, we schedule the microservices of a
benchmark on a single GPU with Laius. The total throughput
of the benchmark with Laius is calculated by aggregating the
throughputs on all the GPUs.
A. Maximizing the Supported Peak Load
In this subsection, we evaluate Camelot in maximizing the
supported peak load while ensuring the required QoS with a
given number of GPUs.
Figure 14 shows the supported peak loads of the bench-
marks normalized to their QoS targets with EA, Laius and
Camelot, while ensuring the 99%-ile latency target. In the
figure, the x-axis shows the batch size of processing user
queries. Camelot increases the supported peak loads of the
benchmarks by 12% to 73.9% compared with EA, and by
10% to 64.5% compared with Laius.
EA results in the low peak loads of the benchmarks because
it does not consider the pipeline effect of the microservices.
While the peak load of a benchmark is determined by the
the peak load of the microservice stage that shows the
lowest throughput, the resource allocation does not balance
the throughputs of the microservice stages. In addition, the
benchmarks achieve slightly higher peak load with Laius com-
pared with EA. This is mainly because we already optimize
Laius to balance the throughputs of the microservice stages.
However, Laius still performs worse than Camelot, because
it does not schedule microservice instances across multiple
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160
2
4
6
8
Th
e 
nu
m
be
r o
f i
ns
ta
nc
e
stage1
stage2
(a) The number of instances for each
microservice stage
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160
20
40
60
80
Qu
ot
a 
of
 e
ac
h 
st
ag
e
stage1
stage2
(b) The percentage of SMs allocated to
each microservice
Fig. 15: The detailed resource allocation with Camelot.
GPUs as Camelot does. In this case, the microservices suffer
from higher contention with Laius compared with Camelot. In
addition, the benchmarks suffer from the long communication
overhead without the global memory-based communication in
EA and Laius.
In more detail, Figure 15 shows the number of instances in
each microservice stage, and the percentage of SMs allocated
to each microservice instance with Camelot. The 16 test cases
in Figure 14 are referenced to be 1-16 for simplicity in
Figure 15. Observed from this figure, for the microservice
stage that has long processing time (e.g., stage 1 for img-to-
img), Camelot automatically creates more instances for it to
increase its total throughput. In the way, Camelot improves
the pipeline efficiency of GPU microservices.
B. Minimizing Resource Usage
Figure 16 shows the normalized GPU resource usage of
the benchmarks at low load and the corresponding 99%-ile
latency with Camelot and Laius. We choose to use 30% of
the peak load to be the low load in the experiment as reported
by Google’s research [1]. In this figure, the resource usage is
normalized to the scenario that each microservice stage uses
an individual GPU. The expeirment with other loads show
similar result.
Observed from Figure 16, Camelot reduces the GPU re-
source usage by 46.5% on average while ensuring the QoS
of all the benchmarks. Camelot is effective in this scenario
because it precisely predicts the duration of a microservice
with different GPU resource configurations, and schedules mi-
croservice instances considering the runtime shared resource
contention (global memory bandwidth, and PCIe bandwidth).
Laius also reduces the resource usage compared with the naive
deployment by 20.2% on average. However, because it does
imgto
img
imgto
text
textt
oimg
textt
otext
0
20%
40%
60%
80%
100%
N
or
m
al
iz
ed
 m
in
im
iz
ed
 r
es
ou
rs
es
Camelot-stage1
Camelot-stage2
Laius-stage1
Laius-stage2
Camelot 99%-ile Latency
Laius 99%-ile Latency
1.0x
N
or
m
al
iz
ed
 9
9%
-il
e 
la
te
nc
y
Fig. 16: The efficiency of Camelot and Laius in reducing the
resource usage.
leve
l-1
leve
l-2
leve
l-3
leve
l-4
leve
l-1
leve
l-2
leve
l-3
leve
l-4
leve
l-1
leve
l-2
leve
l-3
leve
l-4
leve
l-1
leve
l-2
leve
l-3
leve
l-4
0
20%
40%
60%
80%
100%
No
rm
al
ize
d 
m
in
im
ize
d 
re
so
ur
se
s
stage1 stage2
imgtoimg imgtotext texttoimg texttotext
0
0.5x
1.0x
1.5x
No
rm
al
ize
d 
99
%
-il
e 
la
te
nc
y
Camelot Camelot-NC
Fig. 17: The resource usages of the benchmarks under
different load levels with Camelot, and the 99%-ile latencies
of the benchmarks with Camelot, and Camelot-NC.
not optimize the inter-microservice contention and does not
adjust the number of instances for each microservice stage,
it requires more resource than Camelot to ensure the QoS of
user-facing applications. Camelot reduces the GPU resource
usage by 35%, while Laius results in slight QoS violation for
3 out of the 4 benchmarks.
C. Adapting to Different Loads
In this subsection, we evaluate Camelot in adapting the
different loads. For each benchmark, we report its resource
usages and the corresponding 99%-ile latencies under four
different loads with Camelot in Figure 17. In the figure, the
load of level i is higher than the load of level j, if i > j.
Observed from this figure, Camelot reduces more resource
usage when the load is lower, and always guarantee the QoS of
the benchmarks. Camelot is able to fine tune the GPU resource
allocation based on the load, and the contention between the
microservices on the same GPU.
D. Effectiveness of Constraining Global Memory Bandwidth
Contention
Camelot predicts the global memory bandwidth usage of
all the microservices, and makes sure that the accumulated
bandwidth usage of the concurrent tasks is smaller than the
peak global memory bandwidth of the GPU. To show the
effectiveness of this constraint, we implement Camelot-NC,
a system that disables the constraint in Camelot.
p1
+c
1+
m
1
p1
+c
1+
m
2
p1
+c
1+
m
3
p1
+c
2+
m
1
p1
+c
2+
m
2
p1
+c
2+
m
3
p1
+c
3+
m
1
p1
+c
3+
m
2
p1
+c
3+
m
3
p2
+c
1+
m
1
p2
+c
1+
m
2
p2
+c
1+
m
3
p2
+c
2+
m
1
p2
+c
2+
m
2
p2
+c
2+
m
3
p2
+c
3+
m
1
p2
+c
3+
m
2
p2
+c
3+
m
3
p3
+c
1+
m
1
p3
+c
1+
m
2
p3
+c
1+
m
3
p3
+c
2+
m
1
p3
+c
2+
m
2
p3
+c
2+
m
3
p3
+c
3+
m
1
p3
+c
3+
m
2
p3
+c
3+
m
3
0
200
400
600
800
1000
1200
1400
1600
Th
ro
ug
hp
ut
(q
ue
ry
/s
) EA
Laius
Camelot
Fig. 18: The throughputs of the artifact benchmarks with EA,
Laius, and Camelot.
Figure 17 also shows the the 99%-ile latency of the bench-
marks with Camelot-NC. Observed from this figure, user-
facing services in 10 out of the 16 test cases suffer from
QoS violation with Camelot-NC. For instance, the 99%-ile
latency of img-to-img is up to 1.55X of its QoS target with
Camelot-NC. The QoS violation is due to the unmanaged
global memory bandwidth contention.
E. Generalizing for Complex Microservices
Besides the real-system benchmarks, we create 3× 3× 3 =
27 more benchmarks using the artifact benchmark in Camelot
suite (3 microservices with different compute intensities, 3
microservices with different memory access intensities, and
3 microservices with different PCIe intensities) to evaluate
Camelot for complex microservices. The microservices are
denoted by c1, c2, c3, m1, m2, m3, p1, p2, and p3 respec-
tively. ci/mi/pi is more PCIe/compute/memory intensive than
cj /mj /pj , if i > j.
Figure 18 shows the supported peak loads of the 27 artifact
benchmarks with EA, Laius, and Camelot. In the figure,
“pi+ci+mi”represents a benchmark that is built by pipelin-
ing a PCIe-instensive microservice pi, a compute-intensive
microservice ci and a memory-instenvie microservice mi.
Observed from this figure, on average, Camelot improves
the supported peak load of the 27 benchmarks by 44.91%
compared to EA, and by 39.72% compared with Laius.
Corresponding to Figure 18, Figure 20 shows the resource
allocation with Camelot for the 27 benchmarks. Observed
from this figure, Camelot launches different numbers of in-
stances for different microservice stages, and allocates dif-
ferent percentages of the SMs to the microservices. For
instance, Camelot launches 1 instance of Microservice-1, 2
instances of Microservice-2, and 5 instances of Microservice-3
in the first benchmark. In addition, Camelot allocates different
percentages of the SMs to the same microservice when it is
linked in different benchmarks. It reveals that Camelot is able
to automatically adjust the resource allocation based on the
features of the microservices.
Figure 21 shows the resource usages and the corresponding
99%-ile latencies of the 27 benchmarks at low load with
Camelot. Camelot significantly reduces the resource usage by
61.6% on average. In addition, the GPU resource allocations
vary for the 27 benchmarks. This is because Camelot adjusts
the resource allocation based on the characters of the pipelined
microservices. To conclude, Camelot is generalizable for com-
plex microservices.
F. Large Scale Evaluation on DGX-2
We also evaluate Camelot on a large-scale DGX-2 machine
in maximizing the supported peak load. We do not show the
result of minimizing the resource usage here because it is the
same to the one on RTX 2080Ti.
Figure 19 shows the supported peak loads of the bench-
marks normalized to their QoS targets with EA, Laius and
Camelot, while ensuring the 99%-ile latency target. In the
figure, the x-axis shows the batch sizes of processing user
51225612864
Batchsize
0
5
10
15
20
No
rm
al
ize
d 
Th
ro
ug
hp
ut
(q
ue
ry
/s
) EA
Laius
Camelot
tail Latency
0.0x
0.1x
0.2x
0.3x
0.4x
0.5x
0.6x
0.7x
0.8x
0.9x
1.0x
No
rm
al
ize
d 
99
%
 ta
il 
La
te
nc
y
(a) Img-to-img
51225612864
Batchsize
0
10
20
30
40
No
rm
al
ize
d 
Th
ro
ug
hp
ut
(q
ue
ry
/s
)
EA
Laius
Camelot
tail Latency
0.0x
0.1x
0.2x
0.3x
0.4x
0.5x
0.6x
0.7x
0.8x
0.9x
1.0x
No
rm
al
ize
d 
99
%
 ta
il 
La
te
nc
y
(b) Img-to-text
6432168
Batchsize
0
10
20
30
40
No
rm
al
ize
d 
Th
ro
ug
hp
ut
(q
ue
ry
/s
)
EA
Laius
Camelot
tail Latency
0.0x
0.1x
0.2x
0.3x
0.4x
0.5x
0.6x
0.7x
0.8x
0.9x
1.0x
No
rm
al
ize
d 
99
%
 ta
il 
La
te
nc
y
(c) Text-to-img
51225612864
Batchsize
0
5
10
15
20
25
No
rm
al
ize
d 
Th
ro
ug
hp
ut
(q
ue
ry
/s
)
EA
Laius
Camelot
tail Latency
0.0x
0.1x
0.2x
0.3x
0.4x
0.5x
0.6x
0.7x
0.8x
0.9x
1.0x
No
rm
al
ize
d 
99
%
 ta
il 
La
te
nc
y
(d) Text-to-text
Fig. 19: The supported peak loads of the benchmarks on DGX-2 with EA, Laius and Camelot. The stars shows the normalized
99%-ile latencies of the benchmarks with Camelot (corresponding to the right y-axis).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
0
50
100
150
200
Re
so
ur
ce
 P
er
ce
nt
ag
e
MicroService-1 MicroService-2 MicroService-3
Fig. 20: Resource allocation for maximizing the peak sup-
ported load of the benchmarks with Camelot.
1 2 3 4 5 6 7 8 9 1
0
1
1
1
2
1
3
1
4
1
5
1
6
1
7
1
8
1
9
2
0
2
1
2
2
2
3
2
4
2
5
2
6
2
70
20
40
60
80
100
N
o
rm
a
liz
e
d
 m
in
im
iz
e
d
 
re
so
u
rs
e
s 
p
e
rc
e
n
ta
g
e
stage1 stage2 stage3
0
0.5x
1.0x
1.5x
N
o
rm
a
liz
e
d
 9
9
%
-i
le
 L
a
te
n
cyNormalized 99%-ile Latency
Fig. 21: Resource allocation for the benchmarks at low loads
and the corresponding 99%-ile latencies with Camelot.
queries. Observed from this figure, Camelot increases the
supported peak load by 50.1% for all the benchmarks on
average compared with EA, while guaranteeing their 99%-ile
latency within the required QoS target. Camelot is scalable on
large-scale GPU machines.
G. Overhead of Camelot
Offline overhead. The overhead of training models offline
for predicting microservice performance is acceptable. We
collect the training samples of all the microservices within
a single day using a single GPU. We can further speed up the
sample collection by using multiple GPUs. As for the online
predicting, each prediction completes in 1 ms, which is much
shorter than the QoS target of a service. Resource allocation
overhead. As stated in Section VII, Camelot needs to solve the
optimization problem using the simulated annealing algorithm
to identify the appropriate resource allocation. Our measure-
ment shows that this operation completes in 5ms. Communi-
cation overhead. Camelot need to setup global memory-based
communication for microservices that require data transfer.
The setup operation based on CUDA IPC technique for a
pair of microservices is only done once when the end-to-end
service is launched. The setup operation completes in 1ms.
To conclude, the overhead of Camelot is acceptable for real-
system deployment.
IX. CONCLUSION
For GPU microservices, the main memory-based commu-
nication between the microservices, the pipeline inefficiency,
and the global memory bandwidth contention result in their
poor performance. To this end, we propose Camelot, a runtime
system to manage GPU resources online. Camelot uses a
global memory-based communication mechanism to eliminate
the large communication overhead. We also propose two
contention-aware resource allocation policies that considers
the pipeline efficiency and shared resource contention. Experi-
mental results show that Camelot increases the peak supported
load by up to 64.5%, and reduces 35% resource usage at
low load while achieving the desired 99%-ile latency target
compared with the state-of-the-art work.
REFERENCES
[1] L. A. Barroso and U. Ho¨lzle, “The datacenter as a computer: An
introduction to the design of warehouse-scale machines,” Synthesis
lectures on computer architecture, vol. 4, no. 1, pp. 1–108, 2009.
[2] A. Broder, “A taxonomy of web search,” in ACM Sigir forum, vol. 36,
no. 2. ACM, 2002, pp. 3–10.
[3] S. A. McIlraith, T. C. Son, and H. Zeng, “Semantic web services,” IEEE
intelligent systems, vol. 16, no. 2, pp. 46–53, 2001.
[4] Y. Gan, Y. Zhang, D. Cheng, A. Shetty, P. Rathi, N. Katarki, A. Bruno,
J. Hu, B. Ritchken, B. Jackson et al., “An open-source benchmark
suite for microservices and their hardware-software implications for
cloud & edge systems,” in the Twenty-Fourth International Conference
on Architectural Support for Programming Languages and Operating
Systems (ASPLOS). ACM, 2019, pp. 3–18.
[5] S. Li, H. Zhang, Z. Jia, Z. Li, C. Zhang, J. Li, Q. Gao, J. Ge, and
Z. Shan, “A dataflow-driven approach to identifying microservices from
monolithic applications,” Journal of Systems and Software, vol. 157, p.
110380, 2019.
[6] “The evolution of microservices.” https://www.slideshare.net/
adriancockcroft/evolution-of-microservices-craft-conference.
[7] “Microservices workshop: Why, what,and how to
get there.” http://www.slideshare.net/adriancockcroft/
microservices-workshop-craft-conference.
[8] Q. Chen, H. Yang, J. Mars, and L. Tang, “Baymax: Qos awareness and
increased utilization for non-preemptive accelerators in warehouse scale
computers,” ACM SIGPLAN Notices, vol. 51, no. 4, pp. 681–696, 2016.
[9] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn,
N. Leiser, and G. Czajkowski, “Pregel: a system for large-scale graph
processing,” in the ACM International Conference on Management of
data (SIGMOD). ACM, 2010, pp. 135–146.
[10] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521,
no. 7553, p. 436, 2015.
[11] NVIDIA, “Multi-process service.” https://docs.nvidia.com/deploy/mps/
index.htmltopic 6 1 2, 2015.
[12] A. Sriraman, A. Dhanotia, and T. F. Wenisch, “Softsku: Optimizing
server architectures for microservice diversity@ scale,” in the 46th
International Symposium on Computer Architecture (ISCA), 2019, pp.
513–526.
[13] L. Bao, C. Wu, X. Bu, N. Ren, and M. Shen, “Performance modeling
and workflow scheduling of microservice-based applications in clouds,”
IEEE Transactions on Parallel and Distributed Systems, 2019.
[14] Y. Gan, Y. Zhang, K. Hu, D. Cheng, Y. He, M. Pancholi, and C. De-
limitrou, “Seer: Leveraging big data to navigate the complexity of
performance debugging in cloud microservices,” in the Twenty-Fourth
International Conference on Architectural Support for Programming
Languages and Operating Systems (ASPLOS). ACM, 2019, pp. 19–33.
[15] W. Zhang, W. Cui, K. Fu, Q. Chen, D. E. Mawhirter, B. Wu, C. Li, and
M. Guo, “Laius: Towards latency awareness and improved utilization
of spatial multitasking accelerators in datacenters,” in Proceedings of
the ACM International Conference on Supercomputing (ICS), 2019, pp.
58–68.
[16] H. Sak, A. Senior, and F. Beaufays, “Long short-term memory recurrent
neural network architectures for large scale acoustic modeling,” in the
Fifteenth annual conference of the international speech communication
association, 2014.
[17] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training
of deep bidirectional transformers for language understanding,” arXiv
preprint arXiv:1810.04805, 2018.
[18] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[19] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation
learning with deep convolutional generative adversarial networks,” arXiv
preprint arXiv:1511.06434, 2015.
[20] X. Zhou, X. Peng, T. Xie, J. Sun, C. Xu, C. Ji, and W. Zhao,
“Poster: Benchmarking microservice systems for software engineering
research,” in the 0th International Conference on Software Engineering:
Companion (ICSE-Companion). IEEE, 2018, pp. 323–324.
[21] T. Ueda, T. Nakaike, and M. Ohara, “Workload characterization for
microservices,” in the international symposium on workload character-
ization (IISWC). IEEE, 2016, pp. 1–10.
[22] M. Amaral, J. Polo, D. Carrera, I. Mohomed, M. Unuvar, and M. Stein-
der, “Performance evaluation of microservices architectures using con-
tainers,” in the 14th International Symposium on Network Computing
and Applications. IEEE, 2015, pp. 27–34.
[23] W. Hasselbring, “Microservices for scalability: keynote talk abstract,”
in the 7th ACM/SPEC on International Conference on Performance
Engineering. ACM, 2016, pp. 133–134.
[24] M. Gribaudo, M. Iacono, and D. Manini, “Performance evaluation of
massively distributed microservices based applications,” in the 31st
European Conference on Modelling and Simulation (ECMS). European
Council for Modelling and Simulation, 2017, pp. 598–604.
[25] A. U. Gias, G. Casale, and M. Woodside, “Atom: Model-driven au-
toscaling for microservices,” in the 39th International Conference on
Distributed Computing Systems (ICDCS). IEEE, 2019, pp. 1994–2004.
[26] A. Kwan, J. Wong, H.-A. Jacobsen, and V. Muthusamy, “Hyscale:
Hybrid and network scaling of dockerized microservices in cloud data
centres,” in the 39th International Conference on Distributed Computing
Systems (ICDCS). IEEE, 2019, pp. 80–90.
[27] Y. Xiang and H. Kim, “Pipelined data-parallel cpu/gpu scheduling for
multi-dnn real-time inference,” in Real-Time Systems Symposium (RTSS).
IEEE, 2019, pp. 392–405.
[28] “Conway’s law.” http://www.melconway.com/Home/Conways Law.
html.
[29] Y. Bai, Y. Zhang, M. Ding, and B. Ghanem, “Finding tiny faces in the
wild with generative adversarial network,” in the IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2018, pp. 21–30.
[30] “Facial recognition api for python,” https://github.com/ageitgey/face
recognition.
[31] C. Dong, C. C. Loy, and X. Tang, “Accelerating the super-resolution
convolutional neural network,” in European conference on computer
vision (ECCV). Springer, 2016, pp. 391–407.
[32] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A
neural image caption generator,” in the IEEE conference on Computer
Vision and Pattern Recognition (CVPR), 2015, pp. 3156–3164.
[33] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee,
“Generative adversarial text to image synthesis,” arXiv preprint
arXiv:1605.05396, 2016.
[34] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena,
Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of trans-
fer learning with a unified text-to-text transformer,” arXiv preprint
arXiv:1910.10683, 2019.
[35] “Opennmt: An open source neural machine translation system,” https:
//opennmt.net/.
[36] Y. Liu and M. Lapata, “Text summarization with pretrained encoders,”
arXiv preprint arXiv:1908.08345, 2019.
[37] F. Wang, L. Chen, C. Li, S. Huang, Y. Chen, C. Qian, and
C. Change Loy, “The devil of face recognition is in the noise,” in the
European Conference on Computer Vision (ECCV), 2018, pp. 765–780.
[38] L. Gao, D. Chen, J. Song, X. Xu, D. Zhang, and H. T. Shen, “Perceptual
pyramid adversarial networks for text-to-image synthesis,” 2019.
[39] M. Zhu, P. Pan, W. Chen, and Y. Yang, “Dm-gan: Dynamic memory
generative adversarial networks for text-to-image synthesis,” in the IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), 2019,
pp. 5802–5810.
[40] G. Yin, B. Liu, L. Sheng, N. Yu, X. Wang, and J. Shao, “Semantics
disentangling for text-to-image generation,” in the IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2019, pp. 2327–
2336.
[41] L. Gao, D. Chen, J. Song, X. Xu, D. Zhang, and H. T. Shen, “Perceptual
pyramid adversarial networks for text-to-image synthesis,” 2019.
[42] A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for
generating image descriptions,” in the IEEE conference on Computer
Vision and Pattern Recognition (CVPR), 2015, pp. 3128–3137.
[43] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and
K. Skadron, “Rodinia: A benchmark suite for heterogeneous computing,”
in International Symposium on Workload Characterization (IISWC).
IEEE, 2009, pp. 44–54.
[44] W. Li, G. Jin, X. Cui, and S. See, “An evaluation of unified memory
technology on nvidia gpus,” in the 15th IEEE/ACM International Sym-
posium on Cluster, Cloud and Grid Computing (CCGrid). IEEE, 2015,
pp. 1092–1098.
[45] A. Goldhammer and J. Ayer Jr, “Understanding performance of pci
express systems,” Xilinx WP350, Sept, vol. 4, 2008.
[46] “Nvidia night compute.” https://docs.nvidia.com/nsight-compute/
NsightCompute/index.html.
[47] J. Dean and L. A. Barroso, “The tail at scale,” Communications of the
ACM, vol. 56, no. 2, pp. 74–80, 2013.
[48] G. A. Seber and A. J. Lee, Linear regression analysis. John Wiley &
Sons, 2012, vol. 329.
[49] S. R. Safavian and D. Landgrebe, “A survey of decision tree classifier
methodology,” IEEE transactions on systems, man, and cybernetics,
vol. 21, no. 3, pp. 660–674, 1991.
[50] G. Biau and E. Scornet, “A random forest guided tour,” Test, vol. 25,
no. 2, pp. 197–227, 2016.
[51] L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcement
learning: A survey,” Journal of Artificial Intelligence Research, vol. 4,
pp. 237–285, 1996.
[52] J. Snoek, H. Larochelle, and R. P. Adams, “Practical bayesian optimiza-
tion of machine learning algorithms,” in Advances in Neural Information
Processing Systems, 2012, pp. 2951–2959.
[53] P. J. Van Laarhoven and E. H. Aarts, “Simulated annealing,” in Simulated
annealing: Theory and applications. Springer, 1987, pp. 7–15.
[54] NVIDIA, “Nvidia dgx-2 system user guide.” https://docs.nvidia.com/
dgx/dgx2-user-guide/index.html, 2019.
[55] “Nvidia tesla v100 gpu architecture.” https://images.nvidia.com/content/
volta-architecture/pdf/volta-architecture-whitepaper.pdf, 2017.
