TrIMS: Transparent and Isolated Model Sharing for Low Latency Deep
  LearningInference in Function as a Service Environments by Dakkak, Abdul et al.
TrIMS: Transparent and Isolated Model Sharing for Low Latency Deep Learning
Inference in Function as a Service Environments
Abdul Dakkak1, Cheng Li1, Simon Garcia de Gonzalo1, Jinjun Xiong2, and Wen-mei Hwu3
dakkak@illinois.edu,cli99@illinois.edu,grcdgnz2@illinois.edu,jinjun@us.ibm.com,w-hwu@illinois.edu
1Department of Computer Science , University of Illinois, Urbana-Champaign
2IBM Thomas J. Watson Research Center , Yorktown Heights, NY
3Department of Electrical and Computer Engineering , University of Illinois, Urbana-Champaign
Abstract
Deep neural networks (DNNs) have become core compu-
tation components within low latency Function as a Service
(FaaS) prediction pipelines: including image recognition, ob-
ject detection, natural language processing, speech synthesis,
and personalized recommendation pipelines. Cloud comput-
ing, as the de-facto backbone of modern computing infras-
tructure for both enterprise and consumer applications, has
to be able to handle user-defined pipelines of diverse DNN
inference workloads while maintaining isolation and latency
guarantees, and minimizing resource waste. The current so-
lution for guaranteeing isolation within FaaS is suboptimal
— suffering from “cold start” latency. A major cause of such
inefficiency is the need to move large amount of model data
within and across servers. We propose TrIMS as a novel solu-
tion to address these issues. Our proposed solution consists of
a persistent model store across the GPU, CPU, local storage,
and cloud storage hierarchy, an efficient resource management
layer that provides isolation, and a succinct set of application
APIs and container technologies for easy and transparent inte-
gration with FaaS, Deep Learning (DL) frameworks, and user
code. We demonstrate our solution by interfacing TrIMS with
the Apache MXNet framework and demonstrate up to 24×
speedup in latency for image classification models and up to
210× speedup for large models. We achieve up to 8× system
throughput improvement.
1. Introduction
The recent trend of computing sees a confluence between arti-
ficial intelligence, driven primarily by deep learning (DL), and
cloud computing with both gaining traction within enterprise
and consumer applications. Key to this trend is the superior
performance, accessibility, and accuracy of deep neural net-
works (DNNs) in a wide array of intelligent tasks such as:
image recognition, object detection, natural language under-
standing, speech synthesis, and personalized recommendation.
Today, many business-logic and consumer applications rely
on DL inference as core components within their prediction
pipelines. These pipelines tend to be deployed to the cloud
through Function as a Service (FaaS) platforms [8, 1, 5, 10],
since they abstract away low-level details such as system setup,
dev-ops, and monitoring — promising service isolation, decen-
tralization, and scalability, while still being more cost-effective
compared to dedicated servers. Since FaaS services execute
arbitrary user pipelines, FaaS system must execute code in
isolation — through virtual machines (VMs) or containers.
Current off-the-shelf DL inference [12, 2, 7, 4, 11] is per-
formed through HTTP APIs and uses pre-built general models
(model catalogs) deployed by the cloud provider or user de-
fined models deployed by the user. Within the FaaS pipelines,
users interact with these models using the HTTP inference
APIs and construct their prediction pipelines by defining glue
code that parse the input, perform the model prediction, and
process the output. There are two ways to perform model infer-
ence, batch prediction and online prediction. Batch prediction
is performed offline on a large set of inputs, while online
prediction is usually performed in real-time on a one-by-one
basis [15, 18]. In this paper we focus on online prediction
within a latency sensitive FaaS prediction pipeline.
DL Service providers are aware of the “cold start” cost of
inference, and therefore eagerly persist models within their
catalog — keeping the models in memory (“warm”) to guar-
antee the promised latency. For example, Amazon ML at-
tempts to respond to most real-time prediction requests within
100ms [16]. Without model persistence, network overhead
contributes to a significant portion of the end-to-end inference
latency. As for the user deployed models, the inference latency
is not only affected by the network, but is also dominated by
the mode inference “cold start”. The “cold start” latency can
be seconds to minutes depending on the model size and the
deployment set-up. To avoid the “cold start” overhead, users
have to pay [13, 3] an hourly cost to persist their models.
FaaS can be used to express latency sensitive prediction
pipelines that leverage a chain or ensemble of models. How-
ever, the current practice of integrating FaaS with model cata-
logs is inefficient for this usage — the network latency asso-
ciated with the inference limits how complex or intelligent a
pipeline can be — making these pipelines out of reach for most
but the cloud giants. For example, Google Translate targets a
200ms per sentence end-to-end latency to avoid user-visible
degradation of service [35]. To meet the latency requirement,
Google implements a monolithic in-house pipeline that uses
fast intranet interconnects. Current FaaS users cannot ex-
press such a complex pipeline using modular DL inferences
to achieve comparable latency.
Cloud computing, as the de-facto backbone of modern com-
puting infrastructure, has to be able to enable this scenario in
ar
X
iv
:1
81
1.
09
73
2v
1 
 [c
s.D
C]
  2
4 N
ov
 20
18
58.73
15.37
50.63
45.88
10.45
58.75
9.83
7.63
7.74
7.04
7.2
12.85
8.24
56.92
58.04
51.26
11.55
AlexNet
GoogLeNet
CaffeNet
RCNN-ILSVRC13
Inception-v3
Inception-v4
InceptionBN-v2
ResNet101
ResNet101-v2
ResNet152
ResNeXt50-32x4d
SqueezeNet
SqueezeNet-v1.1
VGG16
VGG16_SOD
VGG19
WRN50-v2
MX
Net
 CP
U
MX
Net
 GP
U
Caff
e C
PU
Caff
e G
PU
Caff
e2 
CP
U
Caff
e2 
GP
U
TF 
CP
U
TF 
GP
U
8.45
12.9
9.67
9.85
8.87
7.07
3.47
10.63
11.22
8.29
13.33
14.8
14.3
25.1
26.81
25.2
12.18
1.36
2.24
1.03
1.17
3.33
5.17
0.89
2.65
2.64
3.34
1.74
5.82
3.44
5.71
2.19
5.63
2.33
0.77
0.61
0.79
0.75
1.16
1.3
1.03
1.25
1.23
1.39
1.14
0.41
0.41
0.88
0.89
0.93
1.1
Model Loading        Input Processing        Compute
Figure 1: Percentage of time spent in model loading, inference
computation, and image preprocessing for “cold start” online
DL inference (batchsize = 1) using CPU and GPU for MXNet,
Caffe, Caffe2, and TensorFlow on an IBM S822LC with Pascal
GPUs. The speedup of using GPU over CPU for the inference
compute alone is shown between the pie charts. Inference
time for all frameworks is dominated by model loading except
for small models, such as SqueezeNet, where the model size is
a few megabytes. For TensorFlow, high GPU initialization over-
head impacts the end-to-end time and the achieved speedup.
a cost-effective way. We envision a future FaaS infrastructure
that avoids the network overhead, thus making building com-
plex latency sensitive pipelines, with modular DL inference
components feasible, while better leveraging the hardware
resources. This enables the development of complex applica-
tions based on FaaS; e.g. users can build a personal assistant
(similar to Amazon’s Alexa or Apple’s Siri) by employing
off-the-shelf DL inference componets and still achieve compa-
rable latency of the complex monolithic application from cloud
giants. To achieve this goal, we advocate for collocating pre-
diction pipelines with model serving within FaaS, effectively
bringing the compute nearer to the model and circumventing
the network latency.
The idea of collocating compute with data is not new and
has been explored in other domains such as: databases and
near memory acceleration. This paper does not deal with the
mechanics of collocating compute and data, since they have
been explored elsewhere [29, 76, 50, 36, 49, 37]. Instead, we
tackle the challenge faced by collocating model serving and
user code within FaaS — the current method of user code
isolation incurs a high “cold start” latency for each invocation
of the DL inference in the pipeline.
We observe that for “cold start” model inference, model
loading (I/O, data structure deserialization, GPU data move-
ment) is the main source of “cold start” latency. Figure 1
shows the “cold start” inference time breakdown for popu-
lar DL frameworks: Caffe [46], Caffe2 [45], MXNet [22],
and TensorFlow [19]. For GPU inference, data movement
is another contributing factor making GPU less attractive for
accelerating inference — even though GPUs offer a significant
compute speed advantage, as shown in Figure 1.
We also observe that in a cloud setting DL models are shared
extensively across user FaaS pipelines. For example, Google
reported that 41 natural translation models can accommodate
over 75% of their translation requests in [7]. Because model
parameters are constant, we can leverage model sharing across
pipelines by persisting model parameters in GPU and/or CPU
memory, hence eliminating the model loading overhead, de-
creasing the end-to-end latency, and reducing the memory
footprint for DL inferences.
In this paper, we propose a Transparent and Isolated Model
Sharing (TrIMS) scheme to address the “cold start” latency
challenge faced by collocating user code with model catalogs
within FaaS — it does so while maintaining the isolation con-
straints, minimizing model-loading overhead, and increasing
hardware resource utilization. We describe TrIMS’s model re-
source manager (MRM) which offers a multi-tiered cache for
DL models to be shared across user pipelines. By decreasing
model loading and data movement overhead, TrIMS decreases
latency of end-to-end model inference, making inference on
GPU a viable target. TrIMS also increases memory efficiency
for cloud data centers while maintaining accuracy.
Specifically, this paper makes the following contributions:
• We characterize the “cold start” overhead for online DL
model inference across popular DL frameworks, such as
Caffe, Caffe2, MXNet, and TensorFlow, on both CPUs and
GPUs and identify model loading as the bottleneck.
• We propose TrIMS to mitigate the model loading overhead
faced by collocating user code with model catalogs within
FaaS, and increase the model serving efficiency by sharing
DL models across all levels of the memory hierarchy in
the cloud environment — GPU, CPU, local storage, and
remote storage. To our knowledge, this work is the first
to propose sharing DL models across isolated online pre-
diction pipelines while increasing hardware efficiency and
decreasing latency.
• We implement TrIMS within Apache MXNet [22] and eval-
uate the impact on online inference performance for a rep-
resentative set of models and systems. We show that TrIMS
provides 1.12× – 24× speedup on small (less than 600MB)
2
Data Conv Conv Pool FC FC FCConv ConvPool Pool Conv
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Figure 2: The DL inference graph for AlexNet [47]. The input dimensions and the memory footprint are shown in Table 1.
Index Name Dim MF (MB)
1 conv1_bias 96 0.001
2 conv1_weight 96×3×11×11 0.270
3 conv2_weight 256×48×5×5 2.458
4 conv2_bias 256 0.002
5 conv3_weight 384×256×3×3 7.078
6 conv3_bias 384 0.003
7 conv4_bias 384 0.003
8 conv4_weight 384×192×3×3 5.3086
9 conv5_weight 256×192×3×3 3.539
10 conv5_bias 256 0.002
11 fc6_bias 4096 0.033
12 fc6_weight 4096×9216 301.990
13 fc7_weight 4096×4096 134.218
14 fc7_bias 4096 0.033
15 fc8_bias 1000 0.008
16 fc8_weight 1000×4096 32.768
Table 1: Memory footprint (MF) for each layer in Figure 2.
models and 5× – 210× speedup on large (up to 6GB) mod-
els and is within 20% of ideal speedup (with ideal being
that model loading and data movement taking no time —
i.e. same as persisting the model), and gives 8× system
throughput improvement without loss of accuracy.
• TrIMS eliminates a substantial part of the non-compute
components of the end-to-end latency, making DL model
inference on GPU and other novel compute accelerator
more viable. We identify remaining latency components for
inference, motivating future microarchitecture techniques
for further inference latency improvements.
• We architect TrIMS so that it can be easily integrated
with existing frameworks without user code changes. The
method is designed to be compatible with existing frame-
work usage patterns, and requires minimal modifications
for framework developers.
The rest of this paper is organized as follows: Sections 2
and 3 describes current overheads and practice for inference
serving. Sections 4 and 5 details our design and implementa-
tion. Section 6 describes our evaluation setup and experiment
results. Section 7 outlines related work. Section 8 concludes.
2. Deep Learning Inference Overhead
A single DL inference is much less computationally intensive
than training, making it more sensitive to the data loading and
deserialization overhead. A DL inference compute graph is
a DAG composed of a set of network layers. Each computa-
tional layer is parameterized through weights and constants.
The model parameters along with the compute topology iden-
tify the model 1. Each layer operator is a function of the
incoming edges in the graph and the weights/constants. An
inference pass iterates through the layers of a compute graph
and applies the layer operators to its input. Figure 2 shows the
inference compute graph for AlexNet [47] and Table 1 lists
the dimension and memory footprint for each layer.
For GPUs, the compute graph and associated weights are
loaded and copied to GPU memory ahead of the computation.
Memory for intermediate layer outputs also need to be allo-
cated. AlexNet, for example, requires 516MB of extra GPU
memory to store the intermediate results during the inference
process. These intermediate outputs are not constant and can-
not be shared, since they depend on the user’s input. However,
layer weights are constant and can be shared across processes.
For AlexNet, this results in sharing 238MB of constant data.
When compute is optimized, the overhead of model loading
is magnified. Figure 1 shows that GPU outperforms the CPU
in terms of compute, thus making model loading a bottleneck
for end-to-end inference. Without data transfer overhead the
NVIDIA Tesla V100 GPU using Tensor Cores can achieve
70× higher throughput on CNNs and 130× higher throughput
on RNNs compared to a high-end CPU server [14]. Reducing
the data movement overhead makes GPU a more appealing
option for DL inference.
To mitigate the model loading overhead, cloud services and
previous work [54, 21, 27] persist model catalogs in mem-
ory or perform inference in batches. These strategies require
knowledge of the model requests, have potential resource
waste since the system resources are persisted within pro-
cesses for models even when they are not used, or increase the
latency of requests if batching the inferences.
3. Current Prediction Pipelines in FaaS
Function as a Service (FaaS) is a cost-effective way for users
to deploy functions or pipelines that are executed within
the cloud. Users define prediction pipelines that use mod-
els they deployed or ones found within the model catalog.
The pipelines are then mapped to a fabric of containers —
used to maintain software stack separation, virtualize sys-
tem resources, and provide isolation — that run on physical
machines. Unlike traditional cloud execution, the functions
executed in a FaaS are short lived and are priced on a per-
1Throughout this paper, sharing a layer means that we are sharing both
the weights and constants that parameterize the layer.
3
Function call Provisoned on HTTP REST API call
Scene 
Understanding API 
Model Server 1
Model Server N} Model Server 2Text to Speech API User Function 3User Function 2User Function 1
User Function 4
AlexNet Model
Inference
{ContainerOrchestrator Server 1Server 2
Container 4
Server 3
} { Text to Speech EndpointContainer 2 NetworkBarrier …APIManagementContainer 1
Container 3
Visual
Recognition 
Endpoint
Deployed AlexNet 
Scene 
Understanding
Endpoint6
71 42
3
5 9
8
Figure 3: An example of using DL inference in the cloud. 1 application code calls functions from their 2 deployed model or
a 3 model catalog. The code is then provisioned onto a 4 container running on 5 server by the cloud provider. The code
performed API calls to 6 perform AlexNet inference and 7 the scene understanding API. 9 AlexNet is deployed by users
through the cloud provider’s cloud deployment mechanism.
invocation basis (with function execution time and resource uti-
lization being the main cost factors). Because cloud providers
use a per-API call and per-resource utilization price model,
resource waste affects the cloud user’s total cost of ownership.
To motivate our work, we use image to scene description
pipeline deployed within FaaS as an example — illustrated
in Figure 3. The pipeline takes an image input and outputs a
textual description, leveraging a deployed AlexNet and an off-
the-shelf scene understanding model from the cloud provider’s
model catalog. Both the AlexNet model inference 2 and the
scene understanding API 3 are called within User Function
3 1 . Cloud providers then provision the function to run
within a container 4 on a cloud server 5 . When user code
is triggered, both 6 the deployed AlexNet model and 7
the scene understanding endpoints are called through HTTP
REST API calls. Meeting the latency requirements for this
application is challenging because of the multiple over-the-
network requests.
To avoid the network latency, a common practice is to collo-
cate the model within the deployed functions or the application
pipelines. However, such embedding requires a copy of the
model to be loaded privately for each function or application
pipeline. For example, 4 and 8 have to load 2 AlexNet
twice on the same machine — wasting memory resources.
The private loads introduces latency overhead, since the model
needs to be loaded for the first function invocation. Since in
FaaS isolation must be guaranteed, the previously mentioned
persistence schemes, in Section 2, is not a solution. Similarly,
batching does not apply for low latency inference.
In a cloud setting DL models are shared extensively across
user functions, for example: between the 4 user functions
shown in Figure 2. Based on this observation, we propose
TrIMS to eliminate such model loading overhead and hard-
ware resource waste, while maintaining resource utilization
MXNet/
AlexNet
VGG16
Inception v4
DenseNet
Caffe2/
AlexNet
VGG16
Inception v4
DenseNet
Glove/
English
Spanish
French
Chinese
FastText/
English
Spanish
French
Chinese
Client 1
Open
Client 2
Open
Download Model
Client 3
Client 4
Close
TrIMS MRM
Open
Cloud
Storage
Figure 4: Multiple processes can perform IPC requests to the
TrIMS Model Resource Manager (MRM) server; for example
Client1, Client2, and Client3 are performing an Open request,
while Client4 is performing a Close request. TrIMS’s MRM
is responsible for loading and managing the placement of the
models in GPU memory, CPU memory, or local disk.
efficiency and decreasing inference latency in user processes.
TrIMS achives this by folding “private copies” of the model
into a shared copy under the hood. This is performed by de-
coupling the model persistence from the user-code execution —
enabling model sharing, isolation, and low latency inference.
4. TrIMS Design
TrIMS consists of two components: a Model Resource Man-
ager (MRM) server and framework clients. MRM manages the
model resources resident in the system memory and abstracts
away the model loading from framework clients. Each frame-
4
work client communicates with MRM through inter-process
communication (IPC), as shown in Figure 4. Since TrIMS
follows the original DL framework’s API and semantics —
returning the same data structures as the unmodified frame-
work — user code can leverage TrIMS transparently without
any code modification.
4.1. TrIMS Model Resource Manager (MRM)
TrIMS’s MRM is a model server daemon that performs model
management and placement. MRM maintains a database of
models, addressing them using namespaces, with framework
as well as model name and version being used to distinguish
frameworks and models. Figure 4 shows that MRM is man-
aging models for MXNet, Caffe2 DL frameworks as well as
word vector embedding models for FastText and Glove.
The MRM placement manager then maps the models into
either GPU memory, CPU memory, local storage, or cloud
storage. The four levels are analogous to the traditional CPU
cache hierarchy. Because of this, we will simply refer to these
four different memory hierarchies as “cache” in the rest of this
paper whenever there is no ambiguity.
After system cold boot, initial model requests miss the
GPU, CPU, and local storage caches, causing the model to
be downloaded from the cloud storage and loaded into the
“caches” to serve both the current quest and future requests.
When one of the caches becomes full, one or more models are
evicted from the cache.
For inter-process communication, TrIMS uses gRPC [38] to
send and receive messages between the MRM and its clients.
TrIMS leverages the CUDA runtime’s cudaIpc* to share GPU
memory across processes. MRM abstracts away the model
TrIMS Model Resource Manager
Model 
Resource
Database
gRPC
Server
TrIMS Client 1
gRPC
Stub
Caffe 
Library
TrIMS Client 2
gRPC
Stub
MXNet 
Library
struct ModelRequest {
  string model_name;
  string path;
  ReqConfig config;
}
struct ModelHandle {
  string id;
  string model_id;
  int64 byte_count;
  int sharing_granularity;
  void* device_raw_ptr;
  bytes ipc_handle;
  Layer[] layers;
} OpenRequest(ModelRequest)
OpenResponse(ModelHandle)
CloseRequest(ModelRequest)
CloseResponse(Void)
Unmodified 
User Code
Unmodified 
User Code
Figure 5: When user code loads a model using the original
framework API, instead of loading the model directly from
disk, the corresponding TrIMS client sends an Open request
with ModelRequest structure to the MRM, and receives a re-
sponse of type ModelHandle, from which it constructs the
compute graph with model weights. When user code un-
loads a model, then instead of directly destroying the allo-
cated memory, the TrIMS client sends out a Close request
with ModelHandle and TrIMS MRM does the housekeeping.
management, exposing two API functions to be used by the
clients: trims::open and trims::close to load and close
a model, respectively. MRM maintains a reference count for
each model to determine the number of users currently using
the shared model. The API is shown in Figure 5.
4.1.1. Loading Models When loading a model, MRM per-
forms shape inference on the model to estimate its memory
footprint when running on GPU. Shape inference is a simple
arithmetic computation performed by any framework to deter-
mine the amount of internal memory to allocate for a model.
After shape inference, MRM follows the state diagram shown
in Figure 7 and needs to handle three cases:
GPU cache hit — Model is persistent in GPU memory
MRM increments the model’s reference count and creates
a shared memory handle from the device memory owned by
MRM. The handle is then returned to the framework client.
Model eviction is triggered when the intermediate results for
a model is greater than the available free memory.
GPU cache miss / CPU cache hit — model is persistent
in CPU memory The server queries the current memory
utilization of the GPU to see if the model can be copied to
GPU memory. If it can, then GPU memory is allocated and
copied; if not, then some memory needs to be reclaimed —
entering the memory reclamation procedure.
CPU and GPU cache miss — model is not persistent in
memory If the data is not on local storage, then MRM down-
loads the model from the cloud. If the data is on disk, then
MRM loads the data from disk using the framework’s serial-
izer. Pinned memory is allocated on the CPU and the model
weights is copied to it. MRM then follows the same logic as
when the data is persistent in CPU memory.
import mxnet as mx
from mlprovider import vision, nlp, audio
def serve_request(net, request):
  img_input       = <<<process input>>>
  img_labels      = vision.classify(img_input)
  img_description = nlp.sentence_generate(img_label)
  audio           = audio.synthasize(img_description)
  return audio
TrIMS MXNet Framework Client
TrIMS MRM
User 1 Function
Container IPCUser 1 Container
vision models language models audio models
CPU Memory GPU Memory Local Storage Cloud Storage
text models
User 3 Container
User 2 Container
Figure 6: Cloud providers can use TrIMS MRM as a container
plugin to provision running untrusted user functions while still
leveraging model sharing. User code is executed within an
isolated containers and can get the benefits of TrIMS without
code modifications. Sharing occurs when the users utilize
the same models as their peers — which is not uncommon
in cloud settings using cloud provided APIs.
5
Model Database 
(TrIMS MRM)
Load Model from Disk and 
Copy to CPU Memory
Copy Model to GPUReclaim GPU Memory Allocate GPU Memory
Increment Ref Count
Return GPU Memory PtrModel Fits in
GPU
LOAD MODEL RPC REQUEST
GPU MISS / CPU MISS
GPU MISS / CPU HIT
NO YES
GPU HIT
ITERATE
Figure 7: The logic for caching models on both GPU and CPU.
The TrIMS client initiates the load model call to TrIMS MRM
and gets back a pointer to GPU memory.
4.1.2. Reclaiming Memory and Evicting Models Memory
reclamation is performed when the memory space for MRM at
a specific cache level is full. Which model to evict to reclaim
memory is determined by the eviction policy. TrIMS supports
a pluggable set of common eviction policies such as least
recently used(LRU) and least commonly used (LCU). For
the CPU and GPU level caches, one needs to make sure that
eviction does not interfere with user’s code. Models within the
MRM database are not candidates for reclamation if they are in
use; i.e. the reference count of a model is non-zero. Evicting
models that is currently being used (effectively freeing GPU
memory that’s being used) causes undefined behavior in the
user’s code.
4.1.3. Unloading Models When a TrIMS framework client
unloads a model (or the user process exists), a model unload
request is sent to MRM. MRM looks up the model in the
database and decrements its reference count. By default MRM
does not free resources for models that have a zero reference
count (not currently used), but MRM can be configured to
eagerly reclaim these models.
4.2. TrIMS Frameworks
MRM can handle requests from multiple TrIMS-enabled
frameworks, managing their weights (which have different
data layouts) in separate namespaces. Shown in Figure 5,
when a TrIMS framework performs a model load request, the
framework’s name and version are sent along with the request.
The server can then perform the model unmarshaling from
disk using the format supported by the framework.
To enable TrIMS in a framework, the functions to load
and unload models need to be modified to perform gRPC
requests to MRM. Since, each framework may have its own
serialization format, support for the model format, to enable
unmarshaling the data from disk to memory, needs to be added
to MRM. With these changes, any type of network supported
by the framework (CNN, RNN, Word2Vec, etc.) and any
compute pattern is automatically supported by TrIMS.
User application rewriting overhead — Since MRM does
not modify the framework’s API, code that is linked with a
TrIMS-enabled framework does not require any change. TrIMS
works within Python, Java, or R. This is an attractive feature,
since the benefits of TrIMS can be leveraged by cloud provider
transparently from the user.
Sharing Granularity — TrIMS supports fixed-size block,
layer, and model level sharing granularity. Sub-model level
sharing granularity is interesting when considering layers or
memory across models. For example, models trained using
transfer learning [65] share the frozen layer weights. Block
level granularity can also be used to share fixed-size buffers.
Multi-GPU and Multi-Node Support — Multi-GPU is usu-
ally used when performing batched inference [17, 21]. TrIMS
inherently supports the multi-GPUs by leveraging Unified
Memory (UM) [6]. Support for Multi-GPU sharing can also
be performed without relying on UM by making the TrIMS
framework client query the device ID of the current GPU
context when a model is loaded. The framework client can
then send the device ID along with the request. TrIMS MRM
would then load the model into the GPU with that device ID.
When a request loads a model on a GPU and the requested
model is persistent on another GPU, MRM will perform GPU
peer-to-peer memory copy if supported.
Multiple independent instances of TrIMS MRM can be
loaded for multi-node support and an off-the-shelf task
scheduling and load balancing middleware can be used to
route and load balance inference requests. TrIMS can be setup
to advertise the models that have already been loaded by users
and the current system load to the load balancer.
4.3. Inference Isolation and Fairness
To enable seamless container isolation, TrIMS provides a
Docker [51] volume plugin that allows service providers to pro-
vision the container with a communication link to the TrIMS
MRM. The TrIMS MRM process runs in the host system with
a link for frameworks to communicate with it across container
boundaries. Figure 6 shows how untrusted user code can be
run on a multi-tenant system while maintaining isolation. The
code shows how users can use DL models, provided by the
cloud provider, to create an image to audio pipeline. The user
uses the cloud provided vision, text, and audio models via a
library that is part of a model catalog. All user code executes
within a container that communicates with the MRM via the
container’s IPC mechanism.
5. Implementation
The experiments reported in this paper are based on an im-
plementation of TrIMS on top of the Apache MXNet 2 — a
popular machine learning framework. The TrIMS MRM in-
cludes serialization code from MXNet to unmarshal MXNet
2The source code for TrIMS is open source and is found at
http://github.com/REMOVED/DURING/REVIEW
6
models from disk. We also modify the MXNet framework
to integrate it with TrIMS — keeping the MXNet APIs un-
changed. Communication between the MXNet framework
client and the MRM uses Google’s gRPC [38] with the pack-
ets encoded using Protocol Buffers [9].
To validate the efficiency and generality of our proposal,
we follow a few principles throughout our implementation —
even if disregarding some would have given us better speedup:
Backward Compatible — The implementation needs to work
with the existing framework’s code base and language bind-
ings, i.e. we should be able to run preexisting MXNet codes
written in Python or Scala with no modifications.
Simple and Minimal — The implementation needs to be sim-
ple and not modify the framework code as much as possible.
Our modifications adds only 1500 lines of code (less than
0.5% of the MXNet code base) to the framework (800 lines
for the server and 700 lines for the client) and is self contained.
Configurable — The implementation has knobs to tweak ev-
erything from the eviction strategy of memory sharing, the
amount of memory that can be used, whether to enable TrIMS,
the levels of cache to enable, etc...
Fast, Concurrent and Scalable — We communicate using
gRPC and use efficient data structures [41] for the MRM
database to make the serving fast and concurrent. The memory
sharing strategy in TrIMS is scalable and can handle large
amount of load.
5.1. TrIMS Apache MXNet Framework
We implement TrIMS on top of the Apache MXNet framework
client by modifying the MXPredCreate and MXPredFree in
the MXNet C predict API’s implementations. When TrIMS
is enabled, trims::open and trims::close are called as
part of the predictor creation and deletion. Listing 1 shows the
main modification to the original MXNet code.
Like most open-source DL Frameworks, MXNet is opti-
mized for training and not inference. We apply a set of op-
timizations to the original MXNet to improve the inference
latency. The optimizations avoid eager initialization of CUDA
resources, remove cuDNN algorithm selection for backward
propagation, and simplify the random resource generation.
With our optimizations, MXNet is 6× faster for inference on
average than the vanilla MXNet for the suite of models we use.
We use the modified MXNet as our baseline for evaluation.
5.2. GPU Memory Sharing
We perform GPU memory sharing using the CUDA’s
cudaIPC* runtime functions. For Pre-Volta GPUs, the CUDA
IPC mechanism utilizes CUDA MPS — an intermediate user
process where the memory allocations are performed. This
means that all CUDA operations end up serialized and exe-
cuted within the same CUDA MPS context — enabling differ-
ence processes to share the same GPU virtual address space
(VAS). For Volta GPUs, NVIDIA introduced a new feature
to allows contexts to share page-table mappings. This makes
MXAPIPredictor MXPredCreate(MXPredParams * p){
MXAPIPredictor *ret = new MXAPIPredictor();
{...} // load in the symbol and model parameters
{... shapes = infer_model_shapes(p); ... }
if (trims::ENABLED) {
auto tinfo = trims::open(...);
ret->handle_id = std::get<0>(tinfo);
ret->model_id = std::get<1>(tinfo);
goto setup_predictor;
}
// original model loading
dmlc::MemoryStream fi(p->buf, p->size);
NDArray::Load(&fi, &data, &names)
{...}
setup_predictor:
{...}
return ret;
}
void MXPredFree(PredictorHandle handle) {
auto pred = (MXAPIPredictor *) handle;
if (trims::ENABLED) trims::close(pred);
delete pred;
}
Listing 1: To integrate TrIMS with MXNet we modify both the
MXPredCreate and MXPredFree functions. MXPredCreate
loads the model and initializes the compute graph to perform
inference, if TrIMS is enabled, we call trims::open instead
of NDArray::Load which loads the model from disk. To cor-
rectly free the models, we modify the MXPredFree function
to call trims::close. MXPredFree is called in the Predictor
destructor or at process exit.
it possible for user processes to run using different contexts
while still sharing memory. For CUDA 9.2, CUDA MPS is still
invoked to keep shared allocations and communicate across
them, but, with the exception of a handful of functions, most
CUDA operations are performed without IPC communication.
Because sharing may serialize to use CUDA MPS, one
slight disadvantage of CUDA IPC functions is that they have
a measurable overhead. This can become a bottleneck. When
sharing models at layer granularity, networks with large num-
ber of layers, such as ResNet269-v2, have high overhead. We
remedy this by having a per-group of layer sharing or model
sharing granularity.
The CUDA IPC overhead is measurable, and we can quan-
tify whether using TrIMS is beneficial statically using the
empirical formula: ρ = b÷ q− n× (o+ s), where n is the
number of objects to share (when the sharing granularity is at
the model level, this value is 1; when the granularity is at the
layer, this value is the number of layers); o is the overhead of
sharing CUDA memory via CUDA IPC and s is the overhead
of obtaining a CUDA device pointer from a shared CUDA
IPC handle; b is the number of bytes the model occupies on
disk; and q is the disk I/O bandwidth. These constants can
be computed once at system startup and cached to be used
by TrIMS. If ρ is positive, then its magnitude is correlated to
the speedup one gets using TrIMS. This equation can be used
within the TrIMS framework to determine at runtime whether
to call TrIMS to share a model or not and at what granularity
to share the model.
7
Name CPU GPU Memory GPU Memory Cached Reads Buffered Disk Reads
System 1 Intel Core i9-7900X TITAN Xp P110 32 GB 12 GB 8 GB/sec 193.30 MB/sec
System 2 Intel Xeon E5-2698 v4 Tesla V100-PCIE 256 GB 16 GB 10 GB/sec 421.30 MB/sec
System 3 IBM S822LC Power8 w/ NVLink Tesla P100-SXM2 512 GB 16 GB 27 GB/sec 521.32 MB/sec
Table 2: We evaluate TrIMS on 3 systems which represent both cloud offerings and consumer desktop system configurations
currently used for DL inference. We use the Linux hdparm tool to measure the cached disk reads.
6. Evaluation
We evaluate TrIMS on 3 systems (shown in Table 2) using
37 (shown in Table 3) pre-trained small models and 8 large
models (shown in Table 4). The systems selected represent
different types of instances that are currently provisioned in the
cloud. System 3 uses the NVLink bus [32, 63] which allows up
to 35GB/s transfer between CPU and GPU. System 3 is used
as proxy for understanding our proposed method’s behavior
on high end cloud instances and next generation interconnects
currently being deployed on HPC and cloud systems [68, 64].
Multi-GPU results are similar to the single-GPU results shown
bellow and for simplicity are omitted.
We used image processing models as a representative work-
load because these are currently the most plentiful in FaaS
pipelines. TrIMS is agnostic to the compute patterns of a net-
work and the analysis would apply to other types of networks
such as: RNNs, word embeddings, or matrix factorization.
The selected 37 pre-trained image processing models, shown
in Table 3, are based on their popularity in both research and
usage. Some of the networks have variants. These are used to
simulate user trained models — the same compute networks
structure can have different weights. Large models are used to
show how TrIMS performs with increasing model sizes.
Throughout this section we compare our performance
within a FaaS setting against ideal (where the model load-
ing and data movement takes no time — ideal is faster than
model persistence) and use end-to-end “cold-start” inference
as the base line, since that’s what is currently employed by
FaaS environments.
6.1. Latency Improvement
We measure the end-to-end “cold-start” inference of MXNet
with and without TrIMS – for the sake of clarity we omit the
input processing time. Figure 8 shows the achieved speedup
on a representative set of the models compared against MXNet
that does not utilize TrIMS. We show two cases: (a) our best
case (when there is a GPU cache hit) and (b) our worst case
(when the cache misses both the CPU and GPU).
For best case analysis (Figure 8a), the server needs to create
the CUDA IPC handles and the framework client needs to em-
bed the GPU device pointers within the framework’s container.
This introduces a slight overhead, however it is within 20%
of the ideal — ideal defined as the time for inference where
model loading or deserialization times set to zero. We see that
latency speedup improves proportionally to the model size,
the system’s data movement bandwidth, the system’s compute
resources, and the model’s compute complexity.
For small models, where the I/O overhead is very low, for
example SqueezeNet (which has a 5MB memory footprint),
we observe only marginal speedup (1.04×). These models are
designed to have a small footprint — targeting edge devices
— and are rarely used within the cloud. For state-of-the-art
networks, such as VGG16-SOD, we observe 24× speedup
on System 1. Even with fast disk and the NVLink intercon-
nect, which mitigates I/O overhead by offering greater data
movement bandwidth, System 3 achieves 6× speedup for
VGG16-SOD.
For the worst case analysis (Figure 8b), the MRM needs to
load the data from disk, persist the model on the CPU, copy
the data to the GPU, and send the GPU memory handles to the
client. Although we get a slow down, this case assumes there
is no model sharing across pipelines, and therefore uncommon
in cloud setting.
6.2. Speedup Breakdown
To understand where the new bottlenecks are for the inference
using TrIMS, we look at System 3 where we achieve the lowest
speedup and measure the (a) time to perform inference compu-
tation, (b) time to initialize the model (this includes copying
the data to the GPU when not using TrIMS), (c) model load-
ing from disk, and (d) model sharing overhead introduced by
TrIMS. As can be seen in Figure 9, without using TrIMS an
average of 86% of the time is spent loading and initializing the
model while only 7% is spent performing computation. When
using TrIMS we eliminate the model loading from disk and
remove the need to perform memory copies to the GPU. Even
though we introduce overhead, we still gain a 4.8× geometric
mean speedup.
6.3. Large Model Evaluation
We evaluate our method using large models which are com-
mon for medical image analysis, NLP, and time series mod-
eling. We generated the large models by starting with the
regular AlexNet and VGG16 networks, keeping their compute
graph topology, and rescaling the input dimensions to gener-
ate enlarged model. Table 4 shows the 8 models selected for
evaluation, their memory footprint, and their input sizes.
Figure 10 shows that by removing model loading overhead,
inference on large models becomes compute bound and gives
an advantage to faster GPUs. This is why System 1 achieves
less speedup than System 2 for the more compute intensive
8
2 6 8 12 13 14 16 19 20 21 22 24 28 31 33 36
0
1
2
3
4
5
Sp
ee
du
p
Sp
ee
du
p
(a)
(b)
24 20 6.
0
22 19 5.
5
2 6 8 12 13 14 16 19 20 21 22 24 28 31 33 36
0.0
0.2
0.4
0.6
0.8
1.0
1.2
29 25 7.
0
26 23 6.
5System 1      System 2      System 3      Ideal
Figure 8: A representative sample of the models shown in Table 3 are chosen and are run on the systems in Table 2 to achieve (a)
the best case end-to-end time — when the model has been pre-loaded in GPU memory — and (b) the worst case end-to-end time
— when the model misses both the CPU and GPU persistence and needs to be loaded from disk. The speedups are normalized to
end-to-end running time of the model without TrIMS. The yellow dots show the ideal speedup; the speedup achieved by removing
any I/O and data-transfer overhead — keeping only the framework initialization and compute. For models 33 and 36, the achieved
speedup is shown on the bar (white) and the ideal speedup is shown on top of the bar (black).
No
rm
ali
ze
d 
%
 o
f T
im
e 
fo
r E
ac
h 
Op
er
at
io
n
20
0
100
60
40
80
0
0
0
0
0
1 0
1       2     3     4     5      6     7      8     9     10   11   12    13   14    15   16   17   18    19   20    21   22   23   24    25    26   27   28   29   30    31    32   33   34   35    36   37  gm 
20
1
5
2
10Speedup (Log Scale)
Compute w/o TrIMS       Init w/o TrIMS      Model Loading w/o TrIMS      Compute w/TrIMS       Init w/TrIMS       Model Sharing w/TrIMS      Model Loading w/TrIMS      Speedup
Figure 9: Detailed normalized times of operations with and without TrIMS on System 3 using the models in Table 3. The duration
for TrIMS is normalized to the end-to-end time of not using TrIMS. Model initialization is the time spent setting up the CUDA
contexts for the model, initializing the the compute state, and (in the case of not using TrIMS) copying the weights to GPU
memory. Compute is the time spent performing inference computation. Model sharing is the overhead introduced by using TrIMS
and includes the gRPC communication and sharing GPU data using CUDA IPC. Through TrIMS we effectively eliminated model
loading and data movement.
1 2 3 4 5 6 7 8
5
10
50
100
Sp
ee
up
 (L
og
 S
ca
le)
System 1      System 2      System 3        % of time in compute
20
0
100
60
40
80 Com
pute %
Figure 10: Large models in Table 4 are run to achieve the best
case end-to-end time — when the model has been pre-loaded
in GPU memory. The speedups are normalized to end-to-end
running time of the model without TrIMS. The red dots show
the percentage of time spent performing the compute. We see
linear speedup until the inference becomes compute bound.
VGG16 network (for example for model 7), since model infer-
ence computation accounts for 85% on System 1 and 50% on
System 2. We expect this to be a more pronounced bottleneck
for lower end GPUs and less of an issue for specialized low
latency inference accelerators.
We also observe that TrIMS increases the memory efficiency
of the GPU. Without TrIMS, two inferences using model 8 can-
not be run concurrently, since they overrun the GPU memory.
TrIMS avoiding multiple private copies of the 6.4GB model on
the same machine, enabling concurrent runs of large models.
6.4. Workload Modeling
Finally, we perform workload modeling to understand the
behavior of TrIMS on multi-tenant oversubscribed system.
The workload is selected from the 37 small models shown in
Table 3 following a Pareto distribution. Since all the models
cannot all be resident on the GPU at the same time — in total
having 2× GPU memory footprint — the TrIMS MRM needs
to exercise the model reclamation and eviction procedure.
9
System 2 System 3System 1
# 
of
 C
on
cu
rre
nt
 R
eq
ue
st
s
% of Models% of Models% of Models
Figure 11: We vary the percentage of models run (from Table 3) and we sample them following a Pareto distribution (with X = 1
and l = 1). We also vary the concurrency level (number of inferences performed concurrently) ranging it from 1 to 10. The
iso-curves show the geometric mean of the speedups for Systems 1, 2, and 3.
Because of limited space, we only present the results for the
LRU eviction strategy, but our observations are valid for other
eviction strategies.
Figure 11 shows the level iso-efficiency curves for the geo-
metric mean speedup 3 as we vary the the concurrency level
and number of models to run. We can see that even in an over-
subscribed setting, we can still service 10 clients concurrently,
reduce the overall batch execution time (by up to 8×), while
incurring only a 20% latency penalty for each request. This
slowdown is due to the cost of evicting models to accommo-
date the larger memory footprint, causing subsequent usage of
the model miss the GPU cache.
For all three systems, we can observe an over-subscription
sweet spot, where the percent of models and number of the
concurrent request can be increased while the batch execution
is preserved to a speedup of 1×. All systems show a sweet spot
when 40% of models are actively being requested. For system
1 and 3, the number of concurrent requests can be increased
to 4, and system 2 the same number improves to 6. The
difference in the over subscription sweet spot can be explained
due to the different compute capabilities between the systems.
System 1 and 3 are provision with Pascal generation GPUs
while system 2 has the latest Volta generation. Essentially,
because we are successful in moving the inference bottleneck
from I/O to compute, the sweet spot is determined by the
available computing resources. In practice, cloud providers
can perform sensitivity analysis to determine the number of
models hosted on each server and the number of concurrent
requests to service based on the service’s target requirements.
By removing model loading overhead, our speedups is
bounded by the framework’s inference pipeline. Frameworks
that are optimized for inference garner greater benefits. For
older generation or lower end GPUs, compute would likely
dominate inference. Therefore, if cloud providers are only in-
3We measure the speedup value using the geometric mean across the 95%
latency speedup of each model.
terested in maintaining latency, they can utilize these older or
lower end GPUs which have a lower initial cost of ownership.
7. Related Works
Recent related work has explored techniques to enable model
serving at cloud scale. TesorFlow-Serving[54] provides soft
model isolation to guard against concurrent running request
interfering with each other performance. TFX[21] uses dedi-
cated thread pool to hide model-loading overhead and provide
thread-level user isolation. Clipper[27] combine concurrent
streams of DL requests into batches to better utilize the GPU
at the cost of longer latency. All of these techniques suffer
from their inability to provide user isolation or handle low
latency inference.
Recent work [56, 44, 55, 31] leverage CUDA IPC in order
to improve various intra-node and inter-node MPI collectives
of a single process/application, and thus facilitate the port-
ing to, and improve the performance of HPC applications on
GPUs. MVAPICH2 [53], for instance, supports the use of MPI
calls directly over GPU memory. Unlike these works, TrIMS
leverages CUDA IPC in order to persist data structures across
processes and thus actively seeks to improve IO and memory
footprint, instead of multi-GPU coordination.
To reduce DL model inference latency and memory require-
ment, a large body of work have been performed recently in
compacting and accelerating convolutional neural networks
(CNNs). Quantization [34, 70, 66, 39, 26] reduces the num-
ber of bits required to represent each weight by rescaling the
weights to a domain smaller than the 32-bits required for float-
ing point representation (usually 8-bits). Network pruning and
sharing [59, 39, 23, 77] reduces redundant parameters which
are not sensitive to the performance. Although these model op-
timization techniques can make the I/O vs. compute problem
less severe, they have drawbacks and limited application scope.
Quantized model inference suffer from accuracy loss while
pruned network can significantly increase the computation
10
intensity due to sparsity, especially on GPUs. Moreover these
techniques currently only works with convolutional layers or
fully connected layers, does not apply to other type of model
inference, such as the fully connected layers used in word
embedding. Model optimizations also could leverage TrIMS,
enabling the sharing of optimized CNN models.
Although various CPU/GPU virtualization techniques [52,
67, 30] and GPU multi-tenancy [28, 57, 72, 20] can improve
ID Name # Layers ILS MWMF
1 AlexNet [47] 16 516 238
2 GoogLeNet [61] 116 111 27
3 CaffeNet [47] 16 512 233
4 RCNN-ILSVRC13 [33] 16 479 221
5 DPN68 [24] 361 122 49
6 DPN92 [24] 481 340 145
7 Inception-v3 [62] 472 257 92
8 Inception-v4 [60] 747 399 164
9 InceptionBN-v2 [43] 416 313 129
10 InceptionBN-v3 [62] 416 142 44
11 Inception-ResNet-v2 [60] 1102 493 214
12 LocationNet [69] 514 666 285
13 NIN [48] 24 131 29
14 ResNet101 [40] 526 423 170
15 ResNet101-v2 [40] 522 428 171
16 ResNet152 [40] 777 548 231
17 ResNet152-11k [40] 769 721 311
18 ResNet152-v2 [40] 761 340 231
19 ResNet18-v2 [40] 99 154 45
20 ResNet200-v2 [40] 1009 589 248
21 ResNet269-v2 [40] 1346 889 391
22 ResNet34-v2 [40] 179 222 84
23 ResNet50 [40] 268 270 98
24 ResNet50-v2 [40] 259 275 98
25 ResNeXt101 [71] 526 375 170
26 ResNeXt101-32x4d [71] 522 378 170
27 ResNeXt26-32x4d [71] 147 147 59
28 ResNeXt50 [71] 271 222 96
29 ResNeXt50-32x4d [71] 267 224 96
30 SqueezeNet-v1.0 [42] 52 34 4.8
31 SqueezeNet-v1.1 [42] 52 28 4.8
32 VGG16 [58] 32 1228 528
33 VGG16-SOD [75] 32 1198 514
34 VGG16-SOS [74] 32 1195 513
35 VGG19 [58] 38 1270 549
36 WRN50-v2 [73] 267 758 264
37 Xception [25] 236 244 88
Table 3: The small models are popular models used in liter-
ature and is used as proxy models that offer a wide variety
of sizes and computational complexity. Image classification
models are used since they are the most commonly used.
Both internal layer sizes (ILS) and the model weights mem-
ory footprint (MWMF) are shown in megabytes. The number
of models is chosen to be 2x larger than the available 16 GB
memory on Systems 2 and 3.
ID Name Input Dims MWMF
1 AlexNet-S1 [47] 227×227 238
2 AlexNet-S3 [47] 454×454 770
3 AlexNet-S3 [47] 681×681 1694
4 AlexNet-S4 [47] 908×908 3010
5 VGG16-S1 [58] 224×224 528
6 VGG16-S2 [58] 448×448 1704
7 VGG16-S3 [58] 672×672 3664
8 VGG16-S4 [58] 896×896 6408
Table 4: Large models were used to evaluate our method. The
models were generated by taking AlexNet and VGG16 and
scaling the number of input features. Large models arise in
either medical image analysis, NLP, or time series analysis
where down-sampling decreases the accuracy or the network
requires a large window of features to give accurate results.
system utilization and throughput through time sharing or par-
allel sharing CPU or GPU, they do not help solving the model
loading overhead within inference processes. TrIMS is orthog-
onal to these techniques and can be integrated into containers
as a plugin. Also, in the very same way that NVIDIA Volta
added the capability of effectively sharing memory across user-
processes without the need of a proxy process (CUDA MPS
server). The ability of sharing memory across different VMs
(using a third level of virtual memory translation, as CPUs do)
would enable TrIMS to work across VMs.
8. Conclusion
Collocating compute with model serving within FaaS over-
comes the network barrier but suffers from high “code start”
latency. We propose TrIMS to mitigate the major source of
“code start” latency — the model loading overhead — and
make building complex latency sensitive pipelines with modu-
lar DL components feasible. We do so by decoupling compute
from model persistence and leveraging model sharing across
user pipelines. TrIMS moves the bottleneck of DL model
inference to compute, thus making GPU acceleration more ap-
pealing and making specialized novel inference architectures
more tractable.
TrIMS was evaluated on three systems that represent current
cloud system offerings. We used 45 DL models and show
a speedup of up to 24× for small models and up to 210×
for large models. When running concurrent inference, we
can increase the overall system throughput by up to 8×. Our
methodology, when applied to DL frameworks, offers advan-
tages to both cloud providers and users. The isolation along
with the significant memory reduction through model sharing
enable cloud providers to over-provision hardware resources,
thus decreasing the total cost of ownership. The benefits of
TrIMS to the cloud providers can be passed down to the users
in the form of reducing latency or cost of inference.
TrIMS is a generic memory sharing technique that can be
used when computation requires large number of constant
parameters to be in situ on the CPU or GPU, while still main-
11
taining isolation between users. As such, the proposed method
can be easily generalized to any application or algorithm that
spans multiple processes and requires large amount of con-
stant data resources. While we motivated our work with deep
learning, other types of applications such as image processing,
physical simulation, or in-memory databases can benefit from
our approach.
References
[1] Amazon Lambda. http://aws.amazon.com/lambda. Accessed:
2018-8-04.
[2] Amazon Rekognition. https://aws.amazon.com/rekognition.
Accessed: 2018-8-04.
[3] Amazon SageMaker. https://aws.amazon.com/
machine-learning. Accessed: 2018-8-04.
[4] Azure Cognitive Services. https://azure.microsoft.com/
en-us/services/cognitive-services. Accessed: 2018-8-04.
[5] Azure Functions. https://azure.microsoft.com/en-us/
services/functions. Accessed: 2018-8-04.
[6] CUDA Unified Memory. https://devblogs.nvidia.com/tag/
unified-memory. Accessed: 2018-8-04.
[7] Google Cloud AI. https://cloud.google.com/products/
machine-learning. Accessed: 2018-8-04.
[8] Google Cloud Functions. https://cloud.google.com/
functions. Accessed: 2018-8-04.
[9] Google Protocol Buffers. https://developers.google.com/
protocol-buffers. Accessed: 2018-8-04.
[10] IBM OpenWhisk. http://www.ibm.com/cloud-computing/
bluemix/openwhisk. Accessed: 2018-8-04.
[11] IBM Watson. https://www.ibm.com/watson. Accessed: 2018-8-
04.
[12] Machine Learning on AWS. https://aws.amazon.com/
machine-learning. Accessed: 2018-8-04.
[13] Node allocation for online prediction. https://cloud.google.
com/ml-engine/docs/tensorflow/prediction-overview#
node-allocation. Accessed: 2018-8-04.
[14] Nvidia Inference Technical Overview. https://images.nvidia.
com/content/pdf/inference-technical-overview.pdf. Ac-
cessed: 2018-8-04.
[15] Online versus Batch Prediction. https://cloud.google.com/
ml-engine/docs/tensorflow/online-vs-batch-prediction.
Accessed: 2018-8-04.
[16] Requesting Real-time Predictions. https://docs.
aws.amazon.com/machine-learning/latest/dg/
requesting-real-time-predictions.html. Accessed:
2018-8-04.
[17] TensorFlow Serving. https://www.tensorflow.org/serving.
Accessed: 2018-8-04.
[18] Using the Model to Make Predictions. https:
//docs.aws.amazon.com/machine-learning/latest/dg/
using-the-model-to-make-predictions.html. Accessed:
2018-8-04.
[19] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy
Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irv-
ing, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga,
Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Va-
sudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng.
Tensorflow: A system for large-scale machine learning. In Proceedings
of the 12th USENIX Conference on Operating Systems Design and
Implementation, OSDI’16, pages 265–283, Berkeley, CA, USA, 2016.
USENIX Association.
[20] Rachata Ausavarungnirun, Vance Miller, Joshua Landgraf, Saugata
Ghose, Jayneel Gandhi, Adwait Jog, Christopher J Rossbach, and Onur
Mutlu. Mask: Redesigning the gpu memory hierarchy to support
multi-application concurrency. In Proceedings of the Twenty-Third
International Conference on Architectural Support for Programming
Languages and Operating Systems, pages 503–518. ACM, 2018.
[21] Denis Baylor, Eric Breck, Heng-Tze Cheng, Noah Fiedel, Chuan Yu
Foo, Zakaria Haque, Salem Haykal, Mustafa Ispir, Vihan Jain, Levent
Koc, et al. Tfx: A tensorflow-based production-scale machine learning
platform. In Proceedings of the 23rd ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, pages 1387–
1395. ACM, 2017.
[22] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang,
Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. Mxnet:
A flexible and efficient machine learning library for heterogeneous
distributed systems. arXiv preprint arXiv:1512.01274, 2015.
[23] Wenlin Chen, James Wilson, Stephen Tyree, Kilian Weinberger, and
Yixin Chen. Compressing neural networks with the hashing trick. In
International Conference on Machine Learning, pages 2285–2294,
2015.
[24] Yunpeng Chen, Jianan Li, Huaxin Xiao, Xiaojie Jin, Shuicheng Yan,
and Jiashi Feng. Dual path networks. In Advances in Neural Informa-
tion Processing Systems, pages 4470–4478, 2017.
[25] François Chollet. Xception: Deep learning with depthwise separable
convolutions. arXiv preprint, 2016.
[26] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Bina-
ryconnect: Training deep neural networks with binary weights during
propagations. In Advances in neural information processing systems,
pages 3123–3131, 2015.
[27] Daniel Crankshaw, Xin Wang, Guilio Zhou, Michael J Franklin,
Joseph E Gonzalez, and Ion Stoica. Clipper: A low-latency online
prediction serving system. In NSDI, pages 613–627, 2017.
[28] CUDA Multi-Process Service(MPS). https://docs.nvidia.com/
deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf.
Accessed: 2017-3-30.
[29] Thomas W Dinsmore. In-memory analytics. In Disruptive Analytics,
pages 97–116. Springer, 2016.
[30] Jose Duato, Antonio J Pena, Federico Silla, Juan C Fernandez, Rafael
Mayo, and Enrique S Quintana-Orti. Enabling cuda acceleration within
virtual machines using rcuda. In High Performance Computing (HiPC),
2011 18th International Conference on, pages 1–10. IEEE, 2011.
[31] Iman Faraji and Ahmad Afsahi. Hyper-q aware intranode mpi collec-
tives on the gpu. In Proceedings of the First International Workshop
on Extreme Scale Programming Models and Middleware, pages 47–50.
ACM, 2015.
[32] Denis Foley and John Danskin. Ultra-performance pascal gpu and
nvlink interconnect. IEEE Micro, 37(2):7–17, 2017.
[33] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich
feature hierarchies for accurate object detection and semantic segmen-
tation. In Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 580–587, 2014.
[34] Yunchao Gong, Liu Liu, Ming Yang, and Lubomir Bourdev. Com-
pressing deep convolutional networks using vector quantization. arXiv
preprint arXiv:1412.6115, 2014.
[35] Google Translate: Breaking language barriers in emerging markets.
https://goo.gl/TkffQq. Accessed: 2017-18-04.
[36] Robert Grandl, Srikanth Kandula, Sriram Rao, Aditya Akella, and
Janardhan Kulkarni. Do the hard stuff first: Scheduling dependent com-
putations in data-analytics clusters. arXiv preprint arXiv:1604.07371,
2016.
[37] Robert Grandl, Arjun Singhvi, and Aditya Akella. F2: Separating com-
pute from data in cluster computing. arXiv preprint arXiv:1703.10272,
2017.
[38] gRPC. https://www.grpc.io. Accessed: 2018-8-04.
[39] Song Han, Huizi Mao, and William J Dally. Deep compression: Com-
pressing deep neural networks with pruning, trained quantization and
huffman coding. arXiv preprint arXiv:1510.00149, 2015.
[40] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep
residual learning for image recognition. In Proceedings of the IEEE
conference on computer vision and pattern recognition, pages 770–778,
2016.
[41] Maurice Herlihy, Nir Shavit, and Moran Tzafrir. Hopscotch hashing.
In International Symposium on Distributed Computing, pages 350–364.
Springer, 2008.
[42] Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf,
William J Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accuracy
with 50x fewer parameters and< 0.5 mb model size. arXiv preprint
arXiv:1602.07360, 2016.
[43] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating
deep network training by reducing internal covariate shift. arXiv
preprint arXiv:1502.03167, 2015.
[44] Feng Ji, Ashwin M Aji, James Dinan, Darius Buntinas, Pavan Bal-
aji, Rajeev Thakur, Wu-chun Feng, and Xiaosong Ma. Dma-assisted,
intranode communication in gpu accelerated systems. In High Perfor-
mance Computing and Communication & 2012 IEEE 9th International
Conference on Embedded Software and Systems (HPCC-ICESS), 2012
IEEE 14th International Conference on, pages 461–468. IEEE, 2012.
[45] Yangqing Jia. Caffe2. https://www.caffe2.ai, 2017.
12
[46] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev,
Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Dar-
rell. Caffe: Convolutional architecture for fast feature embedding. In
Proceedings of the 22nd ACM international conference on Multimedia,
pages 675–678. ACM, 2014.
[47] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet
classification with deep convolutional neural networks. In Advances in
neural information processing systems, pages 1097–1105, 2012.
[48] Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. arXiv
preprint arXiv:1312.4400, 2013.
[49] Ji Liu, Esther Pacitti, and Patrick Valduriez. A survey of schedul-
ing frameworks in big data systems. International Journal of Cloud
Computing, pages 1–27, 2017.
[50] Lei Lu, Hui Zhang, Evgenia Smirni, Guofei Jiang, and Kenji Yoshi-
hira. Predictive vm consolidation on multiple resources: Beyond load
balancing. In Quality of Service (IWQoS), 2013 IEEE/ACM 21st Inter-
national Symposium on, pages 1–10. IEEE, 2013.
[51] Dirk Merkel. Docker: lightweight linux containers for consistent
development and deployment. Linux Journal, 2014(239):2, 2014.
[52] Roberto Morabito, Jimmy Kjällman, and Miika Komu. Hypervisors
vs. lightweight virtualization: a performance comparison. In Cloud
Engineering (IC2E), 2015 IEEE International Conference on, pages
386–393. IEEE, 2015.
[53] Mvapich2. http://mvapich.cse.ohio-state.edu/userguide/
gdr/2.2. Accessed: 2017-3-30.
[54] Christopher Olston, Noah Fiedel, Kiril Gorovoy, Jeremiah Harmsen,
Li Lao, Fangwei Li, Vinu Rajashekhar, Sukriti Ramesh, and Jordan
Soyke. Tensorflow-serving: Flexible, high-performance ml serving.
arXiv preprint arXiv:1712.06139, 2017.
[55] Antonio J Pena and Sadaf R Alam. Evaluation of inter-and intra-node
data transfer efficiencies between gpu devices and their impact on
scalable applications. In Cluster, Cloud and Grid Computing (CCGrid),
2013 13th IEEE/ACM International Symposium on, pages 144–151.
IEEE, 2013.
[56] Sreeram Potluri, Hao Wang, Devendar Bureddy, Ashish Kumar Singh,
Carlos Rosales, and Dhabaleswar K Panda. Optimizing mpi communi-
cation on multi-gpu systems using cuda inter-process communication.
In Parallel and Distributed Processing Symposium Workshops & PhD
Forum (IPDPSW), 2012 IEEE 26th International, pages 1848–1857.
IEEE, 2012.
[57] Dipanjan Sengupta, Raghavendra Belapure, and Karsten Schwan.
Multi-tenancy on gpgpu-based servers. In Proceedings of the 7th
international workshop on Virtualization technologies in distributed
computing, pages 3–10. ACM, 2013.
[58] Karen Simonyan and Andrew Zisserman. Very deep convolu-
tional networks for large-scale image recognition. arXiv preprint
arXiv:1409.1556, 2014.
[59] Suraj Srinivas and R Venkatesh Babu. Data-free parameter pruning for
deep neural networks. arXiv preprint arXiv:1507.06149, 2015.
[60] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A
Alemi. Inception-v4, inception-resnet and the impact of residual con-
nections on learning. In AAAI, volume 4, page 12, 2017.
[61] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed,
Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew
Rabinovich, et al. Going deeper with convolutions. Cvpr, 2015.
[62] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and
Zbigniew Wojna. Rethinking the inception architecture for computer
vision. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 2818–2826, 2016.
[63] Nathan R Tallent, Nitin A Gawande, Charles Siegel, Abhinav Vishnu,
and Adolfy Hoisie. Evaluating on-node gpu interconnects for deep
learning workloads. In International Workshop on Performance Mod-
eling, Benchmarking and Simulation of High Performance Computer
Systems, pages 3–21. Springer, 2017.
[64] Arnold Tharrington, Wael R Elwasif, and Don Maxwell. Experiences
evaluating functionality and performance of ibm power8+ systems. In
High Performance Computing: ISC High Performance 2017 Interna-
tional Workshops, DRBSD, ExaComm, HCPM, HPC-IODC, IWOPH,
IXPUG, Pˆ 3MA, VHPC, Visualization at Scale, WOPSSS, Frankfurt,
Germany, June 18-22, 2017, Revised Selected Papers, volume 10524,
page 254. Springer, 2017.
[65] Lisa Torrey and Jude Shavlik. Transfer learning. Handbook of Research
on Machine Learning Applications and Trends: Algorithms, Methods,
and Techniques, 1:242, 2009.
[66] Vincent Vanhoucke, Andrew Senior, and Mark Z Mao. Improving
the speed of neural networks on cpus. In Proc. Deep Learning and
Unsupervised Feature Learning NIPS Workshop, volume 1, page 4.
Citeseer, 2011.
[67] Virtual gpu. https://www.nvidia.com/en-us/
design-visualization/technologies/virtual-gpu. Ac-
cessed: 2017-3-30.
[68] RL Vogt, PR Kotta, and CN Meissner. Science and technology re-
view march 2017. Technical report, Lawrence Livermore National
Laboratory (LLNL), Livermore, CA, 2017.
[69] Tobias Weyand, Ilya Kostrikov, and James Philbin. Planet-photo geolo-
cation with convolutional neural networks. In European Conference
on Computer Vision, pages 37–55. Springer, 2016.
[70] Jiaxiang Wu, Cong Leng, Yuhang Wang, Qinghao Hu, and Jian Cheng.
Quantized convolutional neural networks for mobile devices. In Pro-
ceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 4820–4828, 2016.
[71] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming
He. Aggregated residual transformations for deep neural networks.
In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE
Conference on, pages 5987–5995. IEEE, 2017.
[72] Tsung Tai Yeh, Amit Sabne, Putt Sakdhnagool, Rudolf Eigenmann,
and Timothy G Rogers. Pagoda: Fine-grained gpu resource virtual-
ization for narrow tasks. In Proceedings of the 22nd ACM SIGPLAN
Symposium on Principles and Practice of Parallel Programming, pages
221–234. ACM, 2017.
[73] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks.
arXiv preprint arXiv:1605.07146, 2016.
[74] Jianming Zhang, Shugao Ma, Mehrnoosh Sameki, Stan Sclaroff,
Margrit Betke, Zhe Lin, Xiaohui Shen, Brian Price, and Radomir
Mech. Salient object subitizing. In Proceedings of the IEEE Confer-
ence on Computer Vision and Pattern Recognition, pages 4045–4054,
2015.
[75] Jianming Zhang, Stan Sclaroff, Zhe Lin, Xiaohui Shen, Brian Price,
and Radomir Mech. Unconstrained salient object detection via pro-
posal subset optimization. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 5733–5742, 2016.
[76] Michael Zheludkov, Timur Isachenko, et al. High Performance in-
memory computing with Apache Ignite. Lulu. com, 2017.
[77] Hao Zhou, Jose M Alvarez, and Fatih Porikli. Less is more: Towards
compact cnns. In European Conference on Computer Vision, pages
662–677. Springer, 2016.
13
