EmBench: Quantifying Performance Variations of Deep Neural Networks
  across Modern Commodity Devices by Almeida, Mario et al.
PREPRINT: Accepted at the 3rd International Workshop on Embedded and Mobile Deep Learning (EMDL), 2019
EmBench: antifying Performance Variations of
Deep Neural Networks across Modern Commodity Devices
Mario Almeida†*, Stefanos Laskaridis†*
Ilias Leontiadis†*, Stylianos I. Venieris†*, Nicholas D. Lane†,‡
†Samsung AI Center, Cambridge ‡University of Oxford
* Indicates equal contribution.
{mario.a,stefanos.l,i.leontiadis,s.venieris,nic.lane}@samsung.com
ABSTRACT
In recent years, advances in deep learning have resulted in un-
precedented leaps in diverse tasks spanning from speech and object
recognition to context awareness and health monitoring. As a re-
sult, an increasing number of AI-enabled applications are being
developed targeting ubiquitous and mobile devices. While deep
neural networks (DNNs) are getting bigger and more complex, they
also impose a heavy computational and energy burden on the host
devices, which has led to the integration of various specialized pro-
cessors in commodity devices. Given the broad range of competing
DNN architectures and the heterogeneity of the target hardware,
there is an emerging need to understand the compatibility between
DNN-platform pairs and the expected performance benets on each
platform. This work attempts to demystify this landscape by sys-
tematically evaluating a collection of state-of-the-art DNNs on a
wide variety of commodity devices. In this respect, we identify
potential bottlenecks in each architecture and provide important
guidelines that can assist the community in the co-design of more
ecient DNNs and accelerators.
KEYWORDS
Deep neural networks; on-device inference; mobile devices
ACM Reference Format:
Mario Almeida, Stefanos Laskaridis, Ilias Leontiadis, Stylianos I. Venieris,
Nicholas D. Lane. 2019. EmBench: Quantifying Performance Variations of
Deep Neural Networks across Modern Commodity Devices. In Proc. of
The 3rd International Workshop on Deep Learning for Mobile Systems and
Applications, (EMDL’19), June 21, 2019, Seoul, Republic of Korea. ACM, NY,
NY, USA. 6 pages. DOI: https://doi.org/10.1145/3325413.3329793
1 INTRODUCTION
With a demonstrated state-of-the-art accuracy in a wide range of
AI tasks, the popularity of deep neural networks (DNNs) is on the
rise. Since 2012 and the introduction of AlexNet [17], a myriad
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
EMDL’19, Seoul, Republic of Korea
© 2019 ACM. 978-1-4503-6771-4/19/06. . . $15.00
DOI: 10.1145/3325413.3329793
Model Year FLOPs (M) Params (M) Accuracy Accuracy
(top-1) (top-5)
AlexNet [17] 2012 955.21 61.10 56.52 79.07
VGG 11 [24] 2014 8171.57 132.86 69.02 88.63
VGG 13 [24] 2014 11895.04 133.05 69.93 89.25
VGG 16 [24] 2014 16063.36 138.36 71.59 90.38
VGG 19 [24] 2014 20231.68 143.67 72.38 90.88
ResNet 18 [11] 2015 1836.82 11.69 69.76 89.08
ResNet 34 [11] 2015 3692.78 21.80 73.31 91.42
ResNet 50 [11] 2015 4154.96 25.56 76.13 92.86
ResNet 101 [11] 2015 7892.77 44.55 77.37 93.55
ResNet 152 [11] 2015 11636.60 60.19 78.31 94.05
Inception-v3 [25] 2015 5730.17 27.16 75.64 92.59
Inception-v4 [25] 2015 12561.10 42.68 80.08 94.89
SqueezeNet-v1 [14] 2016 865.78 1.25 58.09 80.42
SqueezeNet-v1.1 [14] 2016 377.80 1.24 58.18 80.62
DenseNet 121 [12] 2017 2928.89 7.98 74.43 91.97
DenseNet 169 [12] 2017 3473.88 14.15 75.60 92.81
DenseNet 201 [12] 2017 4435.03 20.01 76.87 93.37
DenseNet 161 [12] 2017 7902.37 28.68 77.14 93.56
Xception [6] 2017 8494.59 22.86 78.89 94.29
MobileNetV2 [23] 2017 336.43 3.50 71.81 90.42
ShueNet-v2.05 [20] 2018 52.32 1.37 60.55 81.75
ShueNet-v2.1 [20] 2018 160.09 2.28 69.36 88.32
MnasNet [26] 2018 649.51 4.38 61.95 84.73
PNASNet [19] 2018 25945.87 86.06 82.74 95.99
NasNet [29] 2018 24882.21 88.75 82.51 96.02
NasNet mobile [29] 2018 667.75 5.29 74.08 91.74
Table 1: DNN benchmarks.
of models have been competing for improved predictive power
(Table 1). Nevertheless, accuracy gains have often been achieved
at the expense of an increase in model complexity, inference time
and resource requirements. With DNN models becoming ubiqui-
tous across multiple scenarios and compute devices, from large-
scale cloud services [7] to resource-constrained mobile systems
[18], predicting the processing performance of each DNN becomes
a challenging task. Given the wide range of competing DNN archi-
tectures and the heterogeneity of the target hardware, there is an
upcoming desire to gain insights on how and why dierent design
decisions impact the accuracy and performance of these networks
upon deployment.
EmBench aims to provide a comprehensive analysis of widely
used deep neural networks, with a focus on evaluating which mod-
els thrive under which target platforms, while identifying their
bottlenecks and sources of ineciency. To this end, we analyze a
set of popular DNNs (Table 1) targeting various compute platforms
(Table 2). First, we perform a macro analysis of these networks in
terms of their complexity, inference latency and throughput for
ar
X
iv
:1
90
5.
07
34
6v
1 
 [c
s.L
G]
  1
7 M
ay
 20
19
multiple batch sizes. Then, we perform a deeper analysis of the
most prominent network operations, with a focus on detecting non-
trivial dierences between execution across the target platforms.
More specically, we provide the following contributions:
• We demonstrate that dierent networks are handled quite
dierently by each target platform, making it challenging
to design an ecient model in a hardware-agnostic manner.
• We provide insights about how dierent batch sizes can af-
fect performance on ve dierent hardware architectures.
• We analyze the inference latency of all networks and show
that the trade-o between actual processing speed and
accuracy depends on the underlying hardware and its
optimizations.
• We break down the overall DNN workload into individ-
ual operations and unveil any opportunities for further
improvements on each platform.
2 RELATEDWORK
So far, a few studies have focused on analyzing the system-level
properties of DNNs on deployment platforms. Canziani et al. [4]
presented a system-level analysis of 14 convolutional neural net-
works (CNNs) on the NVIDIA Jetson TX1 platform. Despite the fact
that the analysis spanned across multiple metrics, the study was
conducted over a limited number of networks and targeted solely
a single platform. Bianco et al. [2] extended the covered space by
evaluating a wider range of networks and targeting one embedded
(Jetson TX1) and one high-end compute platform (NVIDIA Titan X
GPU). Both studies conducted an analysis of the selected networks
across multiple dimensions, including accuracy, compute speed,
memory footprint and power consumption. Nevertheless, by in-
cluding a total of two platforms –and given the heterogeneity of
currently available devices– the presented insights are not directly
transferable to platforms with dierent characteristics.
In this work, we expand to a broad range of both high- and
low-end devices, spanning from the latest server-grade RTX 2080
Ti GPU and the embedded Nvidia Jetson AGX down to the mobile-
ready Qualcomm mobile Kryo 385 CPU and the low-power Intel
Neural Compute Stick 2. Furthermore, we present a microscopic
view of how well dierent layer types are mapped to each hardware
architecture, aiming to provide insights for the hardware-aware
design of novel DNNs.
On a slightly dierent setting, Huang et al. [13] concentrated on
the task of object detection and evaluated a wide set of CNN-based
object detectors in terms of processing performance and detection
accuracy. With a focus on the mobile space, Ignatov et al. [15]
assembled a benchmark suite of representative AI tasks to assess
the processing capabilities of currently available smartphones. In
this paper, we adopt a wider scope than [13] by treating network
architectures in a task-agnostic manner and target more diverse
families of devices compared to [15].
Last, Zhang et al. [28] study the key performance dierences
among dierent machine learning frameworks across dierent edge
platforms. We treat the framework as an invariant and focus our
endeavours on the inference behaviour of the devices at hand for a
signicantly greater variety of models.
Platform Cores Clock Freq.(GHz)
Memory
(GB)
Technology
(nm)
TDP
(W)
Intel Xeon 4116* 12 2.1 256 14 85
NVIDIA RTX 2080 Ti 512† 1.5 11 12 250
Qualcomm Kryo 385 CPU 4+4** 2.8 + 1.8 6 10 5
Intel NCS 2 16‡ 0.7 0.5 16 1
NVIDIA Tegra Xavier GPU 512† 1.3 16 12 30
* HyperThreading enabled
** 4 high-performance ARM A75 + 4 high-eciency ARM A55.
‡Movidius SHAVE cores. † 512 CUDA cores w/ 64 Tensor Cores.
Table 2: Evaluated platforms.
3 HARDWARE PLATFORMS
The large compute and memory demands of modern DNN work-
loads have led to an emergence of specialized processors with the
goal to facilitate their high-performance deployment. Depending on
the target scenario, each platform has employed dierent hardware
optimizations to satisfy system-level constraints, including latency,
throughput, temperature, power dissipation and form factor.
In desktop and server environments, high-end devices are typ-
ically employed in order to maximize throughput at the penalty
of substantial power consumption. In this context, powerful –and
massively parallel– but power-costly platforms have been designed.
A representative example is the latest NVIDIA GeForce RTX 2080
Ti GPU which is based on the NVIDIA Turing architecture. By in-
troducing the 2nd generation of Tensor Cores –a set of specialized
hardware units tailored for DNN processing– this particular GPU
provides hardware support for 16-bit oating-point as well as 8-
and 4-bit xed-point precision and enables the highly optimized
execution of matrix operations with mixed-precision arithmetic.
Having a large bandwidth of 616 GB/s to a 11 GB o-chip RAM, the
RTX 2080 Ti GPU has been optimized for high throughput, reaching
its peak performance when processing inputs in batches, yielding
up to 13.4 (FP32) TFLOP/s at the cost of a 250-watt peak power.
On the other end of the spectrum, severely power-constrained
systems such as IoT and mobile devices are equipped with pro-
cessing units that comply with the respective thermal and form-
factor constraints. In this space, low-power miniaturized SoCs,
such as the Qualcomm Snapdragon™ 845 (SDM845), have been
explicitly designed with the objective to provide the processing
support for on-device DNNs while respecting the typical <10 watts
power budget of modern consumer and robotic systems. Further
towards customization, neural accelerators oer sub-watt solutions
for rapid DNN inference in severely constrained embedded systems.
To this end, full-custom chips have been designed with the goal
to extract the highest possible performance at a minimal power
dissipation. A representative and widely available instance is the
Intel Neural Compute Stick (NCS) 2 mounting the Movidius Myriad
X accelerator, delivering up to 1 TOP/s under 1 watt.
Targeting a mid-range power envelope of a few 10s of watts
which is typical in complex embedded systems such as driverless
cars and robots, several devices have been designed to reach a
congurable trade-o between power consumption and processing
speed. A representative example is the latest NVIDIA Jetson Xavier
SoC. Jetson Xavier hosts a 512-core NVIDIA Volta GPU with 64 1st
generation Tensor Cores. To support dierent power budgets while
sustaining a high performance, the platform can be congured with
a range of frequencies up to 1.3 GHz at a peak TDP of 30 watts.
10−1 100 101 102 103 104
FLOPS (Billions)
10−3
10−2
10−1
Tim
e 
(s)
shufflenet_v2_1
squeezenet_v1_1
squeezenet_v1_0
inception_v3
pnasnet
resnet34
resnet18
resnet152
vgg19
shufflenet_v2_05
densenet169
resnet101
nasnet_mobile
densenet121
resnet50
alexnet
densenet161
vgg11
vgg13
nasnet
mnasnet
densenet201
mobilenet_v2
vgg16
inception_v4
Figure 1: Runtime vs. number of FLOPs on a single NVIDIA
RTX 2080 Ti GPU. Dierentmarkers represent dierent net-
works. Each network has an increasing amount of FLOPs
based on the batch size (i.e., FLOPs/inference × batch size).
Given the diversity of DNN-enabled applications, the variability
across deployment scenarios, network architectures and hardware
platforms leads to an emerging need for evaluating the compatibility
between dierent network-platform pairs. In this respect, Section 4
examines the eciency of mapping a substantially comprehensive
range of networks (Table 1) on each platform of Table 2 with the
potential to guide both novel network and hardware designs.
4 EVALUATION
In this section, we analyze the data that we have collected by bench-
marking the set of networks from Table 1 targeting various com-
modity compute platforms (Table 2). Each conguration comprises
a choice of i) DNN, ii) batch size and iii) target platform. For each
conguration, we initially perform a macroscopic analysis with
respect to complexity, measured in number of oating-point op-
erations, and achieved latency on the selected device. We further
analyze the latency of each operation type within the network to
identify processing bottlenecks and compare the computational
eciency across the target devices.
Benchmarking procedure. Following the eort of the Open
Neural Exchange Format (ONNX) to provide a uniform interface
among deep-learning frameworks, we adopted Facebook’s machine-
learning toolchain (PyTorch, Cae2, FAI-PEP) for the majority of
our experiments due to its imperative interface and support for
ONNX. First, the pretrained versions of all DNNs were collected
in a PyTorch format. Specically, for the experiments on a work-
station hosting the Xeon 4116 CPU and RTX 2080 Ti GPU, and on
the NVIDIA Jetson Xavier SoC, PyTorch1 v1.0.0 with CUDA 10 and
cuDNN 7.5 are used directly. The two platforms run GNU/Linux
Ubuntu 18.04 LTS, compiled for x86-64 and 64-bit ARM, respectively.
For the Xavier SoC, we set the GPU frequency to its maximum
setting to obtain the peak performance. On the mobile side, the
evaluated networks were converted to Cae2 to run on SDM845’s
1https://pytorch.org/
101
103
inf
er
en
ce
s/s
alexnet densenet121 densenet161 densenet169
101
103
inf
er
en
ce
s/s
densenet201 inception_v3 inception_v4 mnasnet
101
103
inf
er
en
ce
s/s
mobilenet_v2 nasnet nasnet_mobile pnasnet
101
103
inf
er
en
ce
s/s
resnet101 resnet152 resnet18 resnet34
101
103
inf
er
en
ce
s/s
resnet50 shufflenet_v2_05 shufflenet_v2_1 squeezenet_v1_0
101
103
inf
er
en
ce
s/s
squeezenet_v1_1 vgg11
100 101 102
batch_size
vgg13
100 101 102
batch_size
vgg16
100 101 102
batch_size
101
103
inf
er
en
ce
s/s
vgg19
100 101 102
batch_size
xception
target
GPU-NVIDIA-2080Ti
GPU-NVIDIA-Xavier
CPU-XEON-4116
Intel-NCS2
CPU-SDM845
CPU-SDM845-NNPACK
Figure 2: Achieved throughput of evaluated DNNs on dier-
ent platforms with increasing batch size.
Kryo CPU, running Android 9 (Pie). The Cae2 backend for mobile
platforms was congured to employ the highly optimized imple-
mentations of the NNPACK2 package. To systematically measure
on-device performance, we employ Facebook’s AI Performance
Evaluation Platform3 (FAI-PEP). Finally, when targeting Intel NCS
2, the networks were converted to ONNX and subsequently to the
Intel Movidius graph le format through the Intel OpenVINO™
toolkit. To time the execution of DNNs on NCS 2, we use the
program counters of the Myriad X chip.
2https://github.com/Maratyszcza/NNPACK
3https://github.com/facebook/FAI-PEP
0 50 100 150 200 250 300 350
Latency (ms)
60
65
70
75
Ac
cu
ra
cy
alexnet
densenet201
mnasnet
mobilenet_v2
nasnet_mobile
resnet152
shufflenet_v2_05
vgg16
CPU-XEON-4116
0 5 10 15 20 25 30 35
Latency (ms)
60
65
70
75
Ac
cu
ra
cy
alexnet
densenet201
mnasnet
mobilenet_v2
nasnet_mobile
resnet152
shufflenet_v2_05
vgg16
GPU-NVIDIA-2080Ti
20 40 60 80 100
Latency (ms)
60
65
70
75
Ac
cu
ra
cy
alexnet
densenet201
mnasnet
mobilenet_v2
nasnet_mobile
resnet152
shufflenet_v2_05
vgg16
GPU-NVIDIA-Xavier
20 40 60 80 100 120 140 160 180
Latency (ms)
60
65
70
75
Ac
cu
ra
cy
alexnet
densenet201
mnasnet
mobilenet_v2
resnet152
vgg16
NCS2
0 500 1000 1500 2000
Latency (ms)
60
65
70
75
Ac
cu
ra
cy
alexnet
densenet201
mnasnet
mobilenet_v2
nasnet_mobile
resnet152
shufflenet_v2_05
SDM845-CPU
100 200 300 400 500 600
Latency (ms)
60
65
70
75
Ac
cu
ra
cy
alexnet
densenet201
mnasnet
mobilenet_v2
nasnet_mobile
resnet152
shufflenet_v2_05
SDM845-CPU (nnpack)
Figure 3: Inferences-per-second vs. accuracy for various target platforms (batch size = 1). The marker size and color represent
the number of FLOPs and parameters, respectively, in each of the evaluated networks.
Across all platforms, each experiment is run 50 times – with ten
warm-up runs for uniform initial cache state and a cool-o period
of 2 seconds between runs to avoid frequency scaling due to heat –
and the minimum latency achieved is reported.
FLOPs analysis. Popular DNNs tend to vary quite a lot in
terms of their complexity. As seen on Table 1, some networks
require up to 3 orders of magnitude more oating-point operations
(FLOPs) to perform image recognition over the same dataset (e.g.,
ImageNet [10]), often with little to no improvement in accuracy,
such as in the case of MobileNetV2 vs. VGG16.
We rst investigate if the FLOPs are a good indicator of inference
time. Figure 1 shows how the actual inference time on the 2080
Ti GPU varies with respect to FLOPs. The amount of FLOPs on
the x-axis is the network’s FLOPs times the batch size. We used
powers of two for the batch sizes up to 512. We can see that for
networks with similar FLOPs, the actual GPU time can vary by at
least one order of magnitude. Nonetheless, for the same network,
the GPU time grows slowly with the batch size up to a point where
time growth becomes almost exponential. At this point, the GPU
resources are fully utilized and the cost of further batching is no
longer amortized.
Impact of batch size. Figure 2 depicts the eect of batch size
on the achieved throughput across models and platforms. On the
one side of the spectrum, given the memory-abundant workstation
setup, the highest performing batch sizes for the high-end 2080
Ti GPU typically lie between 128 and 256. At this point, the GPU
reaches the peak utilization of its resources and, consequently, after
that the overhead of further batching is no longer amortized. In
this case, the GPU is able to exploit the intra-batch parallelism of
large batches and boost the achieved throughput, while not being
limited by the available RAM. On the other hand, although similarly
to the high-end GPU there are no signicant memory constraints,
the Xeon 4116 CPU rarely improves its throughput after a batch
size of 32, where typically its resources –which are less than the
resource-rich GPU’s– are already fully utilized.
On the other side of the spectrum, large batches tend to ex-
haust the memory-limited devices. In this respect, the mobile Kryo
385 CPU (CPU-SDM845 in Figure 2) shows modest throughput
improvement for increasing batch size due to both the reduced
memory bandwidth and amount of compute resources compared
to its high-end counterparts, and its maximum batch size is typ-
ically up to 32 due to the small o-chip memory capacity. Note
that using NNPACK’s CPU optimized convolution layers (CPU-
SDM845-NNPACK in Figure 2) improves the number of inferences
per second by 2.97× on average and up to 4.53× for densenet121.
Moreover, NCS 2 outperforms its mobile counterpart across the
evaluated networks, despite the fact that it was designed to operate
with a batch size of 1. Finally, the embedded Tegra Xavier GPU
demonstrates a similar pattern to RTX 2080 Ti by being able to
exploit batch processing to improve its throughput, although its
median optimal batch size is around 64. The highest throughput
is observed at the point where its resources are fully utilized, after
which further batching degrades the achieved throughput due to
excessive uncompensated overhead.
Accuracy vs. time. Figure 3 shows the accuracy and achieved
latency of selected representative networks as measured across the
evaluated platforms. To investigate the trade-o between accuracy
and achieved processing speed, we analyze the inference latency
(i.e., batch size of 1) against the accuracy of these networks. The
0.00 0.25 0.50 0.75 1.00
alexnet
densenet201
inception_v3
mnasnet
mobilenet_v2
nasnet_mobile
resnet152
shufflenet_v2_05
vgg16
xception
ne
tw
or
k
CPU-XEON-4116
other
AvgPool2d
ReLU
ZeroPad2d
Linear
BatchNorm2d
MaxPool2d
Conv2d
0.00 0.25 0.50 0.75 1.00
alexnet
densenet201
inception_v3
mnasnet
mobilenet_v2
nasnet_mobile
resnet152
shufflenet_v2_05
vgg16
xception
ne
tw
or
k
GPU-NVIDIA-2080Ti
other
AvgPool2d
AdaptiveAvgPool2d
ReLU
MaxPool2d
Linear
BatchNorm2d
Conv2d
0.00 0.25 0.50 0.75 1.00
alexnet
densenet201
inception_v3
mnasnet
mobilenet_v2
nasnet_mobile
resnet152
shufflenet_v2_05
vgg16
xception
ne
tw
or
k
GPU-NVIDIA-Xavier
other
Dropout
AvgPool2d
ReLU
MaxPool2d
Linear
BatchNorm2d
Conv2d
0.00 0.25 0.50 0.75 1.00
alexnet
densenet201
inception_v3
mnasnet
mobilenet_v2
resnet152
vgg16
xception
ne
tw
or
k
NCS2
other
<Extra>
Eltwise
ScaleShift
ReLU
Pooling
FullyConnected
Convolution
0.00 0.25 0.50 0.75 1.00
alexnet
densenet201
inception_v3
mnasnet
mobilenet_v2
nasnet_mobile
resnet152
shufflenet_v2_05
xception
ne
tw
or
k
SDM845-CPU
other
Relu
Slice
AveragePool
MaxPool
SpatialBN
FC
Conv
0.00 0.25 0.50 0.75 1.00
alexnet
densenet201
inception_v3
mnasnet
mobilenet_v2
nasnet_mobile
resnet152
shufflenet_v2_05
xception
ne
tw
or
k
SDM845-CPU (nnpack)
other
Concat
Relu
MaxPool
AveragePool
SpatialBN
FC
Conv
Figure 4: Percentage of time spent per layer for various target platforms (batch size = 1). Only a sample of networks is shown.
results (Figure 3) demonstrate that there are signicant dierences
among the evaluated hardware platforms.
In the workstation setting, we observe that the server-grade CPU
is sub-optimal when it comes to supporting very deep networks (e.g.,
the ResNet family) or those with large number of FLOPs (e.g., VGG).
As a result, none of the state-of-the-art networks could sustain
more than 15 FPS. Instead, the RTX 2080 Ti achieves a throughput
improvement over the Xeon 4116 CPU in the range 7×-50× with an
average of 15× across the networks. In particular, networks such as
ResNet yield a speedup of ≈ 14×, whereas simpler networks such as
AlexNet can handle an impressive 680 FPS. Similarly, with respect
to power eciency, RTX 2080 Ti yields an average improvement of
5.3× in inferences/s/W over the Xeon CPU.
In the <30 watts range, the Tegra Xavier GPU manages to out-
perform the Xeon CPU with raw speedups between 2×-15× (5.1×
average) and yields an average power eciency gain of 14.4× across
the networks. Compared to the RTX 2080 Ti GPU, although the
Tegra Xavier GPU reaches between 26-37% (34% average) of its raw
performance, it achieves a 2.8× average improvement in power
eciency, demonstrating its suitability for applications when infer-
ences/s/W are the primary metric of interest.
Due to their resource constraints, the Kryo mobile CPU exhibits
substantially lower raw performance when compared to the server-
grade CPU and GPU platforms; NCS 2 can handle up to 23 FPS
on average whereas the Kryo processor achieves less than 6 FPS
(35× slower than the RTX 2080 Ti GPU). Despite the expected
degradation of the mobile CPU in terms of raw throughput, NCS
2 achieves a power eciency in the range 3×-88× (41.6×) over
the power-hungry RTX 2080 Ti GPU. In this respect, the NCS 2
platform extracts the maximum performance out of its 1-watt TDP
and constitutes a powerful candidate device for performing DNN
inference in very-low-power settings.
Interestingly, by observing the accuracy-latency Pareto fronts
in Figure 3, mobile devices demonstrate substantially dierent pat-
terns compared to the server-grade experiments. First, deeper
networks with large number of operations suer from a big penalty
in raw performance. For instance, ResNet on the mobile CPU re-
sults in a minimal throughput of 0.2 inferences-per-second which
is 250× slower than the RTX 2080 Ti GPU. At the same time, net-
works that are optimized for mobile devices do signicantly better:
MobileNetV2 is 6.4× faster on NCS 2 and 16.3× faster on the Kryo
processor compared to ResNet. Despite the accuracy penalty, the
design decision of using depthwise separable convolutions seems
to oer a considerable performance benet in the mobile space.
At the same level of accuracy, VGG16 is uniformly slower across
devices by a considerable margin (ranging from 2× to 6×). Finally,
ShueNetV2 is a notable example that achieved signicantly im-
proved performance on Kryo CPU when compared to server-grade
runs, demonstrating the benets of pointwise group convolution
and channel shuing for resource-constrained devices.
Overall, from a high-level view, the Pareto fronts of low-power
devices (i.e., sub-gures on second row of Figure 3) comprise dier-
ent networks compared to their more power-consuming counter-
parts. The Pareto front of the Xeon 4116 CPU comprises AlexNet,
MobileNetV2 and ResNet152, while the RTX 2080 Ti and Tegra
Xavier GPUs contain AlexNet, VGG16 and ResNet152. On the other
hand, NCS 2 also includes DenseNet201 on its Pareto front, with
VGG16 being excluded due to its very latency-expensive NCS 2
mapping, while ShueNetV2 and NASNet-mobile also appear as
Pareto optimal networks for the CPU of SDM845. As a result, the
direct use of hardware-agnostic metrics, such as number of FLOPs,
can often be misleading and not accurately indicate how eciently
a DNN is mapped on a particular platform.
Per-layer analysis. To further investigate into why some net-
works are mapped more eciently on certain platforms, we also
look into the breakdown of inference time within each operation
(Figure 4). As already demonstrated in the literature, the majority
of time is spent on convolution operations ranging from 65% of
the time in desktop GPU to 89% of the time on mobile processors.
This dierence further demonstrates the need for optimizing these
operations on mobile platforms. For NVIDIA GPU accelerators,
the second most time-consuming operation for this workload was
Batch Normalization (19% and 15% of time on the RTX 2080 Ti
and Xavier GPUs respectively) whereas for the Xeon CPU Max
Pool becomes substantial by occupying 10% of the time. Finally,
on the Kryo mobile processor 10% of the time is spend on fully-
connected layers. AlexNet fully reveals that fully-connected layers
are not a good t for the characteristics of mobile platforms, taking
more than 70% of the computation time when compared to below
30% on server-grade GPUs. The primary factor behind the slow
execution of fully-connected layers is the limited o-chip mem-
ory bandwidth of existing mobile platforms which determines the
processing speed of the inherently memory-bound operations of
fully-connected layers.
5 DISCUSSION
The analysis of Section 4 uncovers a number of notable insights.
Performance variability: We note that the performance of
each network varies substantially across platforms depending on
the network architecture and the type of operations used. Therefore,
an interesting research direction would be to design tools that can
automatically select when to use and fuse together these building
blocks, depending on the hardware architecture of the target device
as well as the latency and throughput requirements [1, 3, 9].
Mobile-specic optimizations: Our results further demon-
strate the importance of mobile-specic operations such as point-
wise and depthwise separable convolutions. Furthermore, our
benchmarks show that mobile devices are inecient in handling
larger models due to their reduced memory capabilities. Therefore,
while compression and quantization techniques might not result
in a big performance gain on desktop environments, they do make
a big dierence on mobile and embedded devices, both compute-
and memory-wise. Towards this direction, binary networks [8, 22]
seem to oer a promising alternative for maximal compression,
but currently require specialized hardware support to exploit the
speedup potential. Finally, it is possible to dynamically ooad
computation from device to the edge or cloud, in order to facilitate
computation and minimize latency [5].
Importance of hardware support: Most of the examined mo-
bile devices either do not come with automated libraries for target-
ing their accelerator backends (GPU, DSP, NPU), or provide limited
support for DNN operators. Our results demonstrate that most of
the time is consumed on operations such as convolutions and fully-
connected layers. Instead of being limited to the CPUs, software
optimizations and support for exploiting the available hardware
accelerators of target mobile and embedded chipsets should be
prioritized in order to accelerate these common operations in a
transparent and homogeneous way [27].
Batch size: While most devices are optimized for larger batch
sizes, most real-time mobile applications require smaller batch
sizes to minimize latency. We observe that across all platforms the
hardware is not fully utilized for smaller batch sizes. One possible
direction is to study how multiple networks with small batch sizes
can be optimally collocated to overprovision these resources and
thus maximize their utilization [16, 21].
6 CONCLUSIONS
In this work, we attempted to shed some light on the performance
of DNN inference by analyzing more than 25 networks on a wide
variety of commodity devices. Our results provide useful insights
about the performance and suitability of these models. Further-
more, we identied model design choices that work better on each
platform and we uncovered a number of performance bottlenecks.
We believe that these results can help the research community in
three ways: i) to identify the best possible building blocks when
designing models for these platforms, ii) shed light on the capabili-
ties of these devices and iii) provide insights about possible future
hardware and software optimizations.
REFERENCES
[1] Anubhav Ashok, Nicholas Rhinehart, Fares Beainy, and Kris M Kitani. N2N
learning: Network to Network Compression via Policy Gradient Reinforcement
Learning. In ICLR, 2018.
[2] Simone Bianco, Remi Cadene, Luigi Celona, and Paolo Napoletano. Benchmark
Analysis of Representative Deep Neural Network Architectures. IEEE Access,
6:64270–64277, 2018.
[3] Han Cai et al. ProxylessNAS: Direct Neural Architecture Search on Target Task
and Hardware. In ICLR, 2019.
[4] A. Canziani, E. Culurciello, and A. Paszke. Evaluation of Neural Network Archi-
tectures for Embedded Systems. In ISCAS, pages 1–4, 2017.
[5] Alejandro Cartas, Martin Kocour, Aravindh Raman, Ilias Leontiadis, Jordi Luque,
Nishanth Sastry, Jose Nuñez-Martinez, Diego Perino, and Carlos Segura. A
Reality Check on Inference at Mobile Networks Edge. In Proceedings of the 2Nd
International Workshop on Edge Systems, Analytics and Networking, EdgeSys ’19,
pages 54–59, New York, NY, USA, 2019. ACM.
[6] François Chollet. Xception: Deep Learning with Depthwise Separable Convolu-
tions. In CVPR, pages 1251–1258, 2017.
[7] E. Chung et al. Serving DNNs in Real Time at Datacenter Scale with Project
Brainwave. IEEE Micro, 38(2):8–20, 2018.
[8] Matthieu Courbariaux and Yoshua Bengio. BinaryNet: Training Deep Neural
Networks with Weights and Activations Constrained to +1 or -1. In NIPS, 2016.
[9] Elliot J. Crowley, Gavin Gray, and Amos J Storkey. Moonshine: Distilling with
Cheap Convolutions. In NeurIPS. 2018.
[10] L. Fei-Fei, J. Deng, and K. Li. ImageNet: Constructing a Large-Scale Image
Database. Journal of Vision, 9(8):1037–1037, 2010.
[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learn-
ing for Image Recognition. In CVPR, 2016.
[12] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q Weinberger.
Densely Connected Convolutional Networks. In CVPR, 2017.
[13] Jonathan Huang et al. Speed/Accuracy Trade-Os for Modern Convolutional
Object Detectors. In CVPR, pages 7310–7311, 2017.
[14] Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J
Dally, and Kurt Keutzer. SqueezeNet: AlexNet-level accuracy with 50x fewer
parameters and< 0.5 MB model size. arXiv preprint arXiv:1602.07360, 2016.
[15] Andrey Ignatov, Radu Timofte, William Chou, Ke Wang, Max Wu, Tim Hartley,
and Luc Van Gool. AI Benchmark: Running Deep Neural Networks on Android
Smartphones. In ECCVW, 2018.
[16] Myeongjae Jeon, Shivaram Venkataraman, Amar Phanishayee, Junjie Qian, Wen-
cong Xiao, and Fan Yang. Multi-tenant GPU Clusters for Deep Learning Work-
loads: Analysis and Implications.
[17] Alex Krizhevsky, Ilya Sutskever, and Georey E Hinton. ImageNet Classication
with Deep Convolutional Neural Networks. In NIPS, pages 1097–1105, 2012.
[18] Nicholas D. Lane and Petko Georgiev. Can Deep Learning Revolutionize Mobile
Sensing? In HotMobile, 2015.
[19] Chenxi Liu et al. Progressive Neural Architecture Search. In ECCV, 2018.
[20] Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. ShueNet v2:
Practical Guidelines for Ecient CNN Architecture Design. In ECCV, 2018.
[21] Deepak Narayanan, Keshav Santhanam, Amar Phanishayee, and Matei Zaharia.
Accelerating Deep Learning Workloads through Ecient Multi-Model Execution.
In NIPS SysML Workshop.
[22] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. XNOR-
Net: ImageNet Classication using Binary Convolutional Neural Networks. In
ECCV, 2016.
[23] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen. MobileNetV2:
Inverted Residuals and Linear Bottlenecks. In CVPR, 2018.
[24] Karen Simonyan and Andrew Zisserman. Very Deep Convolutional Networks
for Large-Scale Image Recognition. In ICLR, 2015.
[25] Christian Szegedy, Sergey Ioe, Vincent Vanhoucke, and Alexander A Alemi.
Inception-v4, Inception-ResNet and the Impact of Residual Connections on
Learning. In AAAI, 2017.
[26] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, and Quoc V Le.
MnasNet: Platform-Aware Neural Architecture Search for Mobile. arXiv preprint
arXiv:1807.11626, 2018.
[27] C. Wu et al. Machine Learning at Facebook: Understanding Inference at the
Edge. In HPCA, 2019.
[28] Xingzhou Zhang, Yifan Wang, and Weisong Shi. pCAMP: Performance Compar-
ison of Machine Learning Packages on the Edges. {USENIX} Workshop on Hot
Topics in Edge Computing (HotEdge 18), 2(1), 2018.
[29] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning
Transferable Architectures for Scalable Image Recognition. In CVPR, 2018.
