Modeling the Resource Requirements of Convolutional Neural Networks on
  Mobile Devices by Lu, Zongqing et al.
ar
X
iv
:1
70
9.
09
50
3v
1 
 [c
s.C
V]
  2
7 S
ep
 20
17
Modeling the Resource Requirements of Convolutional Neural
Networks on Mobile Devices
Zongqing Lu
Peking University
zongqing.lu@pku.edu.cn
Swati Rallapalli
IBM Research
srallapalli@us.ibm.com
Kevin Chan
Army Research Laboratory
kevin.s.chan.civ@mail.mil
omas La Porta
Pennsylvania State University
tlp@cse.psu.edu
ABSTRACT
Convolutional Neural Networks (CNNs) have revolutionized the
research in computer vision, due to their ability to capture com-
plex paerns, resulting in high inference accuracies. However, the
increasingly complex nature of these neural networks means that
they are particularly suited for server computers with powerful
GPUs. We envision that deep learning applications will be even-
tually and widely deployed on mobile devices, e.g., smartphones,
self-driving cars, and drones. erefore, in this paper, we aim to
understand the resource requirements (time, memory) of CNNs on
mobile devices. First, by deploying several popular CNNs on mo-
bile CPUs and GPUs, wemeasure and analyze the performance and
resource usage for every layer of the CNNs. Our findings point
out the potential ways of optimizing the performance on mobile
devices. Second, we model the resource requirements of the differ-
ent CNN computations. Finally, based on the measurement, profil-
ing, andmodeling, we build and evaluate ourmodeling tool,Augur,
which takes a CNN configuration (descriptor) as the input and es-
timates the compute time and resource usage of the CNN, to give
insights about whether and how efficiently a CNN can be run on
a given mobile platform. In doing so Augur tackles several chal-
lenges: (i) how to overcome profiling and measurement overhead;
(ii) how to capture the variance in different mobile platforms with
different processors, memory, and cache sizes; and (iii) how to ac-
count for the variance in the number, type and size of layers of the
different CNN configurations.
KEYWORDS
Convolutional neural networks; modeling; mobile devices
1 INTRODUCTION
Deep learning has become the norm of state-of-the-art learning
systems, especially in computer version. Convolutional Neural
Networks (CNNs) have demonstrated impressive performance on
various computer vision tasks from classification and detection to
segmentation and captioning. A CNN consists of different types
of layers (e.g., convolutional, pooling, fully connected), where each
layer performs certain transform on the input data and outputs the
data to the next layer. Different CNNs for computer vision tasks
have been designed, from a few layers to a thousand layers. But,
the core of these networks naturally are the convolutional layers,
which consist of a set of learnable kernels that are convolved across
the length and width of the input image to produce output features.
ere are several frameworks that support the training (forward
and backward pass) and inference (only forward pass) phases of
CNNs, including Caffe [1], TensorFlow [6], Torch [8], eano [7],
etc. All of these frameworks are designed and optimized for both
training and inference on computers with powerful GPUs.
However, we envision that deep learning applications will be
eventually and widely deployed on mobile devices. It is also ex-
pected that for computer vision tasks mobile devices will only per-
form inference (forward pass), since training can be carried out
offline by computers with powerful GPUs. In the rest of this paper,
the terms “inference”, “test” or “forward pass”, mean the same.
Since both the frameworks, as well as the CNN models are de-
signed for computers with powerful GPUs, they may not effec-
tively and efficiently work onmobile devices due to several factors,
e.g., constrained memory and limited computing capability. CNNs
for vision tasks are very complex – for example, VGGNet [21] has
528M parameters and requires over 15G FLOPs (FLoating-point
OPerations) to classify a single image. Due to the large amount
of parameters and FLOPs, and the need to enable running these
CNNs on resource-constrained mobile devices, several works fo-
cus on accelerating the computing of CNNs on mobile devices by
compressing parameters [16, 23], by cloud offload [11], and by dis-
tributing computation to heterogeneous processors on-board [18].
However, complementary to these techniques, our goal is to model
the resource requirements of CNNs accurately. Motivation for this
is that our system can serve guidelines to decide when performance
optimizations, offloading, etc. are required to successfully run ana-
lytics tasks on mobile devices. For instance, using the output of our
models, one could decide to run all the convolutional layers on the
mobile device while offloading the fully connected layers to the cloud
so as to cut down on the memory requirement on the mobile device.
Although accurately modeling the resource requirements of CNNs
is very hard, we make progress towards achieving it.
is paper overviews the workflow of CNNs, shares the experi-
ences of deploying CNNs onmobile devices, gives the performance
measurements and analysis, andmodels the resource requirements
of the inference phase of the CNNs on mobile devices. In doing so
we face significant challenges. (i) Profiling overhead: to measure
timing of GPU computations, we need to add a synchronization
call that waits for all the results to come back before recording
the time. As pointed out by [3], this causes an overhead, as some
cores may be idling while waiting for the rest of the cores to com-
plete the computation. We address this challenge by amortizing
this measurement cost by executing the computing task a large
number of times and averaging the running time. is ensures
that the overhead per iteration is negligible. (ii) Different types of
layers: CNNs are composed of various types of layers, so to model
the resource requirements of all the different types is challenging.
On the other hand, since main computation of all these layers boils
down to matrix multiplication, we are able to model the different
layers by abstracting out the details and focusing on the core of the
computation. (iii) How matrix multiplication scales: as the core of
the computation of CNNs, it is important to understand how the
computation scales with the sizes of matrices in terms of the re-
source requirements. Due to the large number of combinations of
matrix sizes, this can be very challenging. However, by extracting
the matrix multiplication sizes of popular CNNs, we observe that
all of them result into a small set of matrix sizes and thus we are
able to accurately model them for different mobile platforms.
Contributions: (i) We deploy the popular CNN models includ-
ing AlexNet [17], VGGNet [21], GoogleNet [22], and ResNet [12]
using the Caffe framework [15] on mobile platforms (i.e., NVIDIA
TK1 and TX1), where the inference phase is run on both CPUs
and GPUs (§3). (ii) We measure and analyze the performance and
resource usage of the inference phase of these CNN models on a
layerwise granularity. Our findings point out the potential ways
of optimizing the computing of CNNs on mobile devices (§4). (iii)
We profile and model the resource requirements of CNNs. We also
build a modeling tool,Augur, which takes a CNN model descriptor
as the input and estimates the resource requirements of the CNN
so as to give insights on how well the CNN can be run on a mobile
platform without having to implement and deploy it (§5).
2 BACKGROUND
2.1 Overview of CNNs
Our goal is to model the resource requirements of the forward pass
of a CNN. e CNN architecture is typically composed of convolu-
tional, normalization, and subsampling layers optionally followed
by fully connected layers. We overview these layers below, as it
lays the foundations for modeling the resource requirements.
Convolutional Layer:e convolutional (CONV) layers form the
core of CNNs. e parameters of this layer are a set of kernels
(weights) and biases learned during the training phase. During the
forward pass, kernels are convolved across the width, height, and
depth of the input, computing the dot product between the kernel
and the input and producing the output volume. Since the main
operation is dot product between the kernels and local regions of
the input, the forward pass of a CONV layer can be formulated as
a matrix multiplication. For the input volume, each local region
(a block of pixels) is stretched into a column of a matrix, and the
number of columns is the total number of local regions. e kernel
is stretched into a column of another matrix, and the number of
columns is the number of kernels. Finally, the product of thematrix
multiplication is reshaped to the output volume with a depth equal
to the number of kernels. For example, the input of AlexNet [227×
227 × 3] (width × height × depth) is convolved with 96 kernels
at size [11 × 11 × 3] and with a stride 4, and hence there are 55
locations along both width and height. So, the matrix for the input
is [3025 × 363], the matrix of the kernels is [363 × 96], and the
producedmatrix is [3025×96] and finally reshaped to [55×55×96].
e CONV layer is commonly implemented using the matrix
multiplication function of Basic Linear Algebra Subprograms (BLAS)
on CPUs and cuBLAS [2] on CUDA GPUs for acceleration. How-
ever, as many values in the input volume are replicated multiple
times in the matrix stretched from the input volume, it uses more
memory than the input volume itself.
Pooling Layer:e pooling (POOL) layer commonly sits between
CONV layers and performs downsampling to reduce the spatial
size (width and height). e pooling is performed on local regions
with the kernel size defined by a CNN model. e most common
pooling operation in the state-of-the-art CNN models is max pool-
ing. e pooling layer independently operates on the input volume
without parameters, and hence its implementation is simple.
NormalizationLayer: Two types of normalization layers are com-
monly used in CNNs: local response normalization (LRN) and batch
normalization (BatchNorm). However, LRN’s role has been outper-
formed by other techniques, such as BatchNorm, and thus here we
only detail BatchNorm.
BatchNorm is introduced to reduce the internal covariant shi
during training [14]. During test phase, BatchNormnormalizes the
input volume on each dimension (weight × height), e.g., for the i-th
dimension, as follows,
x̂
(i )
=
x
(i ) − E[x(i )]√
Var[x(i )]
,
where E[x(i )] and Var[x(i )] are learned during the training phase
for dimension i .
Fully Connected Layer: Each neuron in a fully connected (FC)
layer is connected to all activations in the previous layer. Due
to the full connectivity, there are a huge number of parameters,
which places heavy burden on memory usage and computation.
Recently, FC layers have fallen out of favor, e.g., the latest CNNs,
i.e., GoogleNet and ResNet, only have one fully connected layer
as the classifier. is dramatically reduces the number of parame-
ters, e.g., 26MB parameters in GoogleNet while 233MB in AlexNet.
Moreover, it was found that FC layers of VGGNet can be removed
with no performance reduction. erefore, it is anticipated that
CNNs will eliminate the use of FC layers. e forward pass of FC
layers is also implemented as a matrix multiplication.
Besides these four layers, rectified linear unit (ReLU) layer that
applies an elementwise function, e.g., max(0,x), on the input vol-
ume, is also commonly used in CNNs. However, ReLU is simple,
has no parameters, and does not change the size of input volume.
us we skip the detail of ReLU layer.
2.2 Related Work
Although CNNs have been applied to various computer vision ap-
plications on different computing platforms, only a few works con-
sider running CNNs on mobile devices, which we envision to be a
significant future area for the deployment of deep learning appli-
cations.
Among these works, many focus on accelerating the computing
of CNNs, e.g., by compressing parameters [16, 23], by cloud offload
[11], and by distributing computation to heterogeneous processors
on-board [18]. Some consider reducing the memory usage to bet-
ter fit mobile devices while maintaining high inference accuracy,
Table 1: CNN models
Layer AlexNet VGGNet GoogLeNet ResNet
CONV 5 13 57 53
POOL 3 5 14 2
NORM 2 2 53
ReLU 7 15 57 49
FC 3 3 1 1
Concat 9
Scale 53
Eltwise 16
Total 20 36 140 227
Table 2: Timing benchmarks on AlexNet
Platform
Layerwise Pass (ms)
Total (ms) Forward Pass (ms)
CONV POOL LRN ReLU FC
TK1
CPU
318.7±0.2 6.1±0.1 103.8±0.0 4.6±0.0 186.3±0.1
619.8±0.2 619.5±0.2
51.42% 0.99% 16.74% 0.75% 30.05%
GPU
24.6±3.5 2.3±0.6 2.4±0.5 5.2±1.2 35.1±5.9
73.3±10.7 54.7±2.4
33.53% 3.15% 3.22% 7.11% 47.95%
TX1
CPU
66.9±5.3 7.6±0.0 172.4±0.3 2.4±0.0 644.7±5.3
894.3±4.8 892.7±2.3
7.48% 0.85% 19.28% 0.27% 72.09%
GPU
24.2±8.3 1.3±2.6 2.7±3.0 5.9±5.9 15.2±4.7
52.8±15.7 29.3±6.5
45.79% 2.51% 5.12% 11.23% 28.76%
FLOPs
666M 1M 2M 0.7M 59M
729M
91.36% 0.14% 0.27% 0.10% 8.09%
Table 3: Timing benchmarks on VGGNet
Platform
Layerwise Pass (ms)
Total (ms) Forward Pass (ms)
CONV POOL ReLU FC
TK1
CPU
7160.5±0.7 60.1±0.1 95.6±0.1 381.6±0.2
7697.9±0.6 7697.8±0.5
93.02% 0.78% 1.24% 4.96%
GPU
263.1±19.3 7.2±0.5 17.5±1.2 57.6±0.5
347.6±20.1 326.7±2.1
75.68% 2.06% 5.03% 16.58%
TX1
CPU
1952.9±12.2 71.3±1.5 52.5±1.9 747.7±24.9
2824.6±23.2 2809.1±10.6
69.14% 2.52% 1.86% 26.47%
GPU
136.3±5.4 3.4±1.6 9.9±4.9 32.8±1.3
184.2±7.4 175.3±2.0
73.98% 1.84% 5.35% 17.82%
FLOPs
15360M 6M 14M 124M
15503M
99.08% 0.04% 0.09% 0.79%
Table 4: Memory of CNN models on platforms (MB)
Type/Platform AlexNet VGGNet GoogleNet ResNet
Weights & Biases 233 528 26 97
Data 8 110 53 221
Workspace 11 168 46 79
TK1
CPU 324 972 161 409
GPU 560 1508 196 533
TX1
CPU 362 1013 200 453
GPU 589 1537 226 562
e.g., [10, 13]. e resource bolenecks of running CNNs on mo-
bile devices are preliminarily investigated in [19]. Different CNNs
are benchmarked in [9], but it does not consider how to model the
resource requirements of CNNs.
While CNNs grow from a few layers to a thousand layers, the
computational capability of mobile devices continues to improve.
As a result, different mobile devices perform differently on differ-
ent CNNs, and hence custom optimization and offloading may or
may not be needed. It depends on whether and how efficiently a
CNN can be run on a given mobile platform. is question moti-
vates our work.
3 MEASUREMENT SET-UP
To understand the resource requirement of the forward pass of
CNNs, we deployed several CNN models on two mobile platforms
using the popular deep learning framework – Caffe.
Platforms: Although some frameworks (e.g., Caffe, Torch) can
run on Android and iOS, they do not support GPU acceleration
on off-the-shelf mobile devices, such as smartphones or tablets. To
understand the performance of CNNs on both mobile CPUs and
GPUs, in this paper, we focus on two developer kits for low power
edge devices – NVIDIA TK1 and TX1.
TK1 is equippedwith a 2.3GHz quad-coreARMCortex-15A 32bit
CPU, 192 CUDA cores Kepler GPU, and 2GB DDR3L RAM. TX1 is
more powerful and has a 1.9GHz quad-core ARMCortex-A57 64bit
CPU, 256 CUDA cores Maxwell GPU, and 4GB LPDDR4 RAM.e
system-on-chip (including CPU and GPU) of TK1 and TX1 also ap-
pears in many off-the-shelf mobile devices, such as Google Nexus
9 and Pixel C. However, none of these devices are enabled to sup-
port CUDA, on which deep learning frameworks are built for GPU
acceleration. us, for ease of experimentation we choose NVIDIA
TK1 and TX1, the results of which should indicate the performance
of CNNs on mobile devices.
Framework: ere are several frameworks for deep neural net-
works. As mentioned before, most of the frameworks use BLAS
on CPU and cuBLAS on GPUs for the CNN computations and thus
show similar performance. In this paper, we use the popular Caffe
framework, where the choice of BLAS is OpenBLAS [5].
CNN Models: For the measurement, we consider the most popu-
lar CNNmodels including AlexNet, VGGNet (VGG-16), GoogleNet,
and ResNet (ResNet-50). Although the architectures of these mod-
els are quite different, from several layers to more than one hun-
dred layers and from regular stacked layers to branched and stacked
layers, they are mainly built on the basic layers of CNNs. Table 1
shows how many these layers each model contains.
4 INITIAL MEASUREMENT STUDY
In this section, we investigate the resource requirements and bot-
tlenecks of running several well known CNN models on mobile
platforms.
4.1 Timing
First, we measure the timing of each model on different platforms
using CPU and GPU in terms of (i) complete forward pass: i.e., tim-
ing is measured for the entire forward pass and (ii) as summation
of individual layer times. We also calculate the number of FLOPs
for each model and each type of layer.
AlexNet has the least number of layers among these models and
indeed requires the least amount of computation in terms of FLOPs,
i.e., 729M. As shown in Table 2, on the CPU of both TK1 and TX1,
the summation of layerwise timing perfectly matches with that of
a full forward pass, which are about 600ms (on TK1) and 900ms
(on TX1). Surprisingly, although TX1 has a more powerful CPU, the
forward pass on TX1 is slower than TK1. e CONV layers on TX1
run much faster than on TK1 (more than 4x), but the FC layers are
much slower (more than 3x). Since the basic computation of both
Table 5: Timing benchmarks on GoogleNet
Platform
Layerwise Pass (ms)
Total (ms) Forward Pass (ms)
CONV POOL LRN ReLU Concat FC
TK1
CPU
755.3±0.2 68.8±0.1 214.3±0.2 22.8±0.0 2.0±0.0 2.7±0.0
1066.2±0.3 1065.6±0.2
70.84% 6.45% 20.10% 2.14% 0.19% 0.26%
GPU
186.9±45.0 20.6±4.9 6.5±1.5 35.3±9.9 13.0±4.4 2.4±0.8
269.3 ±65.6 167.0±44.3
69.40% 7.65% 2.40% 13.10% 4.81% 0.90%
TX1
CPU
174.4±3.6 89.9±0.2 349.4±0.6 9.5±0.1 1.7±0.1 5.7±0.0
630.9±3.5 637.9±14.7
27.64% 14.24% 55.38% 1.50% 0.36% 0.90%
GPU
165.9±48.8 18.5±11.2 3.3±2.3 49.5±31.2 15.4±9.8 1.2±1.1
258.1±89.8 143.9±59.2
64.28% 7.16% 1.28% 19.16% 5.96% 0.46%
FLOPs
1585M 13M 3M 3M 1M
1606M
98.80% 0.80% 0.20% 0.20% 0.06%
Table 6: Timing benchmarks on ResNet
Platform
Layerwise Pass (ms)
Total (ms) Forward Pass (ms)
CONV POOL BatchNorm ReLU Scale Eltwise FC
TK1
CPU
1830.4±0.4 8.8±0.0 97.1±0.1 64.0±0.1 42.0±0.1 24.8±0.1 5.4±0.0
2072.7±0.4 2072.2±0.3
88.31% 0.42% 4.68% 3.09% 2.03% 1.20% 0.26%
GPU
245.8±16.3 5.5±0.6 249.5±11.6 38.7±2.0 76.0±3.3 47.0±2.7 3.9±0.1
673.0±33.4 149.4±4.9
36.53% 0.81% 37.08% 5.75% 11.29% 6.98% 0.58%
TX1
CPU
362.3±5.4 13.7±0.2 83.5±0.3 33.2±0.1 31.9±3.6 20.4±4.2 22.2±0.1
567.6±7.6 566.8±9.7
63.83% 2.41% 14.7% 5.86% 5.62% 3.59% 3.92%
GPU
279.4±42.6 3.0±2.7 198.1±36.8 63.6±31.3 79.8±24.2 34.8±12.9 1.8±2.4
664.7±116.5 104.4±14.0
42.03% 0.45% 29.80% 9.57% 12.01% 5.24% 0.27%
FLOPs
3866M 2M 32M 9M 11M 6M 2M
3922M
98.59% 0.05% 0.81% 0.23% 0.27% 0.14% 0.05%
CONV and FC is matrix multiplication, the results seem contradic-
tory at first. However, we investigate and explain the reasons for
the behavior below.
First, even though the clock is slower on TX1 compared to TK1,
i.e., 1.9 GHz vs. 2.3 GHz, TX1 runs more instructions per clock
cycle compared to TK1 (3 vs. 2) and hence the performance of
TX1 CPU is expected to be beer than TK1 CPU as we see for
the CONV layers. Second, FC layers have many more parameters
than the CONV layers. erefore, FC layers are bolenecked by
the memory whereas CONV layers are compute bound. ird, the
L1 data cache size is 32 KB on both and L2 cache is larger on TK1
compared to TX1. Even if cache size is same on both – because the
address is longer on TX1 (64 bit vs. 32 bit), more memory is used
up for the addressing and we have lesser memory available to save
the data itself on the cache. is means that we need to fetch data
from RAM to the cache more oen while executing the FC layers
on TX1 due to the large number of parameters which causes the
slow down.
GPUs can significantly accelerate the computation of a CNN
and thus improve the performance over CPUs. More advanced
TX1 GPU outperforms TK1 GPU as expected. However, we face
one challenge: the summation of layerwise timing does not match
the timing of the full forward pass on GPUs. e reason for the
mismatch is that CUDA supports asynchronous programming. Be-
fore time measurement, an API (cudaDeviceSynchronize) has to
be called to make sure that all cores have finished their tasks. is
explicit synchronization is the overhead of measuring time on the
GPUs. erefore, the sum of layerwise timing on GPUs is longer
than a full forward pass.
VGGNet has 2xCONV layers compared toAlexNet (Table 1). How-
ever, the number of operations is 20x that of AlexNet because VG-
GNet uses much larger feature maps. While other results follow
similar paern as AlexNet, the throughput of both CPU and GPU
on VGGNet is higher than on AlexNet. For example, the through-
put of TK1 CPU on AlexNet is 1 GFLOPS (GFLOPs per Second)
and of VGGNet is 2 GFLOPS. is is mainly because both CPU and
GPU have beer throughput on matrix multiplication with larger
size.
GoogleNethasmore than 50CONV layers, manymore thanAlexNet.
However, the CONV layers have only two times more FLOPs than
that of AlexNet. e main reason is that the size of the kernels
and feature maps is small, which dramatically reduces the number
of operations. Similar to AlexNet, GoogleNet also employs LRN
that significantly affects the performance on CPU for both TK1
and TX1. For example, it takes more than 55% of total time on TX1
CPU. GoogleNet has a layer, named Concat, that does not involve
any computation, but concatenates the outputs from previous lay-
ers, thus involving memory operations only.
e difference between layerwise timing and full forward pass
on GoogleNet is much larger than AlexNet and VGGNet as shown
in Table 5. GoogleNet has many more layers than AlexNet and
VGGNet and thus much more measuring overhead on GPUs. e
measuring overhead may be larger than the compute time when
the computation of a layer does not cost much time, e.g., ReLU
layers. Due to this measurement artifact, in Table 5, ReLU layers
cost more time on GPUs than CPUs. is is a motivation for us to
devise measurement techniques that can overcome these measure-
ment overheads as we see later in §5.
ResNet has more than two hundred layers. ResNet includes Batch-
Norm, Scale, and Eltwise that are not commonly used by other
models. ese layers are not expensive in terms of FLOPs as shown
in Table 6. We observe that the computation of Scale and Eltwise
costs more on GPUs than CPUs, which is again due to the mea-
surement overhead on GPUs as discussed above. Interestingly, al-
though ResNet has more FLOPs (2x) than GoogleNet, a full for-
ward pass is faster than GoogleNet on TX1. is is because LRN
of GoogleNet is very expensive: (55% of total time) on TX1 CPU.
Moreover, GoogleNet has more CONV layers and the underlying
matrixmultiplication is smaller than that of ResNet. As GPU through-
put is higher on matrix multiplication with larger size, ResNet is
faster than GoogleNet on TX1 GPU.
4.2 Memory
e memory requirement to run a CNN comes from three major
sources: (i) the memory that holds the parameters of the CNN; (ii)
the memory that stores intermediate data of the CNN; and (iii) the
workspace for computation. A majority of the CNN parameters
come from CONV and FC layers (i.e., weights and biases). Interme-
diate data is the output of each layer (i.e., the input of next layer),
e.g., feature maps. Some types of layers require additional space to
perform computation, e.g., on CONV layers, the memory is needed
to hold thematrix stretched from the input data formatrixmultipli-
cation. e workspace memory is mostly consumed by the matrix
multiplication of CONV layers. e NVIDIA CUDA Deep Neural
Network library (cuDNN) [4] can reduce the workspace by sacrific-
ing the speed of computing on GPUs. However, as the workspace
is not the most significant part, cuDNN cannot reduce the memory
usage of CNNs significantly.
Table 4 shows the memory requirement of weights and biases of
CONV and FC layers, intermediate data, and workspace of CONV
layers for each CNN – by parsing the model descriptor (e.g., a pro-
totxt file in Caffe). Table 4 also gives the measured memory us-
age of Caffe, running each CNN on these platforms. One can see
that deeper CNNs (from AlexNet to ResNet) may not require more
memory, especially for GoogleNet, which requires the least mem-
ory among them. Memory usage onTX1 ismore than TK1, because
TK1 is running a 32-bit OS while TX1 is running a 64-bit OS, which
incurs more memory usage for the framework itself.
To speed up the computation of CNNs, all memory should be
allocated beforehand and not released during the computation. Al-
though existing frameworks (e.g., Caffe1) follow this rule, they are
designed for training and testing (scoring) on workstations with
powerful GPUs, and thus not quite suitable for mobile devices in
terms of memory management.
UnifiedMemoryArchitecture: Unlike workstations2 where GPUs
have their dedicated memory, mobile platforms usually have a uni-
fiedmemory architecture, where GPU shares systemmemorywith
CPU. On workstations, in the current implementation of Caffe,
data is transferred to and from the memory of GPU for access,
which is efficient on workstations. However, on united memory ar-
chitecture, e.g., TK1 and TX1, memory transfer from CPU to GPU
simply generates a redundant data copy on system memory. As
shown in Table 4, on both TK1 and TX1, the memory usage on
GPU is always more than CPU, and the additional memory is actu-
ally used to hold a redundant copy of the parameters of each CNN
(mostlyweights and biases). For example, running AlexNet on TK1
GPU takes 560MB memory, which is 236MB more than TK1 CPU,
1Caffe allocates the memory for intermediate data on demand (lazily) during the first
run, and thus it takes longer time than later runs.
2Although GPUs onworkstations can also directly access host memory over PCIe, e.g.,
CUDA kernels, reading data over PCIe is limited by PCIe bandwidth (up to 32GB/s)
which is much slower than reading data from GPU memory (limit 200GB/s).
while weights and biases of AlexNet are 233MB in total. is also
stands for other CNNs.
Mobile GPUs can directly access data by mapping host memory
without degrading performance and incurring memory transfer
overhead (i.e., zero-copy memory). Existing frameworks, includ-
ing Caffe, Torch, and eano, do not take into consideration the
unified memory architecture for mobile platforms. On the con-
trary, the unified memory architecture can be exploited to design
a tailored computing framework for mobile devices. (i) We can
eliminate memory transfers between CPU and GPU. (ii) We can
compute a CNN in the most efficient way; i.e., each layer can be
executed on the most efficient unit, switching back and forth be-
tween GPU and CPU, without incurring additional memory trans-
fer overhead.
4.3 Analysis
FLOPs. As the throughput of both CPU and GPU is higher on the
CNNwith more FLOPs and a significant amount of memory opera-
tions are involved in a CNN computation, FLOPs cannot accurately
reflect the compute time of a CNN. For example, ResNet is faster
than GoogleNet on GPUs, though it involves more FLOPs. ere-
fore, estimating the compute time of CNNs directly from their FLOPs
is not feasible.
CONV and FC Layer. e computation of CONV and FC layers
in most models accounts for a majority of FLOPs. erefore, can
one measure these layers instead of the entire network? However,
this approach encounters other difficulties, i.e., layerwise measur-
ing overhead on GPUs, and we have no way to know the exact
overhead for each layer, which is hidden by GPUs.
Matrix Multiplication. e core of CONV and FC layers are ma-
trix multiplications. erefore, rather than going into the details
of each of the individual layers, if we are able to extract the ma-
trix multiplication part of the layer, we will be able to accurately
capture the resource requirements of these layers.
5 AUGUR
We aim to build a modeling tool that can estimate the resource
requirements of any given CNN descriptor on specific mobile plat-
forms without implementation and deployment. is way, we can
take the costs into consideration during the design of a CNN. is
is critical when designing CNNs for resource-constrained mobile
devices.
5.1 Profiling
e basic idea is simple. We first find the matrix multiplications
that form the core of the CNN computation. en we measure
their performance based on the BLAS and cuBLAS libraries, which
are commonly used for matrix multiplications on CPUs and GPUs
respectively.
Extract matrix sizes: To find all matrix multiplications and their
sizes, we need to parse the descriptor of a CNN. e dimension of
input (e.g., images and feature maps) and network parameters (e.g.,
convolution kernels) determines two matrix sizes (that are to be
multiplied) at a CONV or FC layer. As the dimension of feature
maps can be changed by some other layers, e.g., POOL layers, we
AlexNet VGGNet GoogleNet ResNet
0
2
4
6
8
10
·103
79.61%
96.01%
70.15%
91.16%
493.2
7390.6
747.5
1889
619.5
7697.8
1065.6
2072.2
co
m
p
u
te
ti
m
e
(m
s)
Matrix Mul. Forward Pass
(a) TK1 CPU
AlexNet VGGNet GoogleNet ResNet
0
1
2
3
4
·103
78.01%
88.85%
24.93% 71.45%
696.4
2495.9
159
405
892.7
2809.1
637.9 566.8
co
m
p
u
te
ti
m
e
(m
s)
Matrix Mul. Forward Pass
(b) TX1 CPU
AlexNet VGGNet GoogleNet ResNet
0
2
4
6
·102
70.75%
80.99%
19.40%
57.03%
38.7
264.6
32.4
85.254.7
326.7
167
149.4
co
m
p
u
te
ti
m
e
(m
s)
Matrix Mul. Forward Pass
(c) TK1 GPU
AlexNet VGGNet GoogleNet ResNet
0
1
2
3
4
·102
73.72%
78.44%
19.74%
60.35%
21.6
137.5
28.4
63
29.3
175.3
143.9
104.4
co
m
p
u
te
ti
m
e
(m
s)
Matrix Mul. Forward Pass
(d) TX1 GPU
Figure 1: Matrix multiplication and for-
ward pass of AlexNet, VGGNet, GoogleNet,
and ResNet on mobile platforms.
0
2000
0
5 × 103
10 × 103
0
1
2
·103
(256, 282 )
(384, 142 )
(32, 1122 )
(64, 562 )
(96, 72)
(128, 1122 )
(512, 1122 )
n m
co
m
p
u
te
ti
m
e
(m
s)
(a) effect of n andm, where k = 576
2000 1 × 103
2 × 103
3 × 1030
2
4
6
·102
(96, 64 × 12 )
(256, 64 × 32)
(32, 64 × 52 )
(64, 256 × 32)
(128, 64 × 72)
(384, 128 × 32)
(512, 64 × 72)
n k
co
m
p
u
te
ti
m
e
(m
s)
(b) effect of n and k , wherem = 282
0
1 × 103
2 × 103
3 × 103
1 × 103
2 × 103
3 × 1030
1
·103
(64 × 32, 72 )
(64 × 52, 282 )
(256 × 32, 142)
(64 × 12, 562 )
(128 × 32, 562)
(64 × 72, 562 )
k
m
co
m
p
u
te
ti
m
e
(m
s)
(c) effect ofm and k , where n = 256
Figure 2: Matrix multiplication on TK1 CPU with varying n,m, and k .
0
2000
0
5 × 103
10 × 103
0
1
2
·103
(256, 282 )(384, 142 )
(32, 1122 )
(64, 562 )
(96, 72)
(128, 1122 )
(512, 1122 )
n m
co
m
p
u
te
ti
m
e
(m
s)
(a) effect of n andm, where k = 576
2000 1 × 103
2 × 103
3 × 1030
2
4
6
·102
(96, 64 × 12 )
(256, 64 × 32)
(32, 64 × 52 )
(64, 256 × 32)
(128, 64 × 72)
(384, 128 × 32)
(512, 64 × 72)
n k
co
m
p
u
te
ti
m
e
(m
s)
(b) effect of n and k , wherem = 282
0
1 × 103
2 × 103
3 × 103
1 × 103
2 × 103
3 × 1030
1
·103
(64 × 32, 72 ) (64 × 5
2
, 282 )
(256 × 32, 142)
(64 × 72, 562 )
(128 × 32, 562) (64 × 72, 562 )
k
m
co
m
p
u
te
ti
m
e
(m
s)
(c) effect ofm and k , where n = 256
Figure 3: Matrix multiplication on TX1 CPU with varying n,m, and k .
need to trace the dimension of feature maps layer by layer. How-
ever, this can be easily done by parsing the parameter seings at
each layer, such as zero-padding (P ), stride (S), the number of out-
put feature maps (N ). For instance, in case of a CONV layer, let I
denote the spatial dimension of the input feature map, O denote
the spatial dimension of the output feature map, K denote the 3D
volume of the convolution kernels. en, we have:
Ow = ⌊(Iw − Kw + 2P)/S⌋ + 1
Oh = ⌊(Ih − Kh + 2P)/S⌋ + 1.
en, the matrix multiplication at the CONV layer is [(Ow ·Oh ) ×
(Kw · Kh · Kd )][(Kw · Kh · Kd ) × N ].
Mitigate measurement overhead: Layerwise timing measure-
ment incurs heavy overhead on GPUs and causes a large deviation
from a full forward pass. Moreover, the overhead is not fixed and
varies over each measurement. As illustrated in Table 5 and 6, the
measurement overhead (the difference between the sum of layer-
wise measurements and full forward pass) of GoogleNet (131 mea-
surements) on TX1 GPU is 128 ms, while the overhead of ResNet
(227 measurements) is 595 ms. erefore, we need a way to miti-
gate the overhead for accurate timing of matrix multiplications.
Timing measurements on GPUs can only been recorded aer
all cores finish their tasks. In a full forward pass, timing is only
recorded at the last layer. erefore, a core may be assigned with
the computation of following layers and thus it can continuously
perform the computation without synchronization. For example,
aer finishing the multiply-add operations for the matrix multipli-
cation at a CONV layer, a core can continue to calculate the max
function of next ReLU layer on the output of multiply-add opera-
tions. If layerwise timing is recorded, all cores have towait until all
multiply-add operations of the CONV layer have been completed.
e idea of mitigating the measurement overhead is simple. To
benchmark a matrix multiplication, we keep GPUs iteratively run-
ning the matrix multiplication in a way that GPU cores can contin-
uously perform multiply-add operations without synchronization,
before recording the end time. en, the measurement overhead is
amortized over all the iterations, giving accurate timing estimates.
When the number of iterations is large enough, the overhead is
negligible. In our experiments we measure the timing of a large
number of computing iterations on a matrix multiplication and
use the averaged value of each iteration as the compute time of
the matrix multiplication.
Fraction of forward pass spent by matrix multiplication: In
Figure 1, we study the fraction of forward pass time spent by ma-
trix multiplication (matmul) operations. We do so, by extracting
thematmul operations, measuring them, and then comparing with
the full forward pass measurement. Note that due to the above ex-
plained averaging methodology, measurement overhead for mat-
mul operations in this section is negligible.
First, as seen in Figure 1a, matmul operations on TK1 CPU take
a large portion of forward pass time – 79.61%, 96.01%, 70.15%, and
91.16% for AlexNet, VGGNet, GoogleNet, and ResNet, respectively.
Note that this also approximates the time taken by CONV and FC
layers from Table 2, 3, 5, and 6 (81.47%, 97.98%, 71.1%, and 88.57%).
Second, the trend is similar on TX1 CPU, as depicted in Figure 1b,
except GoogleNet (only about 25% time spent on matmul opera-
tions), which is caused by the particular combination of the archi-
tecture of TX1 CPU and GoogleNet as discussed in §4.1. ird, the
trend on TK1 and TX1 GPUs is similar to the trend on TX1 CPU,
as seen in Figure 1c and 1d. One thing to note is that while matmul
operations of GoogleNet only take about 20% of the total time of
forward pass, our previous measurement in Table 5, showed that
CONV and FC layers take about 60% of the total forward pass time.
32 64 128 256 384 512
0
100
200
300
n = 96
n
co
m
p
u
te
ti
m
e
(m
s)
m = 72 142
282 562
1122 2242
(a) effect ofn andm, where k = 576
64 576 1152 1600 2304 3136
0
20
40
60
k
co
m
p
u
te
ti
m
e
(m
s)
n = 32 64
96 128
256 384
512
(b) effect of k andn, wherem = 562
72 142 282 562 1122 2242
10−2
100
102
104
m
co
m
p
u
te
ti
m
e
(m
s)
k = 64 × 12 64 × 32
128 × 32 64 × 52
256 × 32 64 × 72
(c) effect ofm and k , where n = 256
Figure 4: Matrix multiplication on TK1 GPU with varying n,m, and k .
32 64 128 256 384 512
0
100
200
300
n = 96
n
co
m
p
u
te
ti
m
e
(m
s)
m = 72 142
282 562
1122 2242
(a) effect ofn andm, where k = 576
64 576 1152 1600 2304 3136
0
20
40
60
k
co
m
p
u
te
ti
m
e
(m
s)
n = 32 64
96 128
256 384
512
(b) effect of k andn, wherem = 562
72 142 282 562 1122 2242
10−2
100
102
104
m
co
m
p
u
te
ti
m
e
(m
s)
k = 64 × 12 64 × 32
128 × 32 64 × 52
256 × 32 64 × 72
(c) effect ofm and k , where n = 256
Figure 5: Matrix multiplication on TX1 GPU with varying n,m, and k .
NIN on CPU NIN on GPU VGG19M on CPU VGG19M on GPU
0
400
800
60 89
398
486
116 150
581
675
156 181
625
705
m
em
o
ry
u
sa
g
e
(M
B
)
Estimate TK1 TX1
Figure 6: Memory estimate of NIN and
VGG19M.
TK1 CPU TX1 CPU TK1 GPU TX1 GPU
0
400
800
94.64%
87.67%
80.44% 78.45%
513.6
106.7
48.1 18.2
542.7
121.7
59.8 23.2
co
m
p
u
te
ti
m
e
(m
s)
Estimate Forward Pass
Figure 7: Timing estimate of NIN.
TK1 CPU TX1 CPU TK1 GPU TX1 GPU
102
103
104
105
94.18%
85.60%
84.27%
79.47%
8818
2193
276.5
141.7
9363
2562
328.1
178.3
co
m
p
u
te
ti
m
e
(m
s)
Estimate Forward Pass
Figure 8: Timing profiling of VGG19M.
We believe this is because the matmul operations are run without
taking into account dependencies, whereas, GoogleNet consists of
inception components, each of which has four branches of CONV
layers in parallel. Before proceeding to next inception component,
all four branches of CONV layers have to be competed. How to
handle such dependencies is part of our future work.
In summary, for most cases, matmul operations take a large pro-
portion (more than 60%) of the compute time of a CNN on mobile
platforms. us, we can predict matmul time, to be able to approx-
imately estimate the compute time of a CNN.
5.2 Modeling
So far, we have exactly measured matmul time. In this section, we
aim to model this time, to be able to predict the compute time, just
from the matrix sizes. To do so, we benchmark several matrix sizes,
as explained below to understand the relationship between the size
of the matrices and the compute time.
Given the matmul of [n×k] and [k×m] (the number of FLOPs is
n×m×k) performed by a CONV layer, n is the number of kernels, k
is the size of a kernel in 3D (width× height × depth, where depth is
the number of input feature maps), andm is the spatial size (width
× height) of output feature maps.
CNNs follow special rules on these parameters of CONV lay-
ers. e number of kernels n is usually a multiple of 16, commonly
from 32 to 512. e spatial size of a kernel is commonly 12, 32, 52,
72, or 112. e depth of a kernel is usually the number of kernels in
the previous CONV layer and hence also a multiple of 16; except
the first CONV layer, where the depth is the number of channels of
the input image, typically equal to three. e spatial size of output
feature maps of a CNNm gradually reduces; it is common to have
2242, 1122, 562, 282, 142, or 72, though AlexNet has slightly differ-
ent ones, i.e., 552, 272 and 132. Based on these typical parameter
seings, we carried out experiments on matmul with varying n,m,
and k . e FC layer is currently used in CNNs only as a classifier
(e.g., in GoogleNet and ResNet) and thus its compute time is negli-
gible compared to the forward pass. erefore, we do not consider
the size of matrices for FC layers in the modeling.
Simple linearity on CPU: Figure 2 and 3 illustrate the perfor-
mance of matmul on TK1 CPU and TX1 CPU, respectively. e
seings of n, m, and k are: n = [32, 64, 96, 128, 256, 512], m =
[72, 142, 282, 562, 1122], and k = [64 × 12, 64 × 32, 128 × 32, 64 ×
52, 256 × 32, 64 × 72]. In each figure, we fix one of three param-
eters and vary other two; data points are shown as small circles;
black circles are labeled with coordinates to highlight the seing
of varying parameters.
From Figure 2a, 2b, and 2c, it is observed that the compute time
of matmul on TK1 CPU scales linearly with n, m, and k . e lin-
earity can also be observed on TX1 CPU as depicted in Figure 3a,
3b, and 3c. us, we have a linear model per CPU device, which
predicts the matmul time, given the matrix sizes.
Complex linearity on GPU: Figure 4 and 5 illustrate the per-
formance of matmul with varying seings of n,m, and k on TK1
GPU and TX1 GPU, respectively. e compute time of matmul on
GPUs exhibits more complex relationship with n,m, and k .
Figure 4a mainly depicts the effect of n, which is bipartite. For
all the seings ofm, the compute time has a monotonic relation-
ship with n from n = 32 to 128, except n = 96 which incurs even
longer compute time than n = 128, while, from n = 128 to 512, the
compute time exhibits a perfect linear relationship with n. Similar
result is also found on TX1 GPU as shown in Figure 5a. Although
TX1 GPU has more CUDA cores (256 compare to 192 cores in TK1
GPU) and generally computes matmuls faster than TK1 GPU, it
also exhibits this paern at n = 96. is artifact is related to the
algorithm that determines how the CUDA cores compute matmul
in parallel. Since cuBLAS is not an open-source library, it is hard
to trace the exact reason. However, it is indicated [2] that matmul
works best if n andm are multiples of 128 on Maxwell architecture
(TX1 GPU) and if n is multiple of 256 and m multiple of 192 on
Kepler architecture (TK1 GPU). is may explain why it behaves
differently when n is small.
For given values of n andm, the compute time linearly increases
with k on TK1 GPU and TX1 GPU as depicted in Figure 4b and 5b,
respectively. While the compute time increases with m on both
TK1 GPU and TX1 GPU as depicted in Figure 4c and 5c, the effect
ofm is tripartite. e compute time has three separate linear rela-
tionships with k (different coefficients), e.g., from 72 to 282, from
282 to 562, and from 562 to 2242 on TK1 GPU, as highlighted by dif-
ferent regions in Figure 4c. In each such region, the compute time
on different values of k linearly scales withm at mostly the same
coefficient. Moreover, in the middle region (i.e., between 282 and
562 in Figure 4c and between 142 and 282 in Figure 5c, the compute
time increases withm slower than other two regions. is is espe-
cially true on TX1 GPU, where the region is much more flat and
tends to plateau. is region should be the transition area, where
cuBLAS adopts different schemes based onm and the number of
CUDA cores to assign the workload of matmul to CUDA cores. e
transition area is different on TK1 GPU and TX1 GPU, mainly be-
cause they have different number of CUDA Cores.
Based on the characteristics discussed above, we are able to
model the compute time of matmul on a specific GPU, though we
need more data points than that on a CPU.
5.3 Accuracy
Based on the measurement, profiling, and modeling of CNNs on
mobile devices, we built themodeling tool, Augur, which estimates
the compute time andmemory usage for any given CNN.Augur first
parses the descriptor of a CNN. Based the type and seing of each
layer, it calculates the minimal memory needed to run the CNN.
e memory includes data, parameters, and workspace. en, Au-
gur extracts matmuls from the computation of the CNN. Based on
the models of TK1 and TX1 on matmul, i.e., the linear fits obtained
from Figure 2 and 4 for TK1, and Figure 3 and 5 for TX1, Augur cal-
culates the compute time of individual matmuls and then uses their
summation as the estimate of the compute time of the CNN.
To verify the accuracy of Augur, we model two CNNs (i.e., NIN
[20] and VGG19M3) and compare the estimates to the measured
memory usage and compute time using Caffe. Figure 6 depicts
the memory usage of NIN and VGG19M on different processing
units. e estimate of memory usage is always less than the actual
usage, because the estimate does not take into account thememory
usage of Caffe itself, which is framework-dependent. However, it
is easy to incorporate that if a specific framework is targeted to
perform the CNN computation. Note that the estimate of Augur is
accurate on the memory usage of data, parameters, and workspace
as discussed in §4.2.
Figure 7 and 8 evaluate the accuracy of Augur’s compute time
estimation of NIN and VGG19M, respectively. From Figure 7 and
8, we observe that the estimate based on only matmul can approx-
imate the compute time of NIN and VGG19M on both CPUs and
GPUs, with more than 78% accuracy for all the cases. Since mat-
mul generally takes a larger proportion of the compute time on
CPUs than on GPUs as discussed in §4.1, the estimate on CPUs (up
3VGG19M is a modified version of VGGNet with more CONV layers. e FC layers
in the original VGGNet are replaced by a CONV layer and a POOL layer to reduce
memory usage.
to 94%) is closer to the actual compute time than on GPUs (up to
84%). Moreover, more powerful processing unit can perform mat-
mul faster, but the speed up is not the same across all operations.
erefore, the matmul of a CNN takes a smaller proportion of the
compute time on a more powerful processing unit. is explains
why the estimate on TK1 CPU (or TK1 GPU) is more accurate than
TX1 CPU (TK1 GPU) for the same CNN.
In summary, Augur can estimate whether and how efficiently a
CNN can be run on mobile devices before any deployment. It can
also help the design of CNNs for resource-constrained mobile de-
vices. When designing a CNN model using Augur, designers can
estimate the resource usage and compute time without implemen-
tation and deployment and tune the model to satisfy their specific
needs.
6 DISCUSSION
Augur can be extended to support additional mobile platforms by
simply profiling matrix multiplication operations on them. Matrix
multiplications of a CNN take most computation (more than 90%
of FLOPs from Table 2, 3, 5, and 6), which commonly takes a dom-
inant proportion of the compute time. us, matrix multiplication
is currently exploited by Augur to estimate the compute time of
a CNN. To obtain a more precise estimate, additional factors need
to be taken into consideration, e.g., memory operations and CNN
architectures (stacked or branched). Augur will be enhanced with
these features and this will be our future work.
Moreover, we observe that a framework customized for run-
ning CNNs on mobile platforms is highly desired. e framework
should be optimized for performing the test phase of CNNs and
tailored for the characteristics of mobile platforms, e.g., the unified
memory architecture.
7 CONCLUSION
In this paper, we aim to model the resource requirements of CNNs
on mobile devices. By deploying several popular CNNs on mobile
CPUs and GPUs, we measured and analyzed the performance and
resource usage at a layerwise granularity. Our findings pointed
out the potential ways of optimizing the performance of CNNs on
mobile devices. As matrix multiplications form the core computa-
tions of a CNN, we profiled and modeled matrix multiplications
on mobile platforms. Based on the measurement, profiling, and
modeling, we built Augur that can estimate the compute time and
memory usage of the CNN so as to give insights on whether and
how efficiently the CNN can be run on a mobile platform without
implementation and deployment. erefore, it is a power tool that
helps the design of CNNs for resource-constrained mobile devices.
REFERENCES
[1] Caffe. hp://caffe.berkeleyvision.org/.
[2] cuBLAS. hps://developer.nvidia.com/cublas.
[3] CUDA C Programming Guide. hps://docs.nvidia.com/cuda/.
[4] cuDNN. hps://developer.nvidia.com/cudnn/.
[5] OpenBLAS. hp://www.openblas.net/.
[6] TensorFlow. hp://www.tensorflow.org/.
[7] eano. hp://deeplearning.net/soware/theano/.
[8] Torch. hp://torch.ch/.
[9] Alfredo Canziani, Adam Paszke, and Eugenio Culurciello. 2016. An Analy-
sis of Deep Neural Network Models for Practical Applications. arXiv preprint
arXiv:1605.07678 (2016).
[10] Yunchao Gong, Liu Liu, Ming Yang, and Lubomir Bourdev. 2014. Compress-
ing deep convolutional networks using vector quantization. arXiv preprint
arXiv:1412.6115 (2014).
[11] Seungyeop Han, Haichen Shen, Mahai Philipose, Sharad Agarwal, Alec Wol-
man, and Arvind Krishnamurthy. 2016. MCDNN: An Approximation-Based Ex-
ecution Framework for Deep Stream Processing Under Resource Constraints.
In International Conference on Mobile Systems, Applications, and Services (Mo-
biSys’16).
[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual
Learning for Image Recognition. In IEEE Conference on Computer Vision and
Paern Recognition (CVPR’16).
[13] Forrest N Iandola, Song Han, MahewWMoskewicz, Khalid Ashraf, William J
Dally, and Kurt Keutzer. 2016. SqueezeNet: AlexNet-level accuracy with 50x
fewer parameters and <0.5MB model size. arXiv preprint arXiv:1602.07360
(2016).
[14] Sergey Ioffe and Christian Szegedy. 2015. Batch Normalization: Accelerating
Deep Network Training by Reducing Internal Covariate Shi. In International
Conference on Machine Learning (ICML’15).
[15] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long,
Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolu-
tional Architecture for Fast Feature Embedding. In ACM International Confer-
ence on Multimedia (MM’14).
[16] Yong-Deok Kim, Eunhyeok Park, Sungjoo Yoo, Taelim Choi, Lu Yang, and
Dongjun Shin. 2016. Compression of Deep Convolutional Neural Networks for
Fast and Low Power Mobile Applications. In International Conference on Learn-
ing Representations (ICLR’16).
[17] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet Classi-
fication with Deep Convolutional Neural Networks. In Neural Information Pro-
cessing Systems Conference (NIPS’12).
[18] Nicholas D Lane, Sourav Bhaacharya, Petko Georgiev, Claudio Forlivesi, Lei
Jiao, Lorena Qendro, and Fahim Kawsar. 2016. Deepx: A Soware Accelerator
for Low-Power Deep Learning Inference on Mobile Devices. In International
Conference on Information Processing in Sensor Networks (IPSN’16).
[19] Nicholas D Lane, Sourav Bhaacharya, Petko Georgiev, Claudio Forlivesi, and
Fahim Kawsar. 2015. An Early Resource Characterization of Deep Learning on
Wearables, Smartphones and Internet-of-ings Devices. In International Work-
shop on Internet of ings towards Applications (IoT-App’15).
[20] Min Lin, Qiang Chen, and Shuicheng Yan. 2014. Network in Network. In Inter-
national Conference on Learning Representations (ICLR’14).
[21] Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Net-
works for Large-Scale Image recognition. In International Conference on Learn-
ing Representations (ICLR’15).
[22] Christian Szegedy,Wei Liu, Yangqing Jia, Pierre Sermanet, Sco Reed, Dragomir
Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015.
Going Deeper with Convolutions. In IEEE Conference on Computer Vision and
Paern Recognition (CVPR’15).
[23] Jiaxiang Wu, Cong Leng, Yuhang Wang, Qinghao Hu, and Jian Cheng. 2016.
antized Convolutional Neural Networks for Mobile Devices. In IEEE Confer-
ence on Computer Vision and Paern Recognition (CVPR’16).
