Scaling Up Deep Neural Network Optimization for Edge Inference by Lu, Bingqian et al.
Scaling Up Deep Neural Network Optimization for
Edge Inference
Bingqian Lu∗
UC Riverside
Jianyi Yang†
UC Riverside
Shaolei Ren‡
UC Riverside
Abstract
Deep neural networks (DNNs) have been increasingly deployed on and integrated
with edge devices, such as mobile phones, drones, robots and wearables. To
run DNN inference directly on edge devices (a.k.a. edge inference) with a sat-
isfactory performance, optimizing the DNN design (e.g., network architecture
and quantization policy) is crucial. While state-of-the-art DNN designs have
leveraged performance predictors to speed up the optimization process, they are
device-specific (i.e., each predictor for only one target device) and hence cannot
scale well in the presence of extremely diverse edge devices. Moreover, even
with performance predictors, the optimizer (e.g., search-based optimization) can
still be time-consuming when optimizing DNNs for many different devices. In
this work, we propose a new DNN optimization framework which: (1) leverages
scalable performance predictors that can estimate the resulting performance (e.g.,
inference accuracy/latency/energy) given a DNN-device pair; and (2) uses a neural
network-based automated optimizer that takes both device features and optimiza-
tion parameters as input, and then directly outputs the optimal DNN design without
going through a lengthy optimization process for each individual device.
1 Background and Motivation
Deep neural networks (DNNs) have been increasingly deployed on and integrated with edge devices,
such as mobile phones, drones, robots and wearables. Compared to cloud-based inference, running
DNN inference directly on edge devices (a.k.a. edge inference) has several major advantages, includ-
ing being free from the network connection requirement, saving bandwidths and better protecting
user privacy as a result of local data processing. For example, it is very common to include one or
multiple DNNs in today’s mobile apps [39].
To achieve a satisfactory user experience for edge inference, an appropriate DNN design is needed
to optimize a multi-objective performance metric, e.g., good accuracy while keeping the latency
and energy consumption low. A complex DNN model involves multi-layer perception with up to
billions of parameters, imposing a stringent computational and memory requirement that is often too
prohibitive for edge devices. Thus, the DNN models running on an edge device must be judiciously
optimized using, e.g., neural architecture search (NAS) and model compression [5–7, 20, 22, 34, 37].
The DNN design choices we focus on in this work mainly refer to the network architecture and
compression scheme (e.g., pruning and quantization policy), which constitute an exponentially large
space. Note that the other DNN design parameters, such as learning rate and choice of optimizer for
DNN training, can also be included into the proposed framework. For example, if we want to consider
learning rate and DNN architecture optimization, the accuracy predictor can take the learning rate and
∗E-mail: blu029@ucr.edu
†E-mail: jyang239@ucr.edu
‡E-mail: sren@ece.ucr.edu
Position Paper.
ar
X
iv
:2
00
9.
00
27
8v
2 
 [c
s.L
G]
  7
 Se
p 2
02
0
architecture as the input and be trained by using different DNN samples with distinct architectures
and learning rates.
Given different design choices, DNN models can exhibit dramatically different performance tradeoffs
in terms of various important performance metrics (e.g., accuracy, latency, energy and robustness).
In general, there is not a single DNN model that performs Pareto optimally on all edge devices.
For example, with the same DNN model in Facebook’s app, the resulting latencies on different
devices can vary by a factor of 10x and causes poor user experiences [39]. Thus, device-aware DNN
optimization is mandated [22, 24, 35, 39].
1.80%
15.60%
54.70%
4.20%
23.60%
2005-2010
2011
2012
2013-2014
2015+
Figure 1: Statistics of the year mobile
CPUs are designed as of late 2018 [39].
Designing an optimal DNN for even a single edge device
often needs repeated design iterations and is non-trivial
[8, 38]. Worse yet, DNN model developers often need to
serve extremely diverse edge devices. For example, the
DNN-powered voice assistant application developed by
a third party can be used by many different edge device
vendors, and Facebook’s DNN model for style transfer
is run on billions of mobile devices, more than half of
which still use CPUs designed in 2012 or before (shown
in Fig. 1) [39]. In the mobile market alone, there are
thousands of system-on-chips (SoCs) available. Only top
30 SoCs can each take up more than 1% of the share, and
they collectively account for 51% of the whole market [39].
Thus, the practice of repeatedly optimizing DNN models,
once for each edge device, can no longer meet the demand
in view of the extremely diverse edge devices.
Therefore, it has become crucially important to scale up the optimization of DNNs for edge inference
using automated approaches.
2 State of the Art and Limitations
Network architecture is a key design choice that affects the resulting performance of DNN models on
edge devices. Due to the huge space for network architectures, traditional hand-tuned architecture
designs can take months or even longer to train a DNN with a satisfactory performance [14,40]. Thus,
they have become obsolete and been replaced with automated approaches [34]. Nonetheless, the
early NAS approaches often require training each DNN candidate (albeit usually on a small proxy
dataset), which hence still results in a high complexity and search time. To address this issue, DNN
optimization and training need to be decoupled. For example, the current “once-for-all” technique
can generate nearly unlimited (> 1019) DNN models of different architectures all at once [6].
Consequently, DNN model developers can now focus on the optimization of network architecture,
without having to train a DNN for each candidate architecture. Thus, instead of DNN training, we
consider on scalability of optimizing DNN designs with a focus on the neural architecture.
It is known that even the relative ranking of different DNN models in terms of latency/energy
performance can change on different devices: a DNN model with the lowest latency on one device
may not have the best latency performance on another device [24, 39]. Thus, NAS on a single
target device cannot result in the optimal DNN model for all other devices, motivating device-
aware NAS. In general, the device-aware NAS process is guided by an objective function, e.g.,
accuracy loss+ weight1 ∗ energy + weight2 ∗ latency. Thus, it is crucial to efficiently evaluate
the resulting inference accuracy/latency/energy performance given a DNN candidate [23,27,29,31,36].
Towards this end, proxy models have been leveraged to calculate latency/energy for each candidate,
but they are not very accurate on all devices [38]. Alternatively, actual latency measurement on real
devices for each candidate is also considered, but it is time-consuming [34].
More recently, performance predictors or lookup tables have been utilized to assist with NAS (and
model compression) [5,22,23,27,29,31,33,36,37]: train a machine learning model or build a lookup
table to estimate the resulting accuracy/latency/energy performance for a candidate DNN design on
the target device. Therefore, by using search techniques aided by performance predictors or lookup
tables, an optimal DNN can be identified out of numerous candidates for a target edge device without
actually deploying or running each candidate DNN on the device [6, 37].
2
Step 1: Build performance predictors or lookup tables 
Step 2: Optimization (e.g., evolutionary search)
…
Step 1: Build performance predictors or lookup tables 
Step 2: Optimization (e.g., evolutionary search)
Step 1: Build performance predictors or lookup tables
Step 2: Optimization (e.g., evolutionary search)
… …
Figure 2: Illustration of the existing device-aware DNN optimization (i.e., one optimization for a
single device) [6, 11, 37].
Nonetheless, as illustrated in Fig. 2, the existing latency/energy predictors or lookup tables [6, 7,
11, 27, 31, 37] are device-specific and only take the DNN features as input to predict the inference
latency/energy performance on a particular target device. For example, according to [7], the average
inference latencies of 4k randomly selected sample DNNs are measured on a mobile device and then
used to train an average latency predictor for that specific device (plus additional 1k samples for
testing). Assuming that each measurement takes 30 seconds, it takes a total of 40+ hours to just collect
training and testing samples in order to building the latency predictor for one single device, let alone
the additional time spent for latency predictor training and other performance predictors. Likewise, to
estimate the inference latency, 350K operator-level latency records are profiled to construct a lookup
table in [11], which is inevitably time-consuming. Clearly, building performance predictors or lookup
tables incurs a significant overhead by itself [6, 7, 11, 27, 31, 37].
More crucially, without taking into account the device features, the resulting performance predictors
or lookup tables only apply to the individual device on which the performance is measured. For
example, as shown in Fig. 4 in [11], the same convolution operator can result in dramatically different
latencies and even different latency patterns on two different devices — Samsung S8 with Snapdragon
835 mobile CPU and Hexagon v62 DSP with 800 MHz frequency. As a consequence, the existing
performance predictors or lookup tables that are time-consuming to build cannot scale up to extremely
diverse edge devices.
In addition, the optimizer (e.g., a simple evolutionary search-based algorithm or more advanced
exploration strategies [23, 27, 29, 31]) to identify an optimal architecture for each device also takes
non-negligible time or CPU-hours. For example, even with limited rounds of evolutionary search,
30 minutes to several hours are needed by the DNN optimization process for each device [6, 17, 37].
In [11], the search time may reduce to a few minutes by only searching for similar architectures
compared to an already well-designed baseline DNN model, and hence this comes at the expense of
very limited search space and possibly missing better DNN designs. Therefore, combined together,
the total search cost for edge devices is still non-negligible, especially given the extremely diverse
edge devices for which scalability is very important.
There have also been many prior studies on DNN model compression, such as pruning and quantiza-
tion [1, 9, 10, 13, 15, 16, 20, 21, 25, 28], matrix factorization [12, 26], and knowledge distillation [30],
among others. Like the current practice of NAS, the existing optimizer for compression techniques
are typically targeting a single device (e.g., optimally deciding the quantization and pruning policy
for an individual target device), thus making the overall optimization cost linearly increase with the
number of target devices and lacking scalability [37].
In summary, the state-of-the-art device-aware DNN optimization still takes a large amount of time
and efforts for even a single device [6, 7, 11, 37], and cannot scale to extremely diverse edge devices.
3 Proposed Approach
3.1 Overview
To scale up the optimization of DNNs for edge inference, our key idea is learning to optimize: instead
of performing DNN design optimization repeatedly (once for an individual device), we first learn a
3
Device 
Features
Optimization 
Parameters
෠𝒇 𝒙;𝒅, 𝝀
= −𝑨𝒄𝒄𝚯𝑨 𝒙
+𝝀𝟏 ⋅ 𝑬𝒏𝒆𝒓𝒈𝒚𝚯𝑬 𝒙; 𝒅
+ 𝝀𝟐 ⋅ 𝑳𝒂𝒕𝒆𝒏𝒄𝒚𝚯𝑳(𝒙; 𝒅)
Accuracy
Latency
Energy
Θ
𝑑 ෝ𝒙𝚯(𝒅, 𝝀)
Θ𝐴
Θ𝐿
Θ𝐸
Objective Function
Stage 1:
Performance Predictor
Stage 2:
Optimizer
{𝜆}
… Real Training 
Devices
Offline Training
…
Online Optimizer
…
… Synthetic Training 
Devices
Figure 3: Overview of our proposed automated DNN optimizer for edge inference. Once the optimizer
is trained, the optimal DNN design for a new device is done almost instantly (i.e., only one inference
time).
DNN optimizer from DNN optimization on sample devices, and then apply the learnt DNN optimizer
to new unseen devices and directly obtain the optimal DNN design.
More specifically, we take a departure from the existing practice by: (1) leveraging new performance
predictors that can estimate the resulting inference latency/energy performance given a DNN-device
pair; and (2) using an automated optimizer which takes the device features and optimization pa-
rameters as input, and then directly outputs the optimal DNN design. This is illustrated in Fig. 3.
Our latency/energy performance predictors take as explicit input both the DNN features and device
features, and hence they are reusable and can output the resulting performance for new unseen
devices. Our automated optimizer utilizes a neural network to approximate the optimal DNN design
function, and is intended to cut the search time that would otherwise be incurred for each device. The
initial overhead of training our performance predictors and optimizer is admittedly higher than the
current practice of only training device-specific predictors, but the overall overhead is expected to be
significantly lower, considering the extreme diversity of edge devices.
Mathematically, the problem of designing an optimal DNN for an edge device can be stated as
minx∈X f(x;d, λ), where f(x;d, λ) is the objective function taking into account various metrics
such as accuracy and latency, x is the representation of the DNN design choice (e.g., a combination of
DNN architecture, quantization, and pruning scheme), X is the design space under consideration, d is
the representation of an edge device (e.g., CPU/RAM/GPU/OS configuration), and λ is the parameter
for the objective function. To make the problem easier to solve, instead of simply concatenating naive
one-hot encoding with numerical features, we learn low-dimensional continuous representation of
device features and DNN designs [23, 27, 37].
We consider a commonly-used objective function for DNN optimization as follows:
f(x;d, λ) = −accuracy(x) + λ1 · energy(x;d) + λ2 · latency(x;d), (1)
where accuracy(x) is the prediction accuracy, energy(x;d) and latency(x;d) are average energy
consumption and inference latency given the DNN design choice represented by x on the device d,
and the weight parameters λ1 and λ2 are added to adjust the importance of energy and latency relative
to accuracy, respectively. The weight parameters may vary for different devices, and hence we treat
them as input to our automated optimizer, in addition to the device feature input. This objective
function is also equivalent to optimizing one metric subject to constraints on other metrics [11, 37].
Note that we can also include other performance metrics such as robustness to adversarial samples.
3.2 Training Performance Predictors and Optimizer
Our proposed design builds on top of two-stage training as described below.
4
Stage 1: Training performance predictors. To speed up the training of our optimizer network, we
need to quickly evaluate objective function given different DNN designs on different devices. Instead
of actually measuring the performance for each DNN design candidate (which is time-consuming),
we utilize performance predictors. In our example, we have accuracy/latency/energy predictors.
Concretely, the accuracy predictor can be a simple Gaussian process model as used in [11] or a neural
network, whose input is the DNN design choice represented by x, and it does not depend on the edge
device feature d. On the other hand, the latency/energy predictor neural network will use both device
feature d and DNN design representation x as input, and output the respective performance. They
are each trained by running DNNs with sampled designs on training devices and using mean squared
error (i.e., the error between the predicted performance and the true measured value) as the loss
function. The key difference between our design and [11, 37] is that our latency/energy performance
predictors use device features as part of the input and hence can apply to new unseen devices without
training new performance predictors.
We denote the set of training edge device features as D′T , where each element d ∈ D′T corresponds
to the feature of one available training device. To generate training samples, we can randomly sample
some DNN designs (e.g., randomly select some architectures) plus existing DNN designs if available,
and then measure their corresponding performances on training devices as the labels. We denote
the trained accuracy/energy/latency predictor neural network by AccΘA(x), EnergyΘE (x;d), and
LatencyΘL(x;d), respectively, where ΘA, ΘE , and ΘL are learnt parameters for the three respective
networks. Thus, the predicted objective function fˆ(x;d, λ) can be expressed as
fˆ(x;d, λ) = −AccΘA(x) + λ1 · EnergyΘE (x;d) + λ2 · LatencyΘL(x;d). (2)
The accuracy/energy/latency predictor neural networks are called performance networks, to be
distinguished from the optimizer network we introduce below.
Since collecting energy/latency performances on real training devices is time-consuming, we can
use iterative training to achieve better sample efficiency. Specifically, we can first choose a small
training set of DNN designs at the beginning, and then iteratively include an exploration set of new
DNN designs Xexplore to update the performance networks. This is described in Algorithm 1. The
crux is how to choose the exploration set Xexplore. Some prior studies have considered Bayesian
optimization to balance exploration vs. exploitation [29, 31], and we leave the choice of Xexplore in
each iteration as our future work.
Stage 2: Training the automated optimizer. Given an edge device represented by feature d and
optimization parameter λ, the representation of the corresponding optimal DNN design can be
expressed as a function x∗(d, λ). The current practice of DNN optimization is to repeatedly run an
optimizer (e.g., search-based algorithm), once for a single device, to minimize the predicted objective
function [11, 37]. Nonetheless, obtaining x∗(d, λ) is non-trivial for each device and not scalable
to extremely diverse edge devices. Thus, we address the scalability issue by leveraging the strong
prediction power of another fully-connected neural network parameterized by Θ to approximate the
optimal DNN design function x∗(d, λ). We call this neural network optimizer network, whose output
is denoted by xˆΘ(d, λ) where Θ is the network parameter that needs to be learnt. Once Θ is learnt,
when a new device arrives, we can directly predict the corresponding optimal DNN design choice
xˆΘ(d, λ).
For training purposes, in addition to features of real available training devices D′T , we can also
generate a set of additional synthetic device features DS to augment the training samples. We denote
the combined set of devices for training as DT = D′T ∪ DS , and the training set of optimization
parameters as ΛT which is chosen according to practical needs (e.g., latency may be more important
than energy or vice versa). Next, we discuss two different methods to train the optimizer network.
Training Method 1: A straightforward method of training the optimizer network is to use
the optimal DNN design x∗(d, λ) as the ground-truth label for input sample (d, λ) ∈ (DT ,ΛT ).
Specifically, we can use the mean squared error loss
min
Θ
1
N
∑
(d,λ)∈(DT ,ΛT )
|xˆΘ(d, λ)− x∗(d, λ)|2 + µ‖Θ‖, (3)
where N is the total number of training samples, µ‖Θ‖ is the regularizer to avoid over-fitting, and the
ground-truth optimal DNN design x∗(d, λ) is obtained by using an existing optimization algorithm
(e.g., evolutionary search in [11, 37]) based on the predicted objective function. Concretely, the
5
Algorithm 1: Training Performance and Optimizer Networks
Input: Real training devices D′T , synthetic training devices DS , training set of optimization
parameters ΛT , trained DNN models and their corresponding design space X , initial exploration set
of Xexplore, initial training sets of sampled DNN designs XT ⊂ X and the corresponding
accuracy/energy/latency labels measured on real training devices, and maximum iteration rounds
Max Iterate
Output: Performance network parameters ΘA,ΘE ,ΘL, and optimizer network parameter Θ
Initialize: Randomize ΘA,ΘE ,ΘL, and Θ;
for i = 1 to Max Iterate do
for x ∈ Xexplore ⊂ X and d ∈ D′T doXT ← XT ∪ {x};
Measure accuracy(x) for a new accuracy label;
Measure energy(x;d) and latency(x;d) for new energy and latency labels, respectively;
Update ΘA,ΘE , and ΘL by training performance networks as described in Stage 1;
end
Choose a new Xexplore;
end
if Training method 1 is used then
Fix ΘA,ΘE ,ΘL, and obtain x∗(d, λ) = arg minx fˆ(x;d, λ), ∀(d, λ) ∈ (DT ,ΛT );
Update Θ by training the optimizer network using Method 1;
else
Fix ΘA,ΘE ,ΘL, and update Θ by training the optimizer network using Method 2;
return ΘA,ΘE ,ΘL, and Θ;
optimal DNN design used as the ground truth is x∗(d, λ) = arg minx fˆ(x;d, λ), where fˆ(x;d, λ)
is the predicted objective function with parameters ΘA, ΘE , and ΘL learnt in Stage 1.
Training Method 2: While Method 1 is intuitive, generating many training samples by obtaining
the optimal DNN design x∗(d, λ), even based on the predicted objective function, can be slow
[11, 37]. To reduce the cost of generating training samples, we can directly minimize the predicted
objective function fˆ(x;d, λ) = −AccΘA(x) + λ1 ·EnergyΘE (x;d) + λ2 ·LatencyΘL(x;d) in an
unsupervised manner, without using the optimal DNN design choice x∗(d, λ) as the ground-truth
label. Specifically, given the input samples (d, λ) ∈ (D,Λ) including both real and synthetic device
features, we optimize the optimizer network parameter Θ to directly minimize the following loss:
min
Θ
1
N
∑
(d,λ)∈(DT ,ΛT )
fˆ(xˆΘ(d, λ);d, λ) + µ‖Θ‖. (4)
The output of the optimizer network directly minimizes the predicted objective function, and hence
represents the optimal DNN design. Thus, our training of the optimizer network in Method 2 is
guided by the predicted objective function only and unsupervised. When updating the optimizer
network parameter Θ, the parameters for performance predictors ΘA, ΘE , and ΘL learnt in Stage
1 are fixed without updating. In other words, by viewing the concatenation of optimizer network
and performance predictor networks as a single neural network (illustrated in Fig. 3), we update the
parameters (Θ) in the first few layers while freezing the parameters (ΘA,ΘE ,ΘL) in the last few
layers to minimize the loss expressed in Eqn. (4).
4 Remarks
In this work, we propose a scalable automated DNN optimizer for edge inference and present an
example of training the optimizer, noting that there can be more advanced approaches for training the
optimizer (e.g., better accuracy or fewer samples needed to build performance predictors). The key
point we would like to highlight in this work is that performing DNN optimization for each individual
device as considered in the existing research is not scalable and that a well-trained optimizer network
can significantly reduce the DNN optimization cost in view of extremely diverse edge devices. We
now offer the following remarks.
6
• DNN update. When a new training dataset is available and the DNN models need to be updated
for edge devices, we only need to build a new accuracy predictor on (a subset of) the new dataset and
re-train the optimizer network. The average energy/latency predictors remain unchanged, since they
are not much affected by training datasets. Thus, the time-consuming part of building energy/latency
predictors in our proposed approach is a one-time effort and can be re-used for future tasks.
•Generating optimal DNN design. Once the optimizer network is trained, we can directly generate
the optimal DNN design represented by xˆΘ(d, λ) given a newly arrived edge device d and optimiza-
tion parameter λ. Then, the representation xˆΘ(d, λ) is mapped to the actual DNN design choice
using the learnt decoder. Even though the optimizer network may not always result in the optimal
DNN designs for all edge devices, it can at least help us narrow down the DNN design to a much
smaller space, over which fine tuning the DNN design becomes much easier than over a large design
space.
• Empirical effectiveness. Using performance predictors to guide the optimizer is relevant to
optimization from samples [3, 4]. While in theory optimization from samples may result in bad
outcomes because the predictors may output values with significant errors, the existing NAS and
compression approaches using performance predictors [6, 11, 23, 27, 37] have empirically shown that
such optimization from samples work very well and are able to significantly improve DNN designs in
the context of DNN optimization. This is partly due to the fact that the predicted objective function
only serves as a guide and hence does not need to achieve close to 100% prediction accuracy.
• Relationship to the existing approaches. Our proposed design advances the existing prediction-
assisted DNN optimization approaches [11, 37] by making the DNN optimization process scalable to
numerous diverse edge devices. If our approach is applied to only one edge device, then it actually
reduces to the methods in [11, 37]. Specifically, since the device feature d is fixed given only one
device, we can remove it from our design illustrated in Fig. 3. As a result, our performance predictors
are the same as those in [11,37]. Additionally, our optimizer network can be eliminated, or reduced to
a trivial network that has a constant input neuron directly connected to the output layers without any
hidden layers. Thus, when there is only one edge device, our approach is essentially identical to those
in [11, 37]. Therefore, even in the worst event that the optimizer network or performance predictor
network does not generalize well to some new unseen edge devices (due to, e.g., poor training and/or
lack of edge device samples), we can always optimize the DNN design for each individual device,
one at a time, and roll back to state of the art [11, 37] without additional penalties.
•When scalability is not needed. It has been widely recognized that a single DNN model cannot
perform the best on many devices, and device-aware DNN optimization is crucial [6, 11, 35, 37, 39].
Thus, we focus on the scalability of DNN optimization for extremely diverse edge devices. On the
other hand, if there are only a few target devices (e.g., a vendor develops its own specialized DNN
model for only a few products), then the need for scalability is not justified and our approach will
reduce to [11, 37].
• GAN-based DNN design. There have been recent attempts to reduce the DNN design space
by training generative adversarial networks [18]. Nonetheless, they only produce DNN design
candidates that are more likely to satisfy the accuracy requirement, and do not perform energy or
latency optimization for DNN designs. Thus, a scalable performance evaluator is still needed to
identify an optimal DNN design for diverse edge devices. By contrast, our approach is inspired
by “learning to optimize” [2]: our optimizer network takes almost no time (i.e., only one optimizer
network inference) to directly produce an optimal DNN design, and can also produce multiple optimal
DNN designs by varying the optimization parameter λ to achieve different performance tradeoffs.
• Ensemble. To mitigate potentially bad predictions produced by our optimizer or performance
networks, we can use an ensemble. For example, an ensemble of latency predictors can be used to
smooth the latency prediction, while an ensemble of the optimizer network can be used to generate
multiple optimal DNN designs, out of which we select the best one based on (an ensemble of)
performance predictors.
• Learning to optimize. Our proposed optimizer network is relevant to the concept of learning to
optimize [2], but employs a different loss function in Method 2 which does not utilize ground-truth
optimal DNN designs as labels. The recent study [19] considers related unsupervised learning to
find optimal power allocation in an orthogonal problem context of multi-user wireless networks, but
7
the performance is evaluated based on theoretical formulas. By contrast, we leverage performance
predictors to guide the training of our optimizer network and use iterative training.
• Public datasets for future research. Finally, the lack of access to many diverse edge de-
vices is a practical challenge that prohibits many researchers from studying or experimenting
scalable DNN optimization for edge inference. While there are large datasets available on
(architecture, accuracy) [32], to our knowledge, there do not exist similar publicly-available
datasets containing (architecture, energy, latency, device) for a wide variety of devices. If such
datasets can be made available, they will tremendously help researchers build novel automated opti-
mizers to scale up the DNN optimization for heterogeneous edge devices, benefiting every stakeholder
in edge inference be it a gigantic player or a small start-up.
References
[1] Manoj Alwani, Han Chen, Michael Ferdman, and Peter Milder. Fused-layer cnn accelerators.
In MICRO, 2016.
[2] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom
Schaul, Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by
gradient descent. In NIPS, 2016.
[3] Eric Balkanski, Aviad Rubinstein, and Yaron Singer. The power of optimization from samples.
In NIPS, 2016.
[4] Eric Balkanski, Aviad Rubinstein, and Yaron Singer. The limitations of optimization from
samples. In STOC, 2017.
[5] Ermao Cai, Da-Cheng Juan, Dimitrios Stamoulis, and Diana Marculescu. NeuralPower: Predict
and deploy energy-efficient convolutional neural networks. In ACML, 2017.
[6] Han Cai, Chuang Gan, and Song Han. Once for all: Train one network and specialize it for
efficient deployment. In ICLR, 2019.
[7] Han Cai, Ligeng Zhu, and Song Han. ProxylessNas: Direct neural architecture search on target
task and hardware. In ICLR, 2019.
[8] Hsin-Pai Cheng, Tunhou Zhang, Yukun Yang, Feng Yan, Harris Teague, Yiran Chen, and Hai Li.
MSNet: Structural wired neural architecture search for internet of things. In ICCV Workshop,
2019.
[9] Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. A survey of model compression and
acceleration for deep neural networks. 2017. Available at: https://arxiv.org/abs/1710.
09282.
[10] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep
neural networks with binary weights during propagations. In NeurIPS, 2015.
[11] Xiaoliang Dai, Peizhao Zhang, Bichen Wu, Hongxu Yin, Fei Sun, Yanghan Wang, Marat
Dukhan, Yunqing Hu, Yiming Wu, Yangqing Jia, et al. ChamNet: Towards efficient network
design through platform-aware model adaptation. In CVPR, 2019.
[12] Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting
linear structure within convolutional networks for efficient evaluation. In NeurIPS, 2014.
[13] Caiwen Ding, Siyu Liao, Yanzhi Wang, Zhe Li, Ning Liu, Youwei Zhuo, Chao Wang, Xuehai
Qian, Yu Bai, Geng Yuan, Xiaolong Ma, Yipeng Zhang, Jian Tang, Qinru Qiu, Xue Lin, and
Bo Yuan. CirCNN: Accelerating and compressing deep neural networks using block-circulant
weight matrices. In MICRO, 2017.
[14] Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Neural architecture search: A survey.
Journal of Machine Learning Research, 20(55):1–21, 2019.
[15] Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural
networks with pruning, trained quantization and huffman coding. In ICLR, 2016.
[16] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for
efficient neural network. In NeurIPS, 2015.
8
[17] Weiwen Jiang, Lei Yang, Sakyasingha Dasgupta, Jingtong Hu, and Yiyu Shi. Standing on
the shoulders of giants: Hardware and neural architecture co-search with hot start. IEEE
Transactions on Computer-Aided Design of Integrated CIrcuits and Systems, 2020.
[18] Sheng-Chun Kao, Arun Ramamurthy, and Tushar Krishna. Generative design of hardware-aware
dnns, 2020.
[19] F. Liang, C. Shen, W. Yu, and F. Wu. Towards optimal power control via ensembling deep
neural networks. IEEE Transactions on Communications, 68(3):1760–1776, 2020.
[20] Ning Liu, Xiaolong Ma, Zhiyuan Xu, Yanzhi Wang, Jian Tang, and Jieping Ye. AutoCompress:
An automatic dnn structured pruning framework for ultra-high compression rates. In AAAI,
2020.
[21] Wei Liu, Xiaolong Ma, Sheng Lin, Shihao Wang, Xuehai Qian, Xue Lin, Yanzhi Wang, and Bin
Ren. Patdnn: Achieving real-time DNN execution on mobile devices with pattern-based weight
pruning. In ASPLOS, 2020.
[22] Qing Lu, Weiwen Jiang, Xiaowei Xu, Yiyu Shi, and Jingtong Hu. On neural architecture search
for resource-constrained hardware platforms. In ICCAD, 2019.
[23] Renqian Luo, Fei Tian, Tao Qin, Enhong Chen, and Tie-Yan Liu. Neural architecture optimiza-
tion. In NIPS, 2018.
[24] Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shufflenet v2: Practical guidelines
for efficient cnn architecture design. In ECCV, 2018.
[25] Bradley McDanel, Surat Teerapittayanon, and HT Kung. Embedded binarized neural networks.
2017. Available at: https://arxiv.org/abs/1709.02260.
[26] Seyed Yahya Nikouei, Yu Chen, Sejun Song, Ronghua Xu, Baek-Young Choi, and Timothy
Faughnan. Smart surveillance as an edge network service: From harr-cascade, svm to a
lightweight cnn. In CIC, 2018.
[27] Xuefei Ning, Wenshuo Li, Zixuan Zhou, Tianchen Zhao, Yin Zheng, Shuang Liang, Huazhong
Yang, and Yu Wang. A surgery of the neural architecture evaluators. arXiv preprint
arXiv:2008.03064, 2020.
[28] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet
classification using binary convolutional neural networks. In ECCV, 2016.
[29] Binxin Ru, Xingchen Wan, Xiaowen Dong, and Michael Osborne. Neural architecture search
using bayesian optimisation with weisfeiler-lehman kernel. arXiv preprint arXiv:2006.07556,
2020.
[30] Ragini Sharma, Saman Biookaghazadeh, Baoxin Li, and Ming Zhao. Are existing knowledge
transfer techniques effective for deep learning with edge devices? In EDGE, 2018.
[31] Han Shi, Renjie Pi, Hang Xu, Zhenguo Li, James T. Kwok, and Tong Zhang. Multi-objective
neural srchitecture search via predictive network performance optimization. arXiv preprint
arXiv:1911.09336, 2019.
[32] Julien Siems, Lucas Zimmer, Arber Zela, Jovita Lukasik, Margret Keuper, and Frank Hutter.
NAS-Bench-301 and the case for surrogate benchmarks for neural architecture search. arXiv
preprint arXiv:2008.09777, 2020.
[33] D. Stamoulis, E. Cai, D. Juan, and D. Marculescu. HyperPower: Power- and memory-
constrained hyper-parameter optimization for neural networks. In DATE, 2018.
[34] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and
Quoc V Le. MnasNet: Platform-aware neural architecture search for mobile. In CVPR, 2019.
[35] Hanrui Wang, Zhanghao Wu, Zhijian Liu, Han Cai, Ligeng Zhu, Chuang Gan, and Song Han.
HAT: Hardware-aware transformers for efficient natural language processing. In ACL, 2020.
[36] Linnan Wang, Yiyang Zhao, Yuu Jinnai, Yuandong Tian, and Rodrigo Fonseca. AlphaX:
exploring neural architectures with deep neural networks and monte carlo tree search. arXiv
preprint arXiv:1903.11059, 2019.
[37] Tianzhe Wang, Kuan Wang, Han Cai, Ji Lin, Zhijian Liu, Hanrui Wang, Yujun Lin, and Song
Han. APQ: Joint search for network architecture, pruning and quantization policy. In CVPR,
2020.
9
[38] Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong
Tian, Peter Vajda, Yangqing Jia, and Kurt Keutzer. FBNet: Hardware-aware efficient ConvNet
design via differentiable neural architecture search. In CVPR, 2019.
[39] Carole-Jean Wu, David Brooks, Kevin Chen, Douglas Chen, Sy Choudhury, Marat Dukhan,
Kim Hazelwood, Eldad Isaac, Yangqing Jia, Bill Jia, Tommer Leyvand, Hao Lu, Yang Lu, Lin
Qiao, Brandon Reagen, Joe Spisak, Fei Sun, Andrew Tulloch, Peter Vajda, Xiaodong Wang,
Yanghan Wang, Bram Wasti, Yiming Wu, Ran Xian, Sungjoo Yoo, and Peizhao Zhang. Machine
learning at Facebook: Understanding inference at the edge. In HPCA, 2019.
[40] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. In ICLR,
2017.
10
