TEA-DNN: the Quest for Time-Energy-Accuracy Co-optimized Deep Neural
  Networks by Cai, Lile et al.
TEA-DNN: the Quest for Time-Energy-Accuracy
Co-optimized Deep Neural Networks
Lile Cai*
I2R, Singapore
caill@i2r.a-star.edu.sg
Anne-Maelle Barneche*
SUPELEC, France
anne-maelle.barneche@supelec.fr
Arthur Herbout*
ECP, France
arthur.herbout@student.ecp.fr
Chuan Sheng Foo
I2R, Singapore
foo chuan sheng@i2r.a-star.edu.sg
Jie Lin
I2R, Singapore
lin-j@i2r.a-star.edu.sg
Vijay Ramaseshan Chandrasekhar†
I2R, Singapore
vijay@i2r.a-star.edu.sg
Mohamed M. Sabry†
NTU, Singapore
msabry@ntu.edu.sg
Abstract—Embedded deep learning platforms have witnessed
two simultaneous improvements. First, the accuracy of convolu-
tional neural networks (CNNs) has been significantly improved
through the use of automated neural-architecture search (NAS)
algorithms to determine CNN structure. Second, there has
been increasing interest in developing hardware accelerators
for CNNs that provide improved inference performance and
energy consumption compared to GPUs. Such embedded deep
learning platforms differ in the amount of compute resources and
memory-access bandwidth, which would affect performance and
energy consumption of CNNs. It is therefore critical to consider
the available hardware resources in the network architecture
search. To this end, we introduce TEA-DNN, a NAS algorithm
targeting multi-objective optimization of execution time, energy
consumption, and classification accuracy of CNN workloads
on embedded architectures. TEA-DNN leverages energy and
execution time measurements on embedded hardware when
exploring the Pareto-optimal curves across accuracy, execution
time, and energy consumption and does not require additional
effort to model the underlying hardware. We apply TEA-DNN
for image classification on actual embedded platforms (NVIDIA
Jetson TX2 and Intel Movidius Neural Compute Stick). We
highlight the Pareto-optimal operating points that emphasize the
necessity to explicitly consider hardware characteristics in the
search process. To the best of our knowledge, this is the most
comprehensive study of Pareto-optimal models across a range of
hardware platforms using actual measurements on hardware to
obtain objective values.
Index Terms—Neural architecture search, hardware con-
straints, multi-objective optimization
I. INTRODUCTION
Deep convolutional neural networks (CNNs) have achieved
state-of-the-art performance in image classification, object
detection and many other applications [1]. To achieve better
accuracy, CNN models have become increasingly deeper and
require more computing and memory resources [2], [3]. This
poses a challenge when these models are deployed to run
on resource-limited devices, such as mobile and embedded
platforms, as the memory on these devices may not be large
enough to hold the models or running the model may consume
more power than the device can supply.
*Equal contribution. †Joint corresponding authors.
Much effort has been devoted to designing CNN models that
can run efficiently on these devices, for instance, by manually
designing more efficient convolution operations and network
architectures [4]–[6]. However, this approach demands expert
knowledge and obtaining an optimal model is difficult – one
has to carefully balance the trade-off between accuracy and
computational resources. An alternative approach is to use
automated neural architecture search (NAS) algorithms to
find optimal models under hardware constraints [7]–[9]. NAS
algorithms usually consist of three components: a controller,
a trainer and an evaluator. The controller is responsible for
sampling models from the search space. The trainer is then
responsible for training the sampled models. Finally, the eval-
uator is responsible for evaluating the optimization objectives
(e.g., model accuracy) on currently sampled models. Following
this evaluation, the parameters of the controller are updated
to increase the likelihood that it subsequently samples better
models.
Due to the variation in platform hardware/software config-
urations, models optimized for one platform can be subop-
timal for another. Consider two hardware platforms, namely
the Nvidia TITAN X GPU [10] and Intel Movidius Neural
Computing Stick (NCS) [11] that respectively exemplify a
high-performance and embedded platform. Figure 1 displays
two Pareto-optimal curves (inference time versus classification
error) of CNN models targeting TITAN X GPU and NCS, both
executed on the TITAN X GPU (for measurement details see
Section IV). It can be seen that the Pareto-optimal models
searched for the NCS are far from the Pareto curve for the
GPU, implying that a platform-agnostic NAS may result in
highly suboptimal models – in this case resulting in up to 2×
increase in execution time to achieve comparable accuracy.
The problem revealed in Fig. 1 demonstrates that in order
to obtain an optimal model for a hardware platform, its
corresponding characteristics have to be taken into consider-
ation during the search process. To this end, we introduce
TEA-DNN (Time-Energy-Accuracy co-optimized Deep Neu-
ral Networks), a NAS framework that explicitly considers two
hardware metrics – inference time and energy consumption
– in addition to classification accuracy as objective metrics.978-1-7281-2954-9/19/$31.00 ©2019 IEEE
ar
X
iv
:1
81
1.
12
06
5v
2 
 [c
s.N
E]
  2
1 O
ct 
20
19
  
Fig. 1. Pareto-optimal models searched on Movidius are suboptimal on
TITAN X GPU.
We formulate the neural architecture search problem as a
multi-objective optimization problem and leverage Bayesian
optimization to search for Pareto-optimal solutions. While
Bayesian optimization has been used to obtain hardware-aware
neural networks [12], it was only used to search for several
hyper-parameters with a fixed network architecture. To the
best of our knowledge, our work is the first to apply Bayesian
optimization for neural architecture search. Furthermore, TEA-
DNN does not require modeling the hardware platform and
instead leverages the ability to directly measure energy and
execution time on actual hardware. We summarize our contri-
butions as follows:
• A time, energy and accuracy co-optimization framework
for CNNs.
• Employing Bayesian optimization to search for CNN
structures that yield Pareto-optimal operating conditions.
• We demonstrate how different device configurations can
lead to different trade-off behaviors.
• We demonstrate that optimal models searched on one
hardware platform are not optimal for another and thus
reiterate the importance of hardware-aware NAS.
II. RELATED WORK
A. Neural Architecture Search (NAS)
Early versions of NAS algorithms [7] employed recurrent
neural networks (RNNs) to predict the architecture of a target
CNN where the weights of the RNN are updated using
reinforcement learning. [8] follows the same framework as
proposed in [7], but instead of using a RNN to predict the
entire network architecture, the algorithm only predicts the
optimal structure for one convolutional module (or “cell”).
Identical cells are then stacked multiple times to form the full
network. [9] replaced reinforcement learning with progressive
search, which can yield better models with fewer samples.
B. Hardware-Aware NAS
Explicitly incorporating hardware constraints into NAS has
been an active research topic in recent years. HyperPower
[12] approximates the power and memory consumption of a
networks using linear regression. These approximations are
then used in the acquisition function of a Bayesian optimiza-
tion algorithm to avoid sampling models that violate power or
memory constraints. MnasNet [13] focuses on finding optimal
networks for mobile devices and used inference time as one
of the objectives. DPP-Net [14] performs neural architecture
search on different devices and considers more optimization
objectives: error rate, number of parameters, FLOPs, memory,
and inference time.
While our work is closely related to MnasNet and DPP-
Net in that we all search for Pareto-optimal networks for
a specific device, our approach is unique in two respects.
Firstly, we perform true multi-objective optimization instead of
combining several objectives into a single objective as done
in MnasNet. Secondly, unlike DPP-Net, we do not use any
surrogate functions to approximate the optimization objectives.
Instead, we directly measure the real-world values for all
the three objectives (i.e., time, energy and accuracy). This
eliminates the need to model the targeted hardware, which
is a challenging task given the diversity of hardware platform
configurations.
III. TEA-DNN OPTIMIZATION FRAMEWORK
A. System Overview
We formulate the neural network architecture search
problem as a multi-objective optimization problem
minx(error(x), energy(x), time(x)) where we wish to
find a network architecture parameterized by x (see Section
III-B for details) that minimizes classification error, energy
consumption, and inference time. We do not assume a
closed-form model for energy consumption or inference time,
but evaluate them directly on actual hardware to measure
real-world performance. Networks were trained and evaluated
on GPUs for efficiency as we assume that classification error
is not affected by the specific hardware a network is run on.
As formulated, this is an instance of a black-box opti-
mization problem where the objective functions can only be
evaluated (and are not differentiable), and where function eval-
uations (especially classification error, which requires training
the model) are costly. Note that no single “best solution”
exists for a multi-objective optimization problem. A solution
is instead defined by a Pareto optimal set of points, for which
improvement in any objective function cannot be made without
negatively affecting some other objectives.
We chose to employ a Bayesian optimization algorithm
[15] (detailed in Section III-C) to solve this optimization
problem. We provide a brief overview and refer the reader
to the comprehensive review in [16]. Bayesian optimization
algorithms perform a sequential exploration of the parame-
ter space while building a surrogate probabilistic model to
approximate the objective functions. This model is used to
select points at which to next evaluate the objective functions,
and the obtained function values are then used to update
the model. The algorithm proceeds iteratively following this
select-evaluate-update loop, such that points in the Pareto
optimal set are selected more frequently as the algorithm
progresses. We stopped the algorithm after 400 points are
sampled. A schematic overview of our search algorithm is
shown in Fig. 2.
  
Bayesian 
Optimization
Training 
on GPU
Measurement 
on Devices
Multi-objective Function
Accuracy Inference Time
Energy
Deploy
Update parameters
TEA-DNN 
Models
Sample
Fig. 2. System diagram of the proposed TEA-DNN optimization framework.
B. Search Space
We search over the subset of network architectures that can
be described as repetitions of a modular network “cell”, as
proposed by [9]. The overall network architecture is predefined
(illustrated in Fig. 3(a)) and consists of cells with either stride
1 or 2 (i.e., slide the filter every 1 or 2 pixels). As a common
heuristic, the number of filter channels is doubled after the
stride 2 cells. As such, the network architecture is uniquely
determined by the initial filter channel number F , the number
of cell repeats N and the cell structure. F and N are hyper-
parameters that are pre-specified and the cell structure is
searched using Bayesian optimization.
Specifically, each cell is composed of 5 building blocks and
each building block (illustrated in Fig. 3(b)) is parameterized
by 4 parameters (I1, I2, O1, O2) for a 20-dimensional parame-
ter space. I1 and I2 denote the inputs, and O1 and O2 specify
the operations applied to the respective inputs. The input space
of each building block consists of the outputs of all preceding
blocks in the current cell as well as outputs from the two
preceding cells. The operation space includes the following
eight functions commonly used in top performing CNNs:
1) max 3× 3: 3× 3 max pooling
2) identity: identity mapping
3) sep 3× 3: 3× 3 depthwise-separable convolution
4) conv 3× 3: 3× 3 convolution
5) sep 5× 5: 5× 5 depthwise-separable convolution
6) conv 5× 5: 5× 5 convolution
7) sep 7× 7: 7× 7 depthwise-separable convolution
8) conv 7× 7: 7× 7 convolution
The search space described above has an order of 1014 (22 ×
82× 32× 82× 42× 82× 52× 82× 62× 82 = 5.6× 1014). The
outputs of the two operations are then combined by element-
wise addition. The final output of the cell is the concatenation
of all unused building block outputs.
  
Add
I 1
(b)
O2O1
Image
Cell (str 1)
X N
Cell (str 2) Cell (str 1) Cell (str 2) Cell (str 1)
Globalpool
Softmax
X N X N
(a)
I 2
Fig. 3. The network architecture predefined for CIFAR-10 (a) and the
architecture of one building block (b).
TABLE I
SPECIFICATIONS OF THE DEVICES USED IN OUR EXPERIMENTS.
GTX TITAN X Jetson TX2 Movidius
Processing Unit 3072 CUDA cores 256 CUDA cores Myriad 2 VPU
FLOPS 6.7T FP32 1.5T FP32 2T FP16
Memory 12GByte GDDR5 8GByte LPDDR4 4GBit LPDDR3
Mem. Bandwidth 336.6 GBytes/s 59.7 GBytes/s 4 GBits/s
Power 250 W 15 W 1 W
C. Multi-Objective Bayesian Optimization
Bayesian optimization is a sequential model-based approach
that approximates each objective function with a Gaussian
process (GP) model. For a particular objective function (e.g.,
classification error) let f(x) be its surrogate GP model, x1:n
be the evaluated network architectures in the search space,
fi = f(xi) be the objective function value for network xi,
and yi be the actual measured function value. The GP model
assumes that f = f1:n are jointly Gaussian with mean m
and covariance K and observations y = y1:n are normally
distributed given f :
f ∼ N(m,K),
y|f , σ2 ∼ N(f , σ2I). (1)
Each iteration of Bayesian optimization consists of 3 steps:
1) Selecting the next point (network architecture to evalu-
ate) xn+1 by maximizing an acquisition function, which
specifies a likely candidate that improves the objective(s).
We used the PESMO (Predictive Entropy Search Multi-
objective) [15] acquisition function in our experiments
that chooses points which maximally reduce the entropy
of the current posterior distribution given by the GPs over
the Pareto set.
2) Evaluating the objective functions at xn+1.
3) Updating the parameters m and K for the GP models.
To employ Bayesian optimization for neural architecture
search, we use the 20-dimensional parameterization of the
search space as described in Section III-B. Our three objective
functions are the 1) error rate (i.e., 1− accuracy), 2) inference
time and 3) energy consumption, and we used the open-
source PESMO implementation in Spearmint [15] for our
experiments.
IV. EXPERIMENTAL SETUP
We evaluate TEA-DNN models on different deep-learning
hardware platforms, representing embedded and server-based
systems. Table I summarizes the properties of these platforms.
A. Training Setup in Search Process
In our experiments, models are trained and tested on the
CIFAR-10 dataset [17], which is a popular benchmarking
dataset for image classification. CIFAR-10 has 50,000 training
images and 10,000 test images of dimension 32 × 32 × 3.
We removed 5,000 images from the training set for use as
a validation set and train on the remaining 45,000 images.
During the search process, each model is trained for 10 epochs
with a batch size of 32. We use the RMSProp optimizer [18]
with momentum and decay both set to 0.9. The learning rate
is set to 0.01, and decayed by 0.94 every 2 epochs. Weight
decay is set to 0.00004. For data augmentation, images are first
zero-padded with 4 pixels on each side to 40 × 40, and then
randomly cropped to 32× 32, followed by random horizontal
flip and random adjustment of brightness and contrast. The
initial channel number F is set to 24 and the number of cell
repeats N is set to 2 in the search process.
B. Time and Energy Measurement
The measurement method for each device is detailed as
below:
• TITAN X GPU A model is launched to run on CIFAR-10
validation subset (5000 images) with a batch size of 100.
During the process, power is queried every 20ms using
NVIDIA Management Library (NVML) [19]. A typical
power curve is displayed in Fig. 4(a). We use a threshold
of 80W to separate the curve into working and idle states.
The threshold is decided empirically and we have found
that it works well for all models we tested. The inference
time is then computed as t2− t1 and energy is computed
by integration.
• Jetson TX2 A model is launched to run on CIFAR-
10 validation subset with a batch size of 100. During
the process, power and the corresponding time stamp
are obtained using the Python library provided by [20].
A typical power curve is presented in Fig. 4(b). The
threshold is set to 1W. Inference time and energy are
computed similarly as done for TITAN X GPU.
• Movidius NCS We use the profiling tool provided by
the Movidius Neural Compute SDK [21] to obtain the
inference time. During profiling, a power meter [22]
is attached to the NCS to monitor power consumption.
A typical power curve recorded by the power meter is
shown in Fig. 4(c). The threshold is set to 0.45W and
energy is computed similarly as done for TITAN X GPU
and Jetson TX2.
  (a) (b) (c)
80W
1W
0.45W
t1
t1
t2
t2
Fig. 4. Power curves on TITAN X GPU (a), Jetson TX2 (b) and Movidius
NCS (c). The dashed orange line represents the threshold above which the
device is considered under load.
V. RESULTS AND DISCUSSIONS
A. Evolution of Pareto Curves
To validate the effectiveness of Bayesian optimization in
searching time-energy-accuracy co-optimized DNN models,
we compare the Pareto curves found by Bayesian optimization
to those from random sampling of network architectures on
Movidius NCS in Fig. 5. Randomly sampled models are
generated by selecting inputs and operations for a building
block (Section III-B) uniformly at random. We observe that
Bayesian optimization is able to explore the search space more
effectively in that it has a more spread-out distribution of
points (models), and more points seem to lie in the Pareto
optimal set. Also, it is able to return better models using fewer
function evaluations (sampled models). For instance, after
sampling 100 models, Bayesian optimization is able to find a
model with an error rate of 22.16% and energy consumption
of 2.02J, whereas random sampling can only find a model
with much higher error rate (25.88%) at similar energy levels
(1.99J); after sampling 300 models, Bayesian optimization
finds models that achieve a better trade-off between error
rate and energy: a model with an error rate of 23.42% and
energy consumption of 1.16J, while the best model found
through random sampling has a higher error rate (23.90%)
and consumes more energy (1.32J). This clearly demonstrates
the effectiveness of Bayesian optimization in searching TEA-
DNN models.
B. Cross-Device Evaluation of Pareto-Optimal Models
When performing neural network search on different plat-
forms, a natural question arises as to whether the set of Pareto-
optimal models searched for one platform is also Pareto-
optimal for another. We perform cross-device evaluation of
Pareto-optimal models to answer this question. First, we
evaluate Pareto-optimal models searched for the TITAN X
GPU on the Jetson TX2 (Fig. 6)) and Movidius NCS (Fig. 7).
We observe that models optimal on the TITAN X GPU are
not guaranteed to be optimal on embedded devices, in that
more than half of the Pareto-optimal models for the GPU
(blue points) lie to the upper-right (i.e., are inferior in one or
both dimensions) of the original Pareto curve for each of the
embedded devices (orange line). A GPU-optimal model can
incur significantly higher computational cost than a device-
optimal model at similar accuracy levels: in Fig. 7(a), the
GPU-optimal model (point b) takes 2× the inference time
compared to the device-optimal model (point a) (66.30ms
vs. 32.52ms). We also note that models can behave very
differently on different platforms. For example, in Fig. 7(b),
model c consumes more energy than model d on the TITAN
X GPU (508J vs. 489J), but less on the Movidius (1.05J
vs. 1.26J), and forms the new Pareto curve on Movidius.
These highly platform-dependent behaviors clearly indicates
that incorporating energy and execution time of the targeted
platform is key in TEA-DNN.
In addition, we evaluate the set of Pareto-optimal models
for the Jetson TX2 and Movidius NCS on the TITAN X
GPU, as illustrated in Fig. 8 and Fig. 9. One interesting
phenomenon is that the last model on the Pareto curve of the
embedded devices can always form the new Pareto curve of
GPU. For example, in Fig. 8(a), the Jetson-optimal model a
achieves lower error rate than GPU-optimal model b (22.86%
vs. 23.18%), but runs faster and less energy than model b
(6.08s vs. 8.18s and 815J vs. 1160J), and thus replaces b to
  
100 models 200 models 300 models 400 models
Fig. 5. Evolution of Pareto curves (error rate vs. energy) found by Bayesian optimization (top row) and random sampling (bottom row) on Movidius NCS.
Figures are plotted for every 100 sampled models. In each subfigure, we show the coordinates for the last two points for easier comparison.
  (a) (b)
Fig. 6. Evaluating Pareto-optimal models searched for TITAN X GPU on
Jetson TX2.
  (a) (b)
a b
c
d
e
Fig. 7. Evaluating Pareto-optimal models searched for TITAN X GPU on
Movidius NCS.
become the new Pareto point. This suggests that the limited
resources on embedded platforms yield CNNs architectures
that consume less compute resources in the high-performance
GPU.
C. Accuracy Benchmarking
We select a model from the Pareto curve of Movidius that
achieves low error rate while does not consume too much
energy, i.e., Model e in Fig. 7(b). To obtain the final model,
we set the batch size to 64 and train for 300 epochs. The
initial learning rate is 0.01 and is decayed by 0.5 every 100
epochs. Weight decay is 0.0001. Other settings are the same
as the training setup in search process. We experiment with
  (a) (b)
a b a b
Fig. 8. Evaluating the Pareto-optimal models searched for Jetson TX2 on
TITAN X GPU.
  (a) (b)
Fig. 9. Evaluating the Pareto-optimal models searched for Movidius NCS on
TITAN X GPU.
different N and F (Section III-B) and the results are reported
on the CIFAR-10 test subset (10,000 images) in Table II. It can
be seen that deeper (larger N) and wider (larger F) achieves
lower error rates, with the cost of more parameters, energy
consumption and running time. The cell structure of Model e
is shown in Fig. 10. We note that this model employs quite a
few parameter-efficient operations, e.g., max-pooling, identity,
and 3× 3 convolutions.
In Table II, we also compare the TEA-DNN models with
state-of-the-art image classification model DenseNet [23].
DenseNet is a carefully designed network that connects each
layer to every other subsequent layers. It limits the number
of feature maps that each layer produces in order to improve
parameter efficiency. Such connection pattern and feature map
number limit are not employed in TEA-DNN. It can be seen
that TEA-DNN model with N=2 and F=12 consumes less
energy than DenseNet on TITAN X GPU and Jetson TX2,
while DenseNet has lower error rate and faster running time.
Note that DenseNet is not currently supported by Movidius
NCS as the parser cannot fuse batch norm parameters in the
composite function into the concatenated input.
TABLE II
BENCHMARKING OF TEA-DNN MODELS ON CIFAR-10 TEST WITH
VARYING N AND F .
Error #Params Energy Time
TITANX Jetson Movidius TITANX Jetson Movidius
TEA-DNN (N=2, F=12) 10.20 0.3M 690J 175J 1.03J 7.62s 94.8s 54.9ms
TEA-DNN (N=3, F=12) 9.79 0.5M 876J 233J 1.09J 8.38s 101s 71.2ms
TEA-DNN (N=3, F=16) 8.22 0.9M 1142J 287J 1.11J 9.68s 106s 53.3ms
TEA-DNN (N=2, F=24) 7.89 1.3M 1131J 310J 1.04J 9.38s 107s 48.9ms
TEA-DNN (N=3, F=24) 7.44 2.0M 1473J 395J 1.28J 11.08s 119s 66.4ms
TEA-DNN (N=3, F=48) 6.62 8.0M 2675J 730J 1.73J 15.76s 164s 121ms
DenseNet (L=40, k=12) 5.44 1.1M 1018J 207J n.a. 5.24s 46.5s n.a.
  
sep
3x3
iden
tity
+
max
3x3
iden
tity
H c−2
+
H c−1
conv
3x3
iden
tity
+
max
3x3
conv
5x5
max
3x3
max
3x3
+ +
concat
H c
Fig. 10. The cell structure of the TEA-DNN models evaluated in Table II.
VI. CONCLUSIONS
In this work, we propose the TEA-DNN framework that
employs Bayesian optimization to search for time-energy-
accuracy co-optimized CNN models. We apply TEA-DNN
on three different devices: TITAN X GPU, Jetson TX2 and
Movidius NCS. Comparison with random sampling shows that
Bayesian optimization is able to explore the search space
more effectively and to return better models using fewer
sampled models. Detailed cross-device evaluation of Pareto-
optimal models demonstrates that optimal models searched for
one hardware platform are not guaranteed to be optimal for
another, and models can behave very differently on different
platforms. Our comprehensive experiments reveal the highly
platform-dependent behaviours of neural network models and
reiterates the importance of explicitly considering hardware
characteristics in neural architecture search.
REFERENCES
[1] Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio,
Deep learning, vol. 1, MIT press Cambridge, 2016.
[2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Identity
mappings in deep residual networks,” in European conference on
computer vision. Springer, 2016, pp. 630–645.
[3] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A
Alemi, “Inception-v4, inception-resnet and the impact of residual
connections on learning.,” in AAAI, 2017, vol. 4, p. 12.
[4] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko,
Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam,
“Mobilenets: Efficient convolutional neural networks for mobile vision
applications,” arXiv preprint arXiv:1704.04861, 2017.
[5] X Zhang, X Zhou, M Lin, and J Sun, “Shufflenet: An extremely efficient
convolutional neural network for mobile devices. arxiv 2017,” arXiv
preprint arXiv:1707.01083.
[6] Gao Huang, Shichen Liu, Laurens van der Maaten, and Kilian Q
Weinberger, “Condensenet: An efficient densenet using learned group
convolutions,” group, vol. 3, no. 12, pp. 11, 2017.
[7] Barret Zoph and Quoc V Le, “Neural architecture search with reinforce-
ment learning,” arXiv preprint arXiv:1611.01578, 2016.
[8] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le,
“Learning transferable architectures for scalable image recognition,”
arXiv preprint arXiv:1707.07012, vol. 2, no. 6, 2017.
[9] Chenxi Liu, Barret Zoph, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-
Fei, Alan Yuille, Jonathan Huang, and Kevin Murphy, “Progressive
neural architecture search,” arXiv preprint arXiv:1712.00559, 2017.
[10] TITAN X GPU, https://www.geforce.com/hardware/desktop-
gpus/geforce-gtx-titan-x.
[11] Movidius Neural Compute Stick, https://developer.movidius.com/.
[12] Dimitrios Stamoulis, Ermao Cai, Da-Cheng Juan, and Diana Mar-
culescu, “Hyperpower: Power-and memory-constrained hyper-parameter
optimization for neural networks,” in Design, Automation & Test in
Europe Conference & Exhibition (DATE), 2018, pp. 19–24.
[13] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, and Quoc V
Le, “Mnasnet: Platform-aware neural architecture search for mobile,”
arXiv preprint arXiv:1807.11626, 2018.
[14] Jin-Dong Dong, An-Chieh Cheng, Da-Cheng Juan, Wei Wei, and Min
Sun, “Dpp-net: Device-aware progressive search for pareto-optimal
neural architectures,” arXiv preprint arXiv:1806.08198, 2018.
[15] Daniel Herna´ndez-Lobato, Jose Hernandez-Lobato, Amar Shah, and
Ryan Adams, “Predictive entropy search for multi-objective bayesian
optimization,” in ICML, 2016, pp. 1492–1501.
[16] Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P Adams, and Nando
De Freitas, “Taking the human out of the loop: A review of bayesian
optimization,” Proceedings of the IEEE, vol. 104, no. 1, pp. 148–175,
2016.
[17] Alex Krizhevsky and Geoffrey Hinton, “Learning multiple layers of
features from tiny images,” Tech. Rep., Citeseer, 2009.
[18] Tijmen Tieleman and Geoffrey Hinton, “Lecture 6.5-rmsprop: Divide
the gradient by a running average of its recent magnitude,” COURSERA:
Neural networks for machine learning, vol. 4, no. 2, pp. 26–31, 2012.
[19] NVIDIA Management Library (NVML),
https://developer.nvidia.com/nvidia-management-library-nvml.
[20] Lukas Cavigelli, “Convenient power measurements on the jetson
tx2/tegra x2 board,” 2018.
[21] Intel Movidius Neural Compute SDK, https://movidius.github.io/ncsdk/.
[22] Power-Z USB TD Tester, https://www.unionrepair.com/how-to-use-
power-z-usb-pd-tester-voltage-current-type-c-meter-km001/.
[23] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Wein-
berger, “Densely connected convolutional networks,” in Proceedings of
the IEEE conference on computer vision and pattern recognition, 2017,
pp. 4700–4708.
ACKNOWLEDGEMENTS
This research is supported by A*STAR under its
Hardware-Software Co-optimisation for Deep Learning
(Project No.A1892b0026). The computational work for this
article was partially performed on resources of the National
Supercomputing Centre, Singapore (https://www.nscc.sg).
