Abstract-Embedded deep learning platforms have witnessed two simultaneous improvements. First, the accuracy of convolutional neural networks (CNNs) has been significantly improved through the use of automated neural-architecture search (NAS) algorithms to determine CNN structure. Second, there has been increasing interest in developing hardware accelerators for CNNs that provide improved inference performance and energy consumption compared to GPUs. Such embedded deep learning platforms differ in the amount of compute resources and memory-access bandwidth, which would affect performance and energy consumption of CNNs. It is therefore critical to consider the available hardware resources in the network architecture search. To this end, we introduce TEA-DNN, a NAS algorithm targeting multi-objective optimization of execution time, energy consumption, and classification accuracy of CNN workloads on embedded architectures. TEA-DNN leverages energy and execution time measurements on embedded hardware when exploring the Pareto-optimal curves across accuracy, execution time, and energy consumption and does not require additional effort to model the underlying hardware. We apply TEA-DNN for image classification on actual embedded platforms (NVIDIA Jetson TX2 and Intel Movidius Neural Compute Stick). We highlight the Pareto-optimal operating points that emphasize the necessity to explicitly consider hardware characteristics in the search process. To the best of our knowledge, this is the most comprehensive study of Pareto-optimal models across a range of hardware platforms using actual measurements on hardware to obtain objective values.
I. INTRODUCTION
Deep convolutional neural networks (CNNs) have achieved state-of-the-art performance in image classification, object detection and many other applications [1] . To achieve better accuracy, CNN models have become increasingly deeper and require more computing and memory resources [2] , [3] . This poses a challenge when these models are deployed to run on resource-limited devices, such as mobile and embedded platforms, as the memory on these devices may not be large enough to hold the models or running the model may consume more power than the device can supply. * Equal contribution. † Joint corresponding authors.
Much effort has been devoted to designing CNN models that can run efficiently on these devices, for instance, by manually designing more efficient convolution operations and network architectures [4] - [6] . However, this approach demands expert knowledge and obtaining an optimal model is difficult -one has to carefully balance the trade-off between accuracy and computational resources. An alternative approach is to use automated neural architecture search (NAS) algorithms to find optimal models under hardware constraints [7] - [9] . NAS algorithms usually consist of three components: a controller, a trainer and an evaluator. The controller is responsible for sampling models from the search space. The trainer is then responsible for training the sampled models. Finally, the evaluator is responsible for evaluating the optimization objectives (e.g., model accuracy) on currently sampled models. Following this evaluation, the parameters of the controller are updated to increase the likelihood that it subsequently samples better models. Due to the variation in platform hardware/software configurations, models optimized for one platform can be suboptimal for another. Consider two hardware platforms, namely the Nvidia TITAN X GPU [10] and Intel Movidius Neural Computing Stick (NCS) [11] that respectively exemplify a highperformance and an embedded platform. Figure 1 displays two Pareto-optimal curves (inference time versus classification 978-1-7281-2954-9/19/$31.00 ©2019 IEEE error) of CNN models targeting TITAN X GPU and NCS, both executed on the TITAN X GPU (for measurement details see Section IV). It can be seen that the Pareto-optimal models searched for the NCS are far from the Pareto curve for the GPU, implying that a platform-agnostic NAS may result in highly suboptimal models -in this case resulting in up to 2× increase in execution time to achieve comparable accuracy.
The problem revealed in Fig. 1 demonstrates that in order to obtain an optimal model for a hardware platform, its corresponding characteristics have to be taken into consideration during the search process. To this end, we introduce TEA-DNN (Time-Energy-Accuracy co-optimized Deep Neural Networks), a NAS framework that explicitly considers two hardware metrics -inference time and energy consumption -in addition to classification accuracy as objective metrics. We formulate the neural architecture search problem as a multi-objective optimization problem and leverage Bayesian optimization to search for Pareto-optimal solutions. While Bayesian optimization has been used to obtain hardware-aware neural networks [12] , it was only used to search for several hyper-parameters with a fixed network architecture. To the best of our knowledge, our work is the first to apply Bayesian optimization for neural architecture search. Furthermore, TEA-DNN does not require modeling the hardware platform and instead leverages the ability to directly measure energy and execution time on actual hardware. We summarize our contributions as follows:
• A time, energy and accuracy co-optimization framework for NAS.
• Employing Bayesian optimization to search for CNN structures that yield Pareto-optimal operating conditions. • We demonstrate how different device configurations can lead to different trade-off behaviors.
• We demonstrate that optimal models searched on one hardware platform are not optimal for another and thus reiterate the importance of hardware-aware NAS.
II. RELATED WORK A. Neural Architecture Search (NAS)
Early versions of NAS algorithms [7] employed recurrent neural networks (RNNs) to predict the architecture of a target CNN where the weights of the RNN are updated using reinforcement learning. [8] follows the same framework as proposed in [7] , but instead of using a RNN to predict the entire network architecture, the algorithm only predicts the optimal structure for one convolutional module (or "cell"). Identical cells are then stacked multiple times to form the full network. [9] replaced reinforcement learning with progressive search, which can yield better models with fewer samples.
B. Hardware-Aware NAS
Explicitly incorporating hardware constraints into NAS has been an active research topic in recent years. HyperPower [12] approximates the power and memory consumption of a network using linear regression. These approximations are then used in the acquisition function of a Bayesian optimization algorithm to avoid sampling models that violate power or memory constraints. MnasNet [13] focuses on finding optimal networks for mobile devices and uses inference time as one of the objectives. DPP-Net [14] performs neural architecture search on different devices and considers more optimization objectives, namely, error rate, number of parameters, FLOPs, memory, and inference time.
While our work is closely related to MnasNet and DPPNet in that we all search for Pareto-optimal networks for a specific device, our approach is unique in two respects. Firstly, we perform true multi-objective optimization instead of combining several objectives into a single objective as done in MnasNet. Secondly, unlike DPP-Net, we do not use any surrogate functions to approximate the optimization objectives. Instead, we directly measure the real-world values for all the three objectives (i.e., time, energy and accuracy). This eliminates the need to model the targeted hardware, which is a challenging task given the diversity of hardware platform configurations.
III. TEA-DNN OPTIMIZATION FRAMEWORK

A. System Overview
We formulate the neural network architecture search problem as a multi-objective optimization problem min x (error(x), energy(x), time(x)) where we wish to find a network architecture parameterized by x (see Section III-B for details) that minimizes classification error, energy consumption, and inference time. We do not assume a closed-form model for energy consumption or inference time, but evaluate them directly on actual hardware to measure real-world performance. Networks were trained and evaluated on GPUs for efficiency as we assume that classification error is not affected by the specific hardware a network is run on.
As formulated, this is an instance of a black-box optimization problem where the objective functions can only be evaluated (and are not differentiable), and where function evaluations (especially classification error, which requires training the model) are costly. Note that no single "best solution" exists for a multi-objective optimization problem. A solution is instead defined by a Pareto optimal set of points, for which improvement in any objective function cannot be made without negatively affecting some other objectives.
We chose to employ a Bayesian optimization algorithm [15] (detailed in Section III-C) to solve this optimization problem. We provide a brief overview and refer the reader to the comprehensive review in [16] . Bayesian optimization algorithms perform a sequential exploration of the parameter space while building a surrogate probabilistic model to approximate the objective functions. This model is used to select points at which to next evaluate the objective functions, and the obtained function values are then used to update the model. The algorithm proceeds iteratively following this select-evaluate-update loop, such that points in the Pareto optimal set are selected more frequently as the algorithm progresses. We stopped the algorithm after 400 points are sampled. A schematic overview of our search algorithm is shown in Fig. 2 . 
B. Search Space
We search over the subset of network architectures that can be described as repetitions of a modular network "cell", as proposed by [9] . The overall network architecture is predefined (illustrated in Fig. 3(a) ) and consists of cells with either stride 1 or 2 (i.e., slide the filter every 1 or 2 pixels). As a common heuristic, the number of filter channels is doubled after the stride 2 cells. As such, the network architecture is uniquely determined by the initial filter channel number F , the number of cell repeats N and the cell structure. F and N are hyperparameters that are pre-specified and the cell structure is searched using Bayesian optimization.
Specifically, each cell is composed of 5 building blocks and each building block (illustrated in Fig. 3(b) ) is parameterized by 4 parameters (I 1 , I 2 , O 1 , O 2 ) for a 20-dimensional parameter space. I 1 and I 2 denote the inputs, and O 1 and O 2 specify the operations applied to the respective inputs. The input space of each building block consists of the outputs of all preceding blocks in the current cell as well as outputs from the two preceding cells. The operation space includes the following eight functions commonly used in top performing CNNs: 1) max 3 × 3: 3 × 3 max pooling 2) identity: identity mapping 3) sep 3 × 3: 3 × 3 depthwise-separable convolution 4) conv 3 × 3: 3 × 3 convolution 5) sep 5 × 5: 5 × 5 depthwise-separable convolution 6) conv 5 × 5: 5 × 5 convolution 7) sep 7 × 7: 7 × 7 depthwise-separable convolution 8) conv 7 × 7: 7 × 7 convolution The search space described above has an order of 10
. The outputs of the two operations are then combined by elementwise addition. The final output of the cell is the concatenation of all unused building block outputs. 
Each iteration of Bayesian optimization consists of 3 steps: 1) Selecting the next point (network architecture to evaluate) x n+1 by maximizing an acquisition function, which specifies a likely candidate that improves the objective(s). We used the PESMO (Predictive Entropy Search Multiobjective) [15] acquisition function in our experiments that chooses points which maximally reduce the entropy of the current posterior distribution given by the GPs over the Pareto set. 2) Evaluating the objective functions at x n+1 . 3) Updating the parameters m and K for the GP models. To employ Bayesian optimization for neural architecture search, we use the 20-dimensional parameterization of the search space as described in Section III-B. Our three objective functions are the 1) error rate (i.e., 1 − accuracy), 2) inference time and 3) energy consumption, and we used the opensource PESMO implementation in Spearmint [15] for our experiments.
IV. EXPERIMENTAL SETUP
We evaluate TEA-DNN models on different deep-learning hardware platforms, representing embedded and server-based systems. Table I summarizes the properties of these platforms.
A. Training Setup in Search Process
In our experiments, models are trained and tested on the CIFAR-10 dataset [17] , which is a popular benchmarking dataset for image classification. CIFAR-10 has 50,000 training images and 10,000 test images of dimension 32 × 32 × 3. We removed 5,000 images from the training set for use as a validation set and train on the remaining 45,000 images. During the search process, each model is trained for 10 epochs with a batch size of 32. We use the RMSProp optimizer [18] with momentum and decay both set to 0.9. The learning rate is set to 0.01, and decayed by 0.94 every 2 epochs. Weight decay is set to 0.00004. For data augmentation, images are first zero-padded with 4 pixels on each side to 40 × 40, and then randomly cropped to 32 × 32, followed by random horizontal flip and random adjustment of brightness and contrast. The initial channel number F is set to 24 and the number of cell repeats N is set to 2 in the search process. Running one iteration of sequential search takes about 35 minutes, with 25 minutes for model training and 10 minutes for Bayesian optimization computation (parameters updating and acquisition function evaluation).
B. Time and Energy Measurement
TEA-DNN models are deployed automatically on the corresponding device for time and energy measurement in the search process. The measurement method for each device is detailed as below:
• TITAN X GPU A model is launched to run on CIFAR-10 validation subset (5000 images) with a batch size of 100. During the process, power is queried every 20ms using NVIDIA Management Library (NVML) [19] . A typical power curve is displayed in Fig. 4(a) . We use a threshold of 80W to separate the curve into working and idle states. The threshold is decided empirically and we have found that it works well for all models we tested. The inference time is then computed as t 2 − t 1 and energy is computed by integration.
• Jetson TX2 A model is launched to run on CIFAR-10 validation subset with a batch size of 100. During the process, power and the corresponding time stamp are obtained using the Python library provided by [20] . A typical power curve is presented in Fig. 4(b) . The threshold is set to 1W. Inference time and energy are computed similarly as done for TITAN X GPU.
• Movidius NCS We use the profiling tool provided by the Movidius Neural Compute SDK [21] to obtain the inference time. During profiling, a power meter [22] is attached to the NCS to monitor power consumption. The power meter is controlled by the Power-Z software, which enable us to start, stop and output measurement automatically in the search process. A typical power curve recorded by the power meter is shown in Fig. 4(c) .
The threshold is set to 0.45W and energy is computed similarly as done for TITAN X GPU and Jetson TX2. 
V. RESULTS AND DISCUSSIONS
A. Evolution of Pareto Curves
To validate the effectiveness of Bayesian optimization in searching time-energy-accuracy co-optimized DNN models, we compare the Pareto curves found by Bayesian optimization to those from random sampling of network architectures on Movidius NCS in Fig. 5 . Randomly sampled models are generated by selecting inputs and operations for a building block (Section III-B) uniformly at random. We observe that Bayesian optimization is able to explore the search space more effectively in that it has a more spread-out distribution of points (models), and more points seem to lie in the Pareto optimal set. Also, it is able to return better models using fewer function evaluations (sampled models). For instance, after sampling 100 models, Bayesian optimization is able to find a model with an error rate of 22.16% and energy consumption of 2.02J, whereas random sampling can only find a model with much higher error rate (25.88%) at similar energy levels (1.99J); after sampling 300 models, Bayesian optimization finds models that achieve a better trade-off between error rate and energy: a model with an error rate of 23.42% and energy consumption of 1.16J, while the best model found through random sampling has a higher error rate (23.90%) and consumes more energy (1.32J). This clearly demonstrates the effectiveness of Bayesian optimization in searching TEA-DNN models.
B. Cross-Device Evaluation of Pareto-Optimal Models
When performing neural network search on different platforms, a natural question arises as to whether the set of Paretooptimal models searched for one platform is also Paretooptimal for another. We perform cross-device evaluation of Pareto-optimal models to answer this question. First, we evaluate Pareto-optimal models searched for the TITAN X GPU on the Jetson TX2 (Fig. 6) ) and Movidius NCS (Fig. 7) . We observe that models optimal on the TITAN X GPU are not guaranteed to be optimal on embedded devices, in that more than half of the Pareto-optimal models for the GPU (blue points) lie to the upper-right (i.e., are inferior in one or both dimensions) of the original Pareto curve for each of the embedded devices (orange line). A GPU-optimal model can incur significantly higher computational cost than a deviceoptimal model at similar accuracy levels: in Fig. 7(a) , the GPU-optimal model (point b) takes 2× the inference time compared to the device-optimal model (point a) (66.30ms vs. 32.52ms). We also note that models can behave very differently on different platforms. For example, in Fig. 7(b) , model c consumes more energy than model d on the TITAN X GPU (508J vs. 489J), but less on the Movidius (1.05J vs. 1.26J), and forms the new Pareto curve on Movidius. These highly platform-dependent behaviors clearly indicates that incorporating energy and execution time of the targeted platform is key in TEA-DNN.
In addition, we evaluate the set of Pareto-optimal models for the Jetson TX2 and Movidius NCS on the TITAN X GPU, as 100 models 200 models 300 models 400 models illustrated in Fig. 8 and Fig. 9 . One interesting phenomenon is that the last model on the Pareto curve of the embedded devices can always form the new Pareto curve of GPU. For example, in Fig. 8(a) , the Jetson-optimal model a achieves lower error rate than GPU-optimal model b (22.86% vs. 23.18%), but runs faster and consumes less energy than model b (6.08s vs. 8.18s and 815J vs. 1160J), and thus replaces b to become the new Pareto point. This suggests that the limited resources on embedded platforms yield CNNs architectures that consume less compute resources in the high-performance GPU. 
C. Accuracy Benchmarking
We select a model from the Pareto curve of Movidius that achieves low error rate while does not consume too much energy, i.e., Model e in Fig. 7(b) . To obtain the final model, we set the batch size to 64 and train for 300 epochs. The initial learning rate is 0.01 and is decayed by 0.5 every 100 epochs. Weight decay is 0.0001. Other settings are the same as the training setup in search process. We experiment with different N and F (Section III-B) and the results are reported on the CIFAR-10 test subset (10,000 images) in Table II . It can be seen that deeper (larger N) and wider (larger F) achieves lower error rates, with the cost of more parameters, energy consumption and running time. The cell structure of Model e is shown in Fig. 10 . We note that this model employs quite a few parameter-efficient operations, e.g., max-pooling, identity, and 3 × 3 convolutions.
In Table II , we also compare the TEA-DNN models with state-of-the-art image classification model DenseNet [23] . DenseNet is a carefully designed network that connects each layer to every other subsequent layers. It limits the number of feature maps that each layer produces in order to improve parameter efficiency. Such connection pattern and feature map number limit are not employed in TEA-DNN. It can be seen that TEA-DNN model with N=2 and F=12 consumes less energy than DenseNet on TITAN X GPU and Jetson TX2, while DenseNet has lower error rate and faster running time. Note that DenseNet is not currently supported by Movidius NCS as the parser cannot fuse batch norm parameters in the composite function into the concatenated input. Table II. VI. CONCLUSIONS In this work, we introduce the TEA-DNN framework that employs Bayesian optimization to search for time-energyaccuracy co-optimized CNN models. We apply TEA-DNN on three different devices: TITAN X GPU, Jetson TX2 and Movidius NCS. Comparison with random sampling shows that Bayesian optimization is able to explore the search space more effectively and to return better models using fewer sampled models. Detailed cross-device evaluation of Paretooptimal models demonstrates that optimal models searched for one hardware platform are not guaranteed to be optimal for another, and models can behave very differently on different platforms. Our comprehensive experiments reveal the highly platform-dependent behaviours of neural network models and reiterates the importance of explicitly considering hardware characteristics in neural architecture search.
