Differentiable neural architecture search methods became popular in automated machine learning, mainly due to their low search costs and flexibility in designing the search space. However, these methods suffer the difficulty in optimizing network, so that the searched network is often unfriendly to hardware. This paper deals with this problem by adding a differentiable latency loss term into optimization, so that the search process can tradeoff between accuracy and latency with a balancing coefficient. The core of latency prediction is to encode each network architecture and feed it into a multi-layer regressor, with the training data being collected from randomly sampling a number of architectures and evaluating them on the hardware. We evaluate our approach on NVIDIA Tesla-P100 GPUs. With 100K sampled architectures (requiring a few hours), the latency prediction module arrives at a relative error of lower than 10%. Equipped with this module, the search method can reduce the latency by 20% meanwhile preserving the accuracy. Our approach also enjoys the ability of being transplanted to a wide range of hardware platforms with very few efforts, or being used to optimizing other nondifferentiable factors such as power consumption.
Introduction
Neural architecture search (NAS) is an important topic in an emerging research field named automated machine learning (AutoML). The idea is to design automatic algorithms to explore a complicated space which contains a very large number of network architectures and find out the best one(s) among them. Existing NAS algorithms are roughly categorized into two parts [7, 28] , namely, heuristic search and differentiable search, differing from each other in whether the processes of sampling network from the space and * This work was done when Yuhui Xu and Xin Chen were interns at Huawei Noah's Ark Lab. We would like to thank Longhui Wei, Zhengsu Chen, and An Xiao for instructive discussions. Figure 1 : The goal of this paper is to introduce latency prediction to differentiable NAS methods towards a tradeoff between network performance and efficiency.
training the sampled network are jointly optimized. Often, heuristic NAS methods (including using reinforcement learning [36, 37, 16] or genetic algorithms [23, 30, 22] for heuristic sampling) are computationally challenging caused by training sampled networks repeatedly, while differentiable NAS methods [18, 2] are faster due to a larger fraction of shared training among sampled architectures.
Besides recognition accuracy, efficiency is also a pursuit of many real-world scenarios. This often requires the searched architecture to have a low latency at the inference time. For this respect, it is straightforward to undergo a multi-target training scheme in which accuracy and latency get optimized together. This is easy for heuristic search methods [27, 29, 10] , however, relatively difficult for the differentiable counterparts since latency is nondifferentiable with respect to network parameters, except for the scenarios that the search space is very simple, e.g., the networks are chain-style so that the latency can be obtained via a lookup table [29] .
This paper explores latency-aware differentiable architecture search in a complicated space, e.g., the DARTS [18] search space which contains a few nodes as well as topological connections between them, which exceeds the ability of table lookup. In addition, the relationship between the latency and FLOPs of an architecture can be complex, and so it is unlikely to predict the latency with an empirically designed, arithmetic function with respect to the FLOPs.
Our idea is to train a differentiable latency prediction module (LPM) that is able to predict the latency of an architecture. LPM is a multi-layer neural network, with the input being an encoded form of an architecture, e.g., a fixed-length code of architectural parameters, and the output being the latency of the architecture. We train LPM by sampling a large number of architectures from the search space and measuring the latency of each of them. Note that latency is closely related to the machine configuration, so LPM needs to be trained for each specified hardware/software environment. In practice, we sampled 100K architectures from the DARTS space for training, which took around 9 hours in a single NVIDIA Tesla-P100 GPU. The average relative error of latency prediction is smaller than 10%, which is (verified in experiments) accurate enough for our purpose.
Equipped with LPM, we add the latency term to the loss function of DARTS. By setting different balancing coefficients, we can easily tradeoff between accuracy and speed, which is what we desire. We evaluate our approach on CIFAR10 and ImageNet, two standard image classification benchmarks. On CIFAR10, we arrive at the similar accuracy with the baseline but our architecture is 20% faster. On ImageNet, we also reduce the network latency significantly without losing much accuracy.
The remainder of this paper is organized as follows. Section 2 briefly reviews the previous literature, and Section 3 elaborates the algorithm for latency-aware architecture search. Experiments are shown in Section 4, and conclusions are drawn in Section 5.
Related Works
The past few years have witnessed the blooming development of deep learning and manually-designed convolutional neural networks (CNNs) have pushed a wide range of computer vision tasks to new state-of-the-art performances [15, 25, 9, 12] . Lately, neural architecture search (NAS) has been attracting more attention due to its strong ability in automatically discovering network architectures with high performance.
According to the methodology to explore the search space, existing NAS approaches can roughly be divided into two categories, namely, heuristic search and differentiable search. In some pioneer work in this area, architectures were sampled from the search space and trained from scratch to evaluate their capability, for which some heuristic algorithms, such as evolutionary algorithms and reinforcement learning, act as parameterized controllers of the sampling process. Among them, Liu et al. [17] , Xie et al. [30] and Real et al. [22] adopted evolutionary algorithms as the controller, in which genetic operations were used to modify the architecture, and Real et al. [22] showed that better evolutionary algorithms lead to stronger architectures. Another line of heuristics replaced evolutionary algorithms with reinforcement learning (RL) [36, 1, 37, 34, 16] , in which a meta-controller is trained to generate the hyper-parameters of each candidate.
A crucial drawback of the above methods is the large search cost (hundreds or even thousands of GPU-days). In order to accomplish the search process with an acceptable cost, differentiable search methods were designed. In DARTS [18] , Liu et al. proposed to introduce a set of architectural parameters to relax the search space so that the search process can be finished in a single training process, where the network parameters and the architectural parameters are jointly optimized and the final architecture is generated according to the architectural parameters. Following DARTS, ProxylessNAS [2] adopted a similar differentiable framework and proposed to search architectures directly on the target dataset. To improve the stability of DARTS, P-DARTS [3] proposed to progressively enlarge the search depth to bridge the depth gap, and PC-DARTS [32] enabled partial channel connection so that a large batch size can be used in the search process.
There also exist efforts in studying the hardware applicability of the discovered architecture in terms of FLOPs and/or latency. It is relatively easy for heuristic search methods to achieve this goal, because hardware constraints like FLOPs or latency can be conveniently measured for any sampled architecture [27, 8] . Regarding differentiable NAS approaches, SNAS [31] added FLOPs and memory access constraints by factorizing the architectural parameters and measuring the costs on each operation in the search space. ProxylessNAS [2] and FBNet [29] adopted latency constraint since the search space is chain-styled and those constraints are accessible with a lookup table. To the best of our knowledge, no existing work has done the job in a complicated, differentiable search space, e.g., the search space of DARTS-based approaches.
Approach
The goal of DARTS is to search for the robust cell architectures to construct the evaluation network. Specifically, a cell is represented by a directed acyclic graph (DAG) of N nodes, {x 0 , x 1 , . . . , x N −1 }, where each node represents a set of feature maps. The first two nodes are the result feature maps of previous cells or operations and act as input nodes. Information flow between am intermediate node j and its predecessor node i is connected by an edge E (i,j) , where a bunch of candidate operations o(·) in the operation space O are weighted by the normalized architectural pa- Figure 2 : We sample 10K architectures from the DARTS search space, and plot the FLOPs as well as latency of each architecture when it is applied to ImageNet classification. Note the inconsistency between FLOPs and latency: under a given FLOPs, the smallest latency is often 32% smaller than the largest one, or 8% smaller than the median. rameters α (i,j) , i < j, and formulated as:
An intermediate node is the summation of the outputs of its preceding edges, which is represented as x j = i<j f i,j (x i ), and the output node is the concatenation of all intermediate nodes in the channel dimension, which is denoted by x output = concat(x 2 , x 3 , . . . , x N −1 ).
In this manner, DARTS defines an over-parameterized network h(x; ω, α) where ω and α denote the network and architectural parameters. With a bi-level optimization process, ω and α are trained in a proxy dataset and α is used to determine the final architecture.
Despite the satisfying performance of the searched architecture, we are not sure if the architecture is also optimized in terms of efficiency, e.g., latency. In particular, DARTS involves many inter-layer connections which may bring memory access issues and slow down the architecture.
Differentiable Latency Prediction
We desire an approach that can automatically optimize the latency for DARTS and, in general, differentiable search methods with complicated spaces. We first show the possibility of this task. Figure 2 shows the relationship between latency and FLOPs (testing the accuracy for so many architectures is intractable). Even under the same FLOPs, the latency of an architecture can vary a lot. This is to say, there is some room of improvement for latency-aware search algorithms.
The key is to design a differentiable loss function that can predict the latency of an architecture, so that it can be in-tegrated into the over-parameterized network optimization process. However, the latency of an architecture is determined by a lot of complicated factors, some of which is hardware-specific, and thus it is very difficult to provide an accurate latency for any given architecture.
Training a Latency Prediction Module
We present a learning-based solution 2 . The latency prediction module (LPM) is a multi-layer regression network, with the input being an encoded sub-network architecture and the output being the predicted latency. Throughout this paper, we only investigate the normal cell and ignore the reduction cell, because the final reduction cell is often composed of weight-free operators and contributes little to the network latency. On the other hand, encoding the reduction cell introduces noise to the latency prediction model.
To encode the sub-network, we first recall that each cell of DARTS contains four intermediate nodes with 14 edges and 8 operations on each edge, while the sub-network preserves two edges for each node and only one operation on each selected edge. We use 14 × 8 bits to represent each cell: a bit is 1 if it corresponds to the chosen operation on a preserved edge, otherwise it is 0. In other words, only 8 out of 14 × 8 bits are 1. The 112D vector is propagated through four fully-connected layers with 112, 256, 64 and 1 neurons, respectively, and the final one is the output (latency). We use sigmoid as the activation function for each layer, excluding the last one. For later convenience, we denote a sub-network by γ and the latency prediction function by LPM(γ).
Data collection. We first collect a dataset of (architecture, latency) pairs. On an NVIDIA Tesla-P100 GPU (used in all experiments of our work), we randomly sample 100K architectures from the DARTS space, and evaluate the latency of each architecture with randomized network weights. For a better transferability of the searched architectures, the latency is measured under the ImageNet setting with an input image size of 224 × 224 and is an average of 20 measurements. The entire process takes around 9 hours. Though 100K is a small number compared to the entire search space (there are 1.0 × 10 9 distinct normal cells), it is enough for the learning task. Then we partition the latency data into two parts: 80K pairs are used for training and the remaining 20K for validation. Training details. On the 80K training set, the network is trained from scratch for 1,000 epochs using a batch size of 200. We use a momentum SGD with a fixed learning rate of 0.01, a momentum of 0.9, a weight decay of 1 × 10 −5 , and a mean square error (MSE) loss function.
Latency prediction results. Here we evaluate LPM using both absolute and relative errors between the prediction and the ground-truth on the testing set. As shown in Table 1 , with an increasing amount of training data, the testing error goes down accordingly. On the other hand, the improvement of accuracy becomes marginal when the amount of training data is larger than 40K. With 80K training data, the latency prediction results are satisfying, with an absolute error smaller than 2ms and a relative error smaller than 10%. As we shall see in search experiments, such accuracy is enough for finding efficient architectures.
Latency-Aware Architecture Search
Finally, we present the complete search algorithm, which we call latency-aware differentiable neural architecture search (LA-DNAS). In particular, this paper follows the search space and optimization methods of DARTS, so we name our models LA-DARTS.
To this end, we incorporate the trained LPM into the over-parameterized network for architecture search. This is not straightforward due to the difference between the architectural parameters used in latency prediction and architecture search. Specifically, DARTS optimizes an overparameterized network in which architectural parameters are continuous and all 14 edges are present; but LPM gets trained on sub-networks with 8 edges and a single operation on each edge. Obviously, directly passing the on-training architectural parameter, α, into LPM will incur significant prediction errors 3 .
Our solution to this problem involves sampling a batch of sub-networks from the on-training over-parameterized network, by which we estimate the expectation of the latency of a sub-network when it is sampled from the over-parameterized network. We denote this quantity by LAT(α) and compute it as follows:
Here, S(α) is a distribution determined by α. In practice, we uniformly sample 8 out of 14 edges from α, and then randomly choose the operation on each edge according to the current weights of the operations (excluding none which does not appear in the final architecture). We use a batch size of M = 20, sample M sub-networks, {γ m } M m=1 , and thus have LAT(α) ≈ 1 M M m=1 LPM(γ m ). The gradient of LAT(α) with respect to α is estimated by averaging the gradients with respect to γ m . The final loss function of the search process is written as:
Here, the balancing coefficient, λ, controls the tradeoff between accuracy and performance: a smaller λ prefers accuracy to latency and vice versa. Note that λ has a unit of s −1 .
We will show in experiments that choosing a proper λ is not difficult, and adjusting λ can lead to different properties of architectures.
Our design is easily plugged into the bi-level optimization process followed by a the DARTS-based approaches. The only modification involves replacing the original loss function over α, L val (α), with L total (α). The extra computational costs for LAT(α) is simply negligible.
Discussions and Relationship to Prior Work
To the best of our knowledge, this is the first work that introduces a latency-aware method to a complicated search space. The main difficulty lies in designing a differentiable loss function for latency prediction, while this issue does not exist for heuristic search methods. There are a lot of efforts in applying latency constraints to heuristic search [27, 29, 10] .
On the other hand, in differentiable architecture search, FBNet [29] which integrated latency into the loss function by constructing a look-up table. Although this method works well in the chain-style search space, it can fail in the search space of DARTS due to much higher complexity. In comparison, our approach has a stronger ability and is feasible for a wider range of search spaces. Also, there were efforts [31] in introducing naturally differentiable quantities, e.g., FLOPs (a linear function of α, to the loss function of differentiable frameworks. Our approach, in comparison, is more generalized.
A potential expansion of our work is to be applied to even more complicated search spaces, where there is more room for improving network efficiency, or used to optimize other non-differentiable factors of a network, such as power consumption. One may argue that recognition accuracy is also a kind of non-differentiable quantity, but it is overcomplicated and one cannot expect it to be predictable by a multi-stage regression network on top of an encoded network architecture.
Experiments
We evaluate LADNAS in standard image classification datasets, i.e., CIFAR10 and ImageNet, and study several important properties of it.
Experiments on CIFAR10
Firstly, we evaluate our LADNAS on CIFAR10 [14] . The CIFAR10 dataset consists of 60k colored natural images with 32×32 resolution of 10 categories, which is split into 50K training and 10K testing images. We use DARTS [18] and PC-DARTS [31] as our two baseline methods. Following DARTS and PC-DARTS, we use an individual stage for architecture search and conduct another standalone training process from scratch to evaluate the optimal architecture obtained in the search phase. In the search stage, the goal is to determine the best sets of architectural parameters, namely α o i,j in DARTS and α o i,j , {β i,j } in PC-DARTS for each edge E (i,j) . To this end, the training set is partitioned into two parts, with the first part used for optimizing network parameters, e.g., convolutional weights, and the second part used for optimizing architectural parameters. For fair comparison, the operation space O remains the same as the convention, which contains 8 choices, i.e., 3×3 and 5×5 separable convolution, 3×3 and 5×5 dilated separable convolution, 3×3 max-pooling, 3×3 average-pooling, skip-connect (identity), and zero (none).
Following DARTS and PC-DARTS, in the search period, the over-parameterized network is constructed by stacking 8 cells (6 normal cells and 2 reduction cells, each type of cells share the same architecture), and each cell consists of N = 6 nodes. We train the network for 50 epochs, with the initial number of channels being 16.
In the search experiment based on DARTS and PC-DARTS, the network weights are optimized by momentum SGD, with a batch size of 64 for DARTS and 256 for PC-DARTS, an initial learning rate of 0.025 for DARTS and 0.1 for PC-DARTS (annealed down to zero following the cosine schedule without restart), a momentum of 0.9, and a weight decay of 3 × 10 −4 . We use an Adam optimizer [13] for architectural parameters, with a fixed learning rate of 3×10 −4 for DARTS and 6 × 10 −4 for PC-DARTS, a momentum of (0.5, 0.999) and a weight decay of 10 −3 . For PC-DARTS, we freeze architectural parameters and only allow network parameters to be tuned in the first 15 epochs.
• Evaluation on CIFAR10
The evaluation scenario simply follows that of DARTS and PC-DARTS. The evaluation network is stacked by 20 cells (18 normal cells and 2 reduction cells). The initial number of channels is 36 used, and the network is trained from scratch for 600 epochs using a batch size of 128. We use the SGD optimizer with an initial learning rate of 0.025 (annealed down to zero following a cosine schedule without restart), a momentum of 0.9, a weight decay of 3 × 10 −4 and a norm gradient clipping at 5. Drop-path with a rate of 0.2 as well as cutout [5] is also applied for regularization. The balancing coefficient λ is set as 0.2. The GPU latency on CIFAR10 is measured on one Tesla-P100 GPU with a batch size of 32 (input image size 32 × 32) and is the average of 200 measurements.
We conduct latency-aware architecture search on both DARTS and PC-DARTS. As demonstrated in Table 2 , LA-DARTS (2nd order) achieves a 2.72% test error with only 2.7M parameters and a latency of 28.4ms on CI-FAR10. To achieve a similar classification performance, the original DARTS (2nd order) need 3.3M parameters with 40.9ms latency. The original PC-DARTS requires a much higher latency of 40.7ms on CIFAR10 and 3.6M parameters to achieve nearly the same performance as LA-PC-DARTS. LA-PC-DARTS further boost the classification performance to 2.61% test error with only 2.6M parameters and a latency of 27.7ms. SNAS [31] obtained nearly the same latency (27.4ms) but its test error is significantly higher (+0.47%) than LA-PC-DARTS. P-DARTS [3] achieved an accuracy of 2.50% by search space regulation, which incurred its latency as high as 40.9ms.
• The Impact of the Balancing Coefficient
The balancing coefficient λ is an important factor to control the impact of latency constraint, which directly determines the latency of the searched architectures. To show the impact of λ, different λs are adopted to balance the performance and latency of the searched architectures. In this experiment, we set PC-DARTS [32] as the baseline method (λ = 0.00) and choose λ = 0.10, λ = 0.15 and λ = 0.20 to conduct three independent search runs. The normal cells of the searched architectures and their corresponding latency and test errors are shown in Fig. 6 . With the increase of λ, the latency of the searched architectures is reduced while the performance is relatively stable. It means that our latency optimization can effectively decrease the latency without affecting the searched performance. However, if we continue to increase λ to be larger than 0.2, parameterfree operations will dominate the searched architectures and thus much larger test errors are reported.
• Robustness to Latency Prediction Error
As shown in Section 3.3, The latency prediction module (LPM) still suffers an Absolute Error of 0.82 (ms). We perform additional experiments to demonstrate that it is enough to offer a good latency constraint with an LPM of such precision and the framework is robust to the latency prediction error. A random noise with a distribution of N (0, 0.025) is added on the predicted latency. We compare the latency of the searched architectures with LPM constraint when Table 3 : Comparison to latency-aware architecture search with added noise. Latency is measure on CIFAR10. λ is the balancing coefficient. λ = 0.10 and λ = 0.20. As shown in Table. 3, with the injected noise, the LPM still effectively guides to search the latency aware architectures under different balancing coefficients, which shows the robustness of the proposed LPM and latency-aware architecture framework.
• Comparison to FLOPs-Aware Architecture Search To show the effectiveness of latency-aware architecture search, we conduct FLOPs-aware architecture search as the control group. Different from the latency of an architecture, FLOPs is irrelevant to the route of connections but the operation itself. It is easy to apply the FLOPs constraint as a differentiable term. We measure the FLOPs of each operation in the search space and use a lookup table to compute the overall FLOPs by adding up the FLOPs of each involved operation. A balancing coefficient η is adopted to balance performance and FLOPs in the search scenario. We conduct two independent FLOPs-aware architecture search with η = 0.005 and η = 0.007 and the latency of the discovered architectures is compared with the architectures searched by latency-aware architecture search with λ = 0.100 and λ = 0.200. The result shows that the latency-aware architecture search approach can discover architectures with lower latency than the FLOPs-aware approach when the searched architectures have comparable FLOPs.
Experiments on ImageNet
The ILSVRC2012 [4] , a subset of ImageNet, is used to test the transferability of architectures discovered on CI-FAR10. The ILSVRC2012 consists of 1,000 object categories and 1.28M training and 50K validation images for Table 5 : Comparison with state-of-the-art architectures on ImageNet (mobile setting). Latency is measured on one Tesla-P100 GPU with a batch size of 32 and an input size of 224×224. C denotes the number of initial channels of the architecture.
recognition task. All images are of high-resolution and roughly equally distributed over all classes. Following the conventions [37, 18, 32] , we apply the mobile setting where the input image size is fixed to be 224×224 and the number of multi-add operations does not exceed 600M in the testing stage.
The evaluation on ILSVRC2012 follows DARTS and PC-DARTS, which also starts with three convolution layers of stride 2 to reduce the resolution of feature maps from 224×224 of the input images to 28×28. 14 cells (12 normal cells and 2 reduction cells) are stacked beyond this point, with an initial channel number of 48. The network is trained from scratch for 250 epochs using a batch size of 1,024 on 8 Tesla V100 GPUs. The network parameters are optimized using an SGD optimizer with a momentum of 0.9, an initial learning rate of 0.5 (decayed down to zero linearly), and a weight decay of 3 × 10 −5 . Additional enhancements are adopted including label smoothing and an auxiliary loss tower during training. Learning rate warm-up is applied for the first 5 epochs. The GPU latency on ILSVRC2012 is also measured on one Tesla-P100 GPU with the same setting of CIFAR10 except for the input image size. Table 5 , with the same number of initial channels of 48, LA-DARTS has a 19% lower latency than the original DARTS, and the latency of LA-PC-DARTS is 22.4ms, 29% lower than that of PC-DARTS (31.7ms). To further boost the performance under the mobile setting, we increase the initial number of channels of LA-DARTS to 54 and LA-PC-DARTS to 56. Consequently, the top-1 test error of LA-PC-DARTS, from 48 to 56 channels, is improved to 24.1% which outperforms PC-DARTS by 0.2% yet its latency (26.1ms) is still 18% lower than that of PC-DARTS.
As shown in
In the future, with a larger search space, we expect that our algorithm has larger room of improvement in reducing the network latency.
Conclusions
This paper presented a differentiable method for predicting the latency of an architecture in a complicated search space, and incorporated this module into differentiable architecture search. This enables us to control the balance of recognition accuracy and inference speed. We design the latency prediction module as a multi-layer regression network, and train it by sampling a number of architectures from the pre-defined search space. Our pipeline is easily transplanted to a wide range of hardware/software configu-rations, and helps to design machine-friendly architectures.
Our work sheds light for future research on this direction. As researchers continue exploring larger spaces of NAS, it will be more and more difficult for non-differentiable search methods to converge in reasonable search time. Also, a larger search space will also provide larger room of optimizing latency, as well as other non-differentiable factors such as power consumption, of the searched architecture. We thus expect more efforts beyond this preliminary work. 
A. Visualizing Cells
To guarantee that readers can reproduce our search results, here we attach all normal and/or reduction cells that did not appear in the main paper due to the space limit.
A.1. Reductions Cells on CIFAR10
Reduction cells of architectures found on CIFAR10 with different balancing coefficients are shown in Figure 5 . The balancing coefficients λ are 0.00, 0.10, 0.15 and 0.20, respectively. Latency optimization is combined with PC-DARTS and λ = 0.00 is the same as the original PC-DARTS. The latency is measured on CIFAR10.
A.2. Cells of LA-DARTS and LA-PC-DARTS
The normal and reduction cells of LA-DARTS and LA-PC-DARTS are shown in Figure 6 . The balancing coefficient λ is 0.20 for both LA-DARTS and LA-PC-DARTS. Besides, the normal and reduction cells of DARTS (2nd) and PC-DARTS are shown in Figure 7 . 
