Latency-Aware Differentiable Neural Architecture Search by Xu, Yuhui et al.
Latency-Aware Differentiable Neural
Architecture Search
Yuhui Xu1∗ Lingxi Xie2 Xiaopeng Zhang2 Xin Chen3 Bowen Shi1
Qi Tian2 Hongkai Xiong1
1Shanghai Jiao Tong University 2Huawei Noah’s Ark Lab 3Tongji University
Abstract Differentiable neural architecture search methods became
popular in recent years, mainly due to their low search costs and flexi-
bility in designing the search space. However, these methods suffer the
difficulty in optimizing network, so that the searched network is often un-
friendly to hardware. This paper deals with this problem by adding a dif-
ferentiable latency loss term into optimization, so that the search process
can tradeoff between accuracy and latency with a balancing coefficient.
The core of latency prediction is to encode each network architecture
and feed it into a multi-layer regressor, with the training data which
can be easily collected from randomly sampling a number of architec-
tures and evaluating them on the hardware. We evaluate our approach
on NVIDIA Tesla-P100 GPUs. With 100K sampled architectures (re-
quiring a few hours), the latency prediction module arrives at a relative
error of lower than 10%. Equipped with this module, the search method
can reduce the latency by 20% meanwhile preserving the accuracy. Our
approach also enjoys the ability of being transplanted to a wide range
of hardware platforms with very few efforts, or being used to optimizing
other non-differentiable factors such as power consumption.
1 Introduction
Neural architecture search (NAS) is an important topic in an emerging research
field named automated machine learning (AutoML). The idea is to design au-
tomatic algorithms to explore a complicated space which contains a very large
number of network architectures and find out the best one(s) among them. Ex-
isting NAS algorithms are roughly categorized into two parts [8,29], namely,
heuristic search and differentiable search, differing from each other in whether
the processes of sampling network from the space and training the sampled
network are jointly optimized. Often, heuristic NAS methods (including using
reinforcement learning [37,38,17] or genetic algorithms [24,31,23] for heuristic
sampling) are computationally challenging caused by training sampled networks
repeatedly, while differentiable NAS methods [19,3] are faster due to a larger
fraction of shared training among sampled architectures.
∗ This work was done when Yuhui Xu and Xin Chen were interns at Huawei Noah’s
Ark Lab.
ar
X
iv
:2
00
1.
06
39
2v
2 
 [c
s.C
V]
  2
6 M
ar 
20
20
2 Y. Xu et al.
Validation Data
Training Data
Proxy 
dataset …
…
Latency friendly 
architectures
Deploy
N
et
-w
ei
g
h
ts
A
rc
h
-w
ei
g
h
ts
Training loss
Validation loss
S
u
p
er
-N
et
LADNAS
gg
g
L
P
M
c_{k-2}
c_{k-1}
c_{k}
0
1
2
3
c_{k-2}
c_{k-1}
0
1
2
3
c_{k}
(a) Latency: 30.4ms
(b) Latency: 27.0ms (c) Latency: 28.3ms
c_{k-2}
c_{k-1}
0
1
2
3
c_{k}
Figure 1: Left: the goal of this paper is to introduce latency prediction to dif-
ferentiable NAS methods towards a tradeoff between network performance and
efficiency. Right: the latency of architectures in the DARTS space is difficult to
predict, due to the potentially complex topology. Four skip-connect and four sep-
conv-3x3 operators can compose into different cells which have the same FLOPs
but different latency values. The blue and yellow arrows in each cell indicate
skip-connect and sep-conv-3x3 operators, respectively
Besides recognition accuracy, efficiency is also a pursuit of many real-world
scenarios. This often requires the searched architecture to have a low latency
at the inference time. For this respect, it is straightforward to undergo a multi-
target training scheme in which accuracy and latency get optimized together.
This is easy for heuristic search methods [28,30,11], however, relatively difficult
for the differentiable counterparts since latency is non-differentiable with respect
to network parameters, except for the scenarios that the search space is very
simple, e.g., the networks are chain-style so that the latency can be obtained via
a lookup table [30].
This paper explores latency-aware differentiable architecture search in a
complicated space, e.g., the DARTS [19] space which contains a few nodes
as well as topological connections between them, which exceeds the ability of ta-
ble lookup. As shown in Figure 1, the relationship between latency and FLOPs
of an architecture can be complex, and so it is unlikely to predict the latency
with an empirically designed, arithmetic function with respect to the FLOPs.
Our idea is to train a differentiable latency prediction module (LPM)
that is able to predict the latency of an architecture. LPM is a multi-layer neural
network, with the input being an encoded form of an architecture, e.g., a fixed-
length code of architectural parameters, and the output being the latency of the
architecture. We train LPM by sampling a large number of architectures from
the search space and measuring the latency of each of them. Note that though
latency is closely related to the machine configuration, LPM is adaptive and
can be trained for each specified hardware/software environment. In practice,
we sampled 100K architectures from the DARTS space for training, which took
around 9 hours in a single NVIDIA Tesla-P100 GPU (the batch size is 32), or 24
hours in a Intel E5-1620 CPU (the batch size is 1). The average relative error of
Latency-Aware Differentiable Neural Architecture Search 3
latency prediction is smaller than 5%, which is (verified in experiments) accurate
enough for our purpose, i.e., searching for latency-friendly architectures.
Equipped with LPM, we add the latency term to the loss function of DARTS.
By setting different balancing coefficients, we can easily tradeoff between accu-
racy and speed, which is what we desire. We evaluate our approach on CIFAR10
and ImageNet, two standard image classification benchmarks. We arrive at sim-
ilar classification accuracy with the baseline but our architecture is 15%–20%
faster. In addition, our approach is easily transplanted to different hardware
environments with acceptable costs. We train two LPM’s on GPU and CPU,
respectively, and they show different properties, i.e., the optimal architecture
found on one device is often sub-optimal on another, demonstrating the need of
hardware-specific architecture design.
The remainder of this paper is organized as follows. Section 2 briefly reviews
the previous literature, and Section 3 elaborates the algorithm for latency-aware
architecture search. Experiments on both GPU and CPU are shown in Section 4,
and conclusions are drawn in Section 5.
2 Related Works
The past years have witnessed a rapid development of deep learning and manually-
designed convolutional neural networks (CNNs) have pushed a wide range of
computer vision tasks to new state-of-the-art performances [16,26,10,13]. Lately,
neural architecture search (NAS) has been attracting attentions due to its ability
in automatically discovering network architectures with high performance.
According to the methodology to explore the search space, existing NAS ap-
proaches can roughly be divided into two categories, namely, heuristic search
and differentiable search. In some pioneer work in this area, architectures were
sampled from the search space and trained from scratch to evaluate their capa-
bility, for which some heuristic algorithms, such as evolutionary algorithms and
reinforcement learning, act as parameterized controllers of the sampling process.
Among them, Liu et al. [18], Xie et al. [31] and Real et al. [23] adopted evo-
lutionary algorithms as the controller, in which genetic operations were used
to modify the architecture, and Real et al. [23] showed that better evolutionary
algorithms lead to stronger architectures. Another line of heuristics replaced evo-
lutionary algorithms with reinforcement learning (RL) [37,1,38,35,17], in which
a meta-controller is trained to generate the hyper-parameters of each candidate.
A crucial drawback of the above methods is the large search cost (hundreds
or even thousands of GPU-days). In order to accomplish the search process with
an acceptable cost, differentiable search methods were designed. In DARTS [19],
Liu et al. introduced a set of architectural parameters to relax the search space
so that the search process can be finished in a single training process, where the
network parameters and the architectural parameters are jointly optimized and
the final architecture is generated according to the architectural parameters. Fol-
lowing DARTS, ProxylessNAS [3] adopted a similar differentiable framework and
proposed to search architectures directly on the target dataset. To improve the
4 Y. Xu et al.
stability of DARTS, P-DARTS [4] proposed to progressively enlarge the search
depth to bridge the depth gap, and PC-DARTS [33] enabled partial channel
connection so that a large batch size can be used in the search process.
There also exist efforts in studying the hardware applicability of the dis-
covered architecture in terms of FLOPs and/or latency. It is relatively easy
for heuristic search methods to achieve this goal, because hardware constraints
like FLOPs or latency can be conveniently measured for any sampled architec-
ture [28,9]. Regarding differentiable NAS approaches, SNAS [32] added FLOPs
and memory access constraints by factorizing the architectural parameters and
measuring the costs on each operation in the search space. ProxylessNAS [3] and
FBNet [30] adopted latency constraints since the search space is chain-styled and
those constraints are accessible with a lookup table. To the best of our knowl-
edge, no existing work has done the job in a complicated, differentiable search
space, e.g., the search space of DARTS-based approaches.
3 Approach
3.1 DARTS and the Difficulty of Latency Prediction
The goal of DARTS is to search for the robust cell architectures to construct
the evaluation network. Specifically, a cell is represented by a directed acyclic
graph (DAG) of N nodes, {x0,x1, . . . ,xN−1}, where each node represents a set
of feature maps. The first two nodes are the result feature maps of previous cells
or operations and act as input nodes. Information flow between an intermedi-
ate node j and its predecessor node i is connected by an edge E(i,j), where a
bunch of candidate operations o(·) in the operation space O are weighted by the
normalized architectural parameters α(i,j), i < j, and formulated as:
fi,j(xi) =
∑
o∈Oi,j
exp(α
(i,j)
o )∑
o′∈O exp(α
(i,j)
o′ )
o(xi). (1)
An intermediate node is the summation of the outputs of its preceding edges,
which is represented as xj =
∑
i<j fi,j(xi), and the output node is the con-
catenation of all intermediate nodes in the channel dimension, which is denoted
by xoutput = concat(x2,x3, . . . ,xN−1). In this manner, DARTS defines an over-
parameterized network h(x;ω,α) where ω and α denote the network and archi-
tectural parameters. With a bi-level optimization process, ω and α are trained
in a proxy dataset and α is used to determine the final architecture.
Despite the satisfying performance of the searched architecture, we are not
sure if the architecture is also optimized in terms of efficiency, e.g., latency. In
particular, DARTS involves many inter-layer connections (e.g., each cell receives
input from two previous cells) which may bring memory access issues and slow
down the architecture. More importantly, such a complex architecture brings
uncertainty in latency estimation, because the cost of memory access is often
difficult to measure, unlike that of a specific operator. Hence, summing up the
latency of all layers (stored in a lookup table [30]) is no longer accurate.
Latency-Aware Differentiable Neural Architecture Search 5
17.1
18.5
25.0
32.3
25.3
23.2
c_{k-2}
c_{k-1}
3
c_{k}
2
0
1
dil_conv_5×5
dil_conv_5×5
dil_conv_3×3
dil_conv_3×3
dil_conv_3×3
dil_conv_3×3
dil_conv_3×3 dil_conv_3×3
c_{k-2}
c_{k-1}
3
c_{k}
2
0
1
dil_conv_3×3
avg_pool_3×3
skip_connect
dil_conv_5×5
sep_conv_3×3
sep_conv_3×3
avg_pool_3×3
dil_conv_5×5
Figure 2: We sample 10K architectures from the DARTS space and plot the
FLOPs as well as latency of each of them on ImageNet data (224× 224). Left:
under a specified FLOPs, the smallest latency can be 32% smaller than the
largest one, or 8% smaller than the median. Right: the slowest (top, 32.3ms)
and fastest (bottom, 23.2ms) architectures (in normal cells) under 490M FLOPs
(the purple dashed line), in which the main difference is caused by the varying
latency/FLOPs ratio, e.g., the dil-conv-3x3 operator has nearly half FLOPs yet
requires around 70% latency compared to sep-conv-3x3
We verify this statement by observing the relationship between latency and
FLOPs, which is closely related to the sum of latency of individual layers. As
shown in Figure 2, though the quantities of latency and FLOPs are positive
related, the architectures of the same FLOPs can still have very different latency,
with the fastest one being at least 30% faster than the slowest one. That being
said, FLOPs-aware search methods [32] are not guaranteed to produce efficient
results – there is room for latency-aware search algorithms.
3.2 Latency-Aware Differentiable Architecture Search
We present a search framework which we call latency-aware differentiable neural
architecture search (LA-DNAS). In particular, this paper follows the search space
and optimization methods of DARTS, so we name our models LA-DARTS.
The key of LA-DARTS is to design a differentiable loss function that can
predict the latency of the architecture parameter, α, so that it can be integrated
into the over-parameterized network optimization process. We denote this func-
tion as LAT(α), which is the expectation of latency when an architecture is
sampled according to the weights of α:
LAT(α) = E[LAT(γ) ,γ ∼ S(α˜)], (2)
where γ denotes a discretized sub-architecture that DARTS allows to appear,
and S(α˜) denotes that sampling process is parameterized by α˜ (α˜ is the prob-
abilistic values obtained by passing each edge of α through softmax). In prac-
tice, we uniformly sample 8 out of 14 edges from α˜, and then randomly choose
the operation on each edge according to the current weights of the operations
6 Y. Xu et al.
Image
Normal 
Cell
Reduction 
Cell
Normal 
Cell
Reduction 
Cell
Normal 
Cell
Softmax
0
2
1
3
0
2
1
3
0
2
1
3
Sample 𝜸𝟏 Sample 𝜸𝑴
100 100 001
010 000 100
… ...
encoding encoding
… ...
LPM
sampling
LPM(𝜸𝟏)
21.3ms
LPM(𝜸𝑴)
25.2ms
… ...
2 ×
2 ×
2 ×
010 001 100
001 001 000
𝜸~𝒮(𝜶)
LAT(𝜶) ≈
1
𝑀
෍
𝑚=1
𝑀
LPM(𝜸𝒎)
LAT(𝜶)
23.5ms
Figure 3: Illustration of the proposed latency-aware differentiable architecture
search (best viewed in color). The latency of the current over-parameterized
network is estimated by sampling sub-networks from it, feeding them into the
pre-trained latency prediction module (LPM), and averaging the results. The
binary code indicates the encoded architectures, in which we use a simplified
super-network with |O| = 3 for better visualization (in DARTS, |O| = 8)
(excluding none which does not appear in the final architecture). We use a
batch size of M = 20, sample M sub-architectures, {γm}Mm=1, and thus have
LAT(α) ≈ 1M
∑M
m=1 LPM(γm), where LPM(·) denotes a latency prediction
function which will be detailed in the next subsection. The final loss function of
the search process is written as:
Ltotal(α) = Lval(α) + λ · LAT(α). (3)
Here, the balancing coefficient, λ, controls the tradeoff between accuracy and
performance: a smaller λ prefers accuracy to latency and vice versa. Note that λ
has a unit of sec−1. We will show in experiments that choosing a proper λ is not
difficult, and adjusting λ can lead to different properties of architectures. Upon
the differentiability of LAT(α), this loss function is easily optimized following
the bi-level optimization of DARTS-based approaches.
To compute the gradient of LAT(α) with respect to α, we have:
∂LAT(α)
∂α
≈ 1
M
M∑
m=1
∂LAT(γm)
∂γm
· ∂γm
∂α˜
· ∂α˜
∂α
≈ 1
M
M∑
m=1
∂LAT(γm)
∂γm
· ∂α˜
∂α
. (4)
Here, as γm is the binarization of α˜, we use the straight-through gradient esti-
mator [2], the gradient goes straight-through γm, so that ∂γm/∂α˜ ≈ I.
The overall pipeline of our approach is illustrated in Figure 3.
Latency-Aware Differentiable Neural Architecture Search 7
3.3 Training a Latency Prediction Module
It remains to design a latency prediction module (LPM) which outputs a
value of LPM(γm) for each sampled sub-architecture, γm. We present a learning-
based solution for the following reasons. First, we believe that latency is learn-
able. In other words, there exist network architecture patterns that correspond
to latency, so that a deep network can learn to predict with sufficient training
data). Second, latency prediction does not need to be very accurate, small er-
rors are acceptable (in the experimental section, we verify that the error of our
prediction is sufficiently small, and more importantly, small inaccuracy barely
harms search performance). Third, we can easily transplant the learning-based
approach to other device without much expertise which eases the deployment of
NAS on a wide range of hardware. We will show an example in Section 4.3.
LPM(γm) is a multi-layer regression network, with the input being an en-
coded sub-architecture and the output being the predicted latency. Through-
out this paper, we only investigate the normal cell and ignore the reduction
cell, because the final reduction cell is often composed of weight-free operators
and contributes little to the network latency. On the other hand, encoding the
reduction cell introduces noise to the latency prediction model.
To encode the sub-architecture, we first recall that each cell of DARTS con-
tains four intermediate nodes with 14 edges and 8 operations on each edge, while
the sub-architecture preserves two edges for each node and only one operation
on each selected edge. We use 14×8 bits to represent each cell: a bit is 1 if it cor-
responds to the chosen operation on a preserved edge, otherwise it is 0. In other
words, only 8 out of 14 × 8 bits are 1. The 112D vector is propagated through
four fully-connected layers with 112, 256, 64 and 1 neurons, respectively, and
the final one is the output (latency). We use sigmoid as the activation function
for each layer, excluding the last one.
Data collection. We first collect a dataset of (architecture, latency) pairs.
On an NVIDIA Tesla-P100 GPU (used in all experiments of our work), we
randomly sample 100K architectures from the DARTS space, and evaluate the
latency of each architecture with randomized network weights. For a better trans-
ferability of the searched architectures, the latency is measured under the Im-
ageNet setting with an input image size of 224 × 224 and is an average of 20
measurements. The entire process takes around 9 hours. We also evaluate the
latency for the same set of architectures on an Intel E5-1620 CPU, which takes
24 hours. Though 100K is a small number compared to the entire search space
(there are 1.0×109 distinct normal cells), it is enough for the learning task. Then
we partition the latency data into two parts: 80K pairs are used for training and
the remaining 20K for validation.
Training and inference. On the 80K training set, the network is trained
from scratch for 1,000 epochs using a batch size of 200. We use a momentum
SGD with a fixed learning rate of 0.01, a momentum of 0.9, a weight decay of
1× 10−5, and a mean square error (MSE) loss function. We evaluate LPM using
both absolute and relative errors between the prediction and the ground-truth on
the testing set. As shown in Table 1, with an increasing amount of training data,
8 Y. Xu et al.
Table 1: Absolute and relative errors of the LPM over the 20K testing architec-
tures when using different numbers of training architectures. On GPU and CPU,
we sample the same set of 100K architectures and use the same data split
NVIDIA Tesla-P100 GPU Intel E5-1620 CPU
Training Data 10K 20K 40K 60K 80K 10K 20K 40K 60K 80K
Absolute Error (ms) 1.79 1.41 1.09 0.84 0.82 20.24 16.64 10.42 8.50 8.27
Relative Error (%) 8.09 6.23 4.79 3.57 3.45 12.13 10.10 7.69 5.79 5.32
the testing error goes down accordingly. On the other hand, the improvement
of accuracy becomes marginal when the amount of training data is larger than
60K. With 80K training data, the latency prediction results are satisfying, with
an absolute error smaller than 1ms on GPU or smaller than 10ms on CPU, and
a relative error smaller than 4% on GPU or smaller than 4% on CPU.
To further show the consistency between the ground-truth and predicted la-
tency values, we also sample 2K architectures from the testing set and compute
the Kendall-τ coefficient. The τ -value is 0.83 for GPU and 0.75 for CPU, indicat-
ing that 92% and 87% architecture pairs have the same relative ranking in the
ground-truth and predicted lists. As we shall see in experiments, such accuracy
is sufficient in finding efficient yet powerful architectures.
3.4 Discussions and Relationship to Prior Works
To the best of our knowledge, this is the first work that introduces a latency-
aware method to a complicated search space. The main difficulty lies in designing
a differentiable loss function for latency prediction, while this issue does not
exist for heuristic search methods. There are a lot of efforts in applying latency
constraints to heuristic search [28,30,11].
On the other hand, in differentiable architecture search, FBNet [30] which in-
tegrated latency into the loss function by constructing a look-up table. Although
this method works well in the chain-style search space, it can fail in the search
space of DARTS due to much higher complexity. In comparison, our approach
has a stronger ability and is feasible for a wider range of search spaces. Also, there
were efforts [32] in introducing naturally differentiable quantities, e.g., FLOPs
(a linear function of α, to the loss function of differentiable frameworks. Our
approach, in comparison, is more generalized.
4 Experiments
We evaluate our approach on two standard image classification benchmarks, i.e.,
CIFAR10 and ImageNet, to study several important properties of it. We first use
the latency prediction on an NVIDIA Tesla-V100 GPU, and then generalize it
to that on an Intel E5-1620 CPU.
Latency-Aware Differentiable Neural Architecture Search 9
4.1 Experiments on CIFAR10
Firstly, we evaluate our LADNAS on CIFAR10 [15]. The CIFAR10 dataset con-
sists of 60k colored natural images with 32×32 resolution of 10 categories, which
is split into 50K training and 10K testing images. We use DARTS [19] and PC-
DARTS [32] as our two baseline methods. Following DARTS and PC-DARTS, we
use an individual stage for architecture search and conduct another standalone
training process from scratch to evaluate the optimal architecture obtained in
the search phase. In the search stage, the goal is to determine the best sets of
architectural parameters, namely
{
αoi,j
}
in DARTS and
{
αoi,j
}
, {βi,j} in PC-
DARTS for each edge E(i,j). To this end, the training set is partitioned into two
parts, with the first part used for optimizing network parameters, e.g., convolu-
tional weights, and the second part used for optimizing architectural parameters.
For fair comparison, the operation space O remains the same as the convention,
which contains 8 choices, i.e., sep-conv-3x3, sep-conv-5x5, dil-conv-3x3, dil-conv-
5x5, max-pool-3x3, avg-pool-3x3, skip-connect (identity), and zero (none).
Following DARTS and PC-DARTS, in the search period, the over-parameterized
network is constructed by stacking 8 cells (6 normal cells and 2 reduction cells,
each type of cells share the same architecture), and each cell consists of N = 6
nodes. We train the network for 50 epochs, with the initial number of channels
being 16. In the search phase, the network weights are optimized by momen-
tum SGD, with a batch size of 64 for DARTS and 256 for PC-DARTS, an initial
learning rate of 0.025 for DARTS and 0.1 for PC-DARTS (annealed down to zero
following the cosine schedule without restart), a momentum of 0.9, and a weight
decay of 3× 10−4. We use an Adam optimizer [14] for architectural parameters,
with a fixed learning rate of 3× 10−4 for DARTS and 6× 10−4 for PC-DARTS,
a momentum of (0.5, 0.999) and a weight decay of 10−3. For PC-DARTS, we
freeze architectural parameters and only allow network parameters to be tuned
in the first 15 epochs. For P-DARTS [4], we add the the proposed modeule in
the last search stage.
• Evaluation on CIFAR10
The evaluation scenario simply follows that of DARTS and PC-DARTS. The
evaluation network is stacked by 20 cells (18 normal cells and 2 reduction cells).
The initial number of channels is 36. The entire 50K training set is used, and
the network is trained from scratch for 600 epochs using a batch size of 128. We
use the SGD optimizer with an initial learning rate of 0.025 (annealed down to
zero following a cosine schedule without restart), a momentum of 0.9, a weight
decay of 3×10−4 and a norm gradient clipping at 5. Drop-path with a rate of 0.2
as well as cutout [6] is also applied for regularization. The balancing coefficient
λ is set as 0.2. The GPU latency on CIFAR10 is measured on one Tesla-P100
GPU with a batch size of 32 (input image size 32×32) and is the average of 200
measurements.
We conduct latency-aware architecture search on DARTS, P-DARTS, and
PC-DARTS. As demonstrated in Table 2, LA-DARTS (2nd order) achieves a
2.72% test error with only 2.7M parameters and a latency of 28.4ms on CIFAR10.
To achieve a similar classification performance, the original DARTS (2nd order)
10 Y. Xu et al.
Table 2: Comparison with state-of-the-art network architectures on CIFAR10.
Latency is measured on an NVIDIA Tesla-P100 GPU with a batch size of 32 and
an input size of 32×32. In latency-aware approaches, training the LPM requires
additional 0.4 GPU-days
Architecture
Test Err. Params Latency Search Cost
Search Method
(%) (M) (ms) (GPU-days)
DenseNet-BC [13] 3.46 25.6 - - manual
NASNet-A [38] + cutout 2.65 3.3 - 1800 RL
AmoebaNet-A [23] + cutout 3.34±0.06 3.2 - 3150 evolution
AmoebaNet-B [23] + cutout 2.55±0.05 2.8 - 3150 evolution
Hireachical Evolution [18] 3.75±0.12 15.7 - 300 evolution
PNAS [17] 3.41±0.09 3.2 - 225 SMBO
ENAS [22] + cutout 2.89 4.6 - 0.5 RL
NAONet-WS [20] 3.53 3.1 - 0.4 NAO
SNAS (mild) [32] + cutout 2.98 2.9 30.2 1.5 gradient-based
ProxylessNAS [3] + cutout 2.08 - - 4.0 gradient-based
BayesNAS [36] + cutout 2.81±0.04 3.4 - 0.2 gradient-based
GDAS [7] + cutout 2.93 3.4 30.6 0.3 gradient-based
DARTS (2nd order) [19] + cutout 2.76±0.09 3.3 40.9 0.3 gradient-based
LA-DARTS (2nd order) + cutout 2.72±0.05 2.7 28.4 0.3+0.4 gradient-based
P-DARTS [4] + cutout 2.50 3.4 40.9 0.3 gradient-based
LA-P-DARTS + cutout 2.52±0.08 3.3 35.8 0.3+0.4 gradient-based
PC-DARTS [33] + cutout 2.57±0.07 3.6 40.7 0.1 gradient-based
LA-PC-DARTS + cutout 2.61±0.10 2.6 27.7 0.1+0.4 gradient-based
need 3.3M parameters with 40.9ms latency. SNAS [32] can obtain relative good
latency by mild FLOPs constraint, however, this strict constraint leads to a much
worse performance. Compared to P-DARTS and PC-DARTS, the latency-aware
variants of them report nearly the same performance but with 10% and 30%
relative drop in latency, respectively.
• The Impact of the Balancing Coefficient
The balancing coefficient λ is an important factor to control the impact of
latency constraint, which directly determines the latency of the searched ar-
chitectures. To show the impact of λ, different λs are adopted to balance the
performance and latency of the searched architectures. In this experiment, we
set PC-DARTS [33] as the baseline method (λ = 0.00) and choose λ = 0.10,
λ = 0.15 and λ = 0.20 to conduct three independent search runs. The normal
cells of the searched architectures and their corresponding latency and test errors
are shown in Figure 7. With the increase of λ, the latency of the searched archi-
tectures is reduced while the performance is relatively stable. It means that our
latency optimization can effectively decrease the latency without affecting the
searched performance. However, if we continue to increase λ to be larger than
0.2, parameter-free operations will dominate the searched architectures and thus
much larger test errors are reported.
• Robustness to Latency Prediction Error
As shown in Section 3.3, The latency prediction module (LPM) still suffers
an Absolute Error of 0.82 (ms). We perform additional experiments to demon-
Latency-Aware Differentiable Neural Architecture Search 11
c_{k-2}
c_{k-1}
0
1
2
3avg_pool_3×3
sep_conv_5×5
dil_conv_3×3
sep_conv_3×3
sep_conv_3×3
skip_connect
c_{k}
dil_conv_3×3
sep_conv_3×3
(a) λ = 0.00, Lat.: 40.7ms, Err.: 2.57%
c_{k-2}
c_{k-1}
0
1
2
3
max_pool_3×3
dil_conv_5×5
sep_conv_3×3
sep_conv_3×3
skip_connect
c_{k}
sep_conv_3×3
skip_connect
sep_conv_3×3
(b) λ = 0.10, Lat.: 35.5ms, Err.: 2.64%
c_{k-2}
c_{k-1} 0
1
2
3
dil_conv_5×5
sep_conv_3×3
skip_connect
c_{k}
sep_conv_3×3 skip_connect
skip_connect
dil_conv_5×5
dil_conv_3×3
(c) λ = 0.15, Lat.: 31.2ms, Err.: 2.69%
c_{k-2}
c_{k-1}
2
c_{k}1
3
0
dil_conv_5×5
skip_connect
skip_connect
sep_conv_3×3
sep_conv_3×3
skip_connect
skip_connect
avg_pool_3×3
(d) λ = 0.20, Lat.: 27.7ms, Err.: 2.61%
Figure 4: The normal cells found on CIFAR10 with different balancing coeffi-
cients. The balancing coefficients λ are 0.00, 0.10, 0.15 and 0.20, respectively.
Latency optimization is added upon PC-DARTS, and λ = 0.00 is the same as
the original PC-DARTS. The latency here is measured on CIFAR10
Table 3: Left: latency-aware architecture search with added noise. Right: com-
paring latency-aware search to FLOPs-aware search. Here, η and λ are the bal-
ancing coefficients for FLOPs-aware architecture search and latency-aware ar-
chitecture search, respectively. All numbers (FLOPs and latency) are measured
on CIFAR10 using an NVIDIA Tesla-P100 GPU
Methods λ Latency Test Error Methods η/λ FLOPs Latency
LPM w/o noise
0.1 35.5ms 2.64%
FLOPs-aware
0.005 533M 29.1ms
0.2 27.7ms 2.61% 0.007 462M 25.2ms
LPM w/ noise
0.1 36.1ms 2.69%
Latency-aware
0.100 551M 28.0ms
0.2 28.3ms 2.72% 0.200 460M 23.2ms
strate that it is enough to offer a good latency constraint with an LPM of such
precision and the framework is robust to the latency prediction error. A random
noise with a distribution of N (0, 0.025) is added on the predicted latency. We
compare the latency of the searched architectures with LPM constraint when
λ = 0.10 and λ = 0.20. As shown in Table. 3, with the injected noise, the LPM
still effectively guides to search the latency aware architectures under different
balancing coefficients, which shows the robustness of the proposed LPM and
latency-aware architecture framework.
• Comparison to FLOPs-Aware Architecture Search
To show the effectiveness of latency-aware architecture search, we conduct
FLOPs-aware architecture search as the control group. Different from the la-
12 Y. Xu et al.
Table 4: Comparison with state-of-the-art architectures on ImageNet (mobile
setting). Latency is measured on one Tesla-P100 GPU with a batch size of 32
and an input size of 224× 224. In latency-aware approaches, training the LPM
requires additional 0.4 GPU-days
Architecture
Test Err. (%) Params ×+ Latency Search Cost
Search Method
top-1 top-5 (M) (M) (ms) (GPU-days)
Inception-v1 [27] 30.2 10.1 6.6 1448 - - manual
MobileNet [12] 29.4 10.5 4.2 569 - - manual
MobileNet 1.4× (v2) [25] 25.3 - 6.9 585 27.7 - manual
ShuffleNet 2× (v1) [34] 26.4 10.2 ∼5 524 - - manual
ShuffleNet 2× (v2) [21] 25.1 - ∼5 591 - - manual
NASNet-A [38] 26.0 8.4 5.3 564 48.7 1800 RL
NASNet-B [38] 27.2 8.7 5.3 488 - 1800 RL
NASNet-C [38] 27.5 9.0 4.9 558 - 1800 RL
AmoebaNet-A [23] 25.5 8.0 5.1 555 - 3150 evolution
AmoebaNet-B [23] 26.0 8.5 5.3 555 - 3150 evolution
AmoebaNet-C [23] 24.3 7.6 6.4 570 - 3150 evolution
PNAS [17] 25.8 8.1 5.1 588 47.3 225 SMBO
MnasNet-92 [28] 25.2 8.0 4.4 388 - - RL
SNAS (mild) [32] 27.3 9.2 4.3 522 23.0 1.5 gradient-based
ProxylessNAS (GPU)‡ [3] 24.9 7.5 7.1 465 - 8.3 gradient-based
BayesNAS [36] 26.5 8.9 3.9 - - 0.2 gradient-based
GDAS [7] 26.0 8.5 5.3 581 32.2 0.3 gradient-based
DARTS (2nd order) [19] 26.7 8.7 4.7 574 28.5 0.3 gradient-based
LA-DARTS 25.2 8.0 5.1 575 26.2 0.3+0.4 gradient-based
P-DARTS (CIFAR10) [4] 24.4 7.4 4.9 557 29.0 0.3 gradient-based
LA-P-DARTS (CIFAR10) 24.6 7.4 4.8 550 27.1 0.3+0.4 gradient-based
PC-DARTS (CIFAR10) [33] 25.1 7.8 5.3 586 31.7 0.1 gradient-based
LA-PC-DARTS (CIFAR10) 24.9 7.9 5.3 598 26.1 0.1+0.4 gradient-based
tency of an architecture, FLOPs is irrelevant to the route of connections but
the operation itself. It is easy to apply the FLOPs constraint as a differentiable
term. We measure the FLOPs of each operation in the search space and use a
lookup table to compute the overall FLOPs by adding up the FLOPs of each
involved operation. A balancing coefficient η is adopted to balance performance
and FLOPs in the search scenario. We conduct two independent FLOPs-aware
architecture search with η = 0.005 and η = 0.007 and the latency of the discov-
ered architectures is compared with the architectures searched by latency-aware
architecture search with λ = 0.100 and λ = 0.200. The result shows that the
latency-aware architecture search approach can discover architectures with lower
latency than the FLOPs-aware approach when the searched architectures have
comparable FLOPs.
4.2 Experiments on ImageNet
The ILSVRC2012 [5], a subset of ImageNet, is used to test the transferability of
architectures discovered on CIFAR10. The ILSVRC2012 consists of 1,000 object
categories and 1.28M training and 50K validation images for recognition task.
All images are of high-resolution and roughly equally distributed over all classes.
Latency-Aware Differentiable Neural Architecture Search 13
Following the conventions [38,19,33], we apply the mobile setting where the input
image size is fixed to be 224× 224 and the number of multi-add operations does
not exceed 600M in the testing stage.
The evaluation on ILSVRC2012 follows DARTS, P-DARTS, and PC-DARTS,
which also starts with three convolution layers of stride 2 to reduce the resolution
of feature maps from 224 × 224 of the input images to 28 × 28. 14 cells (12
normal cells and 2 reduction cells) are stacked beyond this point. The network
is trained from scratch for 250 epochs using a batch size of 1,024 on 8 Tesla
V100 GPUs. The network parameters are optimized using an SGD optimizer
with a momentum of 0.9, an initial learning rate of 0.5 (decayed down to zero
linearly), and a weight decay of 3× 10−5. Additional enhancements are adopted
including label smoothing and an auxiliary loss tower during training. Learning
rate warm-up is applied for the first 5 epochs. The latency is measured following
the same setting used on CIFAR10.
As shown in Table 4, with approximately the same FLOPs, LA-DARTS has a
19% lower latency than the original DARTS. Also, the latency of LA-P-DARTS
and LA-PC-DARTS is 27.1ms and 26.1ms, 7% and 18% lower than the original
version, respectively, while the accuracy of the searched architectures is not
impacted (within an acceptable range of ±0.2%). In the future, with a larger
search space, we expect that our algorithm has larger room of improvement in
reducing the network latency.
4.3 Transplanting to CPU
Last but not least, we transplant the proposed pipeline to search for efficient
architectures on an Intel E5-1620 CPU. We use the LPM trained and evaluate
in Section 3.3 which, with 80K training architectures, reports an absolute error
of 8.27ms and a relative error of 5.32% (see Table 1). We use this LPM to replace
the one used in Section 4.1, and adjust the balancing coefficient, λ, into smaller
values since the latency on CPU is often much larger.
We use PC-DARTS to search on CIFAR10. With two balancing coefficients,
λ = 0.025 and λ = 0.015, we obtain two architectures denoted by LA-PC-
DARTS-A and LA-PC-DARTS-B, respectively. As shown in Table 5, the increase
of λ leads to reduced latency as well as performance of the searched architecture,
which is the same as searching in GPU. Compared with the original PC-DARTS,
LA-PC-DARTS-B enjoys a nearly 30% advantage in CPU latency while report-
ing comparable accuracy. LA-PC-DARTS-A runs 40% faster in CPU with 0.1%
accuracy drop. We continue evaluating LA-PC-DARTS-B on ILSVRC2012 and
obtain a 25.1% top-1 test error, the same as the original PC-DARTS, yet the
CPU latency (114.1ms on ILSVRC2012) is 30% lower than that of PC-DARTS
(164.1ms). The normal cells of LA-PC-DARTS-A and LA-PC-DARTS-B are
shown in Figure 5.
Table 5 also implies that CPU and GPU prefer different architectures. In
particular, the architecture found on GPU is faster than LA-PC-DARTS-A on
GPU, but slower on CPU. To further investigate the difference, we sample 2K
architectures from the testing set of LPM, and find the Kendall-τ coefficient
14 Y. Xu et al.
Table 5: Results of latency-aware search using PC-DARTS on CIFAR10, with
LPM trained on an NVIDIA Tesla-P100 GPU and an Intel E5-1620 CPU, re-
spectively. C-Latency and G-Latency are measured on the same CPU and GPU
Architecture
Test Err. Params C-Latency G-Latency
(%) (M) (ms) (ms)
PC-DARTS [33] 2.57 3.6 208.3 40.7
LA-PC-DARTS (GPU) 2.61 2.6 136.7 27.7
LA-PC-DARTS-A (CPU) 2.75 2.8 122.1 30.5
LA-PC-DARTS-B (CPU) 2.65 3.2 148.3 37.3
c_{k-2}
c_{k-1}
3
c_{k}
2
0
1
sep_conv_3×3
skip_connect
sep_conv_3×3
sep_conv_3×3
skip_connect
skip_connect
skip_connect
avg_pool_3×3
(a) λ = 0.025, Lat.: 122.1ms, Err.: 2.75%
c_{k-2}
c_{k-1}
0
2
1
3avg_pool_3×3
sep_conv_3×3
sep_conv_3×3 skip_connect
c_{k}
sep_conv_3×3
sep_conv_3×3
skip_connect
avg_pool_3×3
(b) λ = 0.015, Lat.: 148.3ms, Err.: 2.65%
Figure 5: The normal cells found on CIFAR10 with latency-aware search on CPU.
We use PC-DARTS with different balancing coefficients, and λ = 0 leads to the
architecture shown in Figure 7 (a)
between the ground-truth CPU and GPU latency is only 0.37 (69% relative
rankings are consistent). We believe such inconsistency are caused by hardware
factors – fortunately, with our approach, one can obtain efficient architectures on
different devices without knowing much about them: the coefficient between the
prediction and ground-truth latency of GPU is 0.83 (92% consistent), and for
CPU, 0.75 (87% consistent), both of which are accurate enough to find efficient
architectures.
5 Conclusions
This paper presented a differentiable method for predicting the latency of an
architecture in a complicated search space, and incorporated this module into
differentiable architecture search. This enables us to control the balance of recog-
nition accuracy and inference speed. We design the latency prediction module
as a multi-layer regression network, and train it by sampling a number of archi-
tectures from the pre-defined search space. Our pipeline is easily transplanted to
a wide range of hardware/software configurations, and helps to design machine-
friendly architectures.
Our work sheds light for future research on this direction. As researchers
continue exploring larger spaces of NAS, it will be more and more difficult for
non-differentiable search methods to converge in reasonable search time. Also,
a larger search space will also provide larger room of optimizing latency, as well
Latency-Aware Differentiable Neural Architecture Search 15
as other non-differentiable factors such as power consumption, of the searched
architecture. We thus expect more efforts beyond this preliminary work.
Acknowledgements We thank Longhui Wei, Zhengsu Chen, An Xiao, Lanfei
Wang, and Kaifeng Bi for instructive discussions.
References
1. Baker, B., Gupta, O., Naik, N., Raskar, R.: Designing neural network architectures
using reinforcement learning. In: ICLR (2017)
2. Bengio, Y., Le´onard, N., Courville, A.: Estimating or propagating gradi-
ents through stochastic neurons for conditional computation. arXiv preprint
arXiv:1308.3432 (2013)
3. Cai, H., Zhu, L., Han, S.: ProxylessNAS: Direct neural architecture search on target
task and hardware. arXiv preprint arXiv:1812.00332 (2018)
4. Chen, X., Xie, L., Wu, J., Tian, Q.: Progressive differentiable architecture
search: Bridging the depth gap between search and evaluation. arXiv preprint
arXiv:1904.12760 (2019)
5. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: A large-scale
hierarchical image database. In: CVPR (2009)
6. DeVries, T., Taylor, G.W.: Improved regularization of convolutional neural net-
works with cutout. arXiv preprint arXiv:1708.04552 (2017)
7. Dong, X., Yang, Y.: Searching for a robust neural architecture in four gpu hours. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
pp. 1761–1770 (2019)
8. Elsken, T., Metzen, J.H., Hutter, F.: Neural architecture search: A survey. J. Mach.
Learn. Res. 20, 55:1–55:21 (2018)
9. Guo, Z., Zhang, X., Mu, H., Heng, W., Liu, Z., Wei, Y., Sun, J.: Single
path one-shot neural architecture search with uniform sampling. arXiv preprint
arXiv:1904.00420 (2019)
10. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.
In: CVPR (2016)
11. Howard, A., Sandler, M., Chu, G., Chen, L.C., Chen, B., Tan, M., Wang, W.,
Zhu, Y., Pang, R., Vasudevan, V., Le, Q.V., Adam, H.: Searching for mobilenetv3.
ArXiv abs/1905.02244 (2019)
12. Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., An-
dreetto, M., Adam, H.: MobileNets: Efficient convolutional neural networks for
mobile vision applications. arXiv preprint arXiv:1704.04861 (2017)
13. Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected
convolutional networks. In: CVPR (2017)
14. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980 (2014)
15. Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images.
Tech. rep., Citeseer (2009)
16. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep con-
volutional neural networks. In: NIPS (2012)
17. Liu, C., Zoph, B., Neumann, M., Shlens, J., Hua, W., Li, L.J., Fei-Fei, L., Yuille, A.,
Huang, J., Murphy, K.: Progressive neural architecture search. In: ECCV (2018)
18. Liu, H., Simonyan, K., Vinyals, O., Fernando, C., Kavukcuoglu, K.: Hierarchical
representations for efficient architecture search. In: ICLR (2018)
16 Y. Xu et al.
19. Liu, H., Simonyan, K., Yang, Y.: DARTS: Differentiable architecture search. arXiv
preprint arXiv:1806.09055 (2018)
20. Luo, R., Tian, F., Qin, T., Chen, E., Liu, T.Y.: Neural architecture optimization.
In: NeurIPS (2018)
21. Ma, N., Zhang, X., Zheng, H.T., Sun, J.: ShuffleNet V2: Practical guidelines for
efficient cnn architecture design. In: ECCV (2018)
22. Pham, H., Guan, M.Y., Zoph, B., Le, Q.V., Dean, J.: Efficient neural architecture
search via parameter sharing. In: ICML (2018)
23. Real, E., Aggarwal, A., Huang, Y., Le, Q.V.: Regularized evolution for image clas-
sifier architecture search. arXiv preprint arXiv:1802.01548 (2018)
24. Real, E., Moore, S., Selle, A., Saxena, S., Suematsu, Y.L., Tan, J., Le, Q.V., Ku-
rakin, A.: Large-scale evolution of image classifiers. In: ICML (2017)
25. Sandler, M., Howard, A.G., Zhu, M., Zhmoginov, A., Chen, L.C.: Mobilenetv2: In-
verted residuals and linear bottlenecks. 2018 IEEE/CVF Conference on Computer
Vision and Pattern Recognition pp. 4510–4520 (2018)
26. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
image recognition. In: ICLR (2015)
27. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D.,
Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: CVPR (2015)
28. Tan, M., Chen, B., Pang, R., Vasudevan, V., Le, Q.V.: MnasNet: Platform-aware
neural architecture search for mobile. arXiv preprint arXiv:1807.11626 (2018)
29. Wistuba, M., Rawat, A., Pedapati, T.: A survey on neural architecture search.
ArXiv abs/1905.01392 (2019)
30. Wu, B., Dai, X., Zhang, P., Wang, Y., Sun, F., Wu, Y., Tian, Y., Vajda, P., Jia,
Y., Keutzer, K.: Fbnet: Hardware-aware efficient convnet design via differentiable
neural architecture search. In: CVPR (2018)
31. Xie, L., Yuille, A.: Genetic CNN. In: ICCV (2017)
32. Xie, S., Zheng, H., Liu, C., Lin, L.: SNAS: Stochastic neural architecture search.
arXiv preprint arXiv:1812.09926 (2018)
33. Xu, Y., Xie, L., Zhang, X., Chen, X., Qi, G.J., Tian, Q., Xiong, H.: Pc-darts:
Partial channel connections for memory-efficient differentiable architecture search.
arXiv preprint arXiv:1907.05737 (2019)
34. Zhang, X., Zhou, X., Lin, M., Sun, J.: ShuffleNet: An extremely efficient convolu-
tional neural network for mobile devices. In: CVPR (2018)
35. Zhong, Z., Yan, J., Wu, W., Shao, J., Liu, C.L.: Practical block-wise neural network
architecture generation. In: CVPR (2018)
36. Zhou, H., Yang, M., Wang, J., Pan, W.: BayesNAS: A Bayesian approach for neural
architecture search (2019)
37. Zoph, B., Le, Q.V.: Neural architecture search with reinforcement learning. In:
ICLR (2017)
38. Zoph, B., Vasudevan, V., Shlens, J., Le, Q.V.: Learning transferable architectures
for scalable image recognition. In: CVPR (2018)
Latency-Aware Differentiable Neural Architecture Search 17
A Visualization of Searched Architectures
To ease the readers to reproduce our search results, here we attach all normal
and reduction cells that did not appear in the main article due to the space limit.
A.1 Reductions Cells on CIFAR-10
The reduction cells of architectures found on CIFAR-10 with different balancing
coefficients are shown in Figure 6. The balancing coefficients λ are 0.00, 0.10,
0.15 and 0.20, respectively. Latency optimization is combined with PC-DARTS
and λ = 0.00 is the same as the original PC-DARTS. The latency is measured
on CIFAR-10.
A.2 Cells of LA-DARTS, LA-PC-DARTS and LA-P-DARTS
The normal and reduction cells of LA-DARTS, LA-PC-DARTS and LA-P-DARTS
are shown in Figure 7. The balancing coefficient λ is 0.20 for both LA-DARTS
and LA-PC-DARTS and 0.10 for LA-P-DARTS. Besides, the normal and reduc-
tion cells of DARTS (2nd-order) and PC-DARTS are shown in Figure 8.
A.3 Cells of LA-PC-DARTS-A and LA-PC-DARTS-B
LA-PC-DARTS-A and LA-PC-DARTS-B are CPU-aware searched architectures.
The normal and reduction cells of LA-PC-DARTS-A and LA-PC-DARTS-B are
c_{k-2}
c_{k-1} 0
1
2
3
max_pool_3×3
sep_conv_3×3
sep_conv_5×5
sep_conv_3×3
c_{k}
sep_conv_3×3
sep_conv_3×3
sep_conv_5×5
sep_conv_5×5
(a) λ = 0.00, Latency: 40.7ms, Err.: 2.57%
c_{k-2}
c_{k-1} 0
1
2
3
avg_pool_3×3
dil_conv_5×5
sep_conv_5×5
sep_conv_5×5
c_{k}
avg_pool_3×3
sep_conv_3×3
dil_conv_5×5
avg_pool_3×3
(b) λ = 0.10, Latency: 35.5ms, Err.: 2.64%
c_{k-2}
c_{k-1} 0
1
2
3
max_pool_3×3 dil_conv_5×5
sep_conv_3×3
c_{k}
sep_conv_3×3
sep_conv_5×5
max_pool_3×3
max_pool_3×3
max_pool_3×3
(c) λ = 0.15, Latency: 31.2ms, Err.: 2.69%
c_{k-2}
c_{k-1}
2
c_{k}1
3
0
sep_conv_3×3
max_pool_3×3
sep_conv_5×5
max_pool_3×3
sep_conv_5×5
dil_conv_5×5
sep_conv_3×3
sep_conv_3×3
(d) λ = 0.20, Latency: 27.7ms, Err.: 2.61%
Figure 6: The corresponding reduction cells found on CIFAR-10 with different
balancing coefficients. The balancing coefficients λ are 0.00, 0.10, 0.15 and 0.20,
respectively. Latency optimization is combined with PC-DARTS and λ = 0.00 is
the same as the original PC-DARTS. The latency here is measured on CIFAR-10
18 Y. Xu et al.
c_{k-2}
c_{k-1}
0
1
2
3
avg_pool_3×3
sep_conv_3×3
sep_conv_3×3
skip_connect
c_{k}
sep_conv_3×3
skip_connect
skip_connect
skip_connect
(a) The normal cell of LA-DARTS
c_{k-2}
c_{k-1}
0
1
2
3
max_pool_3×3
sep_conv_5×5
skip_connect
c_{k}
sep_conv_5×5
max_pool_3×3
sep_conv_5×5
dil_conv_3×3
dil_conv_5×5
(b) The reduction cell of LA-DARTS
c_{k-2}
c_{k-1}
2
c_{k}1
3
0
dil_conv_5×5
skip_connect
skip_connect
sep_conv_3×3
sep_conv_3×3
skip_connect
skip_connect
avg_pool_3×3
(c) The normal cell of LA-PC-DARTS
c_{k-2}
c_{k-1}
2
c_{k}1
3
0
sep_conv_3×3
max_pool_3×3
sep_conv_5×5
max_pool_3×3
sep_conv_5×5
dil_conv_5×5
sep_conv_3×3
sep_conv_3×3
(d) The reduction cell of LA-PC-DARTS
c_{k-2}
c_{k-1}
0
1
3
2
dil_conv_3×3
skip_connect
c_{k}
sep_conv_3×3
skip_connect
sep_conv_5×5
sep_conv_3×3
sep_conv_3×3
dil_conv_5×5
(e) The normal cell of LA-P-DARTS
c_{k-2}
c_{k-1}
0
1
2
3
skip_connect
c_{k}
skip_connect
dil_conv_5×5
avg_pool_3×3
dil_conv_3×3
avg_pool_3×3
max_pool_3×3
avg_pool_3×3
(f) The reduction cell of LA-P-DARTS
Figure 7: Normal cells and reduction cells of LA-DARTS (Test error: 2.72%),
LA-PC-DARTS (Test error: 2.61%) and LA-P-DARTS (Test error: 2.52%)
shown in Figure 9. The balancing coefficient λ is 0.025 for LA-PC-DARTS-A
and 0.015 LA-PC-DARTS-B.
Latency-Aware Differentiable Neural Architecture Search 19
c_{k-2}
c_{k-1}
0
1
3
2
sep_conv_3×3
sep_conv_3×3
skip_connect
c_{k}
sep_conv_3×3
skip_connect
sep_conv_3×3
sep_conv_3×3
dil_conv_3×3
(a) The normal cell of DARTS (2nd)
c_{k-2}
c_{k-1}
0
3
2
1
max_pool_3×3
skip_connect
c_{k}
max_pool_3×3
max_pool_3×3
max_pool_3×3
skip_connect
skip_connect
max_pool_3×3
(b) The reduction cell of DARTS (2nd)
c_{k-2}
c_{k-1}
0
1
2
3avg_pool_3×3
sep_conv_5×5
dil_conv_3×3
sep_conv_3×3
sep_conv_3×3
skip_connect
c_{k}
dil_conv_3×3
sep_conv_3×3
(c) The normal cell of PC-DARTS
c_{k-2}
c_{k-1} 0
1
2
3
max_pool_3×3
sep_conv_3×3
sep_conv_5×5
sep_conv_3×3
c_{k}
sep_conv_3×3
sep_conv_3×3
sep_conv_5×5
sep_conv_5×5
(d) The reduction cell of PC-DARTS
Figure 8: Normal cells and reduction cells of DARTS (2nd) (Test error: 2.76%)
and PC-DARTS (Test error: 2.57%)
c_{k-2}
c_{k-1}
3
c_{k}
2
0
1
sep_conv_3×3
skip_connect
sep_conv_3×3
sep_conv_3×3
skip_connect
skip_connect
skip_connect
avg_pool_3×3
(a) The normal cell of LA-PC-DARTS-A
c_{k-2}
c_{k-1} 0
1
2
3
max_pool_3×3
sep_conv_3×3
dil_conv_3×3
dil_conv_5×5
c_{k}
sep_conv_3×3
sep_conv_5×5
skip_connect
sep_conv_3×3
(b) The reduction cell of LA-PC-DARTS-
A
c_{k-2}
c_{k-1}
0
2
1
3avg_pool_3×3
sep_conv_3×3
sep_conv_3×3 skip_connect
c_{k}
sep_conv_3×3
sep_conv_3×3
skip_connect
avg_pool_3×3
(c) The normal cell of LA-PC-DARTS-B
c_{k-2}
c_{k-1}
2
c_{k}1
3
0
sep_conv_3×3
max_pool_3×3
sep_conv_5×5
sep_conv_5×5
sep_conv_3×3
sep_conv_3×3
sep_conv_5×5
sep_conv_5×5
(d) The reduction cell of LA-PC-DARTS-
B
Figure 9: Normal cells and reduction cells of LA-PC-DARTS-A (Test error:
2.75%, CPU latency: 122.1ms) and LA-PC-DARTS-B (Test error: 2.65%, CPU
latency: 148.3ms)
