Densely Connected Search Space for More Flexible Neural Architecture
  Search by Fang, Jiemin et al.
Densely Connected Search Space for More Flexible
Neural Architecture Search
Jiemin Fang1∗, Yuzhu Sun1∗, Qian Zhang2, Yuan Li2, Wenyu Liu1, Xinggang Wang1
1School of EIC, Huazhong University of Science and Technology 2Horizon Robotics
{jaminfong, yuzhusun, liuwy, xgwang}@hust.edu.cn
{qian01.zhang, yuan.li}@horizon.ai
Abstract
In recent years, neural architecture search (NAS) has dramatically advanced the de-
velopment of neural network design. While most previous works are computation-
ally intensive, differentiable NAS methods reduce the search cost by constructing a
super network in a continuous space covering all possible architectures to search
for. However, few of them can search for the network width (the number of filters /
channels) because it is intractable to integrate architectures with different widths
into one super network following conventional differentiable NAS paradigm. In
this paper, we propose a novel differentiable NAS method which can search for
the width and the spatial resolution of each block simultaneously. We achieve
this by constructing a densely connected search space and name our method as
DenseNAS. Blocks with different width and spatial resolution combinations are
densely connected to each other. The best path in the super network is selected
by optimizing the transition probabilities between blocks. As a result the overall
depth distribution of the network is optimized globally in a graceful manner. In
the experiments, DenseNAS obtains an architecture with 75.9% top-1 accuracy on
ImageNet and the latency is as low as 24.3ms on a single TITAN-XP. The total
search time is merely 23 hours on 4 GPUs. †
1 Introduction
Designing deep neural networks has been an important topic for deep learning. Better designed
network architectures usually lead to significant performance improvement. In recent years, neural
architecture search (NAS) [32, 33, 25, 26] has demonstrated success in designing neural architectures
automatically. Many architectures produced by NAS methods have achieved higher accuracy than
those manually designed in tasks such as image classification, semantic segmentation and object
detection. NAS methods not only boost the model performance, but also liberate human experts from
the tedious architecture tweaking work.
The more elements in the architecture design process can be searched automatically, the less burden
human experts bear. What elements can be searched for depends on how the search space is
constructed. In most previous works, there are two main kinds of search space. One is to repeat
the cell structure to construct the network [33, 25, 26, 20] and search for the topological connection
between different nodes in each cell. The other one [29, 4, 30, 9] stacks the mobile convolution blocks
[27] with more succinct connections in the network. How to best search among operation types and
connection patterns are widely explored in many previous works. Searching for network scale (width
and depth) is less straightforward. While Reinforcement Learning (RL) [33, 29] and Evolutionary
Algorithm (EA) [26] based NAS methods can easily search for the depth and width due to their ability
∗The work was done during an internship at Horizon Robotics.
†The related code is available at https://github.com/JaminFong/DenseNAS
Preprint. Under review.
ar
X
iv
:1
90
6.
09
60
7v
1 
 [c
s.C
V]
  2
3 J
un
 20
19
to handle a discrete space, they are extremely computationally expensive. Differentiable [20, 4, 31]
and one-shot [3, 1] methods produce high-performance architectures with much less search cost, but
network scale search is intractable for these methods. Their search process relies on a super network
which covers all the possible sub-architectures in the search space. Searching for the network scale
requires integrating architectures with different scales into the super network. While the depth search
(the number of total layers) of the architecture can be handled by equipping the layer in the super
network with the identity connection as a candidate operation [4, 30], searching for the width is more
difficult to deal with, because once the number of output channels in a layer changes, the number of
input channels in the following layer needs to be changed accordingly. Therefore designing a super
network supporting width search remains a challenging problem.
Yet optimization of the scale of a network is something so crucial that should not be left out of the
NAS process and dealt with manually in an ad hoc manner. Inappropriate width or depth choices
cause drastic accuracy degradation or unsatisfactory model size. In particular, even slight changes
to the width of the architecture can give rise to explosive increase of the model size. In this paper,
we aim at solving the width search problem by developing a novel differentiable NAS method:
DenseNAS. Our solution is to construct a new densely connected search space and design a super
network to be a continuous representation of the search space. As shown in Fig. 1, each block in
the super network is connected into several adjacent blocks. From the beginning to the end of the
network, the number of filters (i.e., width) of each block increases gradually with a small stride. The
fine-grained width distribution in the network guarantees that the search space can cover as many
width values as possible.
In the search space, there are multiple blocks with several widths under the same spatial resolution
setting. We relax the search space as we assign a probability parameter to each output path of each
block. During the search process, the probability distribution of output paths is optimized. The best
width growing path in the super network is selected using this probability distribution to derive the
final architecture. Because the spatial resolution in each block is associated with the width, the block
widths and the layers to carry out spatial down-sampling are optimized and determined at the same
time.
In summary, starting from solving the network width searching problem in differentiable NAS, we
propose a new densely connected search space. The novel search space design even enables flexible
architecture search beyond network widths, e.g. the number of blocks, the layers to do spatial
down-sampling, etc. As a result the overall distribution of depths of the whole network is globally
and automatically optimized. With DenseNAS, we obtain an architecture with 75.9% top-1 accuracy
on ImageNet with a low latency on the GPU device (24.3 ms on one TITAN-XP). The search cost is
only 92 GPU hours, 23 hours on 4 GPUs.
··· ···
Co+c
Co
Co+2c
Co+3c
Co+4c
Co+5c
Co+nc
Figure 1: The search space with densely connected blocks. The width in each block increases
gradually with a small stride. Each block is connected into several subsequent blocks. Under one
spatial resolution, there are multiple blocks with various widths. We search for the width growing
path in the super network and the positions of spatial operations are determined simultaneously.
2 Related Work
Differentiable NAS Methods Recently, the emergence of differentiable NAS methods greatly
reduces the search cost while achieving superior results. DARTS [20] is the first work to utilize the
gradient-based method to search neural architectures. They construct a super network and relax the
architecture representation by assigning continuous weights to the candidate operations. They search
on a small dataset, e.g., CIFAR-10 [15], and then transfer the architecture to a large dataset, e.g.,
2
Input
channels
1x
1
Co
nv
1x
1
C
on
v
Re
LU
6
D
ep
th
w
is
e
kx
k 
C
on
v
Re
LU
6
1x
1
Co
nv
1x
1
C
on
v
Output
channels
Input
channels
1x
1
Co
nv
1x
1
C
on
v
Re
LU
6
D
ep
th
w
is
e
kx
k 
C
on
v
Re
LU
6
1x
1
Co
nv
1x
1
C
on
v
Output
channels'
(stride=2)
Figure 2: The MBConv structures used in the candi-
date operations. below: normal block. upper: reduc-
tion block.
Op Type Kernel(k) Expansion(t)
mbconv_k3e3 3×3 3
mbconv_k3e6 3×3 6
mbconv_k5e3 5×5 3
mbconv_k5e6 5×5 6
mbconv_k7e3 7×7 3
mbconv_k7e6 7×7 6
skip - -
Table 1: Candidate operations for the layer in
the search space.
ImageNet [8]. ProxylessNAS [4] reduces the memory consumption by adopting a dropping path
strategy. They only select a set of paths in the super network to update during the search. They carry
out search directly on the large scale dataset, i.e., ImageNet. FBNet [30] searches on the subset of
ImageNet and use Gumbel Softmax function [14, 22] to better optimize the distribution of architecture
probabilities. Although the differentiable NAS methods mentioned above achieve remarkable results,
the width of the architecture is manually set. It is challenging to integrate architectures with multiple
widths into the super network. Hence adjusting the width of the network still requires many trials by
experienced engineers.
Search Space Design NASNet [33] is the first work that proposes the cell structure to construct the
search space. They search for the operation types and the topological connection in the cell and repeat
the cell to form the whole architecture. The depth of the architecture (i.e., the number of repetitions of
the cell), the widths and the occurrences of down-sampling operations are all set by hand. Afterwards,
many works [19, 25, 26, 20] adopt a similar cell-based search space. MnasNet [29] uses a block-wise
search space. ProxylessNAS [4], FBNet [30] and ChamNet [7] simplify the search space by searching
mostly for the expansion ratios and kernel sizes of the mobile inverted bottleneck convolution (i.e.
MBConv) [27] layers. Auto-DeepLab [18] creatively designs a two-level hierarchical search space for
a segmentation network. The search space is also based on the cell structure and contains complicated
operations on the spatial resolution. Our work is also fundamentally different from DenseNet [13].
Even though the blocks in our super net are densely connected, only one path will be selected to
derive the final architecture which contains no densely connected blocks, as shown in Fig. 4.
3 Method
In this work, we use the differentiable neural architecture search method [20, 31, 4, 30] to solve the
architecture design problem. In this section, we first introduce how to design a search space motivated
by the width search problem. Secondly, we demonstrate the method of relaxing the search space into
a continuous representation. Finally, we explain our search algorithm.
3.1 Densely Connected Search Space
Considering the cell-based search space [33, 19, 20] usually leads to complicated architectures
which are not latency-friendly, we design the search space based on the mobile inverted bottleneck
convolution (i.e. MBConv) proposed in MobileNetV2 [27]. As shown in Fig. 3, we define the search
space on three different levels (the layer, the block and the network). At the layer level, each layer
consists of all the candidate operations. For the block level, one block can be separated into two
components: the head layers and the stacking layers. For the network level, the whole network is
constructed using blocks with incremental widths. Next we will describe the layer, the block and the
network structures in detail.
3.1.1 The Structure of a Layer
We define the layer to be the elementary structure in our search space. One layer represents a
set of candidate operations. The candidate operations are defined as a set of MBConv layers (as
shown in Fig. 2) with kernel sizes of {3, 5, 7} and expansion ratios of {3, 6}. We include the skip
connection as a candidate operation for the depth search. If the skip connection is chosen, it means
3
the corresponding layer is removed from the resulting architecture, effectively reducing its depth.
The set of operations in our search space is shown in Tab. 1.
3.1.2 The Structure of a Block
Each block is composed of several layers. We divide the block into two parts, the head layers and the
stacking layers. We set a fixed width and a corresponding spatial resolution for one block. For the
head layers, they take input tensors with various numbers of channels and spatial resolutions. The
head layers transform all the input tensors to one tensor with the set number of channels and spatial
resolution. The head layer exclude the skip connection as it is required for all blocks. Following the
head layers are a number of stacking layers (in our case, three). The operations in the stacking layers
are carried out with the same number of channels and spatial resolution.
···
···Block Layer
Head Layers
Stacking Layers
Cn-1×Hn-1×Wn-1
Cn×Hn×WnCn-2×Hn-2×Wn-2
Cn-m×Hn-m×Wn-m
Cn×Hn×Wn
k3e3 k3e6 k5e3 Skip
Input
3×
22
4×
22
4
112×112 56×56 28×28 14×14 7×7
Po
ol
in
g 
FC
16 3
2 40 48 64 80 96 11
2
12
8
14
4
16
0
22
4
28
8
41
6
48
0
24 35
2
Stage1 Stage2 Stage3 Stage4 Stage5
Figure 3: We define our search space on three levels. Bottom: The whole network is constructed by
densely connected blocks. Different colors of blocks represents different stages. Each path among
blocks represents one candidate network architecture Upper left: A block structure contains the
head layers and the stacking layers. The head layers take multiple tensors from the outputs of the
predecessor blocks and output tensors with consistent number of channles and shape, which the
stacking layers operate on. Upper right: A layer is a set of candidate operations.
3.1.3 The Structure of the Network
Previous works [29, 4, 30] that use MBConv-based blocks to form the architecture set a fixed number
of blocks, and the resulting architecture contains all the blocks. We instead design more blocks with
various widths in our search space, and allow the searched architecture to contain only a subset of the
blocks, giving the search algorithm the freedom to choose blocks of certain widths while discarding
others.
We define the whole super network architecture as Arch and assume that it includes N blocks:
Arch = {B1, B2, ..., BN}. As Fig. 3 shows, we partition the entire network into several stages.
Each stage holds a different range of widths and a fixed spatial resolution. From the beginning to the
end of the super network, the width of the blocks grows gradually with a small stride. In the early
stage of the network, we set a smaller growing stride for the width because large width setting in the
early network stage will cause huge computational cost. The growing stride becomes larger for we
move to later stages. As shown in Fig. 3, in stage 3, the spatial resolution is set as 28× 28 and the
width growing stride is 8. The width growing stride changes to 16 in stage 4 and 64 in stage 5. This
design of the super network allows searching over different widths in each block, differentiating our
approach from all existing ones.
Each block in the super network connects to m subsequent blocks. We define the connection between
the block Bi and its subsequent block Bj (j > i) as Cij . The spatial resolutions of Bi and Bj are
Hi ×Wi (normally Hi =Wi) and Hj ×Wj in Bj respectively. We constrain connections to only
exist between blocks whose spatial resolutions differ no more than a factor of two. Therefore, Cij
exists when j − i ≤ m and Hj/Hi ≤ 2. The search space is constructed based on the densely
4
connected blocks. However only one path will be selected tto derive the final architecture. In our
method, not only the number of layers in each block but also the block widths and the number
of blocks are searched for. The layers to carry out spatial down-sampling are determined in the
meantime. Our goal is to find a good path in the search space which represents the best depth and
width configuration of the architecture.
The operations of the first two layers in the network are fixed. The rest layers are all searched for by
our method. Inspired by MobileNetV2 [27], the first layer is set as a plain convolution which outputs
the tensor with the shape of 16× 112× 112. The second layer is a MBConv with the kernel size of
3× 3 and the expansion ratio of 1, which outputs a 24× 56× 56 tensor. The number of the blocks
and the distribution of widths in the search space can all be optimized using the loss function. Our
design of the search space (i.e. the super network) is illustrated in Fig. 3.
3.2 Continuous Relaxation of Search Space
As we construct the search space, we relax the architectures into continuous representations. The re-
laxation is implemented on the layer and block level. After relaxation, we can search for architectures
via back propagation.
3.2.1 Relaxation on the Layer Level
Let O be the set of candidate operations described in 3.1.1. We assign an architecture parameter
α`o to the candidate operation o ∈ O in layer `. We relax the layer by making it a weighted sum
of candidate operations. Each architecture weight of the operation is computed as a softmax of the
architecture parameter over all the operations in the layer:
w`o =
exp(α`o)∑
o′∈O exp(α
`
o′)
. (1)
The output of layer ` can be expressed as
x`+1 =
∑
o∈O
w`o · o(x`). (2)
3.2.2 Relaxation on the Block Level
We set bi to be the output tensor of the ith block Bi. As described in Sec. 3.1.3, each block connects
into m subsequent blocks. To relax the block connections into a continuous representation, we assign
each output path of the block a block-level architecture parameter. Namely for the path from block
Bi to Bj , the path between them has a parameter βij . Similar to how we compute the weight of each
operation above, we compute the probability of each path using a softmax function over all paths
between the two blocks:
pij =
exp(βij)∑m
k=1 exp(βik)
. (3)
For block Bi, we assume it also takes m′ input tensors from the predecessor blocks, which are Bi−m′ ,
Bi−m′+1, Bi−m′+2 ... Bi−1. As shown in Fig. 3, the input tensors from these blocks differ in number
of channels and spatial resolution. Therefore each of the input tensor is transformed by a head layer
in Bi to a uniform size and then summed together. Let Hik denote the transformation applied to input
tensor from Bi−k by the kth head layer in block Bi, where k = 1..m′. The sum of the transformed
input tensors can be computed by:
xi =
m′∑
k=1
pi−k,i ·Hik(xi−k). (4)
It is worth noting that the path probabilities are normalized on the block output dimension but applied
on the block input dimension (more specifically on the head layers). The head layer is essentially
a weighted-sum mixture of the candidate operations. The layer-level parameter α controls which
operation to be selected, while the outer block-level parameter β determines which block to connect.
5
3.3 Search Algorithm
3.3.1 Search Procedure
Benefiting from the continuously relaxed representation of the search space, we can search for the
architecture by updating the architecture parameters (introduced in 3.2) using gradient descent. We
find that in the beginning of the search, all the weights of the operations are under-trained. The
operations or architectures which converge faster are more likely to be strengthened, which leads
to shallow architectures. And the distribution of architecture parameters in the preliminary training
stage has a great impact on the later stage training. To tackle this, we split our search procedure
into two stages. In the first stage, we only train the weights of the super network for enough epochs
to get operations sufficiently trained until the accuracy of the model is not too low. In the second
stage, we activate the architecture optimization. We optimize the operation weights by descending
∇wLtrain(w,α, β) on the training set, and optimize the architecture parameters by descending
∇α,βLval(w,α, β) on the validation set. We alternate the optimization process of weights and
architecture by epoch.
When the search procedure is finished, we derive the architecture based on the architecture parameters.
At the layer level, we select the candidate operation with the maximum architecture weight, i.e.,
argmaxo∈Oα`o. At the network level, we use Viterbi algorithm [10] to select the path connecting the
blocks with the highest total transition probability based on the output path probability of each block.
3.3.2 Multi-objective Optimization
Similar to [4, 30], we integrate multi-objective optimization into the search process. Take the latency
as an example, we create a lookup table which records the latency of each operation. The latency of
each module is measured separately on the target device. For a stack layer, the latency is computed
as:
latency` =
∑
o∈O
w`o · latency`o, (5)
where latency`o refers to the pre-measured latency of operation o ∈ O in layer `. For a head layer
of block i, suppose it takes its input tensor from block j’s output, the latency is estimated as:
latency` = pj,i · (
∑
o∈O
w`o · latency`o), (6)
where pj,i is the weight of the connection from block j to block i. The latency of the whole
architecture can thus be obtained by:
latency =
∑
`
latency`. (7)
We design a loss function with the latency-based regularization to achieve the multi-objective
optimization:
L(w,α, β) = LCE + λ logτ latency, (8)
where λ and τ are the hyper-parameters to control the magnitude of the latency term.
3.3.3 Search Acceleration
The super network includes all the possible paths and operations in the search space. To decrease the
memory consumption and accelerate the search process, we adopt the drop-path training strategy. The
one-shot search method [1] drops out some paths when training the super network. This technique
makes the performance prediction of the stand-alone model more accurate. In this work, when
training the weights of operations, we sample one path of the candidate operations according to
the architecture weight distribution {w`o|o ∈ O} in each layer. The drop-path training not only
accelerates the search but also weakens the coupling effect between operation weights for different
architectures in the search space. Following ProxylessNAS [4], we sample two operations in each
layer according to the architecture weight distribution to update the architecture parameters. To keep
the architecture weights of the operations not sampled unchanged, we compute a re-balancing bias to
adjust the sampled and newly updated parameters.
biass = ln
∑
o∈Os exp(α
`
o)∑
o∈Os exp(α
′`
o)
, (9)
6
where Os refers to the set of sampled operations, α`o refers to the originally sampled architecture
parameter in layer ` and α′`o refers to the updated architecture parameter.
4 Experiments
To demonstrate the effectiveness of our proposed method, we apply it to the ImageNet classification
problem [8] to search for a architecture of high accuracy and low latency.
4.1 Implementation Details
Before the search process, we build a lookup table for the module latency of the super network as
described in 3.3.2. We set the input shape as (3, 224, 224) with the batch size of 32 for the network.
Each module of the network is measured on one TITAN-XP for 2000 times and the average latency is
recorded. All models and experiments are implemented using PyTorch [24].
For the search process, we randomly choose 100 classes from the original 1000 classes of the
ImageNet training set. We sample 20% data in each class of the ImageNet subset to form the
validation dataset. The remaining data is used for training. The original validation dataset of
ImageNet is only used for testing our final searched architecture. The search process runs 250 epochs
in total. In the first search stage, we only train the operation weights in the super network for 150
epochs on the divided training dataset. Only one path of the mixed operation is sampled in each step
to update operation weights. For the last 100 epochs, the updating of architecture parameters (α, β)
and operation weights (w) alternates for each epoch. For the training data preprocessing, we use the
standard GoogleNet [28] data augmentation. We set the batch size to 352 on 4 GPUs. We use the
SGD optimizer with 0.9 momentum and 4× 10−5 weight decay to update the operation weights. The
learning rate decays from 0.2 to 10−4 with the cosine annealing schedule [21]. We use the Adam
optimizer [2] with 10−3 weight decay, β = (0.5, 0.999) and fixed learning rate of 3× 10−4 to update
the architecture parameters.
For retraining the final derived architecture, we use the same data augmentation strategy as the search
process on the whole ImageNet dataset. We train the model for 240 epochs with a batch size of 1024
on 8 GPUs. The optimizer is SGD with 0.9 momentum and 4×10−5 weight decay. The 0.1-weighted
label smoothing is used both in the search and retraining process. The learning rate decays from 0.5
to 1× 10−4 with the cosine annealing schedule.
z
Table 2: Our results on ImageNet classification compared with other methods. Our models achieve
higher accuracy with lower latency. The search cost of DenseNAS is less than other methods in
terms of GPU hours. For the GPU latency, we measure all the models with the same setup (on one
TITAN-XP with batch size of 32).
Model #Params(M)
#FLOPs
(M)
GPU
Latency
Top-1/Top-5
Acc(%)
Search Time
(GPU hours)
1.0-MobileNetV1 [12] 4.2 575 16.8ms 70.6 / 89.5 -
1.0-MobileNetV2 [27] 3.4 300 19.5ms 72.0 / - -
1.4-MobileNetV2 [27] 6.9 585 28.0ms 74.7 / - -
NASNet-A [33] 5.3 564 - 74.0 / 91.6 48K
AmoebaNet-A [26] 5.1 555 - 74.5 / 92.0 76K
MnasNet [29] 4.2 317 19.7ms 74.0 / 91.8 91K
DARTS [20] 4.7 574 36.0ms 73.3 / 91.3 96
FBNet-B [30] 4.5 295 18.9ms 74.1 / - 216
FBNet-C [30] 5.5 375 22.1ms 74.9 / - 216
Proxyless(GPU) [4] 7.1 465 22.1ms 75.1 / 92.5 200
Proxyless(mobile) [4] 4.1 320 21.3ms 74.6 / 92.2 200
DenseNAS-A 7.9 501 24.3ms 75.9 / 92.6 92
DenseNAS-B 6.9 414 21.1ms 74.7 / 92.0 92
DenseNAS-C 6.7 383 19.2ms 74.2 / 91.8 92
4.2 Experimental Results
Our ImageNet results are shown in Tab. 2. We set the GPU latency as our secondary optimization
objective. Our models achieve superior accuracy with low latency. They significantly outperform the
7
C
o
n
v 
3
×3
D
en
se
N
A
S-
B
D
en
se
N
A
S-
C
Co
n
v 
3
×
3
D
en
se
N
A
S-
A
M
B
6
 7
×7
M
B
6
 3
×3
M
B
6
 3
×3
M
B
6 
3×
3
C
o
n
v 
3
×
3
M
B
6 
7×
7
M
B
6
 3
×3
M
B
6
 7
×7
Figure 4: The best architectures obtained by DenseNAS. Blocks in the network are separated by lines.
ones designed manually in terms of accuracy. DenseNAS-A achieves 75.9% top-1 accuracy, better
than 1.4-MobileNetV2 (+1.2%) with lower latency (-3.7ms, relative 15.2%). Comparing DenseNAS-
B with NASNet-A [33], AmoebaNet-A [26] and DARTS [20], we achieve higher accuracy with a lot
fewer FLOPs and lower latency. Compared with other NAS methods, DenseNAS achieves superior
accuracy with the similar latency, yet the whole search process takes about only 23 hours on 4 GPUs
(92 GPU hours in total), which is 522× faster than NASNet, 826× faster than AmoebaNet, 989×
faster than MnasNet, around 2.3× faster than FBNet [30] and ProxylessNAS [4]. For FBNet and
ProxylessNAS, the widths of blocks in the search space are set and adjusted by handcraft. The widths
of our networks are all searched automatically.
4.3 Comparison with Fixed-block Search
Table 3: The comparison results of Dense-
NAS models and the models searched under
the fixed-block search space (Fixed-A and -B).
Model Latency Top-1(%) Top-5(%)
Fixed-A 25.3ms 74.7 92.0
DenseNAS-A 24.3ms 75.9 92.6
Fixed-B 22.6ms 73.4 91.2
DenseNAS-B 21.1ms 74.7 92.0
To demonstrate the effectiveness and efficiency
of our proposed method, we carry out the same
search process under the fixed-block search space.
The design of the fixed-block search space entirely
follows the popular human-designed network Mo-
bileNetV2 [27]. The number of bottleneck blocks
in the search space is set to 7 and the widths of
the blocks are set as [16, 24, 32, 64, 96, 160, 320],
which are the same as MobileNetV2. The block
connections are abandoned and all the blocks in the
search space are used for deriving the final architecture. The other search settings are the same as our
proposed method for fair comparison. The results are shown in Tab. 3.
4.4 Different Latency Settings
DenseNAS
Fixed-A
Fixed-B
MobileNetV2-1.0
MobileNetV2-1.3
MobileNetV2-1.4
Figure 5: The comparison of model perfor-
mances under different latency settings.
Our proposed methods can search for architectures
according to various latency requirements. We con-
duct search experiments under different latency set-
tings. For the loss function defined in Eq. (8), we
set τ to 15 and λ from {0.25, 0.3, 0.35, 0.4, 0.45}.
For the first 150 epochs, only operation weights are
updated. The super network model with operation
weights pre-trained can be shared between search
processes under different latency settings for saving
search time.
The architectures obtained by DenseNAS are shown
in Fig. 4. The faster model tends to be shallower.
The block widths and the number of blocks are
all searched automatically. Fig. 5 further shows
that DenseNAS models outperform the models
searched under the fixed-block search space and Mo-
bileNetV2 models with different latency settings.
8
5 Conclusion
The proposed DenseNAS is a differentiable NAS method for searching network widths. DenseNAS
can also optimize the spatial down-sampling position and the distribution of depths in the network
scale rather than block scale. DenseNAS makes neural architecture design more automatically. The
results of large-scale experiments reflect its great efficiency and effectiveness. In future works,
we would like to study our DenseNAS method on some dense prediction tasks, such as designing
feature pyramid [16, 11, 17] and semantic segmentation [23, 5, 6] networks, because DenseNAS has
the advantage of searching for the spatial down-sampling positions and the performance of dense
prediction tasks are more sensitive to the spatial resolution of feature maps.
Acknowledgement
We thank Liangchen Song for the discussion and assistance.
References
[1] Gabriel Bender, Pieter-Jan Kindermans, Barret Zoph, Vijay Vasudevan, and Quoc V. Le. Understanding
and simplifying one-shot architecture search. In Proceedings of the 35th International Conference on
Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, pages 549–558,
2018.
[2] Yoshua Bengio and Yann LeCun, editors. 3rd International Conference on Learning Representations,
ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
[3] Andrew Brock, Theodore Lim, James M. Ritchie, and Nick Weston. SMASH: one-shot model architecture
search through hypernetworks. CoRR, abs/1708.05344, 2017.
[4] Han Cai, Ligeng Zhu, and Song Han. ProxylessNAS: Direct neural architecture search on target task and
hardware. In International Conference on Learning Representations, 2019.
[5] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. Deeplab:
Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs.
IEEE Trans. Pattern Anal. Mach. Intell., 40(4):834–848, 2018.
[6] Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution
for semantic image segmentation. CoRR, abs/1706.05587, 2017.
[7] Xiaoliang Dai, Peizhao Zhang, Bichen Wu, Hongxu Yin, Fei Sun, Yanghan Wang, Marat Dukhan, Yunqing
Hu, Yiming Wu, Yangqing Jia, Peter Vajda, Matt Uyttendaele, and Niraj K. Jha. Chamnet: Towards
efficient network design through platform-aware model adaptation. CoRR, abs/1812.08934, 2018.
[8] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. Imagenet: A large-scale hierarchical
image database. In CVPR, 2009.
[9] Jiemin Fang, Yukang Chen, Xinbang Zhang, Qian Zhang, Chang Huang, Gaofeng Meng, Wenyu Liu, and
Xinggang Wang. EAT-NAS: elastic architecture transfer for accelerating large-scale neural architecture
search. CoRR, abs/1901.05884, 2019.
[10] G David Forney. The viterbi algorithm. Proceedings of the IEEE, 61(3):268–278, 1973.
[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid pooling in deep convolutional
networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell., 37(9):1904–1916, 2015.
[12] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand,
Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile
vision applications. CoRR, abs/1704.04861, 2017.
[13] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connected
convolutional networks. In CVPR, pages 2261–2269, 2017.
[14] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. In 5th
International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017,
Conference Track Proceedings, 2017.
[15] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical
report., 1(4):1–7, 2009.
[16] Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. Beyond bags of features: Spatial pyramid matching
for recognizing natural scene categories. In 2006 IEEE Computer Society Conference on Computer Vision
and Pattern Recognition (CVPR 2006), 17-22 June 2006, New York, NY, USA, pages 2169–2178, 2006.
9
[17] Tsung-Yi Lin, Piotr Dollár, Ross B. Girshick, Kaiming He, Bharath Hariharan, and Serge J. Belongie.
Feature pyramid networks for object detection. In 2017 IEEE Conference on Computer Vision and Pattern
Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 936–944, 2017.
[18] Chenxi Liu, Liang-Chieh Chen, Florian Schroff, Hartwig Adam, Wei Hua, Alan L. Yuille, and Li Fei-
Fei. Auto-deeplab: Hierarchical neural architecture search for semantic image segmentation. CoRR,
abs/1901.02985, 2019.
[19] Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan L.
Yuille, Jonathan Huang, and Kevin Murphy. Progressive neural architecture search. In ECCV, pages 19–35,
2018.
[20] Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: Differentiable architecture search. In
International Conference on Learning Representations, 2019.
[21] Ilya Loshchilov and Frank Hutter. SGDR: stochastic gradient descent with warm restarts. In 5th In-
ternational Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017,
Conference Track Proceedings, 2017.
[22] Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation
of discrete random variables. In 5th International Conference on Learning Representations, ICLR 2017,
Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017.
[23] Maria Papadomanolaki, Maria Vakalopoulou, and Konstantinos Karantzalos. A novel object-based deep
learning framework for semantic segmentation of very high-resolution remote sensing data: Comparison
with convolutional and fully convolutional networks. Remote Sensing, 11(6):684, 2019.
[24] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming
Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
[25] Hieu Pham, Melody Y. Guan, Barret Zoph, Quoc V. Le, and Jeff Dean. Efficient neural architecture search
via parameter sharing. In ICML, pages 4092–4101, 2018.
[26] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V. Le. Regularized evolution for image classifier
architecture search. CoRR, abs/1802.01548, 2018.
[27] Mark Sandler, Andrew G. Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Inverted
residuals and linear bottlenecks: Mobile networks for classification, detection and segmentation. volume
abs/1801.04381, 2018.
[28] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru
Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, 2015.
[29] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, and Quoc V. Le. Mnasnet: Platform-aware
neural architecture search for mobile. CoRR, abs/1807.11626, 2018.
[30] Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong Tian, Peter
Vajda, Yangqing Jia, and Kurt Keutzer. Fbnet: Hardware-aware efficient convnet design via differentiable
neural architecture search. CoRR, abs/1812.03443, 2018.
[31] Sirui Xie, Hehui Zheng, Chunxiao Liu, and Liang Lin. SNAS: stochastic neural architecture search. In
International Conference on Learning Representations, 2019.
[32] Barret Zoph and Quoc V. Le. Neural architecture search with reinforcement learning. CoRR,
abs/1611.01578, 2016.
[33] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V. Le. Learning transferable architectures for
scalable image recognition. CoRR, abs/1707.07012, 2017.
10
