ADWPNAS: Architecture-Driven Weight Prediction for Neural Architecture
  Search by XuZhang et al.
ADWPNAS: Architecture-Driven Weight Prediction for Neural Architecture
Search
Xu Zhang
SYSU
zhangx629@mail2.sysu.edu.cn
Junzhou Chen
SYSU
chenjunzhou@mail.sysu.edu.cn
Bo Gu†
SYSU
gubo@mail.sysu.edu.cn
1
Abstract
How to discover and evaluate the true strength of mod-
els quickly and accurately is one of the key challenges in
Neural Architecture Search (NAS). To cope with this prob-
lem, we propose an Architecture-Driven Weight Prediction
(ADWP) approach for neural architecture search (NAS).
In our approach, we first design an architecture-intensive
search space and then train a HyperNetwork by inputting
stochastic encoding architecture parameters. In the trained
HyperNetwork, weights of convolution kernels can be well
predicted for neural architectures in the search space. Con-
sequently, the target architectures can be evaluated effi-
ciently without any finetuning, thus enabling us to search
for the optimal architecture in the space of general networks
(macro-search). Through real experiments, we evaluate the
performance of the models discovered by the proposed AD-
WPNAS and results show that one search procedure can be
completed in 4.0 GPU hours on CIFAR-10. Moreover, the
discovered model obtains a test error of 2.41% with only
1.52M parameters which is superior to the best existing
models.
1. Introduction
Designing efficient and effective neural architectures has
always been of vital importance for deep learning. In re-
cent years, Neural Architecture Search (NAS) [30, 24, 31,
21, 23, 20] has demonstrated superior capabilities in discov-
ering excellent neural architectures automatically. Specif-
ically, neural architectures obtained through NAS meth-
ods achieve outstanding performance on the tasks of com-
puter vision, such as image classification [5], object detec-
tion [10] and semantic segmentation [17].
Most NAS methods rely on reinforcement learning
(RL) [30, 31, 2] or evolutionary algorithms (EA) [23, 24,
19], which incurs intensive computation during the search
1†Corresponding author.
: removed
: reserved
0
1
2
3
: operation
Figure 1. The overall of the intensive-space. We represent the
search space of each cell as a DAG with ordered nodes. Differ-
ent operations (colored circles) transform one node (gray square)
to intermediate features in a predetermined direction (black ar-
row). Meanwhile, each node is the sum of the intermediate fea-
tures transformed from the previous nodes. We prune the search
space into an intensive-space as described in Sec. 3.1.2. The solid
arrows indicate the reserved operations after pruning, and the dot-
ted ones mean the removed operations.
procedures [31, 23]. For instance, a RL-based method [30]
needs 2000 GPU-days to obtain the final architecture by
training and evaluating more than 20,000 neural architec-
ture candidates and an EA-based method [23] discovers the
best architecture cross 3150 GPU-days.
In this paper, we propose an Architecture-Driven Weight
Prediction (ADWP) approach for neural architecture search
(NAS), aiming to directly search for the optimal architec-
ture rather than the best cells [20]. As detailed in Fig 1, the
search space of a cell is represented as a directed acyclic
1
ar
X
iv
:2
00
3.
01
33
5v
1 
 [c
s.N
E]
  3
 M
ar 
20
20
graph (DAG), which is composed of three elements: or-
dered nodes, colored circles and arrows. The DAG con-
tains a large number of sub-graphs and each sub-graph rep-
resents a neural architecture, which forms a huge search
space. Finding an optimal architecture in such a huge space
is not a trivial task. To this, the search space is pruned into
an intensive-space which consists of only the most likely
operations.
Then, as detailed in Fig. 3, we leverage the intensive-
space to build a HyperNetwork with a pre-defined number
of cells. Moreover, each cell may contain multiple Gener-
atingBlocks and ConvBlocks. Each ConvBlock indicates a
convolution operation in the intensive-space and the Gen-
eratingBlock is to generate weights for the ConvBlock. To
predict the weights well, the HyperNetwork is trained iter-
atively to drive the GeneratingBlocks converge by feeding
in stochastic encoding architecture parameters. After train-
ing, target architectures with predicted weights can be eval-
uated efficiently without any finetuning, thus allowing us
to directly search for the optimal architecture (i.e.,macro-
search). Moreover, the models discovered by the proposed
ADWPNAS achieve superior or comparable results.
In summary, the contribution of our paper is three-fold:
1. We propose ADWPNAS, which well predicts weights
for the target architectures instead of training them.
This way, ADWPNAS greatly improves search effi-
ciency, enabling us to find competitive models in few
GPU-hours with macro-search.
2. In the search procedure, we replace the original search
space with an intensive-space, which is intensive in the
respect of neural architectures. In this way, it signifi-
cantly reduces computational cost.
3. We obtain a series of comparable results with fewer
GPU resources. On CIFAR-10, ADWPNAS is able to
complete a search procedure within 4.0 GPU hours and
attains comparable performance with only 1.52M pa-
rameters which is at least 40% less than those of mod-
els discovered by the existing approaches.
2. Related Work
Recently, NAS has made a significant progress. Neural
architectures [18, 19, 24, 30, 31] obtained by NAS methods
have surpassed the manually designed in many fields. NAS
methods can be divided into the following two categories:
micro search and macro search.
Micro search algorithms aim to find the best neural
cells, and then stack them to construct networks at differ-
ent depths according to actual needs [20, 29, 28, 9]. Liu et
al. [20] relax the search space to continuous and then use a
gradient descent method to search for the final neural cells
in shallow depth. By stacking the cells into a deep network,
they achieve comparable performance in the classification
task. In ProxylessNAS [4], Cai et al. also search for the
best cells and then expand on the depth of the network by
stacking them together. Dong et al. [9] search for the cells
by using differentiable architecture sampler and reduce the
search cost to 4 GPU-hours. One of the great advantages
of micro search algorithms is that it is easy to extend in
terms of depth of the network after obtaining the best cells.
However, the network obtained by stacking multiple cells is
probably not globally optimal.
Macro search algorithms, on the other hand, directly
search for the optimal network instead of cells [3, 16, 27, 5].
Baker et al. [1] use reinforcement learning techniques to
train the Q-Learning agent and then select the CNN layer
by the agent. Based on DARTS [20], Chen et al. [5] allow
the depth of the network to deepen in the search procedure,
and achieve macro search to a certain extent. These tra-
ditional methods can obtain a globally optimal network in
theory. However, in practice, it is inefficient and difficult
to search for architectures with depths similar to [11, 13]
due to the huge search space. e.g., a depth of 12 contains
1029 possible networks[21]. Although The proposed ADW-
PNAS falls into the macro search category. Different from
the traditional methods, we utilize learning a HyperNetwork
with reasonable weight prediction to improve search effi-
ciency, thus enabling to search for the optimal architecture
in different depths.
3. Methodology
The key challenge of NAS is how to evaluate a large
number of models efficiently and accurately under resource
constraints, so as to derive the optimal neural architecture.
Generally, it is efficient to explore in a search space with in-
tensive and effective architectures through a well-designed
search strategy. Hence, we propose ADPWNAS to derive
the optimal neural architecture in an efficient way and elab-
orate on two aspects: search space (Sec. 3.1) and search
strategy (Sec. 3.2).
3.1. Architecture Search Space
In the differentiable architecture search [20, 4, 5], the
optimal neural architecture is usually obtained by selecting
the most likely operations from distinct nodes in the search
space (hereafter termed original search space). However, a
sub-space, consisting of the most likely operations by se-
lecting K from the connected with all the previous nodes
usually remains stable after training for a period of time
(this sub-space called intensive-space below). Therefore,
the architecture finally obtained is generally included in the
intensive-space. This is to say, it is reasonable to derive an
intensive-space from the original search space and further
to obtain the optimal architecture.
2
ReLU Reshape
FCFC
Output
channels
Input channels Weight 
tensor:
𝑤𝛼
GeneratingBlockEncoding architecture 
parameters of the 𝒍𝑡ℎ cell: 𝛼𝑙
0.12 0.21 ⋯
0.21 0.12 ⋯
⋮ ⋮ ⋱
0.42
0.23
⋮
0.53 0.11 ⋯ 0.01
Input
Softmax
1.22 0.89 ⋯
2.21 2.12 ⋯
⋮ ⋮ ⋱
7.21
9.79
⋮
4.90 5.11 ⋯ 6.01
𝑁𝑘−𝑖
𝑁𝑘
Feature mapConvBlock
Figure 2. Network structure of GeneratingBlock connected with ConvBlock. In the lth cell, ConvBlock represents a convolution operation
between the nodesNk−i andNk and transform the (k− i)th node to an intermediate feature with the weights generated by the correspond-
ing GeneratingBlock. Meanwhile, the GeneratingBlock takes the encoding architecture parameters αl as input and outputs weights for the
ConvBlock, as described in Sec. 3.2.1.
3.1.1 Original Search Space
In this stage, we utilize the search space presented in
[30, 31, 20] as our original search space to build a net-
work of L cells, but contained only normal and reduction
types of cells. Each cell is represented as a DAG of M
nodes, {N0, N1, N2, · · · , NM−1}, in which each node is
connected with the previous by a set of operations o(·), e.g.,
convolution, pooling, zero. Thus, all the operations in a cell
constitute the original search space, denoted asO. To make
the search space continuous, a softmax function is applied
to all possible operations mixed with weight α(i,j) for each
pair of node (i, j):
Fi,j(Ni) =
∑
o∈O
exp(α
(i,j)
o )∑
o′∈O exp(α
(i,j)
o′
)
o(Ni) (1)
where Fi,j represents the information flow of feature maps
and α(i,j) is a vector of |O| dimension. Meanwhile,
each node takes all the previous nodes as inputs, and out-
puts the sum of the features transformed from the inputs.
Thus, each intermediate node can be computed as Nj =∑
i<j Fi,j(Ni). Apart from this, each cell takes the out-
puts of the previous two cells as two input nodes (N0 and
N1) and gets the output node NM−1 by concatenating the
intermediate nodes {N2, N3, · · · , NM−2} in the dimension
of channels. Therefore, the structure of each cell is rep-
resented as a set of variables {α(i,j)}. As the two types
of cells share their respective mixing weights αnormal and
αreduce, the architecture parameters are encoded as α =
{αnormal, αreduce}.
3.1.2 Intensive-space Deriving
As shown in Fig. 1, for each node, the most likelyK non-
zero operations from the previous nodes are selected to form
an intensive-space, denoted by O∗. Assuming there are M
nodes (excluding two input nodes and one output node) in a
cell with the original space, the cell with the intensive-space
contains M ×K operations. Formally, the intensive-space
can be denoted by O∗ = {O∗i,j |0 ≤ i < j, 1 ≤ j < M},
whereO∗i,j indicates the space consisting of a set of selected
operations between the node i and j.
On one hand, if the intensive-space is only determined
based on the validation accuracy, it is likely to encounter
similarities in accuracy while the corresponding architec-
ture parameters are quite different. On the other hand, due
to continuous relaxation [20], architecture parameters are
updated continuously. Hence, the difference between the
parameters αt at the tth epoch and αt−1 at the last epoch are
relatively small while the corresponding accuracy might be
quite difference. Therefore, we define a criterion to evaluate
the superiority of a search space considering both the accu-
racy and the stability. Specifically, the accuracy is denoted
by (αt) and the stability of an intensive-space is defined as
follows:
s(O∗t−i,O∗t ) = 1−
C(O∗t−i,O∗t )
M ×K (2)
where C(O∗t−i,O∗t ) denotes the number of changed oper-
ations between the intensive-space O∗t−i and O∗t . M rep-
resents the number of nodes in an intensive-space and K
indicates the number of the most likely non-zero operations
connected to each node. Backtracking n epochs, the supe-
3
riority is expressed as follows:
S(O∗t ) =
∏
0≤i≤n
s(O∗t−i,O∗t ) (αt−i) (3)
The defined superiority has the property that it equals to
1 when there is no change between O∗t−i and O∗t and the
corresponding accuracy reaches 100%, simultaneously.
After deriving the intensive-space, it is relaxed by apply-
ing the softmax function:
Nj =
j−1∑
i=0
∑
o∈O∗i,j
exp(α
(i,j)
o )∑j−1
i=1
∑
o′∈O∗i,j exp(α
(i,j)
o′
)
o(Ni) (4)
s.t.
j−1∑
i=0
|O∗i,j | = K (5)
where α(i,j) is a vector of |O∗i,j | dimension. Addi-
tionally, the probability of an operation is defined as
exp(α(i,j)o )∑j−1
i=1
∑
o
′∈O∗
i,j
exp(α
(i,j)
o
′ )
.
Moreover, since there are two types of cells with
the original search space, we respectively derive two
types of intensive-spaces from the normal and reduction
cells, which are denoted by O∗normal and O∗reduce. Fur-
thermore, the intensive-spaces are encoded as O∗ =
{O∗normal,O∗reduce}.
3.2. Search Procedure
Let Ltrain and Lval represent the loss of training
and validation, respectively. The optimization object of
NAS is to find the optimal architecture α∗ to minimize
Lval(w∗, α∗), where w∗ indicates the weights of the net-
work corresponding with α∗. w∗ is obtained through
minimizing the loss of training, expressed as w∗ =
argminw Ltrain(w,α∗).
As pointed out in [6], since it is uncertain whether the
sharing weight methods is able to reflected the real strength
of architectures, we propose ADWPNAS to find the no-
sharing and reasonable weights for each architecture. To
be specific, the problem of NAS is divided into two consec-
utive sub-problems:
1) finding the optimal weights w∗α for a given ar-
chitecture α :
w∗α = argmin
w
Lval(w,α) (6)
2) searching for the optimal architecture α∗:
α∗ = argmin
α
Lval(w∗α, α) (7)
X nX n
MiniBatch
Normal 
cell
Reduction 
cell
Normal 
cell
Reduction 
cell
Normal 
cell
Softmax
𝛼𝑙 GeneratingBlock ConvBlock
Input
Encoding architecture 
parameters of the 𝑙𝑡ℎ cell: 𝛼𝑙
MiniBatch
Loss
𝑔𝛼
Update Parameters
Generate
Weights
X n
Loss
Softmax
Generate
Weights
ConvBlockGeneratingBlock
Other operations
𝑔𝐺
Cell Structure
Figure 3. Upper side: The overview of HyperNetwork for ADW-
PNAS. A neural network contains a pre-defined number of cells,
where reduction cells are located at 1/3 and 2/3 of the network’s
depth. For a minibatch of input image, Loss can be calculated by
the HyperNetwork with the generated weights. Lower side: The
structure of a cell. 1) ConvBlock indicates a convolution operation
and other operations consist of the operations of pooling and iden-
tity in the intensive-space. All the cells share similar structures, in
which weights of the convolution operations are generated by the
GeneratingBlocks. 2) After obtaining the Loss by the HyperNet-
work, we leverage gates gα and gG to control whether to update
the corresponding parameters or not in the training and search pro-
cedures as described in Sec. 3.2.
3.2.1 Weight Prediction and HyperNetwork Training
To solve the sub-problem 1), it is straightforward to get
the optimal weights by training each architecture. How-
ever, the computational cost is usually unaffordable. In this
regard, we approximate the weights through prediction by
building a HyperNetwork.
The HyperNetwork is constructed in the way as shown
in the upper side of Fig. 3, where all the normal cells share
the intensive-space O∗normal and the reduction ones share
O∗reduce. Meanwhile, architecture parameters are extended
from two types of cells to the entire network. That is, each
cell in the HyperNetwork has independent architecture pa-
rameters. Formally, an architecture with L cells can be rep-
resented as a set of vectors αarch = {αl|0 ≤ l < L}, where
αl represents a vector of parameters associated with lth cell.
Furthermore, as described in Fig. 2, the weights of each
convolution operation in the cells are predicted by the cor-
responding GeneratingBlock. To be specific, a ConvBlock
indicates a convolution operation between the k − ith node
and the kth one in the lth cell and a GeneratingBlock is
composed of a softmax function and two fully-connected
layers. The GeneratingBlock takes encoding architecture
parameters as input to generate the weight matrix. Then,
the generated weight matrix is reshaped to match the num-
ber of input and output channel in the ConvBlock. Note
4
that a softmax function is applied to derive the probabili-
ties of all the operations in the lth cell. Moreover, there is
a one-to-one correspondence between the ConvBlock and
GeneratingBlock. Therefore, we formulate the process of
deriving the weights wα for the architecture with parame-
ters αarch in the HyperNetwork as:
wα = G(wG, αarch) (8)
where G denotes the GeneratingBlocks and wG represents
the parameters of the GeneratingBlocks.
For a batch of input image xtrain, we can calcu-
late the loss by the HyperNetwork with the generated
weights in the ConvBlocks, detailed in the lower side of
Fig. 3. Furthermore, we denote the process by Loss =
ConvBlocks(xtrain).
With the GeneratingBlocks, finding w∗ (Equ. 6) is trans-
formed into solving the following Lval minimization prob-
lem:
G∗ = argmin
G
Lval(G(w∗G, αarch), αarch) (9)
s.t. w∗G = argmin
w
Ltrain(G(wG, αarch), αarch) (10)
Then, the task of finding w∗α for α reduces to optimize
the GeneratingBlocks to generate weights by training the
HyperNetwork.
HyperNetwork training. In the forward propagation,
we change the encoding architecture parameters randomly
at each iteration since the purpose is to generate weights for
given architectures with different αarch.
In the backward propagation, parameters of the Gener-
atingBlocks rather than the ConvBlocks are updated itera-
tively by gradient descent. For the sake of implementation,
a binary gate (gG) is leveraged to control whether to update
the parameters or not when its value equals 1 means open
and 0 for off.
Details are introduced in the Stage 1 of Algorithm. 1.
3.2.2 Architecture Search
After training the HyperNetwork, the optimization ob-
ject (Equ. 7) can be transformed into the following by
G∗(w∗G, αarch):
α∗arch = argmin
α
′
arch
Lval(G∗(w∗G, α
′
arch), α
′
arch) (11)
s.t. α
′
arch = argmin
αarch
Ltrain(G∗(w∗G, αarch), αarch)
(12)
To obtain a discrete architecture, we retain the top-T most
likely operations (connected to each node from all the pre-
vious) according to α∗arch and set T = 2 following the ex-
isting works [20].
Algorithm 1: Search Procedure based on ADWPNAS
Hyper Parameters: Number of Cells: L, Number of
HyperNetwork Training Iterations: Itrain, Number
of Cross-search Iterations to Start: Istartcross−search,
Number of Cross-search Iterations to End:
Iendcross−search, Number of Total Iterations: Itotal.
Input: intensive-space: O∗, training dataset: xtrain,
validation dataset: xval.
Output: The Discrete Architecture Obtained.
Stage 1. HyperNetwork training:
for i=0 : Itrain do
Generate architecture parameters αarch randomly;
Obtain wα by GeneratingBlocks(αarch);
Obtain Loss by ConvBlocks(xtrain);
Update the parameters of GeneratingBlocks wG;
end
Return the trained GeneratingBlocks G∗.
Stage 2. Architecture search:
Initialize architecture parameters αarch randomly;
for i=Itrain : Itotal do
Obtain wα by GeneratingBlocks(αarch);
Obtain Loss by ConvBlocks(xtrain);
Update architecture parameters αarch by gradient
descent;
if Istartcross−search ≤ i<Iendcross−search then
Update the parameters of GeneratingBlocks
wG;
end
end
Obtain the best architecture parameters α∗arch and
return the discrete architecture.
Since the weights of convolution kernels can be derived
through prediction instead of training, target architectures
can be evaluated efficiently without any finetuning, which
enables us to directly search for the optimal architecture
(i.e., macro-search). As shown in Algorithm. 1, different
from the stage of HyperNetwork Training, encoding archi-
tecture parameters are updated by gradient descent rather
than being randomly generated at each iteration. Mean-
while, the binary gate gα is on while gG is off. We refer
to the process as the basic-search.
The basic-search is able to guide the architecture param-
eters αarch to converge, though. For the converged region
of αarch, there may exist some noise in the process of pre-
dicting weights owing to training the HyperNetwork ran-
domly. To deal with the problem, cross-search is adopted
when αarch is converging in the basic-search. In the cross-
search, gG is open as well as gα, allowing to update the
weights of all the GeneratingBlocks and the architecture pa-
5
Method Test Error Params Search Cost Search(%) (M) (GPU-days) Method
DenseNet-BC [13] 3.46 1.7 – Manual
NASNet-A [31] + cutout 2.65 3.3 1800 RL
AmoebaNet-A + cutout [23] 3.34 3.2 3150 Evolution
AmoebaNet-B + cutout [23] 2.55 2.8 3150 Evolution
PNAS [18] 3.41 3.2 225 SMBO
ENAS + cutout [21] 2.89 4.6 0.5 RL
DARTS ((first order) + cutout [20] 3.00 3.3 1.5 Gradient
DARTS (second order) + cutout [20] 2.76 3.3 4.0 Gradient
SNAS + moderate constraint + cutout [28] 2.85 2.80 1.5 Gradient
GDAS + cutout [9] 2.93 3.40 0.21 Gradient
GDAS(FRC) + cutout [9] 2.82 2.50 0.17 Gradient
ADWPNAS(8-layers) + cutout 2.77 1.52 0.2 Gradient
ADWPNAS(8-layers) + cutout + AutoAugment 2.41 1.52 0.2 Gradient
ADWPNAS(14-layers) + cutout 2.55 2.62 0.4 Gradient
ADWPNAS(14-layers) + cutout + AutoAugment 2.08 2.62 0.4 Gradient
Table 1. Comparison with state-of-the-art architectures on CIFAR-10. The search costs are derived from the original papers. Note that
the search cost for ADWPNAS include the HyperNetwork training cost and the architecture search cost, but exclude the intensive-space
deriving cost(0.34 GPU days). Our experiments are based on the TiTAN RTX GPU.
rameters simultaneously. After the cross-search, close gG
and perform basic-search again to obtain the architecture
parameters α∗arch. Therefore, the optimal architecture can
be derived from α∗arch.
4. Experiments and Results
Our experiments on CIFAR-10 [15] consist of three
parts, intensive-space deriving(Sect. 4.1), search proce-
dure(Sec. 4.2) and architecture evaluation(Sec. 4.3). Ad-
ditionally, the transferability of the architectures learned
on CIFAR-10 is investigated by evaluating them on Ima-
geNet [25].
4.1. Intensive-space Deriving
In the proposed ADWPNAS, our original search space
consists of the following 8 operations: 3 × 3 depthwise-
separable conv, 5 × 5 depthwise-separable conv, 3 × 3
dilated-separable conv, 5 × 5 dilated-separable conv, 3 × 3
average pooling, 3× 3 max pooling, identity and zero.
The backbone network is constructed by stacking L cells
and each cell contains M = 7 nodes. The inputs are the
first and second nodes of the lth cell, which equal to the
outputs of the (l − 2)th and (l − 1)th cells respectively, and
the output is set to be the 6th node. Reduction cells, detailed
in the upper side of Fig. 3, are located at 1/3 and 2/3 of the
network’s depth, which are connected to normal cells by
operations with stride of two.
We search for two intensive-spaces2, denoted by O∗8 and
O∗14, which are obtained under the configuration of 8 cells
2The intensive-spaces are provided in the supplementary material.
0 5 10 15 20 25 30
Epochs
0
10
20
30
40
50
60
70
80
Va
lid
 A
cc
ur
ac
y 
/ S
up
er
io
rit
y 
(%
)
Valid Accuracy
Superiority (K=6)
Figure 4. This figure presents the validation accuracy and superi-
ority of an intensive-space with K = 6.
and 14 cells with K = 6 and n = 2, respectively. The other
experiment settings are same as DARTS, except that batch
size is set to be 200 and the number of epochs for training
is 30.
Discussion about deriving the intensive-space. Fig. 4
illustrates the validation accuracy and the superiority of an
intensive-space (K = 6). Although the two have similar
trends, the superiority shows significant differences at dif-
ferent times (epochs) while the validation accuracy gradu-
ally converges. This result indicates that the superiority is
helpful to identify the performance of intensive-spaces.
6
Method Test Error(%) Params +× Search Cost SearchTop-1 Top-5 (M) (M) (GPU-days) Method
Inception-v1 [26] 30.2 10.1 6.6 1448 – Manual
MobileNet [12] 29.4 10.5 4.2 569 – Manual
NASNet-A [31] 26.0 8.4 5.3 564 1800 RL
AmoebaNet-A [23] 25.5 8.0 5.1 555 3150 Evolution
AmoebaNet-B [23] 26.0 8.5 5.3 555 3150 Evolution
AmoebaNet-C [23] 24.3 7.6 6.4 570 3150 Evolution
PNAS [18] 25.8 8.1 5.1 588 225 SMBO
DARTS (second order) [20] 26.7 8.7 4.7 574 4.0 Gradient
SNAS + moderate constraint [28] 27.3 9.2 4.3 522 1.5 Gradient
GDAS [9] 26.0 8.5 5.3 581 0.21 Gradient
GDAS(FRC) [9] 27.5 9.1 4.4 497 0.17 Gradient
ADWPNAS(8-layers) 27.6 9.4 3.7 389 0.2 Gradient
ADWPNAS(14-layers) 26.4 8.6 5.3 565 0.4 Gradient
Table 2. Comparison with state-of-the-art architectures on ImageNet (mobile setting). +× indicates the number of multiply-add operations.
4.2. Search Procedure
Our search procedure is composed of two stages. In
the first stage, the HyperNetwork is randomly trained from
scratch. In the second stage, we search the best architecture
based on the HyperNetwork by gradient descent. However,
both the two stages share the same backbone network, de-
tailed in Fig. 3. In a GeneratingBlock of the lth cell, we use
the encoding architecture parameters αl as input and apply a
softmax function to calculate the probabilities of all the op-
erations in the cell. Then, the probability (associated with
the operation) is used as input of the first full-connected
layer, which outputs a vector with the size of 64. The second
full-connected layer utilizes the 64-size vector as input to
generate a matrix in shape of (1, coutl ×cinl ×wl×hl). Imme-
diately, the output matrix is reshaped to (coutl ×cinl ×wl×hl)
as the weights, where cinl and c
out
l respectively indicate the
number of input and output channel.
In the procedure, we search for architectures with
98 batch size under the configuration of Itrain = 60,
Istartcross−search = 90, I
end
cross−search = 100 and Itotal = 120.
For wG, we leverage the SGD [22] optimizer with momen-
tum β = 0.9 and weight decay 3e-4. The learning rate is ini-
tialized to 0.025 and annealed down to 0 following a cosine
schedule. For αarch, they are generated randomly at each
iteration in the HyperNetwork training stage. In the search
stage, however, αarch are optimized by the Adam [14] opti-
mizer with momentum β = (0.5, 0.999), learning rate 3e-4
and weight decay 1e-3. Additionally, other experimental
settings are the same as [20].
Note that O∗8 and O∗14 are leveraged to search for archi-
tectures3 with 8 layers and 14 layers, respectively.
Discussion about the cross-search. As illustrated in
Fig. 5, the red line indicates the validation accuracy in
3The architectures are provided in the supplementary material.
the search procedure with cross-search while the blue one
indicates the accuracy without cross-search. Note that
the red line describes the acquistion process for the 14-
layer model, and the experimental configuration of the bule
line is the same with the red one expect Istartcross−search =
Iendcross−search = 120.
Obviously, the red line has higher accuracy than blue af-
ter cross-search. Forasmuch, it is critical for cross-search to
improve the accuracy of the search procedure and eliminate
the noise of training HyperNetwork randomly.
Complexity analysis. We analyze the complexity of the
intensive-space for neural architectures in the search pro-
cedure. Without considering graph isomorphism, there are
( K!T !(K−T )! )
M−3 possible sub-graphs contained in each of
our discretized cell (recall that there are two input and one
output nodes). Since we learn each cell to derive the fi-
nal architecture, the total number of architecture is up to
( K!T !(K−T )! )
(M−3)×L when there are L cells in the Hyper-
Network. Therefore, before discretization, the continuous
spaces of the HyperNetworks with 8 and 14 cells cover
( 6×52 )
4×8 ≈ 1037 and ( 6×52 )4×14 ≈ 1065 architectures, re-
spectively. Those two are far greater than 1025 of DARTS.
4.3. Architecture Evaluation
Evaluation on CIFAR-10. After the search procedure,
we evaluate the final architectures by training from scratch
on CIFAR-10 and report its accuracy on the test set. For
training only with the standard cutout [8] trick, we follow
the same settings as [20], but for 1000 epochs with batch
size 98 to converge. When adding the AutoAugment [7]
technology, the number of epochs increases to 1500 and the
other settings remain the same.
The comparison between the models discovered by AD-
WPNAS and other state-of-the-art models are summarized
7
0 20 40 60 80 100 120
Epochs
20
30
40
50
60
70
80
Va
lid
 A
cc
ur
ac
y 
(%
)
with cross-search
without cross-search
Figure 5. The comparison chart about validation accuracy whether
or not the cross-search is included in the search process.
in Tab. 1. The 8-layer model discovered by our approach
not only achieves the test error rate of 2.77% on CIFAR-10,
but also reduces the parameters to 1.52M, which is 40% less
than those of other models with the same level of accuracy.
When adding AutoAugment trick, it achieves a lower error
rate of 2.41%. Furthermore, the 14-layer model with more
parameters can achieve better results, which are test error
rates of 2.55% and 2.08% (with AutoAugment trick). How-
ever, the parameters contained in the model are also less
than those of other state-of-art models.
Evaluation on ImageNet. The experimental setup on
ImageNet is exactly the same as DARTS but with batch size
of 256 and auxiliary weight of 0.7, and the experimental
results are shown in Tab. 2. Meanwhile, ADWPNAS(8-
layers) trained with initial channel size 50 saves 16% less
model parameters aside with 20% less multiply-add opera-
tions than GDAS(FRC) [9], and achieves a similar top-1 er-
ror rate of 27.6%. Furthermore, we successfully transfer the
14-layer model to ImageNet with competitive performance.
5. Conclusion
In this paper, we propose an architecture-driven weight
prediction approach for neural architecture search, which
is efficient and reduces the search cost by about 104 times
compared to the NAS approach [31]. Moreover, the model
discovered by our ADWPNAS can achieve comparable re-
sults with less parameters, especially on the dataset of
CIFAR-10.
References
[1] Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh
Raskar. Designing neural network architectures using rein-
forcement learning. arXiv preprint arXiv:1611.02167, 2016.
2
[2] Irwan Bello, Barret Zoph, Vijay Vasudevan, and Quoc V Le.
Neural optimizer search with reinforcement learning. In Pro-
ceedings of the 34th International Conference on Machine
Learning-Volume 70, pages 459–468. JMLR. org, 2017. 1
[3] Han Cai, Tianyao Chen, Weinan Zhang, Yong Yu, and Jun
Wang. Efficient architecture search by network transforma-
tion. In Thirty-Second AAAI Conference on Artificial Intelli-
gence, 2018. 2
[4] Han Cai, Ligeng Zhu, and Song Han. Proxylessnas: Direct
neural architecture search on target task and hardware. arXiv
preprint arXiv:1812.00332, 2018. 2
[5] Xin Chen, Lingxi Xie, Jun Wu, and Qi Tian. Pro-
gressive differentiable architecture search: Bridging the
depth gap between search and evaluation. arXiv preprint
arXiv:1904.12760, 2019. 1, 2
[6] Xiangxiang Chu, Bo Zhang, Ruijun Xu, and Jixiang Li. Fair-
nas: Rethinking evaluation fairness of weight sharing neural
architecture search. arXiv preprint arXiv:1907.01845, 2019.
4
[7] Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasude-
van, and Quoc V Le. Autoaugment: Learning augmentation
policies from data. arXiv preprint arXiv:1805.09501, 2018.
7
[8] Terrance DeVries and Graham W Taylor. Improved regular-
ization of convolutional neural networks with cutout. arXiv
preprint arXiv:1708.04552, 2017. 7
[9] Xuanyi Dong and Yi Yang. Searching for a robust neural ar-
chitecture in four gpu hours. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition,
pages 1761–1770, 2019. 2, 6, 7, 8
[10] Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. Nas-fpn:
Learning scalable feature pyramid architecture for object de-
tection. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pages 7036–7045, 2019. 1
[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition. In Proceed-
ings of the IEEE conference on computer vision and pattern
recognition, pages 770–778, 2016. 2
[12] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry
Kalenichenko, Weijun Wang, Tobias Weyand, Marco An-
dreetto, and Hartwig Adam. Mobilenets: Efficient convolu-
tional neural networks for mobile vision applications. arXiv
preprint arXiv:1704.04861, 2017. 7
[13] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kil-
ian Q Weinberger. Densely connected convolutional net-
works. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 4700–4708, 2017. 2, 6
[14] Diederik P Kingma and Jimmy Ba. Adam: A method for
stochastic optimization. arXiv preprint arXiv:1412.6980,
2014. 7
[15] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple
layers of features from tiny images. Technical report, Cite-
seer, 2009. 6
[16] Xin Li, Yiming Zhou, Zheng Pan, and Jiashi Feng. Partial
order pruning: for best speed/accuracy trade-off in neural
architecture search. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 9145–
9153, 2019. 2
8
[17] Chenxi Liu, Liang-Chieh Chen, Florian Schroff, Hartwig
Adam, Wei Hua, Alan L Yuille, and Li Fei-Fei. Auto-
deeplab: Hierarchical neural architecture search for semantic
image segmentation. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 82–92,
2019. 1
[18] Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon
Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan
Huang, and Kevin Murphy. Progressive neural architecture
search. In Proceedings of the European Conference on Com-
puter Vision (ECCV), pages 19–34, 2018. 2, 6, 7
[19] Hanxiao Liu, Karen Simonyan, Oriol Vinyals, Chrisantha
Fernando, and Koray Kavukcuoglu. Hierarchical repre-
sentations for efficient architecture search. arXiv preprint
arXiv:1711.00436, 2017. 1, 2
[20] Hanxiao Liu, Karen Simonyan, and Yiming Yang.
Darts: Differentiable architecture search. arXiv preprint
arXiv:1806.09055, 2018. 1, 2, 3, 5, 6, 7
[21] Hieu Pham, Melody Y Guan, Barret Zoph, Quoc V Le, and
Jeff Dean. Efficient neural architecture search via parameter
sharing. arXiv preprint arXiv:1802.03268, 2018. 1, 2, 6
[22] Boris T Polyak and Anatoli B Juditsky. Acceleration of
stochastic approximation by averaging. SIAM Journal on
Control and Optimization, 30(4):838–855, 1992. 7
[23] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V
Le. Regularized evolution for image classifier architecture
search. In Proceedings of the AAAI Conference on Artificial
Intelligence, volume 33, pages 4780–4789, 2019. 1, 6, 7
[24] Esteban Real, Sherry Moore, Andrew Selle, Saurabh Sax-
ena, Yutaka Leon Suematsu, Jie Tan, Quoc V Le, and Alexey
Kurakin. Large-scale evolution of image classifiers. In Pro-
ceedings of the 34th International Conference on Machine
Learning-Volume 70, pages 2902–2911. JMLR. org, 2017.
1, 2
[25] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San-
jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy,
Aditya Khosla, Michael Bernstein, et al. Imagenet large
scale visual recognition challenge. International journal of
computer vision, 115(3):211–252, 2015. 6
[26] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet,
Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent
Vanhoucke, and Andrew Rabinovich. Going deeper with
convolutions. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 1–9, 2015.
7
[27] Tom Veniat and Ludovic Denoyer. Learning time/memory-
efficient deep architectures with budgeted super networks.
In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 3492–3500, 2018. 2
[28] Sirui Xie, Hehui Zheng, Chunxiao Liu, and Liang Lin.
Snas: stochastic neural architecture search. arXiv preprint
arXiv:1812.09926, 2018. 2, 6, 7
[29] Chris Zhang, Mengye Ren, and Raquel Urtasun. Graph hy-
pernetworks for neural architecture search. arXiv preprint
arXiv:1810.05749, 2018. 2
[30] Barret Zoph and Quoc V Le. Neural architecture search with
reinforcement learning. arXiv preprint arXiv:1611.01578,
2016. 1, 2, 3
[31] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V
Le. Learning transferable architectures for scalable image
recognition. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 8697–8710,
2018. 1, 2, 3, 6, 7, 8
9
