BNAS:An Efficient Neural Architecture Search Approach Using Broad
  Scalable Architecture by Ding, Zixiang et al.
1BNAS: An Efficient Neural Architecture Search
Approach Using Broad Scalable Architecture
Zixiang Ding, Yaran Chen, Nannan Li, Dongbin Zhao, Fellow, IEEE, Zhiquan Sun,
and C.L.Philip Chen, Fellow, IEEE
Abstract—Efficient Neural Architecture Search (ENAS)
achieves novel efficiency for learning architecture with high-
performance via parameter sharing and reinforcement learning.
In the phase of architecture search, ENAS employs deep scalable
architecture as search space whose training process consumes
most of search cost. Moreover, time consuming of model training
is proportional to the depth of deep scalable architecture. As
a result, layer reduction of scalable architecture is an effective
way to accelerate the search process of ENAS but suffers from
prohibitive performance drop.
In this paper, we propose Broad Neural Architecture Search
(BNAS) where we elaborately design broad scalable architecture
dubbed Broad Convolutional Neural Network (BCNN) to solve
the above issue. On one hand, the proposed broad scalable
architecture has fast training speed due to its shallow topology.
Moreover, we also adopt reinforcement learning and parameter
sharing used in ENAS as the optimization strategy of BNAS.
Hence, the proposed approach can achieve higher search ef-
ficiency. On the other hand, the broad scalable architecture
extracts multi-scale features and enhancement representations,
and feeds them into global average pooling layer to yield more
reasonable and comprehensive representations. Therefore, the
performance of broad scalable architecture can be promised. In
particular, we also develop two variants for BNAS who modify the
topology of BCNN. In order to verify the effectiveness of BNAS,
several experiments are performed and experimental results show
that 1) BNAS delivers 0.19 day which is 2.37x less expensive
than ENAS who ranks the best in reinforcement learning based
NAS approaches, 2) compared with several small-sized (about 0.5
and 1.1 millions parameters) models, the architecture learned by
BNAS obtains state-of-the-art performance (3.58% and 3.24%
test error) on CIFAR-10, 3) the learned architecture achieves
25.3% top-1 error on ImageNet just using 3.9 millions parame-
ters.
I. INTRODUCTION
As a biologically inspired deep learning technique, convolu-
tional neural networks (CNNs) have powerful ability to solve
many intractable issues, e.g. computer vision tasks [1, 2, 3, 4],
game artificial intelligence [5, 6], intelligent transportation
Z. Ding, Y. Chen, N. Li and D. Zhao are with the State Key Laboratory
of Management and Control for Complex Systems, Institute of Automation,
Chinese Academy of Sciences, Beijing 100190, China, and also with the
University of Chinese Academy of Sciences, Beijing 100049, China (email :
{dingzixiang2018, chenyaran2013, linannan2017, dongbin.zhao}@ia.ac.cn).
Z. Sun is with the School of Automation and Electric Engineering,
University of Science and Technology Beijing, Beijing 100083, China (email
: 41723308@xs.ustb.edu.cn).
C. L. P. Chen is with the School of Computer Science & Engineering, South
China University of Technology, Guangzhou, Guangdong 510006, China,
and also with the College of Navigation, Dalian Maritime University, Dalian
116026, China (e-mail: philip.chen@ieee.org).
This work was supported by No. GJHZ1849 International Partnership
Program of Chinese Academy of Sciences.
Normal Cell
Input
Reduction Cell
Normal Cell
Reduction Cell
Normal Cell
Softmax
N
N
N
Fig. 1. Deep scalable architecture of ENAS for CIFAR-10.
system [7] and intelligent robot [8]. However, its remarkable
performance mainly depends on human experts with a con-
siderable amount of expertise. Recently, Neural Architecture
Search (NAS) [9] which automates the process of model
design has gained ground in recent several years. Computer
vision tasks (e.g. image classification [10, 11, 12, 13], semantic
segmentation [14]) and other artificial intelligence related
tasks (e.g. natural language processing [9, 15, 16]) can all
be solved by NAS with surprising performance. However,
early approaches [9, 11, 17] suffered from the issue of
inefficiency. To solve this issue, some one-shot approaches
[10, 15, 16, 18, 19, 20] were proposed. Generally speaking,
one-shot NAS approaches sample Cells (Normal Cell and
Reduction Cell), a micro search space presented in [11], from
a family of predefined candidate operations depending on a
policy. The sampled Cells are treated as building block of deep
scalable architecture (refer to Fig. 1), i.e. child model, whose
performance is used for updating the parameters of policy.
These one-shot approaches avoid retraining each candidate
deep scalable architecture from scratch so that high efficiency
can be promised.
In particular, Efficient Neural Architecture Search (ENAS)
[16] delivers the efficiency of 0.45 GPU day (ranks the best in
reinforcement learning based NAS frameworks) by sharing pa-
rameters of one-shot model where all possible child models are
defined. In order to improve the performance and robustness
of discovered architecture, ENAS sets the number of Cells to
8 (6 Normal Cells and 2 Reduction Cells) and 17 (15 Normal
ar
X
iv
:2
00
1.
06
67
9v
2 
 [s
tat
.M
L]
  1
 Ju
l 2
02
0
2A
c
cu
ra
cy
 (
%
)
Architecture
Fig. 2. Performance (ranks the worst from the best) of ten candidate
architectures of ENAS when using various number of cells of deep scalable
architecture in search phase on CIFAR-10 without Cutout [21] technique.
Shallow case: Test accuracy of architectures using 5 cells in search phase and
17 cells in derivation phase. Deep case: Test accuracy of architectures using
8 cells in search phase and 17 cells in deriving phase.
Cells and 2 Reduction Cells) in the phase of architecture search
and architecture derivation, respectively. There is a loop in
the architecture search phase: 1) ENAS trains one-shot model
on proxy data set (e.g. CIFAR-10) in a single epoch where
a recurrent neural network (RNN) controller is adopted to
sample a child model from one-shot model in each step. 2) For
learning better architecture, the RNN controller is optimized
by reinforcement learning in fixed steps. Here, a child model
represented by a sequence is sampled by the RNN controller.
For reinforcement learning, the sequence is treated as a token
of actions, and validation accuracy of child model is treated as
reward. On one hand, time consuming of model training and
inference is proportional to the depth of model. On the other
hand, the topologies of one-shot model and child model are
all deep in ENAS. As a result, training one-shot model and
obtaining validation accuracy of child model consume most of
search cost in ENAS. Fortunately, depth reduction of scalable
architecture in architecture search phase can ameliorate the
above issue.
Experiments on ENAS1 are carried out to compare the
discrepancies with respect to efficiency and accuracy when
using 5 (shallow case) and 8 (deep case, also default setting
for ENAS) Cells in the architecture search phase of ENAS
on CIFAR-10. Experimental results (Fig. 2) show that depth
reduction of scalable architecture is able to improve the search
efficiency of ENAS, but suffers from prohibitive performance
drop. Moreover, with a single GeForce GTX 2080Ti GPU,
the shallow case and deep case cost 0.26 and 0.38 day,
respectively. However, the deep case delivers better perfor-
mance (about 0.3% between the top performing architectures
of two cases) than the shallow one. Hence, designing high-
performance scalable architecture with shallow topology is an
effective way to develop more efficient NAS approach.
In this paper, we propose Broad Neural Architecture Search
(BNAS), an automatic architecture search approach with state-
of-the-art efficiency. Different from other NAS approaches,
1https://github.com/melodyguan/enas
in BNAS, an elaborately designed broad scalable architecture
dubbed Broad Convolutional Neural Network (BCNN) instead
of a deep one is discovered by parameter sharing and rein-
forcement learning. BCNN consists of convolution blocks and
enhancement blocks. On one hand, each convolution block is
fully-connected with enhancement blocks. On the other hand,
all outputs of convolution blocks and enhancement blocks are
fed into the global poling layer as inputs. Furthermore, we also
develop two variants for BNAS by modifying the topology
of broad scalable architecture: 1) Cascade of Convolution
blocks with its Last block connected to the Enhancement
blocks Broad Neural Architecture Search (BNAS-CCLE) and
2) Cascade of Convolution blocks and Enhancement blocks
Broad Neural Architecture Search (BNAS-CCE). The pro-
posed several broad scalable architectures extract multi-scale
features and enhancement representations, and feed them into
global average pooling layer to yield more reasonable and
comprehensive representations, so that the performance of
architectures learned by BNAS and its two variants can be
promised. Our contributions can be summarized as follows:
• We propose BNAS to further improve the efficiency of
NAS by replacing the deep scalable architecture with a
broad one dubbed BCNN who is elaborately designed
for satisfactory performance and fast search efficiency
simultaneously.
• We also propose two developed versions of BNAS who
modify the topology of BCNN dubbed BNAS-CCLE and
BNAS-CCE. All of BNAS and its two variants excel
in discovering high-performance architecture with fast
search efficiency.
• We achieve 2.37x less search cost (with a single GeForce
GTX 1080Ti GPU on CIFAR-10 in 0.19 day) than
ENAS [16] who ranks the best in reinforcement learning
based NAS approaches. Furthermore, through extensive
experiments on CIFAR-10, we show that the architecture
learned by BNAS obtains state-of-the-art performance
(3.58% and 3.24% test error) compared with several
small-sized (0.5 and 1.1 millions parameters) models.
• We transfer the learned architecture for large scale image
classification task to shed light to the powerful trans-
ferability and multi-scale features extraction capacity of
BCNN. The learned architecture achieves 25.7% top-1
error just using 3.9 millions parameters on ImageNet.
The remainder of this paper is organized as follows. In
Section II, we review related work with respect to this paper.
Then, the proposed BNAS is described in Section III. Subse-
quently, two variants of BNAS are given in Section IV. Next,
experiments on two data set are performed, and qualitative and
quantitative analysis is given in Section V. At last, we draw
some conclusions in Section VI.
II. RELATED WORK
A. Neural Architecture Search
Recently, various methods were adopt to improve the
search efficiency of NAS [22, 23, 24, 25]. Based on Sequen-
tial Model-Based Optimization (SMBO) strategy, Progressive
Neural Architecture Search (PNAS) [23] was proposed to
3  
  Output Yˆ
  Input X 
Mapped Feature 2
  
2Z
Mapped Feature 1
  
1Z
Mapped Feature n
  
nZ
Enhancement Nodes
  
2H
Enhancement Nodes
  
1H
Enhancement Nodes
  
Hm
  2Z1Z nZ
  
Fig. 3. Structure of broad learning system.
determine the structure of convolutional neural networks in
order of increasing complexity. A multi-objective evolutionary
algorithm was proposed for improving the efficiency of archi-
tecture search in LEMONADE [24]. Moreover, Pareto-optimal
neural architectures were discovered for device-related and
device-agnostic objectives in DPP-Net [25]. However, these
approaches were still not efficient enough due to retrain each
candidate model from scratch.
A great number of one-shot approaches [10, 15, 16, 26, 27]
have been presented for improving the efficiency of NAS fur-
ther. These one-shot approaches defined all possible candidate
architectures in a one-shot model for avoiding the issue of
retraining each child model from scratch. SMASH [10] used
a hyper-network to generate the weights of a designed archi-
tecture so that the search process can be accelerated greatly.
In particular, aforementioned ENAS [16] was presented. In
order to improve the efficiency, ENAS used parameter sharing
for avoiding each candidate deep architecture retraining from
scratch. Furthermore, Liu et al. [15] proposed Differentiable
ARchiTecture Search (DARTS). DARTS discovered the com-
putation Cells within a continuous domain for formulating
NAS in a differentiable way. DARTS achieved three order of
magnitudes less expensive than previous approaches [9, 11].
Subsequently, several developed versions of DARTS (e.g.,
P-DARTS [26], PC-DARTS [27]) were proposed. Based on
the framework of DARTS, P-DARTS modified the scalable
architecture in a progressive way, and delivered novel search
efficiency of 0.3 GPU day. PC-DARTS activated partial chan-
nel connections of Cell to reduce the memory usage of DARTS
so that lager batch size could be set in the architecture search
phase. Lager batch size contributed to reduce the search cost
and uncertainty in NAS. Moreover, PC-DARTS achieved state-
of-the-art efficiency (0.1 day) on a single GeForce GTX
1080Ti GPU.
B. Broad Learning System
BLS [28, 29] was proposed as a developed model of
Random Vector Functional-Link Neural Network (RVFLNN)
[30, 31] who took input data directly to build enhancement
nodes. Different from RVFLNN, a set of mapped features
were established in BLS by the input data firstly for achieving
satisfactory performance.
We show the structure of BLS in Fig. 3. Feature mapping
nodes and enhancement nodes were two main components of
BLS. In [28], the mechanism of BLS was introduced: 1) Non-
linear transformation functions of feature mapping nodes were
applied to generate the mapped features of input data. 2) The
mapped features were enhanced to generate enhancement fea-
tures by enhancement nodes with randomly generated weights.
3) All the mapped features and enhancement features were
used to deliver the final result. Chen et al. introduced several
variants of BLS in [29], e.g. Cascade of Convolution Feature
mapping nodes Broad Learning System (CCFBLS), Cascade
of Feature mapping nodes Broad Learning System (CFBLS),
Cascade of Feature mapping nodes and Enhancement nodes
Broad Learning System (CFEBLS). Below, CCFBLS, CFBLS
and CFEBLS are introduced in details.
Feature mapping nodes and enhancement nodes made up
CCFBLS. As described in [29], the mapped features were
generated by cascade of convolution and pooling operations in
feature mapping nodes. Then, these mapped features were en-
hanced by a nonlinear activation function to obtain a series of
enhancement features. Finally, all of the mapped features and
enhancement features were connected directly with the desired
output. CCFBLS was not only broad but also deep. As a result,
CCFBLS could extract multi-layer features and representations
which were more reasonable and comprehensive compared
with other models only with deep structure. There were two
main discrepancies among CCFBLS and other two variants:
1) The feature mapping nodes of CFBLS and CFEBLS were
neurons rather than convolution and pooling operations. 2)
Their topologies were different. For CFBLS, the output of
each group of feature mapping nodes was fed into every
enhancement nodes group as input. For CFEBLS, the first
enhancement nodes group accepted those outputs from each
feature mapping nodes group as input. Other enhancement
nodes group treated the output of its previous one as input.
Motivated by traditional BLS, we elaborately design a new
broad scalable architecture of convolutional neural network
dubbed BCNN. Furthermore, we propose BNAS by combining
the proposed broad scalable architecture and parameter sharing
for efficiency improvement of ENAS also other Cell-based
NAS frameworks. Beyond that, we also develop two variants
of BNAS inspired by the variants of BLS.
4Conv_1 Conv_2 Conv_u   En_1 En_2 En_v  
Global Average Pooling
FC
Input
Convolution Block
     
1 2 N 1 2 N
Enhancement Block
  
1 2 N
deep Cell ×k broad Cell ×1 enhancement Cell  ×1
Fig. 4. The core of BNAS: broad convolutional neural network. Top: the topology of BCNN. Bottom Left: the structure of convolution block. Bottom Right:
the structure of enhancement block.
III. BROAD NEURAL ARCHITECTURE SEARCH
BNAS employs the combination of broad scalable architec-
ture and parameter sharing to deliver faster search efficiency
than ENAS who ranks the best in reinforcement learning based
NAS approaches. On one hand, we introduce the design details
of the proposed broad scalable architecture dubbed BCNN
who can solve the side-effect of layer reduction in Cell-based
NAS approaches. On the other hand, the efficient optimization
strategy of BNAS is given.
A. Broad Convolutional Neural Network
As described in Section I, we draw a conclusion through
a set of comparative experiments that layer reduction is able
to improve the search efficiency of ENAS but suffers from
prohibitive performance drop. Inspired by BLS who achieves
satisfactory performance using broad topology, we propose a
broad scalable architecture dubbed BCNN who can deliver fast
training and inference speed, to ameliorate the above issue.
For intuitional understanding, the structure of BCNN and
its two important components, convolution and enhancement
blocks are depicted in Fig. 4. The proposed BCNN consists
of u convolution blocks denoted as Conv i (i = 1, 2, . . . , u)
and v enhancement blocks denoted as En j (j = 1, 2, . . . , v)
which are used for feature extraction and enhancement, respec-
tively. In the convolution block, there are k + 1 convolution
Cells: k deep Cells and a single broad Cell which are utilized
for deep and broad features extraction, respectively. Moreover,
u is determined by the size of input images. For example, we
set u = 2 for the experiments on CIFAR-10 with 32 × 32
pixels. The other two parameters k and v need to be defined
by user. Through substantial experiments on CIFAR-10, we
will analyze the influence with regard to k and v in Section
V-D1.
In each convolution block, the deep Cells and broad Cell
have same topologies but various strides: one for the deep and
two for the broad. In order to extract broad features from the
output features of final deep Cell, the broad Cell returns the
feature maps with half width, half height and double channels
(i.e. broad features). In each enhancement block, there is a
single enhancement Cell with one stride and different topology
from those convolution Cells. The proposed BCNN stacks u
convolution blocks one after another, and feeds outputs of
every convolution block into each enhancement block as the
input for obtaining enhancement feature representations. The
convolution and enhancement features from every convolution
and enhancement block are all connected with the global
average pooling layer to yield more reasonable and compre-
hensive representations for achieving promised performance of
the proposed BCNN. For clear understanding, the formulaic
expressions of BCNN are given below.
For convolution block Conv i, its deep feature mapping
Z
(i)
h (h = 1, 2, . . . , k) and broad feature mapping Z
(i)
k+1 can
be defined as
Z
(i)
h = φ(Z
(i)
h−2,Z
(i)
h−1; {W (i)deeph ,β (i)deeph }), i = 1, 2, . . . , u,
(1)
Z
(i)
k+1 = φ(Z
(i)
k−1,Z
(i)
k ; {W (i)broadk+1 ,β (i)broadk+1 }), i = 1, 2, . . . , u,
(2)
where {W (i)deeph ,β (i)deeph } and {W (i)broadk+1 ,β (i)broadk+1 } are the
weight, bias matrices of deep Cells and broad Cell in con-
volution block i, respectively. Moreover, φ(·) is a set of
transformations (e.g. depthwise-separable convolution [32],
pooling, skip connection) by the deep Cells and broad Cell.
In other words, each Cell in the convolution block uses the
outputs of its previous two Cells as the inputs for combining
5various features. However, there is a doubt in (1) that Z (i)−1 and
Z
(i)
0 are not defined. A complementary expression is given as
{Z (i)−1, Z (i)0 } = {Z (i−1)k , Z (i−1)k+1 }, i = 2, 3, . . . , u. (3)
Moreover, a convolution with 3× 3 kernel size is inserted as
a stem of BCNN to provide the input information for the first
and second convolution Cell. As a result, the output of 3× 3
convolution can be represented as Z (1)ξ , where ξ ≤ 0.
For enhancement block En j (j = 1, 2, . . . , v), its en-
hancement feature representationsH (j) can be mathematically
expressed as
H (j) = ϕ(δ(Z
(1)
0 ,Z
(1)
k+1, · · · ,Z (u−1)k+1 ),Z (u)k+1; {W (j),β (j)}),
(4)
where W (j) and β (j) are the weight and bias matrices of
enhancement Cell in En j, respectively. Moreover, ϕ(·) is a
set of transformations by the enhancement Cell, and δ(·) is a
function combination of 1× 1 convolution and concatenating.
Here, a priori knowledge is incorporated into the connec-
tions between convolution blocks and enhancement blocks.
Depending on substantial experiments, we find that those low-
pixels feature maps are more important than those feature
maps with high resolutions for achieving high performance.
In other words, for designing BCNN with novel performance,
more deep and broad feature maps of Conv r instead of
Conv s should be fed into the enhancement block, where
r > s and 0 < s, r ≤ u. In order to insert the above
priori knowledge into BCNN, a group of convolutions with
1 × 1 kernel size are employed in each connection between
the convolution block (except the last one) and enhancement
block. These 1 × 1 convolutions accept those feature repre-
sentations from the final deep Cell in each convolution block,
and output a group of feature maps with different importance.
Moreover, the importance is represented by the number of
output channels which the larger is the more important it is.
Furthermore, these 1×1 convolutions employ different strides
for concatenating all input feature maps with same size. As
a result, the output of δ(Z (1)0 ,Z
(1)
k+1, · · · ,Z (u−1)k+1 ) is obtained
by concatenating those outputs of 1 × 1 convolutions whose
inputs are Z (1)0 ,Z
(1)
k+1, · · · ,Z (u−1)k+1 .
For achieving promised performance of BCNN, all outputs
of each convolution and enhancement block are connected
directly with the global average pooling (GAP) layer to yield
more reasonable and comprehensive representations. Here, the
output of the last deep Cell in each convolution block is
connected for feeding multi-scale features into the GAP layer,
so that the final output of GAP layer can be expressed as
O = ψ(Z
(1)
k ,Z
(2)
k , · · · ,Z (u)k ,H (1),H (2), · · · ,H (v)), (5)
where ψ(·) is a function combination of 1 × 1 convolu-
tion, concatenating and global average pooling. Similarly, the
aforementioned priori knowledge is also incorporated into
the connections between convolution blocks and GAP layer.
Moreover, the output of each enhancement block is equipped
with equal importance.
B. Efficient Optimization Strategy
In BNAS, the combination of parameter sharing presented
in ENAS [16] and reinforcement learning is adopted for find-
ing broad scalable architecture with satisfactory performance.
Here, an Long Short-Term Memory (LSTM) [33] controller
with parameter θ is trained in a loop with two phases: broad
one-shot model training and LSTM training. Moreover, the
broad one-shot model defines all possible broad scalable
architectures.
In the first phase, the broad one-shot model is trained in
a single epoch using the way of selective activation, where
the trained architecture in each step is a subset of the broad
one-shot model:
1) LSTM first generates two types of Cells, convolution
Cell and enhancement Cell, with a list of tokens a1:T
according a sampling policy pi(·) for stacking up them
into a broad child model m (refer to Fig. 4).
2) The sampled broad child model m (a sub-network
of broad one-shot model) is trained by gradient-based
algorithm on a mini-batch data.
3) Repeat the above steps until arriving the termination
condition.
In the second phase, the LSTM is trained by reinforcement
learning in fixed steps:
1) Similar with the step 1) in the phase of broad one-shot
model training, a broad child model m is sampled by
the LSTM.
2) The weights w of the sampled broad child model are
inherited from the broad one-shot model.
3) The validation accuracy R(m;w) is obtained on the
desired task.
4) R(m;w) is treated as the reward of reinforcement learn-
ing to guide the LSTM for discovering various Cells
with better performance.
5) Repeat the above steps until arriving the termination
condition.
Here, BNAS demands the LSTM controller to maximize the
expected reward J(θ), where
J(θ) = Epi(a1:T ;θ)[R(m;w)]. (6)
Moreover, a gradient policy algorithm, REINFORCE [34] is
applied to compute the policy gradient ∇θJ(θ), where
∇θJ(θ) =
T∑
t=1
Epi(a1:T ;θ)[∇θlogpi(at|a(t−1):1; θ)R(m;w)].
(7)
After many iterations of this loop are repeated, novel Cells
with satisfactory performance can be found.
IV. VARIANTS OF BROAD BROAD NEURAL
ARCHITECTURE SEARCH
BLS is a flexible paradigm to be modified under different
constraints. Moreover, several variants have been proposed in
[29], e.g., CCFBLS and CFEBLS. Motivated by the above
two variants of BLS, we also propose two developed versions
6Conv_1 Conv_2 Conv_u   En_1 En_2 En_v  
Global Average Pooling
FC
Input
Conv_1 Conv_2 Conv_u   En_1 En_2 En_v  
Global Average Pooling
FC
Input
(a)
(b)
Fig. 5. The topologies of broad scalable architectures employed in two variants of BNAS. (a) The broad scalable architecture of BNAS-CCLE. (b) The broad
scalable architecture of BNAS-CCE.
for BNAS who modify the topology of BCNN: 1) Cascade of
Convolution blocks with its Last block connected to the En-
hancement blocks Broad Neural Architecture Search (BNAS-
CCLE) and 2) Cascade of Convolution blocks and Enhance-
ment blocks Broad Neural Architecture Search (BNAS-CCE).
In this section, we only introduce the difference between
BNAS and its two variants, i.e. the broad scalable architectures
used in BNAS-CCLE and BNAS-CCE which are shown in Fig.
5.
A. BNAS-CCLE
The broad scalable architecture of BNAS-CCLE is a de-
veloped CCFBLS [29] which is not only broad but also
deep. In the broad scalable architecture of BNAS-CCLE, each
enhancement block only treats the output of last convolution
block as input. In other words, we assume that the output
of last convolution block takes up all importance of each
enhancement block. For clear understanding, the formulaic
expressions of the broad scalable architecture used in BNAS-
CCLE are given below.
Related expressions with regard to convolution blocks can
be found in (1) to (3). For enhancement block En j, its
enhancement feature representations H (j) can be defined as
H (j) = ϕ(Z
(u)
k ,Z
(u)
k+1; {W (j),β (j)}), j = 1, 2, . . . , v (8)
where W (j) and β (j) are the weight and bias matrices of
enhancement Cell in enhancement block j, respectively. Simi-
larly, ϕ(·) is a set of transformations by the enhancement Cell.
For performance improvement, we also incorporate afore-
mentioned priori knowledge used in BCNN into the connec-
TABLE I
OPTIMAL HYPER-PARAMETERS ABOUT BROAD SCALABLE
ARCHITECTURES OF BNAS AND ITS TWO VARIANTS
Approach vs vd kd
BNAS 2 1 1
BNAS-CCLE 2 1 1
BNAS-CCE 2 2 2
tions between convolution blocks and GAP layer. On one
hand, the importance of each convolution block is twice as
much as its previous one for GAP layer. For instance, the
GAP layer accepts a and 2a channels from Conv x and
Conv y (y = x + 1), respectively. On the other hand, the
importance of each enhancement block is equal. For instance,
the GAP layer accepts b channels from both En x and
En y (1 ≤ x, y ≤ v).
B. BNAS-CCE
The broad scalable architecture of BNAS-CCE is a de-
veloped CCEBLS [29] that employs deeper topology than
other variants of BLS. There are two differences between the
broad scalable architectures of BNAS and BNAS-CCLE: 1)
The output of each convolution block is fed into the first
enhancement block as its inputs. 2) Enhancement blocks are
stacked one after another rather than in a parallel way. For
clear understanding, the formulaic expressions of the broad
scalable architecture used in BNAS-CCE are given below.
Similarly, related expressions with regard to convolution
blocks can be represented by (1) to (3). Furthermore, enhance-
7TABLE II
COMPARISON OF THE PROPOSED BNAS WITH OTHER NAS APPROACHES ON CIFAR-10 FOR DIFFERENT-SIZE MODELS UNDER
IDENTICAL TRAINING CONDITIONS.
Architecture Error Params Search Cost Search Method Topology(%) (M) (GPU Days)
LEMONADE + cutout [24] 4.57 0.5 80 evolution deep
DPP-Net + cutout [25] 4.62 0.5 4.00 evolution deep
BNAS + cutout (ours) 3.83 0.5 0.20 RL broad
BNAS-CCLE + cutout (ours) 3.63 0.5 0.20 RL broad
BNAS-CCE + cutout (ours) 3.58 0.6 0.19 RL broad
LEMONADE + cutout [24] 3.69 1.1 80 evolution deep
DPP-Net + cutout [25] 4.78 1.0 4.00 evolution deep
BNAS + cutout (ours) 3.46 1.1 0.20 RL broad
BNAS-CCLE + cutout (ours) 3.40 1.1 0.20 RL broad
BNAS-CCE + cutout (ours) 3.24 1.0 0.19 RL broad
AmoebaNet-A + cutout [17] 3.34 ± 0.06 3.2 3150 evolution deep
AmoebaNet-B + cutout [17] 2.55 ± 0.05 2.8 3150 evolution deep
Hierarchical Evo [22] 3.75 ± 0.12 15.7 300 evolution deep
LEMONADE + cutout [24] 3.05 4.7 80 evolution deep
PNAS [23] 3.41 3.2 225 SMBO deep
DARTS(second order) + cutout [15] 2.83 ± 0.06 3.3 4.00 gradient-based deep
DARTS(first order) + cutout [15] 3.00 2.9 1.50 gradient-based deep
P-DARTS + cutout [26] 2.50 3.4 0.30 gradient-based deep
PC-DARTS + cutout [27] 2.57 ± 0.07 3.6 0.10 gradient-based deep
NASNet-A + cutout [11] 2.65 3.3 1800 RL deep
NASNet-B + cutout [11] 3.73 2.6 1800 RL deep
MANAS + cutout [35] 2.63 3.4 2.80 RL deep
ENAS + cutout [16] 2.89 4.6 0.45 RL deep
BNAS + cutout (ours) 2.97 4.7 0.20 RL broad
BNAS-CCLE + cutout (ours) 2.95 4.1 0.20 RL broad
BNAS-CCE + cutout (ours) 2.88 4.8 0.19 RL broad
ment feature representations H (j) (j = 1, 2, . . . , v) can be
divided into three cases:

ϕ(δ(Z
(1)
k+1, · · · ,Z (u−1)k+1 ),Z (u)k+1); {W (j),β (j)}), if j = 1
ϕ(Z
(u)
k+1,H
(1); {W (j),β (j)}), if j = 2
ϕ(H (j−2),H (j−1); {W (j),β (j)}), else
(9)
where W (j) and β (j) are the weight and bias matrices of
enhancement Cell in enhancement block j, respectively. Simi-
larly, ϕ(·) is a set of transformations by the enhancement Cell.
And δ(·) is aforementioned function combination of 1 × 1
convolution and concatenating.
Similar with BNAS and BNAS-CCLE, for the broad scal-
able architecture of BNAS-CCE, 1×1 convolution based priori
knowledge is inserted between convolution blocks and the first
enhancement block, also is inserted into the connections of
convolution blocks and GAP layer. Different from BNAS and
BNAS-CCLE, the importance of each enhancement block in
broad scalable architecture of BNAS-CCE is not equal for
GAP layer. For instance, the GAP layer accepts c channels
from both En x and En y (1 ≤ x, y < v, y = x + 1), and
all channels from En v.
V. EXPERIMENTS AND ANALYSIS
In this section, we perform a set of experiments to examine
several novel properties of BNAS and its two variants. First
of all, a large number of experiments are performed by BNAS
and its two variants for hyper-parameters (e.g., the number
of enhancement block in two phases, the number of deep
Cell in architecture derivation phase) determination. Next,
BNAS and its two variants are applied to discover novel
architecture on CIFAR-10 for testing the search efficiency and
high performance of the learned architecture. Subsequently,
the learned architecture with best performance is chosen to
solve large scale image classification task on ImageNet. The
experiment on ImageNet not only evaluates the transferability
of the discovered architectures of BNAS and its two variants,
but also examines the powerful multi-scale features extraction
capacity of the proposed broad scalable architectures. Finally,
the qualitative and quantitative analysis are given for the
experimental results on CIFAR-10 and ImageNet.
A. Hyper-parameters Determination for Broad Scalable Ar-
chitectures
For architecture search phase, we set the number of deep
Cell to zero as default setting for delivering fast search speed.
Moreover, the number of enhancement block vs is selected
8(a)
Cell i
Cell i-1
sep
3×3
add
avg
concat
(b)
avg
add
sep
3×3
sep
3×3
add
sep
5×5
skip
add
sep
5×5
sep
3×3
add
sep
3×3
cell i
cell i-1
sep 
5×5 
add
sep 
5×5 
sep 
5×5 
add
sep 
3×3 
sep 
5×5 
add
sep 
3×3 
skip
add
sep 
5×5 
sep 
5×5 
add
sep 
3×3 
concat
(c)
cell i
cell i-1
sep
5×5
add
skip
concat
(d)
skip
add
sep
5×5
skip
add
sep
5×5
sep
5×5
add
skip
sep
5×5
add
sep
5×5
cell i
cell i-1
sep 
5×5 
add
sep 
3×3 
sep 
3×3 
add
sep 
5×5 
sep 
5×5 
add
sep 
3×3 
sep 
3×3 
add
skip
sep 
3×3 
add
sep 
3×3 
concat
(e)
cell i
cell i-1
sep 
5×5 
add
sep 
5×5 
avg
add
sep 
3×3 
sep 
5×5 
add
skip 
concat
(f)
Cell i
Cell i-1
sep
5×5
add
max
concat
sep
3×3
add
sep
3×3
sep
5×5
add
sep
3×3
avg
add
sep
5×5
max
add
avg
sep 
3×3 
add
avg
avg
add
sep 
3×3 
Fig. 6. Optimal architecture discovered by BNAS using various broad scalable architectures: (a) The convolution Cell for BNAS. (b) The enhancement
Cell for BNAS. (c) The convolution Cell for BNAS-CCLE. (d) The enhancement Cell for BNAS-CCLE. (e) The convolution Cell for BNAS-CCE. (f) The
enhancement Cell for BNAS-CCE.
from [1,2]. For the phase of architecture derivation, there are
two hyper-parameters need to be determined for the broad
scalable architectures: the number of enhancement block vd
and the number of deep Cell kd. As described in Section I,
large discrepancy of scalable architectures between two phases
leads to prohibitive performance reduction. As a result, we
choose vd and kd from [1,2,3] and [0,1,2], respectively.
The experimental settings for hyper-parameters determi-
nation is identical with the phase of architecture described
in Section V-B except the number of training epochs in
the architecture derivation phase. The experimental results of
hyper-parameters determination for BNAS and BNAS-CCE
can be found in APPENDIX. Here, related experiments for
BNAS-CCLE perform similarly so that we only show the
results of BNAS. In each case, we employ the mean value and
variance of five candidate architectures as evaluation indices.
According to APPENDIX, the optimal hyper-parameters of
BNAS and its variants are shown in TABLE I.
B. Architecture Search on CIFAR-10
Similarly, CIFAR-10 is chosen as the search data set and
applied a series of standard data augment techniques which can
be found in ENAS [16] for details. In BNAS, we chose five
candidate operations: 3 × 3 depthwise-separable convolution,
5 × 5 depthwise-separable convolution, 3 × 3 max pooling,
3× 3 average pooling and skip connection as the components
of convolution Cell and enhancement Cell with 7 nodes.
In the architecture search phase, the Nesterov momentum
is adopted and the learning rate follows the cosine schedule
with lmax=0.05, lmin=0.0005, T0=10 and Tmul=2 [36] for
training the broad scalable architectures. Furthermore, the
experiment runs for 150 epochs with batch size 128. For
updating the parameters θ of controller, the Adam optimizer
with a learning rate of 0.0035 is applied. One one hand,
we train 5 candidate architecture in 310 epochs on CIFAR-
10 for hyper-parameters determination. One the other hand,
we train 10 candidate architecture in 630 epochs on CIFAR-
10 for architecture derivation. Moreover, we adopt identical
experimental setting in the phase of architecture search for
each case.
The diagrams of the top performing convolution Cells and
enhancement Cells discovered by BNAS using various broad
scalable architectures are shown in Fig. 6. In each case, a
family of broad scalable architectures with same topologies
but different-size by modifying the number of initial channels
are constructed. The comparisons of BNAS using various
broad scalable architectures with other NAS approaches on
CIFAR-10 for different-size models under identical training
conditions are shown in TABLE II. Moreover, a popular data
augmentation technique, Cutout [21] is applied for BNAS and
its two variants in the architecture derivation phase rather than
the phase of architecture search.
C. Transferability of Learned Architecture on ImageNet
In this part, we transfer the architecture learned by BNAS
on CIFAR-10 to solve large scale image classification task.
A large model stacked by the best performing Cells is built
9TABLE III
COMPARISON OF BNAS WITH OTHER STATE-OF-THE-ART IMAGE
CLASSIFIERS ON IMAGENET
Architecture Top-1 Top-5 Params(%) (%) (M)
Inception-v1 [37] 30.2 10.1 -
MobileNet-224 [38] 29.4 - 6
ShuffleNet (2x) [39] 29.1 10.2 10
AmoebaNet-A [17] 25.5 8.0 5.1
AmoebaNet-B [17] 26.0 8.5 5.3
NASNet-A [11] 26.0 8.4 5.3
NASNet-B [11] 27.2 8.7 5.3
NASNet-C [11] 27.5 9.0 4.9
PNASNet [23] 25.8 8.1 5.1
LEMONADE [24] 26.9 9.0 4.9
DARTS [15] 26.7 8.7 4.7
FBNet-B [40] 25.9 - 4.5
P-DARTS (CIFAR-10) [26] 24.4 7.4 4.9
P-DARTS (CIFAR-100)[26] 24.7 7.5 5.1
PC-DARTS (CIFAR-10)[27] 25.1 7.8 5.3
BNAS (ours) 25.7 8.5 3.9
for ImageNet 2012. This experiment is not only performed
for verifying the transferability of discovered architecture by
BNAS and its two variants, but also proving the powerful
multi-scale features extraction capacity of the proposed broad
scalable architectures.
Similar with the experiments on CIFAR-10, some data aug-
ment techniques, for instance, randomly cropping and flipping
are applied on the input images whose size is 224×224. In this
experiment, the broad scalable architecture of BNAS-CCLE
consists of five convolution blocks and a single enhancement
block. Beyond that, only one deep Cell is employed in
each convolution block for deep representations extraction.
Furthermore, we train BNAS-CCLE for 150 epochs with batch
size 256 by using SGD optimizer with momentum 0.9 and
weight decay 3× 10−5. The initial learning rate is set to 0.1
and decayed by a factor of 0.1 when arriving at the 70th, 100th
and 130th epoch. The setting for other hyper-parameters, e.g.
label smoothing, gradient clipping bounds can be found in
DARTS [15] in details.
TABLE III summaries the results from the point of view of
accuracy and parameter, and compares with other state-of-the-
art image classifiers on ImageNet.
D. Results Analysis
1) Hyper-parameters: According to APPENDIX, we can
draw some conclusions for the hyper-parameters of broad
scalable architectures as below.
Firstly, broad scalable architectures without deep Cell are
prone to yield poor accuracy and robustness in the phase of ar-
chitecture derivation. Therefore, deep representations extracted
by deep Cell are necessary in the proposed broad scalable
architectures. Secondly, deep cell (Here, we only show the
comparison between kd = 1 and kd = 2 due to the poor
1dk = 2dk = 1dk = 2dk =
1dv =
2dv =
3dv =
1dv =
2dv =
3dv =
1dv = 2dv = 3dv = 1dv = 2dv = 3dv =
1dk =
2dk =
1dk =
2dk =
Fig. 7. Visualization of the influence of deep Cell and enhancement Cell for
BNAS: (a) The influence of deep Cell for BNAS with vs=1. (b) The influence
of deep Cell for BNAS with vs=2. (c) The influence of enhancement Cell
for BNAS with vs=1. (d) The influence of enhancement Cell for BNAS with
vs=2.
1dv = 2dv = 3dv = 1dv = 2dv = 3dv =
1dk =
2dk =
1dk =
2dk =
1dk = 2dk = 1dk = 2dk =
1dv =
2dv =
3dv =
1dv =
2dv =
3dv =
Fig. 8. Visualization of the influence of deep Cell and enhancement Cell for
BNAS-CCE: (a) The influence of deep Cell for BNAS-CCE with vs=1. (b)
The influence of deep Cell for BNAS-CCE with vs=2. (c) The influence
of enhancement Cell for BNAS-CCE with vs=1. (d) The influence of
enhancement Cell for BNAS-CCE with vs=2.
performance of kd = 0) and enhancement cell of BNAS and
BNAS-CCE play different roles as follows:
• Fig. 7(a) and (b) visualize the influence of deep Cell for
BNAS under the same condition. Obvious, deep Cell not
always contributes to improve the performance of broad
scalable architecture which is different from the deep
one. BCNN is not only broad but also deep where broad
topology is predominant, so that deep Cell may deliver
negative effect for performance improvement.
10
Fig. 7(c) and (d) visualize the influence of enhancement
Cell for BNAS under the same condition. Viewed in
global perspective, broad scalable architecture of BNAS
with a single enhancement block delivers the best per-
formance. As aforementioned, each enhancement block
is equipped with same input and importance of the
connection between enhancement block and GAP layer.
This leads to many redundancies of GAP accepted repre-
sentations from enhancement blocks, so that the case of
vd=1 yields the best performance.
• Fig. 8(a) and (b) visualize the influence of deep Cell for
BNAS-CCE under the same condition. Different from
BNAS, the performance of BNAS-CCE is proportion
to the number of deep Cell. The topology difference
between the broad scalable architectures of BNAS and
BNAS-CCE is the primary cause of above phenomenon.
Different from BNAS, the deep topology is predominant
due to the cascade of enhancement blocks. As a result,
more deep presentations contribute for performance im-
provement.
Fig. 8(c) and (d) visualize the influence of enhancement
Cell for BNAS-CCE under the same condition. Similar
with the influence of deep Cell in BNAS, enhancement
Cell can not contribute to improve the performance of
broad scalable architecture of BNAS-CCE who uses the
cascade of enhancement blocks. Two factors result in
the above situation: 1) the connection between each en-
hancement block and GAP layer, and 2) the same spatial
size of outputs of enhancement blocks. The combination
of above two aspects leads to substantial contradictory
information fed into the GAP layer that needs high-
quality representations as input for high-performance. As
a result, more enhancement blocks in the broad scalable
architecture of BNAS-CCE can not contribute to improve
its performance.
Moreover, we only perform a single run for each archi-
tecture in the experiment of hyper-parameters determination,
so that the results shown in APPENDIX may suffers from
the issue of instability due to the impact of randomness and
nondeterminacy.
2) Performance: For the experiments on CIFAR-10, we use
three groups of learned architecture to construct a family of
broad architecture where three models with different param-
eters (about 0.5, 1.1 and 4 millions) are built based on the
learned architecture of each group. In the first and second
block of TABLE II, DPP-Net [25] and LEMONADE [24] are
chosen as the comparative NAS approaches. It is obvious that
BNAS and its two variants all can deliver small-sized broad
scalable architectures with the best accuracy for small-scale
image classification task. In particular, for the models with 0.5
million parameters, BNAS-CCE exceeds those comparative
NAS methods 1.04% which is a great promotion. In the
third block of TABLE II, a large-sized model is constructed
and several state-of-the-art NAS approaches (e.g., AmoebaNet
[17], NASNet [11], DARTS [15], ENAS [16], P-DARTS [26]
and PC-DARTS [27]) are chosen for comparing with the
proposed BNAS and its two variants. Apparently, all proposed
approaches achieve competitive results. Moreover, BNAS-
CCE also delivers best performance among three proposed
NAS approaches that is 2.88% test error (exceeds ENAS) with
4.8 millions parameters.
Furthermore, two aspects (accuracy and parameter) are com-
pared for the experiment on ImageNet. Here, we only transfer
the architecture learned by BNAS-CCLE for ImageNet. More-
over, we not only choose the NAS approaches (second block
of TABLE III) but also manual design models (first block
of TABLE III) as the comparative methods. From the point
of view of accuracy, BNAS-CCLE achieves 25.7% top-1 test
error which is a competitive result compared with state-of-
the-art model designed by P-DARTS [26]. The transferability
of learned architecture and the powerful multi-scale features
extraction capacity of the proposed broad scalable architecture
for large scale image classification task can be proven. For the
perspective with respect to parameter, BNAS-CCLE obtains
the above competitive accuracy with 3.9 millions parameters
which is state-of-the-art for NAS approaches. Here, the multi-
scale features extracted by broad scalable architecture are
fused to yield more reasonable and comprehensive representa-
tions for image classification so that BNAS-CCLE can make
more exact decisions for image classification problem with
few parameters.
From the above, a family of broad scalable architectures
discovered by BNAS and its two variants excel in dealing
with both small and large scale image classification tasks.
On one hand, compared with other NAS approaches (DPP-
Net [25], LEMONADE [24]) for small-sized model automatic
designing, BNAS and its two variants can achieve novel
performance with few parameters (especially 0.5 million) for
CIFAR-10 so that the effectiveness of them can be proven. On
the other hand, as for the results of ImageNet, the powerful
transferability of the optimal architecture learned by BNAS-
CCLE and multi-scale features extraction capacity of the
proposed broad scalable architectures can all be proven well.
3) Efficiency: As shown in TABLE II, the efficiency of
BNAS using various broad scalable architectures is about
16580x and 9470x which are almost five order of mag-
nitudes faster than AmoebaNet and NASNet, respectively.
Compared BNAS and its two variants with those relative
efficient NAS methods (e.g., Hierarchical Evo [22], PNAS
[23] and LEMONADE [24]), our approach uses about 1580x,
1180x and 420x less computational resources, respectively.
Furthermore, several state-of-the-art efficient NAS approaches,
DPP-Net [25], DARTS [15], P-DARTS [26], PC-DARTS [27]
and ENAS [16] are compared in detail with the proposed
approach below.
First of all, the comparisons of DPP-Net and our approach
are given. It is obvious that the proposed approach is about 21x
faster than DPP-Net. Moreover, the performance of our ap-
proach is better as aforementioned. Compared with DARTS, a
novel gradient-based NAS approach, BNAS and its variants are
about 7.9x and 21x faster than it with first-order and second-
order approximation, respectively. However, the performance
of BNAS-CCE exceeds DARTS with first-order approximation
rather than second-order approximation which uses 21x more
computational resources than our approach. Based on DARTS,
two developed versions dubbed P-DARTS and PC-DARTS ob-
11
tain state-of-the-art efficiency. The efficiency of our approach
is 1.6x faster than P-DARTS but 0.09 GPU day slower than
PC-DARTS.
In particular, the search cost of BNAS and its two variants is
about 2.37x less than ENAS. As aforementioned, our approach
also adopts reinforcement learning and parameter sharing used
in ENAS. As a result, we can draw a conclusion that the
proposed broad scalable architectures contribute to improve
the efficiency of Cell search space based NAS approaches.
Moreover, the efficiency of NAS will be improved further
when combining the proposed broad scalable architecture and
PC-DARTS.
From the above, BNAS and its two variants deliver the
efficiency of 0.19 day with a single GPU who ranks the best
in reinforcement learning based NAS approaches. The pro-
posed broad scalable architectures who contributes to reduce
search cost and performance drop simultaneously, are the main
discrepancy between our approach and ENAS. Furthermore,
the utility of broad scalable architectures is guaranteed for
Cell search space based NAS approaches (e.g., P-DARTS and
PC-DARTS). As a result, we believe that the combination of
the proposed broad scalable architecture and PC-DARTS can
deliver state-of-the-art efficiency.
VI. CONCLUSIONS
In this paper, we propose a NAS approach using broad
scalable architecture dubbed BNAS. The core idea is designing
a scalable architecture with broad topology dubbed BCNN
for replacing the deep one to accelerate the search process
further. Particularly, we also propose two variants for BNAS
dubbed BNAS-CCLE and BNAS-CCE to examine the gener-
alization performance of the proposed approach. Moreover,
the main difference between BNAS and its variants is the
topologies of broad scalable architectures. Focusing on two
hyper-parameters of the proposed broad scalable architectures,
substantial experiments are performed for the optimal deter-
mination of BNAS and its two variants. Experimental results
show that deep Cell and enhancement Cell play identical role
in BNAS and BNAS-CCLE rather BNAS-CCE whose the
topology of scalable architecture is more deep.
Through substantial experiments on CIFAR-10 and Ima-
geNet, the effectiveness of efficiency and performance im-
provement of broad scalable architecture can be proven for
automatic architecture design. From the point of efficiency,
our approach delivers 0.19 GPU day (2.37x less than ENAS)
that ranks the best in reinforcement learning based NAS
approaches on CIFAR-10. From the point of performance,
our approach achieves state-of-the-art performance for both
small and large scales image classification task in particular
for small-sized model on CIFAR-10. For ImageNet, BNAS
achieves comparative performance using state-of-the-art pa-
rameter counts.
Furthermore, the utility of broad scalable architecture can
be guaranteed for Cell-based NAS approaches. As a result,
we are going to insert the proposed broad scalable architecture
into other NAS frameworks (e.g. evolutionary computation and
gradient-based) for efficiency improvement.
REFERENCES
[1] D. Zhao, Y. Chen, and L. Lv, “Deep reinforcement learn-
ing with visual attention for vehicle classification,” IEEE
Transactions on Cognitive and Developmental Systems,
vol. 9, no. 4, pp. 356–367, 2017.
[2] Y. Chen, D. Zhao, L. Lv, and Q. Zhang, “Multi-task
learning for dangerous object detection in autonomous
driving,” Information Sciences, vol. 432, pp. 559–571,
2018.
[3] X. Liu, B. Hu, Q. Chen, X. Wu, and J. You, “Stroke
sequence-dependent deep convolutional neural network
for online handwritten chinese character recognition,”
IEEE Transactions on Neural Networks and Learning
Systems, 2020.
[4] D. Mellouli, T. M. Hamdani, J. J. Sanchez-Medina, M. B.
Ayed, and A. M. Alimi, “Morphological convolutional
neural network architecture for digit recognition,” IEEE
Transactions on Neural Networks and Learning Systems,
vol. 30, no. 9, pp. 2876–2885, 2019.
[5] K. Shao, Y. Zhu, and D. Zhao, “Starcraft micromanage-
ment with reinforcement learning and curriculum transfer
learning,” IEEE Transactions on Emerging Topics in
Computational Intelligence, vol. 3, no. 1, pp. 73–84,
2019.
[6] K. Shao, D. Zhao, N. Li, and Y. Zhu, “Learning battles in
vizdoom via deep reinforcement learning,” in 2018 IEEE
Conference on Computational Intelligence and Games
(CIG). IEEE, 2018, pp. 1–4.
[7] D. Li, D. Zhao, Y. Chen, and Q. Zhang, “Deepsign: Deep
learning based traffic sign recognition,” in 2018 Interna-
tional Joint Conference on Neural Networks (IJCNN).
IEEE, 2018, pp. 1–6.
[8] H. Li, Q. Zhang, and D. Zhao, “Deep reinforcement
learning-based automatic exploration for navigation in
unknown environment,” IEEE Transactions on Neural
Networks and Learning Systems, vol. 31, no. 6, pp. 2064–
2076, 2020.
[9] B. Zoph and Q. V. Le, “Neural architecture search with
reinforcement learning,” in International Conference on
Learning Representations (ICLR), 2017.
[10] A. Brock, T. Lim, J. M. Ritchie, and N. J. Weston,
“Smash: One-shot model architecture search through
hypernetworks,” in International Conference on Learning
Representations (ICLR), 2018.
[11] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le,
“Learning transferable architectures for scalable image
recognition,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2018,
pp. 8697–8710.
[12] Y. Chen, R. Gao, F. Liu, and D. Zhao, “Modulenet:
Knowledge-inherited neural architecture search,” arXiv
preprint arXiv:2004.05020, 2020.
[13] Y. Sun, B. Xue, M. Zhang, and G. G. Yen, “Completely
automated cnn architecture design based on blocks,”
IEEE Transactions on Neural Networks and Learning
Systems, vol. 31, no. 4, pp. 1242–1254, 2020.
[14] C. Liu, L.-C. Chen, F. Schroff, H. Adam, W. Hua, A. L.
12
Yuille, and L. Fei-Fei, “Auto-deeplab: Hierarchical neural
architecture search for semantic image segmentation,” in
Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), 2019, pp. 82–92.
[15] H. Liu, K. Simonyan, and Y. Yang, “Darts: Differen-
tiable architecture search,” in International Conference
on Learning Representations (ICLR), 2018.
[16] H. Pham, M. Guan, B. Zoph, Q. Le, and J. Dean,
“Efficient neural architecture search via parameters shar-
ing,” in International Conference on Machine Learning
(ICLR), 2018, pp. 4095–4104.
[17] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le, “Regular-
ized evolution for image classifier architecture search,”
in Proceedings of the AAAI Conference on Artificial
Intelligence (AAAI), vol. 33, 2019, pp. 4780–4789.
[18] Z. Ding, Y. Chen, N. Li, and D. Zhao, “Simplified
space based neural architecture search,” in 2019 IEEE
Symposium Series on Computational Intelligence (SSCI).
IEEE, 2019, pp. 43–49.
[19] N. Li, Y. Chen, Z. Ding, and D. Zhao, “Light-weight neu-
ral architecture search for resource-constrainted device,”
in 2019 Chinese Automation Congress (CAC), 2019.
[20] N. Li, Y. Chen, Z. Ding, and D. Zhao, “Shift-invariant
convolutional network search,” in 2020 International
Joint Conference on Neural Networks (IJCNN). IEEE,
2020.
[21] T. DeVries and G. W. Taylor, “Improved regularization
of convolutional neural networks with cutout,” arXiv
preprint arXiv:1708.04552, 2017.
[22] H. Liu, K. Simonyan, O. Vinyals, C. Fernando, and
K. Kavukcuoglu, “Hierarchical representations for effi-
cient architecture search,” in International Conference on
Learning Representations (ICLR), 2018.
[23] C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L. Li,
L. FeiFei, A. Yuille, J. Huang, and K. Murphy, “Pro-
gressive neural architecture search,” in Proceedings of
the European Conference on Computer Vision (ECCV),
2018, pp. 19–34.
[24] T. Elsken, J. H. Metzen, and F. Hutter, “Efficient multi-
objective neural architecture search via lamarckian evo-
lution,” in International Conference on Learning Repre-
sentations (ICLR), 2018.
[25] J. Dong, A. Cheng, D. Juan, W. Wei, and M. Sun, “Dpp-
net: Device-aware progressive search for pareto-optimal
neural architectures,” in Proceedings of the European
Conference on Computer Vision (ECCV), 2018, pp. 517–
531.
[26] X. Chen, L. Xie, J. Wu, and Q. Tian, “Progressive
differentiable architecture search: Bridging the depth
gap between search and evaluation,” in Proceedings of
the IEEE International Conference on Computer Vision
(ECCV), 2019, pp. 1294–1303.
[27] Y. Xu, L. Xie, X. Zhang, X. Chen, G.-J. Qi, Q. Tian, and
H. Xiong, “PC-DARTS: Partial channel connections for
memory-efficient architecture search,” in International
Conference on Learning Representations (ICLR), 2019.
[28] C. P. Chen and Z. Liu, “Broad learning system: An ef-
fective and efficient incremental learning system without
the need for deep architecture,” IEEE Transactions on
Neural Networks and Learning Systems, vol. 29, no. 1,
pp. 10–24, 2017.
[29] C. P. Chen, Z. Liu, and S. Feng, “Universal approxima-
tion capability of broad learning system and its structural
variations,” IEEE Transactions on Neural Networks and
Learning Systems, vol. 30, no. 4, pp. 1191–1204, 2018.
[30] Y. Pao and Y. Takefuji, “Functional-link net computing:
theory, system architecture, and functionalities,” Com-
puter, vol. 25, no. 5, pp. 76–79, 1992.
[31] Y. Pao, G. Park, and D. J. Sobajic, “Learning and gener-
alization characteristics of the random vector functional-
link net,” Neurocomputing, vol. 6, no. 2, pp. 163–180,
1994.
[32] F. Chollet, “Xception: Deep learning with depthwise
separable convolutions,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition
(CVPR), 2017, pp. 1251–1258.
[33] S. Hochreiter and J. Schmidhuber, “Long short-term
memory,” Neural Computation, vol. 9, no. 8, pp. 1735–
1780, 1997.
[34] R. J. Williams, “Simple statistical gradient-following al-
gorithms for connectionist reinforcement learning,” Ma-
chine Learning, vol. 8, no. 3-4, pp. 229–256, 1992.
[35] F. M. Carlucci, P. Esperanca, R. Tutunov, M. Singh,
V. Gabillon, A. Yang, H. Xu, Z. Chen, and J. Wang,
“Manas: Multi-agent neural architecture search,” arXiv
preprint arXiv:1909.01051, 2019.
[36] I. Loshchilov and F. Hutter, “SGDR: Stochastic gradient
descent with warm restarts,” in International Conference
on Machine Learning (ICLR), 2017.
[37] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabi-
novich, “Going deeper with convolutions,” in Proceed-
ings of the IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), 2015, pp. 1–9.
[38] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko,
W. Wang, T. Weyand, M. Andreetto, and H. Adam,
“Mobilenets: Efficient convolutional neural networks
for mobile vision applications,” arXiv preprint
arXiv:1704.04861, 2017.
[39] X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet:
An extremely efficient convolutional neural network for
mobile devices,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition (CVPR),
2018, pp. 6848–6856.
[40] B. Wu, X. Dai, P. Zhang, Y. Wang, F. Sun, Y. Wu, Y. Tian,
P. Vajda, Y. Jia, and K. Keutzer, “Fbnet: Hardware-aware
efficient convnet design via differentiable neural architec-
ture search,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2019,
pp. 10 734–10 742.
APPENDIX
13
TABLE IV
HYPER-PARAMETERS DETERMINATION FOR BNAS AND BNAS-CCE
Approach vs vd kd
Accuracy (%) Mean ± Var (%)Arch-1 Arch-2 Arch-3 Arch-4 Arch-5
BNAS
1
1
0 94.56 95.50 95.51 95.64 95.78 95.40 ± 0.19
1 96.42 96.63 96.56 96.21 96.64 96.49 ± 0.03
2 96.32 96.78 96.51 96.58 96.32 96.50 ± 0.03
2
0 94.34 95.53 95.60 96.02 95.80 95.46 ± 0.34
1 96.40 96.56 96.36 96.48 96.37 96.43 ± 0.01
2 95.85 96.65 96.51 96.41 96.58 96.40 ± 0.08
3
0 93.95 95.75 95.57 95.37 94.97 95.12 ± 0.41
1 96.49 96.59 96.53 96.24 96.50 96.47 ± 0.01
2 96.57 96.79 96.01 96.56 96.64 96.51 ± 0.07
2
1
0 95.20 95.65 94.74 95.86 95.11 95.31 ± 0.16
1 96.86 96.82 96.28 96.77 96.79 96.70 ± 0.05
2 96.83 96.36 96.43 96.76 96.72 96.62 ± 0.04
2
0 94.72 93.89 93.12 95.56 95.72 94.60 ± 0.98
1 96.79 96.46 96.45 96.56 96.49 96.55 ± 0.02
2 96.73 96.37 96.48 96.72 96.74 96.61 ± 0.02
3
0 95.44 95.52 95.28 95.55 94.85 95.33 ± 0.07
1 96.70 96.72 96.47 96.55 96.66 96.62 ± 0.01
2 96.76 96.01 96.63 96.50 96.82 96.54 ± 0.08
BNAS-CCE
1
1
0 95.24 95.34 94.11 93.76 94.74 94.64 ± 0.38
1 96.08 95.91 96.07 96.17 95.55 95.96 ± 0.05
2 96.40 95.79 95.96 96.42 96.68 96.25 ± 0.11
2
0 94.89 95.41 94.91 94.78 95.05 95.01 ± 0.05
1 95.82 95.89 96.13 96.00 95.33 95.83 ± 0.08
2 96.38 96.55 95.94 96.15 95.83 96.17 ± 0.07
3
0 95.73 95.42 95.43 95.26 95.53 95.47 ± 0.02
1 96.30 95.93 96.46 96.25 95.92 96.17 ± 0.05
2 96.42 95.74 96.07 95.94 96.01 96.04 ± 0.05
2
1
0 95.51 95.51 95.03 95.85 95.65 95.51 ± 0.07
1 96.22 96.17 96.67 96.53 96.40 96.40 ± 0.04
2 95.62 96.73 96.40 96.64 96.68 96.41 ± 0.17
2
0 95.06 95.78 94.74 95.38 95.61 95.31 ± 0.14
1 96.32 96.11 95.91 96.39 96.42 96.23 ± 0.04
2 96.71 96.56 96.13 96.63 96.77 96.56 ± 0.05
3
0 95.39 95.53 95.53 95.31 95.87 95.53 ± 0.04
1 95.70 96.07 96.25 96.54 96.10 96.13 ± 0.07
2 96.44 96.63 96.48 96.41 96.56 96.50 ± 0.01
