S3NAS: Fast NPU-aware Neural Architecture Search Methodology by Lee, Jaeseong et al.
S3NAS: Fast NPU-aware Neural Architecture Search
Methodology
Jaeseong Lee
Seoul National University
thnkinbtfly@iris.snu.ac.kr
Duseok Kang
Seoul National University
kangds0829@snu.ac.kr
Soonhoi Ha∗
Seoul National University
sha@snu.ac.kr
ABSTRACT
As the application area of convolutional neural networks (CNN) is
growing in embedded devices, it becomes popular to use a hardware
CNN accelerator, called neural processing unit (NPU), to achieve
higher performance per watt than CPUs or GPUs. Recently, au-
tomated neural architecture search (NAS) emerges as the default
technique to find a state-of-the-art CNN architecture with higher
accuracy than manually-designed architectures for image classifica-
tion. In this paper, we present a fast NPU-aware NAS methodology,
called S3NAS, to find a CNN architecture with higher accuracy than
the existing ones under a given latency constraint. It consists of
three steps: supernet design, Single-Path NAS for fast architecture
exploration, and scaling. To widen the search space of the supernet
structure that consists of stages, we allow stages to have a different
number of blocks and blocks to have parallel layers of different ker-
nel sizes. For a fast neural architecture search, we apply a modified
Single-Path NAS technique to the proposed supernet structure. In
this step, we assume a shorter latency constraint than the required
to reduce the search space and the search time. The last step is to
scale up the network maximally within the latency constraint. For
accurate latency estimation, an analytical latency estimator is de-
vised, based on a cycle-level NPU simulator that runs an entire CNN
considering the memory access overhead accurately. With the pro-
posed methodology, we are able to find a network in 3 hours using
TPUv3, which shows 82.72% top-1 accuracy on ImageNet with 11.66
ms latency. Code are released at https://github.com/cap-lab/S3NAS
1 INTRODUCTION
As there are growing needs of deep learning applications based
on convolutional neural network(CNN) in embedded systems, im-
proving the accuracy of CNNs under a given set of constraints
on latency and energy consumption has brought keen interest to
researchers as a challenging problem in various areas. A popular
hardware solution is to develop a hardware accelerator, called neu-
ral processing unit (NPU), that achieves higher performance per
watt than CPUs or GPUs.
For a given hardware platform, several software techniques have
been proposed to accelerate CNNs by approximate computing since
deep learning applications can tolerate a certain range of compu-
tation inaccuracy. Some examples in this software approach are
filter pruning [17], quantization [22], low-rank approximation [16].
Accelerating CNNs is helpful to improve the accuracy by running a
more compute-intensive CNN with higher accuracy within a given
time budget.
∗Corresponding Author.
Search
Space 𝒜
Search
Strategy
Performance 
Estimation
Strategy
architecture 
A ∈ 𝒜
performance
estimate of A
Figure 1: Typical procedure of NAS, adopted from [6]
On the other hand, various algorithmic solutions have been
proposed to improve the CNN architecture by introducing new op-
erations, optimizing the hyper-parameters, or searching for better
network architecture. New operations such as depth-wise convolu-
tion(DWConv) [3] and mobile inverted bottleneck (MBConv) [27]
have been developed to replace the regular full convolution. Re-
cently, automated neural architecture search (NAS) emerges as
the default technique to find a CNN architecture with higher ac-
curacy than manually-designed architectures, particularly image
classification.
A NAS technique explores a predefined search space and esti-
mates the performance for each candidate architecture to find an
optimal one with the highest accuracy under a given latency con-
straint. Thus there are three factors that affect the performance of
NAS, as shown in Figure 1: search space, search strategy, and perfor-
mance estimation. The search space of a NAS technique is usually
restricted by a supernet that defines the topology of the largest net-
work to explore. Since the performance of a network depends on
the hardware platform, the NAS technique needs to be customized
to a given hardware platform. While numerous NAS techniques
have been proposed with various search strategies recently, their
assumed hardware platforms are mostly GPUs. In this paper, we
present a customized NAS technique for an NPU, which produces
a CNN architecture with a better accuracy-latency tradeoff than
existing models.
One of the most closely related work is the recently proposed
NAS technique tailored for Google’s Edge-TPU [9]. While MBConv
is widely used for GPU-aware NAS techniques, they prefer to use
a single full convolution by fusing expansion layer and DWConv
layer in some parts of the network, observing that the Edge-TPU
runs the fused full convolution faster even though the required
number of MAC (multiply-accumulate) operations is much larger.
It confirms that the number of MAC operations is not a proper
measure of latency, and platform-specific performance estimation
is required.
Since an NPU is much faster than a GPU, it enables us to explore
the wider search space for NAS under a given latency constraint.
Since there are many factors to define the search space, such as
the number of layers, channels, kernel sizes, and so on, the search
ar
X
iv
:2
00
9.
02
00
9v
1 
 [c
s.L
G]
  4
 Se
p 2
02
0
Jaeseong Lee, Duseok Kang, and Soonhoi Ha
❶ alter numbers
of blocks per stage 
in supernet
DW 
? x ?
Conv 1 x 1
DW 
? x ?
Conv 1 x 1
DW 
? x ?
ConCat
1. Supernet design
❷ enable search of 
DWConv with 
multiple kernel sizes
2. apply modified Single-Path NAS
3. Scale & add SE + h-swish
SE
Conv 1 x 1
DW
3 x 3
DW
7 x 7
ConCat
Figure 2: Overview of the proposed NAS technique.
space grows exponentially as the allowed computation complex-
ity grows. Hence, reducing the search space, as well as the search
time, is very challenging for NPU-aware NAS techniques. While
the aforementioned work for Google’s Edge TPU trains each archi-
tecture candidate from scratch to estimate the performance, it is not
computationally efficient. In contrast, we adopt a fast differentiable
hardware-aware One-Shot NAS, called Single-Path NAS [28], in
order to reduce the search time.
Figure 2 shows an overview of the proposed NAS methodology
that consists of three steps. In the first step, we change the supernet
structure of the Single-Path NAS, which has a hierarchical structure
based onMobileNetV2 [27]: A supernet structure consists of a series
of stages that contain a series of blocks containing an MBConv
micro-architecture inside. Since the network accuracy depends on
the supernet structure, we make two extensions on the supernet
structure to widen the search space. First, we allow stages to have
a different number of blocks, called depth of the stage, considering
the effect of stage depth on the accuracy and the latency. Second,
we add parallel layers with different kernel sizes in each block,
adopting the idea of mixed depthwise convolution [33] (MixConv).
With the extended supernet structure, we apply the Single-Path
NAS, which is also extended to support the extended supernet
structure. In this step, we assume a shorter latency constraint than
the required to reduce the search space and the search time. The
last step is to scale up the baseline CNN adopting the compound
scaling technique proposed in [32] until the latency constraint is
met. The proposed NAS methodology is named as S3NAS since it
consists of 3 steps: Supernet design, SinglePath NAS, and Scaling
and post-processing.
For accurate latency estimation, an analytical latency estimator
is devised, based on a cycle-level NPU simulator that runs an entire
CNN considering the memory access overhead accurately. Since
the NPU assumed in this paper can execute depth-wise separable
convolution (DWConv), squeeze-and-excitation (SE), and h-swish
activation function efficiently, the proposed supernet prefers DW-
Conv to regular convolution. Observing that the accuracy is im-
proved by around 1% if SE and h-swish activation function are used,
we add a post-processing phase after a CNN network is found by
NAS to add SE layers and to replace ReLU to h-swish activation
function.
Experiments show that the proposed NAS technique could im-
prove the accuracy-latency tradeoff over existing SoTACNNmodels.
Our best model achieves 82.72% top-1 accuracy on ImageNet
with 11.66ms latency without any special data augmentation.
Note that the latency is estimated by cycle-accurate simulation.
For a fair comparison with the related work, the latency of each
compared network is also estimated with the same simulator.
2 RELATEDWORK
2.1 Neural Architecture Search (NAS)
After an automated NAS technique based on reinforcement learn-
ing successfully found a better CNN architecture than manually-
designed architectures [36], extensive research has been conducted
to develop various NAS techniques based on reinforcement learn-
ing [31, 37]. However, these NAS techniques are computationally
intensive because they train each candidate architectures from
scratch to estimate the goodness of it. Thus, one-shot neural archi-
tecture search approach [24] was introduced to reduce the search
cost. In this approach, an over-parameterized super-model network
is defined, and architecture search is performed by parameter opti-
mization to reduce the complexity of the network. Gradient-based
differentiable search has gained increasing popularity, and various
NAS techniques have been proposed with different super-models
and hyper-parameters [2, 5, 8, 19, 24].
Among diverse techniques to decrease the search cost, Single-
Path NAS [28] was recently proposed to find a good architecture
faster than the existing differentiable NAS techniques. This tech-
nique is extended to broaden the search space by including the
squeeze-and-excitation (SE) block in the search space [29]. Our
work is grounded on the original Single-Path NAS technique.
2.2 Hardware-friendly Neural Architecture
Design
Finding a hardware-friendly neural architecture has been facilitated
as NAS algorithm improved. MNASNet [31] added a latency term in
the objective function to discover better architectures with a given
latency constraint on their target hardware platform. Efficient-
Net [32], whose search method is similar to MNASNet, introduced
a novel scaling method, called compound scaling, to find more accu-
rate networks as the latency constraint or FLOPS increases. Instead
of finding a network directly for a given long latency constraint,
they scale up the depth and the width of a small network with
shorter latency and the input image size in a balanced way. They
could achieve a set of networks with state-of-the-art performance
over a range of latency constraints. They removed SE blocks and
swish activation function from their search space for hardware
platforms that do not support them efficiently to name the resultant
network as EfficientNet-lite.
While EfficientNet searches a set of networks over a range of
latency constraints by scaling up, Once-For-All [1] network takes
an opposite approach, scaling down. They first train a super-graph
architecture by a novel method called progressive shrinking and
search a sub-graph network that achieves good accuracy for a given
latency constraint without re-training but cheap fine-tuning. They
claim that a scaled-down network from the super-graph gives better
accuracy than a network that is trained from scratch. They could
S3NAS: Fast NPU-aware Neural Architecture Search Methodology
[2
2
4
, 2
2
4
, 3
]
C
o
n
v7
[1
1
2
, 1
1
2
, 3
2
]
[5
6
, 5
6
, 3
2
]
[5
6
, 5
6
, 3
2
]
[2
8
, 2
8
, 6
4
]
[2
8
, 2
8
, 6
4
]
[2
8
, 2
8
, 6
4
]
[1
4
, 1
4
, 1
2
8
]
[1
4
, 1
4
, 1
2
8
]
[1
4
, 1
4
, 1
2
8
]
[1
4
, 1
4
, 1
2
8
]
[1
4
, 1
4
, 1
2
8
]
…
stage
block
width of block : the number of channels in the final output feature map of the block
cumulative depth up to stage S : # of blocks starting from first block of CNN ~ last block in a specific stage S
width of stage : the width of the final block in the stage
?x
?
?x
?
?x
?
?x
?
?x
?
?x
?
?x
?
?x
?
?x
?
?x
?
?x
?
DW ?x?
[H/s, W/s, ? x c]
[H/s, W/s, c’]
[H, W, ? x c]
Add Input
if stride=1 and c=c’
[H, W, c]
Conv 1 x 1
DW ?x?
Conv 1 x 1
DW ?x?
ConCat
expansion layer
superkernel
Figure 3: Definitions used in this paper. We depict our search space as an example.
find more accurate networks than EfficientNet for small latency
constraints.
To explore more efficient neural architectures on specific hard-
ware, some NAS methods have proposed to define the design space
of architecture exploration, tailored for the hardware platform.
Gupta et al. [9] devised a building block named fused inverted
bottleneck convolution block and showed that this block is often
more efficient than MBConv on their target NPU, Edge-TPU. They
adopted compound scaling method to find high-performing archi-
tectures on Edge-TPU. Our work is closely related to this method.
We devise a building block that consists of parallel DWConv layers
with different kernel sizes, based on a preliminary experiment to
find that it is better than the other alternative building blocks in
terms of performance per latency [33]. And we increase the search
space by allowing stages to have a different number of blocks in
the baseline supernet.
2.3 Deciding the Depth of Stages
A neural network typically consists of multiple stages, a sequence
of blocks with the same number of output channels (width). There
are studies on how to assign the number of blocks (depth) to each
stage. Meng et al. [21] observed that the way of assigning depth
to each stage affects the accuracy. Moreover, they argued that the
good depth assignment of each stage could be inherited from the
shallow ones as the total depth is increased, and proposed a layer-
growing NAS method that could significantly reduce the search
space. Furthermore, Radosavovic et al. [25] discovered that among
neural architectures with similar computational complexity, the
ones whose stage width and depth have a quantized linear relation-
ship tend to have higher accuracy. Based on similar observations,
we apply this design principle to change the structure of the conven-
tional One-Shot NAS supernet. In addition, we argue that placing
more blocks in a stage with a larger width is beneficial.
2.4 Depthwise Convolution with Multiple
Kernel Sizes
While the original DWConv block uses a single kernel size for
depthwise convolution, mixing multiple kernel sizes for depthwise
convolution was recently proposed, named asMixConv [33]. Mixing
multiple kernel sizes can be understood as having parallel branches
inside a block. It is shown that MixConv is more efficient than ordi-
nary DWConv [33]. There exist some recent NAS methods [4, 20]
that also broaden their search space using DWConv with multiple
kernel sizes to find better neural architectures. We adopt this ap-
proach in the supernet and formulate a differentiable latency model
of this operation, enabling a latency-aware differentiable One-Shot
NAS with MixConv.
3 BACKGROUND
In this section, we will briefly review the Single-Path NAS tech-
nique and our target NPU. Before going further, we define some
terminologies used in this paper, as shown in Figure 3. A neural
architecture consists of stages at the top level. A stage consists of
a sequence of blocks whose output feature maps have the same
dimension. In the proposed supernet, a block is defined as MBConv
that typically starts with 1×1 conv (expansion layer) and ends with
1×1 conv. Adopting the MixConv approach, the depthwise convo-
lution layer consists of parallel superkernels whose kernel size will
be determined during the NAS process. The width of block denotes
the number of channels in the final output feature map of the block,
and the width of stage is the width of the final block in the stage.
We will call the total number of blocks starting from the very first
block in the network up to the last block in a specific stage S, as
the cumulative depth up to stage S.
3.1 Single-Path NAS
Differentiable NAS methods usually define architecture parameters
to choose which convolution layer to use in the block, training each
convolution layer independently. Single-Path NAS [28] reduce the
search cost by decreasing the number of trainable parameters by
sharing the kernel weights between convolution layers. The key
idea is designing an over-parameterized depthwise convolution
kernel named superkernel, and letting each depthwise convolution
kernel of candidate MBConvs directly inherit the weights of this
superkernel.
Let wk,e denote the depthwise convolution kernel of candidate
MBConv with kernel size k and expansion ratio e (MBConvk,e ).
Jaeseong Lee, Duseok Kang, and Soonhoi Ha
= +𝟏 use 𝐰5\3,6 ⋅ 1 1
𝐰3,6
(used for 𝑘 ≥ 3)
𝐰5\3,6
(used for 𝑘 ≥ 5)
𝟏 use kernel size 5 = 𝟏 use 𝐰5\3,6 = when 𝐰5\3,6
2
is large enough
= 𝟏 𝐰5\3,6
2
> tk=5
𝐰∗,6
superkernel (𝑘 =?)
Figure 4: A searchable superkernel which can decide the ker-
nel size.
First, they introduce a large w5,6, which is the DWConv kernel of
MBConv5,6. Then, the inner core ofw5,6 can be considered asw3,6,
a DWConv kernel of MBConv3,6. A superkernel containing these
two kernel size options can be expressed as Figure 4:
w∗,6 = w3,6 + 1(use kernel size 5) ·w5\3,6 (1)
where w5\3,e means the outer part, w5,e −w3,e . Next, they formu-
late conditions to determine the kernel size. They define a certain
threshold value t and compare the norm of the kernel weights with
the threshold. If the norm of a subset weight is larger than the
threshold, it remains in the supernet. To this end, Eq. (1) is changed
as follows:
w∗,6(tk=5) = w3,6 + 1(∥w5\3,6∥2 > tk=5) ·w5\3,6 (2)
The threshold value is also trainable to be automatically chosen
during training. To enable back-propagation, they relax 1(x > t)
to σ (x − t) when computing gradients. In addition, they optimize
kernel weights and threshold values simultaneously. For a given
tight search time, this method is shown to be more effective than
the other methods [29].
Moreover, we can vary the number of channels by varying the
expansion ratio of each block: we can use only the first half channels
ofw5,6 andw3,6 asw5,3 andw3,3, respectively. By defining another
set of trainable thresholds, the following formula is defined to
determine the expansion ratio:
w∗,∗(te=3, te=6, tk=5) = 1(∥w∗,3(tk=5)∥2 > te=3) ·w∗,3(tk=5)+
1(∥w∗,3(tk=5)∥2 > te=3)·1(∥w∗,6\3(tk=5)∥2 > te=6)·w∗,6\3(tk=5)
(3)
where wk,6\3 means the remaining half of channels, wk,6 −wk,3.
Note that if te=3 is sufficiently large, all channels can be removed
to make the block a plain skip connection. Thus, they replace the
original depthwise convolution kernel of MBConv5,6 with w∗,∗,
yielding a differentiable and searchable MBConv with respect to
the kernel size and expansion ratio.
They also design a differentiable latency-aware loss function
to consider hardware latency in the search algorithm. To this end,
they define a function to estimate latency as follows:
Lle =1(∥w∗,3∥2 > te=3) · (P l5,3+
1(∥w∗,6\3∥2 > te=6) · (P l5,6 − P l5,3))
(4)
Ll =P l3,6/P l5,6 · Lle+
1(∥w5\3,6∥2 > tk=5) · Lle · (1 − P l3,6/P l5,6)
(5)
where P lk,e is a profiled latency value for MBConvk,e for the lth
block in the supernet. Note that they used P l3,6, P
l
5,3, and P
l
5,6 only to
formulate Ll , and the latency for MBConv3,3 is approximated using
these values. Here is the latency-aware loss function designed:
CE + λ · loд(
∑
l
Ll ) (6)
Finally, they search for a neural architecture in two phases. First,
they train the supernet by randomly choosing one of the candidate
subgraphs in each training step. In this phase, they use CrossEn-
tropy loss only. Next, they enable latency-aware loss function and
train the supernet with the loss function, to decide the threshold val-
ues. By doing this, they could get a high-quality neural architecture
with only eight epochs of ImageNet training set.1
3.2 Target NPU
Even though the proposed methodology can be applied to any type
of NPU, the current implementation is made for an adder-tree type
NPU, called MIDAP [15]. It has a fully-pipelined micro-architecture
that consists of separate hardware modules and memory modules
for convolution, activation function, and various reduction opera-
tions. Since it enables us tomake a fully static schedule of operations
without resource contention in the data path, we can estimate the
end-to-end latency of a CNN quite accurately analytically. Unex-
pected delay may incur from off-chip DRAM delay that is not fully
hidden by double buffering.
Another good feature of MIDAP is that it efficiently supports
the following operations that would lower the MAC (multiply-
accumulate) utilization in other NPUs that have many MAC units:
pooling, DWConv, and squeeze-and-excitation (SE). For DWConv
operation, it does not use an adder tree but an alternative hardware
logic that consists of a set of individual accumulators connected
to the multiply units. For pooling and SE operations, reduction
logic is included in the pipeline. Note that MIDAP has not been
implemented as a real hardware chip yet but as a virtual prototype
with a cycle-accurate simulator. Thanks to the cycle-accurate simu-
lator that considers the DRAM access contention and parametrized
DRAM access delay, we could build an accurate analytical model for
end-to-end latency estimation, based on the profiling result with
the simulator.
Inverted bottleneck with depth-wise convolution (MBConv) [27]
is a popular building block in recent mobile-friendly networks.
However, it is not efficiently supported in existing NPUs that do
not have specialized hardware units for DWConv [7, 9]. Thus Gupta
et al. [9] replaced an MBConv block with a fused building block that
fuses an expansion layer and DWConv in MBConv into a single full
convolution. Even though the fused block increases the number of
multiplications significantly, it improves the MAC utilization larger
so that the fused block is observed faster than MBConv on their
target NPU, EdgeTPU. By adding this building block to their search
space, they could successfully obtain different neural architectures
for EdgeTPU from those for GPUs.
1In our implementation, we changed the probability of selecting each candidate MB-
Convs to be equal.
S3NAS: Fast NPU-aware Neural Architecture Search Methodology
Since DWConv is efficiently supported in MIDAP, however, the
improvement ofMAC utilization by fusing does not outweigh the in-
creased computation complexity, which is observed in preliminary
experiments. The experiment setup is similar to main experiment
setup that will be explained in section 5.2. The experimental re-
sult is shown in Table 1. The latency constraint for fused block
experiment is set to 7.0ms, while others are set to 2.15ms. In the
combined experiment, we use the fused block in the 1st and the
2nd stages, and MBConv for the remaining stages since the latency
gap between two building blocks is too high. As shown in the ta-
ble, MBConv block shows the best tradeoff between accuracy and
latency. Hence we prefer MBConv to the fused building block as
the basic building block in the supernet for MIDAP.
Table 1: NAS comparison with different building blocks
building block accuracy (%) latency (ms)
Fused inverted bottleneck conv 77.34 6.90
MBConv 77.75 2.05
Combined 76.55 2.20
4 PROPOSED S3NAS METHODOLOGY
In this section, we explain the proposed S3NAS methodology that
consists of three steps as displayed in Figure 2.
4.1 Supernet Design
The number of blocks is one of the key parameters in neural net-
works. It is observed that the total number of blocks affects the
accuracy of neural architecture [11, 32]. In conventional One-Shot
NAS methods, each stage in the supernet has the same number of
blocks [2, 28, 34]. On the other hand, some recent studies [21, 25]
report that the way of assigning the number of blocks in each stage
has a noticeable impact on the accuracy, even with the same num-
ber of blocks in total. Hence we allow stages in the supernet to
have a different number of blocks.
We investigate the impact of assigning the number of blocks in
the supernet with another preliminary experiment. We construct
a network based on MobileNetV2, which has four blocks in ev-
ery stage, and observe the change of accuracy as we reduce two
blocks in a different stage in each experiment. Figure 5 shows that
MBConvs with larger width has more impact on accuracy.
As the number of multiplications in a DWConv isW ×H ×C×K2,
the later stage of DWConv tends to have shorter latency since the
reduction ofH ×W is larger than the increase ofC . Thus the impact
on the latency by increasing the number of blocks in a later stage
is not significant as displayed in Figure 5.
Thus, we place more blocks to stages with larger width in the
supernet, making the cumulative depth up to a specific stage is
proportional to the width of the stage, which is similar to Pyramid-
Net [10]. A recent study [25] also claims that neural architectures
with a linear relationship between the cumulative depth and the
width tend to have higher accuracy with a similar amount of com-
putation complexity. Our experiment shows that our modification
to supernet enhances the efficiency of the search result in terms of
accuracy as well as latency (Table 4).
 accuracy cell num
0 0.74054
1st(24x56
x56)
1 0.7429
2nd(32x2
8x28)
2 0.74076
3rd(64x14
x14)
3 0.73874
4th(96x14
x14)
4 0.7347
5th(160x7
x7)
baseline 0.74276
-0.2%
0.0%
0.2%
0.4%
0.6%
0.8%
1.0%
0
0.05
0.1
0.15
0.2
0.25
0.3
1st(24x56x56) 2nd(32x28x28) 3rd(64x14x14) 4th(96x14x14) 5th(160x7x7)
Im
p
ac
t 
to
 T
o
p
-1
 A
cc
u
ra
cy
 (
%
)
Im
p
ac
t 
to
 la
te
n
cy
 o
n
 N
P
U
 (
m
s)
stage index(out feature map dim)
1x1Conv latency DWConv latency Sum latency impact to accuracy
Figure 5: Impact of reducing the number of blocks in differ-
ent stages in a MobileNetV2-based model
Another feature of the proposed supernet is to use mixed con-
volution (MixConv) that mixes different kernel sizes in the depth-
wise convolution layer [33]. Some recent NAS methods [4, 20] also
broaden their search space using DWConv with various kernel
sizes and could find better neural architectures.
Figure 6 depicts our building block structure. This block starts
and ends with 1×1 convolution, with N searchable superkernels
in the middle. Each searchable superkernel is designed similarly
to Eq. (3), while we may use different threshold values in each
superkernel. The kernel sizes and expansion ratios are selected
among predetermined values. If the j-th searchable superkernel
chooses an expansion ratio ej , the j-th kernel has ej times more
channels than the first 1×1 convolution. Compared with the original
MixConv suggested in [33], the proposed building block supports
more diverse combinations of kernel sizes and expansion ratios. It
enhances the efficiency of search results on our target NPU (Table 5).
ConCat
⊕
DW
?x?
Conv 1x1
sk
ip
-c
o
n
n
ec
ti
o
n
DW
?x?
DW
?x?
Conv 1x1
Figure 6: Our MixConv-based building block for supernet
We finish this subsection by highlighting the merit of Single-
Path NAS on building a MixConv-based differentiable NAS. Con-
ventional multi-path NAS methods would have difficulties when
adding inverted bottleneck convolution with MixConv to their
search space. Since the number of possible choices of such blocks
grows proportionally to the partition number, multi-path NASmeth-
ods would introduce a significant increase in memory requirements
and the search time. On the contrary, MixConv can be efficiently
supported in Single-Path NAS, as explained below.
Jaeseong Lee, Duseok Kang, and Soonhoi Ha
4.2 Modified SinglePath NAS
We use a different latency estimationmodel, and a loss formula from
the original SinglePath NAS technique explained in section 3.1.
4.2.1 Differentiable Latency Model. Suppose we concatenate N
searchable superkernels to build a MixConv-based building block,
and let ®k = (k1, · · · ,kN ), ®e = (e1, · · · , eN ) where kj , ej denote the
kernel size and the expansion ratio of the jth searchable superkernel.
The estimated latency of a DWConv operation depends on the
kernel size and the expansion ratio.
For latency formulation, we first define two condition variables,
Fj,kj andG j,ej , that denote whether the jth searchable superkernel
chooses the kernel size kj and the expansion ratio ej , respectively;
For example, Fj,kj is 1 if and only if the jth searchable superkernel
chooses kj , and 0 otherwise.
Let κ1 < · · · < κK be the candidate kernel sizes, and 0 = ϵ1 <
· · · < ϵE denote the candidate expansion ratios of the jth search-
able superkernel, respectively. Suppose kj = κc , then Fj,kj can be
formulated as follows:
Fj,kj =
( ∏
2≤i≤c
1(∥wj,κi \κi−1,ϵE ∥2 > tj,κi )
)
· fj,kj , where
fj,kj =
{
1(∥wj,κc+1\κc ,ϵE ∥2 < tj,κc+1 ), if c < K
1, if c = K
(7)
𝐹𝑗,𝜅2 = 𝟏 𝑢𝑠𝑒 𝒘𝑗,𝜅2\𝜅1,𝜖𝐸 ⋅ 𝟏 𝑛𝑜𝑡 𝑢𝑠𝑒 𝒘𝑗,𝜅3\𝜅2,𝜖𝐸
= 𝟏 𝒘𝑗,𝜅2\𝜅1,𝜖𝐸
2
> 𝑡𝑗,𝜅2 ⋅ 𝟏 𝒘𝑗,𝜅3\𝜅2,𝜖𝐸
2
< 𝑡𝑗,𝜅3
= + 1 ⋅ + 0 ⋅
𝐰𝑗,𝜅1,𝜖𝐸
(used 
by default)
𝑗th superkernel
chose 𝜅2
(𝑘𝑗 = 𝜅2)
𝐰𝑗,𝜅2\𝜅1,𝜖𝐸
(use this ⇔
𝑘𝑗 ≥ 𝜅2)
𝐰𝑗,𝜅3\𝜅2,𝜖𝐸
(use this ⇔
𝑘𝑗 ≥ 𝜅3)
Figure 7: An example of Eq. (7) when a searchable superker-
nel chose kernel size κ2.
Figure 7 depicts an example of this formula when the jth search-
able superkernel that has four candidate kernel sizes κ1 < · · · < κ4
chooses κ2 as the kernel size: kj = κ2. It means that weightwj,κ1,ϵE
and wj,κ2\κ1,ϵE are used, but the remaining weights starting from
wj,κ3\κ2,ϵE are not used. Since wj,κ1,ϵE is always used, it is not
included in the formula. To usewj,κ2\κ1,ϵE , the norm of it has to be
larger than tj,κ2 while the norm ofwj,κ3\κ2,ϵE should not be larger
than tj,κ3 to avoid the use of larger kernel sizes.
We can formulate G j,ej similarly:
G j,ej =
( ∏
2≤i≤d
1(∥wj,∗,ϵi \ϵi−1 ∥2 > tj,ϵi )
)
· дj,ej , where
дj,ej =
{
1(∥wj,∗,ϵd+1\ϵd ∥2 < tj,ϵd+1 ), if d < E
1, if d = E
when ej = ϵd . Then the condition for a MixConv-based building
block to choose ®k, ®e can be expressed as∏Nj Fj,kjG j,ej .
Now, the estimated latency of a single block is formulated as
follows:
L =
∑
®k, ®e
(P(®k, ®e)
N∏
j
Fj,kjG j,ej ) (8)
where P(®k, ®e) denotes the profiled latency value of aMixConv-based
building block corresponding to ®k, ®e .
Unlike the original Single-Path NAS that approximates the la-
tency in Eq. (5) in some cases, we use the profiled latency value in
all cases. Note that an expansion ratio can be zero, and if only one
superkernel has a nonzero expansion ratio, the MixConv block is
reduced to a plain MBConv block. Finally, we can estimate the la-
tency by summing up these estimated latencies for all superkernels
in the block,
∑
L.
ConCat
⊕
DW
5x5
Conv 1x1
sk
ip
-c
o
n
n
ec
ti
o
n
DW
7x7
DW
5x5
Conv 1x1
ConCat
⊕
Conv 1x1
sk
ip
-c
o
n
n
ec
ti
o
n
DW
7x7
DW
5x5
Conv 1x1
Figure 8: Two implementations of a same block. While their
FLOPS are identical, their estimated latencies are different.
Since each superkernel is treated independently, some superk-
ernels may have the same kernel size and expansion ratio. Then,
even if two superkernel configurations express an equivalent block,
as illustrated in Figure 8, they may have different estimated latency
values, which is an artifact of the proposed profiling-based latency
estimation method. To avoid this artifact, we enforce that there is
only one kernel for each kernel size in the MixConv block. That is,
we merge two kernels of the same size into one; For instance, the
left MixConv is translated to the right MixConv in Figure 8 before
latency estimation.
Figure 9 shows the estimated latency and simulated latency of
randomly generated 100 models on our search space. It validates
the accuracy of the proposed latency model, whose mean absolute
percentage error(MAPE) is about 0.16%.
4.2.2 Differentiable Loss with Target Latency. The existing hardware-
aware differentiable NAS methods mostly define some hyperparam-
eters to balance between accuracy and latency, including SinglePath
NAS, whose loss function is defined as Eq. (6). Since there is no
information on the target latency in the loss function, in case there
is a strict latency constraint, they have to pay additional search
costs for the hyperparameters to let the final architecture have no
larger latency than the constraint. In addition, this process needs
to be repeated whenever the target latency is changed.
S3NAS: Fast NPU-aware Neural Architecture Search Methodology
1.8 2 2.2 2.4 2.6 2.8 3
1.8
2
2.2
2.4
2.6
2.8
3 x=y
estimated lat (ms)
si
m
ul
at
ed
 la
t (
m
s)
Figure 9: Accuracy of the proposed latency estimation
model: MAPE is about 0.16%.
We propose to modify the loss function to activate the latency-
aware loss term only when the estimated latency is larger than the
latency constraint as follows:
CE + λ1 · loд(1 + λ2 · ReLU ((
∑
L) −T )) (9)
Although this is not a panacea, this modification significantly eases
the search process, which will be discussed in section 5.2 with
various experiments.
4.3 Post-processing
In the second step, we intentionally use shorter latency to reduce
the search space for the baseline network. After finding the baseline
network with a shorter latency, we apply compound scaling to find
an architecture with the final latency constraint. In this step, we
conduct post-processing to add SE block and h-swish activation
function if beneficial.
4.3.1 Compound Scaling. It is well known that increasing depth [11],
width [35], or input image size improves accuracy while it increases
latency. However, if only one of these three factors is increased,
the accuracy improvement is quickly saturated. Observing this fact,
Tan et al. [32] proposed a compound scaling method that increases
all three factors together. A scaling coefficient is defined for each
factor. By judiciously assigning the scaling coefficients in a balanced
fashion, they could improve the accuracy much larger than scaling
a single factor only. Adopting this approach, we apply the com-
pound scaling to the baseline architecture obtained in the previous
step. Based on the ratio between the true latency constraint and the
assumed latency constraint in the second step, we find the scaling
coefficients considering the estimated latency increment. To keep
the linear relationship between the width and cumulative depth,
we use the same scaling coefficient for width and depth, differently
from [32]. Note that how to realize scaling depends on the baseline
architecture. While the baseline architecture assumed in [32] has a
series of identical blocks in each stage, a stage consists of hetero-
geneous blocks in our baseline architecture. Thus depth scaling is
not realized by merely adding new blocks in each stage. We need
to choose what types of blocks to add in each stage. We increase
the number of blocks with more parameters first. To compute how
many blocks to add in a stage, we multiply the depth of the stage by
depth coefficient and round the multiplication result. Width scaling
is applied to all blocks equally. Finally, we consider latency when
we scale.
4.3.2 Add h-swish and SE. In addition to compound scaling, we
add two components in the post-processing step: h-swish activation
function and squeeze-and-excitation (SE) block. A recent study [23]
reports that SE and the h-swish activation function are no hurdles
for 8-bit quantization. They could quantize a network with SE and
h-swish without noticeable accuracy loss.
Extensive studies have been conducted to find a better activation
function than ReLU, and the swish activation function [26] was
found. Several neural networks [20, 32, 33] use swish activation
function instead of ReLU to improve accuracy. Howard et al. [12]
proposed a quantization-friendly version of the swish activation
function called h-swish that has a similar impact on accuracy. So,
we replace ReLU with h-swish [12] activation function.
Squeeze-and-Excitation(SE) is a lightweight operation which is
shown to be beneficial to accuracy [13]. Figure 10 depicts the struc-
ture of a SE block. For a given input feature map, it first computes
the importance of the feature channels a representative value for
global spatial information of each feature channel by global average
pooling. After such squeeze operation generates channel-wise sta-
tistics, excitation operation captures channel-wise dependencies by
two cascaded fully-connected layers to produce activation values,
which represents the importance of each feature channel. Finally,
channel-wise multiplication is performed between the activation
values induced by the excitation operation and the input feature
map for each channel. SE block is used in many recent architec-
tures [12, 25, 32]. By adding SE blocks to the baseline network, we
also observe the accuracy improvement.
In
p
u
t 
: [
H
, W
, C
]
FC
 +
 R
eL
U
FC
 +
 S
ig
m
o
id
A
ct
iv
at
io
n
 o
f 
ex
ci
ta
ti
o
n
: [
1
, 1
, C
]
SE block
O
u
tp
u
t 
: [
H
, W
, C
]
channel-wise
multiplication
⋯⋯
squeeze excitation
G
lo
b
al
A
vg
P
o
o
l
Figure 10: Structure of SE block
4.3.3 Selective removal of SE. Figure 11 depicts an example distri-
bution of activation values produced by two different SE blocks
for three different images. The authors of the original paper [13]
conjectured that if such distribution from a SE block does not differ
widely between image classes, the SE block is not important. Thus,
after training, they obtained averaged activation values of a SE
block over multiple images in the same class. They compared the
distributions of the averaged values over different image classes.
They observed that removing the SE blocks that have similar distri-
butions over different image classes incurs only a marginal loss in
accuracy.
Jaeseong Lee, Duseok Kang, and Soonhoi Ha
0 20 40
0
0.2
0.4
0.6
0.8
1
0 20 40
0
0.2
0.4
0.6
0.8
1
image 0
image 1
image 2
channel index channel index
a
c
t
iv
a
t
io
n
Figure 11: Distribution of activation values produced by two
different SE blocks for three different images
Inspired by this observation, we propose to remove SE blocks
selectively to minimize the additional computation cost caused by
SE blocks. We obtain activation values from a SE block for each
input image and measure how the distribution of activation values
varies over different input images. For each channel c, we calculate
the standard deviation σc of activation values over different images.
If σc is small in most channels, the activation values from the SE
block does not differ much over images. Conceptually, it implies that
the SE block does not help to discriminate further which channel
is more influential. From the engineering perspective, it means
that channel-wise multiplication of a SE block is similar to constant
multiplication, which can be handled by the following convolutional
layer.
We define a metric as the average of standard deviation values
σc over all channels that represent the diverseness of the activation
distribution over different images. If the metric value is small, we
remove the SE block. For example, in Figure 11, our metric of the
SE block on the left side has a value of 0.021, while the right side
has a value of 0.118, more than 5x larger than the left side; The left
side is a better candidate for SE block removal. When we remove SE
blocks according to this metric, the accuracy is found to be similar,
while the latency got shorter (Table 6).
5 EXPERIMENTS
5.1 Setup
We evaluate the proposed NAS technique for image classification
with the ImageNet dataset. The current implementation is made
for MIDAP [15] that can perform DWConv and SE operations effi-
ciently so that MBConv is preferred to full 3-D convolution as the
basic building block, as explained above. Latencies on the target
NPU are obtained with the cycle-accurate simulator2.
A superkernel has two parameters to search: expansion ratio
and kernel size. To limit the search space, we choose the expansion
ratio among 0, 2, 4, and 6, and the kernel size between 3 and 5 when
MBConv or full convolution is used as the building block. In the
case of the MixConv-based building block, we use N=3 superkenels
whose expansion ratio is 0 or 2; The sum of the expansion ratio of
three superkernels has the same range as the expansion ratio of a
single MBConv block. To allow three superkernels to have different
kernel sizes, we let one of three superkernels be able to have 7 as
the kernel size.
2https://github.com/cap-lab/MidapSim
In the first phase of the neural architecture search, we train the
supernet by randomly choosing one of the candidate subgraphs in
each training step. We train the supernet for 8 epochs, with λ1 = 0
in the loss function of Eq. 9, focusing only on the accuracy. We
decrease the learning rate by 0.97 every 2.4 epochs, starting from
0.064. The other setting for network training is displayed in Table 2.
Gradient clipping with a value of 10 is used in this phase. In the
second phase, we set λ1 = 15, λ2 = 100 to consider latency in the
loss function, and optimize the weights and threshold values of
supernet for 2 epochs. After this second phase finishes, the final
architecture topology is decided.
Next, we train the final architecture again to determine the filter
weights for 350 epochs with the ImageNet again, using the same
setting described in Table 2. Unlike the search phase, the learning
rate is increased from 0 to 0.064 in the first 5 epochs, then decayed
by 0.97 every 2.4 epochs. Since we observed that the batch size is
critical to accuracy when using the EfficientNet training code, we
use a large batch size. Both network architecture search and final
training are conducted on Google Cloud TPUs.
Table 2: Settings for network training, which is similar
to [32] 3.
train batch size 1024
optimizer RMSProp with decay 0.9,momentum 0.9, epsilon 0.001
image preprocessing Inception preprocessing
weight decay 1e-5
label smoothing 0.1
stochastic depth [14] 0.2
Exponential Moving Average 0.9999
5.2 Supernet Design
Table 3: The supernet architecture of the proposedNAS tech-
nique
input block width depth strides
2242 × 3 7×7 conv 32 1 2
1122 × 32 TBD 32 d1 2
562 × 32 TBD 64 d2 2
282 × 64 TBD 128 d3 2
142 × 128 TBD 160 d4 1
142 × 160 TBD 256 d5 2
72 × 256 fc 1280 1 1
72 × 1280 avgpool 1280 1 1
12 × 1280 fc 1000 1 1
In the proposed NAS technique, two major extensions are made
to the supernet, compared with the original SinglePath NAS tech-
nique. Table 3 shows the proposed supernet architecture with con-
figuration parameters, block types and depths. It starts with a 7x7
3The setting is similar to EfficientNet training code: https://github.com/tensorflow/
tpu/tree/master/models/official/efficientnet
S3NAS: Fast NPU-aware Neural Architecture Search Methodology
[2
2
4
, 2
2
4
, 3
]
C
o
n
v7
[1
1
2
, 1
1
2
, 3
2
]
3
x3
5
x5
[5
6
, 5
6
, 3
2
]
3
x3
[5
6
, 5
6
, 3
2
]
3
x3
[5
6
, 5
6
, 3
2
]
3
x3
5
x5
[2
8
, 2
8
, 6
4
]
3
x3
5
x5
[2
8
, 2
8
, 6
4
]
3
x3
5
x5
[2
8
, 2
8
, 6
4
]
3
x3
5
x5
[2
8
, 2
8
, 6
4
]
3
x3
5
x5
[1
4
, 1
4
, 1
2
8
]
3
x3
[1
4
, 1
4
, 1
2
8
]
3
x3
[1
4
, 1
4
, 1
2
8
]
3
x3
5
x5
[1
4
, 1
4
, 1
2
8
]
3
x3
5
x5
[1
4
, 1
4
, 1
2
8
]
3
x3
5
x5
[1
4
, 1
4
, 1
2
8
]
5
x5
[1
4
, 1
4
, 1
2
8
]
3
x3
5
x5
[1
4
, 1
4
, 1
6
0
]
3
x3
[1
4
, 1
4
, 1
6
0
]
3
x3
[1
4
, 1
4
, 1
6
0
]
3
x3
[1
4
, 1
4
, 1
6
0
]
3
x3
5
x5
7
x7
[7
, 7
, 2
5
6
]
5
x5
[7
, 7
, 2
5
6
]
5
x5
[7
, 7
, 2
5
6
]
5
x5
[7
, 7
, 2
5
6
]
5
x5
[7
, 7
, 2
5
6
]
5
x5
[7
, 7
, 2
5
6
]
5
x5
[7
, 7
, 2
5
6
]
5
x5
[7
, 7
, 2
5
6
]
5
x5
[7
, 7
, 2
5
6
]
5
x5
[7
, 7
, 2
5
6
]
5
x5
[7
, 7
, 2
5
6
]
C
o
n
v1
[7
, 7
, 1
2
8
0
]
A
vg
Po
o
l
[1
, 1
, 1
2
8
0
]
FC
[1
, 1
, 1
0
0
0
]
Figure 12: ours-M+ architecture. Height of each block in the picture is proportional to expansion ratio. SE-applied blocks are
depicted as dotted blocks.
convolution layer, followed by 5 stages that have a different number
of blocks for feature extraction and 2 fully-connected networks for
classification.
The first extension is to allow stages to have a different number
of blocks. To verify the goodness of this extension, we design two
kinds of MBConv-based supernet with 20 blocks in total: a supernet
with constant depth(baseline), a supernet with linear depth where
the cumulative depth up to a specific stage is proportional to the
width of the stage.
Table 4: Comparison of two MBConv-based supernets with
different block assignment schemes to stages
strategy accuracy (%) latency (ms)
constant (baseline) 75.85 1.28
linear 77.09 1.24
As shown in Table 4, a supernet with linear depth outperforms
a supernet with constant depth in terms of accuracy with similar
latency. It confirms that this simple change of block assignment
in supernet gives notable accuracy boost with the same latency
constraint, without any additional optimization techniques.
Table 5: Comparison of two supernets with different build-
ing blocks: MBConv and MixConv
building block accuracy (%) latency (ms)
DWConv (baseline) 75.85 1.28
MixConv 76.58 1.30
The second extension is to use multiple parallel superkernels
in an MBConv block. To verify the benefit of it, we compare two
different supernets with the same number of blocks in each stage.
The accuracy and latency performance of the baseline supernet
is the same as the previous experimental result shown in Table 4.
Table 5 shows that the extended supernet with MixConv-based
building blocks gives a better accuracy-latency tradeoff.
5.3 Comparison with existing models
We apply the proposed NAS method with the supernet architecture
described above. The depth of 5 stages is set to 3, 4, 7, 4, 11, respec-
tively. The latency constraint is set to 2.5 ms that corresponds to
the latency of EfficientNet-B1 on our target NPU, MIDAP. Table 6
compares our search results with the state-of-the-art models: Ed-
geTPU [9], EfficientNet [32], Once-For-All [1]. The latency of the
other models is obtained by running the network on the MIDAP
cycle-accurate simulator. We compare the accuracy without quan-
tization, assuming that quantization effects will be similar to all
models.
As shown in Table 6, the baseline model, ours-M, found by the
proposed NAS technique has higher accuracy than the other models
on our target NPU; ours-M achieves more than 1.7% higher top-1
accuracy than EfficientNet-lite2 with similar latency. Moreover, it
is 0.5% higher than EfficientNet-B1, even without using SE and h-
swish activation function. Note that the number of parameters and
the number of FLOPS in ours-M is larger than EfficientNet-B1. It
implies that the complexity of the network is not a direct indicator
of the end-to-end latency of the network. The end-to-end latency
depends on the NPU architecture, and the proposed NAS technique
could find a larger network with shorter latency by adding the
latency factor to the loss function directly. The main benefit comes
from different block assignment to stages.
We improve the baseline network by adding the h-swish activa-
tion function and squeeze-and-excitation(SE) block to get the ours-
M+ model. Figure 12 shows the topology of ours-M+ architecture
in which the height of each block is proportional to the expansion
ratio of the block. Compared with the baseline network, ours-M, we
achieve around 1% accuracy boost with ours-M+, paying the cost
of 16% latency increase. This model outperforms the other models,
0.5% higher accuracy and 14% faster than EfficientNet-B2. Since
EfficientNet-B2 is too large to run with the default configuration
on MIDAP, we increase the memory size for filter weights.
Next, we applied compound scaling [32] to ours-M+ to obtain
ours-L+ and ours-XL+. When we determine scaling coefficients, we
keep the linear relationship between the cumulative depth and
width of each stage, and scale the input image size more aggres-
sively than [32]. We make the number of filters to be multiples
of 16 to maximize the MAC unit utilization on MIDAP. When we
train our scaled model, we set the dropout ratio to 0.4, similar to
EfficientNet-B4 training. The accuracy of ours-L+ is higher than
EfficientNet-B3 and EfficientNet-lite4, while the accuracy of ours-
XL+ is similar to EfficientNet-B4. Note that the difference between
the searched network and the EfficientNet decreases as the network
size increases.
Jaeseong Lee, Duseok Kang, and Soonhoi Ha
Table 6: Performance comparison among networkmodels with the ImageNet dataset. (∗ : Trained again with our training code.
∗∗ : Trained again with official EfficientNet code using baseline preprocessing only. †† : We convert the wall clock time to GPU-
hours [29] to compare with the other methods. + : using squeeze-excitation and the h-swish function. † : When we compare
the search cost, we compare the time needed to get one final architecture. ‡: The search time contains training time.)
Model FP32 Top-1 acc(%) latency(ms) #Params #FLOPS Search Cost(h)
EdgeTPU-S [9] 77.23 4.41 5.4M 2.35B 40,000
Inception V3 [30] 78.8 6.75 23.8M 5.71B manual
EfficientNet-B1 [32] 78.8 2.47 7.8M 0.69B 40,000
EfficientNet-lite2 [32] 77.6 2.51 6.1M 0.86B 40,000
random selection 78.55∼79.19 2.45∼2.49 8.7M∼11.0M 0.97B∼1.14B 720(7200)‡
random search 78.93 2.49 10.4M 1.06B 3(30)††
ours-M 79.35 2.47 12.8M 1.29B 3(30)††
EdgeTPU-M [9] 78.69 6.86 6.9M 3.66B 40,000
EfficientNet-B2 [32] 79.8 3.31 9.1M 0.99B 40,000
EfficientNet-lite3 [32] 79.8(79.15)∗∗ 3.50 8.2M 1.38B 40,000
Once-For-All [1] 80.0(78.50)∗ 2.18 9.1M 0.60B 1,200†
ours-M+ 80.28 2.86 15.4M 1.29B 3(30)††
EfficientNet-B3 [32] 81.0 5.22 12.2M 1.83B 40,000
EfficientNet-lite4 [32] 81.5(80.38)∗∗ 5.84 13.0M 2.55B 40,000
ours-L+ 81.49 5.23 20.7M 2.47B 3(30)††
EdgeTPU-L [9] 80.62 15.55 10.6M 9.66B 40,000
EfficientNet-B4 [32] 82.6 11.42 19.3M 4.39B 40,000
ours-XL+ 82.67 11.87 27.9M 5.95B 3(30)††
ours-XL-rmSE+ 82.72 11.66 26.9M 5.95B 3(30)††
Finally, we selectively removed SE blocks from ours-XL+, re-
sulting in ours-XL-rmSE+. We collected the activation values using
randomly sampled 10K images from the training dataset and calcu-
lated the metric explained in Sec. 4.3.3. After removing SE blocks
from ours-XL+ based on the metric, only about 60% of the blocks
in the network have SE blocks. As a result, we could make the
latency shorter, while the accuracy was slightly improved than
ours-XL+. This model achieves 82.72% top-1 accuracy with only
11.66ms latency. It is much better than EfficientNet-EdgeTPU-L [9]
that achieves 80.62% FP32 top-1 accuracy with more than 20ms on
EdgeTPU. Our architecture on MIDAP is about 2 times faster with
2.1% higher accuracy.
Finally, we compare the search time. Since the TPU is faster than
GPU, we report the wall clock time and the estimated GPU time
(in parenthesis) that is 10 times longer than the wall clock time
in the last column of Table 6 Our method takes 3 hours, which is
much faster than the other methods. Note that we compare the total
time to get one architecture from scratch without trained weights.
Once-For-All [1] would require only short fine-tuning time after
a neural architecture is searched. In contrast, we need to train the
network after a network architecture is found. It took 40 hours on
TPUv3 to train ours-M+.
5.4 Ablation Studies
5.4.1 Comparison with Random Search. While most NAS tech-
niques are not compared with a random search method, the au-
thors [18] reported that a random search method is highly competi-
tive. So we conducted an experiment to compare the proposed NAS
technique with two random search methods, exploring the same
search space defined by the supernet structure of ours-M. First, we
designed a simple random search method that has the similar time
complexity of the proposed technique. In this method, we randomly
generate 15 models having a similar latency with ours-M, from the
same search space. Then we train each of them for 1 epoch with co-
sine learning rate decay. After evaluating each of them, we choose
the architecture with the topmost top-1 accuracy and fully train
it. In the second method, called random selection, we randomly
generate 20 models having a similar latency with ours-M and train
them fully and take the architecture with the highest top-1 accuracy.
Since the random selection method performs search and training
simultaneously, it is slower than the proposed technique by the
number of randomly generated models.
Comparison results are reported in Table 6. It is confirmed that
both random selection and random search are quite competitive,
but noticeably inferior to ours-M in terms of accuracy. In detail,
the worst case of random selection showed 0.8% lower accuracy
than ours-M. The best performance obtained from 20 randomly
generated models is 79.19%, still lower than the accuracy of ours-
M. Note that random search and random selection show similar
performance that is no smaller than the other networks. It means
S3NAS: Fast NPU-aware Neural Architecture Search Methodology
that the search space defined by the supernet architecture has a
more significant effect on the accuracy than the search method.
Table 7: Comparison between compound scaling and direct
search on large target latency.
accuracy (%) latency (ms)
Compound scale 81.78 10.81
Direct search 81.87 10.84
5.4.2 Comparison between compound scaling and direct search.
There are two methods to find an architecture with a loose la-
tency constraint. One is to use compound scaling that scales a small
network with shorter latency, and the other is to search a network
directly. To compare these two methods, we first scaled ours-M
using the same scaling coefficients that we used to scale ours-M+
to ours-L+ and trained it. When conducting a direct search, we
scaled the depth and width of the supernet and the input image size
first and applied the proposed NAS technique for the scaled super-
net. We used batch size 512 instead of 1024 during the architecture
search due to the memory limitation of TPU. The comparison result
is shown in Table 7 in terms of top-1 accuracy(%) and the latency
on the target NPU(ms). Two results were similar while direct search
needed 10 hours on TPUv3; It means that compound scaling is an
effective method to find a large network fast.
Table 8: Impact of SE and h-swish on accuracy.
ReLU h-swish
w/o SE 79.35 79.36
w SE 80.04 80.28
5.4.3 Affect of SE and h-swish. To examine how SE and h-swish
impact accuracy individually, we compare four combinations as
displayed in Table 8. The baseline is ours-M that does not use SE
and h-swish activation function. Replacing ReLU with h-swish
gives a marginal improvement on accuracy while adding SE blocks
improves the accuracy noticeably. Adding both SE and h-swish
activation function improves the accuracy by around 1%.
6 CONCLUSION
In this work, we propose a fast NPU-aware NAS methodology
extending the Single-Path NAS technique [28]. We modify the
supernet architecture by varying the number of blocks in stages
and adding mixed depthwise convolution [33] to the search space.
By modifying the loss function to directly include the target latency
estimated by a cycle-accurate simulator of the target NPU, we could
find a better baseline architecture with a shorter latency than the
latency constraint. Using a tight latency constraint, we can reduce
the search space to find the baseline network fast. Afterward, we
apply compound scaling to find a larger network than the baseline
network, and add SE blocks and h-swish activation functions in the
post-processing step. Through the proposed NAS methodology, we
could obtain a network with 82.72% accuracy with 11.66ms latency
on our target NPU, without special data augmentation in training.
It dominates the existing network models on the target NPU. It
confirms the importance of supernet architecture design for a given
NPU and effectiveness of the three-step approach in the proposed
NAS methodology: supernet design, SinglePath NAS with a tighter
latency constraint, and compound scaling and post-processing.
ACKNOWLEDGMENTS
This work is supported by the National Research Foundation of
Korea (NRF) grant funded by the Korea government(MSIP) (NRF-
2019R1A2B5B02069406). Also, we acknowledge support from the
TensorFlow Research Cloud (TFRC) program.
REFERENCES
[1] Han Cai, Chuang Gan, and Song Han. 2019. Once for all: Train one network and
specialize it for efficient deployment. arXiv preprint arXiv:1908.09791 (2019).
[2] Han Cai, Ligeng Zhu, and Song Han. 2018. Proxylessnas: Direct neural archi-
tecture search on target task and hardware. arXiv preprint arXiv:1812.00332
(2018).
[3] François Chollet. 2017. Xception: Deep learning with depthwise separable con-
volutions. In Proceedings of the IEEE conference on computer vision and pattern
recognition. 1251–1258.
[4] Xiangxiang Chu, Xudong Li, Yi Lu, Bo Zhang, and Jixiang Li. 2020. MixPath:
A Unified Approach for One-shot Neural Architecture Search. arXiv preprint
arXiv:2001.05887 (2020).
[5] Xiangxiang Chu, Bo Zhang, Jixiang Li, Qingyuan Li, and Ruijun Xu. 2019. Scar-
letnas: Bridging the gap between scalability and fairness in neural architecture
search. arXiv preprint arXiv:1908.06022 (2019).
[6] Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. 2018. Neural architecture
search: A survey. arXiv preprint arXiv:1808.05377 (2018).
[7] Amir Gholami, Kiseok Kwon, Bichen Wu, Zizheng Tai, Xiangyu Yue, Peter Jin,
Sicheng Zhao, and Kurt Keutzer. 2018. Squeezenext: Hardware-aware neural
network design. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition Workshops. 1638–1647.
[8] Zichao Guo, Xiangyu Zhang, Haoyuan Mu, Wen Heng, Zechun Liu, Yichen Wei,
and Jian Sun. 2019. Single path one-shot neural architecture search with uniform
sampling. arXiv preprint arXiv:1904.00420 (2019).
[9] Suyog Gupta and Berkin Akin. 2020. Accelerator-aware Neural Network Design
using AutoML. arXiv preprint arXiv:2003.02838 (2020).
[10] Dongyoon Han, Jiwhan Kim, and Junmo Kim. 2017. Deep pyramidal residual
networks. In Proceedings of the IEEE conference on computer vision and pattern
recognition. 5927–5935.
[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual
learning for image recognition. In Proceedings of the IEEE conference on computer
vision and pattern recognition. 770–778.
[12] Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingx-
ing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. 2019.
Searching for mobilenetv3. In Proceedings of the IEEE International Conference on
Computer Vision. 1314–1324.
[13] Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. In Proceed-
ings of the IEEE conference on computer vision and pattern recognition. 7132–7141.
[14] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. 2016.
Deep networks with stochastic depth. In European conference on computer vision.
Springer, 646–661.
[15] Donghyun Kang, Jintaek Kang, Hyungdal Kwon, Hyunsik Park, and Soonhoi
Ha. 2019. A Novel Convolutional Neural Network Accelerator That Enables
Fully-Pipelined Execution of Layers. In 2019 IEEE 37th International Conference
on Computer Design (ICCD). IEEE, 698–701.
[16] Yong-Deok Kim, Eunhyeok Park, Sungjoo Yoo, Taelim Choi, Lu Yang, and
Dongjun Shin. 2015. Compression of deep convolutional neural networks for
fast and low power mobile applications. arXiv preprint arXiv:1511.06530 (2015).
[17] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. 2016.
Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710 (2016).
[18] Liam Li and Ameet Talwalkar. 2019. Random search and reproducibility for
neural architecture search. arXiv preprint arXiv:1902.07638 (2019).
[19] Hanxiao Liu, Karen Simonyan, and Yiming Yang. 2018. Darts: Differentiable
architecture search. arXiv preprint arXiv:1806.09055 (2018).
[20] Jieru Mei, Yingwei Li, Xiaochen Lian, Xiaojie Jin, Linjie Yang, Alan Yuille, and
Jianchao Yang. 2019. AtomNAS: Fine-Grained End-to-End Neural Architecture
Search. arXiv preprint arXiv:1912.09640 (2019).
Jaeseong Lee, Duseok Kang, and Soonhoi Ha
[21] Rang Meng, Weijie Chen, Di Xie, Yuan Zhang, and Shiliang Pu. 2020.
Neural Inheritance Relation Guided One-Shot Layer Assignment Search.
arXiv:cs.CV/2002.12580
[22] Eunhyeok Park, Junwhan Ahn, and Sungjoo Yoo. 2017. Weighted-entropy-based
quantization for deep neural networks. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition. 5456–5464.
[23] Eunhyeok Park and Sungjoo Yoo. 2020. PROFIT: A Novel Training Method
for sub-4-bit MobileNet Models. In Proceedings of the European Conference on
Computer Vision (ECCV), To appear in ECCV 2020.
[24] Hieu Pham,Melody Y Guan, Barret Zoph, Quoc V Le, and Jeff Dean. 2018. Efficient
neural architecture search via parameter sharing. arXiv preprint arXiv:1802.03268
(2018).
[25] Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick, Kaiming He, and Piotr
Dollár. 2020. Designing Network Design Spaces. arXiv preprint arXiv:2003.13678
(2020).
[26] Prajit Ramachandran, Barret Zoph, and Quoc V Le. 2017. Searching for activation
functions. arXiv preprint arXiv:1710.05941 (2017).
[27] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-
Chieh Chen. 2018. Mobilenetv2: Inverted residuals and linear bottlenecks. In
Proceedings of the IEEE conference on computer vision and pattern recognition.
4510–4520.
[28] Dimitrios Stamoulis, Ruizhou Ding, Di Wang, Dimitrios Lymberopoulos, Bodhi
Priyantha, Jie Liu, and Diana Marculescu. 2019. Single-path nas: Designing
hardware-efficient convnets in less than 4 hours. arXiv preprint arXiv:1904.02877
(2019).
[29] Dimitrios Stamoulis, Ruizhou Ding, Di Wang, Dimitrios Lymberopoulos, Nis-
sanka Bodhi Priyantha, Jie Liu, and Diana Marculescu. 2020. Single-path mobile
automl: Efficient convnet design and nas hyperparameter optimization. IEEE
Journal of Selected Topics in Signal Processing (2020).
[30] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew
Wojna. 2016. Rethinking the inception architecture for computer vision. In
Proceedings of the IEEE conference on computer vision and pattern recognition.
2818–2826.
[31] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew
Howard, and Quoc V Le. 2019. Mnasnet: Platform-aware neural architecture
search for mobile. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition. 2820–2828.
[32] Mingxing Tan and Quoc V Le. 2019. Efficientnet: Rethinking model scaling for
convolutional neural networks. arXiv preprint arXiv:1905.11946 (2019).
[33] Mingxing Tan and Quoc V Le. 2019. Mixconv: Mixed depthwise convolutional
kernels. CoRR, abs/1907.09595 (2019).
[34] Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming
Wu, Yuandong Tian, Peter Vajda, Yangqing Jia, and Kurt Keutzer. 2019. Fbnet:
Hardware-aware efficient convnet design via differentiable neural architecture
search. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition. 10734–10742.
[35] Sergey Zagoruyko and Nikos Komodakis. 2016. Wide residual networks. arXiv
preprint arXiv:1605.07146 (2016).
[36] Barret Zoph and Quoc V Le. 2016. Neural architecture search with reinforcement
learning. arXiv preprint arXiv:1611.01578 (2016).
[37] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. 2018. Learning
transferable architectures for scalable image recognition. In Proceedings of the
IEEE conference on computer vision and pattern recognition. 8697–8710.
