Fast Hardware-Aware Neural Architecture Search by Zhang, Li Lyna et al.
Integrating Hardware Diversity with Neural Architecture Search
for Efficient Convolutional Neural Networks
Li Lyna Zhang1 Yuqing Yang1 Yuhang Jiang2 Wenwu Zhu2 Yunxin Liu1
1Microsoft Research Asia, 2Tsinghua University
{lzhani, yuqing.yang, yunxin.liu}@microsoft.com
{jyh17}@mails.tsinghua.edu.cn, {wwzhu}@tsinghua.edu.cn
Abstract
Designing accurate and efficient convolutional neural
architectures for vast amount of hardware is challeng-
ing because hardware designs are complex and diverse.
This paper addresses the hardware diversity challenge in
Neural Architecture Search (NAS). Unlike previous ap-
proaches that apply search algorithms on a small, human-
designed search space without considering hardware di-
versity, we propose HURRICANE 1 that explores the au-
tomatic hardware-aware search over a much larger search
space and a two-stage search algorithm, to efficiently gen-
erate tailored models for different types of hardware. Ex-
tensive experiments on ImageNet show that our algorithm
consistently achieves a much lower inference latency with a
similar or better accuracy than state-of-the-art NAS meth-
ods on three types of hardware. Remarkably, HURRI-
CANE achieves a 76.67% top-1 accuracy on ImageNet with
a inference latency of only 16.5 ms for DSP, which is a
3.47% higher accuracy and a 6.35× inference speedup
than FBNet-iPhoneX, respectively. For VPU, HURRICANE
achieves a 0.53% higher top-1 accuracy than Proxyless-
mobile with a 1.49× speedup. Even for well-studied mo-
bile CPU, HURRICANE achieves a 1.63% higher top-1 ac-
curacy than FBNet-iPhoneX with a comparable inference
latency. HURRICANE also reduces the training time by
30.4% or even 54.7% (with less than 0.5% accuracy loss)
compared to Singlepath-Oneshot.
1. Introduction
Neural Architecture Search (NAS) is a powerful mecha-
nism to automatically generate efficient Convolutional Neu-
ral Networks (CNN) without requiring huge manual efforts
1HURRICANE will be open sourced after review.
(a) Previous Hardware-aware NAS: single search space for all hardware
(b) Proposed Hardware-aware NAS: specialized search space for each type
of hardware. Operators pool is much larger than that of previous NAS to
tackle hardware diversity
Figure 1. HURRICANE constructs hardware-specialized search
space (by latency constraints) from a large global search space
containing more hardware-efficient architectures than previous
NAS, and employs a two-stage search algorithm for NAS search
acceleration.
of human experts to design good CNN models [10, 31, 32,
25, 9, 1]. However, searching efficient and accurate CNN
for the massive smart devices is difficult by current NAS
approaches due to the emergence of massive types of hard-
ware devices and the intrinsic huge search cost.
Unaware of Hardware Diversity. Most existing NAS
methods focus on searching for the most powerful mod-
els. The common effort to guarantee the inference effi-
ciency (e.g. model inference latency on real hardware) is
to limit the model’s FLOPs 2. Some recent hardware aware
NAS methods [9, 25, 5, 27] consider model-inference per-
2In this paper, the definition of FLOPs follows [30], i.e., the number
of multiply-adds.
1
ar
X
iv
:1
91
0.
11
60
9v
2 
 [c
s.C
V]
  2
4 D
ec
 20
19
(a) Latency and FLOPs on different hardware (b) Latency in different feature map sizes on VPU
Figure 2. Performance of widely used operators in NAS (c.f Table 2). (a): Latency and FLOPs on three types of hardware: (1) DSP
(HexagonTM 685 DSP), (2) CPU (Snapdragon 845 ARM CPU), (3) VPU (MovidiusTM MyriadTM X Vision Processing Unit). The in-
put/output feature maps are all the same, equal to 562 × 64. (b): Latency in different input feature map sizes on VPU.
formance but they only aim at the same type of hardware,
smart phones from different manufacturers but all based
on ARM processors. As shown in Figure 1(a), existing
hardware-aware NAS approaches [4, 27, 24] use an iden-
tical manually elaborated search space for different types of
hardware platforms. However, the emerging massive smart
devices (e.g., IoT devices) are equipped with very diverse
processors, such as CPU, GPU, DSP, FPGA, and various AI
accelerators that have fundamentally different hardware de-
signs. Such a big hardware diversity makes FLOPs an im-
proper metric to predict model-inference performance and
manual-designed search space not ideal for searching effi-
cient models. As a result, it calls for new methods to au-
tomatically generate the hardware-aware search spaces that
leverage the characteristics of every hardware platform and
relax the reliance on human effort.
To demonstrate it, we conduct an experiment to mea-
sure the performance of a set of widely used neural net-
work operators (a.k.a. operations) on three types of mobile
processors: HexagonTM 685 DSP, Snapdragon 845 ARM
CPU, and MovidiusTM MyriadTM X Vision Processing Unit
(VPU). Figure 2 shows the results and we make the follow-
ing key observations. First, from Figure 2(a), we can see
that even the operators have similar FLOPs, the same oper-
ator may have very different inference latency on different
processors. For example, the latency of operator SEP 5 is
nearly 12× higher than that of operator Choice 3 on the
CPU, but the difference on the VPU is less than 4×. There-
fore, FLOPs is not the right metric to decide the inference
latency on different hardware. Second, the relative effec-
tiveness of different operators on different processors is also
different. For example, operator SEP 3 has the smallest la-
tency on the DSP, but operator Choice 3 has the smallest la-
tency on the VPU. Thus, different processors should choose
different operators for the best trade-off between model ac-
curacy and inference latency. Furthermore, as shown in Fig-
ure 2(b), the computational complexity and latency of the
same operator are also affected by the execution context,
such as input feature map shapes, number of channels, etc.
Such a context is determined by which layer the operator
is placed on. That is, even for the same type of hardware,
optimal operators may change at different layers of the net-
work. Thus, it is difficult to cover hardware diversity using
manually-designed search space.
Motivated by these observations, we argue that there is
no one-size-fits-all model for different hardware platforms,
and thus propose HURRICANE (shown in Figure 1(b))
to generate different models tailored for different types of
hardware. To cover the diversity of hardware platforms, we
construct a much larger candidate operators pool (32 in our
implementation) and propose a hardware aware search
space generation algorithm to automatically generate a
hardware-specialized search space for each type of hard-
ware. As detailed in Section 3.1, our mechanism is based
on profiled real performance on the target hardware.
To reduce the intensive search cost, we propose a two-
stage search algorithm for the one-shot NAS3, which
searches the complete architecture by a sequence of simpler
searching of sub-networks. The method is inspired by the
layer diversity (different layers may have different impacts
on inference latency [10] and model accuracy [29, 14]), we
argue that exploring more architecture selections in the lay-
ers close to classification output may help find better archi-
tectures with the limited sampling budget, and limiting the
latency in the layers close to data input is critical to search
for low-latency models.
Section 3 gives more details about HURRICANE. In sec-
tion 4, we evaluate the effectiveness of HURRICANE on
ImageNet 2012 and OUI-Adience-Age datasets over three
hardware platforms (Figure 2(a)). Under all the three plat-
forms, HURRICANE consistently achieves the same level
(or better) accuracy with much lower inference latency than
3In this paper we adopt one-shot NAS because of its simplicity, how-
ever, our acceleration could also be combined with other NAS methods.
state-of-the-art hardware-aware NAS methods. Remark-
ably, HURRICANE reduces the inference latency by 6.35×
on DSP compared to FBNet-iPhoneX and 1.49× on VPU
compared to Proxyless-mobile, respectively. Compared
to Singlepath-Oneshot, HURRICANE reduces the training
time by 30.4% or even 54.7% (with less than 0.5% accuracy
loss) on ImageNet.
2. Related Work
Hardware aware NAS. Recent methods [25, 27, 4, 9, 1]
adopt a layer-level hierarchical search space with a fixed
macro-structure allowing different layer structures at dif-
ferent resolution blocks of a network. The goal becomes
searching operators for each layer so that the architecture
achieves competitive accuracy under given constraints. To
search hardware efficient architectures, the search spaces
have been built on increasingly more efficient building
blocks. [25, 4, 24] built upon the MobileNetV2 [21]
structure (MB k e). [27, 9] built search space by Shuf-
fleNetV1 [30] and ShuffleNetV2 [16] (Choice k). As these
structures are primarily designed for mobile CPU, the effi-
ciency of such manually-designed search space is unknown
for other hardware.
To measure the model efficiency, many NAS meth-
ods [31, 32] adopt the hardware-agnostic metric FLOPs.
However, architecture with lower FLOPs is not necessar-
ily faster [23]. Recently, gradient-based methods [4, 24, 27]
adopt direct metrics such as measured latency but only for
mobile CPUs. They profile every operator’s latency and
build prediction model. The latency is then viewed as
a differentiable regularization loss. However, the multi-
objective loss is not optimal because accuracy changes
much more dramatically with latency for small models,
as [10] pointed out. Differently, we follow One-Shot
NAS [9, 5] and apply the latency constraints directly.
One-Shot NAS. Starting from ENAS [17], weight shar-
ing became popular as it accelerates the search process and
makes search cost feasible. Recent one-shot methods en-
code the search space into an over-parameterized supernet,
where each path is a stand-alone model. During the super-
net training, architectures are sampled by different proxies
(e.g., reinforcement learning) with weights updated. How-
ever, Singlepath-Oneshot [9] and FairNAS [5] observe that
such coupled architecture search and weight sharing could
be problematic to fairly evaluate the performance of can-
didate architectures. Our work is built upon Singlepath-
Oneshot [9]. It decouples the supernet training and evo-
lutionary architecture search by uniform sampling.
3. Methodology
In this paper, HURRICANE aims to search the following
architectures for a given hardware platform h (any of CPU,
DSP, NPU, VPU, etc.) and the latency constant τ (h)c :
max ACCval(a)
s.t. τ(a, h) ≤ τ (h)c
(1)
HURRICANE can find a architecture a that achieves the
maximum accuracy ACCval(a) on the validation set and the
inference latency τ(a, h) is under the constraint τ (h)c .
3.1. Hardware-aware Search Space
We follow the design of layer-level hierarchical search
space in recent hardware-aware NAS [27, 25]. Besides first
and last three fixed layers, each learnable layer can choose
an operator from a candidate pool. For each target hard-
ware, we encode the specialized search space into a over-
parameterized supernet for one-shot NAS.
Diverse Candidate Operators Pool. Compared with
the small operator pool in previous works, we employ a
much bigger pool of candidate operators from the primary
blocks of off-the-shelf networks. Our pool contains up to 32
operators (detailed in Table 1) that leverage different com-
putation and memory complexity. They are built upon the
following 4 basic structures from current efficient models:
• SEP: depthwise-separable convolution. Following
DARTS [15], we applied the depthwise-separable con-
volution twice. SEP has a larger FLOP count than oth-
ers, but less memory access complexity.
• MB: mobile inverted bottleneck convolution in Mo-
bileNetV2 [21]. MB has a medium memory access
cost due to its shortcut and add operation. Its com-
putation complexity is decided by the kernel size k and
expansion rate e.
• Choice: basic building block in ShuffleNetV2 [16].
Following [9], we add a similar operator ChoiceX.
Choice and ChoiceX have much smaller FLOPs than
the others, but the memory complexity is high due to
the channel split and concat operation.
• SE: squeeze-and-excitation network [11]. To balance
the impacts in latency and accuracy, we follow the set-
tings in MobileNetV3 [10]. We set the reduction ratio
r to 4, and replace the original sigmoid function with a
hard version of swish hswish[x] = xReLU6(x+3)6 . We
apply SE module to the above operators and generate
new operators. The computation complexity of SE is
decided by its insert position, while the memory access
cost is relatively lower.
Hardware Aware Search Space Generation. The en-
richment of candidate operators covers the diversity of hard-
ware platforms. However, the huge search cost makes it im-
practicable to construct the over-parameterized supernet by
Operator
Variable range
NumberKernel (k) Expansion (e)
SEP k 3,5,7 - 3
SEP k SE 3,5,7 - 3
MB k e 3,5,7 1,3,6 9
MB k e SE 3,5,7 1,3,6 9
Choice k 3,5,7 - 3
Choice k SE 3,5,7 - 3
ChoiceX 3 - 1
ChoiceX SE 3 - 1
Table 1. Diverse candidate operators. For the depthwise convolu-
tion in each operator, we allow choosing k of 3 , 5, 7. For the
expansion ratio e in MB, we allow it choosing of 1, 3, 6.
repeating the candidate operators in every learnable layer
as previous works do. To address this issue, we propose a
layer wise hardware aware search space generation algo-
rithm to generate a search space specialized for every hard-
ware platform. To filter out those operators that are most
efficient on target hardware platform, we sort every layer’s
candidate operators in non-increasing order of their scores
in Equation 2. The operators listed upfront will be selected
to construct the reduced search space. For each layer, we fil-
ter out the first p operators with highest scores, and the size
of search space would be np (n is the number of learnable
layers).
Hardware aware profiling and sorting. The score of
candidate operator op at the i-th learnable layer (score(i)op )
considers both representation capacity (approximately) and
real hardware performance
score(i)op = (Fop)
α(Pop)
β(τ`i(a)=op(a, h))
−1 (2)
where Fop and Pop are the FLOPs count and number of
parameters of operator op respectively, `i(a) = op means
architecture a whose i-th learnable layer is op. Parameter α
and β are non-negative constants.
Exploring Operator. Inspired by the observations of
layer diversity, some layers (commonly in the layers close
to output) contribute small to the latency (due to the small
feature map size) but impact largely on the accuracy. For
these learnable layers, we add an extra exploring operator
besides the first p operators. Since exploring operator is
mainly for better accuracy, its score could be not so top
ranked. For our backbone network (shown in Table 2), it
is natural to add the exploring operator to the last 4 layers
because of their smallest feature map size.
Robustness verification. During training the over-
parameterized supernet, the weights of an operator can be
updated only when it is in the valid sampled path (Equa-
tion 3). It will be hard to converge if (1) most of archi-
tectures in the search spaces exceed the latency constraint
or (2) some of the candidate operators are rarely sampled.
We could prevent wasting valuable computing power on the
hardly-converging search space by verifying its sampling
characteristics. The appearance of any of the above symp-
toms means the architectures of generated search space is
too heavy-weighted for target platform under the latency
constraint, then we could adjust (α, β) to let the algorithm
cares more about the latency, or even skip some of learnable
layers by applying an empty operator list to it.
3.2. Search Algorithm and Two-Stage NAS Accel-
eration
Like other one-shot NAS methods, our search algo-
rithm is also comprised by supernet training and architec-
ture searching. All candidate architectures in the hardware
aware search space Ah are encoded into a over parameter-
ized supernet. For every input minibatch, a specific archi-
tecture a is randomly sampled from Ah under the latency
constraint:
a ∈ {a ∈ Ah|τ(a, h) ≤ τ (h)c } (3)
And its weights W (a) is also inherited from the shared
weights W . In the backward computing, the weights of a
is updated by the gradient∇LB(N (a,W (a))) [3, 9], where
N (a,W (a)) denotes the network with sampled architecture
a and its weights W (a) and LB(·) denotes the training loss
over input minibatch B.
After the supernet is trained (only once), all candidate ar-
chitectures inherit weights from it. The weights are used to
evaluate the accuracy of architecture approximately and will
not be updated during the process of architecture searching.
We adopt evolutionary search [9] to find the winning archi-
tecture with the highest validation accuracy.
The supernet training consumes computing resource in-
tensively. To reduce its cost, we introduce a two-stage ac-
celeration algorithm that leverages layer diversity in accu-
racy and latency.
Layer grouping. The phenomenon that different CNN
layers reveal different sensitivity to accuracy has been ob-
served in other domains [6, 14, 29]. NAS should take more
efforts in searching the ideal operators for the critical lay-
ers as the operator selection for non-critical layers impacts
less to the final accuracy. However, it’s difficult to do the
accuracy sensitivity analysis for individual layer in NAS
scenario. Fortunately, some previous works [6, 8] have re-
vealed different behaviors between the earlier layers (close
to data input) and the later layers (close to classification out-
put) in CNN models. The earlier layers extract low-level
features from inputs (e.g., edges and colors), are computa-
tion intensive and demands more data to converge, while the
later layers capture high-level class-specific features but are
less computation intensive. Inspired by this, our intuition is
that operator search for later layers is more critical than ear-
lier layers. To this end, we group the n layers of the CNN
model into two groups: the earlier t layers (less critical) and
the later n− t layers (more critical).
Two-Stage Search Algorithm. Algorithm 1 illustrates
the main procedures of hardware-aware NAS with two-
stage search acceleration. Each step starts with a different
winning architecture and runs a one-shot NAS to search the
target group of learnable layers. The rest learnable layers
are treated as the non-active fixed layers and use the corre-
sponding layers’ operator of the winning architecture. Ar-
chitecture search upon fixed convolutional layers has also
appeared in other NAS works [13]. In the beginning, we set
up the initial winning architecture awin0 with the operators
of the highest scores in every layer.
First, Stage1 (I = 1) searches the later n − t layers for
awin0. We mark the later n− t layers as active and the ear-
lier t layers as non-active. The non-active layers are fixed
to the corresponding layer structures of architecture awin0,
while the active layers i will be chosen from all the gener-
ated operator list li(A). The one-shot NAS method itself is
similar to the work [9], except that we constraint the search
space with a hardware latency 4 other than FLOPs. After a
complete process of one-shot search, a new winning archi-
tecture awin1 would be generated. awin1 remains the same
operators in the non-active earlier t layers and update to new
operators in the active later n− t layers.
Second, Stage2 (I = 2) starts with the new winning ar-
chitecture awin1 and searches for earlier t layers. The later
n− t layers are fixed to the corresponding layer operator of
awin1 and the earlier t layers are active for another one-shot
search. Stage2 returns the final architecture awin2.
Hyper-parameter t. The layer grouping boundary t in
Algorithm 1 impacts the effectiveness and efficiency of two-
stage acceleration. Specifically, two-stage acceleration rolls
back to the original one-shot NAS and search for the com-
plete n learnable layers when t = 0, thus no search cost
reduction achieved. While larger t reduces the supernet
training time a lot, it can harm the search for the optimal
architectures. In this paper, we set t = 8 (only learnable
layers counted) according to the natural resolution changes
of the supernet. According to our empirical results, two-
stage search algorithm achieves comparable or better accu-
racy and promising search time reduction on two datasets
when t = 8 (c.f. Table 5). We will study more about how to
choose t in future works.
4. Evaluation
4.1. Experiment Setup
Hardware Platforms and Measurements We target
three representative mobile hardware that is widely used
for CNN deployment: (1) DSP (Qualcomm’s HexagonTM
4To do this, we build a latency-prediction model and details can be
found in supplementary materials sec A .
Algorithm 1 Hardware-aware NAS with acceleration
Require: h (hardware platform), τ (h)c (latency constraint)
Require: I (number of steps, I ∈ {1, 2}), t (hyper-parameter)
1: function TWOSTAGECONSTRAINEDNAS(A, I , t)
2: . `i(a) denotes the i-th layer of architecture a
3: . `i(A) denotes all the candidate operators in the i-th layer
of search space A, sorted in non-increasing order of score(i)op
4: for i← 1 to n do
5: . Init awin with top ranked candidate operator
6: `i(awin)← `i(A)[0]
7: end for
8: for iter ← 1 to I do
9: . Let SELECT(x,a,b) returns a if x, otherwise returns b
10: Fixed← SELECT(iter is odd, [1, t], [t+ 1, n])
11: for i← 1 to n do
12: φi ← SELECT(i ∈ Fixed, {`i(awin)}, `i(A))
13: end for
14: Aiter ← {a | `i(a) ∈ φi, 1 ≤ i ≤ n}
15: awin ← CONSTRAINEDONESHOTNAS(Aiter , τ (h)c )
16: end for
17: return awin
18: end function
19: Ah ← HWAWARESEARCHSPACE(h, τ (h)c ) . c.f. Sec 3.1
20: awin ← TWOSTAGECONSTRAINEDNAS(Ah, I , t)
21: retrain awin
685 DSP), (2): CPU (Qualcomm’s Snapdragon 845 ARM
CPU), (3): VPU (Intel’s MovidiusTM MyriadTM X Vision
Processing Unit ). To make full utilization of these hard-
ware at inference, we use the particular inference engine
provided by the hardware vendor. Specifically, DSP and
CPU latency are measured by Snapdragon Neural Process-
ing Engine SDK [18], while VPU latency is measured by
Intel OpenVINOTM Toolkit [12].
Latency Constraints. For better comparison with other
works, we set the latency constraints to be smaller than the
best latency of models from other works [25, 10, 27, 4],
which are 310 ms (CPU), 17 ms (DSP) and 36 ms (VPU).
Hardware Aware Search Space. The supernet [9] con-
sists of stem layers and 20 learnable layers (n = 20) (c.f.
Table 2). For the fair comparison, we let every learnable
layer has 4 (p = 4) candidate operators. The exploring op-
erator is applied to the last 4 learnable layers because of
their smallest feature map size. Thus the total size of hard-
ware aware search space is 416 × 54, which is similar to
the size of manually designed search spaces in other works.
As shown in Table 2, the hardware aware search space is
generated according to the different characteristics of every
hardware. More analysis and hardware insights are given in
Section 4.3.1.
One-shot Search. HURRICANE is built on top of
Singlepath-Oneshot [9, 20]. Once the supernet training fin-
ishes, we perform a 20-iterations evolution search for total
Output shape Layer DSP CPU VPU
562 × 64 1-4 SEP 3, Choice 3 Choice 3, Choice 3 SE Choice 3, Choice 5
MB 3 1, ChoiceX MB 3 1, ChoiceX Choice 7, SEP 3
282 × 160 5-8 Choice 3, ChoiceX Choice 3, ChoiceX Choice 3, Choice 5
MB 3 1, Choice 3 SE Choice 5, MB 3 1 Choice 7, ChoiceX
142 × 320 9-16 Choice 3, Choice 3 SE Choice 3, Choice 3 SE Choice 3, Choice 5
ChoiceX, MB 3 1 Choice 5 ,Choice 5 SE Choice 7, ChoiceX
72 × 640 17-20 Choice 3, Choice 3 SE Choice 3, Choice 5 Choice 3, Choice 5
ChoiceX ,MB 3 1, MB 3 3 Choice 3 SE, Choice 7, MB 5 1 Choice 7, MB 3 1, MB 7 1
Table 2. Hardware-aware search space for each mobile hardware. For layer at 1-16, it contains 4 operators for selection, for layer 17-20,
each layer has 5 operators. The input/output channel and stride settings for each layer are the same with Singlepath-Oneshot [9].
Model Approach Acc DSP CPU
‡ VPU FLOPs
Supernet training
Time Reduction (%)
(%) (ms) (ms) (ms) Stage1 / Stage1 + Stage2 †
MobileNetV2 [21] Manual 72.00 10.1 432.4 45.2 300M -
MobileNetV3-Large1.0 [10] RL [25]+NetAdapt [28] 75.20 48.3 411.4 43.9 219M -
MnasNet-A1 [25] RL 75.20 149.0 1056.1 52.4 312M -
FBNet-iPhoneX [27] gradient 73.20 105.0 313.0 45.6 322M -
FBNet-S8 [27] gradient 73.27 293.0 369.6 45.1 293M -
Proxyless-R (mobile) [4] gradient 74.60 534.6 616.5 53.1 333M -
Singlepath-Oneshot? [9] oneshot 74.30 270.6 455.8 38.7 319M 0
HURRICANE (DSP) oneshot 76.67 16.5 576.7 45.4 709M 61.4 (76.57) / 36.4 (76.63)
HURRICANE (CPU) oneshot 74.59 80.1 301.3 38.9 327M 57.7 (74.59) / 33.9 (74.59)
HURRICANE (VPU) oneshot 75.13 390.8 645.3 35.6 409M 45.0 (74.63) / 20.8 (75.13)
Table 3. Compared with various state-of-the-art efficient models on ImageNet, HURRICANE is the only NAS method that achieves high
accuracy and low latency on all the target hardware. Results suggest FLOPs is not the accurate metric that indicates actual inference
latency. As the hardware measurement settings in related NAS works are different, we measure the latency on our hardware platforms. ‡:
CPU latency is measured on a single CPU core with float32 precision. ?: For fair comparison, we use the block search model instead of the
block search+ channel search. †: In the form ”x(y)”, where ”x” means the training time reduction and ”y” means the accuracy achieved.
Group NAS Acc CPU
(%) (ms)
Similar latency
FBNet-iPhoneX 73.27 313.0
FBNet-S8 73.20 369.6
HURRICANE (CPU) 74.59 301.3
Similar accuracy
Proxyless-mobile 74.60 616.5
MnasNet-A1 75.20 1056.1
HURRICANE (CPU-1) 74.98 381.2
Table 4. Compared with models with same-level CPU inference
latency, HURRICANE (CPU) improves the top-1 accuracy from
73.27% to 74.59% on ImageNet. Compared with models with
same-level top-1 accuracy, HURRICANE (CPU-1) accelerates the
CPU inference time by 1.62X- 2.77X.
1,000 architectures as Singlepath-Oneshot. To avoid mea-
suring the latency of every candidate architecture during
search, we build a latency-prediction model with high ac-
curacy: the average estimated error for DSP, CPU, VPU is
4.7%, 4.2%, and 0.08%, respectively. More details about
the predictor are provided in the supplementary materials.
4.2. Searching on ImageNet Dataset
We compare HURRICANE searched models with vari-
ous state-of-the-art efficient models. The primary metrics
we care about are top-1 accuracy on the ImageNet dataset
and inference latency on the three hardware.
Dataset and Training Details. Following [4], we ran-
domly split the original training set into two parts: 50,000
images for validation (50 images for each class exactly) and
the rest as the training set. The original validation set is
used for testing, on which all the evaluation results are re-
ported. We follow most of the training settings and hyper-
parameters used in Singlepath-Oneshot [9], with two ex-
ceptions: (i) For supernet training, the epochs change with
different hardware-aware search spaces (listed in Table 2),
and we stop at the same level training loss as Singlepath-
Oneshot. (ii) For architecture retraining, we change linear
learning rate decay to cosine decay from 0.4 to 0. The batch
size is 1,024. Training uses 4 NVIDIA V100 GPUs.
Comparing to State-of-the-art Efficient Models. Ta-
ble 3 summarizes our experiment results on ImageNet.
HURRICANE surpasses state-of-the-art models. Com-
pared to MobileNetV2 (top-1 accuracy 72.0%), HURRI-
Task Dataset Search Space
Singlepath-Oneshot HURRICANE HURRICANE
(20 layers) (Stage1: 12 layers) (Stage1 + Stage2: 20 layers)
Acc Train Acc Train Acc Train
(%) iters (#) (%) iters (#) (%) iters (#)
1 ImageNet Manually-designed [9] 73.72 144,360 74.01 72,180 74.16 105,864
2 OUI Manually-designed [9] 86.41 235,800 86.44 112,660 86.90 133,620
3 OUI Hardware-aware (DSP) 87.22 569,850 86.56 128,380 87.62 150,650
4 OUI Hardware-aware (CPU) 87.02 476,840 86.75 144,100 87.33 168,990
5 OUI Hardware-aware (VPU) 86.99 524,000 86.93 133,620 87.07 157,200
Table 5. Compared to Singlepath-Oneshot [9], our two-stage search method achieves higher accuracy with much less search cost (26.7%-
73.6%) on both manually-designed and hardware-aware search spaces. We list out the training iterations on ImageNet (batchsize=1,024)
and OUI (batchsize=64) for search cost comparison.
CANE improves the accuracy by 2.59% to 4.03% on all tar-
get hardware platforms. Compared to state-of-the-art mod-
els searched by NAS, HURRICANE achieves the lowest in-
ference latency on DSP, CPU, VPU, with better or compa-
rable accuracy. It demonstrates that it’s essential to leverage
hardware diversity in NAS to achieve the best efficiency on
different hardware platforms.
Specially, HURRICANE (DSP) achieves 76.67% accu-
racy, better than MnasNet-A1 (+1.47%), FBNet-iPhoneX
(+3.47%), FBNet-S8 (+3.4%), Proxyless-R (+2.07%), and
Singlepath-Oneshot (+2.37%). Regarding latency, HURRI-
CANE (DSP) is 16.5ms on DSP, that reaches a a 6.35× in-
ference speedup than FBNet-iPhoneX. Interestingly, HUR-
RICANE (DSP) is faster than other NAS models but with
a much larger FLOPs count. This is against the widely ac-
cepted belief that smaller FLOPs count results in lower la-
tency. Our study for DSP indicates that small-kernel-sized
complicated operators are most suitable for this platform,
and the hardware aware search space fully takes advantage
of this and benefits from such operators (e.g., MB 3 3 in
Table 2). More discussions are in the Hardware Insights of
Section 4.3.1. For VPU and CPU, Singlepath-Oneshot and
FBNet-iPhoneX are the fastest NAS models, respectively.
Compared to Singlepath-Oneshot, HURRICANE (VPU)’s
accuracy is 0.83% higher, latency is 3.1ms lower. Com-
pared to FBNet-iPhoneX, HURRICANE (CPU) achieves
1.39% higher accuracy with 11.7ms latency reduction.
Table 3 indicates MnasNet-A1 and MobileNetV3-
Large1.0 achieve 0.61% higher accuracy than HURRI-
CANE (CPU). To further compare the efficiency on CPU,
we group related NAS models into same-level latency group
and same-level accuracy group in Table 4. For fairness, we
didn’t compare MobileNetV3-Large1.0 as it adopts a sec-
ond fine-grained search by NetAdapt on MnasNet-A1. Re-
sults in Table 4 suggests HURRICANE (CPU) achieves the
highest accuracy in same-level CPU inference time group,
and achieves 1.62X - 2.77X lower inference time in same-
level accuracy group.
Finally, Table 3 shows none of existing efficient mod-
els are efficient on all the three target hardware. Remark-
ably, by introducing hardware diversity into search space,
HURRICANE is the only hardware-aware NAS method that
consistently searches architectures with better accuracy and
much lower latency on all diverse hardware platforms.
Search Cost Analysis. To compare the search cost,
we report supernet training time reduction compared with
Singlepath-Oneshot instead of exact GPU search days as
[4, 27] for two reasons: (i): the GPU search days are highly
relevant with the experiment environments (e.g., different
GPU hardware) and the code implementation (e.g., Ima-
geNet distributed training). (ii): The primary time cost
comes from supernet training in Singlepath-Oneshot, as the
evolution search is fast that architectures only perform in-
ference.
Compared with Singlepath-Oneshot, HURRICANE
(Stage1 + Stage2) reduces 30.4% supernet training time and
finds models with better performance. Furthermore, HUR-
RICANE (Stage1) already achieves better classification ac-
curacy than other NAS methods (loss ≤ 0.5% compared
with our final result) but it could reduce an average 54.7%
time, which is almost a 2X training time speedup. It demon-
strates the effectiveness of exploring more architecture se-
lections in the later layers.
4.3. Ablation Study
4.3.1 Efficiency of hardware-aware search space
We compare our hardware-aware search spaces (listed
in Table 2) with (i) Manual: the search space used in
Singlepath-Oneshot [9] that designed by domain experts,
(ii) Random: randomly generated by our large operator pool
(c.f. Table 1). Ideal search spaces contain many high accu-
racy architectures within the latency constraint so that the
optimal ones are easily sampled by the search algorithm.
Since it’s difficult to compare the accuracy due to the very
expensive training cost, we uniformly sample 2,000 archi-
tectures from the three different search spaces and compare
their latency on the target hardware. Figure 3 shows that the
latency of most architectures in our hardware specialized
search spaces is very low. This indicates our search space
Figure 3. Our hardware specialized search spaces achieve better hardware efficiency than randomly- and manually-designed search spaces.
We random sample 2,000 architectures from three different search spaces and benchmark their inference latency on hardware. The x-axis
is log-scaled for better comparison.
is more compact and easier for constrained sampling than
randomly-designed and manually-designed search spaces.
Hardware Insights. We share important insights from
hardware profiling and search space generation.
• HexagonTM 685 DSP. Small kernel convolutions (k ≤
3) are well optimized on this platform. As a result,
all the operators are of k=3 in search space. It also
allow the search space to contain complicated opera-
tors (of large FLOPs) with small kernels, because their
efficiency on this platform is better than those less-
complicated operators but with bigger kernels. That’s
why our HURRICANE (DSP) is faster than other NAS
models but with a much larger FLOPs.
• MyriadTM X VPU. The efficiency is strongly impacted
by whether the operator is natively supported by the
AI accelerator. For example, SE module is of low ef-
ficiency in this platform, because it has to roll back to
relatively slow CPU execution. On the contrary, con-
volutions with bigger kernels (k = 7) are much more
efficiently executed than on other platforms. This ex-
plains why the search space for this platform selects no
SE operators but much more bigger kernel operators
(especially in the earlier layers).
• Snapdragon 845 ARM CPU. Even with complex mem-
ory operator, Choice 3 (i.e., ShuffleNetV2 unit) is the
most efficient operator on this platform.
4.3.2 Effectiveness of two-stage search algorithm.
Our two-stage search algorithm groups CNN layers into 12
later and 8 earlier layers: HURRICANE-Stage1 searches
the later layers first, HURRICANE-Stage2 searches the ear-
lier layers for the winning architectures in Stage1. To
further demonstrate the effectiveness of two-stage search
method, we compare it with Singlepath-Oneshot on a se-
quence of tasks in Table 5. Singlepath-Oneshot glob-
ally searches n = 20 layers. For each task, we con-
duct Singlepath-Oneshot and HURRICANE search algo-
rithm on a same search space. Task 1 and 2 uses the original
search space of Singlepath-Oneshot. Task 3, 4, 5 uses the
hardware-aware search spaces proposed by HURRICANE
for DSP, CPU, VPU, respectively.
Dataset and Training Details. We do the experiments
on ImageNet and also the OUI-Adience-Age (OUI) for sim-
plicity. OUI-Adience-Age [7] is a small 8-class dataset con-
sisting of 17,000 face images. We split OUI into train and
test sets by 8:2. We adopt the same hyper-parameter set-
tings as Singlepath-Oneshot, except that we reduce the ini-
tial learning rate from 0.5 to 0.1, and the batch size reduced
from 1024 to 64. Supernet trains until converge. For the
architecture retraining, we train for 400 epochs and change
the linear learning rate decay to Cyclic decay [22] with a
[0, 1] bound. We use 1 NVIDIA Telsa P100 for training.
Results and Analysis. Table 5 summarizes experiment
results. The searched models outperform the manual de-
signed light-weight models, such as MobileNetV2 (top-1
acc: 72.00% on ImageNet, 85.67% on OUI-Adience-Age).
For each task, our proposed method could achieve not only
higher accuracy but also less search cost.
As shown in Table 5, in most tasks, only one step search
(Stage1) of HURRICANE could achieve a comparable top-
1 accuracy (or even better in some tasks), but the number
of training iterations is significantly reduced (50%-77.5%).
This indicates that operators in later CNN layers are more
critical for final accuracy. If the computation budget (e.g.
training time) allows, HURRICANE can benefit from the
second step search (Stage2). The accuracy is improved by
0.15%-1.06% with an additional cost of only 4.0%-23.3%
of training iterations. Our gains mainly come from the re-
duced search space size by two-stage search algorithm. In
summary, we demonstrate that two-stage search algorithm
can search more optimal architectures with less search cost.
5. Conclusion
In this paper, we propose HURRICANE to address
the challenge of hardware diversity in NAS. By exploring
hardware-aware search space and two-stage search algo-
rithm, we demonstrate that HURRICANE is able to search
for better models specialized for different hardware plat-
forms and outperforms the previous NAS methods by both
accuracy and significant training time reduction.
References
[1] Gabriel Bender, Pieter-Jan Kindermans, Barret Zoph, Vijay
Vasudevan, and Quoc V. Le. Understanding and Simplifying
One-Shot Architecture Search. In ICML, 2017. 1, 3
[2] Christopher M. Bishop. Pattern Recognition and Machine
Learning, 2006. 10
[3] Andrew Brock, Theodore Lim, James Millar Ritchie, and
Nicholas J Weston. Smash: One-shot model architecture
search through hypernetworks. In 6th International Confer-
ence on Learning Representations, 2018. 4
[4] Han Cai, Ligeng Zhu, and Song Han. Proxylessnas: Direct
neural architecture search on target task and hardware. In
ICLR. 2019. 2, 3, 5, 6, 7
[5] Xiangxiang Chu, Bo Zhang, Ruijun Xu, and Jixiang Li. Fair-
nas: Rethinking evaluation fairness of weight sharing neural
architecture search. arXiv preprint, arXiv:1907.01845, 2019.
1, 3
[6] Matthew D.Zeiler and Rob Fergus. Visualizing and under-
standing convolutional networks. ECCV, 2014. 4
[7] Eran Eldinger, Roee Enbar, and Tal Hassner. Age and Gen-
der Estimation of Unfiltered Faces. In TIFS, 2014. 8, 10
[8] Deeptha Girish, Vineeta Singh, and Anca Ralescu. Unsu-
pervised clustering based understanding of cnn. In Proceed-
ings of the IEEE Conference on Computer Vision and Pattern
Recognition Workshops, pages 9–11, 2019. 4
[9] Zichao Guo, Xiangyu Zhang, Haoyuan Mu, Wen Heng,
Zechun Liu, Yichen Wei, and Jian Sun. Single path
one-shot neural architecture search with uniform sampling.
arXiv:1904.00420, 2019. 1, 3, 4, 5, 6, 7
[10] Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh
Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun
Zhu, Ruoming Pang, Vijay Vasudevan, Quoc V. Le, and
Hartwig Adam. Searching for mobilenetv3. arXiv preprint,
arXiv:1905.02244, 2019. 1, 2, 3, 5, 6
[11] Jie Hu, Li Shen, and Samuel Albanie. Squeeze-and-
excitation networks. In arXiv preprint, arXiv:1709.01507,
2017. 3
[12] Intel. Intel distribution of openvino toolkit.
https://software.intel.com/en-us/
openvino-toolkit, 2019. 5
[13] Haifeng Jin, Qingquan Song, and Xia Hu. Auto-Keras: Ef-
ficient Neural Architecture Search with Network Morphism.
arXiv:1806.10282, 2018. 5
[14] Hao Li, Asim Kadav, Igor Durdanovic, Hannan Samet, and
Hans Peter Graf. Pruning filters for efficient convnets. In
ICLR. 2017. 2, 4
[15] Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS:
Differentiable Architecture Search. ICLR, 2019. 3
[16] Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun.
Shufflenet v2: Practical guidelines for efficient cnn architec-
ture design. ECCV, 2018. 3
[17] Hieu Pham, Melody Y. Guan, Barret Zoph, Quoc V. Le, and
Jeff Dean. Efficient Neural Architecture Search via Parame-
ter Sharing. In ICML, 2018. 3
[18] Qualcomm. Snapdragon neural processing engine sdk.
https://developer.qualcomm.com/docs/
snpe/setup.html, 2019. 5
[19] Carl Edward Rasmussen and Christopher K.I. Williams.
Gaussian processes for machine learning (adaptive compu-
tation and machine learning), 2005. 10
[20] Megvii Research. Shufflenet series.
urlhttps://github.com/megvii-model/ShuffleNet-Series,,
2019. 5
[21] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zh-
moginov, and Liang-Chieh Chen. Mobilenetv2: Inverted
residuals and linear bottlenecks. CVPR, 2018. 3, 6
[22] Leslie N. Smith. Cyclical learning rates for training neural
networks. In WACV, 2017. 8
[23] Dimitrios Stamoulis, Ermao Cai, Da-Cheng Juan, and Diana
Marculescu. Hyperpower: Power- and memory-constrained
hyper-parameter optimization for neural networks. The Jour-
nal of Machine Learning Research, 2018. 3
[24] Dimitrios Stamoulis, Ruizhou Ding, Di Wang, Dimitrios
Lymberopoulos, Bodhi Priyantha, Jie Liu, and Diana Mar-
culescu. Single-path nas: Designing hardware-efficient con-
vnets in less than 4 hours. CVPR, 2018. 2, 3
[25] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan,
Mark Sandler, Andrew Howard, and Quoc V.Le. Mnasnet:
Platform-aware neural architecture search for mobile. In
CVPR, 2019. 1, 3, 5, 6
[26] Robert Tibshirani. Regression shrinkage and selection via
the lasso, 1994. 10
[27] Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang,
Fei Sun, Yiming Wu, Yuandong Tian, Peter Yajda, Yangqing
Jia, and Kurt Keutzer. Fbnet: Hardware-aware efficient con-
vnet design via differentiable neural architecture search. In
CVPR. 2019. 1, 2, 3, 5, 6, 7
[28] Tien-Ju Yang, Andrew Howard, Bo Chen, Xiao Zhang, Alec
Go, Mark Sandler, Vivienne Sze, and Hartwig Adam. Ne-
tadapt: Platform-aware neural network adaptation for mobile
applications. ECCV, 2018. 6
[29] Chiyuan Zhang, Samy Bengio, and Yoram Singer. Are all
layers created equal? In arXiv preprint arXiv:1902.01996v3.
2019. 2, 4
[30] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun.
Shufflenet: An extremely efficient convolutional neural net-
work for mobile devices. CVPR, 2018. 1, 3
[31] Barret Zoph and Quoc V. Le. Neural architecture search with
reinforcement learning. arXiv preprint, arXiv:1611.01578,
2016. 1, 3
[32] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V.
Le. Learning transferable architectures for scalable image
recognition. In CVPR, 2018. 1, 3
A. Details on Experiments
Supernet Architecture. Table 6 shows the overall ar-
chitecture of the supernet.
Input shape Block channels repeat stride
2242x3 3 x 3 conv 16 1 2
1122x16 learnable layer 64 4 2
562x64 learnable layer 160 4 2
282x160 learnable layer 320 8 2
142x320 learnable layer 640 4 2
72x640 1 x1 conv 1024 - -
72x1024 GAP - 1 1
1024 fc 1000 1 1
Table 6. Supernet architecture. Column-”Block” denotes the block
type.The ”stride” column represents the stride of the first block in
each repeated group. In our paper, we search the operators for total
20 learnable layers.
Latency Predictor. In a constrained optimization prob-
lem, it is often required to check whether an architecture
exceeds the latency constraint. To reduce the cost and com-
plexity of connecting hardware frequently, we developed
the latency predictors, which consists of multiple indepen-
dent hardware specific predictors. Each predictor takes the
sequence of operators in all the layers as the input and pre-
dict the real latency of the whole architecture on the target
hardware platform. We use the latency predictor to search,
and use the on-device measured latency to report the final
latency for the searched architectures.
For each target hardware platform, we uniformly sam-
ple 2,000 candidate architectures from the corresponding
hardware-aware search spaces (c.f. Table 2), where 80%
of them are used to build the latency predictor and the rest
are used for test. We encode architectures by the opera-
tors at each learnable layer and the search space. First,
we squeeze the hardware-aware search space into a 84-
dimension (c.f. Table 2: 4 × 16 + 5 × 4 = 84) vec-
tor O = {o1, o2, ..oi, ..}, where o represents the operator.
Then, we encode each architecture a to a 84-dimension bi-
nary vector B = {b1, b2, ..bi, ..}, where the binary value
bi indicates whether operator oi is the appeared at cor-
responding learnable layer of architecture a. The binary
vectors are viewed as the training data. Different regres-
sion models are selected for latency prediction on diverse
hardware. Specifically, we build GaussianProcessRegres-
sor with Matern kernel (length scale=1.5, nu=0.35) [19],
Lasso Regression model (alpha=0.01) [26] and Bayesian
Ridge Regression [2] models for DSP, CPU and VPU, re-
spectively. The RMSE (root mean square error) of predic-
tion model is 0.82ms (DSP), 21.84ms (CPU), and 0.03ms
(VPU). Results suggest that we build very accurate latency
predictors for DSP and VPU as the RMSE is very low.
While the RMSE on CPU is relatively high, it indicates a
small 4.2% latency estimation error compared to the aver-
age latency of test set.
OUI Dataset. As is discussed in Section 4.3.2, we use
the small OUI-Adience-Age [7] for the analysis of two-
stage algorithm effectiveness. For architecture retraining,
we split OUI into the training set and testing set by 8:2. For
architecture search, we randomly split the training set into
two parts: 5,567 images for validation and the rest as the
training set for supernet.
B. HURRICANE Searched Architectures
Figure 5 visualizes our architectures searched on three
different hardware platforms: DSP, CPU, VPU. For each
operator, we plot the structure in Figure 4. The results in-
dicate some interesting behaviors: (i) for the three diverse
hardware platforms, the last four learnable layers tend to
search for operators with large FLOPs count. (ii): For
early learnable layers (i.e., feature map with high resolu-
tion), HURRICANE (CPU) prefers smaller operators (i.e.,
Choice 3) than DSP and VPU. This is because large FLOPs
count heavily impacts CPU efficiency. On the contrary,
DSP and VPU have much higher parallelism to enable the
efficiency for larger operators.
(a) Choice k (b) ChoiceX (c) SEP k (d) MB k e
Figure 4. Operators in Table 1. SE∗ indicates the position to add squeeze-and-excitation block. When SE∗ is enabled, the operations
are Choice k SE, ChoiceX SE, SEP k SE, MB k e SE. k indicates kernel size, where k = 3, 5, 7. e indicates the expansion rate, where
e = 1, 3, 6.
Figure 5. Structures of searched architectures on ImageNet in Table 3.
