Performance-Oriented Neural Architecture Search by Anderson, Andrew et al.
Performance-Oriented
Neural Architecture Search
Andrew Anderson∗, Jing Su†, Rozenn Dahyot‡ and David Gregg§
School of Computer Science and Statistics, Trinity College Dublin
Dublin, Ireland
Email: ∗andersan@cs.tcd.ie, †jing.su@tcd.ie, ‡Rozenn.Dahyot@tcd.ie, §dgregg@cs.tcd.ie
Abstract—Hardware-Software Co-Design is a highly successful
strategy for improving performance of domain-specific com-
puting systems. We argue for the application of the same
methodology to deep learning; specifically, we propose to extend
neural architecture search with information about the hardware
to ensure that the model designs produced are highly efficient in
addition to the typical criteria around accuracy.
Using the task of keyword spotting in audio on edge computing
devices, we demonstrate that our approach results in neural
architecture that is not only highly accurate, but also efficiently
mapped to the computing platform which will perform the
inference. Using our modified neural architecture search, we
demonstrate 0.88% increase in TOP-1 accuracy with 1.85×
reduction in latency for keyword spotting in audio on an
embedded SoC, and 1.59× on a high-end GPU.
Index Terms—Deep Neural Networks, Neural Architecture
Search
I. MOTIVATION
For modern edge computing systems, the task of implement-
ing deep learning applications with high efficiency has become
critical. For example, automatic speech recognition [13] (ASR)
is an area where significant industrial effort is being focused
to move the problem on-device and avoid an always-on
connection to cloud computing resources.
The typical approach is first to select a deep neural network
(DNN) architecture and train it to acceptable accuracy on
some task. Techniques such as weight pruning [25] and model
quantization [22] are then applied, so that the inference has
acceptable latency and memory consumption on the target
device. This general workflow is applied in the vast majority
of research, for example, in the recent on-device ASR work
at Google [12].
However, certain features of neural architecture, especially
in convolutional neural networks (CNNs) present hard road-
blocks for performance. For example, the size and shapes
chosen for convolution kernels determines whether or not fast
and memory-efficient convolution algorithms can be applied
at inference time [1].
In this paper, we argue for a more flexible and holistic
approach to optimization of deep learning applications. We
propose to adapt techniques from hardware-software co-design
to the domain of deep learning, by performing neural archi-
tecture search with both classification accuracy and inference
performance as primary objectives.
Extending neural architecture search with information about
the hardware on which the inference will be deployed allows
decisions to be made at a high level concerning the structure of
the neural network which have profound implications for per-
formance. Without this information, we are searching “in the
dark” with regard to inference performance, and it is unlikely
that the search will find high efficiency neural architectures,
except by chance.
Contribution
In this paper, we extend neural architecture search with
an inference performance cost model to allow performance
concerns to be directly integrated in the architecture search.
We propose a neural architecture search based on the idea of
iterative refinement of stepwise-optimal candidates that allows
the user to control the depth and breadth of the search.
We demonstrate that it is possible to find variations of neural
architecture with equivalent classification accuracy but greatly
improved inference performance, even before weight pruning
or model quantization are considered.
Paper Structure
We first present some background on neural architecture
search, and inference performance in deep neural networks.
Next, we present our performance-oriented neural architec-
ture search algorithm. Using our approach, we explore varia-
tions of the initial seed architecture and collect classification
accuracy and inference performance statistics.
We evaluate our approach on a real-world edge platform
by considering the application of a keyword spotting (KWS)
pipeline on an ARM Cortex-A class processor. For context,
we also benchmark the models produced by our approach
on a typical high-end GPU commonly used for deep-learning
workloads. Finally, we present some related work and discuss
potential extensions of our approach.
II. BACKGROUND
Many deep learning models in widespread use today are
still designed by humans [17], [14] or according to simple
schemes where submodules are programmatically stacked to
a configurable depth to construct the network [19], [23], [18].
Neural architecture search is typically performed in con-
junction with hyperparameter optimization to automatically
discover model architectures which can be trained to a high
degree of accuracy for a specific learning task [27], [10], [9].
ar
X
iv
:2
00
1.
02
97
6v
1 
 [c
s.L
G]
  9
 Ja
n 2
02
0
However, the search process may discover many model
architectures which can be trained to an approximately equiv-
alent degree of accuracy. When resources are no object at
inference time, any of these models will suffice. However,
when performing inference on edge devices, with limited
memory and compute capacities, it is crucial to optimize the
model to minimize storage and computational requirements.
Variants of neural architecture search have been success-
fully applied to many networks in different contexts in prior
work [27]. However, these investigations have focused only
on improving the accuracy of the trained networks.
Yang et al. propose the use of co-design methods to produce
a customized hardware design to accelerate a specific neural
network [24]. Our work differs from the work of Yang et
al. primarily in that we propose a generic method by which
the entire space of customized networks derived from some
initial seed network can automatically be explored, rather than
designing a custom network by hand.
Cai et al. [5] propose a modification to the training process
of the neural network to directly incorporate parameter search.
However a key difference with our work is that Cai et al.
require a set of fixed alternatives to be chosen in advance,
rather than our fine-grained bounds for individual architectural
parameters. Nevertheless, the approach of Cai et al. is shown
to yield improvements in two image classification tasks.
Pruning weights with small values induces sparsity in
the model [11] and is a popular technique for decreasing
operation count. However, the degree to which a model can be
pruned without failing to meet classification accuracy targets
is unpredictable and largely model-dependent. Furthermore, it
is not always clear how to select an effective pruning approach
for any given model beyond trial and error.
Quantization is another major approach to reducing model
size and improving inference performance. By converting
weights to a smaller numeric format, the model size is
reduced. Although quantization does not change the number
of operations, computation on smaller types is often faster,
especially on FPGA [3], or custom accelerators [11], where
the arithmetic implementation in hardware can be customized
to match the model quantization scheme.
On general purpose processors and GPUs, quantization is
restricted to the use of the numeric types already present in
hardware. Nonetheless, quantization often results in a net gain
in inference performance [8].
A. DNN Convolution
Convolution is the most computationally heavy primitive
in the majority of popular deep neural networks. As such, it
presents a natural target for optimization.
Figure 1 shows the data flow in the DNN convolution oper-
ation. The H and W parameters are determined by the size
of the input feature maps, but the k, C, and M parameters
characterize the convolution being applied. The filter size of
the convolution is k × k points, and the number of feature
maps, C, determines how many of these k × k filters are
trained. Since convolutions are connected input-to-output in
Fig. 1: DNN Convolution: C input feature maps, each of size
H×W , are convolved with M multichannel filters, each with
C channels and a k× k kernel, resulting in M output feature
maps.
the neural network, the number of feature maps on the output,
M , matches the number of feature maps C on the input of the
next convolution in the chain.
Since inference performance is dominated by the convo-
lutional layers, the size and number of convolutional filters
trained in each convolutional layer are natural candidates
for search. With very few or very small filters, the network
encodes much less information than with more numerous or
larger filters. However, a reduction in filter quantity or filter
size reduces both the operation count, and the model size,
resulting in more efficient inference.
Striking a balance between these opposing objectives is the
goal of this paper. We propose to incorporate inference per-
formance concerns directly in the neural architecture search,
by extending neural architecture search with a new objective
function incorporating inference performance.
Inference Performance: The inference performance of con-
volutional DNNs is dominated by the time spent executing
the convolutional layers. khkw multiplications and additions
are used to compute the output of a single filter. All C filters
are then applied at H × W input points, where H and W
are subdivided by their respective strides, Sh, Sw, if present.
C additions per output point are then performed to compute
one output feature map. Since there are M output feature
maps, the total operation count is M(Cd HSh edWSw e(2khkw) +
Cd HSh edWSw e) or more succinctly,
FPOPS = MC
⌈
H
Sh
⌉⌈
W
Sw
⌉
(2khkw + 1) (1)
Similiarly, the number of weight parameters for a convolu-
tional layer can be expressed as
FPPARAMS = MC(khkw) (2)
There are many special-case algorithms for computing a direct
convolution, which all have the same asymptotic complexity,
but exploit spatial locality and other features of the convolution
to make efficient use of different primitive operations, such as
matrix multiplication [20]. The choice of filter size and shape
determines which of these algorithms can actually be used.
III. OUR APPROACH
Prior work [4], [27] on neural architecture search ranges in
scope from the design of whole-network connective structure
to the value ranges of specific hyperparameters. In order to
demonstrate our performance-oriented search, we focus on the
two kinds of parameters which most immediately influence
operation count in the network; filter size, k, and feature map
count, M . We consider the filter size k to consist of two
distinct sub-parameters, kh and kw, to allow for non-square
filters, since these are observed in several expert-designed
networks for keyword spotting.
Taking the expert-designed network from the work of De
Prado et al. [8] as the initial seed architecture, we explore
variant networks with the same connective structure, but
different kh, kw, and M parameters on all layers.
In principle, this search can be extended to incorporate many
other parameters, provided that their impact on the operation
count in the network can be specified.
A. Search
We begin by constructing a search space by fixing a range for
each of the parameters kh, kw and M , for each convolution
in the network. Each point in this search space is a unique
configuration of the neural network. There are two key metrics
of network quality which we consider: accuracy and cost.
To model how well any given configuration of the network
classifies, we measure the TOP-1 accuracy across some fixed
number of inferences. Using the formula in Equation 1, we
also compute the cost of the configuration as the sum of the
operation count in all convolutional layers.
In principle, any measure of accuracy and of cost can be
used. For example, to relax the accuracy requirement, the
average top-5 accuracy could be considered instead of TOP-1.
Since kh, kw, and M are arbitrary (positive) integers,
the configuration space of the network is extremely large.
Recall that these parameters can vary independently for each
convolutional layer in the network. In practice, for any one
layer, kh and kw are bounded above by the height H and
width W of the input feature maps, since setting kh = H
and kw = W means that the convolution produces a single
scalar output. However the number of filters M is unbounded
in practice, although we would expect diminishing returns in
terms of accuracy as the filter count grows very large.
Even by fixing kh, kw, and M to be bounded above by
their settings in the seed network, many configurations remain,
the evaluation of which requires some number of training
iterations.
Prior work on Neural Architecture Search has employed
numerous approaches to deal with the size and complexity
of the search spaces involved. All of the approaches share
a common design element, in that they attempt to learn the
structure of the search space so that the majority of the search
time is spent evaluating candidate configurations which are
likely to be of high quality.
Two of the most popular approaches are to use reinforce-
ment learning or hypernetworks to learn the structure of the
search space. In the “SMASH” approach of Brock et al. [4],
an estimator is trained using reinforcement learning which
proposes new points in the space that are likely to be high-
quality configurations. In the approach of Le et al. [27], a
recurrent neural network (RNN) is trained, which learns to
propose strings specifying the configuration of the network.
This approach works by feeding the accuracy of the child
network (the CNN matching the proposed configuration) back
into the parent RNN to modify the loss.
In principle, any method may be chosen to explore the
configuration space of the network. Regardless of the choice
of method, the ideal result is a set of configurations of the
network which are high-quality, in the sense that they have
high accuracy and low cost. However, according to the needs
of the programmer, some of these high-quality configurations
may be better than others. For example, if the aim of the
search is strictly to find the least-cost network which meets
some accuracy target, a different candidate may be best for a
search aimed at finding the most accurate network that does
not exceed a certain cost.
To evaluate a candidate model while exploring the search
space, some budget of training iterations is required. Depend-
ing on the available computing resources for training, this
may be larger or smaller. With a smaller training iteration
budget, more points in the search space can be explored in
a fixed amount of time than with a larger budget. However,
with a larger budget, the candidates will be evaluated in more
depth, meaning discriminating between them will be easier.
The stopping criterion for one search phase may be a wall-
clock time elapsed, or a certain number of models explored.
In either case, after executing a search phase, we obtain a set
of candidate models trained for some number of iterations.
B. Refinement
The second phase of our approach is the refinement of the set
of candidate models. We construct the Pareto frontier of the
model set produced from the search stage with respect to the
two criteria of accuracy and cost.
If all, or very many, candidates on the Pareto frontier share a
common substructure (i.e. common settings for any kh, kw, or
M parameter), we fix that substructure, and repeat the search
phase. Since we have reduced the search space by fixing some
substructure, each of these iterative refinements becomes faster
to evaluate. This means that we can explore more of the search
space around refined configurations (with a fixed training
iteration budget), or evaluate candidates more precisely (with
an increased training iteration budget). In this way, the search
process works at a progressively finer granularity the more
refinement steps are taken.
A Pareto improvement is a change that makes at least
one criterion better without making any other criterion worse,
given a certain initial configuration. A configuration is Pareto
optimal when no further Pareto improvements can be made.
The Pareto frontier is the set of all Pareto optimal config-
urations, i.e. configurations of the network where no other
Fig. 2: Keyword Spotting workflow. A one-second mono-channel speech signal is used as input. An MFCC spectrogram is
computed and used as the input to a neural network. At inference, an unknown signal is processed through this workflow to
produce a label prediction.
4x10
1:2
3x3
2:2
3x3
1:1
3x3
1:1
3x3
1:1
3x3
1:1
100 100 100 100 100 100
Fig. 3: Seed CNN Architecture. Each unit performs a convo-
lution, batch normalization, scale, and ReLU activation. The
first and second unit perform strided convolutions with the
indicated vertical and horizontal subsampling factors, but the
remaining convolutions are stride 1. All units have 100 filters
of size 3×3 except the first unit, where the filter size is 4×10.
configuration can achieve at least as much accuracy with lower
cost, nor strictly more accuracy with equivalent cost.
In practice, we found that fixing the most commonly ob-
served parameter setting in the Pareto-optimal models worked
very well in the refinement phase. It is important to note
that when we freeze a parameter setting, although we do not
continue to evaluate candidates which do not share the setting
for the fixed architectural parameter, we do not discard Pareto
optimal candidates which have already been found. Models on
the Pareto frontier which do not share the chosen setting for
a frozen parameter might continue to be on the frontier even
after the other candidates, and their variants, have been trained
for further iterations. Algorithm 1 summarizes the workflow
of our neural architecture search strategy. Steps 4 and 5 may
be repeated until all architectural parameters are frozen.
Algorithm 1: Neural Architecture Search
1 Select the seed architecture
2 Initial search over network/optimizer parameters
3 Select Pareto-optimal architectures
4 Refined search over network parameters
5 Select Pareto-optimal architectures and freeze parameters
6 Fine-tune Pareto-optimal models
IV. EXPERIMENTATION
In this paper, we use keyword spotting (KWS) as an
example to study neural architecture search. A KWS system
takes an input audio signal and predicts its most likely text
label. This system can be trained to work through a typical
supervised learning approach together with labelled speech
samples. Instead of classifying audio signals directly, a feature
extraction step is applied to generate a spectrogram from
the audio. Computing the Mel-frequency Cepstral Coefficients
(MFCCs) [7] is a typical preprocessing step for ASR applica-
tions. Figure 2 shows an end-to-end KWS workflow in which
2D MFCCs are used to train a neural network classifier. For
all models we generate MFCC features in the same way. 40
frequency bands are applied to 16kHz audio samples with
128ms frame length and 32ms stride. We use the Google
speech command data set [21] which has samples of 1 second
in length, so the MFCC tensor has 40x32 features.
The speech commands dataset contains 65,000 audio sam-
ples of 30 keywords. We selected 10 keywords (yes, no, up,
down, left, right, on, off, stop, go) for our experiment and
used the default training, validation and test split, 80:10:10
respectively.
We implemented our approach using the Microsoft NNI
(Neural Network Intelligence) toolkit [6], for constructing
automated machine learning (AutoML) experiments. NNI is
designed for building automated searches for the best neural
architecture and/or hyper-parameter sets.
For the two objectives we chose the TOP-1 accuracy across
100 inferences as the estimator of network accuracy, and the
sum of operation counts (Equation 1) as the estimator of
network cost. Rather than a reinforcement learning or RNN
based search, we chose to use the Tree-structured Parzen
Estimator [2] (TPE), since each trial involves training a neural
network to convergence, so the trials are high-cost. This
estimator is already implemented in the NNI toolkit.
Seed Architecture
A 6 layer CNN model (Figure 3) previously achieved
94.23% test accuracy with Google speech commands
dataset [21], therefore it is a good starting point of neural
architecture search. We refer to the Caffe-trained model of [8]
as the original model (i.e. with the parameters as found
in [8] that yields 94.23% TOP-1 accuracy), and its neural
architecture as the seed architecture.
Starting with the seed architecture, we configured NNI to
perform a search for the number of feature maps M , kernel
height kh and kernel width kw for each convolutional layer.
Search in Experiments
Microsoft’s NNI toolkit offers many options to tune network
hyperparameters as well as solver/optimizer parameters, in-
cluding training iterations, batch size, learning rate and decay
strategy. However, the search space expands exponentially if
we try to search for values for all parameters independently. In
order to finish our experiment within a reasonable time budget
we used a two step approach. Firstly, we explored solver
parameters used to train the original model. The original model
is trained in 40,000 iterations with the ADAM optimizer [16]
with a batch size of 100. The base learning rate is 5 × 10−4
and it drops 70% every 10,000 iterations.
Performing a hyperparameter search with NNI, we found
that using a learning rate of 1 × 10−3 and batch size of 25
with 8,000 training iterations had high correlation with top
accuracy scores. The ADAM optimizer was still preferred. We
fixed this set of new solver parameters before beginning our
search.
Our search space bounds for the three parameter types were
configured as follows. For the per-layer bound on M , we used
the setting in the seed architecture M = 100 as an upper
bound, with a lower bound of 1. For the settings of kh and
kw, we used the seed architecture settings as upper bounds for
the first convolutional layer (kh = 4, kw = 10), and for the
remaining layers we increased the bounds slightly from the
original setting of (kh = 3, kw = 3) to (kh = 5, kw = 5). All
kh and kw lower bounds were also set to 1.
This choice of bounds means we are searching for networks
where each layer has at most as many filters, M , as in the
original network, and the first layer filter size is at most as
large as in the original network. The filter sizes of subsequent
layers may be larger than in the original network, up to a
bound of 66% growth in each spatial dimension.
Refinement in Experiments
We ran the search stage of the experiment with our initial
configuration until 300 candidate networks were produced.
Looking at the Pareto frontier in this experiment, we noticed
that the vast majority of high-quality candidates used a filter
size of (kh = 3, kw = 3) for the first convolutional layer.
There was no other significant common substructure in the
candidates. Fixing this filter size for the first layer, we repeated
the search step with the refined seed network until a further
500 candidates were produced. We computed the Pareto fron-
tier for this refined set of candidates, and fine-tuned the 12
Pareto-optimal models until each had hit the limit of 40,000
TABLE I: Interesting Model Configurations from Figure 4
TOP-1 MFPOPS δ TOP-1 FPOPS Note
0.9423 581.12 0.0% 1× seed model
0.8960 17.22 -4.63% -33.75× best δ FPOPS
0.9410 87.61 -0.09% -6.63× fastest δ TOP-1 ≈ 0
0.9425 167.68 0.02% -3.47× fastest δ TOP-1 > 0
0.9511 223.44 0.88% -2.60× best δ TOP-1
training iterations (i.e. the number of iterations used to train
the seed model). Figure 4 shows the results of the experiment.
Since the non-square kernel (kh = 4, kw = 10) of the first
convolutional layer yielded high accuracy in CNN, DSCNN
and CRNN scenarios [26], it is believed to be a good design
for KWS. However, our search found that a (kh = 3, kw = 3)
kernel in the first layer is a better choice.
After investigating the data, we found that the MFCC
features were generated from 40ms speech frames [26], but
we generated MFCC features from 128ms speech frames.
Consequently the MFCC feature map in our experiment covers
more temporal information (W dimension of MFCC) than the
one in previous work. This setting enables good performance
of smaller square kernels and it explains why the kh and
kw values we observe are preferred by the search. It also
demonstrates the power of neural architecture search to adapt
the model to changes in the conditioning of the dataset without
intervention on the part of the end-user.
Observations
The most immediate observation from the experimental data is
that, once the TOP-1 accuracy exceeds 90%, the vast majority
of candidates which achieve any given accuracy target are
much more expensive than they need to be. For many of the
gradations in TOP-1 accuracy, the difference between the least
cost and greatest cost configuration is close to, or exceeds, an
order of magnitude.
Using traditional approaches to Neural Architecture Search,
which are oblivious to inference performance, we have no way
to ensure that the search process will choose a candidate with
reasonable cost, and it is clear that the likelihood of making
a very expensive choice is high.
Table I summarizes the most interesting models we found
in our experiment. The first row shows the configuration
which had the largest reduction in operation count versus
the seed architecture. The search finds a configuration which
uses 33.75× fewer operations for inference than the seed
architecture, at the cost of a 4.63% point reduction in TOP-1
accuracy.
The least-cost configuration found which had approximately
the same accuracy as the seed architecture exhibited a reduc-
tion in operation count of 6.63×, while the least-cost architec-
ture that was strictly more accurate has a reduction of 3.47×.
Finally, the most accurate configuration found improves TOP-
1 accuracy by 0.88% points, to 95.11% TOP-1, while reducing
operation count by 2.6× over the seed architecture.
 0
 5x107
 1x108
 1.5x108
 2x108
 2.5x108
 3x108
 0.84  0.86  0.88  0.9  0.92  0.94  0.96
FP
 o
ps
TOP-1
Accuracy vs Inference Cost
Candidate Pareto Optimal Finetuned
Fig. 4: Experimental Results from Model Search.
The figure shows the operation count of the candidate models versus their TOP-1 test accuracy. Models within at most -1%
point deviation from the test accuracy of the seed model occupy the shaded gray area in the graph. Pareto-optimal models are
indicated with crosses ×. All models were trained for 8,000 iterations during search. The final set of Pareto-optimal models
were trained for a total of 40,000 iterations (equivalent to the seed model). These are the “Finetuned” data points.
Real-World Benchmarking
Using the full set of Pareto-optimal CNN models from Fig-
ure 4, we performed two real-world benchmarking experi-
ments, one on an embedded SoC, the Samsung Exynos 5 Octa
(Exynos 5422), and one on a high-end GPU (the NVIDIA
GTX 1080Ti). We used the ODroid XU3 system, which has
2GB of RAM. Our experiments used OpenBLAS 0.3.5. For
GPU, experiments, we used the CUDA toolkit version 10.0
and cuDNN version 7.5.
All models in the Pareto frontier were benchmarked using
Caffe. 100 inferences were performed in all cases. The runtime
of the seed model was also benchmarked. Labels on bars are
the TOP-1 score of the model for which inference time is
being benchmarked. All models (including seed model) were
trained for a total of 40,000 iterations.
Figure 5 shows the single-inference latency of the Pareto-
optimal models using Caffe [15] on the Exynos 5422 system,
and Figure 6 shows the same data on the GTX 1080Ti, using
the Caffe cuDNN backend.
The mapping from FPops to inference latency is not one-
to-one, since Caffe uses GEMM to implement the arithmetic,
and spatial locality and other implementation concerns affect
the execution time. Nevertheless, we see that the general trend
observed in Figure 4 is still observed in practice.
Discovered Models
Table II shows the network architectures on the Pareto
frontier in Figure 4. The configuration of convolutional layers
is written {kh×kw,M}. Final TOP-1 scores after fine-tuning
 20
 40
 60
 80
 100
 120
 140
 160
 180
Ti
m
e 
(m
s)
Finetuned Models after 40K Iterations
Accuracy vs Inference Time
Discovered Models Seed Model
0.919
0.927
0.934
0.934
0.935 0.936 0.937
0.938
0.938
0.941
0.943
0.951
0.942
Fig. 5: Exynos 5422 Experiment
Single-Inference time (lower is better)
Using our approach, we are able to find models which are
faster (lower latency) than the original seed model, as well
as models with better TOP-1 accuracy. The best speedup
observed in practice on the Exynos 5422 is 1.85× versus the
seed model. Bar labels show TOP-1 accuracy.
are shown, along with operation count. Additionally, the first
row shows the parameters of the seed architecture.
One of the clearest trends in the data is that the choice
of 100 filters per convolutional unit in the expert-designed
seed architecture is excessive. None of the architectures found
during search have any unit with more than 50 filters.
Another trend is that the choice of a 3× 3 filter for all but
the first unit in the network is suboptimal. In fact, only one of
TABLE II: Pareto Optimal CNN architectures from Figure 4
Model conv1 conv2 conv3 conv4 conv5 conv6 TOP-1 MFPops
seed 4x10, 100 3x3, 100 3x3, 100 3x3, 100 3x3, 100 3x3, 100 94.2% 581.1
kws1 3x3, 40 3x3, 30 1x1, 30 5x5, 50 5x5, 50 5x5, 50 95.1% 223.4
kws2 5x5, 40 3x3, 50 1x1, 30 5x5, 40 3x3, 50 5x5, 50 94.3% 167.7
kws3 5x5, 50 1x1, 30 5x5, 40 3x3, 20 5x5, 30 3x3, 50 94.1% 87.6
kws4 5x5, 50 3x3, 40 5x5, 20 1x1, 20 5x5, 30 3x3, 50 93.8% 87.2
kws5 5x5, 20 1x1, 40 5x5, 30 3x3, 20 5x5, 30 3x3, 30 93.8% 76.5
kws6 5x5, 20 3x3, 40 3x3, 40 3x3, 20 3x3, 40 3x3, 40 93.6% 65.2
kws7 3x3, 50 1x1, 30 3x3, 20 5x5, 20 3x3, 50 3x3, 40 93.6% 56.8
kws8 5x5, 50 1x1, 50 3x3, 20 3x3, 40 3x3, 30 3x3, 20 93.7% 46.3
kws9 5x5, 50 1x1, 20 1x1, 50 3x3, 20 5x5, 20 3x3, 40 93.4% 37.7
kws10 3x3, 40 1x1, 20 1x1, 20 3x3, 20 5x5, 20 3x3, 30 93.4% 26.3
kws11 5x5, 30 1x1, 20 1x1, 20 1x1, 20 3x3, 20 5x5, 20 91.9% 20.2
kws12 5x5, 50 1x1, 40 1x1, 50 1x1, 20 3x3, 20 3x3, 20 92.7% 17.2
 2
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
Ti
m
e
 (
m
s)
Finetuned Models after 40K Iterations
Accuracy vs Inference Time
Discovered Models Seed Model
0.919 0.927
0.934
0.934
0.935 0.936
0.937
0.938
0.938
0.941
0.943
0.951
0.942
Fig. 6: GTX 1080Ti Experiment
Single-Inference time (lower is better)
On the GTX 1080Ti, the spread of inference times is greatly
reduced versus the Exynos 5422, due to the high degree of
arithmetic throughput available on the GPU. The best speedup
observed here is 1.59× versus the seed model. Bar labels show
TOP-1 accuracy.
the final set of Pareto optimal models uses this arrangement
(model 6). Looking at the most accurate model found (model
1), we see that a smaller filter size of 1 × 1 is found for the
third unit, while larger 5 × 5 filters are selected for the last
three units. While the choice of 100 filters per unit is clearly
excessive, it appears that the choice of a 3 × 3 filter size is
sometimes too large, and sometimes too small.
Additionally, the choice of a 4× 10 filter in the first layer,
which evenly divides the dimensions of the input spectrogram
seems to be totally unnecessary in our experiments – not a
single Pareto-optimal arrangement of the network uses this
filter shape. This expert designed sub-structure could be useful
with certain MFCC generation settings but it is not always true.
The NAS approach helps us avoid sub-optimal architectures.
The smallest model found, model 12, uses a 5 × 5 filter
for the first convolutional unit, 1× 1 filters for the next three
convolutional units, and 3 × 3 filters for the last two units.
Despite the enormous reduction in operation count versus the
seed model, this arrangement is only 1.5% points less accurate
with the same training budget.
V. CONCLUSION AND FUTURE WORK
It is clear that the design of deep neural networks has
become a task for which pen and paper are no longer suited.
We have presented a computer-aided approach for the design
of deep networks, which extends typical neural architecture
search, with a new objective modelling inference performance.
From an initial seed architecture designed by hand, we are
able to automatically discover variants which are as much
as 33.75× more efficient, or as much as 0.88% points more
accurate, as well as the set of Pareto-optimal models spanning
these two extremes. As our evaluation shows, the benefits
translate into practical performance gains, both on resource
constrained embedded devices (1.85× speedup) right up to
high-performance GPU systems (1.59× speedup).
Future Work
We have not evaluated the use of pruning or quantization
in conjunction with our approach. Pruning and quantization
apply to a fixed network architecture, and so are orthogonal
approaches to reducing model size and operation count.
As we have demonstrated, using a more appropriate neural
architecture can result in gains of up to 33.75× reduction in
operation count. However, the resulting model can potentially
still benefit from quantization and pruning, and the benefit
is cumulative, since the pruning and quantization would be
applied to the models which are discovered by search. The
evaluation of pruning and quantization in conjunction with
model search is a promising avenue for future work.
Acknowledgments
This work was partly supported by Science Foundation Ire-
land with grants 12/IA/1381, 13/RC/2094 (the Irish Software
Research Centre www.lero.ie) and 13/RC/2106 (the Adapt
Research Centre www.adaptcentre.ie). The project has re-
ceived funding from the European Union’s Horizon 2020
research and innovation programme under grant agreement No
732204 (Bonseyes). This work is supported by the Swiss State
Secretariat for Education, Research and Innovation (SERI)
under contract number 16.0159.
The opinions expressed and arguments employed herein do not
necessarily reflect the official views of these funding bodies.
REFERENCES
[1] Andrew Anderson and David Gregg. Optimal DNN primitive selection
with partitioned boolean quadratic programming. In Jens Knoop,
Markus Schordan, Teresa Johnson, and Michael F. P. O’Boyle, editors,
Proceedings of the 2018 International Symposium on Code Generation
and Optimization, CGO 2018, Vo¨sendorf / Vienna, Austria, February
24-28, 2018, pages 340–351. ACM, 2018.
[2] James Bergstra, Re´mi Bardenet, Yoshua Bengio, and Bala´zs Ke´gl.
Algorithms for hyper-parameter optimization. In John Shawe-Taylor,
Richard S. Zemel, Peter L. Bartlett, Fernando C. N. Pereira, and Kil-
ian Q. Weinberger, editors, Advances in Neural Information Processing
Systems 24: 25th Annual Conference on Neural Information Processing
Systems 2011. Proceedings of a meeting held 12-14 December 2011,
Granada, Spain., pages 2546–2554, 2011.
[3] Michaela Blott, Thomas B. Preußer, Nicholas J. Fraser, Giulio Gam-
bardella, Kenneth O’Brien, Yaman Umuroglu, Miriam Leeser, and
Kees A. Vissers. Finn-R: An end-to-end deep-learning framework for
fast exploration of quantized neural networks. TRETS, 11(3):16:1–16:23,
2018.
[4] Andrew Brock, Theodore Lim, James M. Ritchie, and Nick Weston.
SMASH: one-shot model architecture search through hypernetworks.
CoRR, abs/1708.05344, 2017.
[5] Han Cai, Ligeng Zhu, and Song Han. Proxylessnas: Direct neural
architecture search on target task and hardware. CoRR, abs/1812.00332,
2018.
[6] (Microsoft NNI contributors). An open source AutoML toolkit
for neural architecture search and hyper-parameter tuning.
(https://github.com/Microsoft/nni), 2019. [Online; accessed 18
March 2019].
[7] S. Davis and P. Mermelstein. Experiments in syllable-based recognition
of continuous speech. IEEE Trans. Acoust., Speech, Signal Processing,
28:357 – 366, Aug. 1980.
[8] Miguel de Prado, Maurizio Denna, Luca Benini, and Nuria Pazos.
QUENN: quantization engine for low-power neural networks. In
David R. Kaeli and Miquel Perica`s, editors, Proceedings of the 15th
ACM International Conference on Computing Frontiers, CF 2018,
Ischia, Italy, May 08-10, 2018, pages 36–44. ACM, 2018.
[9] Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Neural Archi-
tecture Search: A Survey. arXiv e-prints, page arXiv:1808.05377, Aug
2018.
[10] David Ha, Andrew M. Dai, and Quoc V. Le. Hypernetworks. CoRR,
abs/1609.09106, 2016.
[11] Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A.
Horowitz, and William J. Dally. EIE: efficient inference engine on com-
pressed deep neural network. In 43rd ACM/IEEE Annual International
Symposium on Computer Architecture, ISCA 2016, Seoul, South Korea,
June 18-22, 2016, pages 243–254. IEEE Computer Society, 2016.
[12] Yanzhang He, Tara N. Sainath, Rohit Prabhavalkar, Ian McGraw, Raziel
Alvarez, Ding Zhao, David Rybach, Anjuli Kannan, Yonghui Wu,
Ruoming Pang, Qiao Liang, Deepti Bhatia, Yuan Shangguan, Bo Li,
Golan Pundak, Khe Chai Sim, Tom Bagby, Shuo-Yiin Chang, Kanishka
Rao, and Alexander Gruenstein. Streaming end-to-end speech recogni-
tion for mobile devices. CoRR, abs/1811.06621, 2018.
[13] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed, N. Jaitly,
A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury.
Deep neural networks for acoustic modeling in speech recognition: The
shared views of four research groups. IEEE Signal Process. Mag.,
29(6):82–97, 2012.
[14] Forrest N. Iandola, Matthew W. Moskewicz, Khalid Ashraf, Song
Han, William J. Dally, and Kurt Keutzer. Squeezenet: Alexnet-level
accuracy with 50x fewer parameters and <1mb model size. CoRR,
abs/1602.07360, 2016.
[15] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan
Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe:
Convolutional architecture for fast feature embedding. arXiv preprint
arXiv:1408.5093, 2014.
[16] Diederick P Kingma and Jimmy Ba. Adam: A method for stochastic
optimization. In International Conference on Learning Representations
(ICLR), 2015.
[17] Alex Krizhevsky. One weird trick for parallelizing convolutional neural
networks. CoRR, abs/1404.5997, 2014.
[18] Karen Simonyan and Andrew Zisserman. Very deep convolutional
networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.
[19] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E.
Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and
Andrew Rabinovich. Going deeper with convolutions. In IEEE
Conference on Computer Vision and Pattern Recognition, CVPR 2015,
Boston, MA, USA, June 7-12, 2015, pages 1–9. IEEE Computer Society,
2015.
[20] Aravind Vasudevan, Andrew Anderson, and David Gregg. Parallel multi
channel convolution using general matrix multiplication. In 28th IEEE
International Conference on Application-specific Systems, Architectures
and Processors, ASAP 2017, Seattle, WA, USA, July 10-12, 2017, pages
19–24. IEEE Computer Society, 2017.
[21] Pete Warden. Speech commands: A dataset for limited-vocabulary
speech recognition. CoRR, abs/1804.03209, 2018.
[22] Matthias Wess, Sai Manoj Pudukotai Dinakarrao, and Axel Jantsch.
Weighted quantization-regularization in dnns for weight memory mini-
mization toward HW implementation. IEEE Trans. on CAD of Integrated
Circuits and Systems, 37(11):2929–2939, 2018.
[23] Zifeng Wu, Chunhua Shen, and Anton van den Hengel. Wider or
deeper: Revisiting the resnet model for visual recognition. CoRR,
abs/1611.10080, 2016.
[24] Yifan Yang, Qijing Huang, Bichen Wu, Tianjun Zhang, Liang Ma, Giulio
Gambardella, Michaela Blott, Luciano Lavagno, Kees A. Vissers, John
Wawrzynek, and Kurt Keutzer. Synetgy: Algorithm-hardware co-design
for convnet accelerators on embedded fpgas. In Kia Bazargan and
Stephen Neuendorffer, editors, Proceedings of the 2019 ACM/SIGDA
International Symposium on Field-Programmable Gate Arrays, FPGA
2019, Seaside, CA, USA, February 24-26, 2019, pages 23–32. ACM,
2019.
[25] Jiecao Yu, Andrew Lukefahr, David J. Palframan, Ganesh S. Dasika,
Reetuparna Das, and Scott A. Mahlke. Scalpel: Customizing DNN
pruning to the underlying hardware parallelism. In Proceedings of the
44th Annual International Symposium on Computer Architecture, ISCA
2017, Toronto, ON, Canada, June 24-28, 2017, pages 548–560. ACM,
2017.
[26] Yundong Zhang, Naveen Suda, Liangzhen Lai, and Vikas Chandra. Hello
edge: Keyword spotting on microcontrollers. CoRR, abs/1711.07128,
2017.
[27] Barret Zoph and Quoc V. Le. Neural architecture search with reinforce-
ment learning. CoRR, abs/1611.01578, 2016.
