SpArSe: Sparse Architecture Search for CNNs on Resource-Constrained
  Microcontrollers by Fedorov, Igor et al.
ar
X
iv
:1
90
5.
12
10
7v
1 
 [c
s.L
G]
  2
8 M
ay
 20
19
SpArSe: Sparse Architecture Search for CNNs
on Resource-Constrained Microcontrollers
Igor Fedorov ∗1, Ryan P. Adams2, Matthew Mattina1, and Paul N.
Whatmough1
1ARM ML Research
2Princeton University
Abstract
The vast majority of processors in the world are actually microcon-
troller units (MCUs), which find widespread use performing simple con-
trol tasks in applications ranging from automobiles to medical devices and
office equipment. The Internet of Things (IoT) promises to inject machine
learning into many of these every-day objects via tiny, cheap MCUs. How-
ever, these resource-impoverished hardware platforms severely limit the
complexity of machine learning models that can be deployed. For exam-
ple, although convolutional neural networks (CNNs) achieve state-of-the-
art results on many visual recognition tasks, CNN inference on MCUs is
challenging due to severe finite memory limitations. To circumvent the
memory challenge associated with CNNs, various alternatives have been
proposed that do fit within the memory budget of an MCU, albeit at the
cost of prediction accuracy. This paper challenges the idea that CNNs are
not suitable for deployment on MCUs. We demonstrate that it is possi-
ble to automatically design CNNs which generalize well, while also being
small enough to fit onto memory-limited MCUs. Our Sparse Architecture
Search method combines neural architecture search with pruning in a sin-
gle, unified approach, which learns superior models on four popular IoT
datasets. The CNNs we find are more accurate and up to 4.35× smaller
than previous approaches, while meeting the strict MCU working memory
constraint.
1 Introduction
The microcontroller unit (MCU) is a truly ubiquitous computer. MCUs are
self-contained single-chip processors which are small (∼ 1cm2), cheap (∼ $1),
and power efficient (∼ 1mW). Applications are extremely broad, but often
∗Corresponding author: igor.fedorov@arm.com
1
include seemingly banal tasks such as simple control and sequencing operations
for everyday devices like washing machines, microwave ovens, and telephones.
The key advantage of MCUs over application specific integrated circuits (ASICs)
is that they are programmed with software and can be readily updated to fix
bugs, change functionality, or add new features. The short time to market and
flexibility of software has led to the staggering popularity of MCUs. In the
developed world, a typical home is likely to have around four general-purpose
microprocessors. In contrast, the number of MCUs is around three dozen [18].
A typical mid-range car may have about 30 MCUs. The best public market
estimates suggest that around 50 billion MCU chips will ship in 2019 [2], which
far eclipses other computer chips like graphics processing units (GPUs), whose
shipments totalled roughly 100 million units in 2018 [4].
MCUs can be highly resource constrained; Table 1 compares MCUs with big-
ger processors. The broad proliferation of MCUs relative to desktop GPUs and
CPUs stems from the fact that they are orders of magnitude cheaper (∼ 600×)
and less power hungry (∼ 250, 000×). In recent years, MCUs have been used to
inject intelligence and connectivity into everything from industrial monitoring
sensors to consumer devices, a trend commonly referred to as the Internet of
Things (IoT) [19, 30, 48]. Deploying machine learning (ML) models on MCUs
is a critical part of many IoT applications, enabling local autonomous intelli-
gence rather than relying on expensive and insecure communication with the
cloud [17]. In the context of supervised visual tasks, state-of-the-art (SOTA)
ML models typically take the form of convolutional neural networks (CNNs) [40].
While tools for deploying CNNs on MCUs have started to appear [15, 14, 9],
the CNNs themselves remain far too large for the memory-limited MCUs com-
monly used in IoT devices. In the remainder of this work, we use MCU to refer
specifically to IoT-sized MCUs, like the Micro:Bit. In contrast to this work, the
majority of preceding research on compute/memory efficient CNN inference has
targeted CPUs and GPUs [33, 20, 63, 64, 50, 58, 53].
To illustrate the challenge of deploying CNNs on MCUs, consider the seem-
ingly simple task of deploying the well-known LeNet CNN on an Arduino Uno [1]
to perform MNIST character recognition [43]. Assuming the weights can be
quantized to 8-bit integers, 420 KB of memory is required to store the model
parameters, which exceeds the Uno’s 32 KB of read-only (flash) memory. An
additional 177 KB of random access memory (RAM) is then required to store
the intermediate feature maps produced by LeNet, which far exceeds the Uno’s
2 KB RAM. The dispiriting implication is that it is not possible to perform
LeNet inference on the Uno. This has led many to conclude that CNNs should
be abandoned on constrained MCUs [41, 32]. Nevertheless, the sheer popularity
of MCUs coupled with the dearth of techniques for leveraging CNNs on MCUs
motivates our work, where we take a step towards bridging this gap.
Deployment of CNNs on MCUs is challenging along multiple dimensions,
including power consumption and latency, but as the example above illustrates,
it is the hard memory constraints that most directly prohibit the use of these
networks. MCUs typically include two types of memory. The first is static
RAM, which is relatively fast, but volatile and small in capacity. RAM is used
2
Table 1: Processors for ML inference: estimated characteristics to indicate the
relative capabilities.
Processor Usecase Compute Memory Power Cost
Nvidia 1080Ti GPU [3] Desktop 10 TFLOPs/Sec 11 GB 250 W $700
Intel i9-9900K CPU [6, 5] Desktop 500 GFLOPs/Sec 256 GB 95 W $499
Google Pixel 1 (Arm CPU) [10] Mobile 50 GOPs/Sec 4 GB ∼ 5 W –
Raspberry Pi (Arm CPU) [11] Hobbyist 50 GOPs/Sec 1 GB 1.5 W –
Micro:Bit (Arm MCU) [8] IoT 16 MOPs/Sec 16 KB ∼ 1 mW $1.75
Arduino Uno (Microchip MCU) [1] IoT 4 MOPs/Sec 2 KB ∼ 1 mW $1.14
to store intermediate data. The second is flash memory, which is non-volatile
and larger than RAM; it is typically used to store the program binary and any
constant data. Flash memory has very limited write endurance, and is therefore
treated as read-only memory (ROM). The two MCU memory types introduce
the following constraints on CNN model architecture:
C1 : The maximum size of intermediate feature maps cannot exceed the RAM
capacity.
C2 : The model parameters must not exceed the ROM (flash memory) capac-
ity.
To the best of our knowledge, there are currently no CNN architectures or
training procedures that produce CNNs satisfying these MCU memory con-
straints [41, 32]. This is true even ignoring the memory required for the runtime
(in RAM) and the program itself (in ROM). The severe memory constraints for
inference on MCUs have pushed research away from CNNs and toward simpler
classifiers based on decision trees and nearest neighbors [41, 32]. We demon-
strate for the first time that it is possible to design CNNs that are at least
as accurate as Kumar et al. [41], Gupta et al. [32] and at the same time sat-
isfy C1-C2, even for devices with just 2 KB of RAM. We achieve this result
by designing CNNs that are heavily specialized for deployment on MCUs us-
ing a method we call Sparse Architecture Search (SpArSe). The key insight
from SpArSe, is that combining neural architecture search (NAS) and network
pruning allows us to balance generalization performance against tight memory
constraints C1-C2. Critically, we enable SpArSe to search over pruning strate-
gies in conjunction with conventional hyperparameters around morphology and
training. Pruning enables SpArSe to quickly evaluate many sub-networks of
a given network, thereby expanding the scope of the overall search. While
previous NAS approaches have automated the discovery of performant models
with reduced parameterizations, we are the first to simultaneously consider per-
formance, parameter memory constraints, and inference-time working memory
constraints.
We use SpArSe to uncover SOTA models on four datasets, in terms of accu-
racy and model size, outperforming both pruning of popular architectures and
MCU-specific models [41, 32]. The multi-objective approach of SpArSe leads to
new insights in the design of memory-constrained architectures. Fig. 1a shows
3
input
MaxPool 1x1x3
Conv2D 5x5x [11/50]
 ModelSize [286/1300]
 WorkingMemory [1310/2324]
MaxPool 2x2
FC [732/9800]x[0/41]
 ModelSize [0/401841]
 WorkingMemory [732/411641]
Concatenate
FC [253/9841]x2
 ModelSize [508/19684]
 WorkingMemory [761/29525]
(a) Acc = 73.84%, MS = 1.31 KB, WM =
1.28 KB
input
Conv2D 3x3x9
 ModelSize [94/243]
 WorkingMemory [3166/3315]
SeparableConv2D 4x4x86
 ModelSize [125/918]
 WorkingMemory [8172/8874]
MaxPool 2x2
DownsampledConv2D 1x1x17, 5x5x39
 ModelSize [217/18037]
 WorkingMemory [14644/19448]
MaxPool 2x2
FC 624x391
 ModelSize [77/243984]
 WorkingMemory [285/244608]
FC 391x2
 ModelSize [3/782]
 WorkingMemory [6/1173]
(b) Acc=73.58%, MS = 0.61 KB, WM =
14.3 KB
Figure 1: Model architectures found with best test accuracy on CIFAR10-binary,
while optimizing for (a) 2KB for bothModelSize (MS) andWorkingMemory
(WM), and (b) minimum MS. Each node in the graph is annotated with MS
and WS, and the values in square brackets show the quantities before and after
pruning, respectively. Optimizing for WM leads to a model that yields more
than 11.2x WM reduction. Note that pruning has a considerable impact on the
CNN.
an example of a discovered architecture which has high accuracy, small model
size, and fits within 2KB RAM. By contrast, we find that optimizing networks
solely to minimize the number of parameters (as is typically done in the NAS
literature, e.g., [23]), is not sufficient to identify networks that minimize RAM
usage. Fig. 1b illustrates one such example.
2 Related work
CNNs designed for resource constrained inference have been widely published in
recent years [53, 35, 65], motivated by the goal of enabling inference on mobile
phone platforms. Advances include depth-wise separable layers [54], deployment-
centric pruning [64, 50], and quantization techniques [61]. More recently, NAS
4
has been leveraged to achieve even more efficient networks on mobile phone
platforms [20, 56].
Although mobile phones are more constrained than general-purpose CPUs
and GPUs, they still have many orders of magnitude more memory capacity
and compute performance than MCUs (Table 1). In contrast, little attention
has been paid to running CNNs on MCUs, which represent the most numerous
compute platform in the world. Kumar et al. [41] propose Bonsai, a pruned shal-
low decision tree with non-axis aligned decision boundaries. Gupta et al. [32]
propose a compressed k-nearest neighbors (kNN) approach (ProtoNN), where
model size is reduced by projecting data into a low-dimensional space, main-
taining a subset of prototypes to classify against, and pruning parameters. We
build upon Kumar et al. [41], Gupta et al. [32] by targeting the same MCUs,
but using NAS to find CNNs which are at least as small and more accurate.
Algorithms for identifying performant CNN architectures have received sig-
nificant attention recently [66, 23, 20, 45, 31, 24, 44]. The approaches closest to
SpArSe are Stamoulis et al. [56], Elsken et al. [23]. In Stamoulis et al. [56], the
authors optimize the kernel size and number of feature maps of the MBConv
layers in a MobileNetV2 backbone [53] by expressing each of the layer choices
as a pruned version of a superkernel. In some ways, Stamoulis et al. [56] is
less a NAS algorithm and more of a structured pruning approach, given that
the only allowed architectures are reductions of MobileNetV2. SpArSe does not
constrain architectures to be pruned versions of a baseline, which can be too
restrictive of an assumption for ultra small CNNs. SpArSe is not based on an
existing backbone, giving it greater flexibility to extend to different problems.
Like Elsken et al. [23], SpArSe uses a form of weight sharing called network mor-
phism [62] to search over architectures without training each one from scratch.
SpArSe extends the concept of morphisms to expedite training and pruning
CNNs. Because Elsken et al. [23] seek compact architectures by using the num-
ber of network edges as one of the objectives in the search, potential gains from
weight sparsity are ignored, which can be significant (Section 4.3 [27, 28]). More-
over, since SpArSe optimizes both the architecture and weight sparsity, Elsken
et al. [23] can be seen as a special case of SpArSe.
3 SpArSe framework: CNN design as
multi-objective optimization
Our approach to designing a small but performant CNN is to specify a multi-
objective optimization problem that balances the competing criteria. We denote
a point in the design space as Ω = {α, ϑ, ω, θ}, in which: α = {V,E} is a directed
acyclic graph describing the network connectivity, where V and E denote the
set of graph vertices and edges; ω denotes the network weights; ϑ represents
the operations performed at each edge, i.e. convolution, pooling, etc.; and θ
are hyperparameters governing the training process. The vertices vi, vj ∈ V
represent network neurons, which are connected to each other if (vi, vj) ∈ E
5
through an operation ϑij parameterized by ω. The competing objectives in the
present work of targeting constrained MCUs are:
f1(Ω) = 1−ValidationAccuracy(Ω) (1)
f2(Ω) = ModelSize(ω) (2)
f3(Ω) = max
l∈1,...,L
WorkingMemoryl(Ω) (3)
where ValidationAccuracy(Ω) is the accuracy of the trained model on vali-
dation data, ModelSize(ω), or MS, is the number of bits needed to store the
model parameters ω, WorkingMemoryl(Ω) is the working memory in bits
needed to compute the output of layer l, with the maximum taken over the L
layers to account for in-place operations. We refer to (3) as the working memory
(WM) for Ω.
There is no single Ω which minimizes all of (1) − (3) simultaneously. For
instance, (1) prefers large networks with many non-zero weights whereas (2)
favors networks with no weights. Likewise, (3) prefers configurations with small
intermediate representations, whereas (2) has no preference as to the size of the
feature maps. Therefore, in the context of CNN design, it is more appropriate
to seek the set of Pareto optimal configurations, where Ω⋆ is Pareto optimal if
fk(Ω
⋆) ≤ fk(Ω) ∀k,Ω and ∃j : fj(Ω⋆) < fj(Ω) ∀Ω 6= Ω⋆ .
The concept of Pareto optimality is appealing for multi-objective optimization,
as it allows the ready identification of optimal designs subject to arbitrary con-
straints in a subset of the objectives.
3.1 Search space
Our search space is designed to encompass CNNs of varying depth, width, and
connectivity. Each graph consists of optional input downsampling followed by
a variable number of blocks, where each block contains a variable number of
convolutional layers, each parametrized by its own kernel size, number of out-
put channels, convolution type, and padding. We consider regular, depthwise
separable, and downsampled convolutions, where we define a downsampled con-
volution to be a 1 × 1 convolution that downsamples the input in depth, fol-
lowed by a regular convolution. Each convolution is followed by optional batch-
normalization, ReLU, and spatial downsampling through pooling of a variable
window size. Each set of two consecutive convolutions has an optional residual
connection. Inspired by the decision tree approach in Kumar et al. [41], we let
the output layer use features at multiple scales by optionally routing the output
of each block to the output layer through a fully connected (FC) layer (see Fig.
1a). All of the FC layer outputs are merged before going through an FC layer
that generates the output. The search space also includes parameters governing
CNN training and pruning. The Appendix contains a complete description of
the search space.
6
3.2 Quantifying memory requirements
The ValidationAccuracy(Ω) metric is readily available for trained models
via a held-out validation set or by cross-validation. However, the memory con-
straints of interest in this work demand more careful specification. For simplicity,
we estimate the model size as
ModelSize(ω) ≈ ‖ω‖0 . (4)
For working memory, we consider two different models:
WorkingMemory
1
l (Ω) ≈ ‖xl‖0 + ‖ωl‖0 (5)
WorkingMemory
2
l (Ω) ≈ ‖xl‖0 + ‖yl‖0 (6)
where xl, yl, and ωl are the input, output, and weights for layer l, respectively.
The assumption in (5) is that the inputs to layer l and the weights need to reside
in RAM to compute the output, which is consistent with deployment tools like
[15] which allow layer outputs to be written to an SD card. The model in (6)
is also a standard RAM usage model, adopted in [16], for example. For merge
nodes that sum two vector inputs x1l and x
2
l , we set xl =
[(
x1l
)T (
x2l
)T ]T in
(5)-(6). The reliance of (4)-(6) on the ℓ0 norm is motivated by our use of pruning
to minimize the number of non-zeros in both ω and {xl}
L
l=1, which is also the
compression mechanism used in related work [41, 32]. Note that (4)-(6) are
reductive to varying degrees. However, since SpArSe is a black-box optimizer,
the measures in (4)-(6) can be readily updated as MCU deployment toolchains
mature.
3.3 Neural network pruning
Pruning [42, 49, 21, 47] is essential to MCU deployment using SpArSe, as it
heavily reduces the model size (4) and working memory (5)/(6) without signifi-
cantly impacting classification accuracy. Pruning is a procedure for zeroing out
network parameters ω and can be seen as a way to generate a new set of param-
eters ω¯ that have lower ‖ω¯‖0. We consider both unstructured and structured,
or channel [34], pruning, where the difference is that the latter prunes away
entire groups of weights corresponding to output feature maps for convolution
layers and input neurons for FC layers. Both forms of pruning reduce ‖ω‖0 and,
consequently, (4)-(5). Structured pruning is critical for reducing (5)-(6) because
it provides a mechanism for reducing the size of layer inputs. We use Sparse
Variational Dropout (SpVD) [49] and Bayesian Compression (BC) [47] to realize
unstructured and structured pruning, respectively. Both approaches assume a
sparsity promoting prior on the weights and approximate the weight posterior
by a distribution parameterized by φ. See the Appendix for a description of
SpVD and BC. Notably, φ contains all of the information about the network
weight values as well as which weights to prune.
7
3.4 Multi-objective Bayesian optimization
SpArSe consists of three stages, where each stage m samples Tm configura-
tions. At iteration n, a new configuration Ωn is generated by the multi-objective
Bayesian optimizer (MOBO) with probability ρm and uniformly at random with
probability 1− ρm. We adopt the combination of model-based and entirely ran-
dom sampling from [26] to increase search space coverage. The optimizer con-
siders candidates which are morphs of previous configurations and returns both
the new and reference configurations (Section 3.5). The parameters of the new
architecture are then inherited from the reference before being retrained and
pruned.
SpArSe uses a MOBO based on the idea of random scalarizations [51]. The
MOBO approach is appealing as it builds flexible nonparametric models of the
unknown objectives and enables reasoning about uncertainty in the search for
the Pareto frontier. A scalarized objective is given by
g
(
{λk}
K
k=1 ,Ω
)
= max
k∈1,...,K
λkfk(Ω) (7)
where λk is drawn randomly. Choosing the domain of the prior on λk allows
the user to specify preferences about the region of the Pareto frontier to explore.
For example, IoT practitioners may care about models with less than 1000
parameters. Since fk(Ω) is unknown in practice, it is modeled by a Gaussian
process [52] with a kernel κ (·, ·) that supports the types of variables included
in Ω, i.e., real-valued, discrete, categorical, and hierarchically related variables
[57, 29]. A new Ωn is sampled by minimizing (7) through Thompson sampling.
This MOBO yields better coverage of the Pareto frontier than the deterministic
scalarization methods used in [20, 56].
3.5 Network morphism
Evaluating each configuration Ωn from a random initialization is slow, as ev-
idenced by early NAS works which required thousands of GPU days [66, 67].
Search time can be reduced by constraining each proposal to be a morph of
a reference Ωr ∈
{
Ωj
}n−1
j=0
[23]. Loosely speaking, we say that Ωn is a morph
of Ωr if most of the elements in Ωn are identical to those in Ωr. The advantage
of using morphism to generate Ωn is that most of φn can be inherited from φr,
where φr denotes the weight posterior parameters for configuration Ωr. Initial-
izing φn in this way means that Ωn inherits knowledge about the value and
pruning mask for most of its weights. Compared to running SpVD/BC from
scratch, morphisms enable pruning proposals using 2-8× fewer epochs, depend-
ing on the dataset. Further details on morphism are given in the Appendix,
including allowed morphs.
Because our search space includes such a diversity of parameters, including
architectural parameters, pruning hyperparameters, etc., we find it helpful to
perform the search in stages, where each successive stage increasingly limits the
set of possible proposals. This coarse-to-fine search enables exploring decisions
8
101 102 103 104 105 106 107 108
Number of parameters
0.90
0.92
0.94
0.96
0.98
1.00
Ac
cu
ra
cy
MNIST
SpArSe stage 1
SpArSe stage 2
SpArSe stage 3
ProtoNN
Bonsai
LeNet+SpVD
KNN
RBF-SVM
GBDT
(a)
101 102 103 104 105 106 107
Number of parameters
0.50
0.55
0.60
0.65
0.70
0.75
0.80
0.85
Ac
cu
ra
cy
CIFAR10-binary
SpArSe stage 1
SpArSe stage 2
SpArSe stage 2
Bonsai 2kB
GBDT
RBF-SVM
LeNet+SpVD
ProtoNN
Bonsai 16 kB
KNN
(b)
102 103 104 105 106
Number of parameters
0.800
0.825
0.850
0.875
0.900
0.925
0.950
0.975
1.000
Ac
cu
ra
cy
CUReT
SpArSe stage 1
SpArSe stage 2
SpArSe stage 3
RBF-SVM
KNN
Bonsai
ProtoNN
GBDT
(c)
102 103 104 105 106
Number of parameters
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Ac
cu
ra
cy
Chars4k
SpArSe stage 1
SpArSe stage 2
SpArSe stage 3
RBF-SVM
KNN
Bonsai
GBDT
(d)
Figure 2: SpArSe results from minimization of
(1−ValidationAccuracy(Ω)) ,ModelSize(ω).
at increasing granularity, to wit: Stage 1) A candidate configuration can be
generated by applying modifications to any of {Ωr}n−1r=1 , Stage 2) The allow-
able morphs are restricted to the pruning parameters, Stage 3) The reference
configurations are restricted to the Pareto optimal points.
4 Results
We report results on a variety of datasets: MNIST (55e3, 5e3, 10e3) [43], CI-
FAR10 (45e3, 5e3, 10e3) [39], CUReT (3704, 500, 1408) [60], and Chars4k (3897, 500, 1886)
[25], corresponding to classification problems with 10, 10, 61, and 62 classes, re-
spectively, with the training/validation/test set sizes provided in parentheses.
To match the setup in [41], we also report on binary versions of these datasets,
meaning that the classes are split into two groups and re-labeled. The only
pre-processing we perform is mean subtraction and division by the standard de-
viation. Experiments were run on four NVIDIA RTX 2080 GPUs. We compare
against previous SOTA works: Bonsai [41], ProtoNN [32], Gradient Boosted
Decision Tree Ensemble Pruning [22], kNN, and radial basis function support
vector machine (SVM). We do not compare against previous NAS works be-
cause they have not addressed the memory-constrained classification problem
addressed here.
9
4.1 Models optimized for number of parameters
First, we address C2 by showing that SpArSe finds CNNs with higher accuracy
and fewer parameters than previously published methods. We use unstructured
pruning and optimize {fk (Ω)}
2
k=1. Fig. 2 shows the Pareto curves for SpArSe
and confirms that it finds smaller and more accurate models on all datasets.
For each competing method, we also report the SpArSe-obtained configuration
which attains the same or higher test accuracy and minimum number of parame-
ters, which we term the dominating configuration. Results are shown in Table 2.
To confirm that SpArSe learns non-trivial solutions, we compare with applying
SpVD pruning to LeNet in Fig. 2 and Table 2.
4.2 Models optimized for total memory footprint
Next, we demonstrate that SpArSe resolves C1-C2 by finding CNNs that con-
sume less device memory than Bonsai [41]. We use structured pruning and
optimize {fk (Ω)}
3
k=1. We quantize weights and activations to one byte to yield
realistic memory calculations and for fair comparison with Bonsai [13]. Table
3 compares SpArSe to Bonsai in terms of accuracy, MS, and WM under the
model in (5). For all datasets and metrics, SpArSe yields CNNs which outper-
form Bonsai. For MNIST, Bonsai reports performance on a binarized dataset,
whereas we use the original ten-class problem, i.e., we solve a significantly more
complex problem with fewer resources. Table 4 reports results for WM model
(6), showing that SpArSe outperforms Bonsai across all metrics on the MNIST,
CUReT, and Chars4k datasets, whereas Bonsai achieves higher accuracy on
CIFAR10. For validation, we use uTensor [15] to convert CNNs from SpArSe
into baremetal C++, which we compile using mbed-cli [7] and deploy on the
STM32 [12] MCU.
4.3 What SpArSe reveals about pruning
Pruning can be considered a form of NAS, where ω¯ represents a sub-network
of {α, ϑ, ω} given by {{V,Ep} , ϑ, ω}, and Ep ⊆ E contains only the edges for
which ω¯ is non-zero [27]. The question then becomes, should one look for Ep
directly or begin with a large edge-set E and prune it? There is conflicting evi-
dence whether the same validation accuracy can be achieved by both approaches
[27, 28, 46]. Importantly, previous NAS approaches have focused on searching
for Ep directly by using |E| as one of the optimization objectives [23]. On the
other hand, SpArSe is able to explore both strategies and learn the optimal inter-
action between network graph α, operations ϑ, and pruning. Fig. 3a compares
SpArSe to SpArSe without pruning on MNIST. The results show that including
pruning as part of the optimization yields roughly a 80x reduction in number of
parameters, indicating that the formulation of SpArSe is better suited to design-
ing tiny CNNs compared to [23]. To gain more insight, we show scatter plots
of |E| versus ‖ω¯‖0 for the best-performing configurations on two datasets in
Fig. 3b-3c, revealing two important trends (see the Appendix for results on the
Chars4k and CUReT datasets). First, ‖ω¯‖0 tends to increase with increasing |E|
10
Table 2: Dominating configurations for the parameter minimization experiment
in Section 4.2. SpArSe models are listed on top and the competing method on
bottom. SpArSe finds CNNs that are more accurate and have fewer parameters
than competing methods. The amount of time spent obtaining each dominating
configuration is reported in GPU days (GPUD).
MNIST CIFAR10-binary CUReT Chars4k
A
cc
‖
ω
‖ 0
G
P
U
D
A
cc
‖
ω
‖ 0
G
P
U
D
A
cc
‖
ω
‖ 0
G
P
U
D
A
cc
‖ω
‖
0
G
P
U
D
Bonsai
97.24
97.01
510
2.15e4
11
73.08
73.02
487
512
1
96.45
95.23
8.5e3
2.9e4
1
67.82
58.59
1.7e3
2.6e4
1
Bonsai (16 kB) – – –
76.66
76.64
1.4e3
4.1e3
9 – – – – – –
ProtoNN
96.84
95.88
476
1.6e4
11
76.56
76.35
1.4e3
4.1e3
10
96.45
94.44
8.5e3
1.6e4
1 – – –
GBDT
98.78
97.90
804
1.5e6
11
77.90
77.19
1.6e3
4e5
8
96.45
90.81
8.5e3
6.1e5
1
67.82
43.34
1.7e3
2.5e6
1
kNN
96.84
94.34
476
4.71e7
11
76.34
73.70
1.4e3
2e7
10
96.45
89.81
8.5e3
2.6e6
2
67.82
39.32
1.7e3
1.7e6
1
RBF-SVM
97.42
97.30
569
1e7
10
81.77
81.68
3.2e3
1.6e7
3
97.58
97.43
2.2e4
2.3e6
2
67.82
48.04
1.7e3
2e6
1
LeNet + SpVD
99.16
99.10
1e3
1.8e3
8
75.35
75.09
1.4e3
1.6e5
10 – – – – – –
Table 3: Comparison of Bonsai with SpArSe for WM model (5). Bonsai results
taken from Kumar et al. [41]. The first row shows the highest accuracy model for
WM ≤ 2KB and the second row shows the highest accuracy model for WM, MS
≤ 2KB. For MNIST, SpArSe is evaluated on the full ten-class dataset whereas
Bonsai reports on a reduced two-class problem. SpArSe finds models with fewer
parameters, less working memory, and higher accuracy in all cases. WM,MS
reported in KB. Best performance highlighted in bold.
MNIST CIFAR10-binary CUReT-binary Chars4K-binary
A
cc
W
M
M
S
G
P
U
D
A
cc
W
M
M
S
G
P
U
D
A
cc
W
M
M
S
G
P
U
D
A
cc
W
M
M
S
G
P
U
D
SpArSe 98.64 1.96 2.77 1 73.84 1.28 0.78 5 80.68 1.66 2.34 1 77.78 0.72 0.46 1
SpArSe 96.49 1.33 1.44 1 73.84 1.28 0.78 5 79.97 1.43 1.69 1 77.78 0.72 0.46 1
Bonsai 94.38∗ < 2 1.96 73.02 < 2 1.98 – – – 74.28 < 2 2
for |E| greater than some threshold ζ. This suggests that optimizing |E| can be
a proxy for optimizing ‖ω¯‖0 when targeting large networks. At the same time,
‖ω¯‖0 tends to decrease with increasing |E| for |E| < ζ, which has implications
for both NAS and pruning in the context of small CNNs. Fig. 3b-3c suggest
that |E| is not always indicative of weight sparsity, such that minimizing |E|
would actually lead to ignoring graphs with more edges but the same amount
of non-zero weights. Since CNNs with more edges contain more subgraphs, it is
possible that one of these subgraphs actually has better accuracy and the same
number of non-zero weights as the subgraphs of a graph with less edges. The
key is that pruning provides a mechanism for uncovering such high performing
subgraphs [27].
11
Table 4: SpArSe versus Bonsai for WM model (6). See Table 3 for details.
MNIST CIFAR10-binary CUReT-binary Chars4K-binary
A
cc
W
M
M
S
G
P
U
D
A
cc
W
M
M
S
G
P
U
D
A
cc
W
M
M
S
G
P
U
D
A
cc
W
M
M
S
G
P
U
D
SpArSe 97.03 1.38 15 1 70.41 0.91 1.98 18 73.22 1.9 0.14 2 76.83 0.39 20.12 1
SpArSe 95.76 0.62 1.76 2 70.41 0.91 1.98 18 73.22 1.9 0.14 2 74.87 1.64 0.16 3
Bonsai 94.38∗ < 2 1.96 73.02 < 2 1.98 – – – 74.71 < 2 2
103 104 105 106
Number of parameters
0.90
0.92
0.94
0.96
0.98
1.00
Ac
cu
ra
cy
MNIST
SpArSe
SpArSe 
 w/o pruning
(a)
105 106
|E|
103
104
‖ω̄
‖ 0
̄N‖ST
(b)
104 105 106
|E|
103
104
105
106
‖ω̄
‖ 0
CIF̄R10‖binary
(c)
Figure 3: Fig. 3a shows the Pareto frontier of SpArSe with and without pruning,
where both experiments sample the same number (325) of configurations. Fig.
3b-3c show scatter plots of |E| versus ‖ω¯‖0 for the best performing configurations
from the experiment in Section 4.1. Fig. 3b: MNIST networks with > 95%
accuracy. Fig. 3c: CIFAR10-binary networks with > 70% accuracy.
5 Conclusion
Although MCUs are the most widely deployed computing platform, they have
been largely ignored by ML researchers. This paper makes the case for targeting
MCUs for deployment of ML, enabling future IoT products and usecases. We
demonstrate that, contrary to previous assertions, it is in fact possible to design
CNNs for MCUs with as little as 2KB RAM. SpArSe optimizes CNNs for the
multiple constraints of MCU hardware platforms, finding models that are both
smaller and more accurate than previous SOTA non-CNN models across a range
of standard datasets.
12
References
[1] Arduino Uno Hardware Specification, Wikipedia Article. URL
/https://en.wikipedia.org/wiki/Arduino_Uno. Accessed: 2019-05-02.
[2] The shape of the MCUmarket. /https://www.embedded.com/electronics-blogs/break-points/4441588/The-shape-of-the-MCU-market.
Accessed: 2019-05-02.
[3] GeForce 10 Series Hardware Specification, Wikipedia Article. URL
/https://en.wikipedia.org/wiki/GeForce_10_series. Accessed: 2019-
05-02.
[4] Global shipments of discrete graphics processing
units from 2015 to 2018 (in million units). URL
/https://www.statista.com/statistics/865846/worldwide-discrete-gpus-shipment/.
Accessed: 2019-05-23.
[5] Numerical Computing Performance of Intel 8-core CPUs, . URL
/https://www.pugetsystems.com/labs/hpc/Numerical-Computing-Performance-of-3-Intel-8-core-CPUs-i9---9900K-vs-i7-9800X-vs-Xeon-2145W-1339/.
Accessed: 2019-05-02.
[6] List of Intel Core i9 Microprocessors, Wikipedia Article, . URL
/https://en.wikipedia.org/wiki/List_of_Intel_Core_i9_microprocessors.
Accessed: 2019-05-02.
[7] Arm mbed-cli. URL /https://github.com/ARMmbed/mbed-cli. Accessed:
2019-05-02.
[8] Micro Bit Hardware Specification, Wikipedia Article. URL
/https://en.wikipedia.org/wiki/Micro_Bit. Accessed: 2019-05-02.
[9] Microsoft Embedded Learning Library. URL
/https://microsoft.github.io/ELL/. Accessed: 2019-05-02.
[10] Pixel (Smartphone) Hardware Specification, Wikipedia Article. URL
/https://en.wikipedia.org/wiki/Pixel_(smartphone). Accessed: 2019-
05-02.
[11] Raspberry Pi Hardware Specification, Wikipedia Article. URL
/https://en.wikipedia.org/wiki/Raspberry_Pi. Accessed: 2019-05-02.
[12] STM32 Hardware Specification, Wikipedia. URL
/https://en.wikipedia.org/wiki/STM32. Accessed: 2019-05-02.
[13] TensorFlow Quantization-Aware Training. URL
/https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/quantize.
Accessed: 2019-05-02.
[14] TensorFlow Lite for Microcontrollers. URL
/https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/experimental/micro.
Accessed: 2019-05-02.
13
[15] uTensor. URL /http://utensor.ai/. Accessed: 2019-05-02.
[16] Visual Wake Words Challenge, CVPR 2019. URL
/https://docs.google.com/document/u/2/d/e/2PACX-1vStp3uPhxJB0YTwL4T__Q5xjclmrj6KRs55xtMJrCyi82GoyHDp2X0KdhoYcyjEzKe4v75WBqPObdkP/pub.
Accessed: 2019-05-02.
[17] Why the Future of Machine Learning is Tiny. URL
/https://petewarden.com/2018/06/11/why-the-future-of-machine-learning-is-tiny/.
Accessed: 2019-05-02.
[18] Microcontroller, Wikipedia Article. /https://en.wikipedia.org/wiki/Microcontroller.
Accessed: 2019-05-02.
[19] Luigi Atzori, Antonio Iera, and Giacomo Morabito. The internet of things:
A survey. Computer Networks, 54(15):2787 – 2805, 2010. ISSN 1389-1286.
doi: /https://doi.org/10.1016/j.comnet.2010.05.010.
[20] Han Cai, Ligeng Zhu, and Song Han. ProxylessNAS: Direct neu-
ral architecture search on target task and hardware. In In-
ternational Conference on Learning Representations, 2019. URL
/https://openreview.net/forum?id=HylVB3AqYm.
[21] Miguel Á. Carreira-Perpiñán and Yerlan Idelbayev. “Learning-
Compression” Algorithms for Neural Net Pruning. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
June 2018.
[22] O Dekel, C Jacobbs, and L Xiao. Pruning decision forests. Personal Com-
munications, 2016.
[23] Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Efficient multi-
objective neural architecture search via lamarckian evolution. In Interna-
tional Conference on Learning Representations, 2019.
[24] Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Neural architec-
ture search: A survey. Journal of Machine Learning Research, 20(55):1–21,
2019.
[25] Teófilo Emídio de Campos, Bodla Rakesh Babu, and Manik Varma. Char-
acter recognition in natural images. volume 2, pages 273–280, 01 2009.
[26] Stefan Falkner, Aaron Klein, and Frank Hutter. BOHB: robust and efficient
hyperparameter optimization at scale. In Proceedings of the 35th Interna-
tional Conference on Machine Learning, ICML 2018, Stockholmsmässan,
Stockholm, Sweden, July 10-15, 2018, pages 1436–1445, 2018.
[27] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Find-
ing sparse, trainable neural networks. In International Conference on
Learning Representations, 2019.
14
[28] Trevor Gale, Erich Elsen, and Sara Hooker. The state of spar-
sity in deep neural networks. CoRR, abs/1902.09574, 2019. URL
/http://arxiv.org/abs/1902.09574.
[29] Eduardo C Garrido-Merchán and Daniel Hernández-Lobato. Dealing with
categorical and integer-valued variables in bayesian optimization with gaus-
sian processes. arXiv preprint arXiv:1805.03463, 2018.
[30] Jayavardhana Gubbi, Rajkumar Buyya, Slaven Marusic, and Marimuthu
Palaniswami. Internet of things (iot): A vision, architectural elements, and
future directions. Future generation computer systems, 29(7):1645–1660,
2013.
[31] Zichao Guo, Xiangyu Zhang, Haoyuan Mu, Wen Heng, Zechun Liu, Yichen
Wei, and Jian Sun. Single path one-shot neural architecture search with
uniform sampling. arXiv preprint arXiv:1904.00420, 2019.
[32] Chirag Gupta, Arun Sai Suggala, Ankit Goyal, Harsha Vardhan Simhadri,
Bhargavi Paranjape, Ashish Kumar, Saurabh Goyal, Raghavendra Udupa,
Manik Varma, and Prateek Jain. Protonn: Compressed and accurate knn
for resource-scarce devices. In Proceedings of the 34th International Confer-
ence on Machine Learning-Volume 70, pages 1331–1340. JMLR. org, 2017.
[33] Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A
Horowitz, and William J Dally. EIE: efficient inference engine on com-
pressed deep neural network. In 2016 ACM/IEEE 43rd Annual Interna-
tional Symposium on Computer Architecture (ISCA), pages 243–254. IEEE,
2016.
[34] Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerat-
ing very deep neural networks. In Proceedings of the IEEE International
Conference on Computer Vision, pages 1389–1397, 2017.
[35] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko,
Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig
Adam. Mobilenets: Efficient convolutional neural networks for
mobile vision applications. CoRR, abs/1704.04861, 2017. URL
/http://arxiv.org/abs/1704.04861.
[36] Michael I. Jordan, Zoubin Ghahramani, Tommi S. Jaakkola, and
Lawrence K. Saul. An introduction to variational methods for graphical
models. Machine Learning, 37(2):183–233, Nov 1999. ISSN 1573-0565. doi:
/10.1023/A:1007665907178.
[37] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In
The International Conference on Learning Representations, 2014.
[38] Durk P Kingma, Tim Salimans, and Max Welling. Variational dropout
and the local reparameterization trick. In Advances in Neural Information
Processing Systems, pages 2575–2583, 2015.
15
[39] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features
from tiny images. Technical report, Citeseer, 2009.
[40] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi-
fication with deep convolutional neural networks. In Advances in neural
information processing systems, pages 1097–1105, 2012.
[41] Ashish Kumar, Saurabh Goyal, and Manik Varma. Resource-efficient ma-
chine learning in 2 kb ram for the internet of things. In Proceedings of
the 34th International Conference on Machine Learning-Volume 70, pages
1935–1944. JMLR. org, 2017.
[42] Yann LeCun, John S Denker, and Sara A Solla. Optimal brain damage. In
Advances in neural information processing systems, pages 598–605, 1990.
[43] Yann LeCun, Léon Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradient-
based learning applied to document recognition. Proceedings of the IEEE,
86(11):2278–2324, 1998.
[44] Marius Lindauer. Literature on Neural Architec-
ture Search at AutoML.org at Freiburg. URL
/https://www.automl.org/automl/literature-on-neural-architecture-search/.
Accessed: 2019-05-02.
[45] Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: Differentiable
architecture search. In International Conference on Learning Representa-
tions, 2019.
[46] Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell.
Rethinking the value of network pruning. In International Conference on
Learning Representations, 2019.
[47] Christos Louizos, Karen Ullrich, and Max Welling. Bayesian compression
for deep learning. In Advances in Neural Information Processing Systems,
pages 3288–3298, 2017.
[48] Francois Meunier, Adam Wood, Keith Weiss, Katy Huberty, Simon Flan-
nery, JosephMoore, Craig Hettenbach, and Bill Lu. The ‘Internet of Things’
is now. Morgan Stanley Research, pages 1–96, 2014.
[49] Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Variational
dropout sparsifies deep neural networks. In Proceedings of the 34th In-
ternational Conference on Machine Learning-Volume 70, pages 2498–2507,
2017.
[50] Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and
Jan Kautz. Pruning convolutional neural networks for resource
efficient transfer learning. CoRR, abs/1611.06440, 2016. URL
/http://arxiv.org/abs/1611.06440.
16
[51] Biswajit Paria, Kirthevasan Kandasamy, and Barnabás Póczos. A flexi-
ble multi-objective bayesian optimization approach using random scalar-
izations. arXiv preprint arXiv:1805.12168, 2018.
[52] Carl Edward Rasmussen. Gaussian processes in machine learning. In Sum-
mer School on Machine Learning, pages 63–71. Springer, 2003.
[53] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and
Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks.
In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pages 4510–4520, 2018.
[54] Laurent Sifre and Stéphane Mallat. Rigid-motion scattering for image clas-
sification. Ph.D. Thesis, 1:3, 2014.
[55] Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby,
and Ole Winther. How to train deep variational autoencoders and prob-
abilistic ladder networks. In 33rd International Conference on Machine
Learning (ICML 2016), 2016.
[56] Dimitrios Stamoulis, Ruizhou Ding, Di Wang, Dimitrios Lymberopoulos,
Bodhi Priyantha, Jie Liu, and Diana Marculescu. Single-path nas: De-
signing hardware-efficient convnets in less than 4 hours. In arXiv preprint
arXiv:1904.02877, 2019.
[57] Kevin Swersky, David Duvenaud, Jasper Snoek, Frank Hutter, and
Michael A Osborne. Raiders of the lost architecture: Kernels for
bayesian optimization in conditional parameter spaces. arXiv preprint
arXiv:1409.4011, 2014.
[58] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, and Quoc V
Le. Mnasnet: Platform-aware neural architecture search for mobile. arXiv
preprint arXiv:1807.11626, 2018.
[59] Michael E Tipping. Sparse bayesian learning and the relevance vector ma-
chine. Journal of machine learning research (JMLR), 1(Jun):211–244, 2001.
[60] Manik Varma and Andrew Zisserman. A statistical approach to texture
classification from single images. International journal of computer vision,
62(1-2):61–81, 2005.
[61] Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, and Song Han. HAQ: hardware-
aware automated quantization. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2019.
[62] Tao Wei, Changhu Wang, Yong Rui, and Chang Wen Chen. Network
morphism. In International Conference on Machine Learning, pages 564–
572, 2016.
17
[63] Tien-Ju Yang, Yu-Hsin Chen, and Vivienne Sze. Designing energy-efficient
convolutional neural networks using energy-aware pruning. In Proceed-
ings of the IEEE Conference on Computer Vision and Pattern Recognition,
pages 5687–5695, 2017.
[64] Tien-Ju Yang, Andrew Howard, Bo Chen, Xiao Zhang, Alec Go, Mark San-
dler, Vivienne Sze, and Hartwig Adam. Netadapt: Platform-aware neural
network adaptation for mobile applications. In The European Conference
on Computer Vision (ECCV), September 2018.
[65] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An
extremely efficient convolutional neural network for mobile devices. In Pro-
ceedings of the IEEE Conference on Computer Vision and Pattern Recog-
nition, pages 6848–6856, 2018.
[66] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement
learning. arXiv preprint arXiv:1611.01578, 2016.
[67] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning
transferable architectures for scalable image recognition. In Proceedings
of the IEEE conference on computer vision and pattern recognition, pages
8697–8710, 2018.
18
Appendix A Pruning algorithm details
Pruning can be expressed as
ω¯ = argmin
(
∑
G∈G
1[‖ωG‖2>0])≤s
L ({α, ϑ, ω}) (8)
where L (·) denotes the loss function for the appropriate task, e.g. cross-entropy
for classification, G denotes the set of disjoint groups covering the indices of
each entry in ω, ωG denotes a particular group of weights, and 1 [·] denotes the
indicator function. When |G| = 1∀G ∈ G , (8) is referred to as unstructured
pruning. On other other hand, structured pruning arises when G is chosen to
group related elements of ω, i.e. the weights corresponding to a given feature
map.
An alternative to (8) is to cast pruning as Bayesian inference with priors
that promote sparse solutions [59]. One such algorithm for unstructured prun-
ing is sparse variational dropout (SpVD) [49]. The prior over ω is assumed
to factor over the elements of ω, with p (|ωij |) ∝ |ωij |−1. Given a dataset
D , the goal of Bayesian inference is to then compute the posterior p (ω|D).
SpVD employs variational inference (VI) [36] to approximate the posterior by
a parametrized distribution qφ (ω), whose parameters φ are chosen to minimize
DKL (qφ (ω) ||p (ω|D)). The distribution qφ (ω) is assumed to factor over the
elements of ω and qφ (ωij) = N
(
µij , βijµ
2
ij
)
, where φ = {µ, β}. Techniques for
scalable VI are employed to estimate φ [37, 38]. Upon convergence, the estimate
of ω¯ij becomes ω¯ij = µij ⊙1 [βij ≤ τl], where τl is a layer-specific threshold and
ωij resides in network layer l. Note that φ contains all of the information about
both the network weight values as well as which weights can be masked to 0.
One of the side-effects of the choice of prior in SpVD is that the VI objective
decomposes into a sum of a data-dependant term and a term which only de-
pends on the prior, leading to the interpretation of VI as regularized training.
Although there is no constant in front of the prior term, it can be beneficial to
scale it by γ. Depending on the dataset, Molchanov et al. [49] keep γ at 0 for
N1 epochs, which is referred to as the pretraining phase, and then increase γ to
γN2 over N2 epochs [55]. We include {τl}
L
l=1, N1, N2, and γN2 in Ω.
The structured pruning extension of SpVD is called Bayesian Compression
(BC) [47], which assumes a hierarchical prior on ω that ties weights in the same
group to each other: ω|z ∼
∏
G∈G
∏
(ij)∈G N(ωij ; 0, z
2
G). Inference for this prior
proceeds in much the same way as SpVD and, upon convergence, entire groups
of weights can be pruned away.
Appendix B Search space details
The search space considered in this work is described in Table 5.
19
Table 5: Search space details. For discrete variables, ranges are listed in format
[lowerbound:increment:upperbound].
Name Range Description
downsample-input-in-
depth
True/False If True, max pool the input across the
3rd dimension
downsample-input True/False If True, max pool the input in spatial
dimensions
input-downsampling-
rate
[2 : 1 : 4] Active only if downsample-input = True.
The amount by which to downsample
the input.
zero-regularization-
epochs
[5:1:30] Number of epochs for which VI infer-
ence is performed before the effect of the
sparsity promoting prior is introduced.
annealing-epochs [15:1:25] Only active if pretraining=False. Num-
ber of epochs over which the coefficient
in front of the regularization term in the
VI objective is annealed from 0 to its fi-
nal value
α [1e-2:1e-2:1] Final value for the coefficient of the reg-
ularization term in the VI objective
pretraining True/False Only active if pretraining=False. If
True, pretrain the CNN before pruning
batch-norm True/False Only used for random weight pruning
experiments. If True, apply batch-
normalization to the output of each
layer
num-conv-blocks [1:1:2] Number of convolution blocks in the
CNN, where each block consists of a se-
ries of convolutional layers. The output
of each block is downsampled through
max pooling
num-fc-layers [0:1:1] Number of FC layers in the main branch
following the convolution blocks
pruning-thresholds-
block-k-layer-l
[-6:1e-1:3] Thresholds for pruning weights in block
k layer l
total-fc-layer-weights [1:1:800]e3 Number of weights in the FC layers
comprising the main, left, and right
branches
weight-fraction-main-
branch
[0:1] Percentage of total-fc-layer-weights that
go into the FC layer in the main branch
num-conv-layers-block-
k
[1:1:3] Number of convolutional layers in block
k
20
layer-type-block-k-
layer-l
[Conv2D,
Downsam-
pledConv2D,
Separable-
Conv2D]
Layer type for convolutional block k
layer l
kernel-size-block-k-
layer-l
[2:1:5] Convolutional kernel size of block k
layer l
num-filters-block-k-
layer-l
[1:1:100] Number of output feature maps for
block k layer l
downsample-block-k-
layer-l
[0:0.5] Active only if layer-type-block-k-layer-
l=DownsampledConv2D. If True, the
input feature maps are first passed
through a 1×1×(downsample−block−
k− layer− l× num− filters− block−
k − layer − (l − 1)) convolutional layer
left-branch True/False If True, a branch is added to the feed-
forward architecture. The branch takes
the output of the first convolution block,
sends it to an FC layer, sends the result
to a merge operation, whose output is
sent to a final FC layer
right-branch True/False If True, a branch is added to the feed-
forward architecture. The branch takes
the input to the first convolution block,
sends it to an FC layer, sends the result
to a merge operation, whose output is
sent to a final FC layer
weight-fraction-left-
branch
[0.01:1] Active only if left-branch=True. Per-
centage of total-fc-layer-weights that go
into the left branch FC layer
weight-fraction-left-
branch
[0.01:1] Active only if right-branch=True. Per-
centage of total-fc-layer-weights that go
into the right branch FC layer
merge-type Sum/ConcatenateActive only if at least one of left-branch
or right-branch are True. How the main,
left, and right branches are to be com-
bined
Appendix C Morphism detals
In the present work, a configuration Ωn is considered a morph of Ωr if Ωn
is generated by applying one or more of the operations listed in Table 6 to Ωr.
These morphs are used to generated random samples for the Thompson sampling
21
step in Section 3.4. Each sample Ωn is generating by randomly choosing one
or more of the morphs from Table 6 and applying them to a randomly chosen
Ωr ∈ {Ωr}n−1r=1 . This procedure ensures that each configuration proposal is
relatively close to a reference configuration. We then use the fact that Ωn is
closely related to Ωr during the pruning process by letting φn inherit information
from φr , where φn denotes the parameters of the approximated weight posterior
for configuration Ωn. The inheritance process proceeds by first checking for
identical nodes between Ωr and Ωn and then copies the corresponding elements
of φr into φn for those nodes. The nodes which participate in this step are
the nodes which were not influenced during the morphing process. For the
remaining nodes, if corresponding nodes in Ωn and Ωr have the same operation
type, we copy as many of the corresponding elements of φr into φn as possible.
For example, if the first layer of Ωn is a 3×3×50 convolution and the first layer
of Ωr is a 3× 3 × 30 convolution, we copy the elements of φr corresponding to
the first convolution layer into the elements of φn corresponding to the first 30
feature maps of the first convolution layer. Upon completion of the inheritance
process, most of the elements of φn are inherited from φr , and the remaining
elements are learned from the training data. Unlike Elsken et al. [23], we do
not restrict the training process to just the elements of phin which were not
inherited, but instead update all of the elements of φn during learning.
Appendix D Visualization of discovered CNNs
Fig. 6 shows the architectures which dominated the competing methods in
Table 2.
Appendix E Extended results on interaction of
pruning and architecture
Fig. 4 shows the interaction of pruning with architecture for the Chars4k and
CUReT dataset experiments in Section 4.1. MNIST and CIFAR10-binary are
given in Fig. 3.
Appendix F Evolution of winning CNNs
The evolution of the CNN architectures which ended up dominating the com-
peting methods on the MNIST dataset in Table 2 is shown in Fig. 5.
Appendix G Visualization of winning CNNs
Fig. 6 shows the architectures which dominating the competing methods on
the task of classifying MNIST with the minimum number of parameters (i.e.
Section 4.1).
22
Table 6: Allowable morphs. Random sampling is always performed under a
uniform distribution.
Morph Description
num-fc-layers Change the number of FC layers in the main branch by
±1
num-conv-blocks Change the number of convolution blocks by ±1. If
the number of convolution blocks is increased, set the
number of convolution layers in the new block to 1
layer-type Change the layer type for a randomly chosen convolu-
tion layer
num-conv-filters Change the number of output feature maps in a ran-
domly chosen convolution layer
kernel-size Change the kernel size for a randomly chosen convolu-
tion filter
downsampling-rate Randomly choose a convolution layer that has type
DownsampledConv2D and randomly sample its down-
sampling rate
batch-norm Switch the state of the batch-norm parameter
residual-connections Switch the state of the residual connections parameter
left-branch Switch the state of the left-branch parameter. If the
new state is True, set weight-fraction-left-branch=0.05
right-branch Switch the state of the right-branch parameter. If the
new state is True, set weight-fraction-right-branch=0.05
total-fc-layer-weights Change the number of total FC layer weights by ±5e3
merge-type Switch the state of the merge-type parameter
threshold Change the value of a randomly chosen pruning thresh-
old by 0.5
weight-fraction For each one weight-fraction-main-branch, weight-
fraction-left-branch, weight-fraction-right-branch, per-
turb each active parameter by 5e-2
α Change α by ±0.1
num-conv-layers Change the number of convolution layers in a randomly
chosen convolution block by ±1
23
105 106
|E|
104‖ω̄
‖ 0
CURET‖networks‖with‖>80̄‖accuracy
(a)
104 105 106
|E|
104
105
‖ω̄
‖ 0
Char4k‖networks‖with‖>40̄‖accuracy
(b)
Figure 4: Scatter plots of |V | versus ‖ω¯‖0 for the best performing configurations.
24
105 106
|E|
103
6×102
2×103
3×103
4×103
‖ω̄
‖ 0
Dominating‖configuration‖for‖Bonsai
0̄0
0̄5
1̄0
1̄5
2̄0
2̄5
3̄0
3̄5
4̄0
(a)
105 106
|E|
103
6×102
2×103
3×103
4×103
‖ω̄
‖ 0
̄ominating‖configuration‖for‖ProtoNN
0
1
2
3
4
5
6
(b)
1063×1054×105 6×105
|E|
103
6×102
‖ω̄
‖ 0
̄ominating‖configuration‖for‖KNN
0
1
2
3
4
5
6
(c)
5×105 6×105 7×1058×105
|E|
103
8×102
9×102
‖ω̄
‖ 0
Dominating‖configuration‖for‖GBDT
0̄0
0̄5
1̄0
1̄5
2̄0
2̄5
3̄0
3̄5
4̄0
(d)
105 106
|E|
103
6×102
2×103
3×103
4×103
‖ω̄
‖ 0
Dominating configuration for LeNet+SpVD
0
1
2
3
4
5
6
7
8
(e)
5×105 6×105 7×1058×105
|E|
103
8×102
9×102
‖ω̄
‖ 0
Dominating‖configuration‖for‖RBF̄SVM
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
(f)
Figure 5: MNIST: Evolution of dominating configuration. Lighter colored sam-
ples indicate configurations which were sampled later in the optimization pro-
cess.
25
input
Conv2D 3x3x20
 ModelSize [73/180]
 WorkingMemory [857/964]
MaxPool 2x2
MaxPool 2x2
SeparableConv2D 5x5x50
 ModelSize [137/1500]
 WorkingMemory [2796/4380]
Conv2D 1x1x50
 ModelSize [4/50]
 WorkingMemory [200/246]
MaxPool 2x2
Sum
MaxPool 2x2
FC 800x500
 ModelSize [91/400000]
 WorkingMemory [283/400800]
FC 500x10
 ModelSize [111/5000]
 WorkingMemory [161/5500]
(a) Winner against Bonsai on MNIST
input
Conv2D 3x3x20
 ModelSize [74/180]
 WorkingMemory [858/964]
MaxPool 2x2
MaxPool 2x2
SeparableConv2D 5x5x50
 ModelSize [129/1500]
 WorkingMemory [2790/4380]
Conv2D 1x1x50
 ModelSize [5/50]
 WorkingMemory [201/246]
MaxPool 2x2
Sum
MaxPool 2x2
FC 800x500
 ModelSize [73/400000]
 WorkingMemory [249/400800]
FC 500x10
 ModelSize [105/5000]
 WorkingMemory [153/5500]
(b) Winner against ProtoNN on
MNIST
input
Conv2D 5x5x19
 ModelSize [76/475]
 WorkingMemory [860/1259]
Conv2D 5x5x20
 ModelSize [225/9500]
 WorkingMemory [11169/20444]
DownsampledConv2D 1x1x3, 4x4x19
 ModelSize [81/972]
 WorkingMemory [8031/8060]
Conv2D 3x3x20
 ModelSize [117/3420]
 WorkingMemory [5608/8911]
Conv2D 5x5x20
 ModelSize [127/10000]
 WorkingMemory [4627/14500]
DownsampledConv2D 1x1x8, 5x5x20
 ModelSize [100/4160]
 WorkingMemory [2482/4968]
FC 980x10
 ModelSize [160/9800]
 WorkingMemory [1140/10780]
(c) Winner against LeNet+SpVD on
MNIST
input
Conv2D 5x5x19
 ModelSize [62/475]
 WorkingMemory [846/1259]
Conv2D 5x5x20
 ModelSize [187/9500]
 WorkingMemory [11131/20444]
DownsampledConv2D 1x1x3, 4x4x19
 ModelSize [50/972]
 WorkingMemory [8033/8060]
Conv2D 3x3x20
 ModelSize [99/3420]
 WorkingMemory [5590/8911]
Conv2D 5x5x20
 ModelSize [81/10000]
 WorkingMemory [4581/14500]
DownsampledConv2D 1x1x8, 5x5x20
 ModelSize [72/4160]
 WorkingMemory [2465/4968]
FC 980x10
 ModelSize [125/9800]
 WorkingMemory [1105/10780]
(d) Winner against GBDT on MNIST
Figure 6: Visualization of winning CNNs on MNIST classification in Table 2.
Working memory is reported for the model in (5). The dominating configuration
against KNN is the same as that for ProtoNN. The dominating configuration
against RBF-SVM is the same as that for Bonsai.
26
