The vast majority of processors in the world are actually microcontroller units (MCUs), which find widespread use performing simple control tasks in applications ranging from automobiles to medical devices and office equipment. The Internet of Things (IoT) promises to inject machine learning into many of these every-day objects via tiny, cheap MCUs. However, these resource-impoverished hardware platforms severely limit the complexity of machine learning models that can be deployed. For example, although convolutional neural networks (CNNs) achieve state-of-theart results on many visual recognition tasks, CNN inference on MCUs is challenging due to severe finite memory limitations. To circumvent the memory challenge associated with CNNs, various alternatives have been proposed that do fit within the memory budget of an MCU, albeit at the cost of prediction accuracy. This paper challenges the idea that CNNs are not suitable for deployment on MCUs. We demonstrate that it is possible to automatically design CNNs which generalize well, while also being small enough to fit onto memory-limited MCUs. Our Sparse Architecture Search method combines neural architecture search with pruning in a single, unified approach, which learns superior models on four popular IoT datasets. The CNNs we find are more accurate and up to 4.35× smaller than previous approaches, while meeting the strict MCU working memory constraint.
Introduction
The microcontroller unit (MCU) is a truly ubiquitous computer. MCUs are self-contained single-chip processors which are small (∼ 1cm
2 ), cheap (∼ $1), and power efficient (∼ 1mW). Applications are extremely broad, but often include seemingly banal tasks such as simple control and sequencing operations for everyday devices like washing machines, microwave ovens, and telephones. The key advantage of MCUs over application specific integrated circuits (ASICs) is that they are programmed with software and can be readily updated to fix bugs, change functionality, or add new features. The short time to market and flexibility of software has led to the staggering popularity of MCUs. In the developed world, a typical home is likely to have around four general-purpose microprocessors. In contrast, the number of MCUs is around three dozen [18] . A typical mid-range car may have about 30 MCUs. The best public market estimates suggest that around 50 billion MCU chips will ship in 2019 [2] , which far eclipses other computer chips like graphics processing units (GPUs), whose shipments totalled roughly 100 million units in 2018 [4] .
MCUs can be highly resource constrained; Table 1 compares MCUs with bigger processors. The broad proliferation of MCUs relative to desktop GPUs and CPUs stems from the fact that they are orders of magnitude cheaper (∼ 600×) and less power hungry (∼ 250, 000×). In recent years, MCUs have been used to inject intelligence and connectivity into everything from industrial monitoring sensors to consumer devices, a trend commonly referred to as the Internet of Things (IoT) [19, 30, 48] . Deploying machine learning (ML) models on MCUs is a critical part of many IoT applications, enabling local autonomous intelligence rather than relying on expensive and insecure communication with the cloud [17] . In the context of supervised visual tasks, state-of-the-art (SOTA) ML models typically take the form of convolutional neural networks (CNNs) [40] . While tools for deploying CNNs on MCUs have started to appear [15, 14, 9] , the CNNs themselves remain far too large for the memory-limited MCUs commonly used in IoT devices. In the remainder of this work, we use MCU to refer specifically to IoT-sized MCUs, like the Micro:Bit. In contrast to this work, the majority of preceding research on compute/memory efficient CNN inference has targeted CPUs and GPUs [33, 20, 63, 64, 50, 58, 53] .
To illustrate the challenge of deploying CNNs on MCUs, consider the seemingly simple task of deploying the well-known LeNet CNN on an Arduino Uno [1] to perform MNIST character recognition [43] . Assuming the weights can be quantized to 8-bit integers, 420 KB of memory is required to store the model parameters, which exceeds the Uno's 32 KB of read-only (flash) memory. An additional 177 KB of random access memory (RAM) is then required to store the intermediate feature maps produced by LeNet, which far exceeds the Uno's 2 KB RAM. The dispiriting implication is that it is not possible to perform LeNet inference on the Uno. This has led many to conclude that CNNs should be abandoned on constrained MCUs [41, 32] . Nevertheless, the sheer popularity of MCUs coupled with the dearth of techniques for leveraging CNNs on MCUs motivates our work, where we take a step towards bridging this gap.
Deployment of CNNs on MCUs is challenging along multiple dimensions, including power consumption and latency, but as the example above illustrates, it is the hard memory constraints that most directly prohibit the use of these networks. MCUs typically include two types of memory. The first is static RAM, which is relatively fast, but volatile and small in capacity. RAM is used C2 : The model parameters must not exceed the ROM (flash memory) capacity.
To the best of our knowledge, there are currently no CNN architectures or training procedures that produce CNNs satisfying these MCU memory constraints [41, 32] . This is true even ignoring the memory required for the runtime (in RAM) and the program itself (in ROM). The severe memory constraints for inference on MCUs have pushed research away from CNNs and toward simpler classifiers based on decision trees and nearest neighbors [41, 32] . We demonstrate for the first time that it is possible to design CNNs that are at least as accurate as Kumar et al.
[41], Gupta et al. [32] and at the same time satisfy C1-C2, even for devices with just 2 KB of RAM. We achieve this result by designing CNNs that are heavily specialized for deployment on MCUs using a method we call Sparse Architecture Search (SpArSe). The key insight from SpArSe, is that combining neural architecture search (NAS) and network pruning allows us to balance generalization performance against tight memory constraints C1-C2. Critically, we enable SpArSe to search over pruning strategies in conjunction with conventional hyperparameters around morphology and training. Pruning enables SpArSe to quickly evaluate many sub-networks of a given network, thereby expanding the scope of the overall search. While previous NAS approaches have automated the discovery of performant models with reduced parameterizations, we are the first to simultaneously consider performance, parameter memory constraints, and inference-time working memory constraints.
We use SpArSe to uncover SOTA models on four datasets, in terms of accuracy and model size, outperforming both pruning of popular architectures and MCU-specific models [41, 32] . The multi-objective approach of SpArSe leads to new insights in the design of memory-constrained architectures. Fig. 1a Figure 1 : Model architectures found with best test accuracy on CIFAR10-binary, while optimizing for (a) 2KB for both ModelSize (MS) and WorkingMemory (WM), and (b) minimum MS. Each node in the graph is annotated with MS and WS, and the values in square brackets show the quantities before and after pruning, respectively. Optimizing for WM leads to a model that yields more than 11.2x WM reduction. Note that pruning has a considerable impact on the CNN.
an example of a discovered architecture which has high accuracy, small model size, and fits within 2KB RAM. By contrast, we find that optimizing networks solely to minimize the number of parameters (as is typically done in the NAS literature, e.g., [23] ), is not sufficient to identify networks that minimize RAM usage. Fig. 1b illustrates one such example.
Related work
CNNs designed for resource constrained inference have been widely published in recent years [53, 35, 65] , motivated by the goal of enabling inference on mobile phone platforms. Advances include depth-wise separable layers [54] , deploymentcentric pruning [64, 50] , and quantization techniques [61] . More recently, NAS has been leveraged to achieve even more efficient networks on mobile phone platforms [20, 56] .
Although mobile phones are more constrained than general-purpose CPUs and GPUs, they still have many orders of magnitude more memory capacity and compute performance than MCUs (Table 1 ). In contrast, little attention has been paid to running CNNs on MCUs, which represent the most numerous compute platform in the world. Kumar et al. [41] propose Bonsai, a pruned shallow decision tree with non-axis aligned decision boundaries. Gupta et al. [32] propose a compressed k-nearest neighbors (kNN) approach (ProtoNN), where model size is reduced by projecting data into a low-dimensional space, maintaining a subset of prototypes to classify against, and pruning parameters. We build upon Kumar et al.
[41], Gupta et al. [32] by targeting the same MCUs, but using NAS to find CNNs which are at least as small and more accurate.
Algorithms for identifying performant CNN architectures have received significant attention recently [66, 23, 20, 45, 31, 24, 44] . The approaches closest to SpArSe are Stamoulis et al. [56] , Elsken et al. [23] . In Stamoulis et al. [56] , the authors optimize the kernel size and number of feature maps of the MBConv layers in a MobileNetV2 backbone [53] by expressing each of the layer choices as a pruned version of a superkernel. In some ways, Stamoulis et al. [56] is less a NAS algorithm and more of a structured pruning approach, given that the only allowed architectures are reductions of MobileNetV2. SpArSe does not constrain architectures to be pruned versions of a baseline, which can be too restrictive of an assumption for ultra small CNNs. SpArSe is not based on an existing backbone, giving it greater flexibility to extend to different problems. Like Elsken et al. [23] , SpArSe uses a form of weight sharing called network morphism [62] to search over architectures without training each one from scratch. SpArSe extends the concept of morphisms to expedite training and pruning CNNs. Because Elsken et al. [23] seek compact architectures by using the number of network edges as one of the objectives in the search, potential gains from weight sparsity are ignored, which can be significant (Section 4.3 [27, 28] ). Moreover, since SpArSe optimizes both the architecture and weight sparsity, Elsken et al. [23] can be seen as a special case of SpArSe.
3 SpArSe framework: CNN design as multi-objective optimization
Our approach to designing a small but performant CNN is to specify a multiobjective optimization problem that balances the competing criteria. We denote a point in the design space as Ω = {α, ϑ, ω, θ}, in which: α = {V, E} is a directed acyclic graph describing the network connectivity, where V and E denote the set of graph vertices and edges; ω denotes the network weights; ϑ represents the operations performed at each edge, i.e. convolution, pooling, etc.; and θ are hyperparameters governing the training process. The vertices v i , v j ∈ V represent network neurons, which are connected to each other if
through an operation ϑ ij parameterized by ω. The competing objectives in the present work of targeting constrained MCUs are:
where ValidationAccuracy(Ω) is the accuracy of the trained model on validation data, ModelSize(ω), or MS, is the number of bits needed to store the model parameters ω, WorkingMemory l (Ω) is the working memory in bits needed to compute the output of layer l, with the maximum taken over the L layers to account for in-place operations. We refer to (3) as the working memory (WM) for Ω. There is no single Ω which minimizes all of (1) − (3) simultaneously. For instance, (1) prefers large networks with many non-zero weights whereas (2) favors networks with no weights. Likewise, (3) prefers configurations with small intermediate representations, whereas (2) has no preference as to the size of the feature maps. Therefore, in the context of CNN design, it is more appropriate to seek the set of Pareto optimal configurations, where Ω ⋆ is Pareto optimal if
The concept of Pareto optimality is appealing for multi-objective optimization, as it allows the ready identification of optimal designs subject to arbitrary constraints in a subset of the objectives.
Search space
Our search space is designed to encompass CNNs of varying depth, width, and connectivity. Each graph consists of optional input downsampling followed by a variable number of blocks, where each block contains a variable number of convolutional layers, each parametrized by its own kernel size, number of output channels, convolution type, and padding. We consider regular, depthwise separable, and downsampled convolutions, where we define a downsampled convolution to be a 1 × 1 convolution that downsamples the input in depth, followed by a regular convolution. Each convolution is followed by optional batchnormalization, ReLU, and spatial downsampling through pooling of a variable window size. Each set of two consecutive convolutions has an optional residual connection. Inspired by the decision tree approach in Kumar et al.
[41], we let the output layer use features at multiple scales by optionally routing the output of each block to the output layer through a fully connected (FC) layer (see Fig.  1a ). All of the FC layer outputs are merged before going through an FC layer that generates the output. The search space also includes parameters governing CNN training and pruning. The Appendix contains a complete description of the search space.
Quantifying memory requirements
The ValidationAccuracy(Ω) metric is readily available for trained models via a held-out validation set or by cross-validation. However, the memory constraints of interest in this work demand more careful specification. For simplicity, we estimate the model size as
For working memory, we consider two different models:
where x l , y l , and ω l are the input, output, and weights for layer l, respectively. The assumption in (5) is that the inputs to layer l and the weights need to reside in RAM to compute the output, which is consistent with deployment tools like [15] which allow layer outputs to be written to an SD card. The model in (6) is also a standard RAM usage model, adopted in [16] , for example. For merge nodes that sum two vector inputs x 1 l and x 2 l , we set
The reliance of (4)-(6) on the ℓ 0 norm is motivated by our use of pruning to minimize the number of non-zeros in both ω and {x l } L l=1 , which is also the compression mechanism used in related work [41, 32] . Note that (4)-(6) are reductive to varying degrees. However, since SpArSe is a black-box optimizer, the measures in (4)-(6) can be readily updated as MCU deployment toolchains mature.
Neural network pruning
Pruning [42, 49, 21, 47 ] is essential to MCU deployment using SpArSe, as it heavily reduces the model size (4) and working memory (5)/(6) without significantly impacting classification accuracy. Pruning is a procedure for zeroing out network parameters ω and can be seen as a way to generate a new set of parametersω that have lower ω 0 . We consider both unstructured and structured, or channel [34] , pruning, where the difference is that the latter prunes away entire groups of weights corresponding to output feature maps for convolution layers and input neurons for FC layers. Both forms of pruning reduce ω 0 and, consequently, (4)-(5). Structured pruning is critical for reducing (5)-(6) because it provides a mechanism for reducing the size of layer inputs. We use Sparse Variational Dropout (SpVD) [49] and Bayesian Compression (BC) [47] to realize unstructured and structured pruning, respectively. Both approaches assume a sparsity promoting prior on the weights and approximate the weight posterior by a distribution parameterized by φ. See the Appendix for a description of SpVD and BC. Notably, φ contains all of the information about the network weight values as well as which weights to prune.
Multi-objective Bayesian optimization
SpArSe consists of three stages, where each stage m samples T m configurations. At iteration n, a new configuration Ω n is generated by the multi-objective Bayesian optimizer (MOBO) with probability ρ m and uniformly at random with probability 1 − ρ m . We adopt the combination of model-based and entirely random sampling from [26] to increase search space coverage. The optimizer considers candidates which are morphs of previous configurations and returns both the new and reference configurations (Section 3.5). The parameters of the new architecture are then inherited from the reference before being retrained and pruned.
SpArSe uses a MOBO based on the idea of random scalarizations [51] . The MOBO approach is appealing as it builds flexible nonparametric models of the unknown objectives and enables reasoning about uncertainty in the search for the Pareto frontier. A scalarized objective is given by
where λ k is drawn randomly. Choosing the domain of the prior on λ k allows the user to specify preferences about the region of the Pareto frontier to explore. For example, IoT practitioners may care about models with less than 1000 parameters. Since f k (Ω) is unknown in practice, it is modeled by a Gaussian process [52] with a kernel κ (·, ·) that supports the types of variables included in Ω, i.e., real-valued, discrete, categorical, and hierarchically related variables [57, 29] . A new Ω n is sampled by minimizing (7) through Thompson sampling. This MOBO yields better coverage of the Pareto frontier than the deterministic scalarization methods used in [20, 56] .
Network morphism
Evaluating each configuration Ω n from a random initialization is slow, as evidenced by early NAS works which required thousands of GPU days [66, 67] . Search time can be reduced by constraining each proposal to be a morph of a reference Ω r ∈ Ω j n−1 j=0 [23] . Loosely speaking, we say that Ω n is a morph of Ω r if most of the elements in Ω n are identical to those in Ω r . The advantage of using morphism to generate Ω n is that most of φ n can be inherited from φ r , where φ r denotes the weight posterior parameters for configuration Ω r . Initializing φ n in this way means that Ω n inherits knowledge about the value and pruning mask for most of its weights. Compared to running SpVD/BC from scratch, morphisms enable pruning proposals using 2-8× fewer epochs, depending on the dataset. Further details on morphism are given in the Appendix, including allowed morphs.
Because our search space includes such a diversity of parameters, including architectural parameters, pruning hyperparameters, etc., we find it helpful to perform the search in stages, where each successive stage increasingly limits the set of possible proposals. This coarse-to-fine search enables exploring decisions 
Results
We report results on a variety of datasets: MNIST (55e3, 5e3, 10e3) [43], CI-FAR10 (45e3, 5e3, 10e3) [39], CUReT (3704, 500, 1408) [60] , and Chars4k (3897, 500, 1886) [25] , corresponding to classification problems with 10, 10, 61, and 62 classes, respectively, with the training/validation/test set sizes provided in parentheses. To match the setup in [41], we also report on binary versions of these datasets, meaning that the classes are split into two groups and re-labeled. The only pre-processing we perform is mean subtraction and division by the standard deviation. Experiments were run on four NVIDIA RTX 2080 GPUs. We compare against previous SOTA works: Bonsai [41] , ProtoNN [32] , Gradient Boosted Decision Tree Ensemble Pruning [22] , kNN, and radial basis function support vector machine (SVM). We do not compare against previous NAS works because they have not addressed the memory-constrained classification problem addressed here.
Models optimized for number of parameters
First, we address C2 by showing that SpArSe finds CNNs with higher accuracy and fewer parameters than previously published methods. We use unstructured pruning and optimize {f k (Ω)} 2 k=1 . Fig. 2 shows the Pareto curves for SpArSe and confirms that it finds smaller and more accurate models on all datasets. For each competing method, we also report the SpArSe-obtained configuration which attains the same or higher test accuracy and minimum number of parameters, which we term the dominating configuration. Results are shown in Table 2 . To confirm that SpArSe learns non-trivial solutions, we compare with applying SpVD pruning to LeNet in Fig. 2 and Table 2 .
Models optimized for total memory footprint
Next, we demonstrate that SpArSe resolves C1-C2 by finding CNNs that consume less device memory than Bonsai [41] . We use structured pruning and optimize {f k (Ω)} 3 k=1 . We quantize weights and activations to one byte to yield realistic memory calculations and for fair comparison with Bonsai [13] . Table  3 compares SpArSe to Bonsai in terms of accuracy, MS, and WM under the model in (5). For all datasets and metrics, SpArSe yields CNNs which outperform Bonsai. For MNIST, Bonsai reports performance on a binarized dataset, whereas we use the original ten-class problem, i.e., we solve a significantly more complex problem with fewer resources. 
What SpArSe reveals about pruning
Pruning can be considered a form of NAS, whereω represents a sub-network of {α, ϑ, ω} given by {{V, E p } , ϑ, ω}, and E p ⊆ E contains only the edges for whichω is non-zero [27] . The question then becomes, should one look for E p directly or begin with a large edge-set E and prune it? There is conflicting evidence whether the same validation accuracy can be achieved by both approaches [27, 28, 46] . Importantly, previous NAS approaches have focused on searching for E p directly by using |E| as one of the optimization objectives [23] . On the other hand, SpArSe is able to explore both strategies and learn the optimal interaction between network graph α, operations ϑ, and pruning. Fig. 3a compares SpArSe to SpArSe without pruning on MNIST. The results show that including pruning as part of the optimization yields roughly a 80x reduction in number of parameters, indicating that the formulation of SpArSe is better suited to designing tiny CNNs compared to [23] . To gain more insight, we show scatter plots of |E| versus ω 0 for the best-performing configurations on two datasets in Fig. 3b-3c , revealing two important trends (see the Appendix for results on the Chars4k and CUReT datasets). First, ω 0 tends to increase with increasing |E| for |E| greater than some threshold ζ. This suggests that optimizing |E| can be a proxy for optimizing ω 0 when targeting large networks. At the same time, ω 0 tends to decrease with increasing |E| for |E| < ζ, which has implications for both NAS and pruning in the context of small CNNs. Fig. 3b-3c suggest that |E| is not always indicative of weight sparsity, such that minimizing |E| would actually lead to ignoring graphs with more edges but the same amount of non-zero weights. Since CNNs with more edges contain more subgraphs, it is possible that one of these subgraphs actually has better accuracy and the same number of non-zero weights as the subgraphs of a graph with less edges. The key is that pruning provides a mechanism for uncovering such high performing subgraphs [27] . 
Conclusion
Although MCUs are the most widely deployed computing platform, they have been largely ignored by ML researchers. This paper makes the case for targeting MCUs for deployment of ML, enabling future IoT products and usecases. We demonstrate that, contrary to previous assertions, it is in fact possible to design CNNs for MCUs with as little as 2KB RAM. SpArSe optimizes CNNs for the multiple constraints of MCU hardware platforms, finding models that are both smaller and more accurate than previous SOTA non-CNN models across a range of standard datasets.
References
[1] Arduino Uno Hardware Specification, Wikipedia Article. URL /https://en.wikipedia.org/wiki/Arduino_Uno. Accessed: 2019-05-02.
[2] The shape of the MCU market. /https://www.embedded.com/electronics-blogs/break-points/444158
Accessed: 2019-05-02.
[3] GeForce 10 Series Hardware Specification, Wikipedia Article. URL /https://en.wikipedia.org/wiki/GeForce_10_series. Accessed: 2019-05-02.
[4] Global shipments of discrete graphics processing units from 2015 to 2018 (in million units). URL /https://www.statista.com/statistics/865846/worldwide-discrete-gpus-shipment/. Accessed: 2019-05-23.
[5] Numerical Computing Performance of Intel 8-core CPUs, . URL /https://www.pugetsystems.com/labs/hpc/Numerical-Computing-Performance-of-3-Intel-8-core Accessed: 2019-05-02.
[6] List of Intel Core i9 Microprocessors, Wikipedia Article, . URL /https://en.wikipedia.org/wiki/List_of_Intel_Core_i9_microprocessors. Accessed: 2019-05-02.
[7] Arm mbed-cli. URL /https://github.com/ARMmbed/mbed-cli. Accessed:
2019-05-02.
[8] Micro Bit Hardware Specification, Wikipedia Article. URL /https://en.wikipedia.org/wiki/Micro_Bit. Accessed: 2019-05-02.
[9] Microsoft Embedded Learning Library. URL /https://microsoft.github.io/ELL/. Accessed: 2019-05-02.
[10] Pixel (Smartphone) Hardware Specification, Wikipedia Article. URL /https://en.wikipedia.org/wiki/Pixel_(smartphone). Accessed: 2019-05-02.
[11] Raspberry Pi Hardware Specification, Wikipedia Article. URL /https://en.wikipedia.org/wiki/Raspberry_Pi. Accessed: 2019-05-02.
[12] STM32 Hardware Specification, Wikipedia. URL /https://en.wikipedia.org/wiki/STM32. Accessed: 2019-05-02.
[13] TensorFlow Quantization-Aware Training. URL /https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/quantize. Accessed: 2019-05-02. [42] Yann LeCun, John S Denker, and Sara A Solla. Optimal brain damage. In Advances in neural information processing systems, pages 598-605, 1990.
[ 
Appendix A Pruning algorithm details
Pruning can be expressed as
where L (·) denotes the loss function for the appropriate task, e.g. cross-entropy for classification, G denotes the set of disjoint groups covering the indices of each entry in ω, ω G denotes a particular group of weights, and ½ [·] denotes the indicator function. When |G| = 1∀G ∈ G , (8) is referred to as unstructured pruning. On other other hand, structured pruning arises when G is chosen to group related elements of ω, i.e. the weights corresponding to a given feature map. An alternative to (8) is to cast pruning as Bayesian inference with priors that promote sparse solutions [59] . One such algorithm for unstructured pruning is sparse variational dropout (SpVD) [49] . The prior over ω is assumed to factor over the elements of ω, with p (|ω ij |) ∝ |ω ij | −1 . Given a dataset D, the goal of Bayesian inference is to then compute the posterior p (ω|D). SpVD employs variational inference (VI) [36] to approximate the posterior by a parametrized distribution q φ (ω), whose parameters φ are chosen to minimize D KL (q φ (ω) ||p (ω|D)). The distribution q φ (ω) is assumed to factor over the elements of ω and q φ (ω ij ) = N µ ij , β ij µ 2 ij , where φ = {µ, β}. Techniques for scalable VI are employed to estimate φ [37, 38] . Upon convergence, the estimate
where τ l is a layer-specific threshold and ω ij resides in network layer l. Note that φ contains all of the information about both the network weight values as well as which weights can be masked to 0. One of the side-effects of the choice of prior in SpVD is that the VI objective decomposes into a sum of a data-dependant term and a term which only depends on the prior, leading to the interpretation of VI as regularized training. Although there is no constant in front of the prior term, it can be beneficial to scale it by γ. Depending on the dataset, Molchanov et al. [49] keep γ at 0 for N 1 epochs, which is referred to as the pretraining phase, and then increase γ to γ N2 over N 2 epochs [55] . We include {τ l } L l=1 , N 1 , N 2 , and γ N2 in Ω. The structured pruning extension of SpVD is called Bayesian Compression (BC) [47] , which assumes a hierarchical prior on ω that ties weights in the same group to each other: ω|z ∼ G∈G (ij)∈G N(ω ij ; 0, z 2 G ). Inference for this prior proceeds in much the same way as SpVD and, upon convergence, entire groups of weights can be pruned away. 
Appendix C Morphism detals
In the present work, a configuration Ω n is considered a morph of Ω r if Ω n is generated by applying one or more of the operations listed in Table 6 to Ω r . These morphs are used to generated random samples for the Thompson sampling Change the value of a randomly chosen pruning threshold by 0.5 weight-fraction For each one weight-fraction-main-branch, weightfraction-left-branch, weight-fraction-right-branch, perturb each active parameter by 5e-2 α Change α by ±0.1 num-conv-layers Change the number of convolution layers in a randomly chosen convolution block by ±1 
