This paper presents a novel end-to-end methodology for enabling the deployment of low-error deep networks on microcontrollers. To fit the memory and computational limitations of resource-constrained edge-devices, we exploit mixed low-bitwidth compression, featuring 8, 4 or 2-bit uniform quantization, and we model the inference graph with integer-only operations. Our approach aims at determining the minimum bit precision of every activation and weight tensor given the memory constraints of a device. This is achieved through a rule-based iterative procedure, which cuts the number of bits of the most memory-demanding layers, aiming at meeting the memory constraints. After a quantization-aware retraining step, the fake-quantized graph is converted into an inference integer-only model by inserting the Integer Channel-Normalization (ICN) layers, which introduce a negligible loss as demonstrated on INT4 MobilenetV1 models. We report the latency-accuracy evaluation of mixed-precision MobilenetV1 family networks on a STM32H7 microcontroller. Our experimental results demonstrate an end-to-end deployment of an integer-only Mobilenet network with Top1 accuracy of 68% on a device with only 2MB of FLASH memory and 512kB of RAM, improving by 8% the Top1 accuracy with respect to previously published 8 bit implementations for microcontrollers.
Introduction
Enabling machine learning on extreme-edge-devices is challenging due to their tight memory and computing power constraints. When envisioning smart sensors operating on batteries, the target power envelope must be below tens of mWs to guarantee a battery lifetime of years. This requirement impacts the system architecture design: adding computational units (e.g. floating-point units) or memory banks contributes increasing the complexity and the power cost, and hence the energy, of a system. Nowadays, microcontroller units (MCUs), such STMicroelectronics STM32 devices, feature an energy consumption compliant with the requirement of smart autonomous sensors and include energy-efficient computational units for running machine learning workloads. However, the typical size of the embedded memory cuts is limited to a few MB (a STM32H7 MCU features 2MB of FLASH memory) and the computation core (commonly a single ARM Cortex-M CPU) runs up to few hundreds of MHz. To boost the performance of this class of MCUs while leveraging the high flexibility of software-programmability, ARM recently released a software library, CMSIS-NN [14] , which enabled the efficient computation of deep networks on tiny microcontrollers. The optimized routines composing the library realize convolutional operations in fixed-point representations, to exploit instruction-level parallelism. Unfortunately, due to memory constraints, only a small set of relatively complex networks has been ported to the microcontroller domain yet [25] . For what concerns models tailored for hard problems, e.g. image classification among the 1000 classes of Imagenet dataset, fitting them on MCU memory resources is still an open problem.
To address this problem, a crucial contribution comes from the recent work aiming at designing novel network topologies optimized not only in terms of accuracy but also for computational and memory costs [10, 17, 23] . In addition, a variety of compression techniques can be applied to further shrink a trained model. Among these, the quantization of either activations values and parameters to a low-bitwidth format, i.e. 8 bit or less, is extremely effective because, besides reducing the memory footprint, it allows to operate with low precision integer operations, which can be efficiently mapped on the limited instruction-set of tiny microcontrollers. Figure 1 highlights the typical development flow to deploy a deep network design into a resource-constrained device. A pretrained network f (x) is quantized by means of an initial device-aware fine tuning process, which can include also a re-training step. The resultant fake-quantized model g(x), emulating quantized values during the forward pass, is turned into an integer-only deployment model g (x) by means of an additional optimization step. Ideally, loss(g (x)) = loss(g(x)) = loss(f (x)). The state-of-the-art methodology for training a quantized integer-only model is currently integrated within the Tensorflow framework, which shows a low accuracy degradation when targeting 8 bit implementations [11] . This compression level is however not sufficient to bring complex models with high accuracy into memory-constrained microcontroller. As an example, a 8 bit MobilenetV1 [10] with the highest accuracy requires more than 4 MB of embedded memory, which is prohibitive for the majority of microcontroller devices available. To this end, a more aggressive sub-byte quantization methodology is needed, combined with novel techniques for deriving integer-only inference models.
In this work, we present a methodology for quantizing deep networks based on a mixed-precision scheme. The selection of the bit precision of every individual tensor is automated such as to satisfy the memory limitations of a given device. Moreover, we improve the methodology [11] for integeronly inference networks to support sub-byte per-channel quantization. Our experimental evaluation is conducted over the MobilenetV1 family networks on Imagenet [10] . We argue that this is a representative problem for tiny microntrollers, not yet solved [12] and much harder than quantizing over-parameterized networks [2] . This paper places the following contributions: i) We introduce the Integer Channel-Normalization (ICN) activation layer to achieve an efficient conversion of the fake-quantized graph, also exploiting per-channel quantization and optimized quantization-aware training strategies, into an integer-only deployment graph. ii) We present a mixed-precision quantization methodology driven by the memory constraints of a target architecture, which aims at selecting the bit precision of every weight and activation tensor of an integer-only network. iii) We studied the latency-accuracy tradeoff on iso-memory mixed-precision networks belonging to the MobilenetV1 family when running on a STM32H7 microcontroller device.
Our methodology demonstrates an integer-only deployment of a MobilenetV1 network on a STM32H7 microcontroller with 68% Top1 accuracy, which is 8% higher than previous reported 8 bit integer-only implementations [11] .
Related Work
Quantized Neural Networks. Early works on quantization of deep networks targeted 16 bits fixed-point implementations [15] , which result in an almost lossless approximation of full-precision trained networks, or extreme binarized networks, which, despite the fascinating low-computational and memory requirements, showed major accuracy losses when applied on image classification benchmarks [4, 19] . Several studies demonstrated that 8 bit quantization of weights and activations results in a good trade-off between latency, compression and a near-zero accuracy degradation, also if applied to efficient Imagenet classification networks [11, 18, 12] . Among the employed methodologies, TensorRT [18] approximates the parameters tensor by the minimization of the KL divergence metric between quantized and full-precision values. On the contrary, [11] quantizes values within a range defined by the tensor min and max values. Concerning activations, the PACT approach [2] demonstrated the highest efficiency by leveraging backpropagation to learn the quantization ranges. Recently, to fit stringent memory requirements, more aggressive sub-byte precision quantization approaches, i.e. less than 8 bit, are under investigation [3, 12, 6, 13, 16] . The works [12, 6] exploits learning-based approaches for determining the quantization ranges of activation and weights at low-bitwidth precision. State-of-the-art accuracy level on the efficient MobilenetV1 model has been reported by [13, 16] , by making use of per-channel quantization when moving to 4 bits precision. It is also worth to mention as non-uniform quantizers have resulted as the best approximators when reducing the bit precision [24, 22, 9] . However, a high-precision (floating point) arithmetic is needed on uncompressed values within the datapath, hence these methods results not suitable for the microcontroller domain. In this work, we leverage existing techniques and show the insights, concerning either computational and memory aspects, when bringing fake-quantized networks to the integer-only arithmetic domain, which is not taken into consideration by this class of works.
Mixed Low Precision Quantization. Mixed-precision techniques make use of multiple bit precision throughout a quantized network, motivated by the fact that a lossy and aggressive linear cut is not necessary to reach a given compression rate. The method [7] targeted per-pixel binarization based on a defined tensor mask. Despite achieving an extreme quantization level, a per-pixel quantization cannot be efficiently handled on a microcontroller, due to the control-based nature of the required dataflow. The HAWQ [5] method relies on a second order Hessian metric to define prioritization of tensor's bit precision to reduce, but without choosing the optimal per-tensor quantization level. On the same direction, HAQ [22] dynamically explores multiple low-bitwidth precision at training time by means of reinforcement learning. When optimizing for memory constraints, a non-uniform quantization is used. Compered to this, our methodology for bit precision selection applies statically, before quantization-aware retraining, and it is based on a rule-based iterative procedure. Both [5] and [22] reports superior accuracy than ours when compressing networks to a 1MB of memory footprint, but they include non-uniform clustering quantization of floating-point parameters, therefore not fully-comparable with our work in terms of microcontroller readiness, as current MCUs are not equipped with the hardware needed for manipulation and computation on these data formats.
Deep networks for resource-constrained devices. To bridge the gap between the complexity on deep networks and the limitations of resource-constrained devices, device-aware optimization strategies have also been presented. The work [1] introduced FINN-R to quantize and deploy a generic model into constrained FPGA architectures. Their quantization approach makes use of integer thresholds [21, 8, 20] for data compression. This method enabled a lossless integer representation of a fake-quantized networks, but demands larger memory footprint with respect to our proposed method. In contrast, the integer-only deployment in [11] presented a compact fixed-point 8 bit quantization strategy, which performs the folding of batch-normalization and scaling factors into weights before applying a uniform quantizer. Additionally, per-layer fixed-point parameters are needed for adapting the dynamic range when passing data from a layer to the next one. In contrast with this work, our methodology generalizes the deployment process when a more effective quantization strategy is used, i.e. per-channel mixed-precision quantization.
Background on Low-Bitwidth Quantization
The quantization process aims at quantizing either the network parameters and the activations values, i.e. the temporary input and output values of the network layers. While the parameters can be quantized just before the inference (forward) pass [18] , the quantization of the activations requires the insertion of fake-quantized activation layers within the network graph. These additional layers are responsible for recording the activation range statistics, optionally via backpropagation [2] , and apply quantization during the forward pass depending on the collected statistics. Because of injected quantization noise, the original full-precision network f is approximated with the correspondent fake-quantized function g. A quantization-aware retraining of a fake-quantized model is essential to recover accuracy, especially when low-bitwidth precision is employed [11] .
In the remainder of the paper we only focus on uniform quantization because its arithmetic is naturally supported the instruction-set of general-purpose programmable MCUs. Hence, without loosing generalities, any tensor t ∈ R N , either representing weights or activations or only a subset of them, can be quantized across the range [a, b] with a given number of Q bits [11] as:
Equation (1) derives from the mapping:
where Z t is a bias parameter required to shift the numeric domain of the quantized tensors
ranges, representative of the UINT-Q and INT-Q datatypes. If Z t is constrained to zero, e.g. when a = −b, b > 0, the quantization range is symmetric.
In the case of weights, the parameters a and b can be computed as the min and max values of a tensor [11] or by means of more sophisticated statistic analysis [18] or via backpropagation [2] . A Per-Layer (PL) quantization exploit single values a and b for the whole full-precision tensor, hence the Equation 1 is applied layer-wise. A Per-Channel (PC) procedure results more effective by independently approximating a given tensor along the outer dimension [13] . This corresponds to compute the a and b parameters in correspondence of any output channel of the tensor. 
Integer-Only Inference
Previous work [11] discussed the training and integer-only deployment of a fake-quantized network with 8 bit per-layer quantization. The weight quantization is applied after folding the batch-norm parameters into the convolutional weights. However, when reducing the bit precision below 8 bit using per-layer quantization, the folding process itself can lead to accuracy drop because it can drastically affects the range of the parameters to quantize. As a reference, Table 2 shows the collapse of the training process for INT4 MobilenetV1 with the folding of the batch-norm parameters enabled.
With the aim of an integer-only deployment, we extend [11] to a) prevent the folding of batch normalization parameters into convolutional weights and b) support per-channel low-bitwidth weight quantization. We observe that any fake-quantized network's sub-graph composed by a convolutional layer, a batch-normalization layer and a fake-quantizer activation module can be modeled by the transfer function:
where φ = x · w is the output of a full-precision convolution and µ, σ, γ, β are channel-wise fullprecision parameters of a batch normalization layer. It is worth to note that this kind of formulation holds for any feature-wise or layer-wise scaling factor applied to the convolution's output tensor.
When applying a per-layer quantization of either input/output activations and weights, the Rule 2 is injected into Equation 3 that becomes: 
where
is the integer output of a low-bitwidth convolution. We define the arrays B q = round( 
Note that every value in Equation 5 is an integer or a fixed-point value, so that a quantized convolutional layer can be computed with an integer-only arithmetic. Since the static parameters M 0 , N 0 , B q vary along the channel dimension, we name this activation function (Equation 5) as Integer ChannelNormalization activation, indicated as ICN. If weight parameters get quantized per-channel (PC), i.e. every output channel weight bank has its own S w and Z w values, Equation (5) still holds after deriving the B q , M 0 and N 0 vector parameters accordingly. Table 1 reports also the higher memory requirement of a quantized convolutional layer if using the thresholding method proposed by [21, 8] , which exponentially increases with Q.
Memory Requirement

Memory-Driven Mixed Low Precision Methodology for MCU Deployment
To run deep networks on microcontrollers, the memory footprint is a stringent constraint. Given common microcontroller architectures [25] , we distinguish:
• Read-Only (RO) Memory, to store frozen inference parameters, i.e. parameters that will not change during the lifetime of a smart device.
• Read-Write (RW) Memory, to store temporary values, i.e. input and output of any quantized convolutional layer that depends on the current sensor data.
At any step of the inference pass, a pair of temporary activation tensors, i.e. the input and output of a layer, and the whole set of fixed parameters must be present in the memory. If considering a network of L stacked quantized convolutional layers and a device with M RO and M RW memory budget (expressed in bytes), the above requirement is translated as:
Algorithm 1 Cut Activation Bits
Require: a fake-quantized network g of L stacked quantized convolutional layers, a MRO memory constraint, a Qa,min minimum quantization level Ensure: the bit precion Q are decremented by one step 6: end while 7: end for 8:
for i = L − 1 to 1 do Backward pass 9:
while mem(xi, Q where i indicates the i-th quantized convolutional layer and mem(t, Q) returns the memory footprint of a tensor t with bit precision Q. M T i A is the memory footprint of the additional set of layer's static parameters (see Table 1 ) with datatype detailed in Section 4.1. Concerning activation values:
Our methodology aims determining the bit precision Q x . Initially, the bit precision of every tensor is set as Q = 8. Algorithm 1 and Algorithm 2 reports the pseudo-code of the procedure to cut the bit precision of, respectively, activations and weights, under the hypothesis that exists a solution that satisfy (6) and (7) . The procedure in Algorithm 1 iterates over the L quantized convolutiona layers in a forward and backward fashion: the bit precision of output tensors Q are applied during the backward pass. Any cut consists of reducing the bit precision by a single step, i.e. from 8 to 4 and from 4 to 2 bits, and it is applied if the number of bits of the intended tensor (output during forward or input during backward) is lower or equal, but with a higher footprint, than the other activation tensor of the i-th layer.
Algorithm 2 details the iterative procedure for cutting bits of the weights parameters. At any iteration, a layer score r i is computed as the ratio between the layer's footprint of the i-th layer and the total occupation. Among the highest scores r i within a δ margin, the layer with the lowest layer's index is selected for the cut. This heuristic rule is intended to favorite the cut of central layers with respect to the last layers, usually more critical for what concern quantization.
Experimental Results
We run experiments on the MobilenetV1 family networks [10] on Imagenet using the PyTorch framework. In the following, a MobilenetV1 model is referred with a label x_y, where x = {128, 160, 192, 224} is the spatial resolution of the input data and y = {0.25, 0.5, 0.75, 1.0} refers
Algorithm 2 Cut Weights Bits
Require: a fake-quantized network g of L stacked quantized convolutional layers, a MRW memory constraint, a Qw,min minimum quantization level, a δ margin Ensure: The bit precision Q Find R = maxi ri 5:
Among the layers with ri > (R − δ), select the k-th with the smallest index i 6: Q k w is decremented by one step 7: end while to the width channel multiplier. The quantization-aware retraining starts from pre-trained weights 1 . Every training session executes on a compute node equipped with 4 NVIDIA-Tesla P100 GPUs for 8 hours. ADAM is chosen as optimizer with an initial learning rate of 1e-4, which is decreased in a fixed schedule to 5e-5 and 1e-5 at, respectively, the 5th and 8th epoch. Running statistics and learned parameters of batch-normalization layers are frozen after the first training epoch. Batch size is 128. An asymmetric uniform quantization is applied on weights: the PACT method is used in case of PL quantization while min/max statistics are employed in case of PC quantization. PPQ [16] is applied for refining pre-trained weights before the quantization-aware retraining. Folding of batch-normalization parameters into weights, when applied layer-wise, starts from the 2nd training epoch. Activations are quantized with the PACT strategy.
To proof the effectiveness of the ICN layers, we apply our quantization approach to a MobilenetV1 224_1.0 model and we measure the accuracy achieved by a 4 bit integer-only implementation. Table 2 reports the accuracies for the following strategies: PL+FB stands for per-layer quantization and folding of batch-norm parameters into weights, PL+ICN indicates per-layer quantization with ICN layers and PC+ICN refers to per-channel quantization with ICN layers. First we can note that only thanks to the proposed ICN layers, the folding of the batch-norm parameters, which causes the collapse of the training process (PL+FB INT4), can be avoided, therefore enabling the convergence of the training algorithm (PL+ICN INT4 and PC+ICN INT4) . Secondly, the insertion of the ICN layer introduces an almost negligible accuracy drop of 0.3% on PL+ICN and 0.05% on PC-ICN with respect to the fake-quantized graph. Moreover, by means of PC quantization, the accuracy of our 4 bit model is higher than other reported implementations [13, 16] . In addition, Table 2 also reports the memory footprint of our PC+ICN INT4, which results to be 10% less memory-demanding than using the integer thresholds based methodology.
To evaluate our proposed methodology for the deployment of deep networks on microcontrollers, we apply our mixed-precision technique on all the Mobilenet configurations after setting the memory constraints M RO = 2M B and M RW = 512kB, corresponding with the memory characteristics of an STM32H7 device. The trained integer-only models are also bechmarked on the STM32H7 MCU running at 400MHz, to assess the implications for inference deployments. To this aim, we leverages an extended version of the ARM CMSIS-NN [14] library, featuring an output stationary dataflow, and we measure latency in terms of clock cycles. Figure 2 plots the accuracy-latency tradeoff measured on two configurations. MixQ-PL indicates per-layer quantization with either the folding of batch-norm parameters or ICN for layers with Q y < 8 or Q w < 8. On the contrary, MixQ-PC-ICN indicates integer-only models with per-channel quantization and ICN as activation layers. Every curve represents a group of Mobilenet models with same input resolution. Increasing the width multiplier causes a longer latency because of the increasing amount of MAC operations. When applying our mixed-precision method under this memory constraints, Mobilenet models with width multipliers of 0.25 and 0.5, with the exception of 224_0.5, features no cuts of bit precision. Hence, under the configuration MixQ-PL, these points corresponds to the 8 bit integer-only models described in [11] . Pareto frontiers are mostly populated by MixQ-PC-ICN configurations. The most accurate model, PC+ICN 192_0.5, scores 68% Top1 accuracy by featuring 4 bit weight on the last convolutional pointwise and on the linear layers, in addition to Q 1 y , Q 2 y , Q 5 y = 4, as determined by the memory-driven procedure of Section 5. This score is 8% higher than the more accurate INT8 Mobilenet (192_0.5) fitting into the same device. Note that all the configurations featuring width multiplier 1.0 suffers of a dramatic accuracy degradation with respect to full-precision settings (from 2% to 15%) due to aggressive quantization required to fit into the memory constrains. On the latency side, the fastest inference model (128_0.25 MixQ-PL), which features a homogeneous 8 bit quantization, runs at 10fps, 20× higher than the the most precise configuration (224_0.75 PC+ICN), but only achieves 43% of Top1 accuracy. We can observe that the MixQ-PC-ICN quantization introduces a latency overhead of approx. 20% with respect to the MixQ-PL setting, due to the additional subtractions of Z w biases within the inner loop of the convolution. On the other hand, MixQ-PC-ICN provides up to 4% more accuracy for classification.
To further test our proposed mixed-precision method, we set the memory constrain to M RO = 1M B and compare with other mixed-precision methodologies in Table 4 . Our best models feature up to 7% lower accuracy with respect to [22] , but we remark the integer-only nature of our solution. On the other hand, our implementation features a 2% higher accuracy than INT8 models with comparable memory footprint and tailored for integer-only deployments.
Conclusion
By mixing quantization methodologies is possible to execute complex deep neural networks such as MobilenetV1 on memory constrained MCU edge devices. To pursue this objective, in this work we introduced a mixed-precision quantization technique tailored for memory-constrained microcontroller devices, leveraging the formulation of a quantized activation layer, i.e. the Integer Channel-Normalization activation, to enable sub byte integer-only deployments. The experimental results show a MobilenetV1 network running on a microcontroller equipped with 2MB of Flash and 512kB of RAM and featuring a Top1 accuracy of 68%, which is 8% higher state-of-the-art integer-only 8 bit implementations.
Appendices
A Mixed-precision Quantization Figure 3 plots the bit precision of every weight and activation tensor of the MixQ-PL and MixQ-PC-ICN MobilenetV1 models of experimental Section 6. Table 4 reports the Top1 accuracy metrics of the experimented models. 
Model
MixQ-PL Top1 Accuracy MixQ-PC-ICN Top1 Accuracy
