Fixed-point quantization and binarization are two reduction methods adopted to deploy Convolutional Neural Networks (CNN) on end-nodes powered by low-power micro-controller units (MCUs). While most of the existing works use them as stand-alone optimizations, this work aims at demonstrating there is margin for a joint cooperation that leads to inferential engines with lower latency and higher accuracy. Called CoopNet, the proposed heterogeneous model is conceived, implemented and tested on off-the-shelf MCUs with small on-chip memory and few computational resources. Experimental results conducted on three different CNNs using as test-bench the low-power RISC core of the Cortex-M family by ARM validate the CoopNet proposal by showing substantial improvements w.r.t. designs where quantization and binarization are applied separately.
Introduction
Inference engines built upon end-to-end deep learning methods represent the state-of-the-art in several application domains. Deep Convolutional Neural Networks (CNNs), in particular, have brought about breakthroughs in the field of computer vision, speech recognition and natural language processing [1] . Many Internet-of-Things (IoT) services rely on CNNs to infer information from the raw data gathered by end-user portable devices and/or embedded sensors. While the majority of IoT frameworks run CNNs in the cloud, namely, on centralized data centers physically very far from the source of data, to have CNNs on hand is a means to higher efficiency and more user privacy [2] . Enabling the inferential stage on the mobile edge is challenging as it requires the processing of CNNs, large in size and computationally intensive, with limited hardware resources. The picture gets even more complicated when considering applications, like wearable [1] or ambient and infrastructural sensors [3] , which must run on tiny cores with few hundreds of kByte of on-chip memory and an active power consumption below the 100 mW mark. As practical example, this work considers the micro-controller units (MCUs) of the Cortex-M family designed by ARM for the IoT segment 1 . In such cases, the only available 1 https://os.mbed.com/platforms option is to shrink down the cardinality of the CNN model till it fits the underlying hardware architecture [4] .
Among the available algorithmic optimizations, posttraining quantization via integer arithmetic has become a must-do stage: most of the MCU cores deployed on the end-nodes do not have floating-point units indeed. The use of arithmetic representations with scaled bit-widths helps to reduce the memory footprint, but above all it ensures a larger memory bandwidth as multiple data can be packed within the same word. This ensures lower latency and hence smaller energy consumption w.r.t 32-bit floating point. In [5] the authors demonstrate that 8-bit fixed-point integer guarantees near-to-zero accuracy loss with 4× memory reduction. Extreme quantization to 1-bit [6, 7, 8] leads to Binary CNNs with the smallest footprint, but also the lightest workload as (some) integer arithmetic get replaced with bit-wise Boolean operators. However, binary CNNs come with significant accuracy loss: from 2%, up to 10%, 20%, and even more, depending on the original CNN and the complexity of the training data-set. This represents a key limiting factor.
The aim of this work is to address this drawback demonstrating there exist margins to exploit binary CNNs for building highly accurate, yet fast inferential models that can be deployed on the edge. The proposed solution, called CNNs are a special class of end-to-end trainable models mostly suited for the classification of multi-dimensional spatial inputs, like multi-channel images. They consist of several computational layers chained to form a deep architecture. Existing CNNs mainly differ for their internal topology, namely, how different kinds of hidden layers are sized and connected. It is however possible to recognize a common structure which is made up of two macroblocks ( Fig. 1 ): Feature Extraction, where relevant features learned during the training stage are extracted layer-wise using kernel convolutions; Classification, where the extracted features get classified.
Within the feature extraction block, the most commonly adopted layers are: convolutional layers (CONV), which perform multidimensional convolutions between the output tensor generated by the previous layer (also called feature map) and local filter tensors; pooling layers (POOL), e.g. max pooling or average pooling, which reduce the dimension of feature maps; normalization layers (NORM), that normalize the distribution (mean and standard deviation) of the activation maps; activation function (ACT), e.g. ReLU or tanh, which introduces non-linearity. The classification block is built upon fully-connected layers (FC), which implement a geometric separation of the extracted features, and softmax, that produces a probability distribution over the available classes.
Fixed-Point Quantization
While a CNN training is usually run using a 32-bits floating-point representation, recent studies, e.g. [5, 9] , demonstrate that fixed-point integers with lower arithmetic precision are enough for inference. Fixed-point quantization is becoming a consolidated standard when the target hardware are low-power cores with small memory footprint and reduced instruction set (8/16-bit integer). A detailed review of all the quantization schemes in literature is out of the scope of this work and interested readers may refer to [5, 9, 10] . This work adopts the q-bit fixed-point quantization proposed in [9] . The convolution run in a CONV layer between the input feature map x ∈ R c×win×hin and the local weights w ∈ R c×kw×kh is as follows:
with C as the number of channels. We set q = 8 for both weights and activations, and q = 16 for intermediate results accumulation.
Binarized Neural Networks
Several works proposed CNNs with binary weights and/or activations. BinaryConnect [6] represents the ancestor: weights are binarized using hard sigmoid function, while activations remain in full-precision to avoid accuracy drop. The Binarized Neural Networks proposed in [7] are the first example of fully binary CNN: both weights and activations are binarized via sign function. The CONV layers are simplified through bit-wise XNOR and bit-count. This allows to achieve the highest compression (∼ 32×), yet with substantial accuracy loss (up to 28.7%). The authors of XNOR-Net [8] addressed this drawback introducing a new topology where the binary output of each CONV layer is first re-scaled through a full-precision NORM layer. Fig. 1-b gives a pictorial description of the basic block deployed in the XNOR-Net, where the suffix Bin highlights binarized layers.
Given x ∈ R c×win×hin as the input feature and w ∈ R c×kw×kh as the weight tensor, their convolution is approximated as follows:
where K and α are scaling factors. While weights (W) are binarized with just the sign function, the activations are first normalized and than binarized. These stages can be fused into a single layer that includes all batch normalization parameters: variance σ 2 , mean µ, scale γ, shift factor β, and for numerical stability. A feature map x is binarized as follows: where c = µ − β/γ √ σ 2 + is constant at inference time. We represent c, K, α with 8-bit integers.
An efficient processing of XNOR-Net requires data-paths capable of performing bitwise xnor, bit-counter and comparison. These operators can be implemented with specialized units in case of custom hardware [11] , or through software routines compiled using the instruction-set available on the target general purpose core [12] .
CoopNet

Concept and Architecture
The CoopNet inference concept is intuitive, yet very efficient. As graphically depicted in Fig. 2 , it is based on the cooperation of two convolutional models: a binary net BNN, fast and small but less accurate, an integer net INT8, slower and larger but more accurate. The BNN processes the input data first. Then, if the prediction satisfies a certain criterion of confidence it is forwarded to the output as the final outcome, otherwise the input is re-processed by the INT8 to produce a more confident output score. The criterion used to control the execution flow is called Confidence Score (CS) and it is defined as follows:
where P BN N (y n |x) is the probability produced by the BNN that a given input x belongs to a given class n ∈ 0, 1, . . . N ; i and j refer to the indexes of the first and second highest scored classes. Intuitively, a high CS means the BNN was able to classify the given input with enough confidence, on the contrary, a low CS means the topmost scored classes get very close to each other, which reveals a certain level of uncertainty, as the BNN was not able to make a clear distinction among the available classes. For such a latter case, the INT8 model is activated for aid. The thresholding policy is controlled through a Confidence Threshold (CT) which might be changed dynamically for run-time adjustments.
For a given task and application, the pre-trained 32-bit Floating Point model is used as basis to generate the INT8 model, obtained with the quantization method introduced in [9] using q=8-bit. The BNN model is built using the XNOR-Net method presented in [8] . It is worth emphasizing that, according to [8] , the first and last layers of the BNN model are kept to 8-bit.
A key design aspect concerns the setting of the threshold CT as it affects the accuracy-latency trade-off. The parametric analysis reported in the experimental section provides a proper understanding of this important relationship.
Extra-functional Metrics
Latency. Given a generic CoopNet, its latency is modeled through the following equation: [13] which supports binary convolutions [12] . When considering batch inference, Equation 5 can be generalized as:
where BS is the cardinality of the batch and L i (CT ) is the latency of the i-th batch sample.
On-chip Memory. The hardware cores targeted by this work are the smallest low-power MCUs equipped of the Cortex-M family by ARM. These MCUs are usually equipped with limited RAM (≤ 1 MByte). The memory footprint of CoopNet is the sum of the RAM taken by the BNN model (M bnn ) and INT8 model (M int ). The two contributions include the RAM taken by the weights buffer, the activations buffer and im2col buffer as the model provided by ARM in [13] . The CT parameter is one Byte, therefore negligible.
Experimental results
Experimental Setup
The CoopNet has been evaluated on the following three tasks: CIFAR-10 -Image classification task; it consists of 60k 32 × 32 RGB images classified with 10 labels. Google Speech Command (GSC) -Keyword spotting from speech; the data set [14] collects 65k one-second long samples classified with 30 classes.
Facial Expression Recognition (FER13) -Emotion recognition from facial expression; the data set [15] is made up of 36k 48 × 48 grayscale facial images classified by 7 labels.
Different lightweight CNNs suited for tiny cores are deployed for the three tasks. An overview is reported in Table 1 ; for sake of space, the table reports the CONV and FC layers together with their size, although there are activation, pooling and regularization (normalization and dropout) layers. Moreover, Table 1 shows the top-1 accuracy (%) and the memory footprint (kB) for full-precision (FP32), 8-bit fixed-point (INT8) and binary (BNN) models. 
Performance Assessment
The conducted experiments aim at assessing the latencyaccuracy trade-off. With this purpose, we first provide a parametric analysis that leverages the confidential threshold CT as main knob. The line plot in Fig. 3 shows the delta accuracy achieved by CoopNet using as ground the accuracy of the baseline model INT8. The three tasks show the same trend: the CoopNet gets more accurate (positive delta) for larger values of CT . The break-even point CT be (for which delta is 0) may change depending on the complexity of the data-set and the classification capability of the CNN adopted: CT be = 0.2 for FER and GSC, CT be = 0.4 for CIFAR. To notice that the use of a confidential threshold CT > CT be guarantees substantial accuracy improvement. This suggests that CoopNet does not just improve over standard binarized CNNs, but it can also go beyond 8-bit quantization.
Even more interesting is the gain in terms of latency. 
t baseline INT8 on test-set
CoopNet shows impressive performance boost: 47.90% for FER, 51.58% for CIFAR-10, 80.16% for GSC. Table 2 gives a summary of some key results achieved by CoopNet. More specifically, it shows the evaluated extra-functional metrics (Speed-up and RAM footprint) under three accuracy level scenarios: (i) CoopNet meets the accuracy of FP32, (ii) CoopNet meets the accuracy of INT8 (the ground, i.e. ∆ = 0), (iii) CoopNet with the highest accuracy. As one can see, CoopNet guarantees substantial gains even under very high accuracy constraints. For instance, for GSC it achieves the same accuracy of the FP32 model with an average speed-up of 69.53%; more interesting CoopNet can even overtake the FP32 model (+1.47%). We observed that the joint action of BNN and INT8 helps to recognize inputs for which the FP32 model fails.
Related works
Fixed-point quantization [5] , as well as binarization [7, 6, 8] , or ternarization [16] , represent a valuable solution to deploy CNN on ultra-low-power commercial MCUs. While 8-bit are almost sufficient to guarantee the More specifically, they aim at finding the optimal balance between accuracy, resource utilization and performances assigning different bit-widths to different layers [10] . Unfortunately, commercial low-power MCUs do not have programmable data-paths and memory interfaces to support arbitrary bitwidth arithmetic efficiently [12] . Amiri et al. in [17] proposed a system level mixed-precision solution which exploits heterogeneous CPU and FPGA accelerators. The overhead, both on-line (due to a tuning procedure) and off-line (during training), and the resources required make this approach less suitable for low-end MCUs. Combining multiple CNN models into ensemble results in a winning solution for many tasks [18] . However, the resources required to host several models and execute them in parallel make this approach practically not scalable on a low-end device. On the contrary, CoopNet enables an efficient and accurate solution for off-the-shelf MCUs proposing a flexible architecture adaptable to the user-defined constraint.
Conclusions
CoopNet is a novel network architecture that integrates a fast and unreliable model with a slower but accurate one to improve the processing efficiency of inference models. The joint cooperation of binary and 8-bit quantized models guarantees higher accuracy and substantial speed-up, also offering a valuable option for adaptive energy-accuracy inference on the edge.
