This work targets the automated minimum-energy optimization of Quantized Neural Networks (QNNs) -networks using low precision weights and activations. These networks are trained from scratch at an arbitrary fixed point precision. At iso-accuracy, QNNs using fewer bits require deeper and wider network architectures than networks using higher precision operators, while they require less complex arithmetic and less bits per weights. This fundamental trade-off is analyzed and quantified to find the minimum energy QNN for any benchmark and hence optimize energy-efficiency. To this end, the energy consumption of inference is modeled for a generic hardware platform. This allows drawing several conclusions across different benchmarks. First, energy consumption varies orders of magnitude at iso-accuracy depending on the number of bits used in the QNN. Second, in a typical system, BinaryNets or int4 implementations lead to the minimum energy solution, outperforming int8 networks up to 2 − 10× at iso-accuracy. All code used for QNN training is available from https://github.com/BertMoons/.
I. INTRODUCTION
Deep learning [1] , and more specifically Convolutional Neural Networks (CNNs) have come up as state-of-the-art classification algorithms, achieving super-human performance in applications in both computer vision (CV) and automatic speech recognition. Although these networks are extremely powerful, they are also very computationally and memory intensive, making them difficult to employ on embedded or battery-constrained systems. Today, training, or even neural network inference is therefore run on specialized, very fast and power hungry Graphical Processing Units (GPU). Substantial research efforts are spent in either speeding up or minimizing energy consumption of NNs at run-time on both generalpurpose and specialized computer hardware.
Minimizing energy consumption is especially crucial in Neural Networks used in battery constrained, wearable and always-on applications. In these systems, inference at the edge is crucial, in order to reduce the latency and wireless connectivity costs as well as the privacy concerns associated with cloud-connectivity. Previous solutions where insufficient for such purposes, as both the used hardware platform and the used Neural Networks are not sufficiently energy-efficient for always-on, low-latency processing. A simple always-on facedetection task on this platform drains its battery in less than 40 minutes [2] . In this work, we propose a framework to analyze and optimize the trade-off between energy-consumption and accuracy for Quantized Neural Networks (QNNs), hence allowing co-design of hardware and algorithm towards minimum energy consumption in any application.
II. RELATED WORK Several approaches have been proposed to reduce the energy footprint of DNNs. Most efforts are either in designing more efficient algorithms, or in designing optimized hardware.
Dedicated hardware platforms are optimized for typical dataflows and exploit network sparsity as well as the inherent error-resilience of most Neural Networks. The highly parallel nature of DNNs is exploited in any hardware DNN implementation [3] , [4] . Han, et al. (2015) , use clustered training and trained pruning to reduce model sizes and propose a hardware accelerator optimized for their compression scheme [5] . Other recent hardware implementations propose solutions exploiting sparsity, either by speeding up [6] or by increasing energyefficiency during sparse operation [4] [3] . Some works [3] , [5] expand upon that by also exploiting DNN's inherent tolerance to noise and variations by using reduced precision operators. This reduces arithmetic power consumption and compresses the models memory footprint, at the expense of a potential accuracy loss. All these works are implementations of existing neural networks, rather than cross-field optimizations.
Algorithmic innovations towards new network architectures with a smaller memory and energy footprint have been made as well. Residual Neural Networks or ResNets [7] provide an alternative to VGG-type [8] networks, with considerably smaller network complexity and model sizes at iso-accuracy. Recent works have shown how computational complexity can be reduced by constraining the used computational precision during the training phase of a DNN. Most notably, [9] can either constrain only the network weights, or both weights and activations to +1 and -1. This is particularly interesting from a hardware perspective, as such binary network topology allows replacing all costly multiply operations with an energy-efficient XNOR-operation. Other works have proposed ternarynets [10] , fixed-point analyses [11] and fixedpoint finetuning [12] . In [12] , the maximum achieved accuracy drops significantly at 2-or 4-bits. All these techniques are ad-hoc, leading to sub-optimal results without offering full control over the used computational precision. Work presented in [13] , shows a number of benchmarks can be quantized during inference down to 6-bits in the pytorch framework, but they do not retrain or train to an arbitrary number of bits and have no way to compare the energy consumption of the resulting modes. General Quantized Networks have been discussed in [14] and [15] , where the authors run adhoc training-tests on specific network-topologies, but do not train them targeting minimum energy consumption.
Here, we offer explicit control over network quantization for any number of bits and any network topology through quantized training and link this to an inference energy model. This allows cross-optimizing both the used algorithm and the hardware architecture to enable always-on embedded applications. More specifically, our contributions are the following: • We evaluate the energy-accuracy-computational precision trade-off for QNN inference by linking network complexity and size to a full system energy-model. • We conclude energy consumption at iso-accuracy varies depending on the required accuracy, computational precision and the available on-chip memory. int4 implementations are often minimum energy solutions.
III. QNNS: QUANTIZED NEURAL NETWORKS
This section details our formulation of Quantized Neural Networks (QNN), which use only fixed-point representations for both weights and activations in training and inference. In essence, QNNs are the generalization of binary-and ternarynets [9] , [10] to multiple bits, as in [14] , [15] . Implementations in keras/tensorflow and lasagne/theano can be found on https://github.com/BertMoons.
In QNNs, all weights and activations are quantized to Q bits in a fixed point representation during training. All QNN models converge at all values of Q. The following quantization function is used to achieve this in the forward pass:
The Q=1 case is regarded as a special case, where q = Sign(w), as in the original BinaryNet paper [14] . To successfully propagate gradients through discrete neurons, a "straightthrough estimator" (STE) function is used for back propagation, which leads to fast training [16] . If an estimator g q of the gradient ∂C ∂q has been obtained, the STE of ∂C ∂w is g w /g q = hardtanh(w) = clip(w, −1, 1)
As in [9] , all real valued weights are clipped during training onto the interval [−1, 1]. Otherwise real-valued weights would grow large, without impacting the quantized weights. The weight quantization function q(w) and STE are plotted in Fig. 1a for different Q. Activations are done using either a quantized relu or hardtanh function. Multiple setups have been evaluated. Best results are achieved with the quantized ReLU function for int2, int4 and int8 and with the symmetrically quantized hardtanh function for the Q=1 case. As in [9] , all real valued activations are clipped during training onto the interval [−1, 1]. Every layer following an activation layer, will then have intQ inputs. The weight quantized ReLU-and hardtanh forward functions and STEs are plotted in Fig. 1b and 1c for different Q.
In a QNN all the inputs to a layer are quantized to intQ, with the exception of the first layer, which typically has int8 pixels as its inputs. In a general case with M input bits where M > Q, an intQ layer can be performed as a sequence of M/Q shifted and added dot products.
IV. HARDWARE ENERGY MODEL
A generic, parameterized energy model, shown in Figure 2 , is used to assess the impact of QNNs on the energy consumption of a typical inference platform. Global energy per inference is the sum of the energy consumed in communication with an off-chip DRAM and the energy consumption of the processing platform itself. The total energy consumed per network inference is then:
The sections below discuss a parameterized energy model, which can be customized to a wide variety of processing platforms by calibrating its parameters.
A. Energy consumption of off-chip Memory-Access
The available memory in an always-on chip is inherently limited due to costs and leakage energy constraints and hence typically insufficient to store full models and feature maps. If this is the case, the chip will constantly have to communicate with a larger off-chip memory system. The cost of this interface is two orders of magnitude higher than a single equivalent MAC-operation [17] . Using less bits for weights and activations can hence be potentially more energy efficient, if the achieved network compression, both for weights and activations makes the network fit completely in on-chip memory. Off-chip DRAM access energy is modeled as:
Where E D is the energy consumed per intQ DRAM access, as in Fig. 2 . s in , c in and M/Q are respectively the input image's dimensions, the number of input channels and the first-layer factor defined in section III. f r and w r are the number of words that have to be re-fetched/stored from/to DRAM if a feature map or model does not fit in the on-chip memory. 
B. Hardware modeling
The hardware platform, shown in Fig 2 is a typical processing platform for CNNs, based on [3] . It contains a parallel neuron array, with a fixed area for p MAC-units and two levels of memory. A large main buffer enables storing M Wbits of weights and M A = M W -bits of activations, of which 50% is used for the current layer's inputs and 50% is used for the current layer's outputs. The small local SRAM or register file-buffers contain the currently used weights and activations. We model the relative energy consumption of SRAMfetches and Multiply-Accumulate (MAC) operations according to Horowitz [17] . Here, the energy consumption of a read/write from/to the small local SRAM or Register file E L is modelled to be equal to the energy of a single MAC operation E M AC , while accessing the main SRAM costs E M = 2 × E M AC . Other operations, such as bias additions, quantized-ReLU and non-linear batch normalization are modeled to cost E M AC as well. All these numbers incorporate control-, data transfer and clocking overheads. The total on-chip energy per inference E H W is then the sum of the compute energy E C and the cost of weight E W and activation accesses E A :
Here, N c is the network complexity in number of MACoperations for partial product accumulation, N s is the model size in number of weights and biases and A s is the total number of activations throughout the whole network. Thus, E C is the sum of all energy consumed in partial product generation, biasing, batch-normalization and activation. The 
Block Classical QNN
activation is simultaneously multiplied with √ p weights. The total level of parallelism p is a function of Q, as the same area containing p ′ 16-bit MACs, can hold p ′ × 16/Q intQ MACs. Similarly, an on-chip memory can store a variable number of weights, depending on the value of Q. A 2Mb memory stores more then 2M weights, but only 131k 16-bit weights. If either the weight size or feature map size exceeds the available on-chip size, communication with a larger off-chip DRAM memory will be necessary, as discussed in section IV-A.
V. EXPERIMENTS

A. QNN topologies
To quantify the energy-accuracy trade-off in QNNs, multiple network topologies are evaluated. This is necessary, as network performance not only varies with the used computational accuracy, but also with the network depth and width.
Each tested network contains 4 stages as shown in Fig. 3a and Table I: 3 QNN-blocks, each followed by a max-pooling layer and 1 fully-connected classification stage as illustrated in Table I . Each QNN-block is defined by 2 parameters: the number of basic building blocks n and the layer width F . Every QNN-sequence is a cascade of a QNN-layer, followed by a batch-normalization layer and a quantized activation function, as shown in Fig. 3a . In this work F Block is varied from 32-512 and n Block from 1-3.
In order to reliably compare QNNs at iso-accuracy for different n, n Block , F Block and Q, first the pareto-optimal floating-point architectures in the energy-accuracy space are derived . This can be done through an evolutionary architecture optimization [19] , but here we apply a brute search method across the parameter space. Once this pareto-optimal front is found, the same network topologies are trained again from scratch, as QNNs with a varying number of bits.
B. Results and discussion
The pareto-optimal set of QNNs is analyzed in search for a minimum energy network. In this analysis, we vary model parameters M W and M A and take p = 64×(16)/Q. Based on measurements in [3] , we take E M AC = 3.7pJ × (16/Q) 1.25 .
Model sizes and inference complexity are shown in Fig. 4 . Here, computational complexity, model size and the maximum feature map size are compared as a function of error rate and Q for the pareto optimal classical QNN set on CIFAR-10. Fig. 4a illustrates how the required computational complexity decreases at iso-accuracy if Q is varied from 1-to-16-bit, as networks with higher resolution require fewer and smaller neurons at the same accuracy. At 12% error for example, the required complexity of a float16 network is 80 MMACoperations. Model complexity at iso-accuracy increases by 10× to 800 binary MMAC-operations. On the other hand, the model size in terms of absolute storage requirements increases with the used number of bits. This is illustrated in Fig. 4b . Here, an int4 implementation offers the minimum model size of only 2Mb, at 12% error rate. BinaryNets require 50% more model storage, while the float16 net requires at least 4× more. Fig. 4c shows the storage required for feature maps as a function of network accuracy. If this size exceeds the available memory, DRAM access will be necessary. Here, BinaryNets offer a clear advantage over intQ alternatives. Fig. 5 and Fig. 6 illustrate the energy consumption and the minimum energy point for classical QNN architectures. Fig. 5 shows the error-rate vs energy trade-off for different intQ implementations, for chips with a typical 4Mb of onchip memory. The optimal intQ mode varies with the required precision for all benchmarks. At high error rates, BinaryNets tend to be optimal. For medium and low error-rates mostly int4-nets are optimal. At an error-rate of 13% on CIFAR-10 in Fig. 5a , int4 offers a > 6× advantage over int8 and a 2× advantage over a BinaryNet. At 11%, BinaryNet is the most energy-efficient operating point and respectively 4× and 12× more energy-efficient than the int8 and float16 implementations. The same holds for networks with 10% error. However, these networks come at a 3× higher energy cost than the 11% error rate networks, which illustrates the large energy costs of increased accuracy. In an int4 network run on a 4Mb chip, energy increase 3× going from 17% to 13%, while it increases 20× when going from 13% down to 10%. Hence, orders of magnitude of energy consumption can be saved, if the image recognition pipeline can tolerate slightly less accurate QNN architectures. Fig. 6 compares the influence of the total on-chip memory size M W + M A . In Fig. 6a , an implementation with limited on-chip memory, BinaryNets are the minimum energy solution for all accuracy-goals, as the costs of DRAM interfacing becomes dominant. In the typical case of 4Mb, either BinaryNets, int2-or int4-networks are optimal depending on the required error rate. In a system with ∞Mb, hence without off-chip DRAM access, int2 and int4 are optimal. In all cases, int4 outperforms int8 by a factor of 2 − 5×, while the minimum energy point consumes 2 − 10× less energy than the int8 implementations.
VI. CONCLUSION
This work presents a methodology to minimize the energy consumption of embedded neural networks, by introducing QNNs, as well as a hardware energy model used for network topology selection. To this end, the BinaryNet training setup is generalized from 1-bit to Q-bit for intQ operators. This approach allows finding the minimum energy topology and deriving several trends. First, energy consumption varies by orders of magnitudes at iso-accuracy depending on the used number of bits. The optimal minimum energy point at iso-accuracy varies between 1-and 4-bit for all tested benchmarks depending on the available on-chip memory and the required accuracy. In general, int4 networks outperform int8 implementations by up to 2 − 6×. This suggests, the native float32/float16/int8 support in both low-power always on applications and high performance computing, should be expanded with int4 to enable minimum energy inference.
