Minimum Energy Quantized Neural Networks by Moons, Bert et al.
Minimum Energy Quantized Neural Networks
Bert Moons+, Koen Goetschalckx+, Nick Van Berckelaer* and Marian Verhelst+
Department of Electrical Engineering*+ - ESAT/MICAS+, KU Leuven, Leuven, Belgium
Abstract—This work targets the automated minimum-energy
optimization of Quantized Neural Networks (QNNs) - networks
using low precision weights and activations. These networks are
trained from scratch at an arbitrary fixed point precision. At
iso-accuracy, QNNs using fewer bits require deeper and wider
network architectures than networks using higher precision oper-
ators, while they require less complex arithmetic and less bits per
weights. This fundamental trade-off is analyzed and quantified
to find the minimum energy QNN for any benchmark and hence
optimize energy-efficiency. To this end, the energy consumption of
inference is modeled for a generic hardware platform. This allows
drawing several conclusions across different benchmarks. First,
energy consumption varies orders of magnitude at iso-accuracy
depending on the number of bits used in the QNN. Second, in
a typical system, BinaryNets or int4 implementations lead to
the minimum energy solution, outperforming int8 networks up
to 2 − 10× at iso-accuracy. All code used for QNN training is
available from https://github.com/BertMoons/.
Index Terms—Deep Learning, Quantized Neural Network,
Approximate Computing, Minimum Energy
I. INTRODUCTION
Deep learning [1], and more specifically Convolutional
Neural Networks (CNNs) have come up as state-of-the-art
classification algorithms, achieving super-human performance
in applications in both computer vision (CV) and automatic
speech recognition. Although these networks are extremely
powerful, they are also very computationally and memory
intensive, making them difficult to employ on embedded or
battery-constrained systems. Today, training, or even neural
network inference is therefore run on specialized, very fast and
power hungry Graphical Processing Units (GPU). Substantial
research efforts are spent in either speeding up or minimizing
energy consumption of NNs at run-time on both general-
purpose and specialized computer hardware.
Minimizing energy consumption is especially crucial in
Neural Networks used in battery constrained, wearable and
always-on applications. In these systems, inference at the
edge is crucial, in order to reduce the latency and wireless
connectivity costs as well as the privacy concerns associated
with cloud-connectivity. Previous solutions where insufficient
for such purposes, as both the used hardware platform and the
used Neural Networks are not sufficiently energy-efficient for
always-on, low-latency processing. A simple always-on face-
detection task on this platform drains its battery in less than 40
minutes [2]. In this work, we propose a framework to analyze
and optimize the trade-off between energy-consumption and
accuracy for Quantized Neural Networks (QNNs), hence al-
lowing co-design of hardware and algorithm towards minimum
energy consumption in any application.
II. RELATED WORK
Several approaches have been proposed to reduce the energy
footprint of DNNs. Most efforts are either in designing more
efficient algorithms, or in designing optimized hardware.
Dedicated hardware platforms are optimized for typical
dataflows and exploit network sparsity as well as the inherent
error-resilience of most Neural Networks. The highly parallel
nature of DNNs is exploited in any hardware DNN implemen-
tation [3], [4]. Han, et al. (2015), use clustered training and
trained pruning to reduce model sizes and propose a hardware
accelerator optimized for their compression scheme [5]. Other
recent hardware implementations propose solutions exploiting
sparsity, either by speeding up [6] or by increasing energy-
efficiency during sparse operation [4] [3]. Some works [3], [5]
expand upon that by also exploiting DNN’s inherent tolerance
to noise and variations by using reduced precision operators.
This reduces arithmetic power consumption and compresses
the models memory footprint, at the expense of a potential
accuracy loss. All these works are implementations of existing
neural networks, rather than cross-field optimizations.
Algorithmic innovations towards new network architec-
tures with a smaller memory and energy footprint have been
made as well. Residual Neural Networks or ResNets [7]
provide an alternative to VGG-type [8] networks, with con-
siderably smaller network complexity and model sizes at
iso-accuracy. Recent works have shown how computational
complexity can be reduced by constraining the used compu-
tational precision during the training phase of a DNN. Most
notably, [9] can either constrain only the network weights, or
both weights and activations to +1 and -1. This is particularly
interesting from a hardware perspective, as such binary net-
work topology allows replacing all costly multiply operations
with an energy-efficient XNOR-operation. Other works have
proposed ternarynets [10], fixed-point analyses [11] and fixed-
point finetuning [12]. In [12], the maximum achieved accuracy
drops significantly at 2- or 4-bits. All these techniques are
ad-hoc, leading to sub-optimal results without offering full
control over the used computational precision. Work presented
in [13], shows a number of benchmarks can be quantized
during inference down to 6-bits in the pytorch framework,
but they do not retrain or train to an arbitrary number of
bits and have no way to compare the energy consumption
of the resulting modes. General Quantized Networks have
been discussed in [14] and [15], where the authors run ad-
hoc training-tests on specific network-topologies, but do not
train them targeting minimum energy consumption.
Here, we offer explicit control over network quantization
for any number of bits and any network topology through
quantized training and link this to an inference energy model.
This allows cross-optimizing both the used algorithm and the
hardware architecture to enable always-on embedded applica-
tions. More specifically, our contributions are the following:
• We generalize BinaryNet training from 1- to Q-bits
(BinaryNet to intQ), where Q can be any value ∈ N .
ar
X
iv
:1
71
1.
00
21
5v
2 
 [c
s.N
E]
  2
3 N
ov
 20
17
-1 0 1
Input
-1
-0.5
0
0.5
1
O
ut
pu
t
STE
BinaryNet
2-bit
4-bit
(a)
-1 0 1
Input
-1
-0.5
0
0.5
1
O
ut
pu
t
STE
2-bit
4-bit
(b)
-1 0 1
Input
-1
-0.5
0
0.5
1
O
ut
pu
t
STE
BinaryNet
2-bit
4-bit
(c)
Fig. 1. (a) Weight quantization. (b) Quantized ReLU activation function. (c) Quantized hardtanh activation function. Straight-through estimators (STE) are
used to estimate gradients.
• We evaluate the energy-accuracy-computational preci-
sion trade-off for QNN inference by linking network
complexity and size to a full system energy-model.
• We conclude energy consumption at iso-accuracy varies
depending on the required accuracy, computational pre-
cision and the available on-chip memory. int4 implemen-
tations are often minimum energy solutions.
III. QNNS: QUANTIZED NEURAL NETWORKS
This section details our formulation of Quantized Neural
Networks (QNN), which use only fixed-point representations
for both weights and activations in training and inference.
In essence, QNNs are the generalization of binary- and
ternarynets [9], [10] to multiple bits, as in [14], [15]. Im-
plementations in keras/tensorflow and lasagne/theano can be
found on https://github.com/BertMoons.
In QNNs, all weights and activations are quantized to Q
bits in a fixed point representation during training. All QNN
models converge at all values of Q. The following quantization
function is used to achieve this in the forward pass:
q = clip(
round(2Q−1 × w)
2Q−1
,−1, 1− 2−(Q−1)) (1)
The Q=1 case is regarded as a special case, where q =
Sign(w), as in the original BinaryNet paper [14]. To success-
fully propagate gradients through discrete neurons, a ”straight-
through estimator” (STE) function is used for back propaga-
tion, which leads to fast training [16]. If an estimator gq of
the gradient ∂C∂q has been obtained, the STE of
∂C
∂w is
gw/gq = hardtanh(w) = clip(w,−1, 1) (2)
As in [9], all real valued weights are clipped during training
onto the interval [−1, 1]. Otherwise real-valued weights would
grow large, without impacting the quantized weights. The
weight quantization function q(w) and STE are plotted in
Fig. 1a for different Q. Activations are done using either a
quantized relu or hardtanh function.
Multiple setups have been evaluated. Best results are
achieved with the quantized ReLU function for int2, int4 and
int8 and with the symmetrically quantized hardtanh function
for the Q=1 case. As in [9], all real valued activations are
clipped during training onto the interval [−1, 1]. Every layer
following an activation layer, will then have intQ inputs. The
weight quantized ReLU- and hardtanh forward functions and
STEs are plotted in Fig. 1b and 1c for different Q.
In a QNN all the inputs to a layer are quantized to intQ,
with the exception of the first layer, which typically has int8
pixels as its inputs. In a general case with M input bits where
M > Q, an intQ layer can be performed as a sequence of
M/Q shifted and added dot products.
IV. HARDWARE ENERGY MODEL
A generic, parameterized energy model, shown in Figure 2,
is used to assess the impact of QNNs on the energy con-
sumption of a typical inference platform. Global energy per
inference is the sum of the energy consumed in communication
with an off-chip DRAM and the energy consumption of the
processing platform itself. The total energy consumed per
network inference is then:
Einf = EDRAM + EHW (3)
The sections below discuss a parameterized energy model,
which can be customized to a wide variety of processing
platforms by calibrating its parameters.
A. Energy consumption of off-chip Memory-Access
The available memory in an always-on chip is inherently
limited due to costs and leakage energy constraints and hence
typically insufficient to store full models and feature maps. If
this is the case, the chip will constantly have to communicate
with a larger off-chip memory system. The cost of this
interface is two orders of magnitude higher than a single
equivalent MAC-operation [17]. Using less bits for weights
and activations can hence be potentially more energy efficient,
if the achieved network compression, both for weights and ac-
tivations makes the network fit completely in on-chip memory.
Off-chip DRAM access energy is modeled as:
EDRAM = ED × (s2in × cin ×M/Q+ 2× fr + wr) (4)
Where ED is the energy consumed per intQ DRAM access, as
in Fig. 2. sin, cin and M/Q are respectively the input image’s
dimensions, the number of input channels and the first-layer
factor defined in section III. fr and wr are the number of
words that have to be re-fetched/stored from/to DRAM if a
feature map or model does not fit in the on-chip memory.
(a) (b)
Fig. 2. Used energy model for a NN platform based on [3] and [17]. (a) High-level overview of the system architecture. (b) Relative energy consumption
per equivalent Multiply-Accumulate (MAC) operation (EMAC ), read/write from the local (EL) and main (EM ) SRAM buffers and per read/write from a
large DRAM memory (ED) of an intQ word.
B. Hardware modeling
The hardware platform, shown in Fig 2 is a typical process-
ing platform for CNNs, based on [3]. It contains a parallel
neuron array, with a fixed area for p MAC-units and two
levels of memory. A large main buffer enables storing MW -
bits of weights and MA = MW -bits of activations, of which
50% is used for the current layer’s inputs and 50% is used
for the current layer’s outputs. The small local SRAM or
register file-buffers contain the currently used weights and acti-
vations. We model the relative energy consumption of SRAM-
fetches and Multiply-Accumulate (MAC) operations according
to Horowitz [17]. Here, the energy consumption of a read/write
from/to the small local SRAM or Register file EL is modelled
to be equal to the energy of a single MAC operation EMAC ,
while accessing the main SRAM costs EM = 2 × EMAC .
Other operations, such as bias additions, quantized-ReLU and
non-linear batch normalization are modeled to cost EMAC
as well. All these numbers incorporate control-, data transfer
and clocking overheads. The total on-chip energy per inference
EHW is then the sum of the compute energy EC and the cost
of weight EW and activation accesses EA:
EC = EMAC × (Nc + 3×As)
EW = EM ×Ns + EL ×Nc/√p
EA = 2× EM ×As + EL ×Nc/√p
(5)
Here, Nc is the network complexity in number of MAC-
operations for partial product accumulation, Ns is the model
size in number of weights and biases and As is the total
number of activations throughout the whole network. Thus,
EC is the sum of all energy consumed in partial product
generation, biasing, batch-normalization and activation. The
latter three are performed on all activations As, hence the
term 3×As in the EC equation. Weights are transferred once
from the main to the local buffer and are then reused from
there, leading to an equation for EW . Here
√
p is a reduction
in memory energy, due to activation-level parallelism, as one
weight is used simultaneously on
√
p activations. A similar
equation is derived for EA, as activations are fetched/stored
from/to the main buffer. The number of local activation fetches
is reduced by
√
p due to weight level parallelism, as one
(a)
Fig. 3. A QNN building block
TABLE I
USED QNN-TOPOLOGIES. nA, nB , nC , FA, FB , FC AND n ARE TAKEN
AS PARAMETERS. ALL USED FILTERS ARE 3× 3.
Block Classical QNN
Input -
Block A - 32× 32 nA × FA × 3× 3 + MaxPool(2,2)
Block B - 16× 16 nB × FB × 3× 3 + MaxPool(2,2)
Block C - 8× 8 nC × FC × 3× 3 + MaxPool(2,2)
Output Dense - 4× 4× FC
activation is simultaneously multiplied with
√
p weights. The
total level of parallelism p is a function of Q, as the same area
containing p′ 16-bit MACs, can hold p′ × 16/Q intQ MACs.
Similarly, an on-chip memory can store a variable number
of weights, depending on the value of Q. A 2Mb memory
stores more then 2M weights, but only 131k 16-bit weights. If
either the weight size or feature map size exceeds the available
on-chip size, communication with a larger off-chip DRAM
memory will be necessary, as discussed in section IV-A.
V. EXPERIMENTS
A. QNN topologies
To quantify the energy-accuracy trade-off in QNNs, multiple
network topologies are evaluated. This is necessary, as network
performance not only varies with the used computational
accuracy, but also with the network depth and width.
Each tested network contains 4 stages as shown in Fig. 3a
and Table I: 3 QNN-blocks, each followed by a max-pooling
layer and 1 fully-connected classification stage as illustrated
in Table I. Each QNN-block is defined by 2 parameters: the
number of basic building blocks n and the layer width F .
Every QNN-sequence is a cascade of a QNN-layer, followed
by a batch-normalization layer and a quantized activation
function, as shown in Fig. 3a. In this work FBlock is varied
from 32-512 and nBlock from 1-3.
In order to reliably compare QNNs at iso-accuracy for
different n, nBlock, FBlock and Q, first the pareto-optimal
floating-point architectures in the energy-accuracy space are
derived . This can be done through an evolutionary architecture
optimization [19], but here we apply a brute search method
across the parameter space. Once this pareto-optimal front is
found, the same network topologies are trained again from
scratch, as QNNs with a varying number of bits.
B. Results and discussion
The pareto-optimal set of QNNs is analyzed in search for
a minimum energy network. In this analysis, we vary model
parameters MW and MA and take p = 64×(16)/Q. Based on
measurements in [3], we take EMAC = 3.7pJ × (16/Q)1.25.
Model sizes and inference complexity are shown in Fig. 4.
Here, computational complexity, model size and the maximum
feature map size are compared as a function of error rate and
Q for the pareto optimal classical QNN set on CIFAR-10.
Fig. 4a illustrates how the required computational complexity
decreases at iso-accuracy if Q is varied from 1-to-16-bit, as
networks with higher resolution require fewer and smaller
neurons at the same accuracy. At 12% error for example,
the required complexity of a float16 network is 80 MMAC-
operations. Model complexity at iso-accuracy increases by
10× to 800 binary MMAC-operations. On the other hand, the
model size in terms of absolute storage requirements increases
with the used number of bits. This is illustrated in Fig. 4b.
Here, an int4 implementation offers the minimum model size
of only 2Mb, at 12% error rate. BinaryNets require 50% more
model storage, while the float16 net requires at least 4× more.
Fig. 4c shows the storage required for feature maps as a
function of network accuracy. If this size exceeds the available
memory, DRAM access will be necessary. Here, BinaryNets
offer a clear advantage over intQ alternatives.
Fig. 5 and Fig. 6 illustrate the energy consumption and
the minimum energy point for classical QNN architectures.
Fig. 5 shows the error-rate vs energy trade-off for different
intQ implementations, for chips with a typical 4Mb of on-
chip memory. The optimal intQ mode varies with the required
precision for all benchmarks. At high error rates, BinaryNets
tend to be optimal. For medium and low error-rates mostly
int4-nets are optimal. At an error-rate of 13% on CIFAR-
10 in Fig. 5a, int4 offers a > 6× advantage over int8
and a 2× advantage over a BinaryNet. At 11%, BinaryNet
is the most energy-efficient operating point and respectively
4× and 12× more energy-efficient than the int8 and float16
implementations. The same holds for networks with 10% error.
However, these networks come at a 3× higher energy cost
than the 11% error rate networks, which illustrates the large
energy costs of increased accuracy. In an int4 network run
on a 4Mb chip, energy increase 3× going from 17% to 13%,
while it increases 20× when going from 13% down to 10%.
Hence, orders of magnitude of energy consumption can be
saved, if the image recognition pipeline can tolerate slightly
less accurate QNN architectures. Fig. 6 compares the influence
of the total on-chip memory size MW +MA. In Fig. 6a, an
implementation with limited on-chip memory, BinaryNets are
the minimum energy solution for all accuracy-goals, as the
costs of DRAM interfacing becomes dominant. In the typical
case of 4Mb, either BinaryNets, int2- or int4-networks are
optimal depending on the required error rate. In a system with
∞Mb, hence without off-chip DRAM access, int2 and int4
are optimal. In all cases, int4 outperforms int8 by a factor of
2− 5×, while the minimum energy point consumes 2− 10×
less energy than the int8 implementations.
VI. CONCLUSION
This work presents a methodology to minimize the energy
consumption of embedded neural networks, by introducing
QNNs, as well as a hardware energy model used for net-
work topology selection. To this end, the BinaryNet training
setup is generalized from 1-bit to Q-bit for intQ operators.
This approach allows finding the minimum energy topology
and deriving several trends. First, energy consumption varies
by orders of magnitudes at iso-accuracy depending on the
used number of bits. The optimal minimum energy point
at iso-accuracy varies between 1- and 4-bit for all tested
benchmarks depending on the available on-chip memory and
the required accuracy. In general, int4 networks outperform
int8 implementations by up to 2 − 6×. This suggests, the
native float32/float16/int8 support in both low-power always
(a) (b) (c)
Fig. 4. QNN networks on CIFAR-10 [18]. (a) computational complexity, (b) model size, (c) Maximum feature map size and the number of bits Q.
(a) CIFAR-10 [18] (b) MNIST [20] (c) SVHN [21]
Fig. 5. Error rate as a function of energy consumption for a typical 4Mb chip.
(a) 1Mb on-chip MEM (b) 4Mb on-chip MEM (c) ∞Mb on-chip MEM
Fig. 6. Minimum energy plots for classical QNNs on CIFAR-10 for different chip-models.
on applications and high performance computing, should be
expanded with int4 to enable minimum energy inference.
REFERENCES
[1] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning.
Nature, 521(7553):436–444, 2015.
[2] Robert LiKamWa, Zhen Wang, Aaron Carroll, Felix Xiaozhu Lin, and
Lin Zhong. Draining our glass: An energy and heat characterization of
google glass. In Asia-Pacific Workshop on Systems, 2014.
[3] Bert Moons, Roel Uytterhoeven, Wim Dehaene, and Marian Ver-
helst. Envision: A 0.26-to-10 tops/w subword-parallel dynamic-voltage-
accuracy-frequency-scalable convolutional neural network processor in
28nm fdsoi. In International Solid-State Circuits Conference (ISSCC),
2017.
[4] Yu-Hsin Chen, Tushar Krishna, Joel S Emer, and Vivienne Sze. Eyeriss:
An energy-efficient reconfigurable accelerator for deep convolutional
neural networks. IEEE Journal of Solid-State Circuits, 2016.
[5] Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A
Horowitz, and William J Dally. Eie: efficient inference engine on
compressed deep neural network. In ISCA, 2016.
[6] Dongyoung Kim, Junwhan Ahn, and Sungjoo Yoo. A novel zero
weight/activation-aware hardware architecture of convolutional neural
network. In Design, Automation and Test in Europe (DATE), 2017.
[7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep
residual learning for image recognition. In Conference on Computer
Vision and Pattern Recognition (CVPR), 2016.
[8] Karen Simonyan and Andrew Zisserman. Very deep convolutional
networks for large-scale image recognition. arXiv preprint:1409.1556,
2014.
[9] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and
Yoshua Bengio. Binarized neural networks. In Advances in Neural
Information Processing Systems (NIPS), 2016.
[10] Chenzhuo Zhu, Song Han, Huizi Mao, and William J Dally. Trained
ternary quantization. arXiv preprint:1612.01064, 2016.
[11] Bert Moons, Bert De Brabandere, Luc Van Gool, and Marian Verhelst.
Energy-efficient convnets through approximate computing. In Winter
Conference on Applications of Computer Vision (WACV), 2016.
[12] Philipp Gysel. Ristretto: Hardware-oriented approximation of convolu-
tional neural networks. arXiv preprint:1605.06402, 2016.
[13] Tested model quantization. github.com/aaron-xichen/pytorch-
playground. Accessed: 2017-05-13.
[14] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and
Yoshua Bengio. Quantized neural networks: Training neural networks
with low precision weights and activations. arXiv preprint:1609.07061,
2016.
[15] Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and
Yuheng Zou. Dorefa-net: Training low bitwidth convolutional neural
networks with low bitwidth gradients. arXiv preprint:1606.06160, 2016.
[16] Yoshua Bengio, Nicholas Le´onard, and Aaron Courville. Estimating
or propagating gradients through stochastic neurons for conditional
computation. arXiv preprint:1308.3432, 2013.
[17] Mark Horowitz. Energy table for 45nm process. Stanford VLSI wiki.
[18] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of
features from tiny images. 2009.
[19] Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena, Yu-
taka Leon Suematsu, Quoc Le, and Alex Kurakin. Large-scale evolution
of image classifiers. arXiv preprint:1703.01041, 2017.
[20] Yann LeCun, Le´on Bottou, Yoshua Bengio, and Patrick Haffner.
Gradient-based learning applied to document recognition. Proceedings
of the IEEE, 86(11), 1998.
[21] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu,
and Andrew Y Ng. Reading digits in natural images with unsupervised
feature learning. In NIPS workshop, 2011.
