Embedded Deep Neural Network Processing: Algorithmic and Processor Techniques Bring Deep Learning to IoT and Edge Devices by Verhelst, Marian & Moons, Bert
1943-0582/17©2017IEEE   IEEE SOLID-STATE CIRCUITS MAGAZINE fall 20 17  55
eep learning has 
recently become 
im-mensely pop-
ular for image rec -
ognition, as well as 
for other recognition and pattern match -
ing tasks in, e.g., speech processing, 
natural language processing, and so 
forth. The online evaluation of deep 
neural networks, however, comes with 
significant com putational complex-
ity, making it, until recently, feasible 
only on power-hungry server plat-
forms in the cloud. In recent years, 
we see an emerging trend toward em-
bedded processing of deep learning 
networks in edge devices: mobiles, 
wearables, and Internet of Things 
(IoT) nodes. This would enable us 
to analyze data locally in real time, 
which is not only favorable in terms 
of latency but also mitigates privacy 
issues. Yet evaluating the powerful 
but large deep neural networks with 
power budgets in the milliwatt or even 
microwatt range requires a signifi-
cant improvement in processing en-
ergy efficiency. 
To enable such efficient evalua-
tion of deep neural networks, optimi-
zations at both the algorithmic and 
hardware level are required. This 
article surveys such tightly interwo-
ven hardware-software process-
ing techniques for energy efficiency 
and shows how implementation-
driven algorithmic innovations, 
together with customized yet flex-
ible processing architectures, can 
be true ga me changers .  To help 
readers fully understand the im-
plementation challenges as well as 
opportunities for deep neural net-
work algorithms, we start by briefly 
summarizing the basic concept of 
deep neural networks.
The Birth of Deep Learning
Deep learning [1] can be traced back 
to neural networks, which have been 
around for many decades and were 
already gaining popularity in the 
early 1960s. A neural network is a 
brain-inspired computing system, 
Digital Object Identifier 10.1109/MSSC.2017.2745818
Date of publication: 16 November 2017
Algorithmic and processor techniques  
bring deep learning to IoT and edge devices
Embedded Deep Neural 
Network Processing
Marian Verhelst and Bert Moons
D
b
a
c
k
g
r
o
u
n
d
—
f
o
o
ta
g
e
 f
ir
m
, i
n
c
.
56 fall 20 17 IEEE SOLID-STATE CIRCUITS MAGAZINE 
typically trained through supervised 
learning, whereby a machine learns 
a generalized model from many 
training examples, enabling it to 
classify new items. 
The trained classification model 
in such neural networks consists of 
several layers of neurons, wherein 
each neuron of one layer connects 
to each neuron of the next layer, as 
illustrated in Figure 1. The output 
of the network indicates the prob-
ability that a certain object class is 
observed at the network’s input. In 
such a network, every individual neu-
ron creates one output o, which is a 
weighted sum of its inputs i. For the 
nth neuron, of layer ,l  this can be for-
malized as
 . .o w i b  ln ln
m
lmn lmv= +/c m  (1)
The weights w  lmn  and biases bln  are 
the flexible parameters of the net-
work that enable it to represent a 
particular desired input/output map-
ping for the targeted classification. 
They are trained with supervised 
training examples in an initial off-
line training phase, after which the 
network can classify new examples 
presented to its inputs, a process typi-
cally referred to as inference.
Such neural networks have been 
used for decades in several applica-
tion domains. In a classical pattern-
recognition pipeline [Figure 2(a)], 
features are generated from an input 
image by an application-specific fea-
ture extractor, hand-designed by an 
expert engineer. This preliminary 
feature extraction step was necessary 
because, at that time, one could use 
only small neural networks with a 
limited number of layers that did not 
have the modeling capacity required 
for complex feature extraction from 
raw data. Larger neural networks were 
impossible to train due to noncon-
vergence issues, lack of sufficiently 
large data sets, and insufficient com-
pute power.
Yet, after a long winter for neural 
networks in the 1970s and 1980s, 
they regained momentum in the 
1990s and again in the 2010s. The 
increasing availability of pow-
erful compute servers and graph-
ics processing units (GPUs), the 
abundance of digital data sources, 
and innovations in training mecha-
nisms allowed training deeper and 
deeper networks, with many layers of 
neurons. This meant the start of a new 
era for classification, as it allow -
ed training networks with enough 
i11 o11
o11
o13
o21
o23
o11
o31
o32
o33
i12
i13
i11
i12
i13
i11
i12
i13
i14
i15
Car?
House?
Dog?
× w111
× w112
× w113
b11
+
σ
Oln = σ (∑
m
wlmn × ilm + bln)
o12 o22
Figure 1: A traditional fully connected neural network is made up of layers of neurons. 
 Every neuron makes a weighted sum of all its inputs, followed by a nonlinear  transformation.


Image Trained
Feature
Extraction
Neural 
network
Trained
Classifier
Class
Label
…
…
…

Image Designed
Feature
Extraction
Edges
Gradients
Corners
HOG
…
Neural
Network
Trained
Classifier
“House”
“House”
“House”
Class
Label
(a)
(b)
Image Trained
Feature
Extraction
Trained
Classifier
Class
Label
(c)
Figure 2: (a) Traditionally, machine learning classifiers were trained and applied on hand-
crafted features. (b) The advent of deep learning allowed the network to learn and extract the 
optimal feature sets. (c) Such a network trains itself to extract very coarse, low-level features 
in its first layers, then finer, higher-level features in its intermediate layers, and, finally, targets 
full objects in the last layers. HOG: histogram of oriented gradients. 
   IEEE SOLID-STATE CIRCUITS MAGAZINE fall 20 17  57
modeling capacity to operate directly 
on raw data. [Figure 2(b)]. Such “deep 
learning networks” thus fulfilled 
the role of both feature extractor 
and classifier. 
A deeper network can automati-
cally learn the best possible features 
during its training phase, instead of 
relying on features hand-crafted by 
humans. When inspecting trained 
networks, one can see that a deep 
neural network trains itself to extract 
very coarse, low-level features in its 
first layers and finer, higher-level 
features in its intermediate layers 
and then targets full objects in the 
last layers [Figure 2(c)]. 
A network’s ability to learn the 
most optimal features significantly 
boosted the classification accuracy 
of such networks, resulting in their 
true breakthrough: deep learning was 
born. Over the last decade, deep learn-
ing has, as such, been able to move to 
deeper and deeper network architec-
tures, enabling tremendous improve-
ments in achievable classification 
accuracy, as illustrated by the results 
from the yearly ImageNet challenge 
(Figure 3) [2].
Deep Neural Network Topologies
Another crucial factor in the break-
through of deep learning technol-
ogy is the advent of new network 
topologies. Classical neural networks—
which rely on so-called fully con-
nected layers, with each neuron of 
one layer connected to each neuron 
of the next layer (Figure 1) —suffer 
from a very large number of training 
parameters. For a network with L  
layers of N  neurons each, . ( )L N N2+  
parameters must be trained. Know-
ing that N  can easily reach the 
order of a million (e.g., for images 
with a million pixels), this large 
pa rameter set becomes unpractical 
and untrainable.
For many tasks (mainly in image 
processing and computer vision), 
convolutional neural networks (CNNs) 
are more efficient. These CNNs, in -
spired by visual neuroscience, orga-
nize the data in every network layer 
as three-dimensional (3-D) tensors. 
The first part of the network con-
sists out of a sequence of convo-
lutional layers and pooling layers, 
replacing the traditional fully con-
nected layers. A convolutional layer 
transforms a 3-D input tensor O  (of 
size )H H C# #  into a 3-D output ten-
sor I  (of size ).M M F# #  
As illustrated in Figure 4, each 
element of the output tensor O  does 
not need all elements of the input 
tensor I  to be computed. Instead, it 
28.2
25.8
16.4
11.7
7.7 6.7
3.57
ILS
VR
C’1
0
ILS
VR
C’1
1
ILS
VR
C’1
3
ILS
VR
C’1
2
Ale
xN
et
ILS
VR
C’1
4
VG
G
ILS
VR
C’1
4
Go
og
leN
et
ILS
VR
C’1
5
Re
sN
et
Human
5.1% 
ImageNet Challenge:
1,000 Classes
1.3 M Training Images/50 k 
Validation/100-k Testing
Hand-Crafted
Features
Deep
Learning
Top 5 Classification Errors (%)
22 Layers
19 Layers
Eight Layers
Eight Layers
152 Layers
Figure 3: The classification results of the ImageNet challenge have seen enormous boosts 
in accuracy since the appearance of deep learning submissions. (Data from [2].) ILSVRC: Ima-
geNet Large-Scale Visual Recognition Challenge; AlexNet: a CNN named for Alex Krizhevsky; 
VGG: a network from the Visual Geometry Group at Oxford University; ResNet: Residual Net. 
To enable efficient evaluation of deep 
neural networks, optimizations at both the 
algorithmic and hardware level are required. 
Per Output Pixel of a Layer: 
• Load C.K2 Weights
• Load C.K2 Inputs
• Do C.K2 MACs
• One Output Store  
for (int f = 0; f < F; f++)
for (int mx = 0; mx < M; my++)
for (int my = 0; my < M; mx++)
for (int c = 0; c < C; c++)
for (int kx = 0; kx < K; kx++)
for (int ky = 0; ky < K; ky++)
o [c ][mx][my] += w [f ][c ][kx][ky] . i [c ][mx + kx][my + ky]);
Max-Pooling Convolutional 
C
K 
K 
F 
Convolutional  Max-Pooling 
Fully Connected  
Classification
ReLU  ReLU
Classification Trained Feature Extraction 
H M
Repeat F.M2 Times Per Layer
Figure 4: The topology and pseudocode of one layer of a typical CNN. The psuedocode is 
for one layer of the network. MACs: multiply accumulation. 
58 fall 20 17 IEEE SOLID-STATE CIRCUITS MAGAZINE 
is connected only locally to a patch 
of the input tensor of size ( )K K C# #  
through a trainable 3-D kernel W  (of 
size )K K C# #  and a bias .B  A formal 
mathematical description to com-
pute the outputs of a convolution 
layer, ,l  is given as
. .O I W Blfxy
c
C
i
K
j
K
lc x i y j lfcij lf
0 0 0
= +
= = =
+ +/ / / ^ h^ h
The result of the local sum com-
puted in this filter bank is then 
passed through a nonlinearity layer, 
typically a rectified linear unit (ReLU), 
using the nonlinear activation func-
tion ,  .maxf u u0  =^ ^h h  This output 
can finally be processed by a max-
pooling layer, which outputs only the 
maximum of a local patch (typically 
2 × 2 or 3 × 3) of output units to the 
next layer. This thereby reduces the 
dimension of the feature representa-
tion and creates invariance to small 
shifts and distortions in the inputs. 
A modern CNN consists of tens [3] 
to hundreds [4] of such alternating 
convolutional and max-pooling lay-
ers, typically followed by one to 
three classification layers, imple-
mented using the traditional fully 
connected neurons (Figure 4).
It is important to note that the 
same convolution kernel W  and bias 
B  are used to compute all (M X M) 
outputs of one slice in the output ten-
sor. As such, every layer of the net-
work needs only F x K K C 1   # # +^ h 
parameters. With K  typically rang-
ing between one and seven and F and 
C on the order of tens or hundreds, 
this method allows the creation of 
very large networks while keeping 
the number of trainable parameters 
under control—all of which gave 
deep learning its significant boost.
The majority of recent state-
of-the-art deep learning networks 
rely on such CNNs. The optimal net-
work architecture, characterized 
by the number of cascading stages 
and the values of model param-
eters ,  ,  ,  ,F H C K  and ,M  varies for 
each specific application. Over the 
last few years, various alterations 
have been proposed to this stan-
dard topology, such as, e.g., introduc-
ing feed-through connections in 
ResNets [4], concatenating very small 
convolutions in inception networks 
[5], stacking depthwise and pointwise 
convolutions in Xception networks 
[6], extracting full-image dense mul-
tiscale features using DenseNets 
[7], or recurrent connections in RNNs 
or long short-term memories [8]. 
These, however, lie beyond the scope 
of this tutorial.
Challenges for Embedded  
Deep Inference
Both the training of a deep network 
and its own inferences to perform 
new classifications are now typically 
executed on power-hungry serv-
ers and GPUs [Figure 5(a)]. There is, 
however, a strong demand to move 
the inference step, in particular, out 
of the cloud and into mobiles and 
wearables to improve latency and 
privacy issues [Figure 5(b)]. How-
ever, current devices lack the capa-
bilities to enable deep inferences for 
real-life applications. 
Recent neural networks for image 
or speech processing easily require 
more than 100 giga-operations (GOP)/s 
to 1 tera-operations (TOP)/s, as well 
as the ability to fetch millions of 
network parameters (kernel weights 
and biases) per network evaluation. 
The energy consumed in these numer-
ous operations and data fetches is 
the main bottleneck for embedded 
Embedded Device: Inference Cloud: Training
Training
Information
Network
Parameters
Latency
Privacy
Tx Energy
uP Energy 
Scarce Resources Infinite Resources
Raw Data
Classification
Result 
Latency
Privacy
Tx Energy
Embedded Device: Tx/Rx Cloud: Training + Inference 
Scarce Resources Infinite Resources
(a)
(b)
Figure 5: Concerns regarding user privacy, recognition latency, and energy wasted on raw 
data transmission push deep learning inferences from (a) the cloud to (b) the embedded device. 
Tx/Rx: transmitter/receiver; uP = microprocessor. 
A deeper network can automatically learn 
the best possible features during its training 
phase, instead of relying on features hand-
crafted by humans.
   IEEE SOLID-STATE CIRCUITS MAGAZINE fall 20 17  59
inference in energy-scarce milliwatt or 
microwatt devices. Currently, micro-
controllers and embedded GPUs 
are limited to efficiencies of a few 
tens to hundreds of GOP/W, while 
embedded inference will only be 
fully enabled with efficiencies well 
beyond 1 TOP/W. Overcoming this 
bottleneck is possible yet requires 
a tight interplay between algorith-
mic optimization (modifying the 
network topology) and hardware 
optimization (modifying the process-
ing architectures). 
The following section elaborates on 
the most promising optimizations cur-
rently being explored toward energy-ef-
ficient, embedded deep in ference. The 
focus here is on the energy-efficient 
execution of convolutional layers, 
which form the bulk of the workload 
during inference. However, several tech-
niques can also be applied to fully con-
nected layers.
Algorithmic and Architectural  
Techniques for Energy Efficiency
GPUs and central processing units 
(CPUs) are extremely flexible, general-
purpose machines. While this makes 
them widely deployable and easy to use 
and program, it also limits their effi-
ciency because they cannot exploit 
several computational aspects of 
deep inference networks, resulting 
in both a memory bottleneck and a 
computational bottleneck. More spe-
cifically, deep inference networks 
have three typical characteristics 
that can be exploited—or further 
enhanced—to improve execution en -
ergy efficiency:
1) Deep learning networks exhibit a 
very particular data flow with a 
large amount of potential paral-
lelism and data reuse. This can, 
moreover, be manipulated dur-
ing network training by playing 
with the ,  ,  ,  ,F H C K  and M  pa-
rameters of the network.
2) Deep learning networks prove 
to be quite robust to approxima-
tions or fault introductions. This 
is exploited in various reduced-
precision hardware implementa-
tions. Also, this characteristic can 
be manipulated when training the 
network, allowing it to find the 
best tradeoff between a low com-
plexity and a robust network.
3) Deep learning networks dem-
onstrate large sparsity. Many 
parameters become very small, 
even equal to zero, after network 
training. Also, many data values 
propagated with the network 
during evaluation become zero. 
This can be exploited to reduce 
operations and memory fetches 
in hardware yet can also be stim-
ulated further with innovative 
training techniques.
We will show how, for each of these 
three aspects, hardware can benefit 
from the network’s characteristics 
but also how, during the algorith-
mic training phase of the network, 
it is possible to additionally opti-
mize the particular characteristic to 
reach even greater efficiency gains. 
As such, it is clear that the hardware 
and algorithmic level need to closely 
cooperate not only to exploit but also 
to enhance the network’s character-
istics toward the most efficient hard-
ware-software realization. All of the 
techniques highlighted in this article 
are summarized in Figure 6.
Enhancing and Exploiting  
Network Structure
In many application areas, designers 
have improved the energy efficiency 
of embedded network evaluation by 
moving away from general-purpose 
processors and developing custom-
ized hardware accelerators. Such 
accelerators can exploit the known 
data flows within the algorithm to 
1) enhance the parallel execution of 
the algorithm as well as 2) minimize 
the number of data movements (Fig-
ure 7). Descriptions of several app -
lication-specific integrated circuits 
targeting the efficient execution of 
convolutional and fully connected 
layers have recently been published.
All solutions exhibit a very large 
degree of parallelization, far beyond 
CPU parallelism. This easily demon-
strates itself in a data path contain-
ing a few hundred to thousands of 
multiply accumulators (MACs), with 
Google’s recent tensor processing 
unit as an extreme example (64,000 
MACs) [9].
Algorithmic
Techniques 
Processor
Architecture
Techniques  
Tightly
Linked
Solving the Memory Bottleneck Solving Computational Bottleneck
• Spatial Data Reuse
• Hierarchical Memory
 Exploiting Data Locality
• Highly Parallel Architectures
• Distributed Processing
• Quantized Training
• Stochastic Memories
• Network Pruning
• Network Compression
 and Weight Sharing
• (Dynamic) Fixed Point
• Analog and Statistical
  Processing
• Memory and Computational
 Gating
• Compressed Computing
A) Enhancing and Exploiting Network Structure
B) Enhancing and Exploiting Fault Tolerance
C) Enhancing and Exploiting Network Sparsity
Figure 6: An overview of the algorithmic and processor architecture techniques discussed to 
increase efficiency and enable the inference of deep neural networks in embedded devices.
CNNs, inspired by visual neuroscience, 
organize the data in every network layer  
as 3-D tensors.
60 fall 20 17 IEEE SOLID-STATE CIRCUITS MAGAZINE 
Providing data to all these func-
tional units in parallel would be near-
 ly impossible if the temporal and 
spatial locality of the data was not 
exploited. Indeed, many computa-
tions within one network layer share 
common inputs. More specifically, as 
highlighted in the pseudocode shown 
in Figure 4, every weight parameter 
is reused approximately M2  times 
across multiple convolutions of the 
same slice in the output tensor, and 
every input data point is reused 
across F  different slices of the out-
put tensor. Moreover, the intermedi-
ate accumulation results o  have to be 
accumulated .C K2  times. This can, in 
a custom accelerator, be exploited in 
several ways to further boost efficien-
cies beyond the highly parallel, yet 
not data-flow-optimized, GPUs.
Data reuse can be exploited by 
reusing the same data across multi-
ple parallel execution units or, equiv-
alently, across multiple time steps on 
the same execution unit. In this topol-
ogy, three extreme cases can be dis-
tinguished, as shown in Figure 8. 
The first multiplies the same input 
data value with several weights of 
a layer’s different output channels. 
This is also called weight parallel or 
input stationary. In this implementa-
tion, every input will ideally be loaded 
into the system only once. This, how-
ever, has negative repercussions on 
the weight memory bandwidth, as 
the weights must be reloaded fre-
quently (every time a new input is 
applied). Moreover, the accumula-
tion of the output o  cannot be per-
formed across different clock cycles, 
requiring intermediate accumulation 
results o  to be pushed into mem-
ory and refetched later, strongly im-
pacting the input/output memory 
bandwidth. A similar scheme fetches 
every weight once and multiplies it 
with many input values. This “weight 
stationarity” or “input parallelism” 
improves the weight memory band-
width, yet at the expense of the in-
put memory bandwidth. Finally, the 
output stationary scheme reloads 
new weights and inputs every single 
clock cycle and yet is able to accumu-
late the intermediate results locally 
within the MAC unit across different 
clock cycles, to the benefit of the out-
put memory bandwidth. 
All these optimizations can be 
seen as a reshuffling of the nested 
loops in the pseudocode of Figure 4. 
Of course, in practice, most realiza-
tions implement a hybrid form of 
the three presented extreme cases. 
Examples include [23] and [24], where 
a two-dimensional (2-D) data path 
multiplies every input with several 
weights, while every weight is also 
multiplied with several inputs, and 
[10], where the input and output 
stationarities are optimized to mini-
mize the chip input/output band-
width. Which parallelization scheme 
is optimal depends strongly on the 
network’s dimensions; the parame-
ters ,  ,  ,  ,F H C K  and ,M  which allow 
cooptimization of the hardware; and 
Input Stationary
(Weight Parallel)
Weight Stationary
(Input Parallel)
Output Stationary Hybrids
Input BW Low High High Medium
Weight BW High Low High Medium
Output BW High High Low Medium
In
pu
ts
+
+
+
× × × × × × × × × ×
× × × ×
× × × ×
× × × ×
×
×
×
×
×
×
Weights Weights WeightsWeight
In
pu
t +
+
+
+ In
pu
ts +
+
+
+
O
ut
pu
ts
O
ut
pu
ts
In
pu
t
Figure 8: Different architectural topologies allow data reuse to be maximized, reusing either inputs, weights, intermediate results, or a 
combination of the three. BW: bandwidth
MAC Array  
× 
+  
× 
+  
× 
+  
× 
+  
 
 
 
 
 
 
 
 
 
 
 
 
 
  
FSM or
Processor
Controlled
Weight Memory
 
Input/Output Memory
 
Minimize Data
Movements 
Maximize
Parallelism
Maintain
Flexibility
×
+
×
+
×
+
×
+
×
+
×
+
×
+
×
+
Figure 7: Custom deep neural network processors gain efficiency by minimizing data move-
ments and maximizing parallelism. Still, it is crucial not to lose all flexibility in mapping a 
wide variety of networks. FSM: final state machine.
   IEEE SOLID-STATE CIRCUITS MAGAZINE fall 20 17  61
the network itself. A more elaborate 
overview of the different paralleliza-
tion schemes can be found in [11] 
and [12], along with an assessment of 
their merits.
A complementary way to reduce 
the energy burden of continuous 
data fetches is not to minimize the 
number of data fetches but rather to 
reduce the energy cost of every data 
fetch by exploiting temporal data 
locality. Most realistic deep networks 
require so much weight and input/out-
put memory (megabytes to gigabytes) 
that it is impossible to fit them in on 
a chip memory, thus requiring fetches 
from energy-costly external dynamic 
random-access memory (DRAM). Simi-
lar to traditional processors, this can, 
however, be mitigated by a memory 
hierarchy having one or more levels of 
on-chip static RAM (SRAM) or register 
files. Frequently accessed data can, as 
such, be stored locally to reduce its 
fetching cost (Figure 9). 
An important difference with gen-
eral-purpose solutions, however, is 
that the sizes of the memories in the 
hierarchy can be optimized toward 
the network’s structure, e.g., foresee-
ing a local memory capable of cach-
ing exactly one weight tensor, or one 
of the tensor [11]. Even more impor-
tantly, the networks can be trained 
with the processor’s memory hierar-
chy in mind. As such, networks have, 
e.g., been explicitly trained to com-
pletely fit in on-chip memory. This 
optimization is, of course, highly 
interwoven with the parallelization 
scheme. By jointly optimizing these, 
one can adjust the degree of parallel-
ization to the memory hierarchy and 
minimize the product of the number 
of memory accesses with the cost of 
every memory access [13].
Distributed and systolic process-
ing can be seen as an extreme type 
of such hierarchical memories. In the 
systolic processing concept, a 2-D 
array of functional units processes 
data locally and passes inputs and 
intermediate results from unit to 
unit instead of to/from global mem-
ory. These functional units are each 
equipped with a very small SRAM (as 
in [14]) or even just registers (as in [9]) 
to store data locally and maximize 
data reuse within the array. Process-
ing happens as a systolic wavefront 
through the array, wherein weight 
coefficients can be kept stationary in 
the functional units, input data are 
shifted in one direction through the 
array, and output data accumulate in 
the orthogonal direction. This allows 
the performance of a very large num-
ber of computations for convolution 
or matrix multiplication in parallel 
by keeping all systolic elements busy 
without burdening the memory band-
width. Interested readers are pointed 
to [15] and [9] for more details.
Such systolic operation opens the 
door to in-memory computing, where 
the computation is integrated inside 
the memory array. While this is also 
pursued in traditional memory archi-
tectures, the results look especially 
promising for emerging nonvolatile 
memory arrays. For example, in resistive 
memory technologies, a multiplication 
can be implemented by exploiting 
the memory cell’s conductance as the 
kernel weight, while accumulating cur-
rent from different elements to imple-
ment the convolution’s accumulation 
operation [16]. However, this technol-
ogy currently still suffers from large 
variability, limiting applications to very 
low-resolution operations with very lim-
ited kernel and network sizes.
While all the aforementioned tech-
niques can dramatically boost the 
system’s throughput and energy effi-
ciency, it is important to keep an eye 
on their impact on the design’s pro-
grammability and flexibility. Espe-
cially in the fast-paced area of deep 
learning, it is of the utmost impor-
tance to maintain sufficient flexibility 
toward alternative network dimensions 
and novel network topologies. Most 
accelerators, however, succeed in this 
by enabling the acceleration of matrix 
multiplications (for the fully con-
nected layers) and convolutions (for 
the convolution layers) of any size, 
yet with maximal efficiency for a sub-
set of sizes.
Enhancing and Exploiting  
Fault  Tolerance
A second important aspect of deep 
neural networks that can be exploited 
in custom processor designs is their 
fault tolerance. Many studies observe 
the robustness of CNNs and other 
networks to perturbations on their 
weight parameters and intermediate 
computational results [17], [18]. This 
can be exploited both at the hard-
ware as well as the algorithmic level 
in several ways.
MAC Array
×
+ + + +
+ + + +
+ + + +
Off-Chip
DRAM  
On-Chip
SRAM  
R
eg
ist
er
s
 
GB
Hundreds of pJ/Word 
 
MB
Tens of pJ/Word 
 
B
<pJ/Word
 
 
Local
SRAM  
kB
pJ/Word
 
 
× × ×
× × × ×
× × × ×
Figure 9: A well-designed memory hierarchy avoids drawing all weights and input data 
from the costly DRAM interface and stores frequently accessed data locally. pJ: picojoule. 
In the systolic processing concept, a 2-D array 
of functional units processes data locally and 
passes inputs and intermediate results from 
unit to unit instead of to/from global memory.
62 fall 20 17 IEEE SOLID-STATE CIRCUITS MAGAZINE 
A straightforward way to ben-
efit from the network’s fault toler-
ance is to perform the computations 
at reduced computational accuracy 
with limited recognition loss. Typi-
cal benchmarks can be run at a 1–9-b 
fixed point rather than a 32-b floating 
point at lower than 1% accuracy loss 
[18]. This is possible by quantizing 
all weights of a floating-point-trained 
network before execution. Improved 
results can be obtained when intro-
ducing quantization during the train-
ing step itself [19], [38], resulting in 
smaller or lower-precision networks 
for the same application accuracy. As 
an extreme example, networks have 
been specifically trained to oper-
ate with only 1-b representations of 
weights alone [20] as well as with 
both weights and activations [20], 
[21] wherein all multiplications can 
be replaced by efficient XNOR opera-
tions [22]. In [20], a binary-weight 
version of ImageNet is only 2.9% less 
accurate (in top-1 accuracy) than the 
full-precision AlexNet [3].
This observation can lead to major 
energy savings, as current CPU and 
GPU architectures operate using 
32–16-b floating-point number for-
mats. Reducing precision from 32-b 
floating point to low precision not 
only reduces computational energy 
but also minimizes the storage and 
data-fetching cost needed for network 
weights and intermediate results. 
Moreover, for very low bit widths, 
this even allows the replacement of 
multipliers that have several data 
values with a common weight factor 
via preloaded lookup tables [10]. As 
a result, all custom CNN accelerators 
operate in fixed point. While most 
processors operate at constant 16-, 
12-, or 8-b word lengths, some recent 
implementations support variable 
word-length computations, wherein 
the processor can change the used 
computational precision from opera-
tion to operation [23], [10], [24]. This 
accommodates for the observation 
that the optimal word length for a 
deep network strongly varies from 
application to application and is 
even shown to differ across various 
layers of a single deep network [18] 
[Figure 10(a)]. 
Energy-efficient variable-resolu-
tion processors have been realized 
using a technique termed dynamic 
voltage-accuracy-frequency scaling 
[25] to jointly reduce the switching 
activity, supply voltage, and par-
allelization scheme when computa-
tional resolution drops [Figure 10(c)]. 
This results in a scaling of the sys-
tem’s energy consumption, which is 
super-linear with the computational 
resolution [Figure 10(b)], thus allow-
ing every network layer to run at its 
own minimal energy point. Reduced 
bit-width implementations all exploit 
the deep network’s tolerance to faults 
in a deterministic way.
Another school of thought targets 
energy savings through tolerating 
nondeterministic statistical errors. 
This can be accomplished by execut-
ing the convolutional kernels in the 
noisy analog domain [26]. Alterna-
tively, in the digital domain, stochas-
tic fault tolerance can be exploited 
by operating the circuits [27] and/
or memory [28] in the energy-effi-
cient near-threshold regions. In this 
region, circuit delays as well as 
memory failures suffer from large 
variation. Yet the networks can tol-
erate such stochastic behavior up to 
a certain limit. Such circuits are com-
bined with circuit monitors that con-
stantly assess and control the circuit’s 
fault rate [28].
Finally, the operational circum-
stances can strongly influence the net-
work’s tolerance to approximations. In 
a given classification application, the 
quality of the inputs might change 
dynamically, or some classes might be 
easier to observe than others. If one 
tries to train one common network 
that performs acceptably under all 
possible circumstances and classes, 
a large, complex, energy-hungry 
network topology would be needed. 
Recent work, however, promotes the 
training of hierarchical or staged 
AlexNet on ImageNet
Qu
an
tiz
at
ion
 (B
its
)
R
el
at
iv
e 
Po
w
er
Computational Precision
33× Gain
at 1% RMSE 
10
8
6
4
2
0
2 4 6 8
Layer Number
(b) (c)(a)
Uniform at 100%
Nonuniform at 99%
100
10–2
10–4
10–6 10–4 10–2 100
1 Bit
6 Bit
16 Bit
x0/0
x1/0
x2
x3
y0/0y1/0y2y3
p0/0
p1/0
p2/0
p3/0
p7 p6 p5 p4
Figure 10: (a) When quantizing all weight and data values in a floating point AlexNet uniformly, the network can run at 9-b precision. 
Lower precision can be achieved without significant classification accuracy loss by running every layer at its own optimal precision. This allows 
(b) saving power in the function of computational precision and (c) building multipliers whose energy consumption scales drastically with com-
putational precision, through reduced activity factor and critical path length.
   IEEE SOLID-STATE CIRCUITS MAGAZINE fall 20 17  63
networks [29] that perform classifica-
tions in several optional stages. At 
each stage, only a few layers of the net-
work are executed, after which a clas-
sification layer tries to guess the class 
from the current outputs. Additional 
network layers and classifiers are run 
only if the obtained probabilities are 
not outspoken enough, until a classi-
fication with distinct probabilities is 
obtained. Such dynamic evaluations 
can be performed on any hardware 
platform but, again, benefit signifi-
cantly from implementation-aware 
training techniques or topology-
optimized implementations. Infer-
ence on the ImageNet data set [29] 
required up to 2.6 times fewer opera-
tions than state-of-the-art networks at 
equivalent accuracy.
Enhancing and Exploiting Sparsity
Deep neural networks exhibit extreme 
sparsity, i.e., many of the weight val-
ues, as well as intermediate data val-
ues, are zero. Figure 11(a) shows the 
sparsity of an AlexNet in function 
of the used fixed-point word length 
within the network. As can be seen, 
even for large word lengths, more 
than 70% of the activations are zero. 
At reduced bit-width computations, 
also many weight values are quan-
tized to zero. This opens up many 
opportunities. 
On the hardware side, this can be 
exploited by preventing any MAC with 
a zero-valued input [see Figure 11(b)], 
by not even fetching zero-valued data 
values from memory, and by strongly 
compressing the on/off chip data 
stream using, e.g., Huffman or other 
types of encoding.  Several hardware 
implementations exploit these CNN 
characteristics. The authors of [24] 
and [11] skip all unnecessary sparse 
operations by gating the inputs to 
their arithmetic units if the input data 
is zero, as a multiply-accumulate with 
zero does not change the internal 
accumulation result. Both implemen-
tations also compress off-chip data 
streams, either through run-length 
encoding [14] or through a simpli-
fied Huffman scheme [23]. The archi-
tectures presented in [30] and [31], 
on the other hand, allow speeding up 
sparse network evaluation s by only 
scheduling non-zero operations for 
execution, improving computational 
throughput  up to 1.52 and 5.2 times, 
respectively.
More powerful opportunities arise, 
again, when the hardware and algo-
rithmic plane are jointly involved. 
Deep network training algorithms 
can be modified to enhance the net-
work’s sparsity by iteratively pruning 
the smallest weight values (quantiz-
ing them to zero) and retraining the 
network [32]. Going one step further, 
energy-aware pruning techniques 
even take the energy consumption 
model of the hardware into account 
and start pruning the layers that con-
sume the most energy, to maximize 
pruning efficacy [33]. This easily 
allows the pruning of 70–90% of the 
weights and saves up to 70% of ener-
 gy consumption.
Interestingly enough, networks 
have more compression capabili-
ties beyond simply that of pruning 
low-valued weights. After pruning 
and quantizing a network, it turns 
out that the resulting weight values 
are highly clustered. This allows, e.g., 
the clustering of 8-b weights in only 
16 (24) different weight clusters, each 
of which can share a common weight 
value expressed by a 4-b label. For 
every weight value, only the 4-b labels 
are stored, and these are expanded 
online to their original 8-b value using 
a small embedded lookup table.  
Recent work has shown that the 
combination of pruning, weight shar-
ing, and Huffman compression com-
presses state-of-the-art networks by 
50 times in memory size (deep com-
pression [32]). Traditional accelerators 
can benefit from such compression 
but only in terms of a reduction 
in memory size and the amount of 
memory accessed. To execute convo-
lutional operations, they must still 
Fixed Point Precision (bits)
AlexNet
M
ea
n 
Sp
ar
sit
y 
(%
)
Layer Inputs
Weights
0 0
0
0
00
0
MAC Array
+ + + +
+ + + +
+ + + +
Weight
Memory
Input/Output
Memory
DRAM
Compress
Off-Chip
Communication
Prevent Fetching
Zero-Valued
Data
Prevent Executing
Zero-Input
MACs
100
50
0
2 4 6 8 10
× × × ×
× × × ×
× × × ×
(a) (b)
Figure 11: (a) The sparsity of input and weight values of a typical network in function of computational precision at which the network is 
evaluated. (b) This sparsity allows energy to be saved in the processor’s input/output interface, on-chip memories, and data path.
An important difference with general-purpose 
solutions, however, is that the sizes of the 
memories in the hierarchy can be optimized 
toward the network’s structure.
64 fall 20 17 IEEE SOLID-STATE CIRCUITS MAGAZINE 
decompress the data and, at best, 
remain idle during zero-valued op -
erations. The efficient-inference engine 
[35], however, demonstrates that it 
is also possible and highly benefi-
cial to operate directly on the com-
pressed data by adapting the data 
path and memory interface to the 
compressed data format.
A network compression technique 
that does enable straightforward 
network execution in the complex 
domain without any hardware adapta-
tion uses singular value decomposi-
tion (SVD) [36]. By performing SVD on 
a sparse weight matrix of a fully con-
nected network layer, the matrix can 
be decomposed into two matrices, 
the rows and columns of which are 
ordered by the function of the most 
significant network parameters. By 
simply removing the nonsignificant 
sections of the matrix, one is left with 
a strongly compressed representa-
tion of the original network layer. The 
result can be executed on any regu-
lar neural network accelerator, as it 
is identical to the execution of two 
(much smaller) fully connected layers. 
While this method is more straightfor-
ward from a hardware point of view, it 
offers only limited compression capa-
bilities, ranging typically up to only 
five times compression [36].
Outlook
In this short tutorial, we have pre-
sented a selection of very promising 
hardware and algorithmic techniques 
from the rapidly expanding and 
growing field of deep learning. Each 
exploits and/or enhances the unique 
features of deep networks to improve 
the energy efficiency of their execu-
tion. Together, they have allowed the 
achievement of tremendous energy 
savings compared to traditional 
CPU- and GPU-based compute plat-
forms. As can be seen in Figure 12 
[37], this recent wave of innovations 
breaks the barrier for embedded 
deep inference in mobile devices. 
Implementations far surpassing the 
efficiencies of 1-TOP/W have recently 
been demonstrated, while computa-
tional throughput is boosted to sev-
eral 100 GOP.
Still, challenges remain to effec-
tively bring deep learning to IoT and 
edge devices. First, few (if any) com-
plete end-to-end solutions have been 
demonstrated. Doing so involves 
integrating the deep-inference chips 
in complete vision-processing pipe-
lines mapping real-life applications. 
This requires not only an efficient 
execution of the inference kernel itself 
but also efficient image slicing, data 
transfer, and results interpretation.
A second interesting challenge 
lies in the learning process. So far, most 
chips focus on the inference part, 
where pretrained models are efficiently 
executed on-chip. In the future, how-
ever, the desire for more privacy and 
user customization will stimulate chips 
capable of executing the training phase 
as well. This, however, comes with new 
computational challenges and the 
need for a careful algorithm–archi-
tecture cooptimization.
It is, thus, very clear that, more 
than ever, the hardware and algo-
rithmic layer must be optimized 
jointly, grasping the various cross-
layer opportunities of deep neural 
networks. This is also apparent from 
the interest of many traditionally 
software-oriented companies (like 
Google, Amazon, and Microsoft) in 
the development of new proprietary 
hardware for deep learning.
This field is so vibrant that every 
single week new ideas pop up. Of 
course, space does not allow us to 
cover all of the exciting ideas going 
around in the embedded deep learn-
ing space at the moment. Yet we hope 
that we were able to spark readers’ 
interest and stimulate further explo-
ration of this lively field.
References 
[1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep 
learning,” Nature, vol. 521, no. 7553, pp 
436–444 2015. 
[2] O. Russakovsky, J. Deng, H. Su, J. Krause, 
S. Satheesh, S. Ma, Z. Huang, A. Karpathy, 
A. Khosla, M. Bernstein, A. C. Berg, and 
L. Fei-Fei, “ImageNet large scale visual 
recognition challenge,” Int. J.Computer 
Vision, vol. 115, no. 3, pp. 211–252, 
2015.
[3] A. Krizhevsky, I. Sutskever, and G. Hinton, 
“ImageNet classification with deep convo-
lutional neural networks,” in Proc. Conf. 
Neural Information Processing Systems, 
2012, pp. 1097–1105. 
[4] K. He, X. Zhang, S. Ren, and J. Sun, “Deep 
residual learning for image recogni-
tion,” arXiv Preprint, arXiv:1512.03385, 
2015.
[5] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. 
Reed, D. Anguelov, and A. Rabinovich, 
“Going deeper with convolutions,” in Proc. 
Throughput (GOPs)
En
er
gy
-E
ffi
cie
nc
y 
(T
OP
s/W
)
1 10 100 1,000
4 b
8 b
16 b
Minimum Energy
Peak Performance
4-b Sparse
2016 References
1 TOPs/W
GPU
CPU
10-f/s ResNet at 30 mW
10
1
0.1
10
0 
G
O
Ps
Figure 12: An overview of the reported performance of the deep neural network processors 
published at the International Solid-State Circuits Conference in 2016 and 2017. Performances 
beyond 100 GOP and 1 TOP/W will be a game changer for deep inference in embedded devices.
Recent work has shown that the combination 
of pruning, weight sharing, and Huffman 
compression compresses state-of-the-art 
networks by 50 times in memory size.
   IEEE SOLID-STATE CIRCUITS MAGAZINE fall 20 17  65
IEEE Conf. Computer Vision and Pattern 
Recognition, 2015, pp. 1–9.
[6] F. Chollet, “Xception: Deep learning with 
depthwise separable convolutions,” arXiv 
Preprint, arXiv:1610.02357, 2016.
[7] F. Iandola, M. Moskewicz, S. Karayev, 
R. Girshick, T. Darrell, and K. Keutzer, 
“Densenet: Implementing efficient con-
vnet descriptor pyramids,” arXiv Preprint, 
arXiv:1404.1869, 2014. 
[8] F. A. Gers, J. Schmidhuber, and F. Cum-
mins, “Learning to forget: Continual pre-
diction with LSTM,” Neural Comput., vol. 
12, no. 10, pp. 2451–2471, 2000.
[9] N. P. Jouppi, et al. “In-datacenter perfor-
mance analysis of a tensor processing 
unit,” arXiv Preprint, arXiv:1704.04760, 
2017. 
[10] D. Shin, J. Lee, J. Lee, and H.-J. Yoo, “DNPU: 
An 8.1 TOPS/W reconfigurable CNN-RNN 
processor for general-purpose deep neu-
ral networks,” in Proc. IEEE Int. Solid-State 
Circuits Conf., 2017, pp. 240–241. 
[11] Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: 
A spatial architecture for energy-effi-
cient dataflow for convolutional neural 
networks,” in Proc. IEEE Annu. Int. Symp. 
Computer Architecture, 2016, pp. 367–
379. 
[12] M. Peemen, et al. “Memory-centric accel-
erator design for convolutional neural 
networks,” in Proc. IEEE 31st Int. Conf. 
Computer Design, 2013, pp. 13–19. 
[13] L. Cecconi, S. Smets, L. Benini, and M. 
Verhelst, “Optimal tiling strategy for 
memory bandwidth reduction for Cnns: 
Advanced concepts for intelligent vision 
systems,” Ph.D. dissertation, Univ. Bolo-
gna 2017. 
[14] Y.-H. Chen, T. Krishna, J. Emer, and V. Sze. 
“Eyeriss: An energy-efficient reconfigu-
rable accelerator for deep convolutional 
neural networks,” in Proc. IEEE Int. Solid-
State Circuits Conf., 2016, pp. 262–263.
[15] H. T. Kung, “Systolic algorithms for the 
CMU WARP processor,” Research Show-
case @ CMU, 1984. 
[16]  A. Shafiee, A. Nag, N. Muralimanohar, R. 
Balasubramonian, J. P. Strachan, M. Hu, 
R. S. Williams, and V. Srikumar, “ISAAC: A 
convolutional neural network accelerator 
with in-situ analog arithmetic in cross-
bars,” in Proc. 43rd Int. Symp. Computer 
Architecture, 2016, pp.14–26. 
[17] P Gysel, M. Motamedi, and S. Ghiasi, 
“Hardware-oriented approximation of 
convolutional neural networks,” in Proc. 
Workshop Contribution to Int. Conf. Learn-
ing Representations, 2016. 
[18] B. Moons, B. De Brabandere, L. Van Gool, 
and M. Verhelst, “Energy-efficient Con-
vNets through approximate computing,” 
in Proc. IEEE Winter Conf. Applications 
Computer Vision, 2016, pp. 1–8.
[19] I. Hubara, M. Courbariaux, D. Soudry, R. 
El-Yaniv, and Y. Bengio, “Quantized neu-
ral networks: Training neural networks 
with low precision weights and activa-
tions,” arXiv preprint, arXiv:1609.07061, 
2016.
[20] M. Rastegari, V. Ordonez, J. Redmon, and 
A. Farhadi, “XNOR-Net: Imagenet classifi-
cation using binary convolutional neural 
networks,” in Proc. European Conf. Com-
puter Vision, 2016, pp. 525–542. 
[21] I. Hubara, M. Courbariaux, D. Soudry, R. 
El-Yaniv, and Y. Bengio, “Binarized Neural 
networks in advances” in Neural Informa-
tion Processing Systems 29, D. D. Lee, M. 
Sugiyama, U. V. Luxburg, I. Guyon, and R. 
Garnett, Eds. Curran Assoc., Inc. 2016, pp. 
4107–4115.
[22] R. Andri, L. Cavigelli, D. Rossi, and L. Be-
nini, “YodaNN: An ultra-low power convo-
lutional neural network accelerator based 
on binary weights,” in Proc. IEEE VLSI 
Computer Society Annu. Symp., July 2016, 
pp. 236–241.
[23] B. Moons and M. Verhelst, “A 0.3–2.6 TOPS/W 
precision-scalable processor for real-time 
large-scale ConvNets,” in Proc. IEEE Symp. 
VLSI Circuits, 2016, pp. 1–2.
[24] B. Moons, et al. “Envision: A 0.26-to-10 
TOPS/W subword-parallel dynamic-volt-
age-accuracy-frequency-scalable convo-
lutional neural network processor in 28 
nm FDSOI,” in Proc. IEEE Int. Solid-State 
Circuits Conf., 2017, pp. 246–257. 
[25] B. Moons, R. Uytterhoeven, W. Dehaene, 
and M. Verhelst, “DVAFS: Trading com-
putational accuracy for energy through 
dynamic-voltage-accuracy-frequency-
scaling,” in Proc. Conf. Design, Automa-
tion and Test in Europe, Lausanne, 2017, 
pp. 488–493.
[26] L. Fick, D. Blaauw, D. Sylvester, S. Skrzyn-
iarz, M. Parikh, and D. Fick, “Analog in-
memory subthreshold deep neural net-
work accelerator,” in Proc. IEEE Custom 
Integrated Circuits Conf., Austin, TX, 2017, 
pp. 1–4.
[27] Y. Lin, S. Zhang, and N. R. Shanbhag. 
“Variation-tolerant architectures for con-
volutional neural networks in the near 
threshold voltage regime,” in Proc. IEEE 
Int. Workshop Signal Processing Systems, 
2016, pp. 17–22. 
[28] P. Whatmough, S. Kyu Lee, H. Lee, S. Rama, 
D. Brooks, and G.-Y. Wei, “A 28nm SoC 
with a 1.2GHz 568nJ/pred sparse deep 
neural network engine with >0.1 timing 
error rate tolerance for IoT applications,” 
in Proc. IEEE Int. Solid-State Circuits Conf., 
2017, pp. 242–243. 
[29] G. Huang, et al. “Multi-scale dense con-
volutional networks for efficient predic-
tion,” arXiv Preprint, arXiv:1703.09844, 
2017.
[30] J. Albericio, P. Judd, T. Hetherington, T. 
Aamodt, N. E. Jerger, and A. Moshovos, 
“Cnvlutin: Ineffectual-neuron-free deep 
neural network computing,” in Proc. ACM/
IEEE 43rd Annu. Int. Symp. Computer 
 Architecture, June 2016, pp. 1–13.
[31] D. Kim, J. Ahn, and S. Yoo, “A novel zero 
weight/activation-aware hardware archi-
tecture of convolutional neural network,” 
in Proc. IEEE Design, Automation & Test in 
Europe Conf. & Exhibition, 2017, pp. 1462–
1467. 
[32] S. Han, J. Pool, J. Tran, and W. Dally. 
“Learning both weights and connections 
for efficient neural network,” in Proc. Ad-
vances in Neural Information Processing 
Systems, 2015, pp. 1135–1143.
[33] V. Sze, T.-J. Yang, and Y.-H. Chen, “Design-
ing energy-efficient convolutional neural 
networks using energy-aware pruning,” 
in Proc. Conf. Computer Vision and Pat-
tern Recognition, Honolulu, Hawaii, July 
21–26, 2017, pp. 5687–5695. 
[34] S. Han, H. Mao, and W. J. Dally, “Deep 
compression: Compressing deep neural 
networks with pruning, trained quantiza-
tion and Huffman coding,” arXiv Preprint, 
arXiv:1510.00149, 2015.
[35] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, 
M. A. Horowitz, and W. J. Dally, “EIE: Ef-
ficient inference engine on compressed 
deep neural network,” arXiv Preprint, 
arXiv:1602.01528, 2016.
[36] J. Xue, J. Li, and Y. Gong, “Restructuring 
of deep neural network acoustic models 
with singular value decomposition,” in 
Proc. Interspeech Conf., 2013, pp. 2365–
2369. 
[37] M. Verhelst. (2017). Deep learning pro-
cessor survey. [Online]. Available: http://
www.esat.kuleuven.be/~mverhels/DLIC-
survey.html 
[38] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and 
Y. Zou, “Dorefa-net: Training low bitwidth 
convolutional neural networks with low 
bitwidth gradients,” arXiv preprint, arX-
iv:1606.06160.
About the Authors
Marian Verhelst (marian.verhelst@
kuleuven.be) has been an assistant 
professor at the Micro-Electronics and 
Sensors Laboratories of the Electrical 
Engineering Department at KU Leu-
ven, Belgium, since 2012. Her research 
focuses on self-adaptive circuits and 
systems, embedded machine learn-
ing, and low-power sensing and pro-
cessing for the Internet of Things. She 
received a Ph.D. degree from KU Leu-
ven (cum ultima laude) in 2008. She 
was a visiting scholar at the Berke-
ley Wireless Research Center of the 
University of California, Berkeley, in 
2005. From 2008 to 2011, she worked 
in the Radio Integration Research 
Lab of Intel Laboratories, Hillsboro, 
Oregon. She is an IEEE Solid-State Cir-
cuits Society Distinguished Lecturer 
and a member of the Young Academy 
of Belgium and has published over 
100 papers in conferences and jour-
nals. She is a member of the Interna-
tional Solid-State Circuits Conference 
(ISSCC) Technical Program Committee 
and the Design, Automation, and Test 
in Europe (DATE) and ISSCC Executive 
Committees. She was associate edi-
tor for IEEE Transactions on Circuits 
and Systems II and currently serves in 
the same capacity for IEEE Journal of 
Solid-State Circuits.
Bert Moons received his B.S. and 
M.S. degrees in electrical engineering 
from KU Leuven, Belgium, in 2011 and 
2013, respectively. In 2013, he joined 
the Micro-Electronics and Sensors Lab-
oratories of KU Leuven as a research 
assistant, funded through an indi-
vidual grant from the Research Foun-
dation of Flanders. In 2016, he was a 
visiting research student at Stanford 
University, California, in the Murmann 
Mixed-Signal Group. Currently, he is 
working toward the Ph.D. degree on 
energy-scalable and run-time adapt-
able digital circuits for embedded 
deep learning applications. 
