Benchmarking Physical Performance of Neural Inference Circuits by Nikonov, Dmitri E. & Young, Ian A.
1 
 
Benchmarking Physical Performance of 
Neural Inference Circuits 
 
Dmitri E. Nikonov and Ian A. Young 
Components Research, Intel Corp., Hillsboro, Oregon 97007, USA 
dmitri.e.nikonov@intel.com 
Abstract 
Numerous neural network circuits and architectures are presently under active research for 
application to artificial intelligence and machine learning. Their physical performance metrics 
(area, time, energy) are estimated. Various types of neural networks (artificial, cellular, spiking, 
and oscillator) are implemented with multiple CMOS and beyond-CMOS (spintronic, 
ferroelectric, resistive memory) devices. A consistent and transparent methodology is proposed 
and used to benchmark this comprehensive set of options across several application cases. 
Promising architecture/device combinations are identified. 
Keywords: neuromorphic, benchmarking, neural network, beyond CMOS, spintronic, CNN, 
spiking, throughput, power 
 
1. Introduction 
The unprecedented progress of traditional, Boolean computing over the last five decades has 
been propelled by the scaling of the transistor scaling according to Moore’s law [1]. Recently a 
larger share of computing is being consumed by applications related to artificial intelligence (AI) 
and machine learning (ML). For these, Boolean computing is less efficient. This has spurred 
research in neural computing which covers a wide field of research; from neural network 
algorithms which can be programmed on traditional Boolean hardware like CPUs or GPUs to 
neural network circuits implemented in specialized hardware – application specific engines. The 
former approach presently handles the majority of user needs from the data center to the edge. 
The latter approach resulted in development and research thrusts in e.g. digital neural 
accelerators such as [2] (see a review [3]) and neuromorphic (biologically inspired) chips such as 
[4]. The operation of neuromorphic chips can span a range of circuit implementations from 
mostly digital to mostly analog (see reviews [5] and [6]). 
In the last few years, AI/ML achieved prominent successes, especially related to deep neural 
networks (DNN) [7] and convolutional neural networks (CoNN). ML has enabled a 
revolutionary improvement in the accuracy of image, pattern, and facial recognition, including 
2 
 
the treatment of ‘big data’ online. More demand for neural computing is emerging in robotic 
control, autonomous vehicles, drones, etc. 
One of the main concerns on the minds of developers of AI computing systems is the same as for 
traditional computing: the power consumption in the chips. The history of traditional computing 
shows that the commercial success of computing devices and architectures is predicated largely 
on their physical performance – areal density, speed of operation, and consumed energy as 
benchmarked in [8,9]. These ultimately translate into processing throughput and consumed 
power of the chips, which are of utmost importance to the user. A fair comparison between the 
published neural network implementations is difficult due to the difference in the process 
technology generation, the network architectures, and computing workloads.  
The main purpose of the paper is to establish a methodology for comparing various neural 
network hardware approaches and to understand the trends revealed through  its development. In 
doing that we strive to adhere to the following principles: 
a) general: wide scope of technologies, devices and circuits; 
b) transparent: simple analytics more important than precise simulations; 
c) uniform: consistent inputs and assumptions across multiple types of hardware; 
d) transparent: all models used are described and the code is available [10] to the reader for 
verification. 
Let us differentiate this work from the existing body of literature. We do not aim to give a 
literature review, and refer the reader to the excellent review papers in the neuromorphic 
hardware field [11,12] which do not attempt to quantitatively compare prior works, as we do in 
this paper. Oftentimes benchmarking refers to comparing various algorithms for an application 
mainly based on their accuracy and with little reference to hardware implementation, e.g. [13]. In 
contrast, we compare different types of hardware implementing the same algorithm and focusing 
on its energy consumption and performance.  
The discussion of neural networks in the numerous papers currently being published remains at 
the architecture level. For example, the accuracy of recognition is studied in its dependence on 
the number of network elements, topology, and details of the algorithm. We are cognizant of the 
importance of the accuracy of inferencing. And indeed prior studies discovered that this accuracy 
can be degraded compared to the algorithm-limited “maximum accuracy” due to device non-
idealities [14]. However for this paper we focus on the effect of the different types of devices and 
their neural circuits. For that purpose we make an optimistic assumption of devices not 
degrading accuracy, as exemplified by [15]. This assumption is appropriate for benchmarking 
which targets the idealized, paradigm limiting cases. 
Some research papers report experimentally measured performance and energy in several 
implementations of CoNN [16,17] by running them on a GPU or a particular DNN on a variety 
of neural hardware [18]. In contrast, we provide a theoretical prediction approach to 
3 
 
benchmarking. A rigorous simulation framework, NeuroSim, is used to benchmark the neural 
network circuit architecture in a cross-connect topology based on a set of memory cells serving 
as synapses [19]. The Eyeriss simulation tool focuses on accelerators for DNN [20]. The 
CrossSim simulator has similar capabilities [21].  
Benchmarking of a variety of devices, including beyond-CMOS ones, has been done in an 
approach similar to ours, but for one type of neural networks only, cellular neural networks 
(CeNN) [22]. Various estimates of time and energy of operation for certain types of digital and 
analog devices have been done previously [23,24,25,26]. Compared to these prior works, w 
cover a wider scope of devices and network architectures with less detailed models of circuits. 
Another purpose of this paper is to explore the impact of exploratory devices on the performance 
and energy efficiency of operation for neural networks. For example, neural circuits based on the 
spintronic type of beyond CMOS devices have been proposed [27,28]. In this paper we aim to 
expand the list of beyond CMOS devices applied to neural networks. Also we consider both 
digital and analog neural networks in an attempt to understand whether there is an advantage in 
speed and energy of neural computing with the latter. We also consider several types of neural 
network microarchitectures and analyze their relative advantages. Finally we consider several 
cases of neural networks running their application ‘workloads’ and demonstrate that the 
qualitative conclusions made about various neural network hardware remains valid when 
performing with these “real use” cases. 
 
2. Fundamentals and Concepts of Neuromorphic Computing 
Operation in the majority of neural network architectures relies on a neural gate, often called the 
perceptron, Figure 1. The elements at the input, synapses, receive vectors of input signals, xi, and 
multiply them by vectors of weights, wi. Neurons perform the summation of these products and 
apply a nonlinear threshold (or ‘activation’) function.  
 





 
n
i
ii bxwgxf )(   (1) 
 
 
Figure 1. Scheme of a neural gate, perceptron. 
4 
 
Despite its apparent simplicity, the neural gate in some form underlies most of the neuromorphic 
hardware and algorithms. Deep neural networks (DNN) consist of cascaded multiple layers of 
neural gates. Convolutional neural networks (CoNN) are an example of algorithms applying 
DNN to image processing. For the benchmarking analysis of this paper we only consider the 
DNN workloads of inference (i.e., determining a distance of input vectors from memorized ones 
in a multi-dimensional space). Inference is crucial for recognition, i.e., classifying objects in the 
input data.  
Learning (or, training) is the process of modifying parameters of the neural network for better 
recognition. It consists of performing numerous inferences on the input data and then adjusting 
the weights in the neural network. Methods of learning can be e.g.: a) supervised learning: 
optimization of weights through e.g. backpropagation algorithm [7], or b) unsupervised learning: 
change of weights according to synapse activity caused by input patterns, e.g. using the spike-
timing-dependent plasticity (STDP) algorithm [27].  
We decided to limit the scope of the present paper to inferencing. We realize the importance of 
learning and that it requires much more computing effort. However inference and learning 
present different market segments and have different usage models: learning is mostly practiced 
by providers of data center services and inference mostly client users. As such it is possible to 
consider inference and learning separately. Benchmarking of learning (such as in [19]) will be 
explored in a future publication. 
Table 1. LIST OF NOTATION USED IN THE PAPER. 
Quantity Symbol Units Value 
Process generation (‘node’) size F  nm 15 
Minimum interconnect length 
icl  
nm 20F  
Bits in a digital synapse bn   8 
Levels in an analog synapse ln   64 
Area, delay, energy of circuits , ,a E  nm
2,ps,fJ  
Area, delay, energy of a device , ,dev dev deva E  nm2,ps,fJ  
Area, delay, energy of a minimal interconnect , ,ic ic ica E  nm2,ps,fJ  
Area, delay, energy of an inverter , ,inv inv inva E  nm2,ps,fJ  
Area, delay, energy of a 2 input NAND , ,nan nan nana E  nm2,ps,fJ  
Area, delay, energy of a register bit , ,reg reg rega E  nm2,ps,fJ  
Area, delay, energy of a state element , ,se se sea E  nm2,ps,fJ  
Area, delay, energy of a 1-bit full adder 
1 1 1, ,a E  nm2,ps,fJ  
Area, delay, energy of a n-bit full ripple-carry adder , ,add add adda E  nm2,ps,fJ  
5 
 
Area, delay, energy of a synapse , ,syn syn syna E  nm2,ps,fJ  
Area, delay, energy of a neuron , ,neu neu neua E  nm2,ps,fJ  
Area, delay, energy of a compute workload , ,CW CW CWa E  nm2,ps,fJ  
Area, delay, energy of the whole chip , ,ch ch cha E  nm2,ps,fJ  
Width of a digital transistor dtw  nm 4F  
Width of an analog transistor atw  nm 16F  
Capacitance of the transistor per unit width tranc  F/m  
On- and off-current in the transistor per unit width ,on offi i  A/m  
Saturation voltage of a transistor 
satV  V 0.3 
Linear transconductance of a transistor 
mdtg  S  
On-state resistance of a transistor ondtR     
Supply voltage ccV  V 0.8 
Capacitance of an interconnect per length icc  nF/m 0.5 
Capacitance of a minimum interconnect icC  F ic icc l  
Resistance of an interconnect per length icr  G  2.2 
Resistance of a minimum interconnect icR    667 
Load capacitance for an interconnect loadC  F  
Effective resistance of a synapse effR     
Voltage for a sense amplifier  saV  V 0.4 
Transistor width for a sense amplifier , , ,p n iso enw w w w  nm {4,4,6.5,5}F  
Sense difference and cell read voltages for a voltage 
sense amplifier 
,vsa rvsaV V  V {0.1,0.5} 
Analog read circuit row voltage  rowV  V 0.65 
Analog read pulse repu  ns 1 
Width of input, pull-up, output transistors in OTA , ,in up outw w w  nm {10,5,10}F  
Current from a neuron neuI  A  
Factor of extra number of synapses in CeNN syncnnM   4 
Factor of extra settling time in CeNN stepcnnM   5 
Maximum weight value for CNN  maxw   0.23 
Average sum of the weights in CeNN cell sumw   1.26 
6 
 
Factor of spike duration spiN   3 
Factor of spacing between spikes spaN   3 
Number of spikes for a neuron to fire fireN   10 
Spiking activity ratio ar    
Number of oscillator periods till synchronization synchN   30 
Area overhead factor for a synapse synM   2 
Area overhead factor for a neuron neuM   2 
Area overhead factor for a core corM   2 
Area overhead factor for the whole chip chM   2 
Cores per chip; value for a nominal chip chc   64 
Neurons per core; value for a nominal chip 
corn   256 
Number of input neurons per core 
inn    
Number of output neurons per core 
outn    
Synapses per neuron; value for a nominal chip 
cors   256 
Synapses per core 
neus    
Number of feature maps in a stage 
stf    
 
3. Types of Neuromorphic Devices 
Synapses and neurons can be implemented by a variety of devices (Table 2): 
 Digital CMOS and analog CMOS or tunnel FET (TFET) devices.  
 Ferroelectric FET (FEFET) devices. 
 Spintronic devices [27] of five types: in-plane and perpendicular spin transfer torque 
(STT) switches with perpendicular magnetic anisotropy, spin orbit torque (SOT) 
switches, domain wall (DW) motion device, and magnetoelectric (ME) switched device. 
 Resistive memory elements: oxide RRAM, floating gate resistors (flash), phase-change 
memory, general spintronic and higher-resistance spin-orbit torque resistors 
(implemented as magnetic tunnel junctions, MTJ), ferroelectric resistors. 
 
 
 
 
7 
 
Table 2. PARAMETERS FOR DEVICES COMPRISING SYNAPSES AND NEURONS. 
   
 
To determine the area, delay, and energy of devices and circuits, we rely on our benchmarking 
methodology [8,9] developed for digital circuits. The benchmarks are calculated consistently for 
multiple devices scaled to the process node size [8], nmF 15 . We first determine the values for 
an ‘intrinsic device’ (i.e. a transistor, or a nanomagnet, Table 2) and then for simple circuits [9]. 
For digital technologies we assume that synapses comprise 8-bit registers. For analog 
technologies we will assume that synapses are accurately set to 64 levels. Understanding that 
these two are not equivalent in terms of precision, we choose to keep the typical value of 8 for 
digital precision. We also assume that the accuracy in analog networks is not limited by the 
number of levels. We take the best case by ignoring non-idealities of the synapse device 
characteristics. Realistic cases in which device characteristics affect accuracy are considered e.g. 
in [29]. 
 
8 
 
 
Figure 2. Schemes of neurons implemented with CMOS and beyond-CMOS devices. 
 
Digital CMOS.  
The first kind of digital NN is based on SRAM synapses that only provide a weight, while the 
multiplication and summation (MAC) operations are performed consecutively in the neuron [29]. 
The circuit considered here follows that in [22]: a synapse consists of n-bits of a SRAM register 
and state element; a neuron consists of two n-bit registers, an n-bit adder, n NAND gates, n 
inverters, and three n-state elements. Therefore area of the synapse and the neuron are the sums 
of the areas of the above constituent circuits. The delay and energy are mostly expended in the 
neuron, but some of the contributions are proportional to the number of synapses. Therefore such 
contributions are inserted to the equations for synapses below. 
9 
 
regbsyn ana    (2) 
13 4syn reg se nan inv bn            (3) 
 13 4syn b reg se nan invE n E E E E E       (4) 
 12neu b reg inv nan sea n a a a a a       (5) 
12 3neu reg se nan inv bn           (6) 
 12 3neu b reg se nan invE n E E E E E       (7) 
The performance estimate and parameters (Table 1) for a sense amplifier follow [19]. It is used 
as a part of a reading circuit for SRAM memories. The quantities per bit below are added to the 
corresponding neuron estimates. The transconductance and load capacitance of  
 1 /sa inv n p iso n dta a w w w w w      (8) 
  /msa mdt p n dtg g w w w    (9) 
 lsa tran p nC c w w    (10) 
  1log / /sa cc sa lsa msa bV V c g n     (11) 
2
sa lsa ccE C V   (12) 
where the second term in the delay corresponds to the time to enable the sense amp and is 
proportional to the clock time.  
The performance estimate and parameters (Table 1) for a voltage sense amplifier follow [19]. It 
is used as a part of a reading circuit for digital resistive memories. The quantities per bit below 
are added to the corresponding neuron estimates. It comprises 3 of n-type and 3 of p-type 
minimum width transistors. 
16vsa inva a    (13) 
the pre-charge resistance, the sense input capacitance, and the bit line capacitance  
10 
 
pch ondtR R    (14) 
2si tran dtC c w    (15) 
li neu ic icC s c l    (16) 
    12.3 / / / 2vsa pch si vsa si li rvsa on rvsa off bR C V C C V R V R n     
   (17) 
2
vsa si ccE C V    (18) 
 
Digital MAC. 
Another kind of digital NN contains a multiplier and an adder in every synapse, so that the MAC 
operation is performed in the synapse [66]. The role of neurons is summation of partial results 
and application of the activation function. 
 1syn b add sea n a a     (19) 
syn add se      (20) 
 1 / 2syn b add seE n E E     (21) 
2neu add se b rama a a n a     (22) 
2neu add se ram       (23) 
2neu add se b ramE E E n E     (24) 
The factor bn  in energy and delay would correspond to simple ripple carry adders and 
multipliers based on them. More efficient designs based adders and multipliers (e.g. carry-save 
adders) are accounted by an additional factor of 1/2. 
11 
 
 
Figure 3. Schemes of synapses implemented with CMOS and beyond-CMOS devices. 
 
Analog CMOS.  
We assume a cell similar to that in [22], where a neuron consists of an opamp, a current source, 
and a threshold function circuit; a synapse consist of 2 operational transconductance amplifiers 
(OTA), see also [30]. Transistors of various width are used, Table 1. The effective capacitance of 
the cell is dominated by the capacitance of the two OTAs  
outtranf wcC 4   (25) 
The subthreshold swing of a transistor is  
 10log /
sat
on off
V
SS
i i
  
The bias current is approximated as the geometric average of the on- and off-states: 
b on off inI i i w   (26) 
The transconductance of an OTA is 
12 
 
ln10b out
mOTA
up
I w
g
SS w
   (27) 
The output conductance of two OTAs is determined by 
max2 /m mOTAG g w   (28) 
The effective resistance of the cell (with a factor of 2x for the nonlinearity of OTA and 2x to 
ensure output stability). 
mf GR /4   (29) 
Then the opamp driving current is  
/opamp cc fI V R   (30) 
and the OTA current is 









up
outsum
bOTA
w
w
w
w
II 1
2
2
max
  (31) 
Thus benchmarks for the synapse and the neuron are 
42 ( ) /syn inv in in in idta a w w w w     (32) 
ffsyn CR4.8   (33) 
syn cc OTAP V I   (34) 
synsynsyn PE    (35) 
43 ( ) /syn inv in in in idta a w w w w     (36) 
synneu     (37) 
 neu cc opamp OTAP V I I    (38) 
neuneuneu PE    (39) 
where the area of standard cells in circuit is approximated as fan-out-4 inverters.  
The performance estimate and parameters (Table 1) for an analog read circuit follow [19]. It is 
used as a part of a reading circuit for analog valued resistive memories. The quantities per analog 
cell below are added to the corresponding neuron estimates. It comprises several circuits equal in 
area to 32 standard inverter cells. 
132adr inva a    (40) 
the column voltage is 
13 
 
col row rvsaV V V     (41) 
12adr repu bn       (42) 
225 /adr col ondtP V R    (43) 
adr adr adrE P     (44) 
Analog spintronic and ferroelectric devices.  
Both synapses and neurons consist of just one intrinsic device. Spintronic synapses and neurons 
have been proposed in [27], as well as ones based on magnetic tunnel junctions [31] or 
magnetoelectric switching [32]; see overview [33]. We assume the supply voltage to be 0.1V for 
all spintronic devices. Ferroelectric synapses were explored in [34,35]. 
These analog neurons and synapses have greater size, delay, and energy proportionally to the 
number of analog levels:  
syn l deva n a   (45) 
syn dev    (46) 
syn devE E   (47) 
neu l deva n a   (48) 
/ 4neu l devn    (49) 
neu l devE n E   (50) 
Resistive memories.   We will use this term synonymously with ‘memristor’. Resistive elements 
are used here as analog memory with multiple levels of resistance in a single cell. Various types 
of resistive elements, such as oxide memristors [36,37], floating gate transistors (“flash”) 
[38,39], spintronic devices [40,41], have been proposed for neural networks. 
In inference, the weights are not modified, therefore the characteristics of switching resistive 
memories are not relevant, but only their on and off resistances are. We assume characteristic on- 
and off-resistances in various resistive memory cells, Table 2. The parameters contributed from 
the memory cell per se are 
/on cc onI V R   (51) 
/off cc offI V R   (52) 
syn deva a   (53) 
The intrinsic capacitance of the synapse is of the order of that in a minimum interconnect, and 
the delay of synapses is determined by the upper bound of synapse resistance set at  
14 
 
eff on lR R n   (54) 
so 
2.3syn eff icR C    (55) 
syn on cc synE I V    (56) 
Another contribution comes from interconnects in the core and is described in Section 5. 
 
 
Figure 4. Schemes of oscillators used as synapses. 
 
4. Types of Neural Networks 
We classify neural networks into 4 types according to the nature of signals used, Figure 5. 
a) Artificial neural network (ANN) where outputs switch in response to inputs in a mostly 
monotonic fashion.  
b) Cellular neural network (CeNN) differ from ANN by their rectangular grid geometry and 
high connectivity. Here they are treated in agreement with [22] based on the methodology 
in [9].  
c) Spiking neural network (SNN) receive trains of spikes at inputs. Synapses route spikes 
towards neurons, and neurons fire output spikes depending on their timing.  
15 
 
d) Oscillator neural network (ONN) determine the degree of pattern matching from the   
synchronization of oscillators in the array. 
 
Figure 5. Schemes of the four types of neural networks considered in this paper. 
 
Neural networks can be created with various combinations of neurons and synapses, Table 3. 
The first three letters of the label (“ANN”,”CNN”,”Spi”,’Osc”) designate the type of a neural 
network, the next set of letters (up to three) designates the type of neurons, and the last set of 
letters (up to four) designates the type of synapses. In neural networks, the synapse and neuron 
circuits require tens of transistors. Alternatively, single spintronic devices are capable of 
implementing synapses and neurons [27].  
Table 3. LABELS FOR DEVICES/ARCHITECTURE COMBINATIONS 
Neuron Synapse ANN, CNN, SNN + … ONN 
Digital CMOS Digital CMOS 6T SRAM DiCSRAM  
Digital CMOS Digital CMOS MAC DiCCMAC  
Digital CMOS Oxide memristor digital DiCOxme  
Digital TFET Digital TFET MAC DiTTMAC  
Digital CMOS FEFET digital DiCFETb  
Digital CMOS Spin-transfer torque digital DiCSTTb OscSTT 
Digital CMOS Spin-orbit digital DiCSOTb OscSOT 
Analog CMOS Analog CMOS AnCAnC OscMOSring 
Analog TFET Analog TFET AnTAnT OscTFEring 
Analog CMOS Ferroelectric FET AnCFET OscPiezo 
16 
 
Analog CMOS Oxide memristor AnCOxme OscOxide 
Analog CMOS Floating gate AnCFlGa  
Analog CMOS PCM AnCPCM  
Ferroelectric FET Ferroelectric FET FETFET  
Domain wall Domain wall DoWDoW  
Spin-orbit torque Spin-orbit analog SOTSOTa  
Magnetoelectric Magnetoelectric MEME OscME 
 
We will adopt the same synapses across the ANN, CeNN, and SNN classes, though neurons will 
be different. 
ANN.  
This is the default case, we directly use the estimates for the synapses and neurons obtained in 
the previous section. 
CeNN.  
We follow the treatment of cellular neural networks in [22]. Application of CeNN to CoNN was 
considered in [42]. Due to both feedback and feedforward connections in a CeNN and due to 
more connections than just nearest neighbors, the number of synapses is doubled. Also it takes a 
longer time for CoNN networks to settle to the steady state due to a larger number of connections 
[22]. This delay depends on the input patterns; we take estimated average values. Therefore 
annsynsyncnnsyn aMa ,   (57) 
annsynsyncnnstepcnnsyn MM ,    (58) 
annsynsyncnnstepcnnsyn EMME ,   (59) 
annneuneu aa ,   (60) 
annneustepcnnneu M ,    (61) 
annneustepcnnneu EME ,   (62) 
Neural network parameters related to Hebbian learning are based on the synaptic weight 
information: the maximum weight value obtained from the training weights, and the average 
summation of the weights per cellular cell [22], Table 1. 
17 
 
 a) 
b) 
Figure 6. Approximate wave forms in a spiking neural network. a) The spike separation is longer 
than the spike duration; b) Multiple synapse spikes are required for a neuron to fire [65]. 
SNN.  
We introduce a factors (Table 1) relating the spike duration to the device delay and relating the 
time spacing between spikes to the spike duration, Figure 6. 
 
With these factors the estimates for SNN become 
,syn syn ann spi spaN N    (63) 
,syn syn ann spiE E N   (64) 
,neu neu ann spi spa fireN N N    (65) 
,neu neu ann spi fireE E N N     (rate coded) (66) 
,neu neu ann spiE E N            (temporal coded) (67) 
 
18 
 
   
Figure 7. Two types of spiking NN: rate coded and temporal coded. 
Note that it takes a different number of spikes arriving at a neuron from synapses to make it fire 
for the cases of rate coding or temporal coding of the signal, Figure 7. We also account for the 
spiking activity, i.e., the probability of a synapse producing a spike in a given spiking interval. 
We incorporate an empirical trend that the spiking activity decreases in the later stages where 
spike activity in an SNN decreases by 1/a stager n  with stage number in a DNN or CoNN [43]. 
ONN.  
The area of oscillators is typically larger because they contain multiple instances of simple gates. 
,10syn syn anna a   (68) 
,30neu neu anna a   (69) 
The frequency of transistor-based ring oscillators is determined by the product of the number of 
inverters (chosen here to be 5) and an average delay in an inverter. The frequency of spintronic 
oscillators empirically proves to be several times faster than the inverse switching time of a logic 
device. The average power is proportional to that of a logic device.   
40.1/osc invf           (for transistor oscillators) (70) 
intint /3 EPosc   (for transistor oscillators) (71) 
,6 /osc neu annf         (for spintronic oscillators) (72) 
, ,6 /osc neu ann neu annP E      (for spintronic oscillators) (73) 
,1/osc neu annf          (for piezo oscillators) (74) 
, ,3 /osc neu ann neu annP E      (for piezo oscillators) (75) 
The operation of the ONN synapse is limited by the synchronization time of the oscillators which 
takes several periods of oscillations, Table 1. Thus the ONN benchmarks are 
19 
 
oscsynchsyn fN /   (76) 
synoscsyn PE    (77) 
synneu     (78) 
synneu EE    (79) 
 
5. Treatment of interconnects 
The benchmarks for neural network elements, neural gates, and larger DNNs are built up 
hierarchically, from benchmarks for a synapse and a neuron obtained in the previous section. We 
refer to it as ‘bottoms-up benchmarking’. 
The chip comprises a number of neural cores with multiple neurons in each and multiple 
synapses feeding signals into each core. The total number of synapses per chip is thus 
neucorchch sncs    (80) 
Empirical factors are introduced to account for layout overhead: spacing between circuits for 
interconnects, routing circuits, intermediate registers, encoders/decoders etc. To obtain the 
corrected area, the estimated area is multiplied by additional layout overhead factors, Table I. 
For a certain workload only a share of synapses 
ar  may be active. The area of the chip is then  
  synsynneuneuneucorcorchchch aMsaMnMcMa    (81) 
The operation of an interconnect is mainly determined by the interconnect capacitance per unit 
length. The energy to charge an interconnect is 
2lVcE icic    (82) 
where the length of an interconnect from a circuit block to the next block is calculated as 
circal    (83) 
The area of the relevant circuit block: for synapses corsynsyncirc saa , , set by the requirement to 
deliver the synapse output within the area of the core; for neurons chneucirc aa , , set by the 
requirement to deliver the output signal of a neuron to any part of the chip.  
20 
 
 
Figure 8. Energy per bit vs. distance in TrueNorth [4]. 
A geometry calculation with a low-k interlayer dielectric results in mFcic /10
10  for 20nm 
wire width. Energy and capacitance vs. distance for an actual NN chip, TrueNorth [4], is shown 
in Figure 8. With voltage of 1V, the energy of a spike is 8pJ for 15mm of interconnect length, 
which implies that mFcic /105
10 . Therefore the energy to transmit a bit over an interconnect 
in neural networks is less efficient by the factor of 5 than the energy of the ideal case, i.e. just 
charging the interconnect capacitance. This empirical factor of 5 is incorporated into further 
estimates.  
The delay in a core-wide interconnect is dominated by the RC-delay in wires connecting 
synapses and neurons: 
 0.38 /cic ic ic eff ic ic load icR C R C R C l l    .  (84) 
The delay of charging a global, chip-wide interconnect 
ic ic
gic
neu neu
c lV E
I I V
   .  (85) 
The delay and energy of a core-wide interconnect are added to those of a synapse. The energy 
and delay of a chip-wide interconnect are added to those of a neuron. 
 
6. Chip-level benchmarks 
The operation of the chip involves signals coming from input neurons, processed in synapses, 
and then firing of output neurons. A synaptic operation (synaptic event) is understood in non-
spiking networks as an operation of multiplication of an input signal by a weight, i.e., “multiply-
and-accumulate” (MAC). However in spiking networks, a synaptic even is mostly understood as 
“A spike event is a synaptic operation evoked when one action potential is transmitted through 
21 
 
one synapse” according to [44]. There may be multiple spikes required to fire a neuron, i.e. 
multiple synaptic operation correspond to one MAC. Therefore the firing rate is conventionally 
defined differently for spiking, SNN, (to keep it consistent with definitions in [4]) and non-
spiking, ANN, CeNN, and ONN, networks: 
synfiref /1    (non-spiking) (86) 
 synneuafire srf /1    (spiking) (87) 
The time step in a chip corresponds to the operation of one stage of a neural network. It consists 
of the time for enough synaptic inputs to arrive at the neuron to make it fire plus the delay in the 
neuron itself. 
neufirestep f   /1   (88) 
Total energy per synaptic even contributed by a synapse and a share of a neuron energy 
 neuaneusynsyntot srEEE /   (89) 
Publications normally quote the throughput of synaptic operations per second (SOPS) (not to 
be confused with throughput of inferences, see below) 
chafiresyn srfT    (90) 
The dissipated power is  
syntotsynch ETP    (91) 
Then the energy per time step is  
chchch PE    (92) 
To compare benchmarks between actual chips and bottoms-up estimates for various neural 
networks, we calculate the performance of the latter for a nominal chip with parameters, Table 1.  
7. Neuromorphic computing workloads vs. hardware 
We considered examples of neuromorphic workloads, including  
1) CoNN such as LeNet [45,46], shown in Figure 9; 
2) AlexNet [47];  
3) a single stage convolution of a 35x35 pixel image with 24 filters of 5x5 pixels;   
4) a single stage associative memory of pixel patterns [22];  
5) a DNN for recognition of hand-written digits from the MNIST hand-written-digit image 
database [48] implemented as a multi-layer perceptron (MLP) with 784×256×128×10 
fully-connected neurons in layers; 
6) a DNN for speech recognition from [18] – a 4 layer MLP with 390x256x256x29 neurons.  
22 
 
While all of these networks belong to the class of non-recurrent DNN, these are examples of 
ubiquitous applications required by users. However these workloads may not be favorable to 
SNNs. For example they do not utilize temporal information carried by spikes. The reader should 
be warned that conclusions may change if we consider workloads more favorable to SNNs. 
Now we determine the benchmarks for a part of the chip necessary to perform a specific 
computing workload (we will use the subscript CW). More specifically in our case the 
computing workload is an inference. It’s hardware implementation is determined by the logic 
structure of a neural network, which can be thought of as an algorithm. Each feature map in a 
stage is produced by a convolution with one of the kernels; this process is mapped to a neural 
core. The number of neurons and synapses in each core is determined by the connectivity of the 
neural network. Each core will have a number of input neurons inn  and a number of output 
neurons outn . Often overlooked input neurons are shown in the array schemes, e.g. in [4]. Each 
output neuron collects inputs from the number of active synapses per neuron neus . For example, 
the LeNet NN shown in Figure 9 can be implemented by an application specific design 
comprising a set  of neural cores shown in Figure 10. A general purpose neural chip, is 
composed of cores of a fixed size with some of the input and output neurons remaining unused. 
 
Figure 9. Map of layers in the CoNN, LeNet [45]. 
The total number of synapses in the deep neural network can be very large. It is much larger than 
the number of trainable weights (which can be e.g. filters for the feature maps in CoNN). In this 
benchmark we adopt an approach of multiple copies of weights written into the memory of 
synapses and placing them close to neurons. This requires a larger number of memory cells and 
the corresponding chip area dedicated to them. However it can be affordable in the case of dense 
analog memory. During training this requires more time and energy for updating the weights. 
The alternative – using only the necessary memory to hold trainable parameters – has the major 
downside ofthe need to route connections with numerous neurons and the time and energy to 
fetch the weight values. 
23 
 
  
Figure 10. Block-diagram of an implementation of the CoNN, LeNet. “AND” symbols designate 
input neurons, triangles designate output neurons; their number of instantiations are indicated. 
Numbers in orange squares designate the number of synapses per neuron in the respective core; 
‘full’ means a fully connected core. Numbers in blue squares next to buses connecting cores 
designate the fanout of  for output neurons. 
Focusing on the structure of a core, we envision that different topologies can be chosen for 
interconnecting input and output neurons by synapses. The most straightforward one is the cross-
connect (Figure 11). It is best for fully connected layers (such as those in the bottom part of 
Figure 10), and allows for a general pattern of connections. But it will leave many unused 
synapses in case of a sparsely connected NN. Since the processing of information is happening in 
or on the periphery of the memory array, containing the weights, such a scheme can also be 
classified as “compute-in-memory” or inference-in-memory”. 
24 
 
  
Figure 11. Cross-connect topology for the neural network. Input (‘In’) and output (‘Out’) 
neurons are shown in yellow. Active synapses are shown in orange, and unused synapses in 
white. 
The convolution topology (Figure 12) is specifically designed for connections in convolution 
layers (such as those in the top part of Figure 10). This scheme utilizes the property of CoNN – 
sparse connectivity between neurons. It also closely resembles Cellular NN (CeNN) 
connectivity. In this topology the output neurons are placed close to connected input ones. They 
mimic the positions of pixels in the image and the resulting feature map. The convolution 
topology is efficient since it contains only active and no unused synapses. But it is not general – 
additional synapses need to be designed for a less sparse connectivity. 
   
25 
 
Figure 12. Convolution topology for the neural network. Input (‘In’) and output (‘Out’) neurons 
are shown in yellow. Active synapses are shown in orange. 
The means by which long interconnects can be routed is shown in Figure 13. For the cross-
connect topology, the routing is trivial, since both input and output neurons are at the edge of the 
core. Even for the convolution topology, the input and output wires can enter in a regular array 
of wires and still be routed to neurons. The pitch of interconnect wires is assumed to be Fp 8
. Then the interconnect-wire-limited area of a core is  
2pnna outinwire    (93) 
 
  
Figure 13. Inter-core interconnects and neurons for the convolution topology of a neural 
network. Pitch p is marked. 
The speed of a neural gate is determined by its fan-in, i.e. the number of synaptic operations 
connected to one output neuron that can be performed in parallel. We will assume the fan-in of 
digital CMOS to be 2, analog CMOS to be 16, spintronic devices to be 32. The fan-in for spiking 
NN is assumed unlimited regardless of a device. We will apply this limitation to bottoms up 
benchmarks, and consider it satisfied for tops-down benchmarks. 
In the case when the devices forming neurons have a limited fan-in if , they need to be cascaded, 
as shown in Figure 14. The number of levels of cascading is (rounded to next higher integer) 
26 
 
 logcas fi neul ceil s   (94) 
Then the number of neurons to form a fan-in in a neural gate is  
   1 / 1lcascas i in f f     (95) 
so that the number of neurons per core is 
inoutcascor nnnn    (96) 
 
  
Figure 14. Synapses connecting to the output neuron via cascaded neurons in case of a limited 
fan-in. 
In the cross-connect case, the area of a core (performing a stage of a NN) is (provided that inn is 
larger than neus ). 
 cor cor neu neu cor syn syn out ina M M a n M a n n    (97) 
In the convolution case 
 cor cor neu neu cor syn syn out neua M M a n M a n s    (98) 
We then take the larger of this estimate and the interconnect wire limit wirea  to constitute the 
area of a core sta for the given stage of CoNN. The time and energy for a stage of the 
computing workload is  
st cas syn neul      (99) 
neuoutsynoutneuast EnEnsrE    (100) 
If the fan-in of the devices is small, which makes its cascading impractical, the synaptic 
operations can be performed sequentially. We will be using this method for neural accelerators. 
In this case the fan-in factors above are set to 1, and instead the time estimate changes to  
27 
 
st neu syn neus      (101) 
In a multi-stage network, like in Figure 9, each stage is using one core. Therefore the above 
benchmarks need to be multiplied by the number of feature maps in each stage and then summed 
over all stages to obtain benchmarks for a computing workload.  
CW st st
st
a a f   (102) 
CW st
st
    (103) 
CW st st
st
E E f   (104) 
In case where the implementation of CoNN is constrained by area, all the cores in Figure 9 can 
be replaced by a single core, and all the stage operations can be performed in a sequential (‘time-
multiplexed’) manner. Here we neglect the energy and delay of storing the intermediate results. 
Then in this case the estimates need to change to 
 maxCW st sta a   (105) 
CW st st
st
f    (106) 
This is the case ,for example for all networks using a digital multiplier in a synapse (labeled 
“MAC”). This treatment is used for all tops-down estimates of implemented chips.  
Then the power in a computing workload and the throughput of inferences (not to be confused 
with the synaptic throughput) in units of inferences per second (IPS) per unit area is 
CWCWCW EP /   (107) 
 CWCWI aT /1   (108) 
   
8. Prototype neuromorphic chips 
We will compare the above benchmarks with those for prototype chips fabricated and measured 
by several groups of researchers. In this chapter we consider mostly spiking neuromorphic chips 
[5,6,49]. To them we apply ‘tops-down benchmarking’, i.e. calculate the neuron and synapse 
values from the total number of synapses, the total area, and known synaptic throughput. A 
number of such chips have been previously benchmarked [50]. One should keep in mind the 
difference between the ‘bottoms-up’ and ‘tops-down’ benchmarks. The former are for idealized 
circuits and do not include multiple auxiliary circuits necessary for the operation of a chip. The 
latter are complete real-life chips and contain all the circuit overheads which are hard to quantify. 
However we will sometimes put these two types of benchmarks side-by-side for sanity checks 
and extract some insights, given the above caveats. We assume that 5% of the chip area is 
occupied by neurons and the rest by synapses. The area per neuron and per synapse is thus  
28 
 
 0.05 /neu ch ch cora a c n   (109) 
 0.95 /syn ch ch cor neua a c n s   (110) 
Synaptic time step syn  is calculated from the firing rate f iref  as per the previous section. 
Publications mostly quote the energy per synaptic even. We approximate the energy per neuron 
as  
neuasynneu srEE    (111) 
The synaptic throughput, power, and energy per time step is calculated like in Section 5. The 
input parameters from cited publications are collected in Table 4. In some cases the input 
parameters are not available, so they are calculated from the consistency of quoted synaptic 
throughput with that calculated from equations in Section 6. Then we use these inputs to obtain 
the benchmarks for a synapse and a neuron, and calculate benchmarks for various computing 
workloads described in Section 7.  
Table 4. PARAMETERS FOR NEUROMORPHIC CHIPS 
Chip 
Name 
Main 
Affiliation 
Year # 
core
s 
Neuro
ns per 
core 
Synaps
es per 
neuron 
Area, 
mm2 
Powe
r, 
mW 
Syn 
Throug
hput, 
MSOP
S 
Energ
y syn 
event, 
pJ 
Syn fire 
rate, s-1 
Acti
vity 
Pro
ces
s, 
nm 
Volt
age, 
V 
Referen
ces 
Notation   
chc  corn
 
neus  cha  chP  
synT  spiE
 
synf  a
r     
HICANN Heidelberg 2010 1 512 224 50 1150* 11,500 100 100k 1 180 1.8 [51] 
HICANN-X Heidelberg 2018 1 512 256 32 2100* 2600 800 20k 1 65 1.2 [52]  
SyNAPSE HRL 2013 1 576 128 42 130 15 8700 203* 1 90 1.4 [53] 
SpiNNaker Manchester 2013 16 1024 1024 102 1000 64 16k* 10 0.4* 130 1.2 [54][55] 
SpiNNaker
2 
Manchester 2017 64 2048 1024 ? 110 250 440 10 0.2* 28 1.0 [56] 
True 
North 
IBM 2014 4096 256 256 430 72 3000 26 20 0.5 28 0.78 [4][57] 
Neurogrid Stanford 2014 1 65536 1024 168 59* 62.5 941 10 0.09* 180 1.8 [58] 
IFAT UCSD 2014 32 2048 1024 16 1.57 73 22 10 0.11* 90 1.2 [59] 
ROLLS ETH 2015 1 256 512 51.4 4 4 1000* 30 1 180 1.8 [60][61] 
DYNAP-
SEL 
ETH 2016 4 256 64 43.8 ? ? 50 30 ? 28 1.0 [62] 
Loihi Intel 2018 128 1024 128 60 450 30,000 15* 1800* 1 14 0.75 [63][64] 
SBNN Intel 2018 64 64 256 1.72 209 25,200 8.3 50k 0.5* 10 0.53 [65] 
* derived value 
We compared our performance estimates with experimentally measured [18] for the speech 
recognition workload on the Loihi and Mydiad2 (Movidius) chips. We note that the theoretical 
estimates are much more optimistic than experimental. The reasons for the discrepancy could be 
the circuit overhead required in an actual chip such as stand-by power, need to fetch the data, 
slower clock frequency, etc.  
29 
 
Table 5. COMPARISON OF BENCHMARKS WITH MEASURED PERFORMANCE 
 Loihi [18]   Loihi this work  Movidius [18] Movidius this work 
Speed, inference/s 89.8 55k 300 167k 
Energy, J/inference 770 6 1500 5.5 
 
9. Digital Neural Accelerators 
There is another type of chip being fabricated, which are commonly called neural accelerators 
[3]. They are based on traditional digital chips and in this sense are different from other 
neuromorphic hardware. Unlike CPU and GPU chips which implement neural network 
algorithms in software, neural accelerators have dedicated hardware engines to implement neural 
networks. They are highly optimized for vector-matrix multiplications, which is  the core 
operation in neural algorithms. In this sense they present a good comparison for digital CMOS 
neural networks. The input data for them are collected in Table 6.  
Our treatment of them is similar to that in Section 8, but adjusted for the non-spiking type. In this 
case the clock frequency determines the time step. Performance of these chips is often quoted in 
MAC/s. A MAC counts as two floating point operations (FLOP), multiplication and addition, 
although different delay and energy is required between the two of them. A significant share of 
the area of these chips is occupied by the cache, control circuits, etc. For this estimate we assume 
that 10% of the chip area is occupied by neurons and synapses. We use the inputs collected in 
Table 6, obtain the benchmarks for a synapse and a neuron, and then calculate benchmarks for 
various computing workloads described in Section 7. 
Table 6. PARAMETERS FOR DIGITAL NEURAL ACCELERATORS 
Chip Name Main 
Affiliati
on 
Year # 
cores 
Neur
ons 
per 
core 
Synaps
es per 
neuron 
Memory 
Bytes 
Area, 
mm2 
Power, 
W 
Perfor
mance, 
GMAC/
s 
Synapse 
energy, 
pJ 
Clock 
frequenc
y, MHz 
Proce
ss, 
nm 
Referen
ces 
Notation   
chc  corn
 
neus  chm  cha  chP  
synT  synE  cl
f    
Diannao CAofS 2014 1 16 16 2k 3.02 0.485 452 1.1* 980 65 [66] 
Dadiannao CAofS 2014 16 16 16 32M 67.73 15.97 5585 2.9* 606 28 [67] 
Pudiannao CAofS 2015 1 16 16 32k 3.51 0.596 1056 0.56* 1000 65 [68] 
Shidiannao CAofS 2015 1 16 16 36k 4.86 0.32 194 1.7* 1000 65 [69] 
Eyeriss MIT 2016 1 1 168 192k 12.25 0.278 33.6 8.3* 200 65 [70] 
EIE Stanford 2016 1 64 8 10.3M 40.8 0.579 51.2 11.3* 800 45 [71]  
Origami ETH 2016 1 4 49 43k 3.09 0.654 98 6.7* 500 65 [72][73] 
Envision Leuven 2017 1 16 16 128k 1.87 0.044 51 0.86* 200 28 [74] 
TPU Google 2017 1 256 256 28M 300 40 11400 3.5* 700 28 [2] 
Tesla Nvidia 2017 80 32 32 6M 815 300 14900 20* 1300 12 [75] 
DPU Wave 2018 16384 1 1 24M 400 200 3900 51* 6700 16 [75] 
Q4MobilEye Intel 2018 1 32 32 1M ? 3 1078 2.8* 1000 28 [75] 
Parker Nvidia 2016 1 256 256 4M ? 5 375 13.3* 3000 16 [75] 
S32V234 NXP 2017 1 64 64 4M ? 5 512 9.8* 1000 28 [75] 
Myriad 2 Intel 2017 12 4 16 2M 27 1.5 58 26* 800 28 [76] 
30 
 
* derived value; ** ‘CAofS’ designates the Chinese Academy of Sciences. 
 
10. Results for physical performance 
The most informative view with the benchmarks is provide by the comparison of operation delay 
and energy. Such energy-delay plots are provided both for synapses (Figure 15) and for neurons 
(Figure 16). In many subsequent benchmarks the following technology options are found to be 
placed in close proximity to each other: the group of DiCFETb, DiCOxme, DiCSTTb, DiCSOTb 
and the group of AnCFET, AnCOxme, AnCFlGa, AnCPCM. In other words these are NNs with 
digital CMOS neurons (for the first group) or analog CMOS neurons (for the second group) 
which have very similar designs within each group. The difference within each group is the  type 
of resistive memory comprising a set of binary bits (for the first group) or an analog resistive 
elements (for the second group). The analysis shows the kind of resistive memory produces 
noticeable but minor differences. In the following plots, for clarity, we will be suppressing the 
labels, leaving just one from each group.  
One observes that among the four NN types, their neurons have similar ranges of energy. 
However on the average, the delay when ordered from fastest to slowest is as follows: ANN, 
ONN, CNN, and SNN. Within each type of NN, networks with both magnetoelectric neurons 
and synapses (MEME) show the lowest energy. This is in line with benchmarks for Boolean 
computing [9]. The networks with both ferroelectric neurons and synapses (FEFET) show the 
fastest speed. This is a result of the combination of relatively fast switching of a transistor and 
the assumption that only a single ferroelectric transistor is capable of performing the neuron 
function. The NNs based on a multiplier and adder in each synapse (MAC) prove to be the 
slowest and the most energy-consuming due to a large number of switching transistors in each 
such element. In general, NNs with analog neurons are faster and more energy efficient than 
similar NNs with digital neurons. This is due to the fact that the neural function is performed in 
parallel in analog NNs rather than synapse-sequential in digital NNs. 
 
31 
 
  
Figure 15. Energy vs. delay for a Synapse in ANN (magenta dots), CeNN (green dots), SNN 
(gold dots), ONN (blue dots). Labels for architectures according to Table 3. 
 
The separation of four NN types is not so clear for synapses. The overall ranges of energy are 
similar between the types. However on the average, the delay ordered from fastest to slowest is 
as follows: ANN, SNN, CNN, and ONN. The difference with the similar relation in neurons is 
due to the fact that synapses are similar between ANN and SNN, and the difference arises in the 
operation on the core level. Other trends for synapses are similar to those for neurons. 
32 
 
  
Figure 16. Energy vs. delay for a Neuron in ANN (magenta dots), CeNN (green dots), SNN 
(gold dots), ONN (blue dots). Labels for architectures according to Table 3. 
The relative relation between energies and delays for a whole computing workload (LeNet in this 
case, Figure 17), closely resembles that for neurons. On the average, ANNs are faster than ONNs 
by about a half an  order of magnitude, faster than CNN by an order of magnitude, and faster 
than SNN by two orders of magnitude. Magnetoelectric devices are more energy efficient than 
analog neurons by about an order of magnitude. They are more energy efficient than digital 
neurons by about another order of magnitude. The redeeming quality of SNNs is built-in learning 
via spike-dependent timing plasticity (SDTP) [4]. 
33 
 
  
Figure 17. Energy vs. delay for one inference in a circuit implementing the LeNet convolutional 
neural network: ANN (magenta dots), CeNN (green dots), SNN (gold dots), ONN (blue dots). 
Labels for architectures according to Table 3. Bottoms-up benchmarks. 
Now we include tops-down benchmarks for experimentally demonstrated integrated 
neuromorphic chips and neural accelerators, Figure 18. We notice that neural accelerators are 
within an order-of-magnitude of agreement with the bottoms-up benchmarks of their similar 
technology (MAC) implementation. Neuromorphic spiking chips in general prove to be slower 
than digital accelerators. The difference depends whether they are designed to run at biologically 
feasible firing rates (few tens of Hertz, e.g. ROLLS) or at an accelerated rate (tens of kilo-Hertz, 
e.g. HICANN). These are still slower than the clock rates of hundreds of mega-Hertz use in 
neural accelerators. Neuromorphic chips have a lower power of operation than neural 
accelerators. However their speed (determined by the firing rate) is much slower than that of 
neural accelerators (determined by the clock rate). As a result, the energy per inference 
(proportional to the product of the operation delay and power, proves to be higher in 
neuromorphic chips.  
34 
 
  
 
Figure 18. Energy vs. delay for one inference for CMOS circuit implementations of the LeNet 
convolutional neural network: with various digital accelerators (yellow dots) and neuromorphic 
spiking chips (red dots). Tops-down benchmarks. Data from Tables II and III are used. 
Energy and delay for various workloads (but the same hardware) is shown in Figure 19. The 
numerical values are determined by the size of the overall network, mostly the number of MACs 
in it, see Supplementary materials. The relation between energy and delay between the various 
hardwares looks similar from workload to workload. 
35 
 
 
Figure 19. Energy vs. delay for one inference in various workloads implemented with digital 
neurons and SRAM synapses. 
 
11. Throughput and Dissipated Power 
Circuit performance can be represented as computing throughput plotted vs. dissipated power 
(Figure 20). One notices that spintronic networks have a higher per unit area throughput due to 
the small size of their implementation of neurons and synapses. However this higher throughput 
results in very high dissipated power. If we apply a cap on allowed power dissipation (Figure 
21), some architectures with leading throughput values (e.g. ANNFETFET) are scaled 
proportionally. In this case only low-energy spintronic options maintain high throughput (e.g. 
ANNanCFET). 
36 
 
 va
  
Figure 20. Dissipated power density vs. inference operation throughput per unit area in a circuit 
implementing the LeNet convolutional neural network. Labels for architectures according to 
Table 3. 
37 
 
  
Figure 21. Dissipated power density vs. inference operation throughput per unit area in a circuit 
implementing the LeNet convolutional neural network. Power is capped to 100W/cm 2, 
throughput is scaled proportionally. Labels for architectures according to Table I. 
12. Conclusions 
In summary, the developed methodology described in this paper enables quantifying the effect of 
devices and NN types on the performance, power, and area of NNs. ANN and ONN show higher 
speed of operation at comparable energy vs. CeNN and SNN. This translates into a larger 
inference throughput especially under the limitation of power dissipation. The trend is confirmed 
by a comparison of actual fabricated functional neuromorphic chips (which are SNN) and neural 
accelerator chips (which are ANN  based). Within each NN type, the ones based on multipliers 
and adders (MAC) prove to be less efficient while ones based on analog neurons and synapses 
prove to be more efficient in both speed and energy of operation. Among those, ferroelectric 
devices show higher speed and spintronic devices (especially based on magnetoelectric 
switching) show lower energy of operation. 
38 
 
It is important to note that the conclusions relate to inference and do not cover learning. Spiking 
neural networks are especially amenable to unsupervised learning; and this advantage is not 
comprehended in the present benchmarks. 
 
13. Acknowledgements 
The authors gratefully acknowledge discussions and critique by Narayan Srinivasa, Mike 
Mayberry, Sasikanth Manipatruni, Greg Chen, Ram Krishnamurthy, Chenyun Pan, Azad 
Naeemi, Dan Hammerstrom, Mike Davies, Eugenio Culurciello, Dmitri Strukov, Kaushik Roy, 
and Wolfgang Porod. 
  
39 
 
14. Supplementary Materials 
Remaining benchmarking plots are collected here in order to keep the main text concise. 
  
Figure 22. Delay vs. area for synapses. 
40 
 
  
Figure 23. Delay vs. area for synapses. 
41 
 
  
Figure 24. Delay vs. area for neurons. 
42 
 
  
Figure 25. Delay vs. area for neurons. 
43 
 
  
Figure 26. Energy vs. delay for synapses. 
44 
 
  
Figure 27. Energy vs. delay for neurons. 
45 
 
  
Figure 28. Delay vs. area for LeNet CoNN. 
46 
 
  
Figure 29. Delay vs. area for LeNet CoNN. 
47 
 
  
Figure 30. Dissipated power density vs. inference operation throughput per unit area in a circuit 
implementing the LeNet convolutional neural network, includes benchmarks for prototype 
neuromorphic chips and neural accelerators. 
48 
 
  
Figure 31. Power vs. synaptic throughput for LeNet. 
49 
 
 
Figure 32. Delay vs. area for the speech recognition. 
50 
 
 
Figure 33. Energy vs. delay for the speech recognition. 
51 
 
 
Figure 34. Power density vs. inference throughput for the speech recognition. 
52 
 
 
Figure 35. Energy vs. MAC in digital neurons and SRAM synapses for various workloads. 
53 
 
 
Figure 36. Energy vs. delay in Loihi for various workloads. 
 
Figure 37. Energy vs. MAC in Loihi for various workloads. 
54 
 
Table 7. Performance benchmarks for the combinations of devices and network types. 
 
15. References 
[1] G. E. Moore, “Cramming more components onto integrated circuits”, Proceedings of IEEE 86, 82–85 
(1998). 
[2] N. P. Jouppi et al., “In-Datacenter Performance Analysis of a Tensor Processing UnitTM”, Proceeding 
ISCA '17 Proceedings of the 44th Annual International Symposium on Computer Architecture, 1-12, 
Toronto, ON, Canada, June 24 - 28, 2017. 
[3] V. Sze, Y. Chen, T. Yang and J. S. Emer, "Efficient Processing of Deep Neural Networks: A Tutorial 
and Survey," in Proceedings of the IEEE, vol. 105, no. 12, pp. 2295-2329, Dec. 2017. 
 
                                                 
55 
 
                                                                                                                                                             
[4] Merolla, P.A., J.V. Arthur, R. Alvarez-Icaza, A S. Cassidy, J. Sawada, F. Akopyan, B.L. Jackson, N. 
Imam, C. Guo, Y. Nakamura, B. Brezzo, I. Vo, S.K. Esser, R. Appuswamy, B. Taba, A. Amir, M.D. 
Flickner, W.P. Risk, R. Manohar, and D. S. Modha, “A million spiking-neuron integrated circuit with a 
scalable communication network and interface,” Science, 345(6197): 668–673, 2014. 
[5] S. Furber, “Large-scale neuromorphic computing systems”, J. Neural Eng. 13 (2016) 051001. 
[6] G. Indiveri and S. Liu, "Memory and Information Processing in Neuromorphic Systems," in 
Proceedings of the IEEE, vol. 103, no. 8, pp. 1379-1397, Aug. 2015. 
[7] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning”, Nature 521, 436 (2015). 
[8] D. E. Nikonov and I. A. Young, “Overview of Beyond-CMOS Devices and a Uniform Methodology 
for Their Benchmarking”, Proc. IEEE 101, 2498 - 2533 (2013). 
[9] D. E. Nikonov and I. A. Young, “Benchmarking of Beyond-CMOS Exploratory Devices for Logic 
Integrated Circuits”, IEEE J. Explor. Comput. Devices and Circuits 1, 3-11 (2015). 
[10]  D. E. Nikonov and I. A. Young, “Benchmarking of devices in the Nanoelectronics Research 
Initiative”. [Online]. Available: https://nanohub.org/tools/nribench/browser/trunk/src (2019). 
[11] J. Misra and I. Saha, “Artificial neural networks in hardware: A survey of two decades of progress”, 
Neurocomputing, v. 74, no. 1–3, pp. 239-255 (2010). 
[12] C. D. Schuman, T.  E. Potok, R. M. Patton, J. D. Birdwell, M. E. Dean, G. S. Rose, and J. S. Plank, 
“A Survey of Neuromorphic Computing and Neural Networks in Hardware”, available online, 
arXiv:1705.06963v1. 
[13] Q. Liu, G. Pineda-García, E. Stromatias, T. Serrano-Gotarredona, S. B. Furber, “Benchmarking 
Spike-Based Visual Recognition: A Dataset and Evaluation “, Frontiers in Neuroscience, v. 10, p. 469 
(2016).  
[14] S. Yu, "Neuro-inspired computing with emerging nonvolatile memorys," in Proceedings of the 
IEEE, vol. 106, no. 2, pp. 260-285, Feb. 2018. 
[15] S. Agarwal et al., "Achieving ideal accuracies in analog neuromorphic computing using periodic 
carry," 2017 Symposium on VLSI Technology, Kyoto, 2017, pp. T174-T175. 
[16] A. Canziani, E. Culurciello and A. Paszke, "Evaluation of neural network architectures for embedded 
systems," 2017 IEEE International Symposium on Circuits and Systems (ISCAS), Baltimore, MD, 2017, 
pp. 1-4. 
[17] A. Canziani, A. Paszke, and E. Culurciello, “An Analysis of Deep Neural Network Models for 
Practical Applications”, available online https://arxiv.org/abs/1605.07678 (2016). 
[18] P. Blouw, X. Choo, E. Hunsberger, and C. Eliasmith, “Benchmarking Keyword Spotting Efficiency 
on Neuromorphic Hardware”, available online https://arxiv.org/abs/1812.01739 (2018). 
[19] P. Chen, X. Peng and S. Yu, "NeuroSim+: An integrated device-to-algorithm framework for 
benchmarking synaptic devices and array architectures," 2017 IEEE International Electron Devices 
Meeting (IEDM), San Francisco, CA, 2017, pp. 6.1.1-6.1.4. 
 
56 
 
                                                                                                                                                             
[20] T.-J. Yang, Y.-H. Chen, J. Emer, V. Sze, "A Method to Estimate the Energy Consumption of Deep 
Neural Networks," Asilomar Conference on Signals, Systems and Computers, Invited Paper, October 
2017. 
[21] M. J. Marinella, S. Agarwal, A. Hsia, I. Richter, R. Jacobs-Gedrim, J. Niroula, S. J. Plimpton, E. 
Ipek, and C. D. James, "Multiscale Co-Design Analysis of Energy, Latency, Area, and Accuracy of a 
ReRAM Analog Neural Training Accelerator," in IEEE Journal on Emerging and Selected Topics in 
Circuits and Systems, vol. 8, no. 1, pp. 86-101, March 2018. 
[22] C. Pan, A. Naeemi, “Non-Boolean Computing Benchmarking for Beyond-CMOS Devices Based on 
Cellular Neural Network”, IEEE J. Explor. Comput. Devices and Circuits (2016). 
[23] M. S. Zaveri and D. Hammerstrom, “Performance/price estimates for cortex-scale hardware: A 
design space exploration”, Neural Networks 24 (2011) 291–304. 
[24] J. Hasler and B. Marr, “Finding a roadmap to achieve large neuromorphic hardware systems”, 
Frontiers in Neuroscience, 7, 118 (2013). 
[25] A. Sengupta, A. Ankit and K. Roy, "Performance analysis and benchmarking of all-spin spiking 
neural networks (Special session paper)," 2017 International Joint Conference on Neural Networks 
(IJCNN), Anchorage, AK, 2017, pp. 4557-4563. 
[26] Z. Du et al., "Neuromorphic accelerators: A comparison between neuroscience and machine-learning 
approaches," 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 
Waikiki, HI, 2015, pp. 494-507. 
[27] G. Srinivasan, A. Sengupta, and K. Roy, “Magnetic Tunnel Junction Based Long-Term Short-Term 
Stochastic Synapse for a Spiking Neural Network with On-Chip STDP Learning”, Scientific Reports 6, 
29545 (2016). 
[28] A. Sengupta, A. Banerjee, and K. Roy, “Hybrid Spintronic-CMOS Spiking Neural Network with On-
Chip Learning: Devices, Circuits, and Systems”, Phys. Rev. Appl. 6, 064003 (2016). 
[29] P. Chen, X. Peng and S. Yu, "NeuroSim: A Circuit-Level Macro Model for Benchmarking Neuro-
Inspired Architectures in Online Learning," in IEEE Transactions on Computer-Aided Design of 
Integrated Circuits and Systems, vol. 37, no. 12, pp. 3067-3080, Dec. 2018. 
[30] I. Palit, B. Sedighi, Q. Lou, M. Niemier, J. Nahas, X. S. Hu,  “Analytical Models for Calculating 
Power and Performance of a CNN System”, unpublished. 
[31] X. Wang, Y. Chen, H. Xi, H. Li, and D. Dimitrov, “Spintronic memristor through spin-torque-
induced magnetization motion,” IEEE Electron Device Lett., vol. 30, no. 3, pp. 294–297, Mar. 2009. 
[32] A. W. Stephan, J. Hu, S. J. Koester, “Benchmarking Inverse Rashba-Edelstein Magnetoelectric 
Devices for Neuromorphic Computing”, available online https://arxiv.org/abs/1811.08624 (2018). 
[33] J. Grollier, D. Querlioz and M. D. Stiles, "Spintronic Nanodevices for Bioinspired Computing," in 
Proceedings of the IEEE, vol. 104, no. 10, pp. 2024-2039, Oct. 2016. 
[34] M. Jerry et al., "Ferroelectric FET analog synapse for acceleration of deep neural network 
training," 2017 IEEE International Electron Devices Meeting (IEDM), San Francisco, CA, 2017, pp. 
6.2.1-6.2.4. 
 
57 
 
                                                                                                                                                             
[35]  E. W. Kinder, C. Alessandri, P. Pandey, G. Karbasian, S. Salahuddin and A. Seabaugh, "Partial 
switching of ferroelectrics for synaptic weight storage," 2017 75th Annual Device Research Conference 
(DRC), South Bend, IN, 2017, pp. 1-2. 
[36] M. Hu, H. Li, Y. Chen, Q. Wu, G. S. Rose and R. W. Linderman, "Memristor Crossbar-Based 
Neuromorphic Computing System: A Case Study," in IEEE Transactions on Neural Networks and 
Learning Systems, vol. 25, no. 10, pp. 1864-1878, Oct. 2014. 
[37] C. Liu, B. Yan, C. Yang, L. Song, Z. Li, B. Liu, Y. Chen, H. Li, Q. Wu, H. Jiang, "A spiking 
neuromorphic design with resistive crossbar," 2015 52nd ACM/EDAC/IEEE Design Automation 
Conference (DAC), San Francisco, CA, 2015, pp. 1-6. 
[38] F. Merrikh-Bayat, X. Guo, M. Klachko, M. Prezioso, K. K. Likharev and D. B. Strukov, "High-
Performance Mixed-Signal Neurocomputing With Nanoscale Floating-Gate Memory Cell Arrays," in 
IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 10, pp. 4782-4790, Oct. 2018. 
[39] M. Bavandpour, M. R. Mahmoodi and D. B. Strukov, "Energy-Efficient Time-Domain Vector-by-
Matrix Multiplier for Neurocomputing and Beyond," in IEEE Transactions on Circuits and Systems II: 
Express Briefs. (2019). 
[40] Vincent, A.F., Larroque, J., Zhao, W.S., Romdhane, N.B., Bichler, O., Gamrat, C., Klein, J.O., 
Galdin-Retailleau, S. and Querlioz, D., “Spin-transfer torque magnetic memory as a stochastic memristive 
synapse”. In 2014 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1074-1077 
(2014). 
[41] Ramasubramanian, S.G., Venkatesan, R., Sharad, M., Roy, K. and Raghunathan, A., “SPINDLE: 
SPINtronic deep learning engine for large-scale neuromorphic computing”, In Proceedings of the 2014 
international symposium on Low power electronics and design, pp. 15-20 (2014). 
[42] Q. Lou, C. Pan, J. McGuinness, A. Horvath, A. Naeemi, M. Niemier, and X. S. Hu, “A Mixed Signal 
Architecture for Convolutional Neural Networks”, ACM Journal on Emerging Technologies in 
Computing Systems (JETC), v. 15, no. 2, art. 19, April 2019.   
[43] C. Lee, S. Shakib Sarwar, and K. Roy, “Enabling Spike-based Backpropagation in State-of-the-art 
Deep Neural Network Architectures”, available online https://arxiv.org/abs/1903.06379 (2019). 
[44] Sharp, T., Galluppi, F., Rast, A., and Furber, S., “Power-efficient simulation of detailed cortical 
microcircuits on SpiNNaker”, J. Neurosci. Methods 210, 110–118 (2012). 
[45] Y. LeCun, et al., “Handwritten digit recognition: Applications of neural network chips and automatic 
learning,” IEEE Commun. Mag., vol. 27, no. 11, pp. 41–46, Nov. 1989. 
[46] Y. LeCun, L. Bottou, Y. Bengio and P. Haffner, "Gradient-based learning applied to document 
recognition," Proceedings of the IEEE, 86, 2278-2324 (1998). 
[47] A. Krizhevsky, I. Sutskever, and G. Hinton. “Imagenet classification with deep convolutional neural 
networks”. In Advances in Neural Information Processing Systems 25, pp. 1097-1105 (2012). 
[48] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. "Gradient-based learning applied to document 
recognition." Proceedings of the IEEE, 86(11):2278-2324, November 1998. 
 
58 
 
                                                                                                                                                             
[49] Thakur, C.S.T., Molin, J., Cauwenberghs, G., Indiveri, G., Kumar, K., Qiao, N., Schemmel, J., 
Wang, R.M., Chicca, E., Olson Hasler, J. and Seo, J.S., “Large-scale neuromorphic spiking array 
processors: A quest to mimic the brain”, Frontiers in neuroscience, 12, p.891 (2018).  
[50] B. Chatterjee, P. Panda, S. Maity, A. Biswas, K. Roy and S. Sen, "Exploiting Inherent Error 
Resiliency of Deep Neural Networks to Achieve Extreme Energy Efficiency Through Mixed-Signal 
Neurons," in IEEE Transactions on Very Large Scale Integration (VLSI) Systems (2019). 
[51] Schemmel, J., D. Bruderle, A. Grubl, M. Hock, K. Meier, and S. Millner, “A waferscale 
neuromorphic hardware system for large-scale neural modeling,” Proc. 2010 IEEE Int. Symp. Circuits 
and Systems (ISCAS), 1947–1950, 2010. 
[52] S. A. Aamir, Y. Stradmann, P. Müller, C. Pehle, A. Hartel, A. Grübl, J. Schemmel, K. Meier, “An 
Accelerated LIF Neuronal Network Array for a Large Scale Mixed-Signal Neuromorphic Architecture”, 
available online arXiv 1804.01906 (2018). 
[53] J. M. Cruz-Albrecht, T. Derosier and N. Srinivasa, “A scalable neural chip with synaptic electronics 
using CMOS integrated memristors”, Nanotechnology 24, 384011 (2013). 
[54] E. Painkras et al., "SpiNNaker: A 1-W 18-Core System-on-Chip for Massively-Parallel Neural 
Network Simulation," in IEEE Journal of Solid-State Circuits, vol. 48, no. 8, pp. 1943-1953, Aug. 2013. 
[55] E. Stromatias, F. Galluppi, C. Patterson and S. Furber, "Power analysis of large-scale, real-time 
neural networks on SpiNNaker," The 2013 International Joint Conference on Neural Networks (IJCNN), 
Dallas, TX, 2013, pp. 1-8. 
[56] J. Partzsch, S. Hoppner, M. Eberlein, R. Schuffny, C. Mayr, D. R. Lester, and S. Furber, “A fixed 
point exponential function accelerator for a neuromorphic many-core system,” in 2017 IEEE International 
Symposium on Circuits and Systems (ISCAS), May 2017, pp. 1–4.  
[57] A. Cassidy et al., “Real-time Scalable Cortical Computing at 46 Giga-Synaptic OPS/Watt with 
∼100× Speedup in Time-to-Solution and ∼100,000× Reduction in Energy-to-Solution”, Proc. of 
International Conference for High Performance Computing, Networking, Storage and Analysis, SC14 
(2014). 
[58] Benjamin, B., P. Gao, E. McQuinn, S. Choudhary, A. Chandrasekaran, J. Bussat, R. Alvarez-Icaza, J. 
Arthur, P. Merolla, and K. Boahen, “Neurogrid: A mixed analog-digital multichip system for large-scale 
neural simulations,” Proc. IEEE, 102(5):699–716, 2014. 
[59] Park, J., S. Ha, T. Yu, E. Neftci, and G. Cauwenberghs, “65k-neuron 73-Mevents/s 22-pJ/event 
asynchronous micro-pipelined integrate-and-fire array transceiver,” Proc. 2014 IEEE Biomedical Circuits 
and Systems Conf. (BioCAS), 2014. 
[60] N. Qiao, H. Mostafa, F. Corradi, M. Osswald, F. Stefanini, D. Sumislawska, and G. Indiveri, “A 
reconfigurable on-line learning spiking neuromorphic processor comprising 256 neurons and 128K 
synapses”, Frontiers in Neuroscience, v. 9, 141 (2015). 
[61] G. Indiveri, F. Corradi and N. Qiao, "Neuromorphic architectures for spiking deep neural networks," 
2015 IEEE International Electron Devices Meeting (IEDM), Washington, DC, 2015, pp. 4.2.1-4.2.4.  
 
59 
 
                                                                                                                                                             
[62] N. Qiao and G. Indiveri, "Scaling mixed-signal neuromorphic processors to 28 nm FD-SOI 
technologies," 2016 IEEE Biomedical Circuits and Systems Conference (BioCAS), Shanghai, 2016, pp. 
552-555. 
[63]  Davies, M., Srinivasa, N., Lin, T., Chinya, G., Cao, Y., Choday, S., Dimou, G., Joshi, P., Imam, N., 
Jain, S., Liao, Y., Lin, C., Lines, A., Liu, R., Mathaikutty, D., McCoy, S., Paul, A., Tse, J., 
Venkataramanan, G., Weng, Y., Wild, A., Yang, Y., and Wang, H. Loihi: a neuromorphic manycore 
processor with on-chip learning. IEEE Micro, 38(1):82–99, 2018.M. Davies et al., "Loihi: A 
Neuromorphic Manycore Processor with On-Chip Learning," in IEEE Micro, vol. 38, no. 1, pp. 82-99, 
January/February 2018. 
[64] A. Lines, P. Joshi, R. Liu, S. McCoy, J. Tse, Y.-H. Weng, and M. Davies, “Loihi Asynchronous 
Neuromorphic Research Chip”, Proceedings of 24th IEEE International Symposium on Asynchronous 
Circuits and Systems, Vienna, May 13-16, 2018. 
[65] G. K. Chen et al., “A 4096-neuron 1M-synapse 3.8pJ/SOP Spiking Neural Network with On-chip 
STDP Learning and Sparse Weights in 10nm FinFET CMOS”, Proc. VLSI Symposium, C24-1, 2018. 
[66] Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, “A small-footprint high-throughput 
accelerator for ubiquitous machine learning”, ASPLOS ’14. 
[67] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, O. Temam, “A 
Machine-Learning Supercomputer”, MICRO ’14. 
[68] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, Jia Wang, L. Li, T. Chen, Z. Xu, N. Sun, O. Temam, “A 
Machine Learning Accelerator”, ASPLOS ’15. 
[69] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam, “Shifting 
vision processing closer to the sensor”, ISCA ’15. 
[70] Y.-H. Chen, T. Krishna, J. Emer, and V. Sze, “Eyeriss: An energy-efficient reconfigurable 
accelerator for deep convolutional neural networks,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. 
Tech. Papers, Feb. 2016, pp. 262–263. 
[71] S. Han et al., "EIE: Efficient Inference Engine on Compressed Deep Neural Network," 2016 
ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), Seoul, 2016, pp. 
243-254. 
[72] R. Andri, L. Cavigelli, D. Rossi and L. Benini, "YodaNN: An Ultra-Low Power Convolutional 
Neural Network Accelerator Based on Binary Weights," 2016 IEEE Computer Society Annual 
Symposium on VLSI (ISVLSI), Pittsburgh, PA, 2016, pp. 236-241.  
[73] L. Cavigelli and L. Benini, “Origami: A 803-GOp/s/W Convolutional Network Accelerator”, IEEE J. 
Trans. Circuits and Systems, v. 27, p 2461 (2016). GLVLSI 2015. 
[74] B. Moons, R. Uytterhoeven, W. Dehaene, M. Verhelst, “Envision: A 0.26-to-10TOPS/W Subword-
Parallel Dynamic-Voltage Accuracy-Frequency-Scalable Convolutional Neural Network Processor in 
28nm FDSOI”, IEEE International Solid-State Circuits Conference, 2017, pp 246-247 (2017). 
[75] L. Gwennap, M. Demler, and L. Case, “A Guide to Processors for Deep Learning”, Linley Group. 
[76] D. Moloney, B. Barry, R. Richmond, F. Connor, C. Brick, and D. Donohoe, “Myriad 2: Eye of the 
computational vision storm,” in IEEE Hot Chips Symposium (HCS), Aug. 2014, pp. 1–18. 
