Energy Efficient and Error Resilient Neuromorphic Computing in VLSI by Kim, Yongtae
ENERGY EFFICIENT AND ERROR RESILIENT NEUROMORPHIC
COMPUTING IN VLSI
A Dissertation
by
YONGTAE KIM
Submitted to the Office of Graduate and Professional Studies of
Texas A&M University
in partial fulfillment of the requirements for the degree of
DOCTOR OF PHILOSOPHY
Chair of Committee, Peng Li
Committee Members, Gwan Choi
Jose Silva-Martinez
Rabi Mahapatra
Head of Department, Chanan Singh
December 2013
Major Subject: Electrical Engineering
Copyright 2013 Yongtae Kim
ABSTRACT
Realization of the conventional Von Neumann architecture faces increasing chal-
lenges due to growing process variations, device reliability and power consumption.
As an appealing architectural solution, brain-inspired neuromorphic computing has
drawn a great deal of research interest due to its potential improved scalability and
power efficiency, and better suitability in processing complex tasks. Moreover, in-
herit error resilience in neuromorphic computing allows remarkable power and energy
savings by exploiting approximate computing. This dissertation focuses on a scalable
and energy efficient neurocomputing architecture which leverages emerging memris-
tor nanodevices and a novel approximate arithmetic for cognitive computing.
First, a brain-inspired digital neuromorphic processor (DNP) architecture with
memristive synaptic crossbar is presented for large scale spiking neural networks.
We leverage memristor nanodevices to build an N×N crossbar array to store not
only multibit synaptic weight values but also the network configuration data with
significantly reduced area cost. Additionally, the crossbar array is accessible both
column- and row-wise to significantly expedite the synaptic weight update process for
on-chip learning. The proposed digital pulse width modulator (PWM) readily creates
a binary pulse with various durations to read and write the multilevel memristors
with low cost. Our design integrates N digital leaky integrate-and-fire (LIF) silicon
neurons to mimic their biological counterparts and the respective on-chip learning
circuits for implementing spike timing dependent plasticity (STDP) learning rules.
The proposed column based analog-to-digital conversion (ADC) scheme accumulates
the pre-synaptic weights of a neuron efficiently and reduces silicon area by using only
one shared arithmetic unit for processing LIF operations of all N neurons. With 256
ii
silicon neurons, the learning circuits and 64K synapses, the power dissipation and
area of our design are evaluated as 6.45 mW and 1.86 mm2, respectively, in a 90 nm
CMOS technology.
Furthermore, arithmetic computations contribute significantly to the overall pro-
cessing time and power of the proposed architecture. In particular, addition and
comparison operations represent 88.5% and 42.9% of processing time and power for
digital LIF computation, respectively. Hence, by exploiting the built-in resilience of
the presented neuromorphic architecture, we propose novel approximate adder and
comparator designs to significantly reduce energy consumption with a very low er-
ror rate. The significantly improved error rate and critical path delay stem from a
novel carry prediction technique that leverages the information from less significant
input bits in a parallel manner. An error magnitude reduction scheme is proposed
to further reduce amount of error once detected with low cost in the proposed adder
design. Implemented in a commercial 90 nm CMOS process, it is shown that the
proposed adder is up to 2.4× faster and 43% more energy efficient over traditional
adders while having an error rate of only 0.18%. Additionally, the proposed com-
parator achieves an error rate of less than 0.1% and an energy reduction of up to
4.9× compared to the conventional ones. The proposed arithmetic has been adopted
in a VLSI-based neuromorphic character recognition chip using unsupervised learn-
ing. The approximation errors of the proposed arithmetic units have been shown to
have negligible impacts on the training process. Moreover, the energy saving of up
to 66.5% over traditional arithmetic units is achieved for the neuromorphic chip with
scaled supply levels.
iii
DEDICATION
To my wife
iv
ACKNOWLEDGEMENTS
First and foremost, I am very grateful to have had the opportunity to work with
a great research advisor Dr. Peng Li and would like to thank him with my deep
respect for his valuable advice and consistent support during my doctoral studies at
Texas A&M University. Dr. Li has actively encouraged me to move forward with new
innovative research ideas and willingly shared his profound knowledge, deep insight
and creative inspiration so I could learn the way of research from him. Also, I would
like to thank my committee members Dr. Gwan Choi, Dr. Jose Silva-Martinez
and Dr. Rabi Mahapatra for their constructive discussions and suggestions on my
research, making this dissertation possible.
My appreciation goes to all the members in our research group for their knowl-
edge, discussion and friendship. Particular thanks go to Yong Zhang and Qian
Wang for the simulation and layout supports. Many friends in the department and
the alumni of Korea University have made my stay of four years in College Station
a pleasurable and unforgettable experience. I also want to acknowledge all my other
friends who have consistently helped me at A&M for their considerable assistances.
From deep down in my heart, I would like to thank my parents and other family
members for their devotion, support and encouragement. In particular, I would like
to give special thanks to my wife for her unconditional love, trust, patience and
sacrifice, leading me to successfully complete my Ph.D. studies.
Finally, the funding support from the Semiconductor Research Corporation is
acknowledged.
v
NOMENCLATURE
ADC Analog-to-Digital Conversion/Converter
ANN Artificial Neural Network
CMOS Complementary Metal Oxide Semiconductor
CLA/CLC Carry Lookahead Adder/Comparator
DNP Digital Neuromorphic Processor
EDAP Energy-Delay-Area Product
EDC Error Detection and Correction
EDEP Energy-Delay-Error Product
EDP Energy-Delay Product
FSM Finite State Machine
LIF Leaky Integrate-and-Fire
LSB Least Significant Bit
LUT Look Up Table
MSB Most Significant Bit
PWM Pulse Width Modulation/Modulator
RCA/RCC Ripple Carry Adder/Comparator
SNN Spiking Neural Network
STDP Spike Timing Dependent Plasticity
VCO Voltage Controlled Oscillator
VLSI Very Large Scale Integration
WTA Winner-Take-All
vi
TABLE OF CONTENTS
Page
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
DEDICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
NOMENCLATURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
TABLE OF CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Digital Neuromorphic Processor for Cognitive Computing . . . . . . . 2
1.2 Approximate Arithmetic for Energy Efficient Neurocomputing . . . . 3
1.3 Outline of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . 5
2. BACKGROUND AND RELATED WORKS . . . . . . . . . . . . . . . . . 7
2.1 Brain-Inspired Neuromorphic Computing . . . . . . . . . . . . . . . . 7
2.1.1 Biological Motivation . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.2 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . 11
2.1.3 Spiking Neural Networks . . . . . . . . . . . . . . . . . . . . . 18
2.1.4 Silicon Neuron Circuits . . . . . . . . . . . . . . . . . . . . . . 22
2.1.5 Neuromorphic VLSI Systems . . . . . . . . . . . . . . . . . . . 26
2.2 Approximate Computing . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.3 Emerging Memory Technologies . . . . . . . . . . . . . . . . . . . . . 32
2.4 Objective of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . 36
3. RECONFIGURABLE DIGITAL NEUROMORPHIC PROCESSOR WITH
MEMRISTIVE CROSSBAR ARRAY . . . . . . . . . . . . . . . . . . . . . 39
3.1 Digital Neuromorphic Processor Architecture . . . . . . . . . . . . . . 39
3.1.1 Overall Processor Architecture . . . . . . . . . . . . . . . . . . 39
3.1.2 Flow Control of the Neuromorphic Processor . . . . . . . . . . 42
3.2 Memristive Synaptic Crossbar Array . . . . . . . . . . . . . . . . . . 45
3.2.1 Memristor Readout Schemes . . . . . . . . . . . . . . . . . . . 45
3.2.2 Memristive Synaptic Cell Partition . . . . . . . . . . . . . . . 48
vii
3.2.3 Memristive Crossbar Array and Cells . . . . . . . . . . . . . . 49
3.2.4 Digital Pulse Width Modulation for Memristive Synaptic Cell 50
3.3 Building Block Implementations . . . . . . . . . . . . . . . . . . . . . 52
3.3.1 Memristor Readout . . . . . . . . . . . . . . . . . . . . . . . . 52
3.3.2 Neuron and LIF Arithmetic Units . . . . . . . . . . . . . . . . 56
3.3.3 Learning Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.4 Implementation of the Neuromorphic Processor and Simulation Re-
sults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.4.1 Column ADC Performance . . . . . . . . . . . . . . . . . . . . 62
3.4.2 Overall Processor Performance . . . . . . . . . . . . . . . . . . 64
3.4.3 Application of the Neuromorphic Processor for Character Recog-
nition System . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4. ENERGY EFFICIENT APPROXIMATE ARITHMETIC . . . . . . . . . 71
4.1 Proposed Approximate Adder . . . . . . . . . . . . . . . . . . . . . . 71
4.1.1 Approximate Adder Architecture . . . . . . . . . . . . . . . . 71
4.1.2 Error Rate Analysis . . . . . . . . . . . . . . . . . . . . . . . . 74
4.1.3 Error Magnitude Reduction Scheme . . . . . . . . . . . . . . . 77
4.2 Proposed Approximate Comparator . . . . . . . . . . . . . . . . . . . 79
4.2.1 Approximate Comparator Architecture . . . . . . . . . . . . . 79
4.2.2 Error Rate Analysis . . . . . . . . . . . . . . . . . . . . . . . . 81
4.3 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.3.1 Error Rate of the Proposed Approximate Adder . . . . . . . . 83
4.3.2 Performance of the Proposed Approximate Adder . . . . . . . 84
4.3.3 Comparison with Seven Other Approximate Adders . . . . . . 85
4.3.4 Comparison on Error-Free Operations . . . . . . . . . . . . . . 89
4.3.5 Error Rate of the Proposed Approximate Comparator . . . . . 91
4.3.6 Performance of the Proposed Approximate Comparator . . . . 92
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5. APPLICATION OF APPROXIMATE ARITHMETIC TO NEUROMOR-
PHIC COMPUTING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.1 Evaluation Environment . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.2 Impacts of Approximation Errors on Neuromorphic Applications . . . 99
5.2.1 Approximate Adder Error Effects . . . . . . . . . . . . . . . . 99
5.2.2 Approximate Comparator Error Effects . . . . . . . . . . . . . 104
5.3 Energy Efficiency of LIF Neurons with Approximate Adders and Com-
parators with Supply Voltage Scaling . . . . . . . . . . . . . . . . . . 105
5.4 Energy Efficiency during the Training Process with Supply Voltage
Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6. CONCLUSION AND FUTURE WORK . . . . . . . . . . . . . . . . . . . 111
viii
6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
ix
LIST OF FIGURES
FIGURE Page
1.1 Application of approximate arithmetic in neuromorphic computing. . 4
2.1 Biological neuron anatomy [57]. . . . . . . . . . . . . . . . . . . . . . 9
2.2 Artificial neuron. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Activation functions: (a) step, (b) piecewise linear, and (c) sigmoid
with different parameter a. . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Feedforward artificial neural network architecture. . . . . . . . . . . . 16
2.5 Spiking neural network. . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.6 Spiking neuron behavior. . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.7 Spike timing dependent plasticity. . . . . . . . . . . . . . . . . . . . . 21
2.8 Analog silicon neuron: (a) schematic and (b) timing diagram [29]. . . 23
2.9 Digital silicon neuron [6]. . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.10 Block diagram of digital neurosynaptic core [46]. . . . . . . . . . . . . 27
2.11 Block diagram of (a) neuromorphic chip and (b) silicon neuron [58]. . 29
2.12 Memristive device structure (left) and variable resistance model (right)
[63]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.13 Energy efficient and error resilient neuromorphic computing in VLSI. 36
3.1 Block diagram of the proposed digital neuromorphic processor archi-
tecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2 Flow diagram of the proposed neuromorphic processor. . . . . . . . . 42
3.3 Memristor sensing schemes by (a) load resistor and (b) summing am-
plifier. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4 Memristor level partitions by equal conductance. . . . . . . . . . . . . 48
x
3.5 Proposed synaptic crossbar array and CMOS/memristor hybrid synap-
tic cell. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.6 Digital pulse width modulator. . . . . . . . . . . . . . . . . . . . . . . 51
3.7 Proposed memristor readout block consisting of column ADC and low-
resolution ADC array. . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.8 Block diagram of VCO based ADC and proposed delay cell. . . . . . 55
3.9 Neuron elements with the LIF arithmetic unit. . . . . . . . . . . . . . 56
3.10 Flowchart of the processing of the neuron unit. . . . . . . . . . . . . . 57
3.11 Learning elements with global timer and shared LUTs. . . . . . . . . 59
3.12 Flowchart of the processing of the learning unit. . . . . . . . . . . . . 60
3.13 Layout of the neuromorphic processor with 256 neurons and 65,536
synapses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.14 Column ADC performance: (a) input-to-output characteristics and
(b) power and area as functions of counter type and resolution. . . . . 63
3.15 Neuromorphic processor performance: (a) power and (b) area break-
down. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.16 Network for character recognition and training for alphabets. . . . . . 66
3.17 Neuron index mapping and synaptic connections of the crossbar array. 67
3.18 Learning results for network: (a) receptive fields after training and (b)
spike rasters for output neurons. . . . . . . . . . . . . . . . . . . . . . 68
4.1 Block diagram of the proposed approximate adder. . . . . . . . . . . 72
4.2 Proposed carry prediction using parallel carry-skip (k=6, v=3). . . . 73
4.3 Block diagram of the error magnitude reduction and an example of its
operation (k=8, v=2). . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.4 Block diagram of the proposed approximate comparator. . . . . . . . 79
4.5 Example of the comparator configuration (n=16, k=4, v=2). . . . . . 81
4.6 Error rates of the proposed adder under different n, k and v. . . . . . 84
4.7 Energy comparison under supply scaling. . . . . . . . . . . . . . . . . 89
xi
4.8 Error rates of the proposed comparator with various n, k and v. . . . 91
5.1 Digital LIF neuron: (a) block diagram and (b) delay and power break-
downs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.2 (a) Input character patterns and (b) receptive fields with 16-bit accu-
rate adders. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.3 Receptive fields with 16-bit (a) proposed approximate adder, (b) LUA,
(c) LOA (8-8), (d) ETAI (8-8), (e) ETAII, (f) VLCSA-1, (g) ACA and
(h) DAA (8-8). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.4 Receptive fields with 16-bit (a) LOA (13-3), (b) ETAI (15-1) and (c)
DAA (11-5). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.5 Receptive fields with 16-bit (a) accurate adder with proposed com-
parator and (b) proposed adder with proposed comparator. . . . . . . 104
5.6 Normalized energies of one digital LIF neuron with various adders
with supply voltage scaling. . . . . . . . . . . . . . . . . . . . . . . . 106
5.7 Normalized energy consumptions by all the digital LIF neurons of the
network while training with various adders and comparators under
supply voltage scaling. . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.1 Neuron and synapse integration densities as a function of technology. 113
6.2 Trends of gate length and power supply [70]. . . . . . . . . . . . . . . 115
6.3 Scaling trend of neuron and synapse integration. . . . . . . . . . . . . 115
xii
LIST OF TABLES
TABLE Page
3.1 Normalized write times to change one level of memristor conductance
(RON=10KΩ, ROFF=500KΩ, VWRITE=1.2V ). . . . . . . . . . . . . . . . 48
3.2 Neuromorphic processor implementation summary. . . . . . . . . . . 63
4.1 Proposed adder with different n, k and v. . . . . . . . . . . . . . . . . 85
4.2 Comparison with other 16-bit adders. . . . . . . . . . . . . . . . . . . 86
4.3 Approximate adders with error detection and correction. . . . . . . . 91
4.4 Proposed comparator with different n, k and v. . . . . . . . . . . . . 92
4.5 Comparison with other 16-bit comparators. . . . . . . . . . . . . . . . 94
5.1 Error rates and average error magnitudes of various adders during
training process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
xiii
1. INTRODUCTION
The human brain mediates and produces our thoughts, actions, memory, feelings
and other complex tasks. All these, however, are accomplished with great energy and
space efficiency by the brain. In contrast, to achieve the same, the man-made con-
ventional Von Neumann machines may require tremendous power, energy and space
resources for computation, communication and memory if it is all possible [46]. To
date, implementing the Von Neumann architecture faces grand challenges due to
growing process variations, device reliability and power consumption. As an ap-
pealing architectural solution, brain-inspired neuromorphic computing has emerged
as a promising solution to overcome these underlying constraints. It may be well-
suited for processing complex tasks, such as character or image recognition, clas-
sification and language learning and enjoy greater power efficiency and scalability
[48, 58, 46, 59, 4, 43].
Furthermore, brain-inspired architectures may offer inherit error resilience and
fault tolerance, which is very appealing for large-scale integration in scaled VLSI
technologies. This inherit error resilience in neuromorphic computing allows remark-
able power and energy savings by adopting approximate computing, which has drawn
a significant research amount of interest in order to remedy the increasing energy
efficiency challenges [21, 60, 8]. The key observation of approximate computing is
that many applications, such as digital signal processing (DSP) and neuromorphic
systems, have inherent error resilience, and hence 100% precision in computation is
not required. This provides opportunities for energy saving by relaxing computation
accuracy while achieving an acceptable processing quality. Particularly, the core of
many DSP and neuromorphic applications lies in processing specific kernel functions,
1
which occupy a significant portion of silicon area and computation time [49, 26]. For
instance, MPEG motion estimation heavily performs the L1-norm arithmetic for sum
of absolute difference (SAD) calculation [66] and spiking neural networks use the
leaky integrate-and-fire (LIF) operation to mimic neuron behavior [58]. Obviously,
adders are the primary component for building these arithmetic kernel functions.
In addition, comparators are indispensable to determine firing activities in the LIF
operation of neuromorphic applications.
To this end, this dissertation proposes two hardware design techniques for scal-
able and energy efficient neurocomputing applications: 1) reconfigurable digital neu-
romorphic processor (DNP) with memristive synaptic crossbar and 2) energy efficient
approximate arithmetic for neuromorphic VLSI systems.
1.1 Digital Neuromorphic Processor for Cognitive Computing
The first contribution of this dissertation includes a reconfigurable neuromorphic
processor with memristive synaptic crossbar for cognitive computing. We propose a
reconfigurable digital neuromorphic architecture comprising a memristive crossbar,
an array of digital LIF spiking neurons and on-line learning circuits that support
spike timing dependent plasticity (STDP) learning mechanisms. We leverage the
memristor nanodevice to implement on-chip synaptic weight storage with low-cost
since the memristor provides non-volatility, excellent scalability and high density of
10 Gb/cm2 or more [23, 72]. To implement a multilevel memristor synaptic memory,
we systematically analyze the memristor device in terms of programming time and
level partitioning. We also investigate memristor readout schemes to more efficiently
perform digital LIF operations with the crossbar structure and present a low-cost
digital pulse width modulation (PWM) scheme for writing the memristor. While the
previous analog-to-digital converter (ADC) design in [36] has a bottleneck of overall
2
power dissipation, we address this limitation by optimizing the VCO based column
ADC through the introduction of an asynchronous counter to measure the VCO
frequency in digital form. In the proposed DNP, the N×N memristive synaptic
array, which stores both multibit synapse values and network configuration data,
can be accessed both column- and row-wise to speed up the synaptic weight update
process. The proposed column ADC effectively accumulates pre-synaptic weights
and allows a single adder and comparator to be shared among all N neurons to
perform LIF operations without degrading throughput. This leads to a considerable
silicon area reduction and its digital implementation style is scalable for large scale
integration.
When implemented in a commercial 90 nm CMOS technology, a 256 neuron design
with a 256×256 synaptic array based on the proposed neuromorphic architecture has
an estimated area of 1.86 mm2 and power consumption of 6.45 mW under the regular
supply voltage of 1.2 V , respectively. The proposed neuromorphic architecture is
rather flexible and can be configured to various network topologies to support to a
range of cognitive learning applications. To demonstrate its potential application,
we configure our DNP to realize a two-layer spiking network with over two hundreds
silicon neurons for character recognition with unsupervised learning.
1.2 Approximate Arithmetic for Energy Efficient Neurocomputing
Our second contribution is to apply an approximate computing scheme to the
neuromorphic hardware design for considerable energy saving (see Figure 1.1). To
achieve this, we propose a novel approximate adder with a parallel carry-skip scheme.
While reducing the worst-case carry propagation delay, this carry-skip scheme allows
for highly accurate carry prediction, making it possible to either speed up addition
operations or reduce energy dissipation by lowering the supply voltage. The signif-
3
Input Spikes
Approximate Arithmetic
Synapse
Learning - Plasticity
Neurons – Leaky Integrate-and-Fire
O
u
tp
u
t 
In
te
rf
a
c
e
In
p
u
t 
In
te
rf
a
c
e
+ -
CMP
Output Spikes
Adder
++
Figure 1.1: Application of approximate arithmetic in neuromorphic computing.
icantly improved error rate and critical path delay stem from the employed carry
prediction technique that leverages the information from less significant input bits
in a parallel manner. An error magnitude reduction scheme is proposed to further
reduce amount of error once detected with low cost. Our adder design is rather flex-
ible in the sense that a low-overhead error correction logic can be readily included to
achieve error-free operations at the cost of one additional clock cycle. Additionally,
we extend our approximate arithmetic scheme to comparator design and present
a complete error rate analysis for the proposed arithmetic units. We extensively
compare our approximates designs with a large number of existing accurate and ap-
proximate adders and comparators and show the large improvements in area, power,
energy, timing and error rate brought by our design technique. To evaluate the
performance of our approximate arithmetic units for neurocomputing applications,
we present an efficient evaluation methodology to analyze large VLSI-based spiking
neural networks with over a thousand silicon neurons for character recognition. We
extensively study the error tolerance of the network, the overall energy consump-
4
tion of digital LIF neurons during the learning process and its dependency on the
underlying arithmetic units.
Implemented in a commercial 90 nm CMOS process, it is shown that the proposed
adder is up to 2.4× faster and 43% more energy efficient over traditional adders while
having an error rate of only 0.18%. In addition, the proposed comparator achieves
an error rate of less than 0.1% and an energy reduction of up to 4.9× compared
to the conventional ones. To evaluate the performance of the proposed arithmetics
under neuromorphic applications, we develop a behavioral evaluation approach for
a VLSI-based neuromorphic character recognition chip using unsupervised learning.
The approximation errors of the proposed adder and comparator have been shown to
have negligible impact on the training process while other approximate adders lead
to unacceptable level of performance degradation. Furthermore, the proposed adder
and comparator enable the overall energy reductions of up to 66.5% over traditional
arithmetic units for the digital neuron circuits during the training process with the
scaled supply.
1.3 Outline of the Dissertation
The remainder of this dissertation is organized as the follows. Section 2 describes
the background of brain-inspired neuromorphic and approximate computing, and
the related works in the literature. The memristor nanodevice leveraged for synap-
tic array is also introduced in Section 2. The reconfigurable digital neuromorphic
processor with memristive synaptic crossbar and its application for the character
recognition system with unsupervised learning are presented in Section 3. After
proposing energy efficient approximate arithmetic units in Section 4, the impacts
of the approximation errors on the neurocomputing application and the energy ef-
ficiency analysis of the proposed approximate units for the neuromorphic hardware
5
design are presented in Section 5. Finally, we conclude this dissertation and discuss
the future works in Section 6.
6
2. BACKGROUND AND RELATED WORKS
This section describes an overview of neuromorphic and approximate computing
paradigms. It begins with the biological motivation of neuromorphic computing, then
gives reviews of artificial and spiking neural networks and their learning algorithms.
It also deals with the existing designs of silicon neurons and neuromorphic VLSI
systems to mimic the biological brain on silicon, and discusses their key design issues
and limitations. Next, it introduces a notion of approximate computing and several
design approaches of the approximate adder, which is the primary component in
approximate computing, are briefly reviewed and their advantages and disadvantages
are presented as well. The overview of the memristor nanodevice, which is employed
in our digital neuromorphic processor as on-chip storage to maintain a huge number of
synaptic weight values, is given. Finally, it clarifies the objective of this dissertation.
2.1 Brain-Inspired Neuromorphic Computing
Today’s Von Neumann computers are able to not only deal with very compli-
cated numerical and algorithmic computations and procedural control tasks, such as
sorting, but also store huge amount of data. Thus, they have been widely used for
solving these complicated problems steadily which may be hard to be handled by
humans. On the other hand, traditional machines may be limited by many other
kinds of tasks that human beings can process without difficulties, such as character
or image recognition, text reading and language learning. Importantly, the humans
adapt to new situations and accumulate information and knowledge by an amazing
ability of the brain, learning. In other words, when faced with a new situation, they
make a proper decision and perform an appropriate behavior based on the acquired
knowledge through the learning or training processes. Incredibly, the human brain
7
processes these tasks much more energy efficiently than the conventional computers.
In general, biological neurons are 106× slower than silicon logic gates [20]. Silicon
chips operate with a clock period in the range of the nanoseconds (10−9 sec.) while
neural events happen in the millisecond (10−3 sec.) range. The slower operating speed
of the biological neurons may have contributed to the brain’s results in exceptionally
good energy efficiency. Specifically, the brain consumes approximately 10−16J per
operation per second, whereas the traditional computer requires an energy level of
about 10−6J per operation per second [20].
Historically, from the past years, neuroscientists have devoted intense efforts to-
wards investigating the human brain. As part of these efforts, a landmark work in
modeling the dynamics of a biological neuron was conducted by Hodgkin and Hux-
ley [24]. After that, a variety of computational neuron models, such as FitzHugh-
Nagumo [17, 52], Hindmarsh-Rose [22], and Morris-Lecar [51], have been proposed.
Also, scientists have studied the interactions among neurons through synapses. The
brain and neuron modeling are greatly facilitated by the rapid advance of digital
computers. However, simulating a large number of these computational neurons is
still challenging to date because it requires tremendous computing power and simu-
lation time. Meanwhile, neuromorphic engineers have been trying to reproduce the
neuron behaviors by morphing their anatomy and physiology into silicon chips for
simulating the human or mammal’s brains in real-time [48, 58, 46, 59, 4, 43]. The
hardware implementation provides very fast simulation of neural networks with less
power consumption.
2.1.1 Biological Motivation
Artificial or spiking neural networks are the core of brain-inspired neuromorphic
computing systems. Their development has been motivated in the part by the in-
8
sights obtained from biological nervous systems (e.g. the human brain) which are
an extremely intricate interconnection of neurons. As an example, the adult human
brain is estimated to contain a densely interconnected network of approximately 1011
neurons and more than 1014 synapses [13]. Neurons are the primary elements of a
nervous system and are specialized types of biological cells that are electrically ex-
citable. Neurons process and transmit information in the form of cellular signals
which are either electrical or chemical for long and short distances, respectively. A
considerable number of neurons connect to each other to form a neural network via
synapses, which are specialized connections among the neurons.
Figure 2.1: Biological neuron anatomy [57].
Figure 2.1 illustrates a typical biological neuron and synapse structure. A neuron
9
mainly consists of three functionally distinct parts, which are the cell body (often
called the soma), axon and dendrites. The cell body is the heart of the neuron
and includes the nucleus where most protein synthesis occurs. The dendrites of the
neuron are highly branched extensions and receive nerve signals from other neurons.
A neuron may have numerous dendrites and their overall shape is referred to as a
dendritic tree. On the other hand, the neuron has only one axon that is typically
thinner and much longer than the dendrites and transmits the signals to other cells
via synapses. In short, the dendrites and axon act as the signal receiver and trans-
mitter, respectively. Information of the nervous system is encoded in the form of
an electrical impulse which is called the action potential or spike and the pulse is
transmitted from a pre-synaptic neuron to a post-synaptic one. The action poten-
tials are created by the axon hillock that is a specialized part of the cell body and
connects to the axon. A neuron processes information by integrating the incoming
nerve signals that come from its pre-synaptic neurons and the action potential is
generated when the membrane potential of the neuron reaches a certain threshold.
Briefly, the neuron transmits the information using the action potentials or spikes.
A synapse is basically a junction between two neurons, which are referred to as
the pre-synaptic and post-synaptic neurons, respectively. In fact, neurons do not
physically touch each other and are separated by a small space called the synaptic
cleft. When an action potential arrives at the axon terminal, the pre-synaptic neuron
releases chemical neurotransmitter molecules into the synaptic cleft and they diffuse
across the synaptic junction, leading to interneuron communication at the synapse.
These chemical molecules bind to the receptor which is placed on the opposite side of
the cleft (i.e. post-synaptic neuron) and cause the membrane potential of the post-
synaptic neuron to change. The type of the receptor and neurotransmitter employed
at the synapse determines whether the post-synaptic neuron would be either excited
10
or inhibited when a pre-synaptic spike is generated. The resulting effects of excitation
and inhibition are to potentiate and depress the post-synaptic neuron’s membrane
potential. In addition, the strength of a synapse is defined by the amplitude change of
the membrane potential as a result of a pre-synaptic action potential. Learning and
memory are resulted from the changes in synaptic strength through the mechanism
of synaptic plasticity that leads to either decrease or increase in the strength. In this
way, the synapses store information.
2.1.2 Artificial Neural Networks
Artificial neural network (ANN) is a computational model inspired by the biolog-
ical nervous systems, in particular the brain, and is widely adopted in applications of
intelligent information processing, such as machine learning and pattern recognition
[20, 30]. An ANN is simply an intricated web of connected artificial neurons (called
the processing elements) that processes information in a way to mimic biological
neural networks. The signals of the network are passed among the artificial neurons
over the connection links called synapses. Each synapse has an associated weight or
strength of its own, which typically multiplies the signal transmitted. The weight is
a adaptive numerical parameter that can be manipulated by a learning algorithm.
Additionally, each neuron accumulates the input signals that are weighted by the
respective synapses of the neuron, and applies an activation function that may be
either linear or non-linear to its net input (i.e. sum of the weighted input signals)
to determine its output signal. Furthermore, ANNs are similar to their biological
counterparts in the sense that they perform functions dispersively, collectively and
in parallel by the processing elements.
Artificial neurons are the basic functional units to build an ANN and are a great
simplification of biological neurons. The first computation model of artificial neu-
11
rons was created by McCulloch and Pitts in [44]. The McCulloch-Pitts model is
based on a simplified binary neuron whose state is either active or not-active, and
implements a threshold function in discrete time. The state is determined by accu-
mulating weighted incoming signals of activated pre-synaptic neurons at each neural
computation step. Namely, it is set to active if the sum of the weighted signals
exceeds a given threshold, otherwise it is not. Subsequent neuron models extend
the McCulloch-Pitts model by introducing real-valued inputs and outputs and var-
ious threshold (activation) functions [20]. Figure 2.2 depicts a typical computation
w1k
w2k
wmk
Σ
x1k
x2k
xmk
vk
yk
Activation
function
Summation
Synaptic 
weights
Input
signals
Output
signal
Figure 2.2: Artificial neuron.
model of the artificial neuron. The model consists of three basic elements: 1) a set
of synapses which are represented by synaptic weights; 2) an adder for summing
the input signals that are multiplied by the respective synaptic weights; and 3) an
activation function to bound the amplitude of the output signal. The behavior of
12
the neuron k is mathematically described by the following equations:
vk =
m∑
j=0
wjkxj (2.1)
yk = ϕ(vk) (2.2)
where m is the number of pre-synaptic neurons, xj is the input signal coming from
the neuron j, wjk is the synaptic weight between the neuron j and k, vk is the linear
summation due to the input signals, ϕ(·) is the activation function and yk is the
output signal of the neuron k. The activation function denoted by ϕ(·) is play a role
of defining the neuron output in terms of the summation input v. There are many
possible activation functions but we introduce three basic types of functions: 1) step;
2) piecewise linear; and 3) sigmoid. They are plotted in Figure 2.3, respectively.
First, the step function makes a binary decision and produces only two values.
The step function in Figure 2.3(a) can be described by
ϕ(v) =

1 if v ≥ 0
0 if v < 0
(2.3)
where the threshold value is zero. The output value is “0” if the input v is greater
than or equals to a given threshold, otherwise this function generates a value of “1”
as the output.
Second, the piecewise linear function is composed of a number of linear segments
over an equal number of intervals. The piecewise linear function described in Figure
13
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
Input
O
u
tp
u
t
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
Input
O
u
tp
u
t
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
Input
O
u
tp
u
t
(a)
(b)
(c)
Increasing ɑ
Figure 2.3: Activation functions: (a) step, (b) piecewise linear, and (c) sigmoid with
different parameter a.
14
2.3(b) is expressed by
ϕ(v) =

1 if v ≥ +1
2
v if + 1
2
> v > −1
2
0 if v ≤ −1
2
(2.4)
where the amplification factor for the linear region is unity. It has two saturation
output levels corresponding an upper and lower bounds (e.g. 0 and 1 in (2.4)) and
provides a linear response between them.
Third, the sigmoid function is a smooth version of the piecewise linear function
in (2.4), and produces an S shaped graph. Moreover, it is most commonly adopted
in the construction of ANNs and is mathematically defined by
ϕ(v) =
1
1 + e−av
(2.5)
where a is the slope parameter. Adjusting the parameter a allows the sigmoid func-
tion to generate different slopes as shown in Figure 2.3(c).
The artificial neurons are combined to form a neural network in many different
ways. Most of the ANN architectures exhibit the layered structure. Figure 2.4 illus-
trated a typical feedforward ANN architecture. It has three layers of input, hidden
and output neurons where a set of artificial neurons constitutes a layer. Typically,
a standard L-layer ANN consists of an input layer, L− 2 hidden layers and an out-
put layer and they are connected successively. It is worth to note that the hidden
layer can be omitted in practice. The network can be connected either fully or par-
tially. The communication proceeds layer by layer from the input to the output
layers through the hidden ones. The neuron states of the output layer indicate the
15
Input 
Layer
Hidden 
Layer
Output 
Layer
Figure 2.4: Feedforward artificial neural network architecture.
computation result of the network. The neurons in the input layer receives exter-
nal input signals in the form of activation pattern and projects them onto the next
layer (e.g. hidden layer in Figure 2.4). The hidden layer plays a role of meditating
between the external input and the network output. The addition of more hidden
layers allows the network to perform high-order computations, which is particularly
valuable when the size of the input layer is large [20]. The neurons in the output layer
produce an overall response of the network under the external inputs. In contrast
to the feedforward architecture, the recurrent network includes at least one feedback
loop, which affects the learning capability of the network and its performance. This
network behave like a sequential logic and thus the previous experience may have
an impact on the activation of the neurons. The recurrent network is a dynamic
16
system while the feedforward is a static one. When a new input pattern is given, the
neuron outputs are computed, then the inputs to each neuron are modified due to
the feedback loops, which makes the network enter a new state. Hence, this network
topology can be used as associative memory.
Learning makes us to either increase knowledge or enhance understanding, and
is archived by studying, experiencing or receiving instructions. For ANNs, learning
refers to a process to adjust synaptic weights so that the network is able to perform a
specific task efficiently. Many learning algorithms have been presented to appropri-
ately adjust the synaptic weight values of the neural network but they are classified
into two main learning paradigms: 1) supervised and 2) unsupervised [30].
Supervised learning is a synaptic weight change process that incorporates the
global information. In short, this training process is performed with a teacher that
has knowledge or correct answers. The underlying principle of this learning algorithm
is the error correction rule that leverages the error signal (i.e. difference between the
actual output and the correct output) to adjust the synaptic weight values to re-
duce the error gradually. In supervised learning, every training input is given to
the network with each desired output (i.e. correct answer). The synaptic weights of
the network are modified to produce the outputs as close as possible to the known
correct answers. Therefore, the neural network tries to emulate the teacher grad-
ually through the learning process. The backpropagation and perceptron learning
algorithms are well known supervised learning methods.
In contrast, unsupervised learning does not have a teacher and utilizes only lo-
cal information during the learning process. Briefly, it does not require a correct
answer associated with each training input for the learning. This learning leverages
the properties or correlations of the training inputs, and tries to organize patterns
into categories from these correlations. The basic principle of unsupervised learn-
17
ing is that output units (i.e. neurons) compete among themselves for activation.
Therefore, it allows only one output neuron to be activated at any given time. This
phenomenon is referred to as winner-take-all (WTA), which is a common feature of
unsupervised learning. Generally, the input patterns form a vector and the network
maps the vectors into the synaptic weights. An example of the unsupervised learning
algorithms is competitive learning [30].
2.1.3 Spiking Neural Networks
While the conventional ANNs described in Section 2.1.2 are a powerful computa-
tional tool to solve complex problems, they still suffer from fundamental limitations
in emulating a real biological neural system due to the lack of temporal informa-
tion of spikes. ANNs have become more powerful and biologically realistic. Spiking
neural networks (SNNs), referred to as the third generation of ANNs, have been
developed by considering the communication among neurons with precise timing in-
formation of the spikes. They more realistically resemble the biological brain than
the conventional ANNs [18].
Spiking
Neural
Network
t t
Input spike train Output spike train
Figure 2.5: Spiking neural network.
18
SNNs exploit both the presence and timing of individual spikes as the means
of communication among the spiking neurons while the conventional ANNs process
neural information with real-valued numbers. As in Figure 2.5, an SNN receives
the input spike train from the external environment, processes it, and produces the
output of the network in the form of another spike train. In an SNN, it is assumed
that the amplitude of spikes does not encode any information. Instead, information
is encoded in the timing of the spikes that forms a spike train. Therefore, input
vectors for an SNN have to be preprocessed to extract the input features that may
contain real-valued timing information [18]. Similarly, the output spike train has to
be decoded to interpret the result of the network. There are various coding schemes
for inputs and outputs for SNNs to interpret a spike train as real-valued numbers,
by using either the frequency of the spikes or the timing between the spikes.
Spiking neurons are similar to the conventional artificial ones, but spiking neurons
utilize spikes as input and output while the traditional ones have real-valued coun-
terparts. When the spikes from the pre-synaptic neurons arrive at a post-synaptic
neuron, the membrane potential of the post-synaptic neuron changes. The mem-
brane potential represents the internal state of the spiking neuron that is induced in
the model to respond to pre-synaptic spikes. The membrane potential is affected by
the synaptic characteristics such as strength of the synaptic connections. The post-
synaptic neuron fires when its potential reaches a specific threshold. The behavior
of spiking neurons is illustrated in Figure 2.6. The post-synaptic neuron connects
with three pre-synaptic neurons that transmit their output spikes as inputs to the
post-synaptic neuron. The pre-synaptic neurons produce the spike trains that con-
sist of a sequence of three, two and one spikes, respectively, and these spikes occur
at the different timings. Hence, the post-synaptic neuron receives six input spikes
in total. The membrane potential of the post-synaptic neuron increases whenever
19
g
total
time (s)
S
ie
m
e
n
 (
S
) Threshold
Input spike train
Membrane potential (Post-synaptic neuron)
Pre-syn
Neuron1
Pre-syn
Neuron2
Pre-syn
Neuron3
Post-syn
Neuron
Pre-synaptic neuron output spike train
Post-synaptic neuron output spike train
Figure 2.6: Spiking neuron behavior.
each spike receives as shown in Figure 2.6. It is important to note that the potential
can either increase or decrease according to the type of neurons. In other words,
inhibitory pre-synaptic neurons depress the membrane potential of the post-synaptic
neuron whereas excitatory ones potentiate. All the pre-synaptic neurons in Figure
2.6 are assumed to be excitatory. The post-synaptic neuron temporally integrates
the incoming spike trains to compute the internal state of the neuron (i.e. membrane
potential) over time. The post-synaptic neuron generates a spike when the potential
exceeds the threshold (e.g. two output spikes in Figure 2.6). The output spike train
of the post-synaptic neuron can be either transmitted to other spiking neurons in the
SNN or read off the external environment. Similar neuron behavior can be modeled
with many different ways by exploiting the existing spiking neuron models, such as
20
Hodgkin-Huxley and Leaky Integrate-and-Fire models [54].
Similar to the traditional ANNs, SNNs also learn through synaptic plasticity
which refers to an adaptation process that updates the strength of the synaptic con-
nections among the neurons over time, in response to their increased or decreased
activities. Neuroscientific research revealed that the change in the synaptic strength
depends on the timings of pre- and post-synaptic spikes [42, 33]. This dependency
was experimentally characterized in detail by Bi and Poo [3] and named spike tim-
ing dependent plasticity (STDP) [62]. STDP is most commonly utilized in model-
ing of circuit-level plasticity and learning [16]. Furthermore, the STDP rules have
been recognized in a wide variety of tasks including associative memory and pattern
recognition. STDP is basically a temporally asymmetric form induced by temporal
Pre-syn 
Neuron
Post-syn
Neuron
w
Δw
tpost - tpre
t
t
tpost - tpre
tpost
tpre
-800 -600 -400 -200 0 200 400 600 800
-0.5
0
0.5
1
Figure 2.7: Spike timing dependent plasticity.
correlations between the spike firing events between pre- and post-synaptic neurons.
Namely, the strength change of synaptic connection is a function of the spike time
21
difference between pre- and post-synaptic firing events and the difference determines
the synaptic weight change as illustrated in Figure 2.7. As one example, to achieve
the STDP-based learning, the time difference ∆t = tpost− tpre of the firing times be-
tween the pre- and post-synaptic neurons needs to be calculated. Then, the synaptic
weight update is done by adding the weight change ∆w obtained from the STDP
curve into the synaptic connection strength w between the pre- and post-synaptic
neurons. In this case, the STDP learning is mathematically described by the follow-
ing equations
∆w = W (tpost − tpre) (2.6)
w = w + ∆w (2.7)
where tpre and tpost are the firing times of the pre- and post-synaptic neurons, re-
spectively, W (·) is the STDP learning function and w is the synaptic weight between
the neurons. Since the STDP function W (·) affects the learning performance of the
SNN, it should be carefully designed according to the targeted applications.
The rest of this dissertation focuses only on SNNs, which are our primary targeted
computational model.
2.1.4 Silicon Neuron Circuits
Silicon neurons are the fundamental building blocks in neuromorphic hardware
design [1]. They emulate the electrophysiological behavior of real neurons on a
silicon chip rather than on a general purpose computer as software. Many design
approaches to implement silicon neurons with analog and digital circuits have been
presented [29]. Moreover, several different devices, ranging from conventional CMOS
devices to recently developed nano-electro devices such as memristors, are exploited
22
to realize silicon neurons. Considering the presented neuron models so far, the LIF
neuron model is widely adopted to implement silicon neurons due to its simplicity in
hardware implementation. In contrast, more detailed ordinary differential equation
(ODE)-based models such as the Hodgekin-Huxley model can reproduce the behavior
of biological neurons more closely, they are less hardware friendly and appropriate
for large scale integration.
(a)
-
+
Vlk
Iin
Cmem
M2
M3
INA
IK
Vmem
Vthr
Iamp Ilp
M5
M4
Ikdn
Ikup
CK
M7
M6
(b)
Up-swing
Down-swing
Spike width
Integration
Refractory
period
Figure 2.8: Analog silicon neuron: (a) schematic and (b) timing diagram [29].
Traditionally, silicon neurons have been implemented with analog circuits, which
23
utilize the I-V characteristics of the transistors to mimic the biological neurons [48,
65, 69, 11]. Figure 2.8 depicts an analog LIF silicon neuron [29]. The capacitor Cmem
is used to keep the membrane potential and the leakage current of the membrane is
controlled by the gate voltage Vlk of M1 as in Figure 2.8(a). The switches M2 and M3
play the role of charging and discharging the membrane capacitor Cmem, respectively.
The circuit utilizes an analog comparator to compare the threshold Vthr with the
membrane potential Vmem. The emulated membrane voltage over time is illustrated
in Figure 2.8(b). If there is no input Iin applied, the membrane voltage Vmem is drawn
to its resting potential, which is 0 V in this configuration, by the leakage current.
If an excitatory input by a positive Iin is injected to the neuron circuit, then the
membrane capacitor Cmem is charged while the inhibitory input by a negative Iin
discharges Cmem. When the excitatory current is larger than the leakage current,
the membrane potential Vmem increases. The potential Vmem is compared with the
threshold voltage Vthr by the comparator which produces a spike when Vmem exceeds
Vthr. The spike turns the switch M2 on and, as a result, Vmem is reset to the resting
potential by Ik. The analog circuit based silicon neurons may have a simple structure
with a few transistors and consume low power. Unfortunately, they are intrinsically
sensitive to process, voltage and temperature (PVT) variations and thus it is essential
to design carefully to make the circuits robust. Analog circuits are difficult to scale
with technology, reconfigure, and interface with software systems. Importantly, the
use of area-consuming capacitors to maintain a considerable number of synaptic
weights and membrane potentials hinders large-scale integration of spiking neurons
[28, 59].
A digital implementation of silicon neurons is illustrated in Figure 2.9 [6]. The
neuron circuit mainly consists of a digital adder, a digital comparing circuit and
some control blocks. The accumulator (i.e. register) stores an integrated digital
24
18-bit Adder
18-bit Accumulator
Forgetting 
Block AER
InterfaceM
U
X
XOR
6 bits6 bits
18
bits
18 bits
Kernel
wij
Sel_forgetting
Sign
Enable
Reset
Reset
Sign
Sel_lim
Pulse
Pulse + Pulse -
Rqst
Ack
Figure 2.9: Digital silicon neuron [6].
value represented by a 2’s complement signed number and its output correspond to
a membrane potential. When the event input occurs, Enable signal is activated and
the accumulator is updated with the corresponding kernel weight wij (i.e. synaptic
weight value) through the adder. If the accumulator output reaches a programmed
threshold, the neuron generates an event pulse (i.e. spike). The threshold value is de-
termined by a 3-bit parameter Sel lim that selects one of the 18 accumulator output
bits via the multiplexer. While a lower bit selected leads to a small threshold value,
a higher one results in a large threshold. For instance, selecting the (5)th and (8)th
bits sets the threshold value to 32 (25) and 256 (28), respectively. The selected bit is
compared with the MSB of the accumulator continuously and the comparator (i.e.
XOR gate) creates a spike when the MSB and the selected bit is different from each
other. Lower thresholds make the neuron fire more frequently, and thus allow fast
processing. The forgetting block mimics the leaking behavior of neurons. The peri-
odic forgetting pulse Sel forgetting is applied to the forgetting block. If Sel forgetting
signal is asserted, the block outputs a fixed leaky value (e.g. −1) so the accumulator
25
output decreases, causing the membrane potential to decrease periodically. The dig-
ital neuron circuits are advantageous in terms of robustness against PVT variations,
better reconfigurability and ease of implementation. However, it may consume large
dynamic power due to digital switching activities.
2.1.5 Neuromorphic VLSI Systems
At the system level, while the conventional computers have become very power-
ful but they still require a huge amount of capacity, power and computation time to
mimic the tasks that humans behave, VLSI-based neuromorphic systems can pro-
vide effective ways to emulating the functions of a biological brain in silicon while
significantly saving computational power and time. Neuromorphic chips are usually
composed of synapse circuits and silicon neurons that store synaptic weight values
and emulate neuron dynamics, respectively. In addition, they can also include ded-
icated on-chip learning circuits according to the application. Recently, two digital
reconfigurable neuromorphic chips integrating a large number of spiking neurons
and their building blocks have been demonstrated in [58, 46, 27, 2]. These two de-
signs support up to 256 programmable digital silicon neurons and 1024×256 binary
synapses by means of an SRAM crossbar array.
Figure 2.10 shows the block diagram of the digital neurosynaptic core in [46]. It
consists of 256 digital LIF neurons with an output encoder, 1024 individually ad-
dressable axons, which can be either excitatory or inhibitory, with an input decoder
and 1024×256 programmable binary synapses implemented with an SRAM crossbar
array. This design does not include an on-chip learning mechanism and necessitates
loading of synaptic weights into the crossbar after the off-chip learning. It performs
neural information processing in an event driven manner to save active power dis-
sipation greatly. Specifically, it adopts an asynchronous design technique where all
26
AK
A3
A2
A1
NMN3N2N1
GK
G3
G2
G1
Select & Encode
D
e
c
o
d
e
Axons TypeCrossbar Synapses
N
e
u
ro
n
s
Sync
Output
spikes
Input
spikes
1 0 1 1
Figure 2.10: Block diagram of digital neurosynaptic core [46].
communication among the blocks requires a request-acknowledge handshake without
a global clock signal. The detailed operation in each time step t is divided into two
phases. In the first phase, a set of input spike-events A are sent to the neurosy-
naptic core at a time, and these events are sequentially decoded to the appropriate
axon block. The corresponding axon activates the SRAM’s row, and reads out all
of its connections and type G. If there are synaptic connections W , which are rep-
resented as “1”, the inputs are sent to the corresponding neurons circuits. Then,
they update the membrane potentials V appropriately. After a sequence of neuron
updates, the axon block deactivates the SRAM and waits for the new inputs. In
the second phase, a synchronization event that occurs in every millisecond period is
sent to all the digital neurons. Each neuron checks whether its membrane potential
reaches certain threshold. If so, it produces a spike and resets the potential to zero.
These spikes of the neurons are encoded and sequentially sent off the chip through
27
the encoder. After that, the leak parameter is applied to the neurons. Throughout
the two phases of neural processing, the neurosynaptic core implements the neuron
dynamics described by the following mathematical expression.
Vi(t+ 1) = Vi(t) + Li +
K∑
j=1
[
Aj(t)×Wji × SGji
]
(2.8)
where L is the leaky parameter, K is the number of axons, A is the activity bit, W
is the synaptic value and G is the axon type.
The digital neuromorphic chip incorporating on-chip learning capability in [58]
is illustrated in Figure 2.11. This chip integrates 256 digital spiking neurons with
256×256 binary synapses to implement a fully recurrent network. It operates in a
synchronous manner with a global hardware clock signal and each biological step
consumes many hardware clock cycles. In each time step, the digital spiking neu-
rons calculate their membrane potential according to the implemented dynamics and
produce spikes when the potentials reach the given firing threshold. The spikes cre-
ated by the firing neurons lead to synaptic integrations in all target neurons (i.e.
post-synaptic neurons) and synapse weight updates according to certain learning
rule. While the input spikes to the system represent external stimuli such as input
patterns, the generated spikes (i.e. output spikes) indicate the output activities (e.g.
recognition of a given pattern). The N×N crossbar architecture is suitable to repre-
sent a neural network of N neurons and all N2 possible synaptic connections among
them. Correspondingly, the on-chip storage used to keep the binary synapse values
is implemented by a 256×256 array of transposable SRAM cells. While the conven-
tional memory array is accessible only in row-direction, the proposed transposable
SRAM array is accessible in both row- and column-fashions for pre-synaptic and
post-synaptic weight updates, respectively, leading to a significant speedup of the
28
(a)
(b)
256×256 
Synapse Array
BL/WL Driver, Sense Amp
Priority Encoder
256 Neurons
N1 N2 … N256
G
lo
b
a
l 
F
in
it
e
 S
ta
te
 M
a
c
h
in
e
In
p
u
t 
P
a
d
s
O
u
tp
u
t 
P
a
d
s
Clock
Input
Spikes
Reconfigurability
neuron/synapse 
configuration
Observability
neuron/synapse
state
Output 
Spikes
16-bit
adder
-
+
4
:1
 m
u
x
s+
s-
sext
λ
External
Excite
Inhibit
Leak
θ
Pre-synaptic
counter
Post-synaptic
counter
2
:1
m
u
x
-
+
LFSR
Integrate and Fire Learning
External input External output
Input from 
synapse array
Output to 
priority encoder
Output to 
synapse array
Figure 2.11: Block diagram of (a) neuromorphic chip and (b) silicon neuron [58].
update process. Moreover, an entire row and column of the crossbar can be accessed
simultaneously. Note that each row and column corresponds to a neuron’s axon and
dendrite, respectively, in the SRAM array. Based on the adopted single-bit memory
cells, the binary weight of the synapses are probabilistically set to “1” or “0” accord-
ing to the implemented learning rule. Each neuron in the Figure 2.11 implements
both locally reconfigurable LIF functions and learning rules, and the following LIF
29
neuron dynamics is conducted in each time step
V [t] = V [t− 1] + s+n+[t]− s−n−[t]− λ (2.9)
where V is the membrane potential, n+ and n− are the numbers of excitatory and
inhibitory inputs received through “on” synapses, respectively, s+ and s− are the
input strength parameters and λ is the leak parameter. The membrane potential and
parameters are expressed with 8-bit digital values. Additionally, if the membrane
potential V exceeds a given threshold θ that is reconfigurable variable, a spike is
generated and V is reset to the resting potential. On-chip learning is implemented
in each neuron cell, which thus shares the learning circuit across axonal rows and
dendritic columns of the synapses. The pre- and post-synaptic counters work with a
linear feedback shift register (LFSR) to perform probabilistic synapse weight update.
The two neuromorphic designs both employ an SRAM array to store synaptic
weight values, incurring a significant portion of the entire chip area. Furthermore, the
learning performance may be degraded due to the adopted binary synapses that are
updated by a probabilistic scheme [58]. The lack of an on-chip learning mechanism
may limit the design of applications [46].
2.2 Approximate Computing
Aggressive CMOS technology scaling allows modern VLSI systems to integrate
many high-performance functional modules, such as multi-media and communica-
tion processors. Meanwhile, today’s circuit designers are facing grand challenges
in managing overall chip power and energy consumptions. To remedy the increas-
ing energy efficiency challenges, a new design paradigm of approximate computing
has emerged as one promising solution and has drawn a significant research interest
30
[7, 67, 21, 60, 8, 9]. Approximate computing can provide great computational and
energy efficiencies by relaxing processing precision while maintaining an acceptable
overall processing quality for many applications that involve signal processing of
multimedia data (e.g. audio, video and image), machine learning and speech recog-
nition. Fortunately, these classes of computation do not require perfect accuracy and
approximate results with controlled accuracy may be sufficient. For example, while
approximation errors in image processing may change the numerical values of the
overall output, the users may not be very sensitive to certain amount of error and
may still recognize the image. Similarly, there is certain level of error resilience in
tasks performed by the human brain. The human brain is often able to fill the miss-
ing information and filter out the noisy or redundant information from the received
inputs by its natural error compensation mechanisms. In short, the human brain
has a certain degree of built-in error or fault resilience. Thus, it makes good sense
to system design to reduce the cost in hardware realizations such as power, energy
and area with approximate computing from a ranging from the algorithm to circuit
levels [21, 60, 8, 9].
Certainly, addition is the fundamental operation in many processing applications
and the adder is therefore an essential component to achieve approximate comput-
ing. Furthermore, other arithmetic units, such as comparator and multiplier, can be
implemented based on the adder. Since approximate adder design is one primary
contribution of this dissertation, we briefly review the existing approximate adder
design techniques. Lu proposes an approximate adder [39] that leverages a limited
number of previous (less significant) input bits for carry speculation to increase the
overall speed by cutting down the long carry propagation chain. The critical draw-
back of this approach is the use of a considerable number of carry generators, which
gives rise to large area and high power dissipation. The so-called ETAI [76] and LOA
31
[40] approximate adders are split into an accurate part for higher order bits and an
inaccurate part, which utilizes a modified XOR (ETAI) and OR function (LOA) to
approximately compute the remaining lower output bits. Therefore, the approximate
errors are concentrated on the lower bits. A few transistors are eliminated from the
traditional mirror adder to reduce power and area at the expense of accuracy degra-
dation in [19]. These two approaches are limited by high error rates. The ETAIII
[75] improves the error rate of ETAI by introducing a dynamic dividing strategy
of the accurate and inaccurate parts by input patterns but still yields a high error
rate. The segment based approximate adders are presented in [74] and [14] which
are named ETAII and VLCSA-1, respectively. The carry for each k-bit segment
is predicted from the lower k-bit inputs to reduce the delay of carry propagation.
Similarly, the ACA [32] adopts a number of 2k-bit sub-adders and leverages only k
most significant bit (MSB) outputs of the sub-adders to achieve approximate addi-
tions. Unfortunately, these adders have high error rates for the carry generations,
particularly for 2’s complement signed additions of small numbers. In addition, the
use of carry selection in VLCSA-1 and middle sub-adders in ACA result in power
consumption and area overhead. The lack of an error magnitude reduction in ETAII
degrades the quality of addition. In [47], the approximation errors for less significant
bits are reduced by conditional bounding logic with dithering, which causes area and
power overheads.
2.3 Emerging Memory Technologies
To date, various new memory technologies have emerged to replace the tradi-
tional memories such as SRAM and DRAM. Among these new memory technologies
researched so far, spin-torque-transfer random access memory (STT-RAM), phase-
change memory (PCRAM) and memristor based resistive random-access memory
32
(ReRAM) are considered the most promising candidates for the future. STT-RAMs
exploit a magnetic tunnel junction to store information and the difference in magnetic
directions is used to represent a bit of information [12]. PCRAMs leverage chalco-
genide materials for memory storage which can be switched between a crystalline
phase (SET state) and an amorphous phase (RESET state) by heat [12]. ReRAMs
are typically implemented with a memristor, as known as memory resistor, whose
existence was theoretically predicted by Chua in 1971 as the fourth fundamental
passive circuit element [10]. More recently, TiO2 thin-film based memristors have
been demonstrated at the nanoscale [63]. The memristive nanodevice has gained
increasing research interest and becomes a promising solution for low-cost on-chip
storage thanks to its non-volatile nature, excellent scalability and high density of
10 Gb/cm2 or more [23, 72]. A number of multibit hybrid CMOS/memristor mem-
ory architectures targeting high integration density and low power dissipation have
been proposed to substitute the conventional SRAM and flash memories that are
confronted with the fundamental technology scaling limits [45, 41]. In addition, sev-
eral recent studies have suggested leveraging memristive nanodevices for building
synaptic arrays [31, 61, 55, 25].
We briefly review the memristor device model which is used for implementing
the on-chip synaptic weight storage in our neuromorphic processor design. A TiO2
thin-film based memristor is a two terminal electrical device and is a titanium oxide
film sandwiched by two metal contacts. Conceptually, there are two layers in the
film: a doped and an undoped ones as shown in Figure 2.12. The undoped layer is
a highly resistive pure TiO2 region (TiO2 layer) while the doped one is filled with
oxygen vacancy that makes it highly conductive (TiO2−x layer) [23]. The memristor
33
Doped Undoped
D
w RON·w/D ROFF·(1-w/D)
Figure 2.12: Memristive device structure (left) and variable resistance model (right)
[63].
device model can be mathematically expressed by
R(w) =
w
D
·RON +
(
1− w
D
)
·ROFF
where 0 ≤ w ≤ D
(2.10)
In (2.10), RON and ROFF are the fully doped (lowest) and fully undoped (highest)
resistances of the memristor, respectively, and w is the length of the doped region
and thus physically bounded by the range between 0 and the total device length D.
Moreover, w represents the internal state of the memristor. The memristor state
is controlled by the incoming flux across the device for flux-controlled memristive
devices where the voltage source is utilized as the input. The memristance is deter-
mined by the state w that can be mathematically described by
w
D
=

1 if ϕ ≥ ΦUP
β
β−1
[
1−
√(
R(w0)
ROFF
)2
− ϕ
ΦD
]
if ΦLOW < ϕ < ΦUP
0 if ϕ ≤ ΦLOW
(2.11)
ΦUP =
ΦD
R2OFF
(
R(w0)
2 −R2ON
)
(2.12)
34
ΦLOW = − ΦD
R2OFF
(
R2OFF −R(w0)2
)
(2.13)
where the ratio between the resistances of the on and off states is denoted by β (i.e.
ROFF = βRON), w0 is the initial state of the memristor, D is the total length of the
memristor and ϕ is the injected flux. In addition, ΦUP and ΦLOW are the upper and
lower limits of the effective flux injection, respectively, and
ΦD =
(βD)2
2µv (β − 1)
(2.14)
where µv is the average ion mobility. Also, the memristor internal state w varies
dynamically with the external input. A recent experimental study has shown that
the conductance of the memristive device can be incrementally adjusted by altering
the pulse width of the constant input voltage [31]. In other words, a longer positive
pulse duration leads to a larger increase of memductance that is given by the inverse
of memristance. Suppose that the state of the memristor is moved from the initial
state w0 to a feasible state w by a square-wave voltage pulse with an amplitude of
VA. Then, the required pulse width TW is can be derived to be [23]
TW =
ΦD
VAR2OFF
[
R(w0)
2 −R(w)2] (2.15)
Furthermore, the pulse duration required to move the memristor state from w = 0
to w = D is the same as what is needed to change the state from w = D to w = 0,
and the pulse width TW for these moves is given by
TW =
∣∣∣∣ ΦDVAR2OFF (R2OFF −R2ON)
∣∣∣∣ (2.16)
Note that the polarity of the input pulse would be different from each other for these
35
moves. This equation also indicates that the programming time (e.g. pulse width)
needed to write a specific value to the memristor would be a function of the current
and target resistances of the memristor.
2.4 Objective of the Dissertation
The objective of the dissertation is to realize energy efficient and error resilient
neuromorphic computing in VLSI. To achieve this goal, we first introduce a general
digital neuromorphic VLSI architecture. The block diagram of a this digital neu-
N×N Synaptic Crossbar
Read/Write Interface, Controller
Synapse Array
N Dendrites
N
 A
x
o
n
s
Synaptic 
Weight
Learning 
Circuit
Learning
Circuit
Learning Array
N
Neuron 
Circuit
Neuron 
Circuit
Neuron Array
N
Memristor Nanodevice
 Non-volatile nature
 High integration density of 10 Gb/cm2
 Excellent Scalability (Multibit synapse)
Approximate Computing
 Fast & low-power consumption
 Extremely low error rate & good energy eff.
 Approximate adder & comparator
Figure 2.13: Energy efficient and error resilient neuromorphic computing in VLSI.
romorphic architecture with N spiking neuron networks is depicted in Figure 2.13.
The design consists of three arrays of synapse, learning, and neuron circuits as well
as a control and an interface circuits for them. The N×N crossbar synapse array
36
represents a fully recurrent network topology and can store all possible N2 synaptic
weights among the N neurons. The resolution of the synaptic weights is determined
according to the targeted applications [56], and can be either binary [58, 46] or multi-
bit [50]. To mimic the behavior of a biological neuron, each neuron circuit emulates
the neuron dynamics (e.g. according to the LIF or Hodgkin-Huxley models) and
generates spikes when it fires. The learning circuits cooperate with the respective
neurons and update the synaptic weights according to a learning rule such as STDP.
To realize energy efficient and error resilient neuromorphic VLSI systems (see
Figure 2.13), we adopt one of the emerging nanodevices, memristor, to implement a
low-cost N×N synaptic crossbar array. While the conventional SRAM-based synap-
tic array suffers from significant area overhead, in particular, for multibit synapses,
the memristor-based nonvolatile arrays provide an excellent scalability and high in-
tegration density of 10 Gb/cm2 or more. We systematically analyze the memristor
device characteristics in terms of access time and level partitioning to realize multibit
synapses and present a low-cost digital PWM scheme for programming the memris-
tor. Furthermore, we investigate memristor readout schemes for the crossbar array
and propose a column based analog-to-digital conversion technique to more efficiently
carry out the digital LIF neuron dynamics. The details of the design of the memris-
tive crossbar is presented in Section 3.
As the second key contribution which will be presented in Section 4 and 5, we ap-
ply approximate computing to the digital neuromorphic VLSI system to considerably
reduce energy dissipation since approximate computing allows for fast computation
with low power consumption, leading to good energy efficiency with an acceptable
computation quality. To do this, a novel approximate arithmetic scheme, referred
to as parallel carry-skip, is proposed for adders and comparators in Section 4. By
cutting down the worst-case carry propagation chain, we reduce the critical path
37
delay and leverage the information from less significant input bits to speculate the
carry in a parallel manner, which allows for highly accurate carry prediction. Thus,
it makes it possible to either speed up addition and comparison operations or reduce
energy dissipation by lowering the supply voltage level. We adopt the proposed ap-
proximate arithmetic units for the digital LIF computations in the neuron circuit
and show the impacts of the approximation errors on the VLSI-based neurocomput-
ing applications. Moreover, we systematically analyze the energy efficiency of the
neuromorphic hardware adopting the approximate components with supply voltage
scaling in Section 5.
Finally, Section 6 summarizes our key contributions and discusses the future
works.
38
3. RECONFIGURABLE DIGITAL NEUROMORPHIC PROCESSOR WITH
MEMRISTIVE CROSSBAR ARRAY
3.1 Digital Neuromorphic Processor Architecture
3.1.1 Overall Processor Architecture
Figure 3.1 depicts the overall block diagram of the proposed neuromorphic pro-
cessor architecture for an N spiking neuron network. It consists of a synapse unit
(SU), a learning unit (LU) with a global timer, a neuron unit (NU), a LIF arithmetic
unit (LAU) and a system controller. The SU employs the proposed N×N memristive
crossbar array since the crossbar structure effectively implements biological synapse
connections [58, 46]. It can represent a fully recurrent network topology and store
N2 possible synaptic weights among N neurons. A biological neuron has multiple
dendrites and a single axon, which receive input spikes from the pre-synaptic neurons
and transmits output spikes to the post-synaptic neurons, respectively. One axon
can connect with the dendrites of multiple post-synaptic neurons. In the crossbar
array, a row and a column corresponds to an axon and a dendrite, respectively, of a
biological neuron. The connection between the (j)th row (axon) and (i)th column
(dendrite) is represented by the synaptic weight wji between the (j)th and (i)th
neurons. The employed memristor device for the array keeps not only a multibit
synapse value but also the network connectivity information. The proposed crossbar
is fully reconfigurable in sense that the connectivity can be programmed with respect
to the topology for any N -neuron network. The detailed memristor cell utilization
of the crossbar array will be explained in Section 3.2. In order to achieve the par-
allel synaptic weight updates, we design the array to be accessible in both row and
column directions and this improves the update performance greatly. Among the
39
L
e
a
rn
in
g
 E
le
m
e
n
t
L
E
C
o
lu
m
n
 (
D
e
n
d
ri
te
) 
D
ri
v
e
r
Row (Axon) Driver
M
e
m
ri
s
to
r 
C
ro
s
s
b
a
r 
A
rr
a
y
(N
 ×
 N
)
C
o
lu
m
n
 (
D
e
n
d
ri
te
) 
A
D
C
L
E
 #
1
Global Timer
Read / Write Pulse Generator
L
e
a
rn
in
g
 F
S
M
S
y
s
te
m
 C
o
n
tr
o
ll
e
r
S
p
ik
e
 
I/
O
L
E
 #
2
P
W
M
 #
1
P
W
M
 #
2
P
W
M
 #
3
P
W
M
 #
N
L
o
w
-R
e
s
o
lu
ti
o
n
 A
D
C
 A
rr
a
y
LUTs for
Synaptic Weight Update
L
E
 #
N
N
e
u
ro
n
 F
S
M
N
E
 #
1
N
E
 #
2
N
E
 #
N
S
p
k
B
u
f
S
p
k
B
u
f
S
p
k
B
u
f
Memristor
Readout
N
e
u
ro
n
 E
le
m
e
n
t
N
E
L
e
a
rn
in
g
 U
n
it
N
e
u
ro
n
 U
n
it
S
y
n
a
p
s
e
 U
n
it
N
D
e
n
d
ri
te
s
N Axons
S
y
n
a
p
ti
c
 
W
e
ig
h
t 
(w
ji
)
L
IF
 A
ri
th
m
e
ti
c
 
U
n
it
F
ig
u
re
3.
1:
B
lo
ck
d
ia
gr
am
of
th
e
p
ro
p
os
ed
d
ig
it
al
n
eu
ro
m
or
p
h
ic
p
ro
ce
ss
or
ar
ch
it
ec
tu
re
.
40
various neuron models, we adopt the LIF model for the silicon neurons to mimic the
biological counterparts. LIF models have been shown to be effective for a number of
learning applications and are suitable for digital implementation due to its moderate
hardware overhead that includes a few arithmetic components, such as an adder and
comparator [29].
The NU, which consists of a finite state machine (FSM) and N neuron elements
(NEs), emulates the LIF neuron dynamics while interfacing with the column (den-
drite) ADC and LAU. In fact, our DNP has two different memristor readout circuits,
which are a low-resolution ADC array and a column ADC, which will be described
in Section 3.3. Each NE keeps the membrane potential and has spike buffers to store
both the external spike that is fed by the off-chip environment and the output spike
that is generated when the potential is greater than the given threshold voltage. The
NE can be made either excitatory or inhibitory, which potentiates or depresses the
membrane potential of the post-synaptic neurons, respectively.
The LU is responsible for performing an on-chip learning. It contains N learning
elements (LEs) that cooperate with the respective NEs as well as an FSM to con-
trol the overall synaptic weight update process. Each LE has a register to maintain
the corresponding neuron’s spike timing, which is used to calculate the spike time
difference between a pre-synaptic and a post-synaptic neurons. The LU updates the
synapse values in the crossbar based on the time differences to realize the STDP
learning mechanism. The STDP rule is programmable through the use of look-up
tables (LUTs) where synaptic weight change as a function of timing difference is
stored. These LUTs are shared by all N LEs and thus able to reduce the silicon-
area significantly and our design allows for parallel STDP updates of weights of all
neurons in a parallel manner The communications between the proposed neuromor-
phic processor and the external environment is performed through input and output
41
spikes. External stimulus are applied in the form of input spikes while output re-
sults in the form of the generated spikes of the output neurons are outputted. In a
character recognition system, for example, input letters are encoded into sequences
of input spikes that are applied to the input neurons and recognition (classification)
results are identified through spiking activities of output neurons.
3.1.2 Flow Control of the Neuromorphic Processor
The system controller manages the overall operations of the processor through a
clocking based synchronous control and the proposed DNP operates in a synchronous
manner as shown in Figure 3.2. Each step corresponds to a biological time unit and
consumes many hardware clock cycles. It includes three processing stages: 1) spike
input/output (I/O); 2) neuron and 3) learning. These stages are executed in a
pipelined manner in that the spike I/O and learning stages can work simultaneously
because there is no data and control hazards between them. During the spike I/O
stage, the input spike buffers in NEs store the spikes from the external environment.
Meanwhile, the output spikes can be read off the chip to observe the output activities.
After receiving/transmitting all the input/output spikes, the neuron stage starts,
Spike I/O Neuron Learning
Spike I/O Neuron Learning
Spike I/O Neuron
t
t+1
Time (Hardware time)
t-1
Learning
S
te
p
 (
B
io
lo
g
ic
a
l 
ti
m
e
)
Figure 3.2: Flow diagram of the proposed neuromorphic processor.
42
where the following LIF neuron dynamics is implemented for each neuron.
Vi[t] = Vi[t− 1] +KSYN
M∑
j=1
wjiSj[t− 1] +KEXTEi[t− 1]− VLEAK (3.1)
where Vi is the membrane potential of (i)th neuron, M is the number of pre-synaptic
neurons, KSYN is the synaptic weight parameter, wji is the synaptic weight between
the (j)th and (i)th neurons, Sj is the activity bit that indicates whether the (j)th
neuron fired, KEXT is the external input spike parameter, Ei is the activity bit for the
input spike of the (i)th neuron and VLEAK is the leaky potential. At each hardware
time step in the neuron stage, through the column driver, an NE activates the cor-
responding column word line to access all its pre-synaptic neurons. The read/write
(R/W) pulse generator, which contains N digital pulse width modulators (PWMs),
produces parallel pulses for reading all pre-synaptic weight values from the mem-
ristor cells in the column and these values are sent to the column ADC. The ADC
accumulates these pre-synaptic weights and converts the sum into a digital quantity.
Note that synaptic weights are stored as an analog quantity of memristors’ resistance
or conductance (i.e. memristance or memductance). This reading process will be de-
tailed in Section 3.2. Finally, the NE updates its membrane potential by adding up
the accumulated pre-synaptic weights, the external spike weight and the leaky po-
tential through LAU. If the membrane potential exceeds the given threshold voltage,
the NE generates a spike event and its potential is reset to the resting potential. The
spiking activity bit of the (i)th neuron Si is set according to
Si[t] =

1 if Vi[t] > VTH
0 otherwise
(3.2)
43
where VTH is the threshold voltage. These spikes are read off the DNP to monitor the
output results during the spike I/O stage. The aforementioned process is repeated
N times to achieve the LIF operations of all N neurons during the neuron stage.
After the neuron stage, the processing continues onto the third learning phase,
where the synaptic weights are updated according to the STDP learning rule. In
this rule, the time difference between a pre-synaptic and a post-synaptic spike event
is measured and utilized to determine the synaptic weight change. To do this, each
LE has a time register to keep track the neuron’s spike event time that is stamped
by the global timer. For each fired neuron, the LU conducts both a pre-synaptic and
a post-synaptic weight updates in a row. If a neuron fires,
1) all its pre- (post-) synaptic neurons’ time registers are compared with the global
timer and the corresponding LEs determine the amounts of the synaptic weight
change using the pre-computed LUT;
2) the column (row) driver activates the memristor crossbar array’s column (row)
word line that is associated with the dendrites (axons) of the fired neuron;
3) the R/W pulse generater provides a read pulse word to the corresponding
column (row) to sense each memristor’s current internal state (i.e. current
synaptic weight) in the column (row) through the low-resolution ADC array;
4) the LEs calculate the pulse durations to write the desired synaptic weight
values, which are evaluated in step 2), into the respective memristor cells using
the memristor’s states obtained in step 3);
5) all the pre- (post-) synaptic weights are updated by means of the R/W pulse
generator with the durations determined in step 4).
44
The generator produces parallel write pulses that have different widths according to
not only the amount of the synaptic weight change but also the memristor’s current
internal state (due to the non-linear device characteristics for write time) as will be
described in Section 3.2. This update process works only when the corresponding
neuron fired at the neuron stage. In other words, if the membrane potential of the
(i)th neuron, for instance, does not exceed the threshold voltage at the neuron stage,
then both pre- and post-synaptic weight update processes for the (i)th neuron are
skipped. It is worth to note that the entire learning stage is omitted when no neurons
fired at the neuron stage. In addition, the proposed architecture is able to process
spiking I/O tasks and the learning stage simultaneously since LIF operations are
processed in the previous stage, resulting no conflict of spiking event data.
3.2 Memristive Synaptic Crossbar Array
In this section, we first briefly introduce the memristor model and two readout
schemes that are suitable for processing LIF operations of silicon neurons and synap-
tic weight update process. Then, the proposed memristive synaptic crossbar array
and CMOS/memristor hybrid cell are presented. Additionally, we propose a new dig-
ital PWM scheme for both reading and STDP update of memristive synapses. While
an analog PWM scheme has been conceptualized for implementing the STDP earn-
ing rule [61], the presented digital design is more amenable to large scale integration
in digital system architectures.
3.2.1 Memristor Readout Schemes
Two different ways to read the memristor internal state have been presented
based on the sensing device for the memristor: 1) load resistor and 2) summing
amplifier (i.e. current-to-voltage converter), as depicted in Figure 3.3. The load
resistor based sensing scheme is commonly adopted for both binary and multilevel
45
M1
M2
MN
RL
VOUT
VREAD
VREAD
VREAD
VOUT
VREAD
VREAD
VREAD RF
(a) (b)
M1
M2
MN
-
+
Figure 3.3: Memristor sensing schemes by (a) load resistor and (b) summing ampli-
fier.
memristor memories due to the ease of implementation [23, 45, 41]. This scheme
leverages a load resistor RL, which is connected in series with the memristor, and
forms a voltage divider. Hence, the output voltage VOUT of Figure 3.3(a) under a
given read voltage VREAD is calculated by
VOUT =
∑N
i=1
1
Ri
1
RL
+
∑N
i=1
1
Ri
VREAD =
∑N
i=1 Gi
GL +
∑N
i=1Gi
VREAD (3.3)
where N is the number of memristors attached to the sensing node, Ri and Gi are
the resistance and conductance of the memristor Mi, respectively, and RL and GL are
those of the load resistor. With respect to this particular neuromorphic application,
the result shows that this scheme is unable to integrate multiple memristor internal
states associate with a STDP update since it can not have the output voltage VOUT
represent a linear summation of either memristor resistance or conductance under
a fixed read voltage VREAD. What is possible with this scheme, though, is the ac-
46
cumulation of all pre-synaptic weights for a neuron in N iterations with use of an
additional adder. In other words, each memristor state can be readout and accumu-
lated by the adder at a time, resulting in a total of N2 iterations to complete all N
neurons’ LIF tasks. Additionally, to concurrently integrate pre-synaptic weights of
N neurons to expedite the tasks, it necessitates N adders and N iterations to finish
them [58]. On the other hand, the summing amplifier based sensing scheme provides
a linear summation of conductance of memristors such that it is able to integrate
all pre-synaptic weights for each neuron. This scheme forms a virtual ground at the
negative input terminal of the amplifier and each current from the memristor flows
out into the ground. Thus, the output voltage VOUT of Figure 3.3(b) with the input
voltage VREAD is expressed by
VOUT = RF
N∑
i=1
VREAD
Ri
= RF
N∑
i=1
GiVREAD (3.4)
where RF is the feedback resistor of the amplifier. This scheme allows the adder
and comparator (LAU) to be shared for the LIF task for N neurons, leading to
great area and power savings. It is worth to note that both schemes can be equally
employed to sense the state of a single memristor while the amplifier based scheme
is more suitable to accumulate the internal states of multiple memristors. Hence,
we leverage the summing amplifier based sensing scheme to effectively accumulate
all pre-synaptic weights of each neuron for LIF operations at the expense of an
amplifier, whereas the resistor based sensing is exploited to detect each memristor’s
current state for the synaptic weight update process.
47
3.2.2 Memristive Synaptic Cell Partition
Since the summing amplifier based sensing provides the linear summation of con-
ductance of memristors, we make the memristors of the crossbar array have equally
partitioned 9 conductance levels to represent a multilevel synaptic weight as shown
in Figure 3.4. Importantly, the programming times to write a value to the memris-
tor are drastically different according to the memristor internal state (i.e. current
conductance value) in spite of writing the same value [41]. Table 3.1 shows the write
times to change the conductance value by one level under different internal state
according to our memristor model [71]. They are normalized against the time to
change the level between 7 and 8. Note that the times are symmetric with respect to
0 1 2 3 4 5 6 7 8
Synapse Value 0 7
Network
Connectivity
GOFF
GON
ROFF
GOFF GON
RON
Figure 3.4: Memristor level partitions by equal conductance.
Table 3.1: Normalized write times to change one level of memristor conductance
(RON=10KΩ, ROFF=500KΩ, VWRITE=1.2V ).
Level change 0 ⇔ 1 1 ⇔ 2 2 ⇔ 3 3 ⇔ 4 4 ⇔ 5 5 ⇔ 6 6 ⇔ 7 7 ⇔ 8
Time 8205 117 25 10 5 3 2 1
48
the direction (e.g. write time for changing from level 1 to level 0 and vice versa are
the same). Also, we have modeled the memristors with the parameters RON = 10KΩ
and ROFF = 500KΩ and the read voltage VREAD and write voltage VWRITE are consid-
ered as 1.2 V for both [71]. The programming time to change level between 0 and
1 is over 8000× slower than that between 7 and 8 for our memristor. The excessive
programming time required to change the state between levels 0 and 1 may notably
slow down the overall on-chip learning speed during the training phase. Therefore,
we utilize the lowest level to indicate the network connectivity of the crossbar, which
allows the network to be fully reconfigurable, and other higher 8 levels to represent
the actual synapse value (i.e. 3-bit synapse) for a significant training speedup. As
an example, if the memristor in the (i)th column and the (j)th row of the crossbar
is at level of 0, it means that no connection between the (j)th and the (i)th neurons
exist. On the other hand, the memristor level of 4 indicates that the (j)th and (i)th
neurons are connected with the synaptic weight of 3.
3.2.3 Memristive Crossbar Array and Cells
Figure 3.5 exhibits the proposed synaptic crossbar array and the CMOS/memristor
hybrid synaptic cell. The two switches S1 and S2 in the cell are introduced to allow
each memristor to be accessible in both the column and row fashion. When the
row (column) driver activates a word line, S1 (S2) of all cells that lie in the same
row (column) are turned on and ready to be accessed. Parallel voltage pulses are
generated by the R/W pulse generator and applied to read or write all cells in the
row (column) as shown in Figure 3.5.
To read the value from a cell, a fixed positive voltage pulse is applied to the
memristor cell, when S3 and S4 connect to the pulse generator and the ADC (i.e.
memristor readout circuit) lines, respectively. The current generated by the cell due
49
Column Word LineRead 
Pulses
Write 
Pulses
S1
S2
To Readout Circuit (ADC)
S3 S4
From R/W Pulse Gen.
F
ro
m
 R
/W
 P
u
ls
e
 G
e
n
.
Figure 3.5: Proposed synaptic crossbar array and CMOS/memristor hybrid synaptic
cell.
to the applied positive voltage pulse flow out into the ADC line and is converted to
a digital value, reflecting the memductance of the cell. Unfortunately, the applied
positive pulse disturbs each memductance [23]. Therefore, a flipped (i.e. negative)
voltage pulse following the positive one is injected to each memristor to restore its
memductance, resulting zero net flux injection for the memristor. This is effectively
done by connecting S3 and S4 to the ADC and the generator lines, respectively. In the
write operation, the cells in either one row or column are accessed and incrementally
updated in parallel. A write voltage pulse is injected to each memristor cell and
its memductance is altered depending on the pulse duration. The write operation
latency varies with respect to the value to be written into the cell. It is possible to
either increase or decrease the memductance. For the latter, S3 and S4 are connected
to the ADC and the generator lines, respectively, to effectively apply a negative
voltage pulse to the memristor cell.
3.2.4 Digital Pulse Width Modulation for Memristive Synaptic Cell
The proposed DNP uses a parallel write pulse word, consisting of N binary pulses
whose durations are different from each other to update either all the pre-synaptic
50
or post-synaptic weights of a given neuron during the training process. The delay
line based digital PWM requires both of a large number of delay cells to realize
many different pulse widths and a large multiplexer to select one output from these
cells, leading to remarkable area and power overheads [64]. Also, the delay cells
can be sensitive to PVT variations. This affects the pulse durations and may incur
failures in writing the desired values to the memristors. Thus, we leverage a low-cost
counter based digital PWM to readily generate a binary pulse with various durations
as illustrated in Figure 3.6. The counter records the number of cycles of the clock
NPWM cycles
Counter
NPWM
M
U
X
nRSTPWM
CKPWM
PPWM
-
+ C
M
P
CNTPWM
Figure 3.6: Digital pulse width modulator.
signal CKPWM and its output CNTPWM is compared with the desired number of cycles
NPWM by the digital comparator. The multiplexer outputs “1” until CNTPWM reaches
NPWM and chooses “0” after then. The pulse duration is given by
tPWM = NPWM · tCKPWM (3.5)
where NPWM and tCKPWM are the desired number cycles and the period, respectively,
of the PWM clock CKPWM. Note that CKPWM does not have to be identical to the
51
DNP operating clock. NPWM is provided by LU, where the amount of synaptic weight
change is calculated by the time difference between a pre- and a post-synaptic firing
events in accordance with the STDP rule, during the learning stage. CKPWM and
NPWM can be straightforwardly configured according to the range of level change and
device characteristics of the memristor. The R/W pulse generator includes N digital
PWMs to create a read or write pulse word to simultaneously access the memristive
synaptic cells in either one column or row.
3.3 Building Block Implementations
3.3.1 Memristor Readout
Figure 3.7 illustrates the proposed memristor readout block that includes a col-
umn ADC and a low-resolution ADC array. The column ADC works only for the
neuron stage to conduct LIF operations while the low-resolution ADC array is ac-
tivated during the learning stage to sense each memristor’s internal state. In [58],
each neuron circuit has its own adder and comparator to integrate all pre-synaptic
weights and determine firing activity. It requires N iterations to complete all neu-
VINTPRE
-
+
S/H
High-Res
12b ADC
C
o
l 
#
1
R
o
w
 #
1
C
o
l 
#
2
R
o
w
 #
2
C
o
l 
#
N
R
o
w
 #
N
3.2b
ADC #1
W1 W2 WN
Column ADC
Low-Res. ADC Array
RF
VREF
Gen.
3.2b
ADC #2
3.2b
ADC #N
Figure 3.7: Proposed memristor readout block consisting of column ADC and low-
resolution ADC array.
52
rons’ LIF tasks. The proposed column ADC, which contains a summing amplifier, a
sample-and-hold circuit and a high-resolution ADC, provides significant power and
area reductions without degrading overall throughput since it allows a single adder
and comparator (LAU) to be shared by all N neurons.
During a LIF operation, the R/W pulse generator injects a pulse word into the
corresponding column to read all pre-synaptic weights from the memristor cells. Each
current from the cell flows out into the virtual ground and is summated at the neg-
ative terminal of the amplifier. The total current is converted to a voltage quantity
through the feedback amplifier. The sample-and-hold circuit keeps the voltage and
the high-resolution ADC transforms it into a digital value that corresponds to the
term
M∑
j=1
wjiSj[t − 1] in (3.1). In this way, N iterations are enough to complete the
LIF operations for all N neurons with only one adder and comparator (LAU).
During the learning stage, before performing a synaptic weight update with a
desired value (i.e. writing the value into the memristor), we should know the corre-
sponding memristor’s current internal state to determine the pulse duration to write
the desired synaptic weight. This is because that the required pulse width varies
with respect to the current state, notwithstanding writing the same amount of value
into the memristor, due to the non-linear device characteristics as demonstrated in
Table 3.1. To do this, we consider the array of N low-resolution flash ADCs to read
all pre- (post-) synaptic weights of each column (row) in parallel. Also, we leverage
the load resistor based sensing scheme, as in Figure 3.3(a), to eliminate the need
of involving amplifiers to reduce area and power dissipation. Each flash ADC has
8 comparators to detect one of 9 levels in our memristor cell and a digital logic to
encode the output of the comparators. The reference voltage generator shared by N
flash ADCs is a resistor string and creates 8 reference voltages for the comparators
of each ADC. Importantly, this string does not have equally spaced resistor values
53
since our memristor cell is equally sliced not by resistance but by conductance (see
Figure 3.4).
The desired resolution of the column ADC is derived by
resolution = dlog2N + log2Le (3.6)
where N and L are the numbers of neurons and conductance levels of the memristor
cell in the array, respectively. Obviously, the flash ADC architecture is not suitable
for a high-resolution ADC because it requires 2K−1 comparators to implement K-bit
analog-to-digital conversion, leading to considerable power and area consumptions as
K increases. The successive approximation register (SAR) and pipeline ADCs occupy
large silicon area stemming from area-consuming passive components. Moreover, the
SAR and delta-sigma (∆Σ) ADCs are able to achieve a high-resolution and they may
be, unfortunately, limited by a relatively slow conversion rate that is the KHz range.
Therefore, to realize a low-cost high-resolution ADC with a moderate conversion
speed, we adopt a multiphase voltage controlled oscillator (VCO) based ADC where
an analog input alters the VCO frequency and the frequency is measured in digital
value by counters [73]. The VCO based ADC is readily implemented with a few
digital components such as counters, resulting in a small area overhead. Figure 3.8
shows the block diagram of the VCO based ADC, which consists of a ring VCO,
counters, and a tree adder, and the proposed delay cell. We employ a 12-stage ring
VCO with the proposed pseudo differential delay cell that is based on an inverter
structure with a NMOS current source. The back-to-back inverter in the cell ensures
the differential outputs. The VCO operating frequency is adjusted by control the
current source (i.e. a higher current leads to a higher frequency). The clock phases
of each delay cell of the VCO can be exploited to enhance the ADC resolution [34].
54
+
+
A
D
D
VCTRL
Async
Counter
Async
Counter
+
+
A
D
D
CK23 CK11
DOUT
CK10 CK22
CK8
CK20
CK21 CK9
VINP
VCTRL
VCTRL
VOUTN VOUTP VINN
Async
Counter
Async
Counter
CK0
CK12
Async
Counter
Async
Counter
CK4
CK16
+
+
A
D
D
+
+
A
D
D
+
+
A
D
D
Ring VCO Counters Tree Adder
Delay Cell
+ -
+
-
+ -
+
-
+ -
+
-
+ -
+
-
Figure 3.8: Block diagram of VCO based ADC and proposed delay cell.
Generally, the use of more phases results in the higher resolution with higher power
and area overheads. In our DNP, six phases (see Figure 3.8) are leveraged to achieve
a 12-bit ADC resolution for 256 neurons and 9 levels of the memristor cells under
1 MHz conversion rate as will be described in Section 3.4. These clock phases are
connected to the respective digital counters whose outputs are summed by the tree
adder to obtain the final digital output. The frequency of our VCO is up to 1016 MHz
and thus 10-bit counter is used to measure the frequency of each clock phase under
1 MHz sampling frequency. Also, the VCO usually operates at a few hundred MHz
and the counters that operate at such a high frequency contribute a large portion of
the overall ADC power dissipation [36]. Hence, we employ asynchronous counters to
significantly reduce the ADC power.
55
3.3.2 Neuron and LIF Arithmetic Units
The NU cooperates with LAU to emulate the LIF neuron behavior of (3.1) dur-
ing the neuron stage. The block diagram of NEs with LAU and the flowchart of
NU processing are described in Figure 3.9 and Figure 3.10, respectively. For each
MUX
VMEM
REG
Spike
REG
VREST
Ext Spk
REG
NE #N
MUX
VMEM
REG
Spike
REG
VREST
Ext Spk
REG
MUX
Memb
REG
Out Spk
REG
VTH
VREST
A
D
D
S
U
B
VLEAK
A
D
D
Output
Spikes
MUX
KEXT
LIF Arithmetic
Unit
NE #1
KSYN
M
U
L
M
U
X
Ext Spk
REG
External
Input
Spikes
VMEMBNEW
-
+ C
M
P+
-
+
+
+
+
VINTPRE
SUB
+
-
NFIREPREi
(ADC) (LEs)
VMEMB
Figure 3.9: Neuron elements with the LIF arithmetic unit.
neuron, the NU makes the R/W pulse generator to create a read pulse word for the
corresponding column (dendrite) of the crossbar and sums up all the pre-synaptic
weights of the neuron VINTPRE through the column ADC. Importantly, the word con-
tains the read pulses for only fired neurons at the previous time step t − 1 to save
power. Consder a network of 10 neurons as an example. If the (3)rd, (4)th and (5)th
neurons fired and the other 7 neurons did not fire at t− 1, only the (3)rd, (4)th and
(5)th PWMs in the generator produce the read pulse and the other PWMs output
56
curNeuron = 0
curNeuron == N ?
Generate read pulses
for fired neurons at t-1
LIF operation
for curNeuron
curNeuron++
NU DONE
YES
NO
Any neuron fired at t-1?
YES
NO
NU START
Figure 3.10: Flowchart of the processing of the neuron unit.
“0”. Similarly, this column reading and pre-synaptic weight accumulating process is
omitted to reduce power dissipation when there is no firing activity in the previous
neuron stage (i.e. ∀j,0≤j<N : Sj[t− 1] = 0 in (3.1)). In this case, VINTPRE is forced to
zero due to the term
M∑
j=1
wjiSj[t− 1] = 0 in (3.1) and only the leaky potential VLEAK
and the external input spike weight KEXT are considered to obtain the membrane
potential.
More importantly, the column ADC output VINTPRE should be adjusted since our
synaptic cell includes the connectivity information (e.g. memristor level of 4 in the
(j)th row and (i)th column of the crossbar indicates the synapse value of 3 between
the (j)th and (i)th neurons). Therefore, VINTPRE is subtracted by the number of
fired pre-synaptic neurons NFIREPREi (for the (i)th neuron), which is evaluated by the
corresponding LE during the synaptic weight update process (i.e. the learning stage)
at the previous time step t − 1. The subtractor output VINTPRE − NFIREPREi is added
57
to the corresponding membrane potential VMEMB, and the external input spike weight
KEXT and leaky potential VLEAK are added as well. The adder’s output VMEMBNEW is
compared to the threshold voltage VTH by the digital comparator and the result is
sent to the respective NE through the demultiplexer and captured by its output spike
register. Meanwhile, VMEMBNEW is sent to the NE and stored in the membrane register
when it does not exceed VTH. Otherwise, the register is reset to the resting potential
VREST via the multiplexer.
3.3.3 Learning Unit
The LU is designed to conduct the STDP on-chip learning by calculating the
amounts of pre- and post-synaptic weight changes, determining the pulse durations
to write the desired weights and updating the memristive crossbar array with the
values through the R/W pulse generator. Figure 3.11 exhibits the block diagram of
the learning elements with the global timer and the two programmable LUTs. The
flowchart of LU is shown in Figure 3.12 as well. All LEs contain their own time
register to keep the neuron’s spike event time and share the register-based LUTs for
STDP learning curve and write pulse width for the memristor. The STDP LUT holds
several pairs of the spike time difference ∆t and synaptic weight change ∆W while
the LUT stores the required number of cycles of PWM clock CKPWM to write desired
values for the memristor. The design of the pulse width LUT involves important area
considerations. For K-bit synapses, a brute-force implementation would require a
large number of 2
K(2K−1)
2
LUT entries for all possible pairs of the current and target
memristor levels. Instead, to reduce area and complexity of the selection logic, we
design the LUT in such a way that only the desired cycles of CKPWM for altering the
memristor level from the lowest to each target (i.e. from level 1 to levels 2, 3, 4, 5,
6, 7 and 8) are stored in 2K − 1 entries. With this area-efficient design, the actual
58
MUX
Time
REG
S
U
B
+
-
Selection Selection
Selection
ADD
++
SUB
+ -
Counter
(W1 ≥ 1)
W
WD
PWD PWS
T
im
e
S
c
a
li
n
g
To R/W Pulse Gen.
t
W
0
4
1
2
2
1
L
NPWM
1→2
117
1→3
142
1→4
152
Global Timer
MUX
Time
REG
S
U
B
+
-
Selection Selection
Selection
ADD
++
SUB
+ -
Counter
(W1 ≥ 1)
W
WD
PWD PWS
T
im
e
S
c
a
li
n
g
MUX
Time
REG
S
U
B
+
-
Selection Selection
Selection
ADD
++
SUB
+ -
Counter
(W1 ≥ 1)
W1
WNEW1
NWNEW1 NW1
T
im
e
S
c
a
li
n
g
LE #1
From Low-Resolution ADC Array
NFIREPRE1
t1
W1
LUT for Pulse Width
LUT for STDP
(To NEs)
W2 WN
NPWM1
NFIREPRE2
NFIREPREN
Figure 3.11: Learning elements with global timer and shared LUTs.
pulse duration for a given update is determined by finding the difference between
the entry values for the current and target levels. For instance, the number of cycles
for increasing the memristor level from 3 to 5 is obtained by subtracting the number
of cycles needed for changing the memristor level from 1 to 3 from that for changing
from level 1 to 5. As a result, we attain 2K−1× area reduction of the pulse width
LUT (e.g. 4× reduction in the proposed DNP). In addition, the LU supports a time
scaling feature to provide additional programmability in scaling the stored STDP
rule. It is implemented with a shift operation of the time differences.
The processing of the learning stage for the entire network is done as follows.
The synaptic weight updates are processed by iterating over all fired neurons. To
check each neuron’s firing activity, the LU checks the output spike buffer, which is
59
Find next fired neuron
LU START
Generate read pulses
for column (dendrite)
Calculate pre-synaptic weight 
changes & pulse widths by LUTs
LU DONE
Generate write pulses for 
column (pre-syn update)
Found ?
Generate read pulses
for row (axon)
Calculate post-synaptic weight 
changes & pulse widths by LUTs
Generate write pulses for 
row (post-syn update)
YES
Figure 3.12: Flowchart of the processing of the learning unit.
filled at the neuron stage, in the corresponding NEs. For each fired neuron, the LU
runs the following two back-to-back parallel processes, one for pre-synaptic weights
and the other for post-synaptic weight updates. Note that in the absence of neuron
firing, the learning stage is skipped. The respective LE for every fired neuron up-
dates its time register with the global timer. Simultaneously, all LEs calculate the
scaled time differences ∆t1,∆t2, · · · ,∆tN between the global timer and their time
register values. This step basically determines the firing time differences between
this fired neuron and all other neurons in the network. The synaptic weight changes
∆W1,∆W2, · · · ,∆WN for ∆t1,∆t2, · · · ,∆tN are selected from the STDP LUT in par-
allel. Meanwhile, all pre- (post-) synaptic weights W1,W2, · · · ,WN (before update)
are obtained from the corresponding column (row) through the R/W pulse genera-
tor and low-resolution ADC array and fed into the respective LEs, also in parallel.
The pre- (post-) synaptic weights to be written into the respective memristor cells
60
WNEW1,WNEW2, · · · ,WNEWN are computed by the adder (i.e. ∀i : WNEWi = Wi + ∆Wi).
Now, Wi and WNEWi correspond to the current and target levels of the (i)th mem-
ristor, respectively. Then, each LE concurrently looks up the cycle counts NWi and
NWNEWi from entries associated with Wi and WNEWi from the pulse width LUT. The
desired numbers of cycles NPWM1, NPWM2, · · · , NPWMN to update the pre- (post-) synap-
tic weight values by WNEW1,WNEW2, · · · ,WNEWN, respectively, are determined by sub-
tracting each NWi from NWNEWi (i.e. ∀i : NPWMi = NWNEWi − NWi). Finally, the pulse
generator produces the parallel write pulse word with NPWM1, NPWM2, · · · , NPWMN as
in Figure 3.6. When creating the word, NPWMi is set to “0” for Wi = 0 since no
pre- (post-) synaptic connection exists. Additionally, for negative NPWMi values, the
generator inverts the polarity of the pulses and it is effectively done by manipulating
the switches of the proposed cell as shown in Figure 3.5. Also, all LEs record the
numbers of fired pre-synaptic neurons NFIREPRE1, NFIREPRE2, · · · , NFIREPREN during the
post-synaptic weight update process for the following LIF operations as detailed in
Section 3.3.2.
3.4 Implementation of the Neuromorphic Processor and Simulation Results
Except for the memristor nanodevices, the proposed neuromorphic processor has
been implemented with a commercial 90 nm CMOS technology under the regular
supply voltage of 1.2 V . All the digital circuits, which exclude the analog components
of the column ADC (e.g. ring VCO and summing amplifier) and the low-resolution
ADC array, are synthesized with standard cells. The layout that measures 1.45 mm
× 1.28 mm is shown in Figure 3.13. It is worth to note that the memristor crossbar
array is defined as an empty macro based on the estimated area. The main clock
frequency is 1 MHz while the R/W pulse generator operates at 50 MHz. Table 3.2
summarizes the key features of the implementation. All the results are obtained from
61
Low-Resolution
ADC Array
Memristor 
Crossbar
Ne
ur
on
 &
 L
IF
 
Ar
ith
m
et
ic 
Un
its
Learning Unit
Controll
er, etc
Pu
lse
 
Ge
ne
ra
to
r
Co
lu
m
n 
AD
C
Figure 3.13: Layout of the neuromorphic processor with 256 neurons and 65,536
synapses.
the pre-layout simulations.
3.4.1 Column ADC Performance
To show the performance of the VCO based ADC as proposed in Figure 3.8,
we sweep the input voltage from 0.45 V to 1.15 V with a 0.05 V step under a
sampling frequency of 1 MHz. The simulated input-to-output characteristic is shown
in Figure 3.14(a). The digital output spreads over the range from 1,440 to 5,862
(i.e. 12.1-bit resolution) and exhibits a great linear behavior with respect to the
input (R2 = 0.9964). Thus, it satisfies the resolution required in (3.6) to serve as
a column ADC for the 256 silicon neurons and 256×256 memristive crossbar array
with 3-bit synapses in the proposed architecture. As in Figure 3.14(b), we compare
the power and area of the ADC with the asynchronous and synchronous counters
under various ADC resolutions. Our 12-stage differential ring VCO in Figure 3.8 can
62
Table 3.2: Neuromorphic processor implementation summary.
Item Specification
Technology 90 nm CMOS
Synapse Storage Memristor
Supply Voltage 1.2 V
Main/PWM Operating Frequency 1 / 50 MHz
# of Neurons 256
# of Synapses 65,536
Synapse Resolution 3-bits (8-levels)
Neuron Model Parameter Resolution 5-bits
Membrane Potential Resolution 16-bits
Neuron Model Digital LIF neuron
Learning Rule On-chip STDP learning
Synaptic Connection Scheme Fully reconfigurable crossbar
Power Dissipation 6.45 mW
Area 1.86 mm2
10-2
(a) (b)
Figure 3.14: Column ADC performance: (a) input-to-output characteristics and (b)
power and area as functions of counter type and resolution.
63
attach up to 24 counters and hence can be leveraged to realize ADCs with a resolution
between 9- and 14-bits. Note that the tree adder size also varies according to the
number of counters. The power consumptions are measured at the input voltage
of 0.8 V , which is the median input of Figure 3.14. The ADC with asynchronous
counters is more power efficient than that with synchronous ones while the area
remains almost the same as the resolution increases. In the 12-bit resolution, the
ADC adopting the asynchronous counter dissipates 24.8% less power with merely
1.6% area overhead compared with the synchronous counter based ADC. Hence, the
asynchronous counter is appealing for high-resolution VCO based ADC designs.
3.4.2 Overall Processor Performance
Figure 3.15 demonstrates the overall performance of the proposed neuromorphic
processor. The power consumption as functions of the network size, which are eval-
Neuron &
LIF Arith.
Units
12.2%
Learning Unit
42.2%
Controller, etc
5.5%
Memristor 
Crossbar
8.6%
Low-Resolution 
ADC Array
19.5%
Pulse 
Gen. 
10.2%
Column ADC 
1.8%
(a) (b)
32 64 128 256
0
2
4
6
8
# of Neurons
P
o
w
e
r 
(m
W
)
Figure 3.15: Neuromorphic processor performance: (a) power and (b) area break-
down.
64
uated based on a 90 nm CMOS technology and memristor parameters in [71], is
depicted in Figure 3.15(a). The required column ADC resolution is a function of
the network size as given in (3.6). (e.g. 9-bit column ADC for a 32 neuron de-
sign). As the number of integrated neurons N doubles, the chip power increases
more slowly than twice. The asynchronous counter based column ADC consumes
over 21% of the overall power dissipation but its power does not increase much as
the resolution increases (see Figure 3.14(b)). The area breakdown analysis for the
neuromorphic processor for a 256 neuron network is illustrated in Figure 3.15(b) as
well. The SU, which includes the column ADC, low-resolution ADC array, mem-
ristive crossbar, and pulse generator, occupies about 40% of the chip area. Despite
of the relatively small area of the memristive crossbar array (8.6%), realizing the
parallel access scheme for the multibit memristive crossbar requires integration of
several peripherals such as the array of low-resolution ADCs and multiple PWMs.
In addition, the concurrent synaptic weight update scheme makes LU to occupy a
large area portion (42.2%). Nevertheless, as a return, this parallel scheme expedites
the synaptic weight update process significantly. Furthermore, the power consumed
by SU reaches 5.16 mW , which is 80% of the entire processor power. Therefore,
further optimized low-overhead access scheme can be appealing, as will be explored
in the future work.
3.4.3 Application of the Neuromorphic Processor for Character Recognition
System
Finally, we conduct a behavior-level digital simulation of the chip to demonstrate
the functionality of the neuromorphic processor designed in this section. The behav-
ioral simulation is necessary as gate or transistor level simulation of long training
processes requires huge CPU times, making it practically infeasible. To realisti-
65
cally capture the functionality of the designed processor and its dependencies on key
hardware design choices, the key network and design features including the digital
LIF neuron dynamics, the STDP learning rules, bit-widths to represent the neuron
model and synapses in Table 3.2 are fully captured in the behavioral simulation. We
specifically consider the case where the proposed DNP is configured to be a two-layer
learning network for character recognition as illustrated in Figure 3.16 [15]. The net-
A B Z
Input Spikes
5k 10k 130k
step
14
14
Figure 3.16: Network for character recognition and training for alphabets.
work has an input-and-output layer structure with 232 excitatory and 7 inhibitory
neurons and is designed to recognize the alphabets “A”−“Z” by unsupervised learn-
ing. The input layer has 196 excitatory neurons, which form a 2 dimensional array.
Each excitatory input neuron receives a binary input representing a pixel value in
the 14×14 pixel input pattern and projects its output to all excitatory output neu-
rons through plastic synapses. In the input layer, the excitatory neurons project
signals to 6 inhibitory neurons which provide negative feedback to modulate the fir-
ing frequencies of all excitatory neurons. The output layer consists of 36 excitatory
neurons where each of which receives input from all the input excitatory neurons.
66
Structurally similar to the input layer, one inhibitory neuron is also employed in the
output layer to provide strong negative feedback. The inclusion of this inhibitory
neuron and these connections implements the winner-take-all (WTA) mechanism,
where any firing output neuron activates the inhibitory neuron and thereby prevents
other output neurons from firing through negative feedback.
We clock the chip at the frequency of 1 MHz under a fixed supply level of 1.2 V .
To train the network, we first convert the training letters, which are composed of a
14×14 pixel map each, to 196 (=142) parallel input spike trains to inject into the
network as in Figure 3.16. The corresponding input neuron is either silent or active to
encode a binary pixel. Then, for each alphabet from “A” to “Z”, the corresponding
input spike trains are applied to the respective input neurons for 5000 biological time
steps. As described here, the network connectivity can be configured by properly
programming the memristive crossbar. Note that our reconfigurable processor has
256 neurons. For the configured character recognition chip, Figure 3.17 illustrates
Neuron Index Mapping
Neuron Type Neuron Index
Input Excitatory
Output Excitatory
Input Inhibitory
Output Inhibitory
1 ~ 196
197 ~ 232
233 ~ 238
239
0
64
128
192
256
0 64 128 192 256
A
x
o
n
 I
n
d
e
x
Dendrite Index
Figure 3.17: Neuron index mapping and synaptic connections of the crossbar array.
67
the index mapping for the neurons and the synaptic connections of the 256×256
memristive crossbar array (i.e. dot plot for memristor levels > 1). The weights of
all plastic synapses are random values before the training. With any input pattern,
the net input received by each excitatory output neuron can be thought as the inner
product of a random weight vector and a signal vector representing the activities of
the excitatory input neurons. The weight vector of each output neuron corresponds
to its receptive field, which describes the input pattern whose presence leads to
excitation of the corresponding output neuron. The network reshapes receptive fields
of some excitatory output neurons to memorize each alphabet during the training
such that they receive strong excitatory signal and emit spikes with the presence of
corresponding input pattern.
(b)(a)
×103
A
C
P
Q
N
Y
Figure 3.18: Learning results for network: (a) receptive fields after training and (b)
spike rasters for output neurons.
To demonstrate the function of the proposed processor, we show the simulation
results of the learning network for the alphabet training in Figure 3.18. The recep-
tive fields of the network after the training are shown in Figure 3.18(a). As can be
68
seen, the receptive fields are well shaped by the training in the sense that every letter
from “A” to “Z” appears once at least in the fields. This implies that during the
recognition phase the presence of a letter is expected to excite at least one output
neuron whose receptive field closely reassembles the presented letter, signifying the
correct recognition of the letter. The spike rasters for the 36 output excitatory neu-
rons, which correspond to the neuron indices from 197 to 232, respectively, during
the training process is plotted in Figure 3.18(b). Due to the WTA network config-
uration, each input pattern has the tendency to one or few output neurons to fire
and all other output neurons are inhibited through the negative feedback. For in-
stance, during the biological time steps from 1 to 5000, the letter “A” is presented
in training. The (197)th neuron’s receptive field is trained to resemble “A” and this
neuron is the only output neuron who actively fires in this period. In short, the
(197)th neuron is the winner when the alphabet “A” is presented. For the training
of some letters such as “B”, there may be a small number of winners in the WTA
mechanism and, as a result, more than one output neuron are trained as seen two
“B” shaped letters in the receptive fields. The results shown in Figure 3.18(b) verify
the correct functioning of the designed WTA scheme. In Figure 3.18(b), we mark
the spike rasters of a few output neurons which should be active with the presence
of the corresponding training alphabet. For ease of visualization, only a subset of
these neurons are marked in the figure.
3.5 Summary
In this section, we have proposed a scalable digital neuromorphic processor archi-
tecture for large scale integration of spiking neurons. A novel multilevel memristive
synaptic crossbar design is presented to allow for high-density synaptic storage and
flexible access. Moreover, the lowest conductance level of each memristor is used
69
to arbitrarily configure the network connectivity with very low overhead, which also
significantly improves the synaptic weight update performance by reducing the write
time of the memristive crossbar. The proposed VCO based column ADC design re-
duces the silicon overhead required for LIF operations and is amenable to integration
due to its digital implementation style. Implemented in a 90 nm CMOS process, our
design with 256 digital neurons with learning circuits and 64K synapses is evaluated
to occupy an area of 1.86 mm2 and dissipate a power of 6.45 mW under a supply
voltage of 1.2 V . Furthermore, we have validated the functionality of the proposed
architecture through the behavioral digital simulation for the case of a character
recognition system with unsupervised learning.
70
4. ENERGY EFFICIENT APPROXIMATE ARITHMETIC
4.1 Proposed Approximate Adder
Our main focuses and key contributions in the design of approximate arithmetic
units include 1) a significant reduction of the error rate by the carry-skip scheme
enabling carry speculation in a parallel manner and its application to the adder and
comparator design, 2) complete error rate analysis of the proposed arithmetic units
and 3) a very low-cost error magnitude reduction scheme without additional clock
cycle scheme for the proposed adder. In this section, the proposed approximate adder
design is presented while the proposed approximate comparator is discussed in the
next section.
4.1.1 Approximate Adder Architecture
Denote the two inputs of the adder A and B, and the (i)th least significant bits
(LSBs) by ai and bi, respectively. In addition, the propagate (pi), generate (gi), kill
(ki), and carry (ci) signals of the (i)th bit position are defined by
gi = aibi, ki = a¯ib¯i, pi = ai ⊕ bi
ci =

1 if gi = 1
0 if ki = 1
ci−1 if pi = 1
(4.1)
where ci−1 is the carry of the (i-1)th bit position. Briefly, the adder outputs the
carry ci when gi=1 or ki=1 independently of ci−1, otherwise, it propagates ci−1 to ci.
Figure 4.1 shows the block diagram of the proposed approximate n-bit adder,
which is divided into several k-bit sized blocks. Each block contains a k-bit sub-
71
(i+1)th
Sub
Adder
(i)th
Sub
Adder
(i-1)th
Sub
Adder
Ak-1:0
i+1 Bk-1:0
i+1 Ak-1:0
i Bk-1:0
i Ak-1:0
i-1 Bk-1:0
i-1
Sapx,k-1:0
i+1 Sapx,k-1:0
i Sapx,k-1:0
i-1
Cout
i Cout
i-1
Cin
i^Cin
i+1^
(i+1)th block (i)th block (i-1)th block
(i+1)th
Sub
Carry Gen.
(i)th
Sub
Carry Gen.
(i-1)th
Sub
Carry Gen.
Figure 4.1: Block diagram of the proposed approximate adder.
adder and a k-bit sub-carry generator, which create a partial summation and a
partial carry-out signal, respectively. The n-bit adder has m = dn
k
e blocks. Also,
as in Figure 4.1, the k-bit inputs of the (i)th block are represented by Aik−1:0 and
Bik−1:0, and the partial summation result is indicated by S
i
apx,k−1:0. Note that the sub-
adders could be implemented by any traditional accurate adders such as ripple-carry
adder (RCA) and carry-lookahead adder (CLA). At the beginning of an addition
operation, all the sub-carry generators simultaneously create the partial carry-out
signals (· · · , Ci+1out , Ciout, Ci−1out , · · · ) using only their k-bit inputs. Then, the sub-
adders’ carry-in signals (· · · , Ĉi+1in , Ĉiin, Ĉi−1in , · · · ) are also concurrently speculated
from the v (≥2) preceding k-bit sub-carry generators with a multiplexer. Finally,
the sub-adders work with the speculated carries and produce the partial summations
(· · · , Si+1apx,k−1:0, Siapx,k−1:0, Si−1apx,k−1:0, · · · ). Therefore, the critical path delay of the
72
proposed approximate adder tapx is derived by
tapx = tsa + dlog2vetmux + tscg (4.2)
where tsa, tmux and tscg are the delays of the sub-adder, a two-input multiplexer,
and the sub-carry generator, respectively. The delay is based on a multilevel tree
structure of two-input multiplexers. Note that the multiplexer delay is negligible if
k is large.
The proposed carry prediction works as follows. When all the propagate signals
of the (i-1)th block are true, the carry-out of a large number of preceding blocks are
required for more accurate carry prediction for the (i)th sub-adder. Thus, we utilize
carry-skip to speculate the carry as depicted in Figure 4.2, which is an example of the
adder with k=6 and v=3. This carry-skip scheme is particularly more advantageous
A
B
Sapx
Sub 
Adder
(i)th
block
(i-3)th
block
p/g/k
carry-skip
Cout
i-3
Cin
i^
Cout
i-1
(i-1)th
block
Cout
i-2
(i-2)th
block
Figure 4.2: Proposed carry prediction using parallel carry-skip (k=6, v=3).
over the alternative approach of cascading several sub-carry generators [74], which
could appreciably increase the critical path delay when k is large. In order to obtain
73
the (i)th carry-in Ĉiin, the multiplexer selects C
u
out where i−v ≤ u < i if any propagate
signal of the (u)th block is false as in Figure 4.2. If all the propagate signals of the
v preceding blocks are true, it chooses Ci−vout . Hence, Ĉ
i
in is expressed by
Ĉiin =P
i−1
k−1:0C
i−1
out + P
i−1
k−1:0P
i−2
k−1:0C
i−2
out + · · ·+
i−v+2∏
j=i−1
P jk−1:0P
i−v+1
k−1:0 C
i−v+1
out +
i−v+1∏
j=i−1
P jk−1:0C
i−v
out
where P ik−1:0 =
k−1∏
j=0
pij
(4.3)
In (4.3), Ciout is the carry-out signals of the (i)th block and p
i
j is the propagate signal
of the (j)th bit position of the (i)th block. Additionally, the carry-out of the (i)th
block is given by
Ciout = g
i
k−1 + g
i
k−2p
i
k−1 + ...+ g
i
0
k−1∏
j=1
pij , Gik−1:0 (4.4)
where gij is the generate signal at (j)th bit position of the (i)th block.
By adopting the carry-skip scheme, the proposed adder is able to enhance the
carry prediction accuracy at the cost of multiplexer delay. Generally, a larger number
of preceding sub-carry generators can be used to further improve the accuracy of carry
prediction, at a low-cost of one multiplexer delay per each included generator.
4.1.2 Error Rate Analysis
The carry prediction error of the proposed adder occurs when a carry propagation
chain has a length greater than kv. In other words, if all the propagate signals of
more than v consecutive blocks are true and a carry is generated in the preceding
block, then the carry prediction is incorrect. Assuming that the adder inputs A
and B are bitwise independent, then the propagate and generate signals are bitwise
74
independent as well. We denote the event that the carry-in prediction of the (i)th
sub-adder is mistaken due to a carry propagation path of a length between kv and
k(v + 1)−1 by Eicin
Eicin =P
i−1
k−1:0P
i−2
k−1:0 · · ·P i−vk−1:0Gi−v−1k−1:0 (4.5)
where P ik−1:0 and G
i
k−1:0 is defined in (4.3) and (4.4), respectively and the probability
of the event is given by
P(Eicin) =P(P
i−1
k−1:0P
i−2
k−1:0 · · ·P i−vk−1:0Gi−v−1k−1:0 )
=P(P i−1k−1:0)P(P
i−2
k−1:0) · · ·P(P i−vk−1:0)P(Gi−v−1k−1:0 )
(4.6)
In (4.6), P(P ik−1:0) [ = P(P
i−1
k−1:0) = · · · ] and P(Gik−1:0) [ = P(Gi−1k−1:0) = · · · ] are
given by
P(P ik−1:0) = P(
k−1∏
j=0
pij) =
k−1∏
j=0
P(pij) =
1
2k
P(Gik−1:0) = P(g
i
k−1 + g
i
k−2p
i
k−1 + · · ·+ gi0
k−1∏
j=1
pij)
= P(gik−1) +P(g
i
k−2p
i
k−1) + · · ·+P(gi0
k−1∏
j=1
pij)
=
1
4
+
1
4
· 1
2
+ · · ·+ 1
4
· 1
2k−1
=
1
2
(
1− 1
2k
)
(4.7)
where gik−1, g
i
k−2p
i
k−1, · · · , gi0
k−1∏
j=1
pij are mutually exclusive. The proposed adder
produces an error if any error event Eicin occurs for any of the sub-adders except
for the v+1 least significant ones. Note that the (0)th, (1)st, · · · , (v)th sub-adders
always have the correct carry-in signals in our design. Thus, the overall error rate of
75
the proposed adder under random inputs is expressed by
Perr(n, k, v) = P(E
dn
k
e−1
cin + E
dn
k
e−2
cin + · · ·+ Ev+2cin + Ev+1cin ) (4.8)
By the inclusion-exclusion principle [5], it is given by
Perr(n, k, v) =
∑
v<i<m
P(Eicin)
−
∑
v<i1<i2<m
P(Ei2cinE
i1
cin)
+
∑
v<i1<i2<i3<m
P(Ei3cinE
i2
cinE
i1
cin)
− · · ·+ (−1)m−vP(Em−1cin Em−2cin · · ·Ev+1cin )
where m = dn/ke
(4.9)
Once Eicin occurs, E
i−1
cin , E
i−2
cin , · · · , Ei−vcin can not do. This is because that under
this case the carry propagate chain lengths for the (i-1)th, · · · , (i-v)th sub-adders
become less than kv due to P i−2k−1:0 = · · · = P i−vk−1:0 = Gi−v−1k−1:0 = 1 and thus the carry
speculations for these sub-adders are always correct. In short, P(Eircin · · ·Ei1cin) = 0 if
∃q : iq − iq−1 ≤ v where v < i1 < · · · < ir < dnk e. Then, we can rewrite (4.9) to yield
Perr(n, k, v)
=
m−v−1∑
r=1
(−1)r+1
 ∑
v<i1<···<ir<m,
∀q:iq−iq−1>v
P(Eircin · · ·Ei1cin)

where m = dn/ke
(4.10)
76
Eircin, E
ir−1
cin , · · ·Ei1cin are independent if ∀q : iq − iq−1 > v. Therefore, by putting (4.6),
(4.7) and (4.10) together, the overall error rate for the adder under random inputs is
Perr(n, k, v)
=
m−v−1∑
r=1
(−1)r+1
 ∑
v<i1<···<ir<m,
∀q:iq−iq−1>v
P(Eircin) · · ·P(Ei1cin)

=
m−v−1∑
r=1
(−1)r+1
 ∑
v<i1<···<ir<m,
∀q:iq−iq−1>v
(
1
2kv+1
(
1− 1
2k
))r
where m = dn/ke
(4.11)
4.1.3 Error Magnitude Reduction Scheme
In addition to error rate, another important metric to evaluate approximate
adders is error significance, which should be minimized and is defined by the ra-
tio of the error magnitude to the correct summation result as follows [53]
error significance =
∣∣∣∣Sapx − ScorScor
∣∣∣∣ (4.12)
where Sapx and Scor are the approximate and correct outputs for given inputs.
Figure 4.3 depicts the block diagram of the proposed error magnitude reduction
and one example of its operation in case of k=8 and v=2. In the example, since
all the propagate signals of the (i)th and (i-1)th blocks are true, the carry-in for
the (i+1)th sub-adder is speculated to “0” although the correct one is “1” due to
Ci−2out = 1. Then, the error significance is
1
27
. Note that it could reach 1
2
for the
worst case inputs of Ai+17:0 = 00000001 and B
i+1
7:0 = 00000000. To reduce the amount
of error, the proposed adder forces all the output bits of the (i)th and the (i-1)th
77
AB
Sapx
p/g/k
Scor
Semr
cout
i-2 =1
Error 
Magnitude 
Reduction
(i)th
Block
(i)th
Block
Error Magnitude Reduction
Pk-1:0Sapx,k-1:0,
i+1i+1 Pk-1:0Sapx,k-1:0,
ii
Semr,k-1:0
i+1 Semr,k-1:0
i Semr,k-1:0
i-1
(i-1)th
Block
Sapx,k-1:0,
i-1 Pk-1:0
i-1
Figure 4.3: Block diagram of the error magnitude reduction and an example of its
operation (k=8, v=2).
sub-adders to “1” when P ik−1:0 = P
i−1
k−1:0 = 1. The reduction can be implemented by
ORing each partial summation (i.e. Siapx,k−1:0 and S
i−1
apx,k−1:0) and the product of the
propagate signals (i.e. P ik−1:0P
i−1
k−1:0). It allows the error significance to be reduced by
1
22k
. As a result of the reduction, the adder finally produces the error reduced output
of Siemr,k−1:0 = S
i−1
emr,k−1:0 = 11111111 and the error significance decreases from
1
27
to
1
223
. It is worth mentioning that the error magnitude reduction always produces the
exact right results when Ci−2out = 0. Consequently, the worst case error magnitude is
reduced from 2n−k to 2n−k(v+1) through the use of error magnitude reduction.
78
4.2 Proposed Approximate Comparator
4.2.1 Approximate Comparator Architecture
Basically, a comparator determines the larger of two inputs A and B, and can
be implemented by using a subtraction. After subtracting two inputs A − B, a
comparison is readily done by checking the sign bit (i.e. MSB) of the result. In
short, A < B when the MSB = 1, otherwise A ≥ B. Note that subtraction is
achieved by addition of 2’s complement (i.e. A − B = A + B + 1) and requires an
additional inverting operation for one of two inputs. However, the use of traditional
adders such as an RCA may incur timing and energy overheads since the MSB of the
addition would be needed to produce the final comparator output. Again, due to the
fact that the targeted neuromorphic computing applications have built-in resilience
to arithmetic errors as shown in the later part of the section, we exploit the same
idea of the parallel carry-skip scheme to improve the timing and energy efficiency of
the comparator.
1b
FA
(v-1)th 
Sub
Carry Gen.
Cin,n-1
^
(v-2)th 
Sub
Carry Gen.
An-1
Sapx,n-1
Oapx
Bn-1
(0)th
Sub
Carry Gen.
Ak-1:0
v-1 Bk-1:0
v-1 Ak-1:0
v-2 Bk-1:0
v-2 Ak-1:0
0 Bk-1:0
0
Cout
v-2
Cout
v-1
Cout
0
Figure 4.4: Block diagram of the proposed approximate comparator.
Figure 4.4 illustrates the block diagram of the proposed approximate compara-
79
tor. The n-bit comparator consists of a 1-bit full adder and v (≥2) k-bit sub-carry
generators that are identical to the ones in the proposed adder. In Figure 4.4,
the k-bit inputs of the (i)th sub-carry generator Aik−1:0 and B
i
k−1:0 correspond to
An−k(v−i−1)−2:n−k(v−i)−1 and Bn−k(v−i−1)−2:n−k(v−i)−1, respectively. It is worth to note
that the proposed approximate comparator exploits only kv+1 MSBs of the n-bit
inputs, resulting in area and power reductions. Importantly, the input B is inverted
to achieve subtraction operation. Since implementing 2’s complement necessitates
an additional incrementor, we employ 1’s complement to further reduce area and
energy with sacrificing an error rate, but still achieving a very low error rate. The
full adder generates the sign bit Sapx,n−1 (MSB output) of the subtraction between
the two inputs by leveraging the speculated carry-in signal Ĉin,n−1 and the MSB of
the two inputs. The speculated carry-in signal is obtained in the same parallel way
by the v sub-carry generators according to (4.3) and (4.4). When the two inputs
have the different signs (i.e. An−1 ⊕ Bn−1 = 1), the comparison result is readily ob-
tained by the input MSB without the full adder. Therefore, the output multiplexer
selects the MSB of the input An−1 if the signs of two inputs are different from each
other, otherwise, it chooses the full adder output Sapx,n−1. Therefore, the comparator
output Oapx is expressed by
Oapx = (An−1 ⊕Bn−1)An−1 +
(
An−1 ⊕Bn−1
)
Sapx,n−1
where Sapx,n−1 = An−1 +Bn−1 + Ĉin,n−1
(4.13)
Then, the critical path delay of the proposed approximate comparator tapx,cmp with
80
the multilevel tree based multiplexer is derived to be
tapx,cmp = tfa + (dlog2ve+ 1) tmux + tscg (4.14)
where tfa, tmux and tscg are the delays of the full adder, the two-input multiplexer,
and the sub-carry generator, respectively. In the proposed comparator, the 1-bit full
adder delay is negligible and the multiplexer delay is also negligible if k is large.
4.2.2 Error Rate Analysis
The proposed comparator fails when the signs of the inputs are different from
each other as well as the carry prediction for the full adder is incorrect. Figure 4.5
illustrates an example of the proposed comparator configuration with n=16, k=4,
v=2 to effectively explain the carry prediction error for the full adder. The input B is
inverted and it includes an unused sub-carry generator block that has the 7 LSBs as
the inputs and its carry-in signal Cin,0 is always “1” due to the 2’s complement (i.e.
−B = B + 1). Note that this block is never used for the comparisons in our design.
The carry speculation is incorrect when all the propagate signals in the v (in this
case two) sub-carry generators used for carry prediction are true and Cunout = 1. The
latter condition is true if a carry is generated by the unused block or Cin,0(= 1) is
1b
FA
Cin,15
^
Sapx,15
Cout
0
b14
Unused
b13 b12 b11
a14 a13 a12 a11
(1)th Sub
Carry Gen.
b10 b9 b8 b7
a10 a9 a8 a7
(0)th Sub
Carry Gen.
b6 b5 b4 b3
a6 a5 a4 a3
b2 b1 b0
a2 a1 a0
Cout
1
b15
a15
1
Cin,0Cout
un
Figure 4.5: Example of the comparator configuration (n=16, k=4, v=2).
81
propagated through the unused block. It is important to note that we should consider
Cin,0 propagation since the proposed comparator adopts 1’s complement, instead of
2’s complement, for the subtraction. We assume that the comparator inputs A and
B are bitwise independent. Then, the probability of the carry prediction error for
the full adder is given by
P(Ecin,n−1)
= P(P v−1k−1:0P
v−2
k−1:0 · · ·P 0k−1:0 (Gn−kv−2:0 + Pn−kv−2:0))
= P(P v−1k−1:0)P(P
v−2
k−1:0) · · ·P(P 0k−1:0)×
(P(Gn−kv−2:0) +P(Pn−kv−2:0))︸ ︷︷ ︸
P(Cunout=1)
(4.15)
where Gn−kv−2:0 and Pn−kv−2:0 are mutually exclusive. And, they are given by
P(Gn−kv−2:0)
= P(gn−kv−2 + gn−kv−3pn−kv−2 + · · ·+ g0
n−kv−2∏
i=1
pi)
= P(gn−kv−2) + · · ·+P(g0
n−kv−2∏
i=1
pi)
=
1
4
+
1
4
· 1
2
+ · · ·+ 1
4
· 1
2n−kv−2
=
1
2
(
1− 1
2n−kv−1
)
P(Pn−kv−2:0) = P(
n−kv−2∏
i=0
pi) =
n−kv−2∏
i=0
P(pi) =
1
2n−kv−1
(4.16)
where gn−kv−2, gn−kv−3pn−kv−2, · · · , g0
n−kv−2∏
i=1
pi are mutually exclusive. Thus, by
putting (4.7), (4.15) and (4.16) together, the overall comparator error rate by the
82
carry-skip scheme under random inputs is
Perr,cmp(n, k, v)
=P(An−1 ⊕Bn−1)P(Ecin,n−1)
=
1
2
· 1
2kv
·
(
1
2
(
1− 1
2n−kv−1
)
+
1
2n−kv−1
)
=
1
2kv+2
(
1 +
1
2n−kv−1
)
(4.17)
4.3 Simulation Results
The proposed approximate arithmetic units were designed in Verilog HDL and
synthesized with a commercial 90 nm CMOS technology and standard cell library.
Also, the gate-level netlists were translated into transistor-level to perform HSPICE
simulations. Each sub-adder in the proposed adder was implemented using an RCA
structure.
4.3.1 Error Rate of the Proposed Approximate Adder
First, we examine the error rates of the proposed adder with various values of n, k
and v. Figure 4.6 exhibits the error rates of the proposed adder under random inputs.
The error rate worsens as the input bit-width n increases for given k and v, whereas
it improves as k increases for fixed n and v. The error rate is significantly reduced
when v increases under the same n and k. Specifically, by leveraging one more sub-
carry generator for carry speculation, the error rate of the 128-bit adder with k=4
and v=2 is decreased from 5.19% to 0.32%, representing a 16.23× improvement (with
v=3). Under the case of n=16, k=4 and v=2, compared to the previously presented
approximate adders [74, 14, 32], the proposed adder is able to considerably reduce
the error rate from 5.86% to 0.18% for random input patterns. Hence, the proposed
carry prediction technique is very appealing to both wide and narrow bit-width
83
2 3 4 5 6 7 8
10
-11
10
-10
10
-9
10
-8
10
-7
10
-6
10
-5
10
-4
10
-3
10
-2
10
-1
10
0
Block Width (k)
E
rr
or
 R
at
e
 
 
n=  16,v=2
n=  32,v=2
n=  64,v=2
n=128,v=2
n=  16,v=3
n=  32,v=3
n=  64,v=3
n=128,v=3
n=  16,v=4
n=  32,v=4
n=  64,v=4
n=128,v=4
E
rr
or
 R
at
e
Figure 4.6: Error rates of the proposed adder under different n, k and v.
approximate additions requiring very low error rates.
4.3.2 Performance of the Proposed Approximate Adder
Table 4.1 reports the implementation results of area, delay, power, and error rate
under the nominal supply of 1.2 V. Note that the error magnitude reduction circuit
is included in all the proposed design implementations. The delay increases as k
increases with a fixed n and v while the error rate and the power decrease. With a
lower k for a given n and v, more carry prediction blocks are needed and the carry
prediction with a smaller number of LSBs causes more errors. Meanwhile, under a
given k and v, while the delay remains almost the same as n increases, the error rate
deteriorates slowly. The proposed 64-bit adder with k=4 and v=2 has an error rate
even over 2× less than the error rate of 5.86% of other 16-bit approximate adders
with the same k [74, 14, 32]. Furthermore, when the adder exploits two more sub-
carry generators for the carry prediction (i.e. v=4), it achieves 295× reduction of
84
Table 4.1: Proposed adder with different n, k and v.
Parameters Area Delay Power Energy Error Rate
(n, k, v) (µm2) (ps) (mW ) (pJ) (%)
(16, 2, 2) 525 202 0.902 0.182 11.55
(16, 3, 2) 533 297 0.708 0.210 2.05
(16, 4, 2) 466 359 0.600 0.215 0.18
(16, 5, 2) 509 431 0.591 0.255 0.05
(16, 2, 3) 550 221 0.917 0.203 2.34
(16, 3, 3) 543 324 0.744 0.241 0.17
(16, 2, 4) 557 232 0.897 0.208 0.44
(16, 3, 4) 547 332 0.863 0.287 0.01
(32, 4, 2) 1147 345 1.682 0.581 0.91
(64, 4, 2) 2389 344 3.039 1.046 2.36
(128, 4, 2) 4873 339 5.827 1.976 5.19
(32, 4, 4) 1185 363 1.884 0.684 0.002
(64, 4, 4) 2846 381 3.105 1.183 0.008
(128, 4, 4) 5250 370 6.009 2.222 0.019
the error rate at the expense of merely 6.6%, 10.7% and 2.2% extra area, delay and
power, respectively.
4.3.3 Comparison with Seven Other Approximate Adders
We also implemented 16-bit (i.e. n=16) two traditional accurate adders (RCA
and CLA) and seven previously presented approximate adders, which are Lu’s Adder
(LUA) [39], LOA [40], ETAI [76], ETAII [74], VLCSA-1 [14], ACA [32] and Dither
Approximate Adder (DAA) [47] in the same commercial 90 nm CMOS technology,
so as to compare with the proposed adder in various aspects. The VLCSA-1 and
ACA have their own error detection and correction (EDC) mechanisms. The invok-
ing of these modules, however, requires additional clock cycles, leading to timing
overhead and potential architectural design complications needed for facilitating a
85
T
ab
le
4.
2:
C
om
p
ar
is
on
w
it
h
ot
h
er
16
-b
it
ad
d
er
s.
D
es
ig
n
A
re
a
D
el
a
y
P
o
w
er
E
n
er
gy
E
rr
o
r
R
a
te
A
vg
.
E
rr
o
r
E
D
P
E
D
A
P
E
D
E
P
(µ
m
2
)
(p
s)
(m
W
)
(p
J
)
(%
)
M
a
gn
it
u
d
e
(p
J
·p
s)
(p
J
·ps
·µ
m
2
)
(p
J
·p
s
·%
)
R
C
A
33
4
85
6
0.
34
3
0.
29
4
N
/A
N
/A
25
1
83
93
6
N
/A
(0
.7
2
×)
1
(2
.3
8
×)
(0
.5
7×
)
(1
.3
7×
)
(3
.2
6
×)
(2
.3
3
×)
C
L
A
51
4
40
7
0.
92
2
0.
37
5
N
/A
N
/A
15
5
81
61
3
N
/A
(1
.1
0
×)
(1
.1
3
×)
(1
.5
4×
)
(1
.7
4×
)
(2
.0
1
×)
(2
.2
7
×)
L
U
A
60
9
23
4
0.
90
8
0.
21
2
16
.6
8
13
63
.7
50
30
21
0
82
7
(1
.3
1
×)
(0
.6
5
×)
(1
.5
1×
)
(0
.9
9×
)
(9
2.
67
×)
(1
81
.8
3×
)
(0
.6
5
×)
(0
.8
4
×)
(5
9.
07
×)
L
O
A
(8
-8
)
20
0
45
0
0.
42
0
0.
18
9
43
.7
5
11
1.
3
85
16
97
6
37
19
(0
.4
3
×)
(1
.2
5
×)
(0
.7
0×
)
(0
.8
8×
)
(2
43
.0
6
×)
(1
4.
84
×)
(1
.1
0
×)
(0
.4
7
×)
(2
65
.6
4×
)
E
T
A
I
(8
-8
)
23
4
43
5
0.
47
0
0.
20
4
90
.0
0
17
8.
3
89
20
71
1
79
80
(0
.5
0
×)
(1
.2
1
×)
(0
.7
8×
)
(0
.9
5×
)
(5
00
.0
0
×)
(2
3.
77
×)
(1
.1
6
×)
(0
.5
8
×)
(5
70
.0
0×
)
E
T
A
II
37
4
25
4
0.
56
4
0.
14
3
5.
86
12
7.
5
36
13
58
3
21
3
(0
.8
0
×)
(0
.7
1
×)
(0
.9
4×
)
(0
.6
7×
)
(3
2.
56
×)
(1
7.
00
×)
(0
.4
7
×)
(0
.3
8
×)
(1
5.
21
×)
V
L
C
S
A
-1
2
67
3
27
7
1.
33
7
0.
37
0
5.
86
12
7.
5
10
3
69
15
9
60
2
(1
.4
4
×)
(0
.7
7
×)
(2
.2
3×
)
(1
.7
2×
)
(3
2.
56
×)
(1
7.
00
×)
(1
.3
4
×)
(1
.9
2
×)
(4
3.
00
×)
A
C
A
2
47
2
37
4
0.
66
6
0.
24
9
5.
86
12
7.
5
93
44
01
0
54
6
(1
.0
1
×)
(1
.0
4
×)
(1
.1
1×
)
(1
.1
6×
)
(3
2.
56
×)
(1
7.
00
×)
(1
.2
1
×)
(1
.2
2
×)
(3
9.
00
×)
D
A
A
(8
-8
)
37
0
43
5
0.
56
6
0.
24
6
25
.0
0
74
.7
10
7
39
50
9
26
71
(0
.7
9
×)
(1
.2
1
×)
(0
.9
4×
)
(1
.1
4×
)
(1
38
.8
9
×)
(9
.9
6
×)
(1
.3
9
×)
(1
.1
0
×)
(1
90
.7
9×
)
P
ro
p
os
ed
3
46
6
35
9
0.
60
0
0.
21
5
0.
18
7.
5/
14
.1
4
77
35
95
9
14
1
(*
)
n
o
rm
a
li
za
ti
o
n
a
ga
in
st
th
e
p
ro
po
se
d
a
d
d
er
2
w
it
h
o
u
t
th
e
er
ro
r
d
et
ec
ti
o
n
a
n
d
co
rr
ec
ti
o
n
3
w
it
h
th
e
er
ro
r
m
a
gn
it
u
d
e
re
d
u
ct
io
n
4
w
it
h
o
u
t
th
e
er
ro
r
m
a
gn
it
u
d
e
re
d
u
ct
io
n
86
given processing application. To examine the approximate natures of the different
adder designs, to be fair, we exclude the EDC modules and their timing, power and
area overheads from this comparison. The same RCA structure is used for the sub-
adders in ETAII, VLCSA-1 and ACA and the parameters of k=4 are adopted in
these adders as well as LUA. The proposed adder employs the same k with v=2 and
the RCA structure for the sub-adders. Moreover, we split LOA, ETAI and DAA to
have both 8-bit sizes for the accurate and inaccurate parts and the RCA structure
is used for the accurate parts. For the dithering in DAA, we utilize the MSB of one
input of the inaccurate part (i.e. A7) as the dithering bit in order to alleviate an
overhead due to the external dither control as presented in [47]. Also, we denote
these three adders by LOA (N-M), ETAI (N-M) and DAA (N-M), where N and M
indicate the bit widths of the accurate and inaccurate parts, respectively.
The energy-delay product (EDP) is widely adopted to effectively show energy
efficiency of circuits and systems. In addition to EDP, we take into account the
energy-delay-area product (EDAP) [37, 38] to compare both the energy and cost
of silicon-area of the adders. Importantly, the energy-delay-error product (EDEP)
is particularly an interesting metric in approximate arithmetic designs because it
considers not only energy and delay but also error rate [49]. Thus, the adders are
compared in terms of EDP, EDAP and EDEP, as well as the fundamental metrics,
such as area, delay, power, energy, error rate and average error magnitude. Table 4.2
summarizes the performance comparison under the regular supply of 1.2 V. The RCA
exhibits the lowest power with the longest delay due to the bit-by-bit carry propagate
chain and the CLA consumes the largest energy. While the RCA is more energy and
area efficient than CLA, over 2× longer delay degrades its EDP and EDAP. The LUA
is the fastest but occupies the second largest area due to the considerable number
of carry generators. Additionally, the carry prediction approach in LUA makes the
87
errors to be able to occur in higher significant bits, which leads to the highest average
error magnitude among the approximate adders. The error rate of EATI (8-8) reaches
90%, which may limit its practical use, due to lack of carry prediction for the accurate
part (i.e. the carry is fixed to zero) and worsens EDEP remarkably. On the other
hand, thanks to the simple carry speculation scheme of LOA (8-8), which is achieved
by ANDing two MSBs of each operand of the inaccurate part, the error rate is
improved to 43.75%, which is still fairly large. The use of the simple OR operation
for the inaccurate part allows it to be the most area efficient adder. The proposed
dithering control for the inaccurate part in DAA (8-8) further improves the error rate
as well as the average error magnitude with area and power overheads. The high
error rates in ETAI (8-8), LOA (8-8) and DAA (8-8) exacerbate the EDEP metric.
Fortunately, the adders have fairly low average error magnitudes despite of the high
error rates since approximation errors are concentrated on lower significant bits (i.e.
inaccurate parts). Thanks to the dithering, the DAA (8-8) demonstrates the best
performance in terms of the error rate, average error magnitude and EDEP among
these three adders. Among the adders having the same error rate of 5.86%, the
ETAII is the most efficient in terms of all the metrics. As a result of the use of carry
selection in VLCSA-1, it dissipates the highest power, which is up to 3.9× more than
the others and incurs EDP, EDAP and EDEP degradations. The proposed adder is
2.4× faster and 3.3× EDP efficient than RCA. The carry-skip scheme allows it to
have the lowest error rate of 0.18% and EDEP of 14 among the approximate adders.
Furthermore, the proposed error magnitude reduction approach improves the average
error magnitude by 1.88× from 14.1 to 7.5, which is also the lowest value among the
approximate adders. Our design is comparable to ACA with respect to area, delay,
power and energy while having much lower error rate, average error magnitude and
EDEP thanks to carry-skip.
88
0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2
0
0.1
0.2
0.3
0.4
Supply Voltage (V)
En
er
gy
 (p
J)
 
 
RCA
CLA
LUA
LOA (8-8)
ETAI (8-8)
ETAII
VLCSA-1
ACA
DAA (8-8)
Proposed
En
er
gy
 (p
J)
Figure 4.7: Energy comparison under supply scaling.
Figure 4.7 plots the energy comparison under scaled voltages. The energy effi-
ciency of VLCSA-1 is hindered by a high power dissipation in spite of its relatively
fast speed. It takes almost the same amount of energy as CLA that consumes the
highest energy, whereas the ETAII does the lowest among the adders. The good
tradeoff between delay and power obtained by the carry-skip approach allows our
adder to be more energy efficient than the two accurate adders, VLCSA-1, ACA
and DAA (8-8). Particularly, our design attains an energy saving of 27% and 43%
compared to RCA and CLA, respectively. Besides, the proposed adder, LUA, LOA
(8-8) and ETAI (8-8) show similar energy consumptions under the scaled voltages
while our design enjoys the lowest error rate.
4.3.4 Comparison on Error-Free Operations
The main objective of this work is to develop an energy efficient approximate
adder with a low error rate for neuromorphic applications. For completeness, we
89
also compare the error-free operations of various designs. We consider the EDC
schemes for VLCSA-1, ACA and the proposed adder. In each of these designs, the
error detection is achieved by checking the propagate and generate signals in the
approximate addition phase. Upon detecting an error, the error correction circuit
reconstructs accurate results by leveraging propagate and generate signals [68, 14]
or by adding “1” to sub-adder output [32], either of which requires an additional
clock cycle. The VLCSA-1 and ACA exploit the prefix adder and incrementor,
respectively, for error correction. For the proposed adder, the prefix adder based
error correction circuit is implemented to produce error-free results [14]. Note that
the error magnitude reduction of our adder is not necessary here and thus removed
in this implementation. Obviously, the error correction circuit is activated whenever
errors are detected in the addition phase [32] and the effective energy Eeff can be
expressed by
Eeff = Papxtapx +Perr · Pectec (4.18)
where Papx, tapx, Pec and tec are the power and delay of the approximate adder
and those of the error correction circuit, respectively, and Perr is the error rate of
the approximate adder. The implementations are summarized in Table 4.3. The
error correction circuits have shorter delays than the respective approximate adders.
The critical path delays of VLCSA-1 and ACA are slightly longer than the delays
in Table 4.2 due to the additional error detection logic. Conversely, the proposed
adder’s critical path delay becomes shorter in spite of the error detection circuit since
the error magnitude reduction block is eliminated. Our design occupies the lowest
area and is the most efficient adder in terms of power and effective energy.
90
Table 4.3: Approximate adders with error detection and correction.
Design
Area Delay1 Power Energy2
(µm2) (ps) (mW ) (pJ)
VLCSA-1
899 296 2.094 0.473
(1.72×)3 (0.93×) (2.38×) (2.43×)
ACA
580 389 1.269 0.287
(1.11×) (1.22×) (1.44×) (1.47×)
Proposed 524 318 0.881 0.195
1 critical path delay 2 effective energy
3 (*) normalization against the proposed adder
4.3.5 Error Rate of the Proposed Approximate Comparator
Figure 4.8 shows the error rates of the proposed approximate comparator with
various n, k and v under random inputs. Unlike the proposed adder, the comparator
error rate does not deteriorate and remains almost the same although n increases
2 3 4 5 6 7 8
10
-11
10
-10
10
-9
10
-8
10
-7
10
-6
10
-5
10
-4
10
-3
10
-2
10
-1
10
0
Block Width (k)
E
rr
or
 R
at
e
 
 
n=  16,v=2
n=  32,v=2
n=  64,v=2
n=128,v=2
n=  16,v=3
n=  32,v=3
n=  64,v=3
n=128,v=3
n=  16,v=4
n=  32,v=4
n=  64,v=4
n=128,v=4
E
rr
or
 R
at
e
Figure 4.8: Error rates of the proposed comparator with various n, k and v.
91
under fixed k and v. This is because under the given k and v,
(
1 + 1
2n−kv−1
) ≈ 1 when
n is fairly large and the overall error rate is, therefore, dominated by 1
2kv+2
according
to (4.17). Also, the error rates are even better than those of the proposed adders in
the same n, k and v (e.g. 11.55% vs 1.56% for the 16-bit comparator and the adder
with k=2 and v=2). The main reason is that the proposed comparator necessitates
only one correct carry prediction for the 1-bit full adder for the MSB to attain the
correct comparison result, whereas the proposed adder requires the carry-in signals
for all the sub-adders to be correct to achieve the correct summation. Consequently,
it is equally attractive in very low error rate approximate comparisons for both wide
and narrow bit-widths inputs.
4.3.6 Performance of the Proposed Approximate Comparator
Table 4.4 lists the comparator implementations with various n, k and v under
the supply voltage of 1.2 V. The proposed comparators exhibits better performances
than the proposed adders under the same n, k and v in terms of all the aspects
since they consider only kv+1 bits of n-bit inputs for comparisons and require less
Table 4.4: Proposed comparator with different n, k and v.
Parameters Area Delay Power Energy Error Rate
(n, k, v) (µm2) (ps) (mW ) (pJ) (%)
(16, 2, 2) 61 152 0.089 0.014 1.563
(16, 4, 2) 121 215 0.111 0.024 0.098
(16, 2, 3) 101 194 0.132 0.026 0.391
(16, 4, 3) 193 233 0.155 0.036 0.007
(16, 2, 4) 140 216 0.163 0.035 0.098
(32/64/128, 4, 2) 121 215 0.111 0.024 0.0981
(32/64/128, 4, 4) 288 257 0.198 0.051 3.81e-41
1 error rate differences among 32-/64-/128-bit comparators are < 10-6%
92
number of sub-carry generators than the adders, leading to the smaller area, delay
and power. All the 16-bit comparators exhibit very low error rates (< 1.6%) with
good area and energy efficiencies and hence can be applied to low-cost comparisons
in error-tolerant applications with a desired accuracy. Interestingly, the proposed
comparators are the identical designs regardless of the input bit-width n as long as
k and v are the same in that they consist of v k-bit sub-carry generators in spite of
different values of n. Only difference here is the error rate. Interestingly, the error
rate differences among the comparators in Table 4.4 are negligibly small as mentioned
in Section 4.3.5 and thus ignorable. Hence, the same design can be reused without
any change for various bit-widths comparisons with the same k and v while enjoying
almost the same extremely low error rates.
From the authors’ best knowledge, unfortunately, no approximate comparator is
presented to date. So, we compare the proposed 16-bit comparator with k=4, v=2
to the two accurate comparators, which are ripple carry and carry lookahead based
comparators (RCC and CLC, respectively). They are also implemented with the
same 90 nm CMOS process and the results are summarized in Table 4.5. The accurate
comparators have less area, delay, power and energy than their corresponding adders
in Table 4.2 because they requires only the carry for the MSB and no logics to produce
summation outputs. Also, they show much better EDP and EDAP performances
than the adders. The proposed comparator demonstrates the best performance in
all the aspects except that it consumes more power than RCC. It is up to 18× more
efficient than the other designs in terms of EDP and EDAP with an extremely low
error rate (< 0.1%), which is well suitable for error-tolerant applications.
93
T
ab
le
4.
5:
C
om
p
ar
is
on
w
it
h
ot
h
er
16
-b
it
co
m
p
ar
at
or
s.
D
es
ig
n
A
re
a
D
el
a
y
P
o
w
er
E
n
er
gy
E
rr
o
r
R
a
te
E
D
P
E
D
A
P
E
D
E
P
(µ
m
2
)
(p
s)
(m
W
)
(p
J
)
(%
)
(p
J
·p
s)
(p
J
·ps
·µ
m
2
)
(p
J
·p
s
·%
)
R
C
C
14
6
8
34
0.
09
0
0.
07
4
N
/A
62
90
86
N
/A
(1
.2
1×
)1
(3
.8
8×
)
(0
.8
1×
)
(3
.0
8×
)
(1
2.
40
×)
(1
4.
65
×)
C
L
C
42
3
3
23
0.
25
8
0.
08
3
N
/A
27
11
34
2
N
/A
(3
.4
1×
)
(1
.6
9×
)
(2
.9
0×
)
(4
.8
8×
)
(5
.4
0
×)
(1
8.
29
×)
P
ro
p
o
se
d
12
1
2
15
0.
11
1
0.
02
4
0.
09
8
5
62
0
0.
50
1
(*
)
n
o
rm
a
li
za
ti
o
n
a
ga
in
st
th
e
p
ro
po
se
d
co
m
pa
ra
to
r
94
4.4 Summary
In this section, novel approximate adder and comparator designs to considerably
reduce energy consumption with a very moderate error rate has been presented for
energy efficient neuromorphic VLSI systems. The proposed carry prediction with
carry-skip scheme significantly enhances the overall error rate and the critical path
delay. Additionally, the error magnitude reduction technique for the adder reduces
the amount of error further with low cost. Implemented in a commercial 90 nm
CMOS process, the proposed adder is 2.4× faster and 43% more energy efficient over
traditional adders with an error rate of merely 0.18%. Furthermore, the proposed
approximate comparator exhibits an extremely low error rate of 0.098% and achieves
an energy reduction of up to 4.9× over the conventional ones.
95
5. APPLICATION OF APPROXIMATE ARITHMETIC TO NEUROMORPHIC
COMPUTING
5.1 Evaluation Environment
To evaluate our approximate arithmetic units designed in Section 4, we consider
the general digital neuromorphic VLSI system described in Figure 2.13. Additionally,
we adopt the digital LIF model for the silicon neurons since the LIF model is suitable
for digital implementation with a few arithmetic components among various neuron
models. This model is widely used in digital neuromorphic chip [58, 36] and its
dynamics can be indicated by
V t+1i = V
t
i +Ksyn
M∑
j=1
wjiS
t
j +KextE
t
i − Vleak
St+1i =

1 if V t+1i > Vth
0 otherwise
(5.1)
where V ti is the membrane potential of neuron i at time t, S
t
i is the spike bit that
indicates whether neuron i fired at time t and is set to “1” when the membrane
potential exceeds the given threshold voltage Vth, wji is the synaptic weight between
neuron j and i, Etj is the spike bit for the external input for neuron i, M is the
number of pre-synaptic neurons, Vleak is the leaky potential, and Ksyn and Kext
are the weight parameters for synapses and external input spikes, respectively. In
(5.1), additions and comparisons are certainly the key operations to calculate the
membrane potentials and determine the firing activities, respectively.
A digital implementation of the LIF neuron is shown in Figure 5.1(a). It con-
tains a multiplier, a comparator and an adder. It accumulates one of the pre-synaptic
96
Vleakwji Kext
Adder
++
Multiplier
Ksyn
Reg.
Registers
Ei
Vth
Reg.
Vmemb
Spike
+
- C
M
P
Comparator
43.7%
Adder
44.8%
Multiplier, etc
11.5%                         
Multiplier, etc
57.1%
Adder
34.0%
Comparator
8.9%
Power
Delay
(a) (b)
Figure 5.1: Digital LIF neuron: (a) block diagram and (b) delay and power break-
downs.
weights, external input and leaky potential, which correspond to the terms KsynwjiS
t
j,
KextE
t
i and −Vleak in (5.1), respectively, through the multiplexer at a time. If the
adder output exceeds the given threshold voltage Vth, the digital comparator pro-
duces a spike. The adder and comparator dominate the computation time because
the multiplier is relatively small due to narrower bit-widths in multiplications. To
demonstrate the portion of the adder and comparator for the digital LIF silicon neu-
ron, we show the delay and power breakdowns of the LIF neuron when implemented
with the ripple carry based adder and comparator in Figure 5.1(b). Each synaptic
weight, model parameter and membrane potential are represented using 3, 3 and 16
97
bits, respectively. The adder and comparator contribute to 88.5% of the processing
time and 42.9% of the power in the entire LIF computation. Therefore, it is ex-
tremely crucial to reduce the delay and power of the adder to improve overall energy
efficiency of neuromorphic computing.
Evaluating the performance of the proposed arithmetic units by simulating the
long training process of the neuromorphic system at the transistor level is compu-
tationally intractable. Instead, we develop a hardware-aware spiking neural network
simulator for the neuromorphic VLSI system to evaluate the performance of our new
arithmetic designs. The key network features and hardware design parameters in-
cluding the digital LIF neuron dynamics, the STDP learning rule, bit-widths used
to represent various neuron model parameters are fully captured in the simulator.
The proposed approximate adder and comparator are carefully characterized and
their circuit profiles are extracted from HSPICE simulations [35]. To evaluate the
approximate nature of our designs, we disable the error correction logic of the adder
and inject the characterized input-specific adder and comparator errors into each
addition and comparison operations in the behavioral simulator, providing a precise
evaluation of the impacts of the approximate errors for neuromorphic computation.
We use the neuromorphic application for character recognition system as illus-
trated in Section 3 to systematically examine the impacts of approximate adder and
comparator errors of several designs. We specifically consider the case where the
neuromorphic hardware is configured to be the same two-layer network for character
recognition as illustrated in Figure 3.16. Here, we extend the network to have over a
thousand silicon neurons such that the input layer contains 1024 excitatory neurons
receiving binary inputs representing pixel values in a 32 × 32 pixel input pattern
while the output layer has 36 excitatory neurons receiving inputs from all excitatory
input neurons through plastic synapses. The behavior of each layer is modulated
98
by the inhibitory neurons as described in Section 3. To train the network, 26 input
patterns of alphabets “A” – “Z”, which are 32×32 pixel patterns, are applied one
by one to the input layer. In this network, we use 3, 3 and 16 bits respectively to
represent each synaptic weight, model parameter and membrane potential for each
neuron and employ a 16-bit adder and comparator for the LIF computation.
5.2 Impacts of Approximation Errors on Neuromorphic Applications
5.2.1 Approximate Adder Error Effects
First, we fix the supply level to 1.2 V and clock the chip at the nominal clock
rate of 100 MHz so that the errors produced are only due to the approximate natures
of the adders since there is no timing failure. Also, the accurate 16-bit comparator
RCC is used for the digital LIF neurons to compare the threshold voltage with the
membrane potentials to generate neurons’ firing activities. Figure 5.2(a) shows the
(a) (b)
Figure 5.2: (a) Input character patterns and (b) receptive fields with 16-bit accurate
adders.
input character patterns “A” to “Z” for the training and the receptive fields of all
99
excitatory output neurons after the training with the accurate adders (i.e. RCA and
CLA). The receptive fields as in Figure 5.2(b) are trained well to respond to the
inputs from “A” to “Z”. This means that every letter appears once at least in the
receptive fields. The results in Figure 5.2(b) serves as a golden reference for the
approximate adders.
We also test the proposed approximate adder and other approximate adders with
the network. The receptive fields after the training with various approximate adders
are shown in Figure 5.3. The corresponding error rates and average error magnitudes
during the learning process are listed in Table 5.1 as well. The proposed 16-bit adder
Table 5.1: Error rates and average error magnitudes of various adders during training
process.
Design
Error Rate Avg. Error
(%) Magnitude
LUA 14.24 200.47
LOA (8-8) 61.05 11.82
ETAI (8-8) 14.32 8.81
ETAII 14.24 544.11
VLCSA-1 14.24 544.11
ACA 14.24 544.11
DAA (8-8) 13.04 7.68
LOA (13-3) 60.95 1.59
ETAI (15-1) 17.50 0.17
DAA (11-5) 23.25 0.59
Proposed 0.18 0.03
with k=4 and v=2 has an error rate of merely 0.18% with an average error magnitude
of 0.03 for the LIF computations during the training. Note that no EDC is considered
but the error magnitude reduction is. Fortunately, thanks to the error resilience of
100
(a) (b)
(d) (e) (f)
(g) (h)
(c)
Figure 5.3: Receptive fields with 16-bit (a) proposed approximate adder, (b) LUA,
(c) LOA (8-8), (d) ETAI (8-8), (e) ETAII, (f) VLCSA-1, (g) ACA and (h) DAA
(8-8).
101
the neuromorphic system, Figure 5.3(a) shows that the receptive fields are trained
successfully to recognize all the letters and the approximation errors have negligible
effect on the training process of the character recognition system. For LOA, ETAI
and DAA, the 8-bit accurate and inaccurate parts are used (i.e. LOA (8-8), ETAI (8-
8) and DAA (8-8)). The MSB of one input of the inaccurate part (i.e. A7) is leveraged
for the dithering bit in DAA. The parameters of n=16 and k=4 are adopted in LUA,
ETAII, VLCSA-1 and ACA and only the approximate adder part (i.e. without the
EDC) is utilized for the later two adders. All these approximate adders have an
error rate of more than 13% during the learning process. Especially, the error rate
of LOA (8-8) reaches 61.05% throughout the training. As seen in Figure 5.3(b) –
(h), the approximate adders produce a set of receptive fields with random synaptic
weights. These high error rates give rise to failures in training the network since the
approximation errors cause the neurons to either fire randomly or cease to fire. In
particular, the 2’s complement signed additions of small numbers frequently occur
during leaky operations (i.e. −Vleak in (5.1)) for the training. In this case, the LUA,
ETAII, VLCSA-1 and ACA produce many wrong carry predictions, incurring an
error rate of more than 14% and a high average error magnitude over 200 during the
learning process and an unacceptable performance degradation. This result suggests
the carry speculation with only 4-bit of less significant inputs in these 16-bit adders
may be insufficient for this application.
To shed more light on this, we increase the accuracy of LOA, ETAI and DAA by
expanding the accurate part of the adder at the cost of increased delay and energy
dissipation. When the LOA, ETAI and DAA have 13, 15 and 11 bits accurate parts,
respectively, the network starts to perform better. Figure 5.4 illustrates the receptive
fields with these adders and the respective error rates and average error magnitudes
are also listed in Table 5.1. Although the LOA (13-3) still has a relatively high error
102
(a) (b) (c)
Figure 5.4: Receptive fields with 16-bit (a) LOA (13-3), (b) ETAI (15-1) and (c)
DAA (11-5).
rate of 60.95%, the corresponding receptive fields as in Figure 5.4(a) are trained such
that all alphabets except for “D”, “E”, “H” and “K” can be identified. In addition,
a vague “I” shaped receptive field is trained, which may be confused with “Y”. Due
to the expansion of the accurate part of the adder, the errors now concentrate more
on LSBs and the average error magnitude is reduced from 11.82 to 1.59. Similarly,
the inclusion of a 15-bit accurate part in ETAI (15-1), which has an error rate of
17.50% with an average error magnitude of 0.17 during the training, allows the
network to be trained for all letters except for “B”, “H” and “I” as illustrated in
Figure 5.4(b). The DAA requires only three more accurate bits to train the network
better and achieves a relatively low average error magnitude of 0.59 thanks to the
dithering scheme. Unfortunately, the high error rate of DAA (11-5) hinders the
network from training all the letters. Several letters are not shown in the receptive
fields and the other letters in the fields are less distinct than the receptive fields with
the other adders as seen in Figure 5.4(c). Clearly, our design outperforms all other
approximate adders.
103
5.2.2 Approximate Comparator Error Effects
To see the impacts of the errors of the proposed approximate comparator on the
neuromorphic computing, we replace the accurate comparator by the proposed ap-
proximate one with k=4 and v=2 in the neuron circuits. The receptive fields with
the proposed comparator are illustrated in Figure 5.5. Note that the errors are in-
(a) (b)
Figure 5.5: Receptive fields with 16-bit (a) accurate adder with proposed comparator
and (b) proposed adder with proposed comparator.
duced from only the approximations without timing failures of the circuits. Figure
5.5(a) is the receptive fields with the accurate adder and the proposed comparator.
The proposed comparator allows the network to train well to recognize all the let-
ters by virtue of the extremely low error rate of 0.45% and the error resilience of
neuromorphic computing. Additionally, the proposed approximate adder together
with the proposed comparator also produce the good receptive fields that show all
the letters as in Figure 5.5(b). The error rates of the adder and comparator are
0.19% and 0.48%, respectively, while training. These very low error rates affect the
104
learning process negligibly and our two designs can be, therefore, equally adopted
for the neuromorphic application without performance degradations.
5.3 Energy Efficiency of LIF Neurons with Approximate Adders and Comparators
with Supply Voltage Scaling
To show the energy efficiency of the these adders in the neuromorphic hard-
ware, we scale down the supply voltage and obtain the energy dissipation in one
LIF operation involving a multiplication of a synaptic weight wji with the weight
parameter Ksyn and an addition of the membrane potential V
t
i with the multiplier
output Ksynwji, a key processing step in (5.1). A comparison between the membrane
potential V ti and the threshold voltage Vth is also included for the LIF operation and
the RCC is employed in the neuron circuit. The clock frequency is fixed at the max-
imum value such that the neuron with RCA and RCC can operate without any error
in the regular supply voltage of 1.2 V. For each adder design, we scale down the
supply voltage with a 0.05 V step as long as there is no critical timing failure created
in the neuron circuit. Figure 5.6 plots the energy comparison of one LIF operation
with the neurons with the different adders under scaled power supply levels. The
energies are normalized against the neuron with RCA and RCC. Neurons with LUA,
ETAII or VLCSA-1 can operate at a supply voltage of 1.0 V. The ETAII is the most
energy efficient adder design while having a much larger error rate than the proposed
adder and leading to poor learning performance (see Figure 5.3(e)). Regretfully, the
high power consumption from the carry selection in VLCSA-1 is an obstacle to at-
tain energy efficiency. All the adders except for RCA is able to work at the scaled
supply voltage of 1.05 V. The LUA, ETAI and the proposed adder show the similar
energy efficiency and the proposed adder consumes about 8% and 6% less energy
than ACA and DAA (8-8), respectively. Our design achieves the energy savings of
105
0
.8
0
.8
5
0
.9
0
.9
5
1
1
.0
5
1
.1
1
.1
5
1
.2
0
0
.2
0
.4
0
.6
0
.81
1
.2
S
u
p
p
ly
 V
o
lt
a
g
e
 (
V
)
Normalized Energy
 
 
R
C
A
C
L
A
L
U
A
L
O
A
 (
8
-8
)
E
T
A
I 
(8
-8
)
E
T
A
II
V
L
C
S
A
-1
A
C
A
D
A
A
 (
8
-8
)
P
ro
p
o
s
e
d
 A
d
d
e
r
P
ro
p
o
s
e
d
 A
d
d
e
r 
&
P
ro
p
o
s
e
d
 C
o
m
p
a
ra
to
r
F
ig
u
re
5.
6:
N
or
m
al
iz
ed
en
er
gi
es
of
on
e
d
ig
it
al
L
IF
n
eu
ro
n
w
it
h
va
ri
ou
s
ad
d
er
s
w
it
h
su
p
p
ly
vo
lt
ag
e
sc
al
in
g.
106
up to 36.6% and 27.9% over RCA and CLA, respectively, in the scaled supply. It can
be seen that our adder has the most competitive energy and error tradeoff among
all these designs. We also compare a neuron leveraging the proposed adder and the
approximate comparator to the others. This neuron allows the supply voltage to
decrease to 0.8 V, which is 0.25 V lower than a neuron including the proposed adder
only. The proposed adder and comparator enables the neuron to be 1.97×, 2.73×
and 3.11× energy efficient over the neuron adopting the proposed adder, CLA and
RCA with the accurate comparator RCC, respectively, in the scaled supply without
performance degradation (see Fig 5.5(b)). Our comparator also provides a great
energy saving with very low error rate for the neuromorphic computing Since a few
hundreds of silicon neurons are integrated in the form of an array [58, 36], the total
energy saving resulted from our designs are remarkable for the neuromorphic chip.
5.4 Energy Efficiency during the Training Process with Supply Voltage Scaling
Finally, we examine the overall energy consumption by all the digital LIF neurons
in the network for the training process. The LIF neuron dynamics in (5.1) is divided
into three different types of additions to obtain the membrane potential and one
comparison to determine the firing activity:
1) an addition of the membrane potential V ti with the multiplier output of each
synaptic weight wji and the weight parameter Ksyn;
2) an addition of the membrane potential V ti with the external spike weight pa-
rameter Kext;
3) an addition of the membrane potential V ti with the leaky potential −Vleak;
4) a comparison of the membrane potential Vti with the threshold voltage Vth.
107
We extract the energy profiles for these three additions and one comparison from
HSPICE simulations and inject the characterized energies into the behavioral simu-
lator. In the synaptic weight integrations, which correspond to term Ksyn
M∑
j=1
wjiS
t
j in
(5.1), the adders are activated only after the pre-synaptic neurons fired (i.e Stj = 1)
for the type 1) addition. Similarly, they work only when the external spiking events
are applied (KextE
t
i in (5.1)) for the type 2) addition. Therefore, we take into ac-
count the neurons’ firing and the external spike activities in the training to obtain
the realistic energy consumptions of the neuron circuits. The simulator accumulates
the energy dissipations of all the neuron circuits in the network by not only discrim-
inating these addition types and comparison but also considering the neurons’ firing
and the external spike activities during the learning process, achieving an accurate
analysis of energy dissipations of all the neurons. We consider the error-free oper-
ation and the proposed addition scheme since the other approximation approaches
show unacceptable learning performances. For the error-free operation, we enable
the EDC of ACA, VLCSA-1 and the proposed adder to achieve the same receptive
fields as the accurate adder after the training (i.e. Figure 5.2(b)). Also, the RCC
is utilized for accurate comparison and the EDC is only activated only when errors
are detected. The clock frequency of the chip is set to the same as that in Section
5.3 since the digital neuron circuits have the overall critical timing path of the neu-
romorphic chip [36]. The supply voltage is scaled down with a 0.05 V step without
any critical timing failure in the neurons as well. Figure 5.7 depicts the overall en-
ergy consumed by the digital LIF neurons during the learning process. The energies
are also normalized as the same in Figure 5.6. While the chip with all the designs
except for RCA can operate at a supply voltage of 1.05 V, the proposed adder with
EDC is 1.75× and 1.30× more energy efficient than VLCSA-1 and ACA, respec-
tively. The proposed adder with EDC requires the smallest amount of energy for the
108
0
.8
0
.8
5
0
.9
0
.9
5
1
1
.0
5
1
.1
1
.1
5
1
.2
0
0
.2
0
.4
0
.6
0
.81
1
.2
1
.4
1
.6
S
u
p
p
ly
 V
o
lt
a
g
e
 (
V
)
Normalized Energy
 
 
R
C
A
C
L
A
V
L
C
S
A
 w
/ 
E
D
C
A
C
A
 w
/ 
E
D
C
P
ro
p
o
s
e
d
 A
d
d
e
r 
w
/ 
E
D
C
P
ro
p
o
s
e
d
 A
d
d
e
r 
w
/o
 E
D
C
P
ro
p
o
s
e
d
 A
d
d
e
r 
w
/o
 E
D
C
 &
P
ro
p
o
s
e
d
 C
o
m
p
a
ra
to
r
F
ig
u
re
5.
7:
N
or
m
al
iz
ed
en
er
gy
co
n
su
m
p
ti
on
s
b
y
al
l
th
e
d
ig
it
al
L
IF
n
eu
ro
n
s
of
th
e
n
et
w
or
k
w
h
il
e
tr
ai
n
in
g
w
it
h
va
ri
ou
s
ad
d
er
s
an
d
co
m
p
ar
at
or
s
u
n
d
er
su
p
p
ly
vo
lt
ag
e
sc
al
in
g.
109
neuron circuits among the error-free adders, which encompass RCA, CLA and EDC
enabled VLCSA-1/ACA/proposed adder, thanks to the low-overhead EDC circuit.
The proposed approximate adder without EDC makes the neurons dissipate 29.3%
and 37.4% less energy than CLA and RCA under the scaled voltage. Moreover, when
enabled with the proposed comparator and adder, digital LIF neurons are able to
work at the scaled supply of 0.8 V while achieving an energy saving of 48.8% to
77.8% over all other error-free adders (e.g. 66.5% over RCA) with negligible perfor-
mance degradation. The proposed arithmetic units demonstrate significant energy
savings for the neuromorphic hardware. Additionally, the proposed adder with EDC
also exhibits improved energy efficiency than the other error-free adders and can be
equally employed for energy efficient accuracy significant applications.
5.5 Summary
This section has demonstrated the performance of the proposed approximate
adder and comparator as part of an unsupervised learning based VLSI neuromorphic
character recognition chip by developing a hardware-aware simulation approach. The
results have proven that the approximation errors of the proposed adder affect the
training performance negligibly while the other approximate adders severely degrade
the learning performance. The digital LIF neuron adopting the proposed arithmetic
units enables it to be over 3× energy efficient compared with the traditional accurate
arithmetic ones. Moreover, the proposed adder and comparator allow for the energy
saving of up to 66.5% over traditional counterparts for the digital LIF neurons during
the learning process with scaled supply voltage levels.
110
6. CONCLUSION AND FUTURE WORK
6.1 Conclusion
This dissertation has developed techniques for designing a neuromorphic pro-
cessor and approximate arithmetic for low-cost, reconfigurable and energy efficient
neuromorphic computing in VLSI. By addressing the several key issues on imple-
menting brain-inspired hardware architecture, we have significantly improve both
flexibility and performance of neuromorphic VLSI systems. Furthermore, the pro-
posed approximate arithmetic and inherit error resiliency in neurocomputing allow
for excellent energy efficiencies with negligible performance degradations in neural
computation. We conclude this research by summarizing the major contributions.
For digital neuromorphic processor, we have proposed a scalable digital architec-
ture that incorporates synapse, neuron and learning arrays for large scale spiking
neural networks. The memristor nanodevice is leveraged to build a high-density
synapse crossbar array that consists of novel multilevel memory cells to store both
a multibit synapse value and a network configuration information. Through the
systematic analysis of the memristor, we have considerably enhanced the synaptic
weight update performance by reducing the programming time for the memristive
array that can be accessed both column- and row-fashion with the low-cost digital
PWM scheme. Additionally, the proposed column based ADC scheme allows the
digital neuron to efficiently perform the LIF neuron dynamics and reduce the area
and power overheads required for the LIF operations. When implemented in a com-
mercial 90 nm CMOS technology, our design with 256 digital spiking neurons with
learning circuits and 65,536 synapses is evaluated to occupy an area of 1.86 mm2
and dissipate a power of 6.45 mW under a supply voltage of 1.2 V . Furthermore,
111
the validation result of the chip functionality by the behavioral digital simulation
has shown that the proposed architecture realizes character recognition with unsu-
pervised learning successfully.
In approximate arithmetic, a novel approximate arithmetic scheme to significantly
reduce energy consumption with an extremely low error rate has been proposed
for error resilient neuromorphic computing. The proposed carry speculation with
a parallel carry-skip has been applied to both adder and comparator designs to
considerably improve the overall error rate in computation and the critical path delay.
Moreover, the error magnitude reduction technique for the adder further reduces the
amount of error created by the approximate nature with low cost. The complete
error rate analysis has proven that the proposed arithmetic units have extremely
low error rates under random input patterns. The proposed approximate units have
been implemented with the same 90 nm CMOS process. While the proposed adder
exhibits 2.4× faster with an error rate of 0.18% and 43% more energy efficient over
traditional ones, the proposed comparator has an error rate of less than 0.1% and
achieves an energy saving of up to 4.9× over the conventional counterparts. To
demonstrate the impacts of the approximate errors on the neuromorphic computing,
we have conducted hardware-aware simulation of an unsupervised learning based
VLSI neuromorphic character recognition chip that includes over a thousand of silicon
neurons. The result has shown that the proposed approximate arithmetic units affect
the training performance negligibly and outperform the other approximate adders.
Furthermore, they allow for an energy reduction of up to 66.5% over traditional
ones for the digital LIF computations during the learning process with scaled supply
voltage levels.
Accordingly, the proposed architectural and circuit level design approaches are
applicable to a wide range of energy efficient and error resilient neuromorphic com-
112
puting systems, such as image and speech recognition, in VLSI.
6.2 Future Work
So far, we have demonstrated a neuromorphic processor configured as a 256 spik-
ing neuron network in a single-die and able to successfully perform character recog-
nition. Clearly, more complex networks are needed for other more sophisticated
applications. Ultimately, one may think about integrating huge numbers of neurons
and synapses to create an artificial brain that mimics the functions of the human
brain such as reasoning, knowledge, planning, learning and memory. When an arti-
ficial human brain is implemented in silicon, tasks requiring complex reasoning and
information processing as conducted by the humans may be readily solved with ex-
tremely short processing times. Also, artificial brains may allow people to better
understand how the human brain works so as to advance cognitive science.
The CMOS technology scaling enables increasing numbers of neurons and synapses
to be integrated in a given silicon area. Figure 6.1 demonstrates the number of neu-
350 250 180 130 90 65 45 32 22
10
2
10
3
10
4
10
5
10
6
10
7
Technology Node (nm)
#
 o
f 
N
e
u
ro
n
s
 
 
# of Neurons & Synapses / cm2
#
 o
f 
S
y
n
a
p
s
e
s
1014
1012
1010
1014
1012
1010
Figure 6.1: Neuron and synapse integration densities as a function of technology.
113
rons and synapses that can be integrated in a fixed area (e.g. 1 cm2) at different
technology nodes. They are evaluated based on our neuromorphic processor de-
signed in Section 3. For the estimations, we take into account the area for only the
CMOS switches in our CMOS/memristor hybrid cell since they dominate the overall
area of the memristor synaptic crossbar array. Additionally, the entire chip area is
assumed to be a linear function of the number of integrated silicon neurons N (i.e.
chip area ∝ N) since the memristive crossbar array, whose area is proportional to N2,
occupies very small portion (<10%) of the overall area. We also consider that the
chip area is scaled with a factor of L2 where L is the technology feature size. Then,
we can not only estimate the area cost per a silicon neuron in the scaled CMOS tech-
nology but also obtain the numbers of integrated silicon neurons and synapses in a
given silicon area. Note that the number of integrated synapses is N2 in our crossbar
structure. The number of integrated neurons approximately doubles for each new
generation CMOS technology. At the 22 nm node, over 0.2 million (2×105) neurons
and 50 billion (5×1010) synapses, a complexity similar to the nervous system of ants,
can be integrated in a 1 cm2 silicon area. Obviously, technology scaling will continue
in the coming decades. According to the International Technology Roadmap for
Semiconductors (ITRS), the CMOS feature size (i.e. gate length) will reach 5.9 nm
in 2026 and the supply voltage levels will decrease continuously as shown in Figure
6.2 [70]. We are able to predict the numbers of integrated neurons and synapses in
a chip in the future from Figure 6.2. Figure 6.3 predicts the trend of the number
of integrated neurons and synapses per 1 cm2 of silicon area. Our neuromorphic
processor and the same estimation method used in Figure 6.1 are also applied to the
prediction. The predicted scaling trend exhibits that the number of silicon neurons
in a cm2 area will increase 25% ∼ 35% for every year from 2013 to 2026. In 2026, the
1 cm2 silicon-die is estimated to include over 3 billion (3×106) neurons and 10 tril-
114
2014 2016 2018 2020 2022 2024 2026
0
5
10
15
20
25
30
35
Year
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Gate Length (L)
V
DD
Figure 6.2: Trends of gate length and power supply [70].
2014 2016 2018 2020 2022 2024 2026
10
5
10
6
10
7
Year
#
 o
f 
N
e
u
ro
n
s
 
 
# of Neurons & Synapses / cm2
#
 o
f 
S
y
n
a
s
e
s
1014
1012
1010
1013
1011
Figure 6.3: Scaling trend of neuron and synapse integration.
lion (1013) synapses, which may be enough to mimic the cockroach’s nervous system
that contains a billion neurons. Approximately 5 and 17 cm2 silicon-dies would be
needed to emulate the brains of frogs (16 billion) and rats (56 billion), respectively.
115
To this end, it deserves to further optimize the existing neuromorphic hardware de-
sign to better deal with such design complexity. As an example, to compute the
membrane potential of a neuron by accumulating a billion 3-bit pre-synaptic weights
would require a 24-bit resolution for the proposed column ADC according to (3.6).
It is very difficult to achieve a 24-bit resolution by the VCO-based ADC and other
ADC architectures such as ∆Σ ADC may be considered. In a different direction, an
alternative schemes to access the synaptic array may be investigated. In addition to
hardware design, very importantly, appropriate learning algorithms and applications
have to be developed to fully utilize the computing power of future neuromorphic
chips integrating tremendous numbers of silicon neurons and synapses.
116
REFERENCES
[1] J. V. Arthur and K. Boahen. Silicon-Neuron Design: A Dynamical Systems
Approach. IEEE. Trans. Circuits Syst. I, Reg. Papers, 58(5):1034–1043, 2011.
[2] J. V. Arthur, P. A. Merolla, F. Akopyan, R. Alvarez, A. Cassidy, S. Chandra,
S. K. Esser, N. Imam, W. Risk, D. B. D. Rubin, R. Manohar, and D. S. Modha.
Building Block of a Programmable Neuromorphic Substrate: A Digital Neu-
rosynaptic Core. In Proc. of Int. Joint Conf. Neural Netw. (IJCNN), pages 1–8,
2012.
[3] G.-Q. Bi and M.-M. Poo. Synaptic Modifications in Cultured Hippocampal
Neurons: Dependence on Spike Timing, Synaptic Strength, and Postsynaptic
Cell Type. The Journal of Neuroscience, 18(24):10464–10472, 1998.
[4] S. Brink, S. Nease, P. Hasler, S. Ramakrishnan, R. Wunderlich, A. Basu, and
B. Degnan. A Learning-Enabled Neuron Array IC Based Upon Transistor Chan-
nel Models of Biological Phenomena. IEEE Trans. Biomed. Circuits Syst.,
7(1):71–81, 2013.
[5] R. A. Brualdi. Introductory Combinatorics. Prentice-Hall, Upper Saddle River,
New Jersey, 2009.
[6] L. Camunas-Mesa, A. Acosta-Jimenez, C. Zamarrefio-Ramos, T. Serrano-
Gotarredona, and B. Linares-Barranco. A 32×32 Pixel Convolution Processor
Chip for Address Event Vision Sensors With 155 ns Event Latency and 20 Meps
Throughput. IEEE. Trans. Circuits Syst. I, Reg. Papers, 58(4):777–790, 2011.
[7] V. K. Chippa, S. T. Chakradhar, K. Roy, and A. Raghunathan. Analysis and
Characterization of Inherent Application Resilience for Approximate Comput-
117
ing. In Proc. of IEEE/ACM Design Automation Conf. (DAC), pages 1–9, 2013.
[8] V. K. Chippa, D. Mohapatra, A. Raghunathan, K. Roy, and S. T. Chakradhar.
Scalable Effort Hardware Design: Exploiting Algorithmic Resilience for Energy
Efficiency. In Proc. of IEEE/ACM Design Automation Conf. (DAC), pages
555–560, 2010.
[9] H. Cho, L. Leem, and S. Mitra. ERSA: Error Resilient System Architecture for
Probabilistic Applications. IEEE Trans. Comput.-Aided Design Integr. Circuits
Syst., 31(4):546–558, 2012.
[10] L. O. Chua. Memristor-The Missing Circuit Element. IEEE Trans. Circuit
Theory, 18(5):507–519, 1971.
[11] J. Cosp, J. Madrenas, and D. Fernandez. Design and Basic Blocks of a Neuro-
morphic VLSI Analogue Vision System. Neurocomputing, 69(16-18):1962–1970,
2006.
[12] X. Dong, C. Xu, Y. Xie, and N. P. Jouppi. NVSim: A Circuit-Level Perfor-
mance, Energy, and Area Model for Emerging Nonvolatile Memory. IEEE Trans.
Comput.-Aided Design Integr. Circuits Syst., 31(7):994–1007, 2012.
[13] D. A. Drachman. Do We Have Brain to Sparse? Neurology, 64(12):2056–2062,
2005.
[14] K. Du, P. Varman, and K. Mohanram. High Performance Reliable Variable
Latency Carry Select Addition. In Proc. of Design, Automation, Test in Europe
(DATE), pages 1257–1262, 2012.
[15] S. K. Esser, A. Ndirango, and D. S. Modha. Binding Sparse Spatiotemporal
Patterns in Spiking Computation. In Proc. of Int. Joint Conf. Neural Netw.
(IJCNN), pages 1–9, 2010.
118
[16] D. E. Feldman. The Spike-Timing Dependence of Plasticity. Neuron, 75(4):556–
571, 2012.
[17] R. FitzHugh. Impulses and Physiological States in Theoretical Models of Nerve
Membrane. Biophy. J., 1(6):445–466, 1961.
[18] S. Ghosh-Dastidar and H. Adeli. Third Generation Neural Networks: Spiking
Neural Networks. In Advances in Computational Intelligence, pages 167–178,
2009.
[19] V. Gupta, D. Mohapatra, A. Raghunathan, and K. Roy. Low-Power Digital Sig-
nal Processing Using Approximate Adders. IEEE Trans. Comput.-Aided Design
Integr. Circuits Syst., 32(1):124–137, 2013.
[20] S. O. Haykin. Neural Networks and Learning Machines. Prentice-Hall, Upper
Saddle River, New Jersey, 2008.
[21] R. Hegde and N. R. Shanbhag. Soft Digital Dignal Processing. IEEE Trans.
Very Large Scale Integr. (VLSI) Syst., 9(6):813–823, 2001.
[22] J. L. Hindmarsh and R. M. Rose. A Model of Neuronal Bursting using
Three Coupled First Order Differential Equations. Proc. R. Soc. Lond. B.,
221(1222):87–102, 1984.
[23] Y. Ho, G. M. Huang, and P. Li. Dynamical Properties and Design Analysis for
Nonvolatile Memristor Memories. IEEE. Trans. Circuits Syst. I, Reg. Papers,
58(4):724–736, 2011.
[24] A. L. Hodgkin and A. F. Huxley. A Quantitative Description of Membrane
Current and Its Application to Conduction and Excitation in Nerve. J. Physiol.,
117(4):500–544, 1952.
119
[25] M. Hu, H. Li, Q. Wu, and G. S. Rose. Hardware Realization of BSB Recall
Function using Memristor Crossbar Arrays. In Proc. of IEEE/ACM Design
Automation Conf. (DAC), pages 498–503, 2012.
[26] J. Huang, J. Lach, and G. Robins. A Methodology for Energy-Quality Tradeoff
Using Imprecise Hardware. In Proc. of IEEE/ACM Design Automation Conf.
(DAC), pages 504–509, 2012.
[27] N. Imam, F. Akopyan, J. Arthur, P. Merolla, R. Manohar, and D. S. Modha. A
Digital Neurosynaptic Core Using Event-Driven QDI Circuits. In Proc. of IEEE
Int. Symp. Async. Circuits and Syst. (ASYNC), pages 25–32, 2012.
[28] G. Indiveri, E. Chicca, and R. Douglas. A VLSI Array of Low-Power Spiking
Neurons and Bistable Synapses with Spike-Timing Dependent Plasticity. IEEE
Trans. Neural Netw., 17(1):211–221, 2006.
[29] G. Indiveri, B. Linares-Barranco, T. J. Hamilton, A. van Schaik, R. Etienne-
Cummings, T. Delbruck, S.-C. Liu, P. Dudek, P. Haliger, S. Renaud, J. Schem-
mel, G. Cauwenberghs, J. Arthur, K. Hynna, F. Folowosele, S. Saighi,
T. Serrano-Gotarredona, J. Wijekoon, Y. Wang, and K. Boahen. Neuromor-
phic Silicon Neuron Circuits. Frontiers in Neuroscience, 5(73):1–23, 2011.
[30] A. K. Jain, J. Mao, and K. M. Mohiuddin. Artificial Neural Networks: a Tuto-
rial. Computer, 29(3):31–44, 1996.
[31] S. H. Jo, T. Chang, I. Ebong, B. B. Bhadviya, P. Mazumder, and W. Lu.
Nanoscale Memristor Device as Synapse in Neuromorphic Systems. Nano Let-
ters, 10(4):1297–1301, 2010.
[32] A. B. Kahng and S. Kang. Accuracy-Configurable Adder for Approximate Arith-
metic Designs. In Proc. of IEEE/ACM Design Automation Conf. (DAC), pages
120
820–825, 2012.
[33] R. Kempter, W. Gerstner, and J. L. Van Hemmen. Hebbian Learning and
Spiking Neurons. Physical Review E, 59(4):4498–4514, 1999.
[34] J. Kim, T.-K. Jang, Y.-G. Yoon, and S. Cho. Analysis and Design of Voltage-
Controlled Oscillator Based Analog-to-Digital Converter. IEEE. Trans. Circuits
Syst. I, Reg. Papers, 57(1):18–30, 2010.
[35] S. H. Kim, S. Mukhopadhyay, and M. Wolf. Modeling and Analysis of Image De-
pendence and Its Implications for Energy Savings in Error Tolerant Image Pro-
cessing. IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., 30(8):1163–
1172, 2011.
[36] Y. Kim, Y. Zhang, and P. Li. A Digital Neuromorphic VLSI Architecture with
Memristor Crossbar Synaptic Array for Machine Learning. In Proc. of IEEE
Int. System-on-Chip Conf. (SOCC), pages 328–333, 2012.
[37] S. Li, J.-H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P.
Jouppi. McPAT: An Integrated Power, Area, and Timing Modeling Framework
for Multicore and Manycore Architectures. In Proc. of IEEE/ACM Int. Symp.
Microarchitecture (MICRO), pages 469–480, 2009.
[38] A. Lingamneni, C. Enz, K. Palem, and C. Piguet. Parsimonious Circuits for
Error-Tolerant Applications through Probabilistic Logic Minimization. In Lec-
ture Notes in Computer Science, pages 204–213, 2011.
[39] S.-L. Lu. Speeding Up Processing with Approximation Circuits. Computer,
37(3):67–73, 2004.
[40] H. R. Mahdiani, A. Ahmadi, S. M. Fakhraie, and C. Lucas. Bio-Inspired Impre-
cise Computational Blocks for Efficient VLSI Implementation of Soft-Computing
121
Applications. IEEE Trans. Circuits Syst. I, Reg. Papers, 57(4):850–862, 2010.
[41] H. Manem, J. Rajendran, and G. S. Rose. Design Considerations for Multilevel
CMOS/Nano Memristive Memory. ACM J. Emerg. Technol. Comput. Syst.,
8(1):6:1–6:22, 2012.
[42] H. Markram, J. Lu¨bke, M. Frotscher, and B. Sakmann. Regulation of
Synaptic Efficacy by Coincidence of Postsynaptic APs and EPSPs. Science,
275(5297):213–215, 1997.
[43] T. M. Massoud and T. K. Horiuchi. A Neuromorphic VLSI Head Direction Cell
System. IEEE. Trans. Circuits Syst. I, Reg. Papers, 58(1):150–163, 2011.
[44] W. McCulloch and W. Pitts. A Logical Calculus of the Ideas Immanent in
Nervous Activity. The Bulletin of Mathematical Biophysics, 5(4):115–133, 1943.
[45] C. E. Merkel, N. Nagpal, S. Mandalapu, and D. Kudithipudi. Reconfigurable
N-level Memristor Memory Design. In Proc. of Int. Joint Conf. Neural Netw.
(IJCNN), pages 3042–3048, 2011.
[46] P. Merolla, J. Arthur, F. Akopyan, N. Imam, R. Manohar, and D. S. Modha. A
Digital Neurosynaptic Core using Embedded Crossbar Memory with 45pJ per
Spike in 45nm. In Proc. of IEEE Custom Integrated Circuits Conf. (CICC),
pages 1–4, 2011.
[47] J. Miao, K. He, A. Gerstlauer, and M. Orshansky. Modeling and Synthesis
of Quality-Energy Optimal Approximate Adders. In Proc. of IEEE/ACM Int.
Conf. Comput.-Aided Design (ICCAD), pages 728–735, 2012.
[48] S. Mitra, S. Fusi, and G. Indiveri. Real-Time Classification of Complex Pat-
terns Using Spike-Based Learning in Neuromorphic VLSI. IEEE Trans. Biomed.
Circuits Syst., 3(1):32–42, 2009.
122
[49] D. Mohapatra, V. K. Chippa, A. Raghunathan, and K. Roy. Design of Voltage-
Scalable Meta-Functions for Approximate Computing. In Proc. of Design, Au-
tomation, Test in Europe (DATE), pages 1–6, 2011.
[50] S. Moradi and G. Indiveri. A VLSI Network of Spiking Neurons with an Asyn-
chronous Static Random Access Memory. In Proc. of IEEE Biomed. Circuits
Syst. Conf. (BioCAS), pages 277–280, 2011.
[51] C. Morris and H. Lecar. Voltage Oscillations in the Barnacle Giant Muscle
Fiber. Biophy. J., 35(1):193–213, 1981.
[52] J. Nagumo, S. Arimoto, and S. Yoshizawa. An Active Pulse Transmission Line
Simulating Nerve Axon. Proc. IRE, 50(10):2061–2070, 1962.
[53] Z. Pan and M. A. Breuer. Basing Acceptable Error-Tolerant Performance on
Significance-Based Error-Rate (SBER). In Proc. of IEEE VLSI Test Symp.
(VTS), pages 59–66, 2008.
[54] H. Paugam-Moisy and S. Bohte. Computing with Spiking Neuron Networks. In
Handbook of Natural Computing, pages 335–376, 2012.
[55] Y. V. Pershin and M. D. Ventra. Experimental Demonstration of Associative
Memory with Memristive Neural Networks. Neural Netw., 23(7):881–886, 2010.
[56] T. Pfeil, T. C. Potjans, S. Schrader, W. Potjans, J. Schemmel, M. Diesmann,
and K. Meier. Is a 4-bit Synaptic Weight Resolution Enough? - Constraints
on Enabling Spike-Timing Dependent Plasticity in Neuromorphic Hardware.
Frontiers in Neuroscience, 6(90):1–19, 2012.
[57] J. B. Reece, L. A. Urry, M. L. Cain, S. A. Wasserman, P. V. Minorsky, and R. B.
Jackson. Campbell Biology. Benjamin Cummings, San Francisco, California,
2010.
123
[58] J.-S. Seo, B. Brezzo, Y. Liu, B. D. Parker, S. K. Esser, R. K. Montoye, B. Ra-
jendran, J. A. Tierno, L. Chang, D. S. Modha, and D. J. Friedman. A 45nm
CMOS Neuromorphic Chip with a Scalable Architecture for Learning in Net-
works of Spiking Neurons. In Proc. of IEEE Custom Integrated Circuits Conf.
(CICC), pages 1–4, 2011.
[59] R. Serrano-Gotarredona, M. Oster, P. Lichtsteiner, A. Linares-Barranco, R. Paz-
Vicente, F. Gomez-Rodriguez, L. Camunas-Mesa, R. Berner, M. Rivas-Perez,
T. Delbruck, S.-C. Liu, R. Douglas, P. Hafliger, G. Jimenez-Moreno, A. C. Ball-
cels, T. Serrano-Gotarredona, A. J. Acosta-Jimenez, and B. Linares-Barranco.
CAVIAR: A 45k Neuron, 5M Synapse, 12G Connects/s AER Hardware Sensory-
Processing-Learning-Actuating System for High-Speed Visual Object Recogni-
tion and Tracking. IEEE Trans. Neural Netw., 20(9):1417–1438, 2009.
[60] B. Shim, S. R. Sridhara, and N. R. Shanbhag. Reliable Low-Power Digital Signal
Processing via Reduced Precision Redundancy. IEEE Trans. Very Large Scale
Integr. (VLSI) Syst., 12(5):497–510, 2004.
[61] G. S. Snider. Spike-Timing-Dependent Learning in Memristive Nanodevices. In
Proc. of IEEE/ACM Int. Symp. Nanoscale Arch. (NANOARCH), pages 85–92,
2008.
[62] S. Song, K. D. Miller, and L. F. Abbott. Competitive Hebbian Learning Through
Spike-Timing-Dependent Synaptic Plasticity. Nature Neuroscience, 3(9):919–
926, 2000.
[63] D. B. Strukov, G. S. Snider, D. R. Stewart, and R. S. Williams. The Missing
Memristor Found. Nature, 453:80–83, 2008.
[64] A. Syed, E. Ahmed, D. Maksimovic, and E. Alarcon. Digital Pulse Width
Modulator Architectures. In Proc. of IEEE Power Electron. Specialists Conf.
124
(PSEC), pages 4689–4695, 2004.
[65] A. van Schaik. Building Blocks for Electronic Spiking Neural Networks. Neural
Netw., 14(6-7):617–628, 2001.
[66] J. Vanne, E. Aho, T. D. Hamalainen, and K. Kuusilinna. A High-Performance
Sum of Absolute Difference Implementation for Motion Estimation. IEEE
Trans. Circuits Syst. Video Technol., 16(7):876–883, 2006.
[67] R. Venkatesan, A. Agarwal, K. Roy, and A. Raghunathan. MACACO: Modeling
and Analysis of Circuits for Approximate Computing. In Proc. of IEEE/ACM
Int. Conf. Comput.-Aided Design (ICCAD), pages 667–673, 2011.
[68] A. K. Verma, P. Brisk, and P. Ienne. Variable Latency Speculative Addition: A
New Paradigm for Arithmetic Circuit Design. In Proc. of Design, Automation,
Test in Europe (DATE), pages 1250–1255, 2008.
[69] J. H. B. Wijekoon and P. Dudek. Compact Silicon Neuron Circuit with Spiking
and Bursting Behaviour. Neural Netw., 21(2-3):524–534, 2008.
[70] L. Wilson. International Technology Roadmap for Semiconductors. SEMAT-
ECH, Albany, New York, 2011.
[71] C. Xu, X. Dong, N. P. Jouppi, and Y. Xie. Design Implications of Memristor-
based RRAM Cross-Point Structures. In Design, Automation and Test in Europe
(DATE), pages 1–6, 2011.
[72] J. J. Yang and R. S. Williams. Memristive Devices in Computing System:
Promises and Challenges. ACM J. Emerg. Technol. Comput. Syst., 9(2):11:1–
11:20, 2013.
125
[73] Y.-G. Yoon, J. Kim, T.-K. Jang, and S. Cho. A Time-Based Bandpass ADC
Using Time-Interleaved Voltage-Controlled Oscillators. IEEE. Trans. Circuits
Syst. I, Reg. Papers, 55(11):3571–3581, 2008.
[74] N. Zhu, W. L. Goh, and K. S. Yeo. An Enhanced Low-Power High-Speed Adder
for Error-Tolerant Application. In Proc. of Int. Symp. Integrated Circuits (ISIC),
pages 69–72, 2009.
[75] N. Zhu, W. L. Goh, and K. S. Yeo. Ultra Low-Power High-Speed Flexible
Probabilistic Adder for Error-Tolerant Applications. In Proc. of Int. SoC Design
Conf. (ISOCC), pages 393–396, 2011.
[76] N. Zhu, W. L. Goh, W. Zhang, K. S. Yeo, and Z. H. Kong. Design of Low-Power
High-Speed Truncation-Error-Tolerant Adder and Its Application in Digital Sig-
nal Processing. IEEE Trans. Very Large Scale Integr. (VLSI) Syst., 18(8):1225–
1229, 2010.
126
