In-memory Implementation of On-chip Trainable and Scalable ANN for AI/ML
  Applications by Kumar, Abhash et al.
1In-memory Implementation of On-chip Trainable
and Scalable ANN for AI/ML Applications
Abhash Kumar, Jawar Singh, Sai Manohar Beeraka, and Bharat Gupta
Abstract—Traditional von Neumann architecture based pro-
cessors become inefficient in terms of energy and throughput as
they involve separate processing and memory units, also known
as memory wall. The memory wall problem is further exacer-
bated when massive parallelism and frequent data movement
are required between processing and memory units for real-
time implementation of artificial neural network (ANN) that
enables many intelligent applications. One of the most promising
approach to address the memory wall problem is to carry
out computations inside the memory core itself that enhances
the memory bandwidth and energy efficiency for extensive
computations. This paper presents an in-memory computing
architecture for ANN enabling artificial intelligence (AI) and
machine learning (ML) applications. The proposed architecture
utilizes deep in-memory architecture based on standard six
transistor (6T) static random access memory (SRAM) core for
the implementation of a multi-layered perceptron. Our novel
on-chip training and inference in-memory architecture reduces
energy cost and enhances throughput by simultaneously accessing
the multiple rows of SRAM array per precharge cycle and
eliminating the frequent access of data. The proposed architec-
ture realizes backpropagation which is the keystone during the
network training using newly proposed different building blocks
such as weight updation, analog multiplication, error calculation,
signed analog to digital conversion, and other necessary signal
control units. The proposed architecture was trained and tested
on the IRIS dataset which exhibits ≈ 46× more energy efficient
per MAC (multiply and accumulate) operation compared to
earlier classifiers.
Index Terms—In-memory computing, SRAM, artificial neural
network, artificial intelligence, machine learning, classification.
I. INTRODUCTION
Artificial intelligence (AI) and machine learning (ML) algo-
rithms are ubiquitous and integral part of contemporary com-
puting devices, and significantly changing the way we live and
interact with the world around us. Most of these computing
systems are based on von Neumann architecture that involves
separate processing and memory units where data need to
be shuttled back and forth frequently between the processing
and the memory units[1]. Therefore, significant amount of the
energy and time are consumed during data movement rather
than actual computing, and this problem further exacerbated
due to massive parallelism and data centric tasks such as
decision making, object recognition, speech and video pro-
cessing. This calls for a radical departure from the orthodox
von Neumann approach to an unorthodox non-von Neumann
computing architectures to carry out computations within the
memory core itself, referred as to in-memory computing. Re-
cently, hardware implementation of AI/ML algorithms based
on in-memory computing has attracted huge attention because
A Kumar, S M Beeraka, and B Gupta are with the Department of Electronics
and Communication Engineering, National Institute of Technology, Patna,
Bihar, INDIA. abhash2205@gmail.com
J Singh is with Electrical Engineering Department, Indian Institute of
Technology Patna, Bihar, INDIA. dr.jawar@gmail.com
of unmatched computing performance and energy efficiency.
In-memory computation overcomes the problem of frequent
data movement between processing and memory units in the
traditional von Neumann architecture based processors by
carrying out computations within the memory core using its
periphery circuitry.
Artificial Neural Network (ANN) is one of the most widely
used tool for AI and ML based applications due to its very
good resemblance and mimicking properties of human brain. It
is used in a wide variety of AI/ML applications because of its
self-learning ability to produce output that is not just limited to
the input provided to them. Most of these AI/ML algorithms
are software based and poses energy and delay optimization
challenges for real time hardware implementation. The major
limitations for hardware based solutions are large number
of interconnections, massive parallelism, and time consuming
calculations that requires huge data for training the networks
and related algorithms. So, in-memory computation for such
networks is one of the most preferred and efficient way for
hardware realization of such complex networks. Researchers
have come up with different in-memory implementations of
popular machine learning networks and algorithms such as
Convolution Neural Networks (CNNs)[2], [3], [4], Deep Neu-
ral Networks (DNNs)[5], [6] and machine learning classifiers
[7], [8]. A large number of machine learning network archi-
tecture uses modified form of artificial neural networks such
as in recurrent neural network (RNN), the fully connected
layer of CNN, or in Deep Neural Networks (DNNs) which
are the foundations of deep learning. There is a remarkable
improvement in learning and predicting the complex pattern
of a large data set which otherwise would have been very
difficult even for the Graphic Processing Units (GPUs) that
still require on-chip or off-chip memory access despite parallel
computations for inference and training of these algorithms.
Most of the AI/ML algorithms require processing of a
large data sets and the energy cost involved is mostly gov-
erned by the frequent access of memory[9], especially for
dense networks such as DNNs[10], [11]. In DianNao[11]
and Eyeriss[10], the data reuse have been an efficient and
highly effective solutions for saving energy, however, it still
resulted in frequent on-chip memory access leading to 35%
to 45% of total energy dissipation. Other techniques, such
as reducing the precision of parameters to even 1-b during
inference[12], [13], [14] can address the energy and latency
issues to a some extent but they will lead to accuracy trade-
off. Further, implementation of architectures in digital domain
using low power circuits have been employed such as power
gating technique for speech recognition[5], RAZOR[15] for
Internet of Things (IoT) applications[6], and architectural
designs such as dynamic-voltage accuracy frequency scalable
CNN processor[3] have been introduced earlier to reduce
ar
X
iv
:2
00
5.
09
52
6v
1 
 [e
es
s.S
P]
  1
9 M
ay
 20
20
2energy cost. They exploits the advantages of scalability and
programmability of implementations in digital domain but
they miss out the opportunities that lies in analog domain for
implementation of AI/ML algorithms. This is due to the unique
data flow of ANN during feedforward and backpropagation
that can be exploited to design energy efficient and high
throughput in-memory ANN architectures in analog domain.
Exploiting the opportunities available in analog domain for
realizing in-memory AI/ML algorithms to reduce energy and
delay cost, an in-memory inference processor based on deep
in-memory architecture (DIMA)[16], [17] have been presented
earlier. The DIMA stores binary data in a column-major format
as opposed to the row-major format employed in traditional
Static Random Access Memory (SRAM) array organization. It
reduces energy cost by simultaneously accessing multiple rows
of the standard six-transistor (6T) SRAM cells per precharge
cycle through the application of modulated pulse width signals
to the word lines (WLs), and thus increases the through-
put. Previously, DIMA is also used for AI/ML algorithms
such as CNNs[2] and its versatility was also established by
mapping the ML algorithms for template matching[18], and
architectures of sparse distributed memory[19], [20]. A multi-
functional in-memory inference processor[16] based on DIMA
have been presented earlier which achieves 53× reduction in
energy delay product (EDP) and supports four algorithms: sup-
port vector machine, k-nearest neighbour, template matching,
and matched filter. But, the biggest disadvantage of DIMA was
that it cannot be used for supervised learning algorithms which
requires training of the network. This is because the underlying
hardware of the DIMA does not support on-chip training of the
network. Recent work on DNN[21] deals with weight update
but it requires additional buffers for storing the output before
performing convolution at each stage which increases energy
cost involved in frequent storage and access of data from these
buffers which also increases latency.
In this paper, a DIMA based memory core is employed
with proposed peripheral circuitry to realize in-memory ANN
architecture. The memory read cost of the proposed architec-
ture is reduced via DIMA functional read (DIMA-FR) process
which access the multiple rows of SRAM bit cell arrays (BCA)
per precharge cycle. Most of the hardware based AI/ML
architectures do not support on-chip training and even if some
of them does, then they do not able to avoid the frequent access
to memory core during each iteration of training contributing
to a large proportion in total energy and delay. To allevi-
ate the memory access bottleneck, sampling capacitors have
been used in the proposed architecture that store the weights
temporarily and avoid the frequent memory access during
each iteration. The proposed architecture provides on-chip
hardware support for the feedforward (FF), backpropagation
(BP), error calculation as well as weight updation. Further, it
supports multilayered and scalable architecture by cascading
the proposed single layered neural network. Each of these
layer communicates with its preceding and next layer in the
same way a neural network does. Moreover, a large variety
of AI/ML algorithms can be implemented using the proposed
architecture by just changing the activation function block and
the number of memory banks in the architecture according to
the requirements. Overall, the proposed architecture exactly
mimics the way a neural network works and it is configurable
to realize variety of neural networks based AI/ML algorithms.
Simulation results demonstrate that the proposed architecture
achieves ≈ 1.04× reduction in energy delay product (EDP)
and ≈ 46× reduction is energy per mutiply-and-accumulate
(MAC) operation. In contrast to existing in-memory comput-
ing architectures, following are the key contributions of the
proposed architecture:
1) Hardware support for on-chip training– Hardware
support for feedforward and backpropagation process is
designed in the periphery of SRAM bitcell array that
mimics the way a neural network works. Further, the
weight update mechanism has been developed without
the need of intermediate buffers to avoid latency and
energy cost involved in read/write mechanism of such
buffers. Moreover, appropriate control signals have been
designed for parallelizing weight update process of the
whole network.
2) On-chip error calculation– An on-chip sum of squares
of error calculating mechanism have been developed
whose gradients with respect to weights are back prop-
agated for network training.
3) Signed flash ADC design– The signed flash ADC is
designed and presented to stored updated signed weights
back inside SRAM memory core which follows 1′s
complement conversion method for negative weights.
II. BACKGROUND
This Section discusses the pre-requisites for the develop-
ment of artificial neural network[22] and hardware implemen-
tation of deep in-memory architecture (DIMA)[16], [17].
A. Neural Network
The neural network is a complex network of neurons where
each of the connections is characterized by synaptic weights.
There are three basic elements of neuron:
(1) set of synapses or connecting links– each characterized
by a weight,
(2) an adder– for summing the products of input signal and
corresponding synapse weight, and
(3) an activation function– typically sigmoid, tanh, or
ReLU that limits the output of a neuron. Fig. 1 shows the
generalized multilayered perceptron neural network exhibiting
the signals at each layer during feedforward (in black colour)
and backpropagation (in red colour). Both feedforward and
backpropagation have been shown in the same network for a
better understanding of signal flow during each of these two
processes. In Fig. 1, NK (for K ∈ W , where W denotes the
whole number) corresponds to total number of neurons in the
Kth layer (K = 0 corresponds to the input layer). The w[K]ij
denotes weight of the synaptic connection to the jth neuron
(of the Kth layer) from the ith neuron (of its previous layer).
In this paper, the subscripts i, j, k, and l (for i, j, k, l ∈ W )
are used for indexing the properties/signals associated with a
particular neuron and the superscript [K] for indicating the
layer to which it is associated with. Paired subscript, such as
‘ij’shows that the property is associated with the connection
3[1]ijw
[1]1a
[1]2a
[1]ja
1
[1]Na
[2]1a
[2]2a
[2]ka
2
[2]Na
[2]jkw
[0]1 1x a
[0]2 2x a
[0]i ix a
0 0
[0]N Nx a
[3]1 1a y
[3]2 2a y
[3]l la y
3 3
[3]N Na y
1e
2e
le
3Ne
[3] [3]1( )h
[3] [3]2( )h
[3] [3]( )lh
3
[3] [3]( )Nh
[3]1
[3]2
[3]l
3
[3]N
[3]1kw
[3]2kw
3
[3]kNw
[3]klw[2]k
Input layer 1st hidden Layer 2nd hidden layer Output Layer
Fig. 1. Network showing multilayered perceptron with two hidden layers. The signals in black are during feedforward process and the signals in red shows
the backpropagation from the output layer to the kth neuron of the 2nd hidden layer.
of ith neuron of any layer to the jth neuron of its next layer.
The training of the network takes place in two phases using
the backpropagation algorithm[22].
Forward Phase: In forward phase, the weights of the
network are initialized and the input signals are propagated
layer by layer through the entire network until they reach to
the output layer. From Fig. 1, the output at the Kth layer is
given by the following matrix multiplication and subsequently
by the application of activation function on it as shown below:
a
[K]
1
a
[K]
2
...
a
[K]
NK
 = ϕ


w
[K]
11 w
[K]
21 ... w
[K]
NK−11
w
[K]
12 w
[K]
22 ... w
[K]
NK−12
...
... .
...
w
[K]
1NK
w
[K]
2NK
... w
[K]
NK−1NK


a
[K−1]
1
a
[K−1]
2
...
a
[K−1]
NK


(1)
where ϕ(·) is the activation function at the output of the
neurons in the Kth layer.
Backward Phase: Once the final output is available during
the forward phase, then the error is calculated by comparing
the final output with the target (or intended) output. The error
at any output neuron is calculated as el = tl − yl, where yl
is the current output and tl is the target (or intended) output
at the lth output neuron. For calculating the total error of the
network there are many cost functions available of which the
sum of squares of error is the most popular one which is given
as:
E =
1
2
N3∑
l=1
e2l (2)
In Eq. 2, the square of error will be summed up for all the out-
put neurons in Fig. 1. The resulting error is propagated through
the network in the backward direction and successive weight
adjustments/updates are made to the synaptic weights using
gradient descent algorithm. The gradient descent algorithm
states that weight should be moved in the direction of negative
gradient of the error landscape to minimize the error of the
network. Mathematically, the weight update of the synaptic
connections to the Kth layer is given as:
w
[K]
jk ← w[K]jk − η
∂E
∂w
[K]
jk
= w
[K]
jk + ∆w
[K]
jk (3)
where E is the error as calculated in Eq. 2 and
∆w
[K]
jk
(
= −η∂E/∂w[K]jk
)
is the change in weight required
to reduce the error of the network. Now, depending upon
where the neuron is situated in the network, two cases arise.
Firstly, when the neuron will be at the output layer of ANN.
In this case, the desired response at each of the output node
is supplied during training, making this case a straightforward
to handle from error calculation and weight updation point
of view. Secondly, when the neuron is situated at any hidden
layer. Hidden layer neurons are not directly accessible but they
do share the responsibility for the total error of the network. At
the same time their desired response is not known beforehand
which makes the estimation of their contribution in total error
even more difficult. Both cases are discussed in detail as
follows:
• CASE I: ‘K’denotes an output layer– The desired
response at an output neuron is known in this case as
4discussed earlier, so Eq. 2 (and Fig. 1) can be used
directly to calculate the weight change as:
∆w
[3]
kl = η (tl − yl)ϕ′[3]
(
h
[3]
l
)
a
[2]
k = ηδ
[3]
l a
[2]
k (4)
where h[3]l =
∑N2
k=1 a
[2]
k w
[3]
kl is the activation potential
at the lth neuron of the output layer, δ[3]l = (tl −
yl)ϕ
′[3]
(
h
[3]
l
)
is the local gradient at lth neuron of the
output layer of Fig. 1, and ϕ′[3](·) is the derivative of the
activation function.
• CASE II: ‘K’denotes a hidden layer– There is no
specified desired response in this case, hence the error
signal has to be determined recursively and working
backward in terms of the error signals of all the neurons
to which that hidden neuron is directly connected. Using
Eq. 2, and applying chain rule the simplified result[22]
can be obtained as:
∆w
[2]
jk = η
(
N3∑
l
δ
[3]
l w
[3]
kl
)
ϕ′[2]
(
h
[2]
k
)
a
[1]
j = ηδ
[2]
k a
[1]
j
(5)
where δ[2]k =
∑N3
l=1
(
δ
[3]
l w
[3]
kl
)
ϕ′[2]
(
h
[2]
k
)
is the local
gradient at the kth neuron of the 2nd layer of the Fig. 1.
Similarly, the weight update for the 1st layer is given as:
∆w
[1]
jk = ηδ
[1]
k a
[0]
j (6)
This is how error is propagated backward in the network. This
back propagation of error is shown in red color in Fig. 1. The
above derivations are described for a multilayered perceptron
neural network with only two hidden layers, however, the same
concept can be extended to any number of hidden layers.
B. Basics of In-Memory Architecture
The main and mature element of in-memory computing is
the standard six transistor (6T) static random access memory
(SRAM) cell/array, where most of the computations are per-
formed in parallel and mixed-signal domain which results in
significant improvement in throughput and energy efficiency.
Recently, a 6T-SRAM based deep in-memory architecture
(DIMA)[16] has demonstrated very good energy efficiency
for many intelligent and data intensive applications. Fig. 2(a)
shows the conventional DIMA architecture. The DIMA uses
the standard SRAM bit cell array (BCA) with read/write
circuitry at the bottom and stores BW bits of weight in a
column-major format as opposed to the row-major format
used in conventional SRAM cell. The DIMA consists of four
processes that are executed sequentially.
(1) Multi-row Functional READ (FR)– that performs digital
to analog conversion of weights w (index i, j, k, l and [K]
are omitted for simplicity) stored in standard 6T SRAM
cell by fetching BW bits of weight which is achieved by
simultaneously reading BW rows per pre-charge cycle,
(2) Bit Line Processing (BLP)– calculates scalar distances
(SDs) by carrying out mathematical operations (such as -
addition, subtraction, multiplication, etc.) between weights
stored in SRAM cells and the applied input signal,
(3) Cross BLP (CBLP) – carries out the summation of
scalar distances (SDs) via charge sharing across the required
columns of the bit cell array to calculate the vector distances
(VDs) such as dot product if SD is multiplication in BLP
stage, or Manhattan distance if SD is absolute difference in
BLP stage, and
(4) Analog-to-Digital Converter (ADC) and Residual Dig-
ital Logic (RDL)– stage for realizing thresholding/decision
function and carrying out analog to digital conversion of the
results of previous analog computations.
Fig. 2(b) shows the multi-row functional read process of
the DIMA. The multi-row functional read stage yields a
voltage discharge ∆VBL proportional to the weighted sum
w =
∑BW−1
i=0 2
ibi of column-major stored digital data
{b0, b1, ..., bBW−1} by applying binary-weighted modulated
pulse width signal Ti ∝ 2i (i ∈ [0, BW − 1]) to BW rows
of SRAM array simultaneously, as shown in Fig. 2(b). The
voltage drop on bitline (BL) as a function of weight w is
given by:
∆VBL =
VPRE
RBLCBL
T0
BW−1∑
i=0
2ibi = ∆Vlsbw (7)
where ∆Vlsb = VPRERBLCBLT0, VPRE is the precharge voltage
on BL and its complement (BLB) lines, RBL and CBL are
the discharge path resistance and capacitance respectively, w
is the decimal equivalent of 1′s complement of weight stored
in SRAM cell array in column-major format. Similarly, the
voltage drop at the complement of BL (i.e. BLB) line can be
expressed by simply replacing w with w in Eq. 7, i.e.,:
∆VBLB =
VPRE
RBLCBL
T0
BW−1∑
i=0
2ibi = ∆Vlsbw (8)
It is important to note that both Eqs. 7 and 8 are valid for Ti 
RBLCBL. A sub-ranged read technique [16] can be employed
for improving the linearity of the FR process when BW > 4
for which BW /2 bits representing the most significant bits
(MSB) and BW /2 bits representing the least significant bits
(LSB) of the weights stored in adjacent columns of the BCA.
For this, the MSB and LSB BL capacitances CM and CL,
respectively, are required to be in the ratio 2BW /2 : 1. The
MSB and LSB columns are first read separately and then
merged by assigning more weights to MSB than LSB using
ratioed capacitors CM and CL.
However, the inference processor presented in ref. [16]
supports four algorithms - Support Vector Machine, Template
Matching, k-Nearest Neighbour, and Matched Filter which
have been just mapped to them without discussing the way to
train the network on the chip. Further, the peripheral circuits
used around memory core in standard DIMA architecture
lacks the hardware support for backpropagation algorithm
and weight update mechanism which is a crucial part in
training any neural networks. Also, frequent access to SRAM
bitcell array during calculation of scalar distance (SD) at each
iteration in conventional DIMA[16] architecture leads to a
significant amount energy dissipation and latency. To real-
ize hardware implementation of ANN along with addressing
above discussed issues related to DIMA and other architectures
discussed in introductory part of this paper, an in-memory
ANN architecture is proposed in next Section.
5... ...
... ...
... ...
... ...
... ...
FR 
Row
 De
cod
er
Memory WritePrecharge
M:1 MUX M:1 MUX
SA SA
MUX & BUFFER
......
K-b bus
FR 
Row
 De
cod
er
0b
1b
3b
standardSRAM interface
WL WL
BLBB
L VDD VDD
M1 M2
M3 M4
M5 M6Q Q
SRAM array(Ncol,K×Nrow)
12 11  5  15         7   0  10
...
...
...
......
...
BW ×Ncol,K bits fetched per precharge cycle
Ncol,K
B W N
row
08T
① 
③
②
(a) (b)
DIMA interface
SRAM memory core b2
... ...
... ...
... ...
... ...
... ...
FR 
Row
 De
cod
er
Memory WritePrecharge
M:1 MUX M:1 MUX
SA SA
MUX & BUFFER
......
K-b bus
FR 
Row
 De
cod
er
0b
1b
3b
Computing Core
Neuron Output
standardSRAM interface
Proposed Circuitry for ANN
General purpose memory core
Memory core for storing ANN weights
b2
BLP BLP BLP BLP......
CBLP
Input Buffer (X)
ADC & RDL
... ...
ΔVBL VB
VC
(c)Decision (y)
ΔVBLB (or ΔVBL)② BW-b analog read③
BW-b word corresponding to ANN weights①
Fig. 2. (a) Conventional deep in-memory architecture (DIMA)[16]. (b) Memory access via functional read (FR) process of the DIMA for four bits per weight
(BW = 4). (c) Generalized block diagram of the proposed architecture which employs FR of DIMA for accessing weights stored in last BW rows of SRAM
bitcell array (BCA).
Notation Description
ADC Analog to Digital Converter
ANN Artificial Neural Network
BCA Bit Cell Array
BL Bit Line
BLB Bit Line Bar
BW No. of bits per weight
DIMA Deep In-Memory Architecture
FR Functional Read
M = (NK−1) No. of inputs to the Kth layer
N = (NK) No. of output of the Kth layer
Nbank,K No. of banks for the Kth layer
Ncol,K No. of columns per memory bank for the Kth layer
Nrow No. of rows in the bit cell array
SA Sense Amplifier
SWC Signed Weight Calculation
SM Signed Multiplier
MT lines M -transmission lines
NT lines N -transmission lines
WL Word Line
WU Weight Updation
TABLE I
ACRONYMS AND NOTATIONS
III. THE PROPOSED IN-MEMORY ANN ARCHITECTURE
In this Section, the proposed In-Memory Artificial Neural
Network architecture is presented. First, detailed architecture
and working of a single layer of a neural network is dis-
cussed and then it is extended to the multilayered perceptron
consisting of interconnections of many such layers. Fig. 2(c)
shows a generalized block diagram of single DIMA-based
memory bank. It is a basic building block of our proposed in-
memory ANN architecture. Each memory bank corresponds to
a single output neuron. As discussed in previous Section, the
peripheral BCA circuitry of the traditional DIMA architecture
is incapable of carrying out on-chip training of the network,
therefore, we have incorporated the additional peripheral cir-
cuitry that facilitate the on-chip training and supports both
feedforward and backpropagation processes for realizing the
in-memory ANN. The proposed in-memory ANN architecture
utilizes the last BW rows of the BCA for storing weights (in
column-major format) of the synaptic connections of an ANN
and are accessed via multi-row FR process which yields the
analog equivalent of stored weight as shown in Fig. 2(b). The
rest of the memory space, i.e., [Ncol,K × (Nrow −BW )] bits
of the BCA can be used as a general purpose digital storage
medium.
Fig. 3 shows the proposed in-memory ANN architecture
of any arbitrary layer of ANN. We have used M and N as
general terms indicating the number of inputs and outputs,
respectively, of the Kth layer. From Fig. 3, multi-bank division
of the BCA have been done for parallelizing the computations
involved in each layer of the ANN. Each bank is used to
store weights of the connections from the outputs of the
previous layer to the input of a neuron in the present layer.
Each bank consists of M columns, for storing weights of the
connection from M inputs to an output of this layer. Further,
the number of such banks, Nbank,K , depends on the number
of neurons at the output of this layer which is assumed to be
N in our case. The FR process performs the analog to digital
conversion of weights by discharging the BLB and BL lines
6M  columns M  columns
[ ]11Kw [ ]21Kw ... ...[ ]1Kjw [ ]12Kw [ ]22Kw ... [ ]2Kjw ... [ ]1 KKNw [ ]2 KKNw ... [ ]KKjNw ...
... ...
... ...
... ...
... ...
... ...
... ...
... ...
... ...
... ...
... ...
... ...
... ...
BW -bits
Bank 1 Bank 2 Bank NM  columns
M  SWC & WU units M  SWC & WU units M  SWC & WU units
...
...
Backpropagation Block
M  SM units M  SM units M  SM units...
Activation potential Activation potential Activation potential...
Activation function and its derivative Activation function and its derivative Activation function and its derivative......
To next layer input
Fro
m b
ack
prop
aga
tion
 
bloc
kof
nex
tlay
er
...X = [a1, a2, …, aM]From previouslayer output
To 
prev
ious
 
laye
r
Memory Write
Precharge
AD
C
B W -
bits
[ ]1Ka [ ]2Ka [ ]KNaN T  lines
M T
 line
sM
 T li
nes
[ ]1KMw [ ]2KMw [ ]KMNw
M  columns
... ...
... ...
... ...
... ...
... ...
[ ]1Kkw [ ]2Kkw [ ]Kjkw
SWC SWC SWC SWC
BW 
......
WU WU WU WU......
To/from M  SM units
kth memory bank
[ ]KMkw
① 
② 
① Memory core for generalpurpose digital storage
② Memory core for storingweights associated with ANN 
(a) (b)
N T  lines
Fig. 3. (a) Block diagram showing proposed in-memory multi-bank implementation of ANN. The T. lines (transmission lines) carrying forward propagating
signals are shown in blue color; and the T. lines (transmission lines) carrying backpropagating signals are shown in red color. (b) A single DIMA based
memory bank connected to M SWC & WU units.
by an amount proportional to their decimal equivalent (w)
and its 1′s complement (w), and can be derived from Eqs. 8
and 7, respectively. Since, the weights can be either positive
or negative, so a signed weight calculation (SWC) unit is
designed for generating these weights in terms of proportional
negative voltage (for negative weights) or positive voltage (for
positive weights). The signed weights thus will be generated
and sent to the weight updation (WU) unit where weights are
updated in each iteration and then sent to the signed multiplier
(SM) unit, these peripheral circuits are essential for on-chip
training that makes our proposed approach energy efficient and
real time.
The input vector X = [a1, a2, ..., aM ] (index i, j, k, l and
[K] have been omitted to avoid confusion and [a1, a2, ..., aM ]
is taken as the inputs to any general layer, i.e., the Kth layer)
to this layer are fetched via M -transmission (MT ) lines each
storing analog voltages proportional to the inputs to the layer,
as shown in Fig. 3(a). These inputs are sent to the SM units
where it is multiplied with signed weights, calculated via SWC
units during feedforward process, to generate product of input
and signed weight. To generate activation potential we need
to sum these individual products at each of the output neuron,
i.e,
∑M
j=1 a
[K−1]
j w
[K]
jk for which current based summation of
weighted inputs is employed inside the activation potential
block (discussed in upcoming sub-sections). The output of the
activation potential block is sampled and stored in capacitor
to avoid its regeneration during backpropagation where it will
be used for weight update. This activation potential is fed to
the activation function block from where the final output of
the layer is obtained. The output of the layer are sent to the
next layer through NT lines.
During backpropagation, the signals traverse through vari-
ous components of the network and they will be governed by
Eqs. 4, 5 and 6. The weighted sum of the local gradients
(the weights applied to the local gradient are the weights
of synaptic connections between the layer having the local
gradients and its previous layer) of any layer are propagated
backward to the previous layer which is multiplied with its
input and learning rate for its weight update. Also, from
Eqs. 4, 5, and 6, for generating local gradient at any
neuron in the present layer, the weighted summation of the
local gradient coming from the next layer is multiplied with
the derivative of the activation function of the present layer
neuron. In our proposed architecture, the weighted sum of the
local gradients of any layer, i.e., the Kth layer is generated
via current based summation which is employed inside the
backpropagation block and is propagated to the previous layer
through MT lines coming out, as shown in Fig. 3(a)). For the
weight update, the local gradient of this layer is multiplied
with input to this layer and sent to the weight updation (WU)
unit. The learning factor η can be controlled by controlling the
gain at the output of the multiplier going to WU unit. During
feedforward and backpropagation processes, different signals
need to be multiplied together which is controlled by control
7signal S[1 : 0]. Once the training of the network finishes,
the WU unit will have all the trained weights stored in it.
These trained weights need to be stored in BCA memory
core since these weights are analog and susceptible to noise,
weights may also drift due to noise/charge leakage then entire
network will have to be trained again to avoid accuracy loss.
To address weight drifting issue, signed flash ADC is designed
and presented which will convert the signed weight to digital
domain considering 1′s complement conversion for negative
weights since the BL line have voltage discharge proportional
to decimal equivalent of 1′s complement of weight which can
be used to obtain absolute value of the negative weight during
FR read process which can be further converted to proportional
negative voltage. The output of signed flash ADC will be
stored in SRAM BCA. The upcoming Subsections present
detailed description of these blocks to support in-memory
ANN essentials for on-chip training.
A. Signed Weight Calculation (SWC) Unit
Fig. 3(b) shows a single bank DIMA with SWC and weight
updation (WU) units for generating signed and updating the
weights, respectively. The weights of synaptic connections can
either be negative or positive which will require a hardware
circuitry that must be equipped with a suitable scheme to
realize signed dot product. The signed dot product calculator
has been proposed earlier in ref.[2] which has two rails:
CBLP+ rail (for positive product) and CBLP– rail (for negative
product) shared with multiple columns each corresponding
to the output of a BLP block. The BLP block was used
there to select the magnitude of weight and then calculate
the product of input vector with weights stored in SRAM.
These products were then shared with CBLP+ rail for positive
weights and CBLP- rail for negative weights. These weights
were accessed via DIMA FR process to get proportional
analog equivalent of weight. These selection of BL or BLB
lines for getting absolute value of weight and selection of
CBLP+ or CBLP- rail depending on sign of product were
done using control circuitry inside BLP block itself. These
rails were then passed through the ADC and fed to the
subtractor block which calculates the difference between them
to get vector distance. However, this approach is inefficient for
updating the weights during each iteration (on-chip training)
because we are performing computations in analog domain to
ease the process of weight updation by storing intermediate
weights in capacitors (discussed in the next Sub-section).
If computations were done in digital domain then it would
have required additional registers/buffers to store intermediate
weight during training consuming large silicon chip area for
dense network involving a large number of weights.
Also, the methodology employed in ref.[2] for signed dot
product cannot be used for our purpose since it uses the
change in voltage of BLB and BL line which are proportional
to the decimal equivalent of weight and its 1′s complement,
respectively, which cannot be added/subtracted directly in the
analog domain for performing addition/subtraction operation.
Therefore, a signed weight calculation (SWC) unit is designed,
as shown in Fig. 4. It works on the principle that one
of the lines among BL (for negative weight) or BLB (for
positive weight) will have voltage discharge proportional to
the absolute magnitude of weight, as per from Eqs. 7 and
8). As we have assumed earlier that signed weight stored in
BCA will follow 1′s complement scheme for negative weight.
This magnitude has to be converted to proportional negative
or positive voltage. For this, first the sign SW and Vmux are
generated with the help of circuitry shown in Fig. 4(a)), and
corresponding expressions are:
SW =
{
0; for VBLB ≥ VBL, i.e., when w ≥ 0
1; for VBLB < VBL, i.e., when w < 0
(9)
The output of MUX, Vmux is selected from among the BL and
BLB since one of them will have absolute value of weight in
terms of voltage change on the line (from Eqs. 7 and 8). The
selection is done using the sign of weights as the select line
as:
|w| =
{∑BW−1
k=0 2
kbk ∝ ∆VBLB , if SW = 0∑BW−1
k=0 2
kbk ∝ ∆VBL, if SW = 1
(10)
Thus, the output voltage of MUX, Vmux will contain the
absolute value of weight as given by:
Vmux = VPRE −∆Vlsb |w| (11)
Thus the change in MUX voltage will be given by:
∆Vmux = VPRE − Vmux = ∆Vlsb |w| ∝ |w| (12)
For calculation of signed weights the circuit in Fig. 4(b) is
employed. It uses OPAMP for maintaining the voltage polarity
based on value of SW , can be seen from Fig. 4(c)-(d)). Thus,
the output voltage at this stage will have positive or negative
sign corresponding to positive or negative weights stored in
the SRAM array and their magnitude will be proportional to
the magnitude of corresponding weight, i.e., |w|.
B. Backpropagation and Weight Updation Units
Backpropagation and weight updation are the most impor-
tant part of any neural network for learning by adjusting
its weight for accurate prediction and decision making until
error is reduced to a minimum possible value. These updated
weights have to be accessed frequently during feedforward
and backpropagation processes that incur high energy and
delay cost during frequent access of SRAM BCA for storing
and accessing the updated weights during each iteration of
network training. However, for improved throughput, energy
efficiency, and on-chip training, we have designed a dedi-
cated circuitry that eliminates this requirement and makes the
proposed in-memory ANN architecture suitable for various
AI/ML applications, as shown in Fig. 4(e). In the proposed
backpropagation and weight updation units, the generated
signed weight w[K]jk is first sampled using switched connection
φS and stored in a sampling capacitor CS shown in Fig. 4(e))
instead of storing them in SRAM BCA. The CS will be used
to store all the intermediate weight corresponding to a single
synaptic connection. The M ×N such weight updation units
of Fig. 4(e) will be required corresponding to the weights of
all the M ×N connections involved in the Kth layer of the
proposed architecture of the network, as shown in Fig. 3(a).
8RR
[ ]Kjkw
muxVWS
0 11:2 DEMUX
 
Comparator
2:1 MUX
BLV BLBV
WS
 
1 0
muxV
 
muxV R
R
 muxV w 
OFF O
FF
 
muxVR R
 muxV w  
OFF
Open connection
ON ON ON O
N
(a) (b)
(c) (d)
OFF
S U
B
[ ]Kjkw

LUV
[ ] 'Kjkw
R
R
R
R
SC
BC
LC
[ ] 'Kjkw
[ ]Kjkw (e)
B
UL

 During Training
For amplifying by a factor of 2
Fig. 4. Proposed hardware design for signed weight calculator (SWC) and weight updator (WU) units. (a) Selecting bit line (BL or BLB) containing magnitude
of weight. (b) sending the change in MUX voltage, i.e., ∆Vmux to one of the inputs of OPAM to perform either (c) unity gain follower operation for positive
weight, i.e., SW = 0 or (d) inverting the polarity of weight for negative weight, i.e., SW = 1. (e) The proposed weight updation (WU) unit with timing
diagram.
Generalizing from Eqs. 4- 6, the change in weight ∆w[K]jk
required for weight update is generated via SM unit (discussed
in the next Sub-section) and is sampled on another sampling
capacitor CB via switched connection φB , shown in Fig. 4(e)).
The unity gain buffer is used to access the voltage stored
in the sampling capacitors. The charge leakage problem of
capacitor will be self addressed since we are updating the
capacitor charge at each iteration. Next, the outputs of each
of these unity gain buffers are applied at the two ends of
the series combination of two resistors of equal magnitude R.
Thus, the generated voltage at their common node VU will be
proportional to the sum of voltages corresponding to weight
and change in weight, i.e., VU =
(
w
[K]
jk + ∆w
[K]
jk
)
/2. The
output is then latched/stored on another sampling capacitor
CL (via switched connection φL) which will be used to update
intermediate stage weights stored in sampling capacitor CS .
But the voltage VU have to be amplified by a factor of
2 to make the updated weight w[K]jk
′ exactly equal to the
sum of original weight and change in weight required. This
amplification can be done using any suitable scheme. One of
them may be to use an OPAMP based non-inverting amplifier
of gain 2 as shown in Fig. 4(e). The updated weight w[K]jk
′ is
then sent to the next unit for carrying out multiplication.
Weight update of the hidden layers is not a straight forward
task as discussed in last section. Eqs. 5 and 6 can be
generalized to state that the weighted sum of the local gradient,
i.e.,
∑
k δ
[K]
k w
[K]
jk
′ is used by the jth neuron of the previous
layer for its weight update. The hardware implementation
for the backpropagation block is shown in Fig. 5(a). The
product δ[K]k w
[K]
jk
′ obtained from the signed multiplier (SM)
unit corresponding to jth column of the kth bank is added
over all the k’s, i.e., over all banks to generate the sum Sj for
(j = 1, 2, . . . ,M ), as shown in Fig. 5(a). The final output at
each backpropagating line will be proportional to the weighted
sum of local gradient which will be fed to the previous layer
for its weight update. Once the local gradient of all the layers
have been calculated via the backpropagation process, then the
weight update process can be parallized by just changing the
control signal to S[1 : 0] =‘10’given in Table II, and discussed
in detail in next Sub-section.
C. Four Quadrant Analog Multiplier
As discussed, different signals need to be multiplied during
feedforward and backpropagation processes. An accurate mul-
tiplier for the neural network is essential for better training
and overall performance of the proposed in-memory ANN
architecture. Fig. 5(b) shows the architecture of the proposed
signed multiplier (SM) unit employed to carry out multiplica-
tion of different signals depending upon the 2-b control signal
9SMSM SM SMSM SM SMSM SM
[ ]11 'Kw
[ ]11Kw
[ ]21 'Kw
[ ]21Kw
[ ]1 'KMw
[ ]1KMw
[ ]12 'Kw
[ ]12Kw [ ]22 'Kw
[ ]22Kw [ ]2 'KMw
[ ]2KMw [ ]1 'KNw
[ ]1KNw [ ]2 'KNw
[ ]2KNw [ ] 'KMNw
[ ]KMNw
1a 2a
[ ]1 11 'Ka w [ ]2 21 'Ka w
Ma
[ ]1 'KM Ma w
1a
[ ]1 12 'Ka w
2a
[ ]2 22 'Ka w
Ma
[ ]2 'KM Ma w
1a
[ ]1 1 'KNa w
2a
[ ]2 2 'KNa w
Ma
[ ] 'KM MNa w
[ ] [ ]1 1 'K Kk kkS w [ ] [ ]2 2 'K Kk kkS w
[ ] [ ] 'K KM k MkkS w
[ ]1 K [ ]2K [ ]KN
[1: 0]S
To and from bank 1 To and from bank 2 To and from bank N
[ 1]Kja 
0
1
0
1 01
23
[ ] 'Kjkw
[ 1] [ ] 'K Kj jka w [ ]Kk
[1]S[0]S

R
2-b
[ ] [ ] 'K Kk jkw
M1
M2
2inV
1iK I
2iK I
1I
2I
multI
TpV
TnV
1inV
1inV
2inV
[ ] [ ] [ 1]K K Kjk k jw a  
M - T. lines
 1 2, ,..., Ma a a
To activation function block
From previous layer output
To previous layer during back-propagation of error


Pre-processing
To activation function blockTo activation function block
 multiplier
1 A
(a)
(b) (c)
Fig. 5. (a) Proposed hardware for carrying out backpropagation. (b) Realization of SM (Signed Multiplier) unit. (c) Four quadrant analog multiplier used
inside SM unit.
M1
M2




R
R


Imult = Gm (Vp - Vn) = GmR (I2 – I1)
Vp
Vn
I1
I2
OTA Imult
From pre-processing block
Vin1  ̶  |VTp|
Vin2  A
Vin1 + VTn
Fig. 6. Realization of pair of current sources used in Fig. 5(c).
S[1 : 0] applied to it. Depending upon control signals, the
proposed multiplier will select different pairs of signals for
multiplication, as shown in Table II. Fig. 5(c) shows the hard-
ware implementation of the four-quadrant analog multiplier
which is the core of the proposed SM unit. In the proposed
four-quadrant multiplier, the NMOS/PMOS work in the triode
region to generate output proportional to the product of inputs
Vin1 and Vin2.
For very small drain to source voltage, the drain current of
S[1] S[0] Product Data Flow
0 0 Forbidden No where
0 1 a[K−1]j w
[K]
jk
′ To activation function block
1 0 ηδ[K]k a
[K]
j To weight update unit
1 1 ηδ[K]k w
[K]
jk
′ To backpropagation block
TABLE II
CONTROL SIGNALS FOR SIGNED MULTIPLIER (SM) UNIT AND
CORRESPONDING OPERATIONS/DATA FLOW.
NMOS in linear/triode region is given by [23]:
iDn = µnCox
(
W
L
)
(VGS − VTn)VDS (13)
where, µn is the electron mobility, Cox is the gate capacitance
per unit area,
(
W
L
)
is the aspect ratio, VGS and VDS are gate
to source and drain to source voltages, respectively, and VTn
is the threshold voltage. Similarly, the drain current of PMOS
is given by:
iDp = µpCox
(
W
L
)
(VSG − |VTp|)VSD (14)
where, µp is the hole mobility, VTp is the threshold voltage of
the PMOS, the rest of the quantities have the usual meaning
as in Eq. 13. These Eqs. 13 and 14 will be valid for two
sets of conditions: (i) VGS > VTn and VDS  VGS − VTn
for NMOS and; (ii) VSG > |VTp| and VSD  VSG − |VTp|
10
for PMOS. To satisfy these conditions, the pre-processing
block is designed, as shown in Fig. 5(c)). Inside the pre-
processing block, the voltage Vin2 is reduced in amplitude
by an amount A (reduction factor) that is sufficient enough
to satisfy the necessary conditions on VDS (for NMOS) and
VSD (for PMOS). There would be an obvious variation in
multiplier output for different values of reduction factor, A,
which are discussed in the next Section along with simula-
tions results. Next, if VGS = Vin1 + VTn (for NMOS) and
VGS = Vin1 − |VTp| (for PMOS) is applied to the gate of the
MOSFETs (as shown in Fig. 5(c)) then, from Eqs. 13 and 14,
the output current will be proportional to the product of input
voltages. The above additions are performed inside the pre-
processing block. Now, the outputs of the pre-processing block
satisfies both necessary conditions ((i) and ii)) for the proposed
multiplier to work properly. The outputs of the pre-processing
block are fed to the multiplier, as shown in Fig. 5(c). The
analog multiplier generates current that is proportional to the
product of inputs voltages Vin1 and Vin2. Two current sources
of equal gain Ki are employed to generate the output current
so that it could be used to drive the next stage. There are
many ways to realize the pair of current sources used in
Fig. 5(c). One of them can be to use a pair of OPAMP and
an OTA (Operational Transconductance Amplifier) as shown
in Fig. 6 which is used for simulations presented in next
Section. The generated output current will be proportional to
the product of input voltages within some offset error ∈m,
i.e., Imult ∝ (Vin1 · Vin2± ∈m). The occurrence of error
will be due to small deviation from the linear behaviour
of the MOSFETs in the triode region. This is because I-V
characteristics of the MOSFET will not be perfectly linear
in the triode region. But the effect of non-linearity can be
reduced by decreasing the drain to source voltage which can
be achieved by increasing the value of reduction factor A used
in the pre-processing block in Fig. 5(c). Thus, the effect of
error, ∈m, can be reduced by choosing a suitable value of the
reduction factor A. The variations of ∈m versus A are shown
in next the Section.
D. Activation Potential and Activation Function
Fig. 7 shows the hardware implementation of activation
potential and activation function for generating the output
at any layer of perceptron. During feedforward, the product
Imult(∝ a[K−1]j w[K]jk ′) generated via SM unit and it is sent to
the activation function block which carries out current based
summation of input signals, as shown in Fig. 7(b). The total
current ITOT will be proportional to the weighted sum of
inputs, i.e., ITOT ∝
∑
j a
[K−1]
j w
[K]
jk
′. But, we need output
of the activation function in terms of voltage since the next
layer requires input signal in the form of voltage instead of
current. As input of multiplier requires signals in the form
of voltage instead of current, a unity gain buffer is employed
which converts the current through a resistor R to a voltage, as
shown in Fig. 7(b). The value of R can be adjusted properly to
make ITOT exactly equal to the weighted sum of inputs. This
activation potential is stored in a capacitor C and accessed
via unity gain buffer. This is done to avoid its regeneration
during backpropagation where it will be used for weight
update. Activation functions perform a transformation on the
input received, in order to keep values within a manageable
range. There is a wide variety of activation functions available,
such as linear, sigmoid, tanh, ReLU, etc. Out of these, the
ReLU is the most popular and frequently used activation
function for hidden layers. For the proposed in-memory ANN
acrchitecture, we have employed and designed the ReLU
activation function and its output can be expressed as:
a
[K]
k = ϕ
(
h
[K]
k
)
= max
(
0, h
[K]
k
)
(15)
where, h[K]k is the activation potential of the present layer and
a
[K]
k is the output of the present layer generated after applying
activation function ϕ(·) on h[K]k . Its differentiation is also very
simple as given:
ϕ′
(
h
[K]
k
)
=
{
0, for h[K]k ≤ 0
1, for h[K]k > 0
(16)
Fig. 7(c) shows the hardware implementation of the ReLU
function and its derivative. The MUX A1 performs the max-
imum operation as per Eq. 15 by selecting the maximum
voltage based on comparator output. Another MUX A2 is used
for generating the local gradient of this layer, which will be
the product of differentiation of activation function and the
weighted sum of the local gradient of the next layer, i.e.,
δ
[K]
k =
(∑
l δ
[K+1]
l w
[K+1]
kl
)
ϕ′
(
h
[K+1]
k
)
. Using Eq. 16 the
local gradient of this layer can be expressed as:
δ
[K]
k =
{
0, for h[K]k ≤ 0∑
l δ
[K+1]
l w
[K+1]
kl , for h
[K]
k > 0
(17)
The MUX A2 performs operation as per Eq. 17 based
on the output of the comparator. But, ReLU is used only
for the hidden layer. For output layer generally linear (for
linear regression); sigmoid, tanh (for binary classification); or
softmax (for multi-class classification) is used. The realization
of linear activation function at the output is pretty much
simple by just allowing the output to be equal to the input.
Also, its derivative being equal to unity does not require
any multiplier block for multiplying the error with derivative
of activation function during backpropagation. But hardware
realization of other non-linear activation function in the analog
domain is quite complex. Instead, activation functions such
as sigmoid[24], tanh[25], softmax[26] and other non-linear
activation functions can be easily realized in the digital domain
using different approximation methods and/or look-up table
(LUT) based methods. So, the incoming analog voltage corre-
sponding to activation potential is first converted to the digital
domain using ADC and then any of the above mentioned
activation functions can be implemented digitally based on the
requirement. In this paper, the softmax[26] activation function
have been employed and its output is converted back to the
analog domain using DAC.
E. Multilayered Implementation Neural Network
We have realized and discussed all the individual blocks
needed for single layer of in-memory ANN architecture.
However, for contemporary AI/ML applications, single layer
11
...( ) '( )  ( ) '( ) ( ) '( ) 
Activation Potential Activation Potential...Activation Potential... ... ...
1 C
omp
arat
or
2:1 MUX
2:1 MUX
1
1 0
0

[ ]2 21 'Ka w[ ]1 11 'Ka w [ ]1 'KM Ma w [ ]1 12 'Ka w [ ]2 22 'Ka w [ ]2 'KM Ma w [ ]1 1 'KNa w [ ]2 2 'KNa w [ ] 'KM MNa w
...
[ ]1 K [ ]2K [ ]KN
1 1 1
R R R
[ ]KNa[ ]2Ka[ ]1Ka
N - T. lines
N - T. lines
[ 1] [ ]1 1 'K Kka w [ 1] [ ]2 2 'K Kka w [ 1] [ ] 'K KM Mka w [ ]Kk [ ]Kkh
[ 1] [ 1]K Kl kll w  
R
[ ]Kkh
ITOT
I1 I2 IM
A1
A2
[ ]KOUT
C
From M  SM unitsFrom M  SM unitsFrom M  SM units
(a)
(b) (c)
 [ ] [ ]K Kkh
[ 1] [ 1]1 'K Kl ll w   [ 1] [ 1]2 'K Kl ll w   [ 1] [ 1] 'K Kl Nll w  
 [ ] [ ]K Kkh
Fig. 7. (a) Output stage at any layer of perceptron. (b) Summation of current to generate activation potential. (c) Hardware realization of ReLU activation
function.
neural network needs to be extended for multilayered neural
network along with on-chip training capability. Fig. 8(a) shows
the multilayered perceptron model designed by cascading
single-layered neural network. One of major advantages of
the proposed neural network architecture is its scalability, as
shown in Fig. 3(a), that is, any number of such architecture can
be cascaded to design a multilayered perceptron with on-chip
training capability. At the output layer, there will not be any
further connections to the next layer but instead, there will be a
mechanism for calculating the error of the network. The output
of the sum of squares of error block is fed to the control signal
block which controls the feedforward and backpropagation
during the network training. The control signal block will stop
the network training if the error VE has stopped decreasing.
The error at any neuron is given as the difference between
target/intended output and the current output. For this, an
OPAMP subtractor circuit is used, as shown in Fig. 8(b) to
calculate the difference el (= tl − yl) at the lth neuron of the
output layer. This difference is further sent to another two
OPAMPs for calculating the difference VGn,l (= VTn − el)
and VGp,l (= − |VTp| − el) which is used for calculating the
sum of squares of error in next stage, where, VGn,l denotes the
gate voltage of the NMOS at the lth output, similarly VGp,l
denotes the gate voltage of the PMOS at the lth output node.
Consider the MOSFETs current in saturation region[23]:
ID =
{
1
2kn (VGS − VTn)2 , for NMOS
1
2kp (VSG − |VTp|)2 , for PMOS
(18)
where, the parameters have usual meaning as mentioned in
Eqs. 13 and 14. Now if VGn,l (= VTn − el), generated via
Fig. 8(b), is applied to the gate of NMOS then, for el < 0,
12
FR 
Row
 
Dec
ode
r
... ...
Computing Core Computing Core Computing Core
... ... ...
......
  
1y
  1t
2y
2t
3Ny
3Nt
 
...
Sum of Squares of Error
......
[1]1a [1]2a 1[1]Na 2[2]Na[2]1a [2]2a 1e 2e 3Ne
         ...TnV
TpV
Con
trol
 Sig
nal
EV
Bank 1N0 columns N0 columnsBank 2 N0 columnsBank N1
...
N1 columnsBank 1 Bank 2N1 columns N1 columnsBank N2 N2 columnsBank 1 N2 columnsBank 2 Bank N3N2 columns
1st hidden layer weights
FR Read
2nd hidden layer weights Output layer weights
0
[0] [0] [0]1 2, ,..., Na a a   X
[2] [2]1k kk w [2] [2]2k kk w [3] [3]1l ll w [3] [3]2l ll w
SRAM bitcell array ( BCA) SRAM bitcell array ( BCA) SRAM bitcell array ( BCA)
RR R R R R111111
N1 T lines
N1 T lines
N2 T lines
N2 T lines
N 0 T
 line
s
2-b 2-b 2-b
S[1
:0]
S[1
:0]
S[1
:0]
2
[3] [3]l N ll w1[2] [2]k N kk w
FR Read FR Read FR Read FR Read FR Read FR Read FR Read FR Read
① 
 
 R
R R
R
R R R
R
R
R R
R
lt ly
TnV
,Gp lV,Gn lV
( )l ee V
TpV
 
I1 I2 IN3
I = I1 + I2 + … + IN3
EV
1R
,1GnV ,2GnV 3,Gn NV,1GpV ,2GpV 3,Gp NV
Combination of NMOS and PMOS for generating error at each of the output
① 
②
② 
(a)
(b) (c)
Fig. 8. (a) Hardware realization of multilayered perceptron of Fig. 1 by cascading multi-bank single layered architecture of fig. 3(a). (b) Generating error el,
and voltages- VGn,l and VGp,l via OPAM which is used in calculating sum of squares of error. (c) Generating sum of squares of error via MOSFETs at the
output layer.
the drain current of the NMOS would be:
IDn,l =
1
2
kn (VGn,l − VTn)2 = 1
2
kne
2
l (19)
For el > 0, no current will flow through the NMOS which
will wrongly reflect that the error of the network has become
zero although it has actually not. So an innovative mechanism
consisting of a combination of NMOS and PMOS having
equal transconductance parameter k (= kn = kp) is designed,
as shown in Fig. 8(c) which will work even for el > 0.
From Fig. 8(c), the gate to source voltage for PMOS will be
VGp,l (= − |VTp| − el) which will be negative and less than
− |VTp| for el > 0, hence PMOS will be in ON state. At any
instant, either the NMOS or PMOS will be in ON state at the
lth output, thus, the square of the error will be easily generated.
Another important observation is that for the square of the
error to be generated correctly, i.e., for Eq. 18 to be satisfied
the MOSFET must always be in saturation which will happen
when VDS ≥ VGS−VTn (for NMOS) and VSD ≥ VSG−|VTp|
(for PMOS). For that, the same potential is applied to the
drain and source, as shown in Fig. 8(c). This will ensure that
the condition for the MOSFET to be in saturation is always
satisfied. Next is the generation of the sum of square of errors,
for which an OPAMP is employed to sum up the current
flowing through each of the MOSFET as:
I =
∑
l
(IDn,l + IDp,l) =
1
2
k
∑
l
e2l (20)
From above discussions, we can conclude for Eq. 20 that
either of the IDn,l or IDp,l will always be zero since the
hardware implementation in Fig. 8(c) is designed in such a
way that only one of the MOSFETs (either NMOS or PMOS)
will be ON at each of the output. Next, the final current flowing
through OPAMP is converted to voltage VE using resistor R1
as shown:
VE = −IR1 = −1
2
kR1
∑
l
e2l (21)
The negative sign is used since the voltage at the output
of OPAMP will be negative when the current will flow in
the conventional direction of NMOS as shown in Fig. 8(c).
Negative polarity of voltage can be ignored since we are only
interested in the magnitude of the error and our aim will be
to make the magnitude of error as small as possible.
F. Signed Flash ADC
The trained weights have to be stored back in SRAM BCA
to avoid training the network again and again. Since, the
proposed architecture deals with negative/positive weights in
the form of analog voltage so the general architecture of ADC
13

inV
inV
2R
2R
R
2R
2R
8-lin
e to
 3-li
ne 
Prio
rity
 Enc
ode
r
8-lin
e to
 3-li
ne 
Prio
rity
 Enc
ode
rR
01
7
2b
0b
1b
0b
1b
2b
DDV
DDV
01
7
2b 1b 0b3b



U1
U7
U8
U14
REFV
REFV
EN
EN
(a)
Back-propagation
Feed forward
Weight Update
...
............
... ............
Computing Error
Training Sample: 1st Lth 
PrechargeFunctional READSigned Weight
ADCEpochs 1 2 P
...
S[1:0]
[1]OUT [2]OUT
S[1]=0; S[0]=1 S[1]=1; S[0]=1 S[1]=1; S[0]=0
[  ]OUT ξ 
B
L
U
   0 1
    1 ξ ξ 
    1 ξ ξ 
   0 1
(b)
Fig. 9. (a) Signed Flash ADC design which converts negative weight using 1’s complement notation. (b) Timing diagram of the whole network.
needs to be modified to meet our requirements for the pro-
posed in-memory ANN architecture. Specifically, for negative
updated weights we need an ADC that uses 1′s complement
rule for analog to digital conversion. Therefore, a 4-bit signed
flash ADC is designed and presented in Fig. 9(a). The flash
ADC is chosen due to its speed and optimal silicon chip area
since we might need to convert a large number of updated
weights (for very dense network) in the digital domain and
store them back in SRAM BCA. In standard flash ADC, only
positive reference voltage, i.e., VREF are used as opposed to
the proposed implementation where both negative and positive
reference voltage VREF and −VREF have been used. Further,
comparison of both positive and negative analog voltages
(corresponding to proportional negative and positive updated
weights) have been performed using resistor ladder network.
The outputs of the comparators are fed to priority encoder
for generating final output which is the digital equivalent of
the magnitude of the input voltage Vin. For negative Vin, the
desired output is 1′s complement of the digital equivalent
of the magnitude of voltage for which NOT gate is used at
the output of the priority encoder corresponding to negative
reference voltage of −VREF to generate its 1′s complement.
For signed conversion using 1′s complement, the positive
number have MSB equals 0 and negative number have MSB
equals 1. To incorporate it, the MSB is chosen as the output
of the first comparator (U8) employed for comparing negative
voltage, as shown in Fig. 9(a)). This will ensure the MSB b3
to be 0 for positive weights and 1 for negative weights lesser
than −Vres/2, where Vres is the resolution of the ADC in volts
per step. Further, to reduce the output data line only one of
the priority encoders is allowed to work at a time by applying
b3 as enable signal to the EN input of each of the priority
encoders. Thus, for b3 = 0 the priority encoder corresponding
to positive reference voltage is enabled and for b3 = 1 the
priority encoder corresponding to negative reference voltage
is enabled. Hence, the designed ADC works for both positive
and negative weights. The reference voltage VREF decides the
maximum swing of the ADC as well as the range of weights.
The transfer function and value of the VREF of the proposed
signed flash ADC are discussed in detail in the next Section.
G. Timing Diagram
Fig. 9(b) shows the timing diagram of the whole network.
‘P’is the total number of epochs, ‘L’is the total number of
training samples presented to the network per epoch, ‘ξ’is the
total number of the layers. The notation: [[0]] → [[1]] denotes
the forward propagation from input to the first hidden layer,
and [[0]]← [[1]] denotes backpropagation from the first hidden
layer to the input layer. With the same notation different
index is used to indicate propagation between different layers.
The working of the network starts with FR process of the
SRAM BCA which fetches proportional analog equivalent of
weights to the SWC units. Each SWC unit calculates the
signed weight and sends the output to the WU unit. Once
the FR process and SWC units have finish their work, then
the whole training and testing procedure are controlled by
2− b control signal S[1 : 0]. During feedforward, the control
signal switches to ‘01’and calculates the activation potential at
each layer which is latched to sampling capacitor, via switched
connection φ[K]OUT , at the output of each layer. Next, the error
of the network is calculated using the sum of squares of error
cost function which is fed to the control block. If the error
is still in reducing phase then the control block continues
training the network otherwise the training is terminated. If
the error has not stopped decreasing the next stage is the
backpropagation phase (S[1:0]=‘11’) where the sum of local
gradient is calculated inside the backpropagation block and
14
(a) (b)
(c) (d)
Fig. 10. (a) Simulation result of magnitude of weight calculated via ∆Vmux inside SWC unit and its deviation (in LSB) from ideal (expected) output for
4-b resolution. (b) 4-b weight dependent energy dissipation during FR read process.(c) Simulation result of updated weights for η = 1. (d) Transfer function
of the proposed multiplier of Fig. 5(c).
is transmitted to previous layer via transmission lines. Once
the backpropagation process completed then control signal
switches to ‘10’for weight updation. As training completes,
the updated weights are stored back inside the SRAM BCA
via signed flash ADC. Each of these major process is shown
in timing diagram in Fig. 9(b). Further, each of these major
process is further divided into smaller sub-process for a clear
insight into the working of the network.
IV. SIMULATION RESULTS AND DISCUSSIONS
In this Section, the simulation results and working of the
proposed architecture are presented. Simulations of all the
peripheral computing blocks were performed with SPICE
simulator using High-Performance 45nm PTM (Predictive
Technology Model) models[27]. The design parameter chosen
for simulation are summerized in Table III.
A. Accuracy, Energy and Delay Analysis
Accurate generation/updation of signed weights are essen-
tial for proper working of a perceptron network. Fig. 10(a)
Parameter Value
Technology 45nm PTM HP
VDD 1 V
VPRE (of BL & BLB) 1 V
W/L 2
SRAM bit cell 6T
T0 (FR Read) 0.3 ns
BW 4
VREF 0.496 V
TABLE III
DESIGN PARAMETERS FOR SIMULATION
shows the output of the magnitude of weight having 4-b
resolution generated from SWC unit, as shown in Fig. 4(a)-(d).
It can be observed that the output is almost linear as expected
(see Eq. 12) and the deviation of the simulated output is also
within the expected range less than 0.67 LSB. A significant
amount of energy dissipation is involved during the functional
read (FR) process of the SRAM BCA. Fig. 10(b) shows the
energy dissipation per column during the FR process of the
SRAM BCA. From Fig. 10(b), the energy expenditure is least
when all the bits in the column-major format are either all zero
15
(a) (b)
(c) (d)
Fig. 11. (a) Energy dissipation in MOSFETs used inside multiplier of Fig. 5(c). (b) Worst case analysis of error, ∈m, associated with multiplier output with
reduction factor A for different polarity of inputs Vin1 and Vin2 but each having equal magnitude of 1 V. (c) Transfer function and power dissipation of the
proposed hardware for realizing ReLU activation function. (d) Simulation result of transfer function and energy dissipation of hardware (see Fig. 8(c)) used
for realizing square of error. The simulation results are for a single neuron at the output layer, i.e., for N3 = 1 in Fig. 8(c).
or all one. As in this case either BL or BLB line will discharge
completely and other will remain at a precharge voltage VPRE
level. However, with magnitude of weight increases both BL
and BLB lines discharge by some amount, as a result, either
the number of discharge paths, the discharge time, or both
of these increases. Hence, energy dissipation during the FR
process increases with an increase in weight magnitude. The
total energy dissipation during the FR process of the SRAM
BCA can be modeled as:
ETotal(w) =
∑
k
Ak(w)e
−2αk·f(w) (22)
where, α = T0/τ0, T0 is pulse width applied at the WL of the
LSB row of DIMA, τ0 = CBLRBL is the time constant of
the circuit, Ak(w) is the coefficient term that depends on the
decimal equivalent of the weight stored in the SRAM BCA,
and f(w) is a linear function of decimal equivalent of weight
(w).
The maximum delay during FR read will be decided by the
maximum time duration of the modulated pulse width WL
signal applied to the MSB row which is 2.4 ns in our case
for 4-b weights stored in SRAM BCA. Fig. 10(c) shows the
variation of trained weight with signed weight (calculated in
SWC unit) for different values of ∆w, for R = 1 kΩ and
learning rate η = 1. The variation is found to be linear from
simulation results as it is expected from Eq. 3.
Fig. 10(d) shows the transfer function of the proposed mul-
tiplier (see Fig. 5(c)). Designed multiplier employs a current
controlled current source (CCCS) using a pair of OPAMP and
an OTA, as shown in Fig. 6 (as discussed in previous Section).
The desired output is effectively achieved within some error,
as can be seen from simulation results for a current gain factor
Ki = 250. The value of R = 25 Ω and Gm = 10 (for CCCS
used in Fig. 6) was used to make the current gain Ki = GmR
equals 250 for simulation shown in Figs. 10(d) and 11(a)-
11(b). The error ∈m associated with the proposed multiplier
can be reduced by increasing the reduction factor A used
in the pre-processing block of Fig. 5(c). For error analysis,
multiplier output current was converted to a voltage by passing
it through 1 kΩ resistance and measuring the voltage across
16
it. This is because for converting the multiplier output current
into a voltage, wherever required, the same value of resistance
of 1 kΩ have been used throughout the simulations. From
simulation results, it was observed that as the input voltage
increases, the error ∈m associated with multiplier output also
increases. Further, the error reaches its maximum when both
the input voltage reaches 1 V each. Hence, for error analysis,
both the inputs was kept at 1 V each for analysing worst case
impact of error ∈m on the multiplier output. The error of
output, ∈m, was measured with respect to a reference voltage
of 1 V which is the expected result for input combination of 1
V each. Fig. 11(b) shows the worst-case analysis of the error
∈m on multiplier output (for Ki = 250 and Vin1 = Vin2 = 1
V) versus the reduction factor A. From Fig. 11(b), it can
be inferred that the proposed multiplier can be made more
accurate by using even smaller value of drain to source voltage,
i.e., by using even larger value of reduction factor, A. But,
with higher value of A, we will need to use current amplifier
with even more higher gain Ki than before to maintain output
value of 1 V for input combinations of 1 V each. Hence,
there will be a tradeoff between reduction factor A and the
current gain factor, Ki, of the multiplier. Fig. 11(a) shows the
power dissipation of the designed multiplier (See Fig. 5(c)) as
a function of applied input signals. Almost quadratic variation
with input voltage is similar to that of a resistance because
the MOSFETs in Fig. 5(c) are designed to operate in linear
region and behave as a voltage controlled resistance.
Fig. 11(c) shows the transfer function and power dissipation
of the proposed hardware for realizing ReLU activation func-
tion. The output of ReLU activation function is zero for inputs
less than or equal to zero, and is equal to the input for inputs
greater than zero as can be seen from Eq. 15. The transfer
function of the proposed ReLU block supports the Eq. 15 as
can be seen from Fig. 11(c). The power dissipation increases
sharply when there is a transition in the activation potential
which is in good agreement as maximum switching power
dissipation takes place within the comparator. Further, the
average delay incurred while generating output from the ReLU
block was ≈ 300 ps. Another important parameter during the
training is the total error of the network which have to keep
track during weight updation process and it should be stopped
as soon as the error turns out to be zero. Generally, this is very
difficult to achieve, hence, a fixed number of epochs is chosen
for training the network during which the error is continuously
monitored till it reaches minimum after which training must
be stopped.
Fig. 11(d) shows the simulation results of the hardware
designed for generating square of error at a single neuron at
the output layer (for R1 = 1 kΩ) which is seen to follow the
square relation as per Eq. 2. The total error will be the sum of
the square of error at each of the outputs which will generate
a voltage VE at the output of OPAMP. The error voltage
can be negative or positive due to the opposite direction of
the current flowing through PMOS/NMOS devices. But, as
stated earlier, we are interested in only to make the error zero
or to make its magnitude as small as possible. The energy
dissipation during this process will mainly be governed by
the resistor R1 used in Fig. 8(c). The simulation results in
Parameter Value
Ncol,1=No. of input layer neurons 4
No. of hidden layers 1
Nbank,1=No. of hidden layer neurons 5
Ncol,2=No. of hidden layer neurons 5
Nbank,2=No. of output layer neurons 3
Loss function Sum of squares of error
Optimizer Gradient Descent
Learning rate (η) 0.1
Activation Function ReLU (for hidden layer)Softmax (for output layer)
Dataset Iris[28]
TRAINING
Epochs 500
Accuracy ≈ 99%
Energy ≈ 0.84 µJ/epoch(≈ 7.002 nJ/iteration)
Delay ≈ 82.032 µs/epoch≈ 0.683 µs/iteration
TESTING
Accuracy ≈ 96.67%
Energy/Decision(pJ) 1.855
Delay/Decision(ns) 680.6
Decision/s 1.47 M
EDP/Decision(J·s) 1.26× 1e− 18
TABLE IV
MODEL DESCRIPTION, PARAMETERS, AND ITS TRAINING AND TESTING
RESULTS ON IRIS DATASET.
Fig. 11(d) shows the energy dissipated for R1 = 1 kΩ. The
time delay incurred during this process was calculated to be
≈ 340 ns. Fig. 12(a) shows the output of the 4-b flash ADC
designed for converting updated weights which corresponds to
proportional analog voltages stored in sampling capacitor CS
inside WU units back to digital domain, and storing them back
to SRAM BCA. The reference voltage VREF was chosen to
be 0.496 V which corresponds to a maximum binary output of
0111 during signed weight conversion (as shown in Fig. 10(a)).
The maximum INL (Integral Non-Linearity) of the ADC was
observed to be around 0.3Vlsb, which is lower than the half
of LSB.
B. Training and Testing on Iris Dataset
The proposed in-memory ANN was trained and tested on
Iris dataset [28] which consists of 150 records having 4 input
features namely – petal length, petal width, sepal length, and
sepal width. Each set of features corresponds to one of the
species of Iris flower – Iris Setosa, Iris Versicolour, or Iris
Virginica. Each of these three species have 50 records in
the dataset. It is one of the most widely used dataset for
training and testing the network for classification purpose.
Since, it is a 3-class classification task so at the output layer
softmax function was used which was implemented digitally
as described in ref.[26]. Table IV gives the sequentially trained
model description, accuracy, energy and delay analysis on
Iris dataset for 500 epochs. Table V compares the result
of the proposed architecture with other earlier architectures.
Fig. 12(b) shows the accuracy and loss value vs number
of epochs. From Fig. 12(b), the training accuracy reaches
almost 99%. Further, the test accuracy was observed to be
≈ 96.67% and it can be further increased by: (a) incorporating
the momentum term during the weight update which will
reduce the possibility of the training mechanism to stuck in the
local minima of the error landscape, and (b) using other cost
17
(a) (b)
Fig. 12. (a) Output of the proposed signed flash ADC block. (b) Training loss and accuracy.
. Ref.[8] Ref.[16] Ref.[29] Ref.[30] Ref.[31] This work
Gate length (nm) 130 65 65 65 65 45
Technology CMOS CMOS CMOS CMOS CMOS PTM HP
Algorithm Adaboost SVM CNN DNN SVM ANN
Dataset MNIST MIT-CBCL MNIST MNIST MIT-CBCL Iris
Bitcell type 6T 6T 10T 6T 6T 6T
On-chip learning No No No No Yes Yes
Decision/s 7.9 M 9.2 M - - 32 M 1.47 M
BW 1 8 1 1 8 4
E1MAC (pJ) 0.003 0.8 0.071 0.018 0.92 0.02
Efficiency (TOPS/W) 350 1.25 14 55.8 1.07 50
EDP (J·s) (×1e-18) 75.9 43.47 - - 1.31 1.26
1Energy of a single multiply-and-accumulate (MAC) operation.
TABLE V
COMPARISON WITH OTHER RELATED WORKS.
function which works better for classification task, i.e., our
proposed architecture will work fine for regression task with
square of error cost function but for classification problem
other cost functions such as cross entropy works even better.
V. CONCLUSION
In-memory on-chip trainable and scalable artificial neural
network was designed and developed for wide variety of data
intensive AI/ML algorithms. The main focus of the presented
architecture was to exploit the in-memory analog computations
for realizing the of ANN with on-chip training facility. The
proposed architecture is scalable and re-configurable to map
a large variety of AI/ML algorithms by enabling different
activation functions and cost functions. The main strength
of the proposed architecture lies in the fact that each steps
from training to inference can be performed on-chip without
external computing resources. Further, it does not require
temporary registers/buffers for storing intermediate results that
makes our proposed approach energy and computationally
efficient.
A neural network, being a complex interconnections of a
large number of neurons, requires huge computations. Even
some of the recently proposed architecture uses binary weights
to reduce complexity but it may lead to accuracy issues since
weights can have any real value (not necessarily only 0 and 1).
Instead, in this paper, all the operations of a neural network are
done in the analog domain which avoids such accuracy issues.
However, some of the accuracy issues may arise during analog
to digital conversion but that can be resolved using more
number of bits/weight. Then the whole network was trained
and tested on the iris dataset where the classification accuracy
was estimated to be around ≈ 96.67 %. Further, energy
and delay analysis shows that the proposed work is ≈ 46×
energy efficient in MAC (multiply-and-accumulate) operation
as compared to previous work which employs DIMA.
REFERENCES
[1] P. G. Emma, “Understanding some simple processor-performance lim-
its,” IBM Journal of Research and Development, vol. 41, no. 3, pp.
215–232, 1997.
[2] M. Kang, S. Lim, S. Gonugondla, and N. R. Shanbhag, “An in-memory
vlsi architecture for convolutional neural networks,” IEEE Journal on
Emerging and Selected Topics in Circuits and Systems, vol. 8, no. 3, pp.
494–505, 2018.
[3] B. Moons, R. Uytterhoeven, W. Dehaene, and M. Verhelst, “Envi-
sion: A 0.26-to-10tops/w subword-parallel dynamic-voltage-accuracy-
frequency-scalable convolutional neural network processor in 28nm
fdsoi,” in 2017 IEEE International Solid-State Circuits Conference
(ISSCC), 2017, pp. 246–247.
[4] B. Moons and M. Verhelst, “A 0.3–2.6 tops/w precision-scalable pro-
cessor for real-time large-scale convnets,” in 2016 IEEE Symposium on
VLSI Circuits (VLSI-Circuits), 2016, pp. 1–2.
[5] M. Price, J. Glass, and A. P. Chandrakasan, “14.4 a scalable speech rec-
ognizer with deep-neural-network acoustic models and voice-activated
power gating,” in 2017 IEEE International Solid-State Circuits Confer-
ence (ISSCC), 2017, pp. 244–245.
18
[6] P. N. Whatmough, S. K. Lee, H. Lee, S. Rama, D. Brooks, and G. Wei,
“14.3 a 28nm soc with a 1.2ghz 568nj/prediction sparse deep-neural-
network engine with >0.1 timing error rate tolerance for iot appli-
cations,” in 2017 IEEE International Solid-State Circuits Conference
(ISSCC), 2017, pp. 242–243.
[7] M. Kang, S. K. Gonugondla, and N. R. Shanbhag, “A 19.4 nj/decision
364k decisions/s in-memory random forest classifier in 6t sram array,” in
ESSCIRC 2017 - 43rd IEEE European Solid State Circuits Conference,
2017, pp. 263–266.
[8] J. Zhang, Z. Wang, and N. Verma, “In-memory computation of a
machine-learning classifier in a standard 6t sram array,” IEEE Journal
of Solid-State Circuits, vol. 52, no. 4, pp. 915–924, 2017.
[9] M. Horowitz, “1.1 computing’s energy problem (and what we can do
about it),” in 2014 IEEE International Solid-State Circuits Conference
Digest of Technical Papers (ISSCC), 2014, pp. 10–14.
[10] Y. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-efficient
reconfigurable accelerator for deep convolutional neural networks,”
IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp. 127–138, 2017.
[11] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and
O. Temam, “Diannao: A small-footprint high-throughput accelerator
for ubiquitous machine-learning,” SIGARCH Comput. Archit. News,
vol. 42, no. 1, p. 269–284, Feb. 2014. [Online]. Available:
https://doi.org/10.1145/2654822.2541967
[12] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio,
“Binarized neural networks: Training deep neural networks with weights
and activations constrained to +1 or -1,” 2016.
[13] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net:
Imagenet classification using binary convolutional neural networks,”
2016.
[14] R. Liu, X. Peng, X. Sun, W. Khwa, X. Si, J. Chen, J. Li, M. Chang, and
S. Yu, “Parallelizing sram arrays with customized bit-cell for binary
neural networks,” in 2018 55th ACM/ESDA/IEEE Design Automation
Conference (DAC), 2018, pp. 1–6.
[15] D. Ernst, Nam Sung Kim, S. Das, S. Pant, R. Rao, Toan Pham,
C. Ziesler, D. Blaauw, T. Austin, K. Flautner, and T. Mudge, “Ra-
zor: a low-power pipeline based on circuit-level timing speculation,”
in Proceedings. 36th Annual IEEE/ACM International Symposium on
Microarchitecture, 2003. MICRO-36., 2003, pp. 7–18.
[16] M. Kang, S. K. Gonugondla, A. Patil, and N. R. Shanbhag, “A multi-
functional in-memory inference processor using a standard 6t sram
array,” IEEE Journal of Solid-State Circuits, vol. 53, no. 2, pp. 642–
655, 2018.
[17] N. Shanbhag, M. Kang, and M.-S. Keel, “Compute memory,” uS Patent
US9697877B2. [Online]. Available: https://patents.google.com/patent/
US9697877
[18] M. Kang, M. Keel, N. R. Shanbhag, S. Eilert, and K. Curewitz, “An
energy-efficient vlsi architecture for pattern recognition via deep embed-
ding of computation in sram,” in 2014 IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 8326–
8330.
[19] M. Kang, E. P. Kim, M. Keel, and N. R. Shanbhag, “Energy-efficient
and high throughput sparse distributed memory architecture,” in 2015
IEEE International Symposium on Circuits and Systems (ISCAS), 2015,
pp. 2505–2508.
[20] M. Kang and N. R. Shanbhag, “In-memory computing architectures for
sparse distributed memory,” IEEE Transactions on Biomedical Circuits
and Systems, vol. 10, no. 4, pp. 855–863, 2016.
[21] H. Jiang, X. Peng, S. Huang, and S. Yu, “Cimat: A compute-in-memory
architecture for on-chip training based on transpose sram arrays,” IEEE
Transactions on Computers, pp. 1–1, 2020.
[22] S. Haykin, Neural Networks and Learning Machines, 3/e. PHI
Learning, 2010. [Online]. Available: https://books.google.co.in/books?
id=ivK0DwAAQBAJ
[23] A. Sedra and K. Smith, Microelectronic Circuits, ser. Oxford
series in electrical and computer engineering. Oxford University
Press, 2004. [Online]. Available: https://books.google.co.in/books?id=
9UujQgAACAAJ
[24] I. Tsmots, O. Skorokhoda, and V. Rabyk, “Hardware implementation of
sigmoid activation functions using fpga,” in 2019 IEEE 15th Interna-
tional Conference on the Experience of Designing and Application of
CAD Systems (CADSM), 2019, pp. 34–38.
[25] A. H. Namin, K. Leboeuf, R. Muscedere, H. Wu, and M. Ahmadi,
“Efficient hardware implementation of the hyperbolic tangent sigmoid
function,” in 2009 IEEE International Symposium on Circuits and
Systems, 2009, pp. 2117–2120.
[26] I. Kouretas and V. Paliouras, “Simplified hardware implementation of
the softmax activation function,” in 2019 8th International Conference
on Modern Circuits and Systems Technologies (MOCAST), 2019, pp.
1–4.
[27] “Predictive technology model,” http://ptm.asu.edu/.
[28] “Iris dataset,” https://archive.ics.uci.edu/ml/datasets/iris.
[29] A. Biswas and A. P. Chandrakasan, “Conv-ram: An energy-efficient
sram with embedded convolution computation for low-power cnn-based
machine learning applications,” in 2018 IEEE International Solid - State
Circuits Conference - (ISSCC), 2018, pp. 488–490.
[30] W. Khwa, J. Chen, J. Li, X. Si, E. Yang, X. Sun, R. Liu, P. Chen, Q. Li,
S. Yu, and M. Chang, “A 65nm 4kb algorithm-dependent computing-
in-memory sram unit-macro with 2.3ns and 55.8tops/w fully parallel
product-sum operation for binary dnn edge processors,” in 2018 IEEE
International Solid - State Circuits Conference - (ISSCC), 2018, pp.
496–498.
[31] S. K. Gonugondla, M. Kang, and N. R. Shanbhag, “A variation-
tolerant in-memory machine learning classifier via on-chip training,”
IEEE Journal of Solid-State Circuits, vol. 53, no. 11, pp. 3163–3173,
2018.
