Semi-Trained Memristive Crossbar Computing Engine with In-Situ Learning
  Accelerator by Zyarah, Abdullah M. & Kudithipudi, Dhireesha
xSemi-Trained Memristive Crossbar Computing Engine with In-Situ
Learning Accelerator
Abdullah M. Zyarah, Neuromorphic AI Lab, Rochester Institute of Technology
Dhireesha Kudithipudi, Neuromorphic AI Lab, Rochester Institute of Technology
On-device intelligence is gaining significant attention recently as it offers local data processing and
low power consumption. In this research, an on-device training circuitry for threshold-current memristors
integrated in a crossbar structure is proposed. Furthermore, alternate approaches of mapping the synaptic
weights into fully-trained and semi-trained crossbars are investigated. In a semi-trained crossbar a confined
subset of memristors are tuned and the remaining subset of memristors are not programmed. This translates
to optimal resource utilization and power consumption, compared to a fully programmed crossbar. The
semi-trained crossbar architecture is applicable to a broad class of neural networks. System level verification
is performed with an extreme learning machine for binomial and multinomial classification. The total power
for a single 4x4 layer network, when implemented in IBM 65nm node, is estimated to be ≈ 42.16µW and the
area is estimated to be 26.48µm x 22.35µm.
CCS Concepts: rComputing methodologies→ Neural networks; rHardware→ Emerging technolo-
gies;
General Terms: Design, Analysis, Experimentation
Additional Key Words and Phrases: On-device learning, Semi-trained neural network, Memristive-crossbar,
Extreme learning machine
ACM Reference Format:
Abdullah M. Zyarah and Dhireesha Kudithipudi, 2018. Semi-Trained Memristive Crossbar Computing
Engine with In-Situ Learning Accelerator. ACM J. Emerg. Technol. Comput. Syst. x, x, Article x (January
2018), 17 pages.
DOI: 10.1145/3233987
1. INTRODUCTION
On-device intelligence is gaining significant attention recently as it offers local data
processing and low power consumption, suitable for energy constrained platforms (e.g.
IoT). Porting neural networks on to embedded platforms to enable on-device intelligence
requires high computational power and bandwidth. Conventional architectures, such
as von Neumann architecture, suffer from throughput drop and high power draw when
realizing neural networks. This can be attributed to the physical separation between
processing and memory units, which leads to memory bottleneck [Indiveri and Liu
2015]. Additionally, pure CMOS implementation of neural networks impose area and
power constraints that hinder the deployment on to embedded platforms [Kim et al.
2012].
This material is based on research sponsored by AirForce Research Laboratory under agreement number
FA8750-16-1-0108. The U.S. Government is authorized to reproduce and distribute reprints for Governmental
purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein
are those of the authors and should not be interpreted as necessarily representing the official policies or
endorsements, either expressed or implied, of AirForce Research Laboratory or the U.S. Government.
Authors addresses: A. M. Zyarah and D. Kudithipudi, Neuromorphic AI Lab, Rochester Institute of Technology,
Rochester, NY; emails: {amz6011, dxkeec}@rit.edu.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by
others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to
post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions
from permissions@acm.org.
c© 2018 ACM. 1550-4832/2018/01-ARTx $15.00
DOI: 10.1145/3233987
ACM Journal on Emerging Technologies in Computing Systems, Vol. x, No. x, Article x, Pub. date: January 2018.
ar
X
iv
:1
80
8.
07
32
9v
1 
 [c
s.E
T]
  2
2 A
ug
 20
18
x:2 A. Zyarah et al.
In 2008, a successful physical implementation of a synapse-like device called memris-
tor was proposed by Strukov [Strukov et al. 2008]. Theoretically, the memristor was
introduced by L. Chua in 1971 as a fourth fundamental electrical device that correlates
the flux and charge in a non-linear relationship [Chua 1971]. The memristor acts like a
non-volatile memory element [Borghetti et al. 2010], consumes low energy [Prezioso
et al. 2015; Merkel 2017], has a small footprint compared to transistors, and can be
integrated in high density crossbar structures [Jo et al. 2010]. A key advantage of the
crossbar structure is that it enables performing the most computationally intensive
operations (multiply-accumulate) in neural networks concurrently while consuming
small amount of power compared to conventional implementations [Snider 2008; Taha
et al. 2014]. These properties make the memristor a natural choice for realizing neu-
ral networks in an efficient manner such that it meets embedded device constraints.
Typically, memristive devices are used to model the bi-polar synaptic weights in neural
networks. Due to the fact that a memristor exhibits properties similar to that of a
resistor, memristors can represent only positive range of weights. Thus, either a hybrid
CMOS-memristor [Kim et al. 2012; Soudry et al. 2015] or two memristors [Alibart et al.
2013; Hu et al. 2014] are used to model the bipolar synaptic weights. Although model-
ing the synaptic weight with one memristor is easier to train, it demands additional
circuitry to generate the bipolar weights. On the other hand, using two memristors to
model the synaptic weights reduces the power consumption, but increases the hardware
complexity and the training process.
Several research groups have studied the realization of synaptic weights in memris-
tive devices while enabling the on-device learning. To realize the synaptic weight with
one memristor, Sah et al. proposed a memristor-based synaptic circuit which employs an
H-Bridge and doublet generator to perform positive and negative input-weight multipli-
cation. However, it was not studied in the context of a multi-level network or a crossbar
architecture [Sah et al. 2012]. In 2015, Soudry et al. presented a memristor crossbar
that supports on-chip online gradient descent. In this architecture, two transistors and
a memristor were used to implement a synapse. This makes the total number of transis-
tors in the crossbar scale linearly with the number of memristors [Soudry et al. 2015].
Adopting two memristors to model the synaptic weights is studied by Alibart et al. who
proposed a memristor-based single-layer perceptron to classify synthetic pattern of the
letters ’X’ and ’T’. The proposed design is trained using ex-situ and in-situ methods [Al-
ibart et al. 2013]. Hasan et al. presented an on-chip training circuit to account for device
faults and variability in memristor-based deep neural networks [Hasan et al. 2017].
This network was trained with auto-encoders and backpropagation and simulated in
MATLAB for classification application. When it comes to extreme learning machine
(ELM), which is the neural network algorithm used to verify our proposed architecture,
few research groups have studied the memristor-based ELM. In 2014 Merkel et al.
proposed memristor-based ELM implementation, but it is not studied within the context
of a crossbar structure [Merkel and Kudithipudi 2014]. Later in 2015, OxRAM based
ELM architecture was proposed by Suri et al. in which the nano-scale device variability
is exploited to design ELM in an efficient manner [Suri et al. 2015]. Unfortunately,
this work does not provide details about the hardware implementation and the training
process. It is also important to mention here that most memristor-based neural network
architectures proposed in literature use threshold-voltage memristors. To the best of our
knowledge, no design has explored on-device learning for current-threshold memristors
integrated in crossbar architecture.
This paper proposes on-device training circuitry for current threshold memristors
integrated into a crossbar structure. Moreover, the paper presents a different approach
for realizing the synaptic weights into a memristive crossbar such that bipolar weights
are obtained. The proposed approach is based on using semi-trained crossbar structure
ACM Journal on Emerging Technologies in Computing Systems, Vol. x, No. x, Article x, Pub. date: January 2018.
Semi-Trained Memristive Crossbar Computing Engine with In-Situ Learning Accelerator x:3
(a combination of trained and untrained memristors), where the trained memristor
models the synaptic weights and the fixed ones are used in association with the trained
memristor to generate bipolar synaptic weights. The proposed design is simulated in
Cadence Spectre and verified for classification application in MATLAB using binomial
(Diabetes and Australian Credit) and multinomial (Iris and MNIST) datasets [Lichman
2013; LeCun 1998]. For a single 4x4 layer network (crossbar and its associated control
and training circuitries) implemented in IBM 65nm technology node, the total power is
estimated to be ≈ 42.16µW, while the area is 26.48µm x 22.35µm.
The rest of the paper is organized as follows: Section 2 presents an overview about
ELM. Section 3 and 4 discuss the design methodology and the hardware analysis. The
experimental setup is described in Section 5. Section 6 demonstrates the experimental
results and Section 7 concludes the paper.
2. OVERVIEW OF ELM
Extreme learning machine (ELM) is a multi-layer feed-forward neural network used
in real-time regression and classification applications [Huang et al. 2004]. It has roots
back in the random vector functional link (RVFL) networks proposed in 1994 [Pao
et al. 1994]. Primarily, ELM is composed of three successive fully connected layers:
input, hidden, and output. The input layer is used to present the input data to the
network, whereas the hidden and output layers conduct the feature extraction and
data classification, respectively. When the input data is presented to the network, it
gets relayed to the hidden layer, where all the relevant and important features are
stochastically extracted [Auerbach et al. 2014]. This is done via projecting the input
data to high-dimensional space carried out by a large number of hidden neurons [Huang
2014]. The features extracted by the hidden layer are further relayed to the output
layer where the class label associated with the input is identified. A key feature of ELM
is that the training is confined only to the output layer synaptic weights, whereas the
hidden layer weights are randomly initialized and left unchanged [Huang et al. 2006].
This feature speeds up the training in ELM and makes the algorithm attractive for
hardware implementations as there is no need for back-propagation.
Fig. 1: High-level representation of ELM with three layers: an input layer, feature
extraction layer (hidden layer), and classification layer (output layer).
Figure 1 illustrates the high-level architecture of an ELM. At runtime, each example
in the input dataset is presented to the network as a pair. Each pair contains an input
feature vector Xp and its associated class label tp, where Xp ∈ Rn, ∀p = 1, 2, ...., L
and L is the dataset size. Using Equation (1), the network feed-forward output can be
ACM Journal on Emerging Technologies in Computing Systems, Vol. x, No. x, Article x, Pub. date: January 2018.
x:4 A. Zyarah et al.
computed, where t∗i represents the predicted output of the ith output unit, ∀i = 1, 2, ....k.
k and η denote the total number of neurons in the output and hidden layers, b is the
bias, while f and z are the activation functions of the hidden and output neurons,
respectively.
t∗i = zi
( η−1∑
j=0
βjfj(X, b)
)
(1)
β = H−1T (2)
Adopting the normal equation, Equation (2) (hidden layer output inverse (H−1)
multiplied by the desired output class labels (T )), to find the output layer weight matrix
(β) is a common method in ELM [Huang et al. 2004; Kasun et al. 2013] as it offers faster
convergence compared to the numerical counterpart. However, realizing the matrix
inverse in hardware is cumbersome [Perina et al. 2017]. Rather than using the normal
equation, the iterative delta rule algorithm [Jacobs 1988] is chosen. In delta rule, a
weight βi,j connecting the jth neuron in hidden layer to ith neuron in output layer is
updated according to Equation (3), where α is the learning rate, hi,j and (t∗i − tpi ) refer
to the input and the output error of the ith neuron, respectively.
∆βi,j = α× hpi,j × (t∗i − tpi ) (3)
3. DESIGN METHODOLOGY
3.1. Memristive Crossbar Network
In order to perform the matrix-vector multiplication in ELM, a memristive crossbar is
used as it enables high-speed computations while maintaining low power consumption
and area overhead. Unfortunately, the memristive crossbar structure offers only positive
range of synaptic weights. Therefore, two memristors are used to model the synaptic
weights in a bipolar manner. Typically, this is achieved either by using two crossbars
or one crossbar with dual input (in this work, dual input refers to a signal and its
negation) [Alibart et al. 2013; Hu et al. 2014; Hasan 2016; Chakma et al. 2018]. Here,
both approaches of realizing the synaptic weights in the memristive crossbar and their
constraints will be discussed. Furthermore, in each section, the proposed optimization
approach, called semi-trained crossbar, will be investigated.
3.1.1. Two Crossbars topology. In this topology, two crossbars are employed to generate
positive and negative weight ranges [Alibart et al. 2013; Hu et al. 2014]. Figure 2-(a)
illustrates the use of two crossbars in emulating the synaptic weights, where each
weight value is given by Equation (4). Rf is the feedback resistance of the Op-Amp
based subtracter, and M+i,j and M
−
i,j denote the memristor resistance at the crosspoint
(i, j) for the left (pink) and right (green) crossbars, respectively. By applying an input
voltage (Xp = [xp0, x
p
1, ....x
p
n−1]) at the word-lines (crossbar rows), an output current (T
= [t∗0, t∗1, ....t∗k−1]) will be generated as given by Equation (5), where β is the synaptic
weight matrix which can be calculated using Equation (4).
βi,j =
Rf
M+i,j
− Rf
M−i,j
(4)
T = Xp × β (5)
ACM Journal on Emerging Technologies in Computing Systems, Vol. x, No. x, Article x, Pub. date: January 2018.
Semi-Trained Memristive Crossbar Computing Engine with In-Situ Learning Accelerator x:5
It turns out that mapping the synaptic weights to two crossbar arrays overwhelm
the learning process, as two crossbars need to be trained rather than one. Moreover, a
consistent change in the memristors on the positive and negative crossbars must be
sustained to ensure network convergence. Therefore, this research suggests a different
approach (called semi-trained) to realize the synaptic weights. This approach utilizes
only one crossbar associated with one fixed reference line. The reference line can be
created either by using memristors or resistors. Figure 2-(b) illustrates the structure
of the proposed approach. On the left side, a memristor crossbar is implemented to
emulate the synaptic weights. On the right side, one column (denoted by M−) of fixed
memristors is used such that positive and negative weight ranges are obtained. By
adopting this approach, the number of Op-Amps used at the bit-lines will be shrunk to
almost half and thereby reduce hardware resources significantly.
Fig. 2: (a) Memristor-based single layer neural network with k number of neurons and
n× k number of synapses modeled as two crossbars. (b) The proposed memristor-based
crossbar to model similar single layer network as in (a). The network uses a crossbar
to model the synaptic weights, and has an additional untrained column to generate
bipolar weights.
3.1.2. One Crossbar topology. One crossbar is used to model the synaptic weights in
this topology. However, in this crossbar array, the input needs to be negated to achieve
bipolar input-weight matrix-vector multiplication. Figure 3 depicts the schematic of one
crossbar structure, where the input vector Xp and its negation (∼ Xp) are introduced
to the crossbar and are multiplied by the synaptic weight matrix β, as given by Equa-
tion (5). When it comes to training such a structure, again all the memristors in the
crossbar need to be adjusted. By adopting the semi-trained approach, M− memristors
are set to a fixed value, whereas M+ memristors are trained to achieve the desired
network convergence.
3.1.3. Two Crossbars Vs. One Crossbar:. In spite of the fact that both crossbar topologies
are capable of performing the intended function (bipolar matrix-vector multiplication),
when it comes to hardware, each approach imposes different constraints. The downside
of using two crossbars to emulate the synaptic weights is that the input-weight mul-
tiplication is fractioned into two parts. One of them is accomplished via M+ crossbar,
whereas the second is done in M− crossbar. Due to this separation, additional con-
straints are imposed on the network input and its weight range. Figure 4-(a) illustrates
a circuit of one neuron with n inputs, the output of the neuron is given by Equation (6)
and Equation (7).
ACM Journal on Emerging Technologies in Computing Systems, Vol. x, No. x, Article x, Pub. date: January 2018.
x:6 A. Zyarah et al.
(a) (b)
x0
~x0
~xn-1
x1
Memristor
t*0 t*k-1
x0
~x0
~xn-1
x1
Memristor
t*0 t*k-1
M+ M+
M- M-
Inverting Op-Amp Inverting Op-Amp
Fig. 3: (a) Memristive crossbar to model a single layer neural network with k neurons
and n synapses for each neuron. Each synapse (βi,j) is modeled as two trained memris-
tors (M+i,j and M
−
i,j). (b) Proposed memristive crossbar to model a single layer neural
network with k neurons and n synapses for each neuron. Two memristors are used to
model each synapse, (βi,j), but one of them is trained (M+i,j) and the other one is fixed
(M−i,j).
t∗i = (x0
−Rf
M+0
+ .....+ xn−1
−Rf
M+n−1
) + Vx
−Rf
Rx
(6)
Vx = x0
−Rx
M−0
+ .....+ xn−1
−Rx
M−n−1
(7)
Due to the fact that Vx is computed by the first Op-Amp, then its maximum value is
always limited to the first Op-Amp biasing voltages (Vdd and Vss), i.e. Equation (8) must
be satisfied. Consequently, the input and weight range will be limited as well as the
crossbar size.
n−1∑
i=0
xi
−Rx
M−i
≤ (Vdd − Vss) (8)
Fig. 4: (a) A neuron circuit in a two crossbar topology. (b) A neuron circuit in one crossbar
topology.
ACM Journal on Emerging Technologies in Computing Systems, Vol. x, No. x, Article x, Pub. date: January 2018.
Semi-Trained Memristive Crossbar Computing Engine with In-Situ Learning Accelerator x:7
In cases where only one crossbar is used to perform the input-weight matrix-vector
multiplication, an additional inverter is needed to negate the input signal. However,
in this network the input-weight multiplication will not be segregated. Thus, using
only one Op-Amp at the output will suffice, as shown in Figure 4-b. The output here
is given by Equation (9). Since t∗i is associated with one Op-Amp, the constraints that
we had in Equation (8) are not applied here. Instead, Equation (10) must be satisfied,
which infers that every single input feature multiplied by its corresponding weight is
evaluated separately to be ≤ (Vdd − Vss). This eventually alleviates the constraints we
had when using two Op-Amps. Reducing the input and weight range constraints gives
more flexibility when it comes to hardware implementation. Moreover, large crossbars
can be realized.
t∗i = (x0
−Rf
M+0
+ x0
Rf
M−0
.....+ xn−1
Rf
M−n−1
) (9)
xi
−Rf
M−i
≤ (Vdd − Vss) (10)
Figure 5 demonstrates the input-weight multiplication performed using the neuron
circuits from Figure 4, where all the input features (Xn, where n =2) are assigned
to 0.3sin(wt), Rf = Rx = 500kΩ, and M+ = M− = 250kΩ. Vout2 and Vout1 denote the
output of neuron-(a) and neuron-(b), which can be computed based on Equation (6) and
Equation (9), respectively. Although the output in both cases should be the same (=0v),
neuron-(a) gives incorrect output as it violated the constraints in Equation (8) (notice
that Vx was clipped). This indicates that the neuron circuit in Figure 4-(b) can handle
more input-weight range compared to the former.
Fig. 5: Input-weight multiplication performed using the neuron circuits given in
Figure 4, where all inputs (Xn) are assigned to 0.3sin(wt), Rf = Rx = 500kΩ, and
M+ = M− = 250kΩ. Vout1 and Vout2 denote the output of one and two crossbar neurons,
respectively.
ACM Journal on Emerging Technologies in Computing Systems, Vol. x, No. x, Article x, Pub. date: January 2018.
x:8 A. Zyarah et al.
3.2. Delta Rule Algorithm
In order to simplify the learning process on-chip, delta rule algorithm, described by
Equation (3), is used. However, realizing the delta rule equation in hardware still
requires non-trivial resources (subtractor and multiplier). Therefore, this work adopts
the simplified delta rule equation used in [Hu et al. 2014] and shown in Equation (11).
∆βi,j = α× S(hpi,j)× S(t∗i − tpi ) (11)
S(i) :=
{
1 i > 0
−1 Otherwise
Here, the weight will essentially change according to the sign of the gradient and has
a fixed learning rate. Although such procedure will slow down the convergence speed,
the resources used for learning circuity will be significantly minimized. By applying
the delta rule to the semi-trained crossbar structure, the iterative change in weight
value ensures network convergence can be computed. Recall that each synaptic weight
is emulated by fixed and tuned memristors. By using Equation (4) and Equation (11),
the net change in the memristor to achieve the desired weight value can be calculated,
as in Equation (12).
Mnew =
Mold ×Rf
Rf − α× S(xpi,j)× S(t∗i − ti,p)
(12)
M =
1
M+
− 1
M−
4. SYSTEM DESIGN AND ANALYSIS
The system level architecture of ELM consists of two main layers: hidden and output.
In this section, the main focus will be on the output layer as it has similar structure
to the hidden layer except that the output layer is integrated to the training circuit
due to the need of synaptic weight adjustment. Figure 6 shows the architecture of the
output layer which essentially has three parts: memristive crossbar, neuron circuit,
and training circuit. The memristive crossbar represents a single layer of ELM in
which each column corresponds to one neuron connected to Rn (Rn is the number of
crossbar rows) number of synapses modeled by memristors. The crossbar is responsible
for evaluating the input-weight matrix multiplication, whereas the neuron and training
circuits are responsible for performing a non-linear transformation on crossbar bit-line
outputs and adjusting the weights of the crossbar, respectively.
Primarily, the proposed network runs in two phases: inference and learning. During
the inference phase, the input vector (Xp = [xp0, x
p
1, ....x
p
n−1]) is fetched to the network
where it gets multiplied by the synaptic weight matrix (β) to generate the output vector
(T ∗). The output of the network is evaluated by comparing it to input class label and
the difference is reported either as logic ’1’ denoting that t∗i > ti, or logic ’0’ otherwise.
The output of the error computing unit is stored into a shift register to be used in the
learning phase. In the learning phase, the memristor resistances are adjusted according
to the sign of the gradient and learning rate. In this work, the tuning of memristor
is done column-by-column (training each column takes two clock cycles) through a
modified Ziksa training circuit [Zyarah et al. 2017], which is modeled by +Tr and −Tr.
Ziksa is used to form an H-Bridge across the memristors that require tuning and by
allowing the current to flow through the device in both directions, bipolar weight change
ACM Journal on Emerging Technologies in Computing Systems, Vol. x, No. x, Article x, Pub. date: January 2018.
Semi-Trained Memristive Crossbar Computing Engine with In-Situ Learning Accelerator x:9
can be achieved (more details is provided section 4.1). It is important to mention here
that each Ziksa unit is controlled by local row and column units which in turn are
controlled by a global controller. The global controller determines when to enable the
inference and learning phases and is responsible for synchronization. In the following
subsections, each unit in the system architecture is discussed in detail.
Column
Controller
Row
Controller
Row
Controller
Column
Controller
+Tr
+Tr
-Tr -Tr
x0
x1
xn-1
Ziksa
t*0
Column
Controller
-Tr
Error Computing
Global
Controller
Row
Controller +Tr
Rf
In
Co
lE
n
Memristor(M+)
Memristor(M-)
Polar
t*k-1
En
En
En
EnEn
Fig. 6: The high-level architecture of the ELM output layer including the memristive
crossbar, and the training circuit which is modeled by +Tr and −Tr and controlled by
column and row controllers.
4.1. Ziksa: Training Circuitry
Recall the simplified delta rule algorithm suggests a fixed adjustment in the memristive
weight. In the adopted memristor model, the changes in memristive state variable are
dependent on the current flowing in the device as given by Equation (13) [Kvatinsky
et al. 2013]
∆w
∆t
=

koff .
(
i(t)
ioff
− 1
)αoff
.foff (w), 0 < ioff < i
0, ion < i < ioff
kon.
(
i(t)
ion
− 1
)αon
.fon(w), i < ion < 0
(13)
where w is the memristor state variable. koff , kon, αoff , and αon are constants, ioff and
ion are the memristor current thresholds, and fon and foff describe the device window
function. As the memristor exhibits properties similar to that of a resistor, the current
flowing in the device will be limited by its resistance. Thus, to satisfy the learning rule
constraints, in this work, we modified our previous design of Ziksa to accommodate this
issue.
ACM Journal on Emerging Technologies in Computing Systems, Vol. x, No. x, Article x, Pub. date: January 2018.
x:10 A. Zyarah et al.
Vdd
Vdd
T1 T2
T3T4
T5
T6
a
b
I+
I- +Tr
-Tr
Fig. 7: Ziksa unit for adjusting a threshold-current memristor. The unit is mainly
composed of four transistors that sandwich the memristor to allow the training current
to flow bi-directionally through the memristor. This allows its value to be incremented
and decremented.
Figure 7 illustrates the modified version of Ziksa in which two current mirrors are
used to limit the amount of current flowing in the memristor which consequently
ensures consistent adjustment of memristor resistance. The circuit works as follows:
during the first clock cycle of learning, which involves increasing the weight values (this
means decreasing the memristor resistance as βi,j ∝ (1/Mi,j)) in a selected column,
current will be provided to the memristor via T5. To ensure a fixed change in the
memristor value, this current will be limited to ≈ I− by using a current mirror created
by T1−2 on the other terminal of the memristor. During the second cycle of training,
the weight will be decremented by allowing the current, ≈ I+, to flow in the opposite
direction.
Practically, fixing the current through the memristor is a difficult condition to meet
with the current technology limitations. However, there is still a possibility to limit
the variation in the current through memristor while changing its state via a cascode
current mirror. The variation in the current through memristor in regular and cascode
current mirrors when using Iref = 4 µA is depicted in Figure 8-(a). The corner analysis
evaluation while considering the fabrication process, ambient temperature, and supply
voltage variations is shown in Figure 8-(b).
4.2. Column and Row Local Control Units
Ziksa transistors are driven by local control units associated with each column and
row. Recall that Ziksa has four transistors that form an H-Bridge sandwiching the
memristors in the crossbar. Half of the H-Bridge transistors reside in +Tr and the
other half in −Tr. Each −Tr is controlled by its associated column controller, shown in
Figure 9-(b). The column local control unit consists of a combinational logic circuit that
drives the T5 and T6 transistors of −Tr. When the learning phase begins, the column
controller receives two signals: ColEn and Polar. The former signal determines whether
a column is selected for training, whereas the latter refers to the system training cycle
which can be either positive or negative. During the positive cycle of training, i.e. Polar
= ’0’, all the weights that need to be incremented are adjusted, while the weights
required to be decremented are tuned in the negative cycle of training. When a column
is enabled by ColEn, T5 and T6 transistors of −Tr will be controlled in an alternating
way. During the low cycles of the Polar signal, PT is set to low to enable transistor T5,
whereas T6 is set to off via NT . This allows the current to flow towards the end-terminal
ACM Journal on Emerging Technologies in Computing Systems, Vol. x, No. x, Article x, Pub. date: January 2018.
Semi-Trained Memristive Crossbar Computing Engine with In-Situ Learning Accelerator x:11
(a) (b)
Fig. 8: (a) The variation in the current supplied to the memristor while sweeping its
value between its low and high resistance states. (b) The variation in the current
through the memristor during corner analysis.
of the crossbar row to increase the weight and vice versa for the high period of the Polar
signal. In case of +Tr, its transistors are controlled via the formed current mirrors
with transistors T1 and T4. The output of +Tr is connected to a tri-state gate controlled
by the row controller, illustrated in Figure 9-(a). It turns out that the row controller
is more complex compared to the column controller because the gradient sign of delta
rule is evaluated here via the input signal (input) and computed network error (Error).
Based on the gradient sign, the memristor resistance will be either incremented or
decremented. However, this is carried out when the training process is enabled via
TrEn. According to the Polar signal state, the memristor resistances are modulated
during the appropriate training cycle. It is important to mention here that the ColEn,
Polar, and TrEn signals are provided to the local controllers via the global controller.
Fig. 9: (a) Row control unit to control Ziksa +Tr output. (b) Column control unit to drive
Ziksa H-Bridge transistors residing in −Tr.
4.3. Global Controller
All the layers in the proposed system architecture are controlled by a global controller
which takes care of data flow and unit synchronization. The global controller runs in
three main states. In the first state (Read), the global control unit enables the pass
transistors to allow the input signals to propagate through the crossbar to perform
input-weight matrix multiplication. Once this is done, the output of the network will be
captured and the next state (Train C1) starts. In this state, the first round of training,
positive cycle, is performed. Thus, the signals that control the local controllers must be
ACM Journal on Emerging Technologies in Computing Systems, Vol. x, No. x, Article x, Pub. date: January 2018.
x:12 A. Zyarah et al.
generated. For the column controller, if a column is selected for training, its associated
ColEn is set to ’1’ whereas Polar signal is set to low indicating the positive cycle of
training. In case of the row controller, TrEn is set to high, which in association with
the input and computed error signs, determines the synaptic weights that need to
be incremented. When the global controller runs into the third state (Train C2), the
negative cycle of training will commence, and the same process from the previous cycles
will be repeated except that the Polar signal is set to high. The global control unit keeps
moving between the second and third states until all the columns of the crossbar are
trained. Figure 10 is an algorithm state machine chart demonstrating the transition
between the states and the output of each one.
Fig. 10: Algorithm state machine flow chart of the global controller.
5. EXPERIMENTAL SETUP
The Verilog-A memristor model proposed by [Kvatinsky et al. 2013] is employed in this
work. The memristor value is set based on the device parameters described in [Fan et al.
2014; Kawahara et al. 2008] such that it meets the network and technology constraints,
shown in Table I. The high resistance state (HRS) of the memristor is selected to be 250
kΩ, so that the voltage developed across the memristor does drive the current mirror
transistor out of saturation. On the other hand, to keep the power dissipation through
the crossbar network as minimum as possible, the low resistance state (LRS) is set to
100 kΩ.
Table I: The parameters used for the ELM network simulation.
Parameter Value
Memristor tuning current 4 µA
Max. memristor current during the inference 3 µA
Input voltage range < |0.5| v
Current threshold 3.2 µA
Memristor low resistance state (LRS) 100 kΩ
Memristor high resistance state (HRS) 250 kΩ
ACM Journal on Emerging Technologies in Computing Systems, Vol. x, No. x, Article x, Pub. date: January 2018.
Semi-Trained Memristive Crossbar Computing Engine with In-Situ Learning Accelerator x:13
Recall that the proposed system runs in two phases: inference and learning. In the
inference phase no change in memristor values should occur. Since a threshold-current
memristor is adopted to model the synaptic weights, the current through these devices
should never exceed the device threshold to avoid undesired changes in the memristors.
Unlike the inference phase, during the learning phase, the current must cross the
device threshold-current to adjust the device resistance. According to the experimental
setup used in this work, there is a variation in the current through the memristor
during the inference phase. However, this variation is estimated to be ≈0.6 µA. Thus,
by using a current of 4 µA during the learning phase and limiting the input current
during the inference phase to be 3 µA, no overlap between the two phases can occur.
The following constraints must be fulfilled for the crossbar pass transistors1:
— Limit the input voltage such that VDS <<| VGS − VTh |. This ensures that the
transistor is working in the triode region and an undistorted signal reaches the
memristors.
— Assume that K(VGS−2VTh) >> Gmem(w) such that the conductance of the transistor
is higher compared to the memristor conductance. Thus, the voltage at the memristor
is ≈ input voltage [Soudry et al. 2015].
Table II: Summary of classification accuracy for binomial and multinomial datasets
DataSet ELM, η = 1000 VLSI ELM, η = 120 This work,
[Huang et al. 2012]a [Yao and Basu 2017]b η = variable
Diabetes 77.95% 77.09% 72.73% (η=65)
Australian Credit 87.89% 87.89% 82.16% (η=40)
Iris 96.04% - 84.66% (η=20)
MNIST - - 93.53% (η=180)c
a Software implementation of ELM.
b Non-memristive mixed signal implementation of ELM.
c All MNIST images are preprocessed with HOG feature descriptor, described in [Zyarah and Kudithipudi
2017].
6. EXPERIMENTAL RESULTS AND NETWORK VERIFICATION
6.1. Network Verification
In order to verify the operation of the proposed design, each unit in the network is
simulated independently and within a network in Cadence Spectre environment. Then,
the same network is emulated in MATLAB and simulated for classification application
under the same circuit constraints but with different configurations. The benchmarks
employed in this work are selected from UCI library and chosen to be binomial (Diabetes
and Australian Credit) and multinomial (Iris). This is added to the multi-class standard
hand-written digits dataset, MNIST. Figure 11 depicts the weight distribution of the
output layer when using MNIST dataset and the achieved accuracy for each dataset
during the training and inference phases. Furthermore, the variation in accuracy that
may occur due to the process variation of memristors2 and random weight initialization
is shown (variation for 5 iterations, each iteration is averaged over 10 runs). Table II
shows a comparison of the achieved accuracy with previous ELM implementations
for the same datasets. As can be noticed, although the proposed work offers lower
classification accuracy, it has a simpler network as the number of hidden neurons
1These constraints can be overcome by using transmission gates rather than pass transistors.
210% variation in memristor resistance range (LRS and HRS) has been considered during the simulation.
ACM Journal on Emerging Technologies in Computing Systems, Vol. x, No. x, Article x, Pub. date: January 2018.
x:14 A. Zyarah et al.
is much lower. However, the performance degradation in the network primarily can
be attributed to the input voltage and the weight range constraints imposed on the
network. These constraints are due to the memristors limited resistance range and the
neuron biasing voltage.
(a)
Mean = 0.04
Std. = 0.756
(b)
Fig. 11: (a) Training and testing accuracy of emulated semi-trained ELM model for
binomial and multinomial classifications while considering memristor device variations.
(b) Output layer weight distribution of the emulated ELM network trained on MNIST
dataset.
6.2. Network Scalability
In order to estimate the resources needed to map a large neural network to the proposed
design, a full custom design of small-scale (4x4) single layer network is implemented in
Cadence using IBM 65nm technology node. Figure 12 shows the exponential scaling
of a single layer neural network from 2x2 up to 128x128, which tends to be linearly
proportional to the total transistor count.
Fig. 12: Transistor count for a single layer neural network with various crossbar sizes.
ACM Journal on Emerging Technologies in Computing Systems, Vol. x, No. x, Article x, Pub. date: January 2018.
Semi-Trained Memristive Crossbar Computing Engine with In-Situ Learning Accelerator x:15
6.3. Power Consumption
The total power consumption of the proposed design is estimated for a 4x4 single layer
neural network. The power consumption is evaluated in Cadence Spectre environment
while running the system at 100 MHz. Considering the worst case scenario, (all crossbar
memristors are set to low resistance state), the power consumption is estimated to be
40 µW for each 4x4 crossbar and 0.63 µW for the digital circuit (the Op-Amp power
is not considered). It is important to mention here that the power consumption in a
memristor while keeping the device resistance unchanged is assumed to be similar
to that of a resistor [Marani et al. 2015]. Table III shows the comparison of power
consumption while using different crossbar architectures and different approaches of
realizing the synaptic weights. The power consumption of each architecture is achieved
by covering all the input combinations and averaging the results. It can be noticed that
the semi-trained two-crossbar architecture offers the minimum power consumption as
it is compact and uses almost half the number of memristors compared to other designs.
Table III: The power consumption distribution of the proposed crossbar architectures.
Architecture Digital logic circuit Crossbar Total
Fully-trained two-crossbar 1.45 µW 39.53 µW 40.98 µW
Semi-trained two-crossbar 1.22 µW 20.85 µW 22.07 µW
Fully-trained one-crossbar 1.52 µW 41.01 µW 42.53 µW
Semi-trained one-crossbar 1.15 µW 41.01 µW 42.16 µW
7. CONCLUSIONS
This paper investigates a new approach for realizing positive and negative synaptic
weights in a crossbar structure using threshold-current memristors. The proposed
approach relies on one memristive crossbar to model the weights and uses additional
fixed (untrained) columns to generate bipolar weights. Moreover, the paper presents an
updated version of the on-device training circuit, Ziksa, which can be used to modulate
current-threshold memristors in a crossbar structure.
The proposed network is tested for classification applications with binomial and
multinomial datasets while considering memristor device variations. It is found that the
process variations have limited effect on the network performance for large datasets. In
these cases, the network has better generalization. In scenarios where power efficiency
is a constraint the semi-trained network for two crossbar topology is preferred. Future
work will investigate the network performance while considering crossbar resistance,
noise effect, and other process variations.
ACKNOWLEDGMENTS
The authors would like to thank the members of the Neuromorphic AI research Lab at RIT for their support
and critical feedback. The authors would also like to thank the reviewers for their time and extensive feedback
to enhance the quality of the paper.
REFERENCES
Fabien Alibart, Elham Zamanidoost, and Dmitri B Strukov. 2013. Pattern classification by memristive
crossbar circuits using ex situ and in situ training. Nature communications 4 (2013).
Joshua E Auerbach, Chrisantha Fernando, and Dario Floreano. 2014. Online extreme evolutionary learning
machines. In Artificial Life 14: Proceedings of the Fourteenth International Conference on the Synthesis
and Simulation of Living Systems. The MIT Press, 465–472.
ACM Journal on Emerging Technologies in Computing Systems, Vol. x, No. x, Article x, Pub. date: January 2018.
x:16 A. Zyarah et al.
Julien Borghetti, Gregory S Snider, Philip J Kuekes, J Joshua Yang, Duncan R Stewart, and R Stanley
Williams. 2010. Memristiveswitches enable statefullogic operations via material implication. Nature 464,
7290 (2010), 873–876.
Gangotree Chakma, Md Musabbir Adnan, Austin R Wyer, Ryan Weiss, Catherine D Schuman, and Garrett S
Rose. 2018. Memristive Mixed-Signal Neuromorphic Systems: Energy-Efficient Learning at the Circuit-
Level. IEEE Journal on Emerging and Selected Topics in Circuits and Systems 8, 1 (2018), 125–136.
Leon Chua. 1971. Memristor-the missing circuit element. IEEE Transactions on circuit theory 18, 5 (1971),
507–519.
Deliang Fan, Mrigank Sharad, and Kaushik Roy. 2014. Design and synthesis of ultralow energy spin-
memristor threshold logic. IEEE Transactions on Nanotechnology 13, 3 (2014), 574–583.
Md Raqibul Hasan. 2016. Memristor Based Low Power High Throughput Circuits and Systems Design. Ph.D.
Dissertation. University of Dayton.
Raqibul Hasan, Tarek M Taha, and Chris Yakopcic. 2017. On-chip training of memristor crossbar based
multi-layer neural networks. Microelectronics Journal 66 (2017), 31–40.
Miao Hu, Hai Li, Yiran Chen, Qing Wu, Garrett S Rose, and Richard W Linderman. 2014. Memristor crossbar-
based neuromorphic computing system: A case study. IEEE transactions on neural networks and learning
systems 25, 10 (2014), 1864–1878.
Guang-Bin Huang. 2014. An insight into extreme learning machines: random neurons, random features and
kernels. Cognitive Computation 6, 3 (2014), 376–390.
Guang-Bin Huang, Hongming Zhou, Xiaojian Ding, and Rui Zhang. 2012. Extreme learning machine for
regression and multiclass classification. IEEE Transactions on Systems, Man, and Cybernetics, Part B
(Cybernetics) 42, 2 (2012), 513–529.
Guang-Bin Huang, Qin-Yu Zhu, and Chee-Kheong Siew. 2004. Extreme learning machine: a new learning
scheme of feedforward neural networks. In Neural Networks, 2004. Proceedings. 2004 IEEE International
Joint Conference on, Vol. 2. IEEE, 985–990.
Guang-Bin Huang, Qin-Yu Zhu, and Chee-Kheong Siew. 2006. Extreme learning machine: theory and
applications. Neurocomputing 70, 1 (2006), 489–501.
Giacomo Indiveri and Shih-Chii Liu. 2015. Memory and information processing in neuromorphic systems.
Proc. IEEE 103, 8 (2015), 1379–1397.
Robert A Jacobs. 1988. Increased rates of convergence through learning rate adaptation. Neural networks 1,
4 (1988), 295–307.
Sung Hyun Jo, Ting Chang, Idongesit Ebong, Bhavitavya B Bhadviya, Pinaki Mazumder, and Wei Lu. 2010.
Nanoscale memristor device as synapse in neuromorphic systems. Nano letters 10, 4 (2010), 1297–1301.
Liyanaarachchi Lekamalage Chamara Kasun, Hongming Zhou, Guang-Bin Huang, and Chi Man Vong. 2013.
Representational learning with ELMs for big data. (2013).
Takayuki Kawahara, Riichiro Takemura, Katsuya Miura, Jun Hayakawa, Shoji Ikeda, Young Min Lee,
Ryutaro Sasaki, Yasushi Goto, Kenchi Ito, Toshiyasu Meguro, and others. 2008. 2 Mb SPRAM (spin-
transfer torque RAM) with bit-by-bit bi-directional current write and parallelizing-direction current read.
IEEE Journal of Solid-State Circuits 43, 1 (2008), 109–120.
Hyongsuk Kim, Maheshwar Pd Sah, Changju Yang, Tama´s Roska, and Leon O Chua. 2012. Neural synaptic
weighting with a pulse-based memristor circuit. IEEE Transactions on Circuits and Systems I: Regular
Papers 59, 1 (2012), 148–158.
Shahar Kvatinsky, Eby G Friedman, Avinoam Kolodny, and Uri C Weiser. 2013. TEAM: Threshold adaptive
memristor model. IEEE Transactions on Circuits and Systems I: Regular Papers 60, 1 (2013), 211–221.
Yann LeCun. 1998. The MNIST database of handwritten digits. http://yann. lecun. com/exdb/mnist/ (1998).
M. Lichman. 2013. UCI Machine Learning Repository. (2013). http://archive.ics.uci.edu/ml
Roberto Marani, Gennaro Gelao, and Anna Gina Perri. 2015. A review on memristor applications. arXiv
preprint arXiv:1506.06899 (2015).
Cory Merkel. 2017. Current-mode Memristor Crossbars for Neuromemristive Systems. arXiv preprint
arXiv:1707.05316 (2017).
Cory Merkel and Dhireesha Kudithipudi. 2014. Neuromemristive extreme learning machines for pattern
classification. In VLSI (ISVLSI), 2014 IEEE Computer Society Annual Symposium on. IEEE, 77–82.
Yoh-Han Pao, Gwang-Hoon Park, and Dejan J Sobajic. 1994. Learning and generalization characteristics of
the random vector functional-link net. Neurocomputing 6, 2 (1994), 163–180.
Andre´ Bannwart Perina, Paulo Matias, Eduardo Marques, Vanderlei Bonato, Joa˜o Miguel Gago Pontes
De Brito, and others. 2017. Exploiting Kant and Kimuras Matrix Inversion Algorithm on FPGA. In
Digital System Design (DSD), 2017 Euromicro Conference on. IEEE, 516–519.
ACM Journal on Emerging Technologies in Computing Systems, Vol. x, No. x, Article x, Pub. date: January 2018.
Semi-Trained Memristive Crossbar Computing Engine with In-Situ Learning Accelerator x:17
Mirko Prezioso, Farnood Merrikh-Bayat, BD Hoskins, GC Adam, Konstantin K Likharev, and Dmitri B
Strukov. 2015. Training and operation of an integrated neuromorphic network based on metal-oxide
memristors. Nature 521, 7550 (2015), 61–64.
Maheshwar Pd Sah, Changju Yang, Hyongsuk Kim, and Leon O Chua. 2012. Memristor circuit for arti-
ficial synaptic weighting of pulse inputs. In Circuits and Systems (ISCAS), 2012 IEEE International
Symposium on. IEEE, 1604–1607.
Greg S Snider. 2008. Spike-timing-dependent learning in memristive nanodevices. In Nanoscale Architectures,
2008. NANOARCH 2008. IEEE International Symposium on. IEEE, 85–92.
Daniel Soudry, Dotan Di Castro, Asaf Gal, Avinoam Kolodny, and Shahar Kvatinsky. 2015. Memristor-based
multilayer neural networks with online gradient descent training. IEEE transactions on neural networks
and learning systems 26, 10 (2015), 2408–2421.
Dmitri B Strukov, Gregory S Snider, Duncan R Stewart, and R Stanley Williams. 2008. The missing memristor
found. nature 453, 7191 (2008), 80–83.
Manan Suri, Vivek Parmar, Gilbert Sassine, and Fabien Alibart. 2015. OXRAM based ELM architecture for
multi-class classification applications. In Neural Networks (IJCNN), 2015 International Joint Conference
on. IEEE, 1–8.
Tarek M Taha, Raqibul Hasan, and Chris Yakopcic. 2014. Memristor crossbar based multicore neuromorphic
processors. In System-on-Chip Conference (SOCC), 2014 27th IEEE International. IEEE, 383–389.
Enyi Yao and Arindam Basu. 2017. VLSI extreme learning machine: A design space exploration. IEEE
Transactions on Very Large Scale Integration (VLSI) Systems 25, 1 (2017), 60–74.
Abdullah M Zyarah and Dhireesha Kudithipudi. 2017. Extreme learning machine as a generalizable classifi-
cation engine. In Neural Networks (IJCNN), 2017 International Joint Conference on. IEEE, 3371–3376.
Abdullah M Zyarah, Nicholas Soures, Lydia Hays, Robin B Jacobs-Gedrim, Sapan Agarwal, Matthew
Marinella, and Dhireesha Kudithipudi. 2017. Ziksa: On-chip learning accelerator with memristor cross-
bars for multilevel neural networks. In Circuits and Systems (ISCAS), 2017 IEEE International Sympo-
sium on. IEEE, 1–4.
ACM Journal on Emerging Technologies in Computing Systems, Vol. x, No. x, Article x, Pub. date: January 2018.
