In-situ Stochastic Training of MTJ Crossbar based Neural Networks by Mondal, Ankit & Srivastava, Ankur
In-situ Stochastic Training of MTJ Crossbar based Neural
Networks
Ankit Mondal
University of Maryland
College Park, Maryland 20742
amondal2@terpmail.umd.edu
Ankur Srivastava
University of Maryland
College Park, Maryland 20742
ankurs@umd.edu
ABSTRACT
Owing to high device density, scalability and non-volatility, Magnetic
Tunnel Junction-based crossbars have garnered signicant interest
for implementing the weights of an articial neural network. e
existence of only two stable states in MTJs implies a high overhead of
obtaining optimal binary weights in soware. We illustrate that the
inherent parallelism in the crossbar structure makes it highly appro-
priate for in-situ training, wherein the network is taught directly on
the hardware. It leads to signicantly smaller training overhead as
the training time is independent of the size of the network, while also
circumventing the eects of alternate current paths in the crossbar
and accounting for manufacturing variations in the device. We show
how the stochastic switching characteristics of MTJs can be leveraged
to perform probabilistic weight updates using the gradient descent
algorithm. We describe how the update operations can be performed
on crossbars both with and without access transistors and perform
simulations on them to demonstrate the eectiveness of our tech-
niques. e results reveal that stochastically trained MTJ-crossbar
NNs achieve a classication accuracy nearly same as that of real-
valued-weight networks trained in soware and exhibit immunity to
device variations.
CCS CONCEPTS
•Computingmethodologies→Neural networks; Supervised learn-
ing; •Hardware→ Spintronics andmagnetic technologies; Emerg-
ing architectures; Non-volatile memory;
1 INTRODUCTION
Deep Neural Networks (DNNs) have become a popular choice for
tasks such as image classication, face recognition, and Natural Lan-
guage Processing. is has however been at the cost of massive
computations on von Neumann architectures exhibiting high energy
and area requirements [11]. e emergence of novel devices and
special-purpose architectures has called for a shi from conventional
digital hardware for implementing neural algorithms [19].
Aempts have been made towards dedicated hardware designs
and realization of the synaptic weights (and neurons) of a Neural
Network (NN) by using CMOS transistors in an analog fashion [14];
but these have met with challenges of scalability and volatility. Paral-
lel research work has focused on using post-CMOS devices such as
memristors, which are non-volatile devices with a variable resistance
[17]. However, the fabrication of multilevel memristors with stable
states is still a challenge [8].
Another choice is the Magnetic Tunnel Junction (MTJ), an emerg-
ing binary device (since it has 2 stable states) which has shown its
potential as storage elements and is a promising candidate for replac-
ing CMOS in memory chips [15]. Its non-volatility and scalability
makes it a particularly lucrative choice for logic-in-memory type
architectures for neural networks. MTJs and memristors can be con-
nected in a crossbar conguration which allows greater scalability
and higher performance due to their inherent parallelism [1, 7, 17].
Several studies have investigated how the crossbar arrays with mem-
ristors [5, 22], MTJs [1, 8] and domain-wall ferromagnets [2, 22]
can implement Spiking Neural Networks (SNN) trained using Spike-
Timing Dependent Plasticity (STDP), both . Hasan et al. [20] and
Soudry et al.[6] have implemented multi-layer NNs on memristive
crossbars trained on-chip using the backpropagation algorithm and
demonstrated on supervised learning tasks.
Continuous weight networks can be simplied into discrete weight
networks without signicant degradation in classication accuracy
while achieving substantial power benets [18]. e use of discrete
weight networks, such as BinaryConnect [16] and in [9], also stems
from the challenge to address the high storage and computational
demands of a large number of full-precision weights. e existence
of only 2 stable states in MTJs makes them a good candidate for the
realization of binary weight networks. One way of training such NNs
is to perform weight updates stochastically, which is justiable from
evidences that learning in human brains also has some stochasticity
associated [19]. at such a method can lead to convergence with
high probability in a nite time has been shown in [23].
Obtaining optimal weights for a binary network in soware can be
impractical because its discrete nature requires integer programming.
Also, when physically realizing an NN on hardware, the underlying
device variations can have a substantial impact on the model accuracy,
and need to be accounted for in the training process. Merely charac-
terizing the variations in the hardware platform is not sucient for
overcoming this issue.
In this paper, we explore the use of MTJ crossbars for the hardware
implementation of the synaptic weight matrices of a neural network.
We propose the in-situ training of such an MTJ crossbar NN, which
allows us to exploit its inherent parallelism for signicantly faster
training and also accounts for device variations. We advocate a
probabilistic way of updating the MTJ synaptic weights through the
gradient descent algorithm by exploiting the stochasticity in their
switching. We experiment with two crossbar structures: with and
without access transistors. e laer poses the additional challenge
of sneak-path currents during programming which makes training
in-situ the only choice to achieve satisfactory performance. Finally,
we support our proposed techniques with data by modeling device
and circuit properties and running simulations.
2 BACKGROUND
In this section we describe the basics of neural networks and the
parallelism oered by the crossbar architecture, and introduce the
characteristics of Magnetic Tunnel Junctions.
2.1 Neural Networks
e computation performed by any layer of an NN during the infer-
ence (forward propagation) phase basically comprises a matrix-vector
multiplication. Say, x ∈ RM is the input to a layer andW ∈ RN×M
represents the synaptic weight matrix, then the output y ∈ RN is
y = f (Wx) (1)
where f() is an activation function. Training of the NN can be done
by backpropagation using the gradient descent optimization method.
e weight update of the synapse connecting the ith input to the jth
output is given as
∆Wji = −η ∂E
∂Wji
= −ηxiδj (2)
where E is the cost function of the presented input sample x , η is the
learning rate and δj is the error calculated at the jth output using y
and the desired output. It is worth noting that such a weight update
is local in nature, in that it depends only on the information available
at the synapse - the input to it and the error at its output. e weight
ar
X
iv
:1
80
6.
09
05
7v
1 
 [c
s.N
E]
  2
4 J
un
 20
18
update of the entire matrix can thus be wrien as
∆W = −ηδxT (3)
e major computational cost of this algorithm comes from the
O(M .N ) complexity of eqns. (1) and (3) whose implementation on
general-purpose hardware requires time and memory of the same
order, thereby not motivating their use for large-scale applications.
Fortunately, the nature of computation in eqn. (1) and the locality
of weight update enable the design of highly parallel hardware that
reduce the overall complexity to O(1).
2.2 e Crossbar Architecture
e physical realization of a synaptic weight matrix is possible using
the grid-like crossbar structure where each junction has a resistance
corresponding to one synapse. Fig 1(a) shows a simplied crossbar
with each row corresponding to an input and each column to an
output neuron. Let Vi ∈ [−V ,V ] be the voltage applied at the ith
input terminal andG ji be the conductance of the synapse connecting
it to the jth output. By Ohm’s Law, the current through that synapse
is G jiVi and by Kirchho’s law the total current at the output is
Ij =
∑
i
G jiVi (4)
which bears similarity to the dot products in (1). is can then be fed
to suitable analog circuits for implementing the activation function.
Since the outputs are obtained almost instantaneously aer the
inputs are applied, the matrix-vector multiplication of eqn. (1) is
performed in parallel with constant time complexity. As for the
update phase, the crossbar resistances can be modied by suitably
modeling the required change as the product of 2 physical quantities
derivable from the inputs and the errors. In this way, the O(M .N )
operations can be done in parallel using the M .N synapses.
2.3 Magnetic Tunnel Junction
e Magnetic Tunnel Junction (MTJ) is a 2-terminal spintronic device
consisting primarily of 2 ferromagnetic layers separated by a thin
tunnel barrier (typically MgO). e magnetic orientation of one of
the magnetic layers is xed, whereas that of the other is free, as
shown in g 1(b). MTJs possess 2 stable states where the relative
magnetic orientations of the free and xed layers are Parallel (P) and
Anti-Parallel (AP) respectively, with the P state exhibiting a lower
resistance than the AP state (RP < RAP ).
It is possible to switch the state of the MTJ by passing spin-polarized
current of appropriate polarity which ips the magnetization of the
free layer through the mechanism of spin-transfer torque [26]. e
time required to switch is heavily dependent on the magnitude of
the switching current. Not only that, this switching process is a sto-
chastic one, in the sense that a pulse of given amplitude and duration
has only a certain probability to successfully change the state. is
stochasticity is due to thermal uctuations in the initial magnetization
angle and is an intrinsic property of the STT switching [26].
G1,1
I2 Ij IN
V1
V2
Vi
VM
Input 
Voltages
G2,1 Gj,1 GN,1
G1,2 G2,2 Gj,2 GN,2
G1,i G2,i Gj,i GN,i
G1,M G2,M Gj,M GN,M
I1
(a) Structure of an M × N crossbar
Fixed
Layer
Tunnel 
Barrier
Free
Layer
Parallel Anti-Parallel
(b)e 2 stable states
of an MTJ
G1,1 G2,1
I2I1
G1,1 G2,1
V1
V2
(c)
Figure 1: (a) A crossbar (b) Magnetic Tunnel Junction (c) A 2 × 2 crossbar
0 2 4 6
Pulse width, t (ns)
0
0.25
0.5
0.75
1
Sw
itc
hi
ng
 P
ro
ba
bi
lity
 P
65 uA
80 uA
95 uA
110 uA
I = 110 uA
I = 65 uA
(a) P vs t, for dierent values of I
40 60 80 100 120
Switching current I (uA)
0
0.25
0.5
0.75
1
Sw
itc
hi
ng
 P
ro
ba
bi
lity
 P
2.25 ns
2.50 ns
2.75 ns
3.00 ns
t = 3.00 ns
t = 2.25 ns
(b) P vs I, for dierent values of t
Figure 2: MTJ AP → P switching probability as a function of t and I
Depending on the magnitude I of the current and the critical cur-
rent Ic0 [8], the switching probability in the high-speed precessional
regime (I > Ic0) is expressed as
P(a, t) = exp(−4f (a)∆exp(−2t/T )), with f (a) =
(
2a
a − 1
)( −2a+1 )
(5)
where a = I/Ic0, t is the pulse width, ∆ is the thermal stability and T
is the mean switching time (which is dependent on a)[10].
e spin transfer eciency (θ ) of an MTJ is dierent for the
2 switching directions, with θP→AP having a smaller value than
θAP→P [25]. is makes IP→APc0 > I
AP→P
c0 , which means that the
same magnitude and duration of current will correspond to dierent
switching probabilities for the 2 switching directions. Fig. 2 shows the
dependence of the switching probability on pulse width and switch-
ing current for the AP → P transition. Observe the similarity in the
nature of variation with I and t . e P → AP transition too depicts
this kind of a behavior, albeit with dierent values of I and t .
3 MTJ CROSSBAR BASED NEURAL NETWORKS
e stochastic switching nature of MTJs has necessitated the usage
of high write currents or write duration in memory applications to
ensure low write errors. Alternatively, one can also use them to
implement the synaptic weights in a crossbar where each cross-point
would be an MTJ in one of its 2 states. ey are capable of being
programmed with high speeds and exhibit endurance of the order
of 1015 write cycles. However, the inherently binary nature of MTJs
implies that such synapses can represent only 2 weight values and
hence can implement only binary networks. Although it is possible
to have some continuous behavior with the inclusion of a domain
wall in the free layer [2], the maturity of such technology is not at
par with that of the binary version [22].
Training Binary Networks: Obtaining optimal binary weights
for an NN is an NP-hard problem with an exponential time complexity,
and hence a solution must involve training of the binary network of
some form. is prompts the use of a probabilistic learning technique
since the required weight update is continuous whereas any possible
change in the conductance of the MTJ could only be discrete, in
fact binary. As stated in [19], stochastic update of binary weights
is computationally equivalent to deterministic update of multi-level
weights at the system level.
In [1], Vincent et al. exploit the stochastic switching behavior
of MTJs to propose its use as a ”stochastic memristive synapse” in
an SNN taught using a simplied STDP rule. However, there is no
theoretical guarantee of the convergence of STDP for general inputs
[21]. We propose using a probabilistic learning approach by training
using the gradient descent method (which requires weight updates
of the form in eqn. (2)) as demonstrated in section 4.2
3.1 e Motivation for In-situ Training
ere are 2 ways (primarily) in which MTJs in the crossbar can be
connected to their respective input and output terminals -
(1) With selector devices (1T1R) - Here each MTJ synapse is con-
nected in series with an MOS transistor (as in g. 1(c)), resulting
in O(M × N ) transistors in the crossbars.
Input Error ∆W W and G Switch
x > 0 δ > 0 ∆W < 0 Decreases P → AP
x > 0 δ < 0 ∆W > 0 Increases AP → P
x < 0 δ > 0 ∆W > 0 Increases AP → P
x < 0 δ < 0 ∆W < 0 Decreases P → AP
Table 1: Write phase. Signs of x , δ , and ∆W , required change in weight
W and conductance G, and the desired direction of switching of MTJ Synapse
(2) Without selector devices (1R) - Synapses are directly connected
to the crossbar terminals; there are no transistors within the
crossbar, such as the one in g. 1(a). While a 1R structure provides
greater scalability, it does so at the cost of reduced control of and
access to individual synapses.
Stochastic learning can be done (simulated) oine and the nal
weights obtained can be programmed on to the crossbar determin-
istically. But, since MTJs have an inherently stochastic switching
behaviour, deterministically programming them on a crossbar would
require currents having high magnitude and duration to guarantee
successful write operations. e possibility of selecting synapses to
be wrien in the 1T1R architecture ensures no side-eects of this
method stemming from alternate current paths (because there would
be none). But, despite circumventing this issue, this architecture
can suer from performance degradation due to the intrinsic device
variations which only aggravate with scaling. On the other hand, in
a 1R architecture, such high programming currents, when they sneak
through alternate paths, are bound to cause unwanted changes in
neighboring synapses owing to which the weights may never con-
verge. is necessitates in-situ training of the crossbar in probabilistic
way for both 1T1R and 1R congurations, as only training on the
hardware can account for both alternate paths and device variability.
3.2 Network Binarization
Simply using ±1 as the binary weight values, represented by the
P and AP states of an MTJ, is naive and estimating a good scaling
factor b is essential for overall network performance. An appropriate
way to determine a suitable b is to minimize the L2 loss between the
real-valued weights W and quantized ones, as was done in [18]. is
provides a solution b = ‖W ‖1/n (the mean of absolute values ofW ).
us an MTJ in the P (AP ) state would signify a weight of +b (−b).
4 IN-SITU TRAINING OF MTJ CROSSBARS
We rst provide a high-level understanding of how an MTJ synaptic
crossbar implementing an NN should work. For the sake of sim-
plicity, all operations are described for a single-layer NN and can
be easily scaled to multiple layers (more details subsequently). We
then illustrate how the gradient descent method can be used for the
stochastic weight update of MTJs, and nally describe the in-situ
training procedure for the 2 crossbar architectures.
4.1 Overview of Operations
e training process is carried out as follows.
Read Phase: Upon receiving a training input x ∈ RM , the input
terminals are applied with voltages V ri ∈ [−V ,V ] ∀ i proportional to
xi , whereas the output terminals are maintained at ground potential.
Current Iji = G jiV ri ows through the (j, i) synapse and the total
current I at the output terminals are suitably converted to output y.
Write Phase: Using y and the desired output, calculate the error δ .
Table 1 lists the 4 possible cases of weight update depending on x and
δ . e gradient descent algorithm requires a weight update of the
form of eqn. (2). An appropriate way to realize this, as suggested in
[12], is to set switching probabilities proportional to (the magnitude
of) ∆w calculated in (2). Our way of achieving this is explained next.
e process of read and write are carried out for each input sample
and repeated for several iterations until convergence is achieved.
4.2 Stochastic Learning of an MTJ Synapse
We will now describe how the stochasticity of MTJ switching can be
used to perform weight updates with gradient descent method. Just
as the weight update in eqn (2) is a function of 2 variables (the input
and the error), the probabilistic switching of MTJs can be controlled
by 2 physical quantities- the magnitude and the duration of the pro-
gramming current. We choose the magnitude of the write current
to be dependent on the input xi and the duration on the error δj .
However, as can be evidenced from eqn (5) and g 2, the switching
probability P is a highly non-linear function of the parameters a and t
(recall a = I/Ic0), whereas the desired probability, being proportional
to ∆Wji , is a linear function of xi and δj . Further, the switching
probability does not immediately rise with the pulse width and the
write current as they increase from 0, indicating some kind of so
threshold. Note that the direction of switching can be decided by the
polarity of the write current.
We therefore model switching probabilities by a linear mapping of
x and δ to write current Iwr and duration twr respectively as follows.
Usually |x | ≤ 1, and henceforth assume for simplicity that |δ | ≤ 1
(can be ensured by normalizing and adjusting with η). e pulse
width twr is set at a minimum of t0 and increases linearly with |δ |
(since twr needs to increase irrespective of the sign of δ ) as
twr = t0 + t1 |δ | (6)
Similarly, the write current (Iwr ) would be a minimum of I0 and
increase linearly with |x | as
Iwr = I0 + I1 |x | (7)
We now wish to nd coecients t0, t1, I0 and I1 that yield MTJ
switching probabilities (P ) close to the desired probabilities of weight
update. A certain probability of switching can be obtained for dier-
ent combinations of I and t , as is evident from g. 2. We rst x the
range of pulse widths by choosing suitable t0 and t1 (refer to table 3).
We want a nearly 0 switching probability for twr = t0 irrespective of
the value of Iwr because ∆W = 0 for δ = 0 regardless of x . We thus
choose the maximum Iwr (which is I0 + I1) to be that value of I for
which the plot of P against twr starts rising at t0. at is
P(I0 + I1, twr ) is
{
< P0 for twr < t0,
≥ P0 for twr ≥ t0 (8)
where P0 is a small value. So now even if |x | is (as high as) 1, P = P0.
In our experiments, we chose P0 to be about 0.05.
A symmetric argument holds when x = 0. For twr = t0 + t1, we
want P ≈ 0 if Iwr = I0, (because ∆W = 0 for x = 0). But P should
start increasing as soon as Iwr increases, that is
P(Iwr , t0 + t1) is
{
< P0 for Iwr < I0
≥ P0 for Iwr ≥ I0 (9)
Fig 3 shows how well the linear model approximates the required
AP → P switching probabilities (similar curve ing for P → AP as
(|δ|=0) (|δ|=1)
Pulse width twr (ns) 
Sw
it
ch
in
g 
P
ro
b
ab
ili
ty
 P
𝑃0 𝑥 =0
Figure 3: P vs twr of the linear
model and desired probabilities (ob-
tained with η = 0.7) forAP → P tran-
sition. e region between the dashed
vertical lines is of interest. e dark
green, cyan and red straight lines plot
desired probabilities for |x | = 0, 0.5,
and 1 respectively. e brown, yellow
and blue plots correspond to the ac-
tual switching probabilities (obtained
from the linear model) for the mapped
currents I = 60µA, 75µA, and 90µA
Weight MTJ
Update Switching
|δ | = 0 twr = t0
|δ | = 1 twr = t0 + t1
|x | = 0 Iwr = I0
|x | = 1 Iwr = I0 + I1
Table 2: Boundary values
of the parameters in the weight
update eqn. (2) and their coun-
terpart in probabilistic switching
of MTJ.
Direction AP → P P → AP
t0 1.5ns 1.5ns
t1 1ns 1ns
I0 60µA 140µA
I1 30µA 60µA
Table 3: e coecients that
t the model for both AP → P
and P → AP switching
well). Table 2 shows the write currents and duration for boundary
values of |x | and |δ | and table 3 lists the values of the coecients
in eqns. (6) and (7). One could use non-linear models for mapping
|δ | and |x | to twr and Iwr , respectively, in order to beer t the
desired switching probabilities; however, that would complicate the
analog circuit responsible for the conversion. Owing to this, and the
closeness with which the linear model can replicate the stochastic
switching characteristics, we stick to the linear version.
Next, we describe the 1T1R and 1R crossbar architectures imple-
menting the NN. We show how these can be trained in-situ using the
stochastic learning technique described above.
4.3 e 1T1R Architecture
is is the conventional architecture for memory applications where
each cell has a selection transistor. One major advantage of being
able to selectively turn o certain cells is that it disallows the pres-
ence of undesired sneak currents which lead to unnecessary power
consumption at a minimum. Fig 4(a) shows a 1T1R crossbar where
each MTJ synapse is connected in series with an NMOS transistor.
Input and output terminals are interfaced with necessary Control
Logic (CL). All the transistors in a single column will have a common
gate voltage since the corresponding synapses are connected to the
same neuron output, and hence will always have the same error ‘δ ’
and write pulse width twr .
Fig 4(b) plots the signals during both the read and write phases.
During the read phase (0 ≤ t ≤ Trd ), all transistors are turned on:
c j = VDD ∀ j = 1...N so that all columns (neuron outputs) are
read simultaneously. Inputs xi are provided to their respective input
CLs which convert them to read voltages V ri . Output currents Ij are
processed by the output CLs.
Updating the crossbar: Decide the write currents that should
be provided to each input row and the pulse widths for each output
column as described in sec. 4.2. Recall that the former depend on x
and the laer on δ . e direction of the currents would depend on
the sign of the desired weight update. Apply suitable write voltages
at the input terminals while grounding the output terminals to 0.
For the (j, i) synapse, the write pulse width depends on only |δj |,
and the write current magnitude depends on |xi |. But the direction
of switching depends on the signs of δj and xi (see Table 1) and has
to be decided by the polarity of current. For eg. two MTJ synapses
belonging to the same row but dierent columns may have opposite
signs of δ . us, despite having the same input xi , they are required
to switch in opposite directions and hence need write voltages of
opposite sign. is requires us to split the write phase into two parts
as explained next.
Since the transistor gate control signals are connected to the output
CLs, we can select or deselect a certain column based on information
at its respective CL, which is the error δ . We therefore program the
crossbar sequentially in 2 stages, with the columns updated in a given
stage depending on the signs of δ . Each phase has a duration of Twr
(which need not be more than t0+t1, see eqn. (6)). e voltage signals
in each phase are ploed in g. 4(b) and detailed below -
 
S
1,1
 
Output 
CL 
x
1
 Input 
CL 
c
1
 
x
M
 Input 
CL 
S
1,M
 
S
N,1
 
Output 
CL 
c
N
 
S
N,M
 
(a) An M × N crossbar
 
𝑉𝑖
𝑟
 
𝑉𝐷𝐷  
T
rd
 T
rd 
+T
wr
 T
rd 
+2T
wr
 
t 
0 
Input 
Terminal 
Voltage (𝑉𝑖
𝐼) 
Transistor 
Gate 
Voltage (c
j
) 
𝑡𝑤𝑟,𝑗    
Read 
Phase Write Phase 1 
𝑉𝐴𝑃(𝑥𝑖) 𝑖𝑓 𝑥𝑖 < 0 
𝑉𝑃(𝑥𝑖) 𝑖𝑓 𝑥𝑖 > 0 𝑉𝑃(𝑥𝑖) 𝑖𝑓 𝑥𝑖 < 0 
𝑉𝐴𝑃(𝑥𝑖) 𝑖𝑓 𝑥𝑖 > 0 
𝑉𝐷𝐷 𝑖𝑓 𝛿𝑗 > 0 𝑉𝐷𝐷 𝑖𝑓 𝛿𝑗 < 0 
𝑡𝑤𝑟,𝑗  
0 𝑖𝑓 𝛿𝑗 < 0 0 𝑖𝑓 𝛿𝑗 > 0 
Write Phase 2 
(b) Write voltages and control signals.
Figure 4: e 1T1R crossbar. (a) Schematic (b) Read & write phases signals
(1) Phase 1: Trd ≤ t ≤ Trd +Twr . Update the weights of the columns
which had δ > 0. en, the transistor control signals would be
c j =
{
VDD , for δj > 0 and 0 ≤ t −Trd ≤ twr, j
0, for δj < 0 or twr, j ≤ t −Trd ≤ Twr (10)
And the write voltages applied at the input terminals would be
Vwr,i = VP (xi )u(xi ) +VAP (xi )u(−xi ) (11)
where u is the unit step function.
(2) Phase 2: Trd + Twr ≤ t ≤ Trd + 2Twr . Update the weights of
those columns which had δ < 0. Here, the signals are opposite to
those in phase 1 as shown in g. 4(b).
Here VP (VAP ) is the voltage applied to switch from P→AP (AP→P)
and can be obtained using (7) and RP (RAP ). VP and VAP still depend
on |xi |, but for brevity explicit mention will be omied henceforth.
Let MTJs in the crossbar be arranged in a way that positive (negative)
current from the ith input terminal to jth output terminal can switch
Sj,i from P → AP (AP → P ); hence VP > 0, (VAP < 0). Parameters
in table 3 give VP ∈ [0.68, 0.98] volts and VAP ∈ [−0.81,−0.62] volts.
us we can see that the read and update operations are completed
inTrd +2Twr time which isO(1). Due to limitations on the scalability
of 1T1R architecture, it is worth exploring the feasibility of transistor-
less crossbars to achieve even higher density of integration.
4.4 e 1R Architecture
Eliminating the need to have an access transistor for every synapse
in the crossbar will allow for compact designs having an integration
density of about 4F 2/device. But the inability to select the synapses to
be updated during programming results in leakage currents through
alternate paths that not only waste energy but also can lead to un-
desirable changes in synaptic conductance. We rst see the eect of
such currents with the previously proposed write-strategy and then
suggest a modied strategy (and circuit) for the 1R architecture
4.4.1 Two-phase update: Let’s analyze the impact of sneak
paths on the 1R crossbar with the 2-phase update strategy used previ-
ously. We rst demonstrate the presence of sneak paths with a small
example. Fig 5(a) shows a 2 × 2 crossbar with transistors only at the
output terminals (to choose columns to be wrien in any particular
phase). Assume without loss of generality that a certain input x with
x1 > 0,x2 < 0 produced errors δ1 > 0,δ2 < 0 at the outputs. e
equivalent circuit during write phase 1 is drawn in g. 5(b). It depicts
the currents through the synapses, with the ones through S21 and
S22 being undesired. ese may falsely switch S21 from P → AP and
S22 from AP → P if they are in P and AP states respectively.
We now state a worst-case scenario for a crossbar with M inputs.
If M is large, analysis using Kirchho’s Current Law shows that
the potential dierence across an MTJ synapse could go as high as
(VP − VAP ). e current through such an MTJ, if in the P state, is
I = (VP −VAP )/RP and is high enough (recall VAP < 0) to switch it
from P → AP . In the other extreme case, a potential dierence of
(VAP −VP ) leading to current I = (VAP −VP )/RAP through an MTJ
in the AP state will switch it from AP → P .
It is also necessary to mention an average (expected) case. Here
these currents reduce to I = (VP−VAP )/2RP and I = (VAP−VP )/2RAP ,
respectively, which are half of those found previously, but still have
some probability of switching MTJs (because these currents are
roughly the same as VP /RP and VAP /RAP ). us, chances of un-
wanted ips of MTJs are quite signicant, which calls for some modi-
cation in the circuit and/or in the programming method.
Input Error ei V Ii c j Switch
Phase 1 x > 0 δ > 0 u(xi )VDD u(xi )VP u(δj )VDD P → AP
Phase 2 x < 0 δ > 0 u(−xi )VDD u(−xi )VAP u(δj )VDD AP → P
Phase 3 x > 0 δ < 0 u(xi )VDD u(xi )VAP u(−δj )VDD AP → P
Phase 4 x < 0 δ < 0 u(−xi )VDD u(−xi )VP u(−δj )VDD P → AP
Table 4: 4-phase weight update for the 1R conguration in g 5(c):
Condition on input and error for a synapse to be updated, along with the
control signals (e , c ) and write voltages (V I ), for each phase
x1>0
x2<0
c1 c2
V1
O V2
O
(a)
VP
VAP
S11 S21
S12 S22
0
(b)
x1 Input 
CL
xM Input 
CL
Output CL
c1
Output CL
c2
S1,1
S1,M
SN,1
SN,M
V1
I
e1
eM
VM
I
V1
O VN
O
(c) An M × N crossbar with 1R structure
0
𝑅22
𝑅21
𝑅12
VP
(d)
Figure 5: (a) and (b) Alternate current paths in the 1R structure with
2-phase write strategy - (a) A 2 × 2 crossbar. (b) Its equivalent circuit in
write phase 1 with c1 = VDD, c2 = 0, VO1 = 0, V
I
1 = VP , V
I
2 = VAP . (MTJ
synapses shown as resistors). (c) Schematic of the proposed 1R Architecture
for MTJ crossbar, (d) e equivalent circuit in phase 1 with 4-phase writing.
4.4.2 Four-phase Update: e large sneak currents in the 2-
phase writing strategy, potentially resulting in false switching, is due
to the high potential dierence VP −VAP between input terminals
having dierent signs of inputs. One simple way to mitigate this issue
is to further split the 2 phases of weight update so that, in a given
phase, only rows having the same sign of input are updated at a time.
is is equivalent to rst clustering the columns according to the
sign of δ , and then further clustering the rows according to the sign
of x . is proposed 4-phase writing scheme would require additional
transistors to choose the rows to be updated in a given phase as
shown in g 5(c). It is summarized in Table 4 where each phase will
have the same duration Twr ; thus the total time for updating the
crossbar is doubled to 4Twr . Note that this is still O(1) time.
Let us now see how bad the issue of sneak-path leakage is with this
strategy. Fig 5(d) shows the equivalent circuit for the 2 × 2 crossbar
with the same set of assumptions (only synapses providing alternate
current paths are shown). For an M × N crossbar, in the worst-
case scenario, sneak currents could beVP /RP andVAP /RAP , and can
still result in false switching. is follows intuition as the potential
dierence between an input terminal and an output terminal is at most
VP orVAP . However, in the average case, the sneak current values are
found to be only VP /3RP and VAP /3RAP . ese currents are small,
and do not have the potential to cause undesired switching as is
evident from the parameters listed in table 3 and the range of values of
VP andVAP . Hence, the 4-phase writing scheme signicantly reduces
the incidences of undesired switching at a small cost of increase in
the duration of the write phase. As we shall see, this trade-o is not
only worth but also necessary for satisfactory performance of the
training process.
4.5 Multi-Layer NNs
Multi-layer NNs can be implemented on cascaded crossbars (each
representing one layer) with the output of one fed as the input to the
next. It is prey straightforward to implement the backpropagation
algorithm on such a structure. Consider a 2-layer NN with weight
matricesW1 andW2. For an input x , the nal output y2 is given as
y2 = f (a2) = f (W2y1) where y1 = f (a1) = f (W1x) (12)
If δ2 is the error of the second layer (output), then that of the rst
layer (hidden) is δ1 = (W2T δ2) × f ′(a1) where f ′ is the derivative of
activation function f , and × represents a component-wise product.
is operation can be done on the crossbar (of the output layer) itself
by reversing the roles of its input and output terminals: δ2 is now
fed as the input and out comesW2T δ2, which, when multiplied by
f ′(a1), gives δ1 as the error to be used for updating the weights of
the hidden layer.
For the MTJ crossbar NN we described, during forward propaga-
tion, the total duration of the read phase would be nTrd for an n-layer
NN. Backpropagation of errors to hidden layers would require an
extraTrd -long read phase for each such layer, during which the error
at (the output of) a layer is fed as an input to its crossbar to obtain
the error at its preceding layer. Lastly, all the layers can be updated
simultaneously (in 2Twr or 4Twr time, as per the architecture).
Further, it must be mentioned that a large layer in an NN could be
split into multiple crossbars, some of which which share inputs or
outputs. All these crossbars can still be read and wrien in parallel,
thanks to the locality of the weight update operations.
5 EXPERIMENTAL SETUP AND RESULTS
To see how successfully the MTJ crossbar NNs can be trained in-situ,
we performed system level simulations by modeling the functional-
ity of the crossbar architecture in MATLAB and training it on some
datasets with supervised learning. To capture the MTJ device parame-
ters, we used an HSPICE model [13] and included thermal elds in its
LLG equations for obtaining the stochastic switching characteristics
[3]. Certain device parameters used in and obtained from this model
were then incorporated into the simulations of the crossbar.
e performance of the neural network was evaluated in the fol-
lowing scenarios (code-named for further reference). All training
processes used the Mean Square Error cost function and neurons had
the tanh activation function.
(1) RV:We rst train and evaluate a neural network with real-valued
weights in MATLAB. Binary quantization step (b) is obtained from
this trained network as shown in sec. 3.2.
(2) DP: Suitable binary weights are obtained by doing probabilistic
learning in soware on a binary network. en a 1T1R crossbar
and a 1R crossbar are deterministically programmed to these
weights. We see the eect of device variations on the former, and
of alternate current paths and resulting false switchings on the
laer.
(3) ST: An MTJ synaptic crossbar is modeled and stochastically
trained in-situ using the linear model of stochastic weight update
described in sec. 4.2 for the
(a) 1T1R architecture, with the 2-phase write strategy (sec. 4.3).
(b) 1R architecture, with both the 2-phase (to see the eects of
sneak currents) and the 4-phase update strategies (sec. 4.4).
(4) DV: Device variations of dierent extent are introduced in the
stochastic training of both the 1T1R and 1R crossbars. It reects
in the variations in the resistance of the P and AP states, which
usually doesn’t exceed 10% as per experiments [4].
We use the following datasets for evaluation.
SONAR, Rocks vs Mines[27]: ree dierent NN architectures
are considered - one with 1 layer (1L), and two with 2 layers having
15 and 25 hidden neurons respectively, and named 2L15 and 2L25.
ey were trained, and then tested on 104 samples of the test dataset.
MNIST Digit Recognition[24]: ree 2-layer networks of 50,
100 and 150 hidden units respectively and a 3-layer network of 50+25
hidden units were evaluated on the 10000 images of the test dataset.
Wisconsin Breast Cancer (Diagnostic)(WBCD)[27]: A single-
layer network (1L) and 2 two-layer networks (2L10 and 2L20) were
considered, and the test dataset had 200 samples.
Table 5 summarizes the accuracy obtained with these networks
under the dierent training scenarios mentioned above. e eect of
device variations of dierent extents on the in-situ stochastic training
is highlighted for some of the networks in table 6, with g. 6 ploing
the mean square error as the training progresses for the 1R crossbar.
Additionally, g. 7 compares the error for the two write strategies.
It doesn’t converge with the 2-phase writing scheme due to higher
instances of undesired weight changes, but does so with 4 phases.
It is evident from these results that
Dataset SONAR MNIST WBCD
Network 1L 2L15 2L25 2L50 2L100 2L150 3L 1L 2L10 2L20
RV 16.4 12.8 11.9 9.87 7.34 6.44 7.25 8.35 7.40 7.10
DP 1T1R 19.2 15.2 14.3 13.50 10.89 9.55 10.45 9.85 8.30 8.551R 46.8 41.4 42.7 39.42 36.10 37.92 40.48 24.95 27.60 23.65
ST 1T1R 18.4 14.2 13.6 12.69 10.18 8.96 9.71 9.20 7.70 8.051R 18.3 14.5 14.0 12.72 10.20 9.03 9.66 9.40 7.85 7.95
Table 5: Classication error rates for the 3 datasets (on the test samples)
with various NN and crossbar architectures under dierent training scenarios.
Here, ST-1R crossbar used 4-phase update. Ideal devices assumed for all except
DP-1T1R, where 10% variation was considered. SONAR and WBCD gures
are average of 10 runs. MNIST and WBCD gures are in %
Dataset SONAR MNIST WBCD
Network 1L 2L15 2L100 3L 2L20
Variation 1T1R 1R 1T1R 1R 1T1R 1R 1T1R 1R 1T1R 1R
2% 18.5 18.4 14.4 14.7 10.27 10.22 9.67 9.73 8.10 8.05
5% 18.7 18.7 14.7 14.8 10.28 10.29 9.78 9.80 8.25 8.30
10% 19.0 19.1 15.1 15.1 10.33 10.43 9.86 9.91 8.30 8.40
20% 19.3 19.5 16.0 15.9 10.42 10.72 10.15 10.28 8.60 8.75
Table 6: Misclassication rates with stochastic training (ST) of 1T1R and
1R architectures under dierent levels of device variations (DV).
0 10 20 30 40 50
Iterations
0.2
0.4
0.6
Er
ro
r
None
2 %
5 %
10 %
20%
(a) On SONAR for 2L15 network
0 20 40 60 80 100
Iterations
0.1
0.2
0.3
0.4
Er
ro
r
None
2%
5%
10%
20%
(b) On MNIST for the 3L network
Figure 6: Training error with dierent extents of device variations on the
1R crossbar architecture for 2 datasets.
• When an MTJ synaptic crossbar without access transistors is stochas-
tically trained in-situ (ST-1R), it shows classication accuracy only
slightly lower (about 3% at worst) than when the same network is
trained in soware with real-valued weights (RV, which can be con-
sidered to be the best achievable). However, it brings about signicant
improvement (up to 30%) in accuracy over a deterministically pro-
grammed crossbar (DP-1R) since the laer suers from undesired
weight changes arising from alternate current paths.
• In-situ training also benets the crossbar with transistors (ST-1T1R
against DP-1T1R) in the presence of device variations by slightly
improving accuracy (by about 0.5% − 1%).
• It is possible to compensate for the loss in accuracy due to use of a
binary network by increasing the size of the network (adding more
hidden layers and/or neurons).
• Further, the trained crossbar has robustness even in the face of
device variations, owing primarily to the fault-tolerant nature of NN
and its learning algorithms. As can be seen in table 6, increase in
misclassication rates remain within 2% even with 20% variation.
e accuracy degradation of 2% − 3% that we achieve (on going
from RV to ST) is comparable to the 3.73% reported by [8] and the
0.8% − 3.5% in [1]. However, it must be mentioned and emphasized
that any comparison is fair only if they are on the same dataset and
network architecture. e benet of using in-situ training can also
be seen when we compare our work with that of [7] (which performs
oine learning). On the MNIST 2L100 network, we obtained an error
rate of 10.20%, whereas [7] had a much higher value of 30% on the
same network, although it must be mentioned that the laer were at
a disadvantage due to linear activation units.
6 CONCLUSION
In this work, we show how MTJ crossbars representing weights of
an ANN can be trained in-situ by exploiting the stochastic switching
properties of MTJs and performing weight updates in a way akin to
gradient descent. We demonstrate how the learning algorithm can be
implemented on crossbars with and without transistors. Results show
0 20 40 60 80
Iterations
0.2
0.4
0.6
Er
ro
r
2-phase
4-phase
(a) On SONAR for 2L15 network
0 20 40 60 80 100
Iterations
0
0.1
0.2
0.3
0.4
Er
ro
r 2-phase
4-phase
(b) On MNIST for 2L100 network
Figure 7: Comparison of error during training of the 1R crossbar with
2-phase and 4-phase update schemes for 2 datasets. No variations assumed.
these stochastically trained binary networks can achieve classication
accuracy almost as good as that of those trained in soware and
implemented on processors. is paves the way for the aainment
of highly scalable neural systems in the future capable of performing
complex applications.
REFERENCES
[1] A. F. Vincent et al. 2015. Spin-transfer torque magnetic memory as a stochastic
memristive synapse for neuromorphic systems. IEEE transactions on biomedical
circuits and systems 9, 2 (2015), 166–174.
[2] A. Sengupta et al. 2016. Hybrid Spintronic-CMOS Spiking Neural Network with
On-Chip Learning: Devices, Circuits, and Systems. Physical Review Applied 6
(2016).
[3] A. Sengupta et al. 2016. Probabilistic deep spiking neural systems enabled by
magnetic tunnel junction. IEEE Transactions on Electron Devices 63 (2016), 2963–70.
[4] D. C. Worledge et al. 2010. Switching distributions and write reliability of per-
pendicular spin torque MRAM. In Electron Devices Meeting (IEDM), 2010 IEEE
International.
[5] D. erlioz et al. 2013. Immunity to device variations in a spiking neural network
with memristive nanodevices. IEEE Transactions on Nanotechnology 12, 3 (2013).
[6] D. Soudry et al. 2015. Memristor-based multilayer neural networks with online
gradient descent training. IEEE transactions on neural networks and learning systems
26, 10 (2015), 2408–2421.
[7] D. Zhang et al. 2016. All spin articial neural networks based on compound
spintronic synapse and neuron. IEEE transactions on biomedical circuits and systems
10, 4 (2016), 828–836.
[8] D. Zhang et al. 2016. Stochastic spintronic device based synapses and spiking
neurons for neuromorphic computation. In Nanoscale Architectures (NANOARCH),
2016 IEEE/ACM International Symposium on. IEEE, 173–178.
[9] F. Li et al. 2016. Ternary weight networks. arXiv preprint arXiv:1605.04711 (2016).
[10] H. Tomita et al. 2011. High-speed spin-transfer switching in GMR nano-pillars with
perpendicular anisotropy. IEEE Transactions on Magnetics 47, 6 (2011), 1599–1602.
[11] J. Dean et al. 2012. Large scale distributed deep networks. In Advances in Neural
Information Processing Systems (NIPS). 1223–1231.
[12] J. H. Lee et al. 2007. Defect-tolerant nanoelectronic paern classiers. International
Journal of Circuit eory and Applications 35, 3 (2007), 239–264.
[13] J. Kim et al. 2015. A technology-agnostic MTJ SPICE model with user-dened dimen-
sions for STT-MRAM scalability studies. In Custom Integrated Circuits Conference
(CICC), 2015 IEEE. IEEE, 1–4.
[14] J. Misra et al. 2010. Articial neural networks in hardware: A survey of two decades
of progress. 74, 1 (2010), 239–255.
[15] K. L. Wang et al. 2013. Low-power non-volatile spintronic memory: STT-RAM and
beyond. Journal of Physics D: Applied Physics 46, 7 (2013), 074003.
[16] M. Courbariaux et al. 2015. Binaryconnect: Training deep neural networks with
binary weights during propagations. In Advances in NIPS. 3123–3131.
[17] M. Prezioso et al. 2015. Training and operation of an integrated neuromorphic
network based on metal-oxide memristors. Nature 521, 7550 (2015), 61–64.
[18] M. Rastegari et al. 2016. Xnor-net: Imagenet classication using binary convolu-
tional neural networks. In European Conference on Computer Vision. Springer.
[19] M. Suri et al. 2013. Bio-inspired stochastic computing using binary CBRAM
synapses. IEEE Transactions on Electron Devices 60, 7 (2013), 2402–2409.
[20] R. Hasan et al. 2014. Enabling back propagation training of memristor crossbar
neuromorphic processors. In Neural Networks (IJCNN), 2014 International Joint
Conference on. IEEE, 21–28.
[21] R. Legenstein et al. 2005. What can a neuron learn with spike-timing-dependent
plasticity? Neural computation 17, 11 (2005), 2337–2382.
[22] S. Saı¨ghi et al. 2015. Plasticity in memristive devices for spiking neural networks.
9 (2015).
[23] W. Senn et al. 2005. Convergence of stochastic learning in perceptrons with binary
synapses. Physical Review E 71, 6 (2005), 061907.
[24] Y. LeCun et al. 1998. Gradient-based learning applied to document recognition.
Proc. IEEE 86, 11 (1998), 2278–2324.
[25] Y. Zhang et al. 2012. Asymmetry of MTJ switching and its implication to STT-RAM
designs. In Proceedings of the Conference on Design, Automation and Test in Europe.
[26] Z. Li et al. 2003. Magnetization dynamics with a spin-transfer torque. Physical
Review B 68, 2 (2003), 024404.
[27] M. Lichman. 2013. UCI Machine Learning Repository. (2013). hp://archive.ics.uci.
edu/ml
