All-Spin Bayesian Neural Networks by Yang, Kezhou et al.
1All-Spin Bayesian Neural Networks
Kezhou Yang, Akul Malhotra, Sen Lu, and Abhronil Sengupta, Member, IEEE
Abstract—Probabilistic machine learning enabled by the
Bayesian formulation has recently gained significant attention
in the domain of automated reasoning and decision-making.
While impressive strides have been made recently to scale up the
performance of deep Bayesian neural networks, they have been
primarily standalone software efforts without any regard to the
underlying hardware implementation. In this paper, we propose
an “All-Spin” Bayesian Neural Network where the underlying
spintronic hardware provides a better match to the Bayesian
computing models. To the best of our knowledge, this is the first
exploration of a Bayesian neural hardware accelerator enabled
by emerging post-CMOS technologies. We develop an experimen-
tally calibrated device-circuit-algorithm co-simulation framework
and demonstrate 24× reduction in energy consumption against
an iso-network CMOS baseline implementation.
Index Terms—Neuromorphic Computing, Bayesian Neural
Networks, Magnetic Tunnel Junction.
I. INTRODUCTION
Probabilistic inference is at the core of decision-making in
the brain. While the past few years have witnessed unprece-
dented success of deep learning in a plethora of pattern recog-
nition tasks (complemented by advancements in dedicated
hardware designs for these workloads), these problem spaces
are usually characterized by the availability of large amounts
of data and networks that do not explicitly represent any
uncertainty in the network structure or parameters. However,
as we strive to deploy Artificial Intelligence platforms in
autonomous systems like self-driving cars, decision-making
based on uncertainty is crucial. Standard supervised backprop-
agation based learning techniques are unable to deal with such
issues since they do not overtly represent uncertainty in the
modelling process. To circumvent these problems, Bayesian
deep learning has recently been gaining attention where deep
neural networks are trained in a probabilistic framework fol-
lowing the classic rules of probability, i.e. Bayes’ theorem. In
the Bayesian formulation, the network is visualized as a set of
plausible models (assuming prior probability distributions on
its parameters, for instance, synaptic weights). Given observed
data, the posterior probability distributions are learnt that
best explains the observed data. The key distinction between
standard deep networks and Bayesian deep networks is the
fact that network parameters in the latter case are modelled
as probability distributions. It is worth noting here that the
probability distributions are usually modelled by Gaussian
processes characterized by a mean and variance [1]. Utilizing
probability distributions to model network parameters allows
Manuscript received November, 2019.
All authors contributed equally to this work. The authors are with the School
of Electrical Engineering and Computer Science, Department of Materials Sci-
ence and Engineering, The Pennsylvania State University, University Park, PA
16802, USA. A. Malhotra is also affiliated with Birla Institute of Technology
and Science, Pilani, Rajasthan 333031, India. E-mail: sengupta@psu.edu.
us to characterize the network outputs by an uncertainty
measure (variance of the distribution), instead of just point
estimates in a standard network. These uncertainty measures
can therefore be used by autonomous agents for decision
making and self-assessment in the presence of continuous
streaming data.
This paper explores a hardware-software co-design ap-
proach to accelerate Bayesian deep learning platforms through
the usage of spintronic technologies. Recent research has
demonstrated the possibility of mimicking the primitives of
standard deep learning frameworks – synapses and neurons
by single magnetic device structures that can be operated at
very low terminal voltages [2]–[4]. Further, being non-volatile
in nature, spintronic devices can be arranged in crossbar
architectures to realize “In-Memory” dot-product computing
kernels – thereby alleviating the memory access and memory
leakage bottlenecks prevalent in CMOS-based implementa-
tions [5], [6]. As mentioned earlier, the key distinction between
Bayesian and standard deep learning is the requirement of
sampling from probability distributions and inference based
on sampled values. Interestingly, scaled nanomagnetic devices
operating at room temperature are characterized by thermal
noise resulting in probabilistic switching. We propose to
leverage the inherent device stochasticity of spintronic devices
to generate samples from Gaussian probability distributions
by drawing insights from statistical Central Limit Theorem.
Further, our paper also elaborates on a cohesive design of a
spintronic Bayesian processor that leverages benefits of spin-
based Gaussian random number generators and spintronic “In-
Memory” crossbar architectures to realize high-performance,
energy efficient hardware platforms. We believe the drastic re-
ductions in circuit complexity (single devices emulating synap-
tic scaling operations, crossbar architectures implementing
“In-Memory” dot-product computing kernels and leveraging
device stochasticity to sample from probability distributions)
and low operating voltages of spintronic devices make them a
promising path toward the realization of Probabilistic Machine
Learning enabled by the Bayesian formulation.
II. PRELIMINARIES: BAYESIAN NEURAL NETWORKS
Before going into the technical details of the work, we
would like to first discuss the preliminaries of Bayesian Neural
Networks and the main computationally expensive operations
pertaining to their hardware implementation. As shown in Fig.
1, a particular layer of a neural network consists of a set
of neurons receiving inputs (sensory information or previous
layer of neurons) through synaptic weights, W. Bayesian
Neural Networks consider the weights of the network, W, to
be latent variables characterized by a probability distribution,
instead of point estimates. More specifically, each weight in
ar
X
iv
:1
91
1.
05
82
8v
4 
 [c
s.E
T]
  1
7 J
an
 20
20
2*
  
      
Determining the mean
and variance of the 
weight probability 
distributions
Learning
Input Weights
Outputs
Fig. 1. In a Bayesian framework, each synaptic weight is represented by a
Gaussian probability distribution. The core computing kernel for a particular
layer during inference is a dot-product between the inputs and a synaptic
weight matrix sample drawn from the individual probability distributions.
Learning involves the determination of the mean and variances of the
probability distributions using Bayes’ formulation.
such a framework is a random number drawn from a posterior
probability distribution (characterized by a mean and variance)
that is conditioned on a prior probability distribution and the
observed datapoints, D (incoming patterns to the network).
Hence, during inference, each incoming data pattern will
propagate through the synaptic weights, each of which is
characterized by a probability distribution. Hence, as shown in
Fig. 1, the final output of the neurons of a particular layer will
also be described by a probability distribution characterized by
a mean and variance (the uncertainty measure).
Bayesian Neural Networks correspond to the family of deep
learning networks where the weights are ‘learnt’ using Bayes’
rule. The learning process here involves the estimation of
the mean and variance of the weight posterior distribution.
Following Bayes’ rule, the posterior probability can be written
as,
P (W|D) = P (D|W)P (W)
P (D)
(1)
where, P (W) denotes the prior probability (probability of the
latent variables before any data input to the network). P (D|W)
is the likelihood, corresponding to the feedforward pass of
the network. In order to make the above posterior probability
density estimation tractable, two popular approaches are –
Variational Inference methods [7] or Markov Chain Monte
Carlo methods [8]. However, in this paper, we focus on
Variational Inference methods due to its scalability to large-
scale problems [9]. Variational Inference methods usually
approximates the posterior distribution by a Gaussian distribu-
tion, q(W, θ), characterized by parameters, θ = (µ, σ), where
µ and σ represent the mean and standard deviation vectors
for the probability distributions representing P (W|D) [10].
To summarize, the main hardware design space concerns in
Bayesian Neural Networks can be categorized as follows:
• Gaussian Random Number Generation: Central to the
entire framework, both in the learning as well as the inference
process, is the random number generation corresponding to
the synaptic weights. Given current large model sizes char-
acterized by over a million synapses, coupled with the fact
that random draws need to performed multiple times for
each synaptic weight, random number generator circuits would
contribute significantly to the total area and power consump-
tion of the hardware. Further, the random numbers need to
be sampled from a Gaussian distribution, thereby increasing
the complexity of the circuit. We will discuss the hardware
costs for CMOS implementations of such Gaussian random
number generators in the following sections along with their
limitations, followed by our proposal of nanomagnetic random
number generators that can serve as the basic building blocks
of such Bayesian Neural Networks.
• Dot-Product Operation Between Inputs and Sampled
Synaptic Weights: A common aspect of any standard deep
learning framework is the fact that forward propagation of
information through the network involves a significant amount
of memory-intensive operations. The dot-product operation
between the synaptic weights and inputs for inference involves
the compute energy along with memory access and memory
leakage components. For large-scale problems and correspond-
ingly large-scale models, CMOS memory access and memory
leakage can be almost ∼ 50% of the total energy consumption
profile [11].
The situation is further worsened in a Bayesian deep
network since each synaptic weight is characterized by two
parameters (mean and variance of the probability distribution),
thereby requiring double memory storage. However, the dot-
product operation does not occur directly between the inputs
and these parameters. In fact, for each inference operation the
synaptic weights (typically assumed constant during inference
for non-probabilistic networks and implemented by memory
elements in hardware) are repeatedly updated depending on
sampled values from the Gaussian probability distribution.
Hence, direct utilization of crossbar-based “In-Memory” com-
puting platforms enabled by non-volatile memory technologies
(discussed in details later) for alleviating the memory access
and memory fetch bottlenecks is not possible and therefore
requires a significant rethinking.
In the following sections, we sequentially expand on each
of these points and propose a spin-based neural processor
that merges deterministic and stochastic devices as a potential
pathway to enable Bayesian deep learning that can be orders of
magnitude more efficient in contrast to state-of-the-art CMOS
implementations.
III. SPINTRONIC DEVICE DESIGN
A. Magnetic Tunnel Junction - True Random Number Gener-
ator Design
The basic device structure under consideration is the Mag-
netic Tunnel Junction (MTJ), which consists of two nanomag-
nets sandwiching a spacer layer (typically an oxide such as
MgO). The magnetization of one of the layers is magnetostati-
cally “pinned” in a particular direction while the magnetization
of the other layer can be manipulated by a spin current or
an external magnetic field. The two layers are denoted as
the “Pinned” layer (PL) and “Free” layer (FL). Depending
on the relative orientation of the two magnets, the device
exhibits a high-resistance anti-parallel (AP) state (when the
magnetizations of the two layers have opposite direction) and
a low-resistance parallel (P) state (when the magnetizations of
3FL
PL
HM 
e
e
e
eI
S
IQ
1
T2 T3
T
“Reset” Current
“Read” 
Current
Tunneling 
oxide (MgO)
Initial Intermediate Final
time
Reset
Read
Fig. 2. The TRNG device structure is shown. Reset current (IQ) flowing
through the heavy-metal (HM) results in in-plane spin current (IS ) injection
for the MTJ “free layer” (FL). After switching to the in-plane meta-stable
position, the magnet relaxes to either of the two stable states with 50%
probability.
the two layers have the same direction). These two states are
stabilized by an energy barrier determined by the anisotropy
and volume of the magnet.
Let us now consider the switching of the magnet from one
state to another by the application of an external current.
The switching process is inherently stochastic at non-zero
temperatures due to thermal noise [12]. In the presence of
an external current, the probability of switching from one
state to the other is modulated depending on the magnitude
and duration of the current. True random number generator
(TRNG) can be designed using such a device by biasing the
magnet at the “write” current corresponding to a switching
probability of 50%. Note that CMOS-based TRNGs suffer
from high energy consumption and circuit design complexity
[13]. Proposals and experimental demonstrations of MTJ-
based TRNG have been shown [14]. MTJ-based TRNGs are
characterized by low area footprint and compatibility with
CMOS technology.
In this paper, we consider a spin-orbit coupling enabled
device structure (Fig. 2). It consists of the MTJ stack lying
on top of a heavy-metal (HM) underlayer. The device “read”
is performed through the MTJ stack between terminals T1
and T3. However, the device “write” is performed by passing
current through the heavy-metal underlayer between terminals
T2 and T3. Input current flowing through the heavy-metal
results in spin-injection at the interface of the magnet and
heavy-metal due to spin-Hall effect (SHE) [15] and thereby
causes switching of the MTJ “free layer” [16]. The device has
the following advantages:
• The decoupled “write” and “read” current paths is ad-
vantageous from the perspective of peripheral circuit design
to avoid “read”-“write” conflicts since the associated circuits
can be optimized independently.
• Such devices offer 1-2 orders of magnitude energy
efficiency in comparison to standard spin-transfer torque
MRAMs. This is due to the fact that in such spin-orbit
coupling based systems, every incoming electron in the “write”
current path repeatedly scatters at the interface of the magnet
and heavy metal and transfers multiple units of spin angular
momentum to the ferromagnet lying on top.
Usage of SHE-based switching enables us to use an alterna-
tive TRNG design [17], [18] that has the potential to produce
high quality random numbers in presence of process, voltage
and temperature (PVT) variations. In the earlier scenario of
a standard MTJ, device-to-device variations can result in
deviations of the bias current required for 50% switching
probability, thereby degrading the quality of the random
number generation process. Our scheme is depicted in Fig.
2, where a magnet with Perpendicular Magnetic Anisotropy
(PMA) lies on top of the heavy-metal. The device operation
is divided into three stages. During an initial “Reset” stage,
a current flowing through the heavy metal results in in-plane
spin injection in the magnet and orients it along the hard-
axis for a sufficient magnitude of the “reset” current. The
magnet is then allowed to relax to either of the two stable
states in presence of thermal noise – the switching probability
being 50% since the hard-axis is a meta-stable orientation
point for the magnet. In this case, device-to-device variations
only causes change in the critical current required for biasing
the magnet close to the meta-stable orientation and does not
skew the probability distribution to a particular direction (as in
the standard MTJ case). Hence, by maintaining a worst-case
critical value of the heavy-metal “reset” current, quality of the
random number generation process can be preserved even in
the presence of PVT variations. Further, the “reset” current
does not flow through the tunneling oxide layer (unlike the
standard MTJ case) and therefore reliability of the oxide layer
is not a concern in this scenario [17], [18]. Note that our device
operation is validated by recent experiments of holding the
magnet to its meta-stable hard-axis orientation for performing
Bennett clocking in the context of nanomagnetic logic [19].
SHE-based energy-efficient switching also results in reduction
of the energy consumption involved in the random number
generation process.
The probabilistic switching characteristics of the MTJ can
be analyzed by Landau-Lifshitz-Gilbert (LLG) equation with
additional term to account for the spin-orbit torque generated
by spin-Hall effect at the ferromagnet-heavy metal interface
[20],
dm̂
dt
= −γ(m̂×Heff )+α(m̂× dm̂
dt
)+
1
qNs
(m̂×Is×m̂) (2)
where, m̂ is the unit vector of FL magnetization, γ =
2µBµ0
~ is the gyromagnetic ratio for electron, α is Gilbert’s
damping ratio, Heff is the effective magnetic field including
the shape anisotropy field for elliptic disks, Ns = MsVµB
is the number of spins in free layer of volume V (Ms is
saturation magnetization and µB is Bohr magneton), and
Is = θSH(AMTJ/AHM )Iq is the input spin current (AMTJ
and AHM are the MTJ and HM cross-sectional area, θSH is
the spin-Hall angle and Iq is the charge current flowing through
the HM underlayer). Thermal noise is included by an addi-
tional thermal field [12], Hthermal =
√
α
1+α2
2KBTK
γµ0MsV δt
G0,1,
where G0,1 is a Gaussian distribution with zero mean and
unit standard deviation, KB is Boltzmann constant, TK is the
temperature and δt is the simulation time-step.
Considering a worst-case “reset” current of 140µA for a
duration of 1ns, the energy consumption involved in using a
20kBT barrier magnet (calibrated to experimental measure-
ments reported in [21]) as a TRNG is ∼ 57fJ /bit (I2Rt
4-40 -20 0 20 40
2
2.5
3
3.5
4
4.5
Domain Wall 
Position (nm)
Pinned Layers
PL
Tunneling Oxide
HM 
FL
∆G
J
∆x
T
2
T
3
T
1
0 200 400 600
0
10
20
30
40
Domain Wall 
Displacement (nm)
4 ns
3 ns
2 ns
1 ns
(a)
(b) (c)
C
u
rr
en
t 
th
ro
u
g
h
 H
M
 
(µ
A
)
D
ev
ic
e 
C
o
n
d
u
ct
an
ce
 
(x
10
e-
7 
m
h
o
)
HM 
FL
PL
Neuron MTJ
FL
PL V
DD
I
out
Iin
T1
T
2
T
3
Reference MTJ
(b)
Fig. 3. (a) DW-MTJ: Magnitude of current flowing through the HM, J ,
causes a proportionate displacement, ∆x, in the DW position, which causes
a change, ∆G, in the device conductance between terminals T1 and T3. (b)
The same device can be used as a neuron by interfacing with a Reference
MTJ. The current provided by the output transistor, Iout, is a saturated linear
function of the input current, Iin.
TABLE I. MTJ Device Simulation Parameters
Parameters Value
Free layer width 40nm
Heavy-metal thickness, tHM 2nm
Saturation magnetization, MS 1000 KA/m [21]
Spin-Hall angle, θSH 0.3 [21]
Energy barrier, EB 20 KBT
Temperature, TK 300K
energy consumption) [17], which is almost 2× lower than
standard MTJ-based TRNG.
B. Domain Wall Motion Based Magnetic Devices - Multi-
Level Non-Volatile Memory Design
The mono-domain magnet discussed above is characterized
by only two stable states. For a magnet with elongated shape,
multiple domains can be stabilized in the FL, thereby leading
to the realization of multiple stable resistive states. Such a
domain-wall (DW) MTJ consists of a domain wall separating
the two oppositely magnetized regions and the domain wall
position is programmed to modulate the MTJ resistance (due
to variation in the relative proportion of P and AP domains in
the device) [5].
We consider SHE-based domain wall motion dynamics also
in magnet-heavy metal bilayers. In magnetic heterostructures
with high perpendicular magnetocrystalline anisotropy, spin-
orbit coupling and broken inversion symmetry stabilizes chi-
ral domain walls through Dzyaloshinskii-Moriya interaction
(DMI) [22], [23]. Such an interfacial DMI at the magnet-
heavy metal interface results in the formation of a Ne´el domain
wall. When an in-plane charge current is injected through
the heavy metal, the accumulated spins at the magnet-heavy
metal interface results in Ne´el domain-wall motion. The device
structure is shown in Fig. 3(a), where a current of magnitude,
J , flowing through the HM layer results in a conductance
change, ∆G, between terminals T1 and T3. As shown in
Fig. 4(a), for a given programming time duration, the current
flowing through the HM underlayer, causes DW displacement
proportional to its magnitude. Note that the device char-
acteristics are obtained by performing micromagnetic LLG
simulations by dividing the magnet into multiple grids. The
domain wall position determines the magnitude of the MTJ
-40 -20 0 20 40
2
2.5
3
3.5
4
4.5
Domain Wall 
Position (nm)
Pinned Layers
PL
Tunneling Oxide
HM 
FL
∆G
J
∆x
T
2
T
3
T
1
0 200 400 600
0
10
20
30
40
Domain Wall 
Displacement (nm)
4 ns
3 ns
2 ns
1 ns
(a)
(a) (b)
C
u
rr
en
t 
th
ro
u
g
h
 H
M
 
(µ
A
)
D
ev
ic
e 
C
o
n
d
u
ct
an
ce
 
(x
10
e-
7 
m
h
o
)
HM 
FL
PL
Neuron MTJ
FL
PL V
DD
I
out
Iin
T1
T
2
T
3
Reference MTJ
Fig. 4. Device characteristics are shown for a 20nm wide and 0.6nm
thick magnet calibrated to experimental measurements [22]. The device
characteristics illustrate that the programming current magnitude is directly
proportional to the amount of conductance change [5].
conductance. The MTJ conductance varies linearly with the
domain wall position since it determines the relative proportion
of the area of the Parallel and Anti-Parallel domains of the
MTJ (Fig. 4(b)). Since such a device can be programmed
to multi-level resistive states and are characterized by low
switching current requirements and linear device behavior
(device conductance change varies in proportion to magnitude
of programming current), they are an ideal fit for implementing
crossbar-based “In-Memory” computing platforms (discussed
in next section). We will refer to this device as a DW-MTJ for
the remainder of this text. Experimentally, a multi-level DW
motion-based resistive device was recently shown to exhibit
15-20 intermediate resistive states [24].
It is worth noting here that the device structure in Fig.
3(a) can be used as a neuron by interfacing with a Reference
MTJ (Fig. 3(b)) [5]. The resistive divider can drive a CMOS
transistor where the output drive current would be a linear
function of the input current flowing through the heavy metal
layer of the device, thereby mimicking the functionality of a
Saturated Linear Functionality by ensuring that the transistor
operates in the saturation regime [5]. The simulation parame-
ters, provided in Table II, were used for the rest of this text
for DW-MTJ unless otherwise stated. The parameters were
obtained magnetometric measurements of CoFe-Pt nanostrips
[22].
TABLE II. DW-MTJ Device Simulation Parameters
Parameters Value
Ferromagnet Thickness 0.6nm
Grid Size 4× 1× 0.6nm3
Heavy Metal Thickness 3nm
Domain Wall Width 7.6nm
Saturation Magnetization, Ms 700 KA/m [22]
Spin-Hall Angle, θSH 0.07 [22]
Gilbert Damping Factor, α 0.3 [22]
Exchange Correlation Constant, A 1× 10−11J/m [22]
Perpendicular Magnetic Anisotropy, Ku2 4.8× 105J/m3 [22]
Effective DMI constant, D −1.2× 10−3J/m2 [22]
IV. ALL-SPIN BAYESIAN NEURAL NETWORKS
A. Spin-Based Gaussian Random Number Generator
Gaussian random number generation task is a hardware-
expensive process. CMOS-based designs for Gaussian random
number generators would usually require large number of reg-
isters, linear feedback circuits, etc. For instance, a recent work
for a CMOS-based Gaussian RNG implementation reports
5BL[1] BL[2]GND
WL[1]
WL[2]
Peripherals & Accumulator
Reset
Reset Relax Read
Reset Line
WL
BL
time
H
is
to
g
ra
m
Random Variable
P
L
F
L
P
L
F
L
P
L
F
L
P
L
F
L
N = 3 N = 8N = 2
Fig. 5. (a) Outline of a 2× 2 array utilizing spin-based devices interfaced with an accumulator to implement a Gaussian RNG. The probability distributions
of random numbers generated from such an array are shown in the extreme right by using a sum of N random variables (rows of the array). We use 8-bit
representation and 100,000 samples to plot the distribution.
1780 registers and 528.69mW power consumption for a 64-
parallel Gaussian random number generator task [9].
Let us now discuss our proposal of spin-based Gaussian ran-
dom number generator. In the previous section, we discussed
the design of a spintronic TRNG. An array of TRNGs can
be used for sampling from a uniform probability distribution.
Note that each spin device can be considered to produce
a sample from a Bernoulli distribution with probability 0.5.
However, reading a particular row of the array provides a sam-
ple from a discrete uniform distribution. In order to generate a
Gaussian probability distribution from a uniform one, we draw
inspiration from statistical Central Limit Theorem, discussed
in Box 1. The key result of Central Limit Theorem that we
utilize is that the sum of a large number of independent and
identically distributed (i.i.d) random variables is approximately
Normal.
Box 1: Central Limit Theorem
Let {X1, X2, ..., Xn} be a random sample of n i.i.d random
variables drawn from a distribution (which may not be Nor-
mal) of mean µ and variance σ2. Then, the probability den-
sity function of the sample average, Sn = X1+X2+...+Xnn
approaches a Normal distribution with mean µ and variance
σ2
n as n increases.
Our proposed design is illustrated in Fig. 5 which depicts a
possible array implementation [17] of our spin-based TRNGs.
Each spin device is interfaced with an access transistor. Rows
sharing a Reset-Line can be driven simultaneously. Hence,
random numbers can be generated in the entire array in
parallel. The timing diagram is shown in Fig. 5. Each row can
be read by asserting a particular word-line (WL) and sensing
the bit-line voltage (BL). For an m× n array, each row-read
produces an n-bit number generated from a uniform probabil-
ity distribution. By interfacing the array with an accumulator,
that averages all the generated random numbers, we are able to
produce random numbers drawn from a Normal distribution.
Note that the hardware overhead for this process would
be high for applications that require precise sampling from
Gaussian distributions, since the convergence takes place only
for infinite samples. However, for machine learning workloads
considered herein, performance of such platforms are usually
resilient to approximations in the underlying computations. For
instance, Fig. 5 shows that even with 8-bit representation and 3
random variables drawn from uniform probability distribution,
Spin Neurons+Analog-to-Digital Converter
V
1+
. . .
. . .
. . .
. . .
. . .
. . .
. . .
L
ay
er
 I
n
p
u
ts
Spintronic Synapse Crossbar Array (   )
D
ig
it
al
-t
o
-A
n
al
o
g
 C
o
n
v
er
te
r
V
1-
V
m+
V
m-
Layer Outputs
Fig. 6. “In-Memory” computing primitive where an array of spin synapses
implement the dot-product kernel.
we are able to achieve an approximate Gaussian distribution.
While Gaussian probability distributions are primarily used in
such algorithms, non-Gaussian weight distributions can be also
designed by using the Gaussian function as a basis. Note that,
while Box 1 discussions are equally valid for a CMOS-based
TRNG, it will be an order of magnitude more area and power
consuming than our proposed spin-based TRNG (as explained
in Section III).
B. Dot-Product Operation Between Inputs and Sampled
Synaptic Weights
Let us first discuss the operation of DW-MTJ enabled
spintronic crossbar arrays as an energy-efficient mechanism
to realize the dot-product computing kernel. Assuming each
synapse to be represented by a DW-MTJ, as shown in Fig.
6, they can be arranged in a crossbar structure. Each row of
the array is driven by an analog voltage (output of Digital-to-
Analog converters – DACs) that corresponds to the magnitude
of the input. The current flowing through each synapse is
scaled by the conductance of the device and due to Kirchoff’s
law, all these currents get summed up along the column,
thereby realizing the dot-product kernel. Note that negative
synaptic weights can be also mapped by using two horizontal
lines per input (driven by ‘positive’ and ‘negative’ supply
voltages). In case a particular synaptic weight is ‘positive’
(‘negative’), then the corresponding conductance in the ‘pos-
itive’ (‘negative’) line is set in accordance to the weight. The
resultant currents get summed up along the column and pass as
6X
+
x
            
 
Feedforward 
Results
“In-Memory” Dot-Product 
Evaluator
Gaussian RNG
       
 
    
Analog-to-Digital Converter
I1+
. . .
. . .
. . .
. . .
. . .
. . .
. . .
Spintronic Synapse Crossbar Array (   )
D
ig
it
al
-t
o
-A
n
al
o
g
 C
o
n
v
er
te
r
I1-
Im+
Im-
     
 
Analog-to-Digital Converter
I1+
. . .
. . .
. . .
. . .
. . .
. . .
. . .
Spintronic Synapse Crossbar Array (   )
D
ig
it
al
-t
o
-A
n
al
o
g
 C
o
n
v
er
te
r
I1-
Im+
Im-
     
 
BL[1] BL[2]GND
WL[1]
WL[2]
Peripherals & Accumulator
Reset
P
L
F
L
P
L
F
L
P
L
F
L
P
L
F
L
Fig. 7. All-Spin Bayesian Neural Network Implementation. The two crossbar
arrays behave as “In-Memory” computing kernels whereas the RNG unit
provides sampling operation from Gaussian random number generators.
the input “write” current through the spin-neuron. Consecutive
“write” and “read” cycles of the spin-neurons will implement
multiple iterations of the Bayesian network. The analog output
current provided by the spin-neuron is then converted to a
digital format using the Analog-to-Digital Converters (ADCs).
The digital outputs can be latched to provide inputs for the fan-
out crossbar arrays. The energy-efficiency of the system stems
mainly from two factors:
• The input write resistance of the spintronic neurons
are low (being magneto-metallic devices) and they inherently
require very low currents for switching. This enables the cross-
bar arrays of spintronic synapses to be operated at low terminal
voltages (typically 100mV ). Further, spintronic neurons are
inherently current-driven and thereby do not require costly
current to voltage converters, in contrast to CMOS and other
emerging technology (Resistive Random Access Memory,
Phase Change Memory, among others) based implementations
[25].
• Since, spin devices are inherently non-volatile technolo-
gies, the ability to perform the costly Multiply-Accumulate
operations in the memory array itself enables us to address
the issues of von-Neumann bottleneck.
However, in the context of Bayesian deep networks, even
for the inference stage, the synaptic weights are not constant
but are updated depending on sampled values from a Gaussian
distribution. Assuming we are able to generate samples from
a Normal distribution by using the device-circuit primitives
proposed earlier, the computations in a Bayesian network can
be partitioned in an appropriate fashion such that the benefits
of spin-based “In-Memory” computing can be still utilized.
This is explained in Box 2.
Realizing that a Normal distribution with a particular mean
and variance is equivalent to a scaled and shifted version of
a Normal distribution with zero mean and unit variance, we
partition the inference equation as shown in (4). The constant
parameters µjk and σjk (highlighted in red) represent the mean
and variance of the probability distribution of the correspond-
ing synaptic weight and can be therefore implemented by DW-
MTJ based memory devices from a hardware implementation
perspective. The resultant system (Fig. 7) consists of two
crossbar arrays for storing the mean and variance parameters
respectively. While the inputs of a particular layer are directly
applied to the crossbar array storing the mean values, they
are scaled by the random numbers generated from the RNG
unit (outputs normalized to provide random numbers with
zero mean and unit variance) described previously for the
crossbar array storing the variance values. Typical CMOS
neuromorphic architectures are characterized by much higher
movement of weight data than input data to compute the in-
ference operation [26]. Our proposal of computation partition,
explained in Box 2, enables us to leverage the “In-Memory”
computing primitives for storing the probability distribution
parameters while parallely computing energy-efficient dot-
products in-situ between inputs and stochastic weights. It
is worth noting here that the crossbar column outputs are
computed and read sequentially in order to ensure that the
random numbers sampled for the synaptic weights of each
column are independent.
Box 2: Computations Involved in Inference Operation
Once all the posterior distributions are learnt (µ and σ
parameters of the weight distributions), the network output
corresponding to input, x, should be obtained by averaging
the outputs obtained by sampling from the posterior distri-
bution of the weights, W [9]. The output of the network, y,
is therefore given by,
y = EP (W|D)[f(x,W)] ≈ Eq(W,θ)[f(x,W)] ≈ 1
S
S∑
i=1
f(x,Wi)
(3)
where, f(x,W) is the network mapping for input x and
weights, W. Using the Variational Inference method, we
approximate the weight distribution by Gaussian func-
tions. The approximation is performed over S independent
Monte-Carlo samples drawn from the Gaussian distribution,
q(W, θ).
Considering just a single layer and neglecting the neural
transfer function, f(x,Wi) for the j-th neuron can be de-
composed into,
f(x,Wij) =
∑
k
xk.N(µjk, σjk)
=
∑
k
xk.(µjk + σjk.N(0, 1))
=
∑
k
xk.µjk +
∑
k
xk.N(0, 1).σjk
(4)
where, k is the dimensionality of the input x and N(µjk, σjk)
represent a particular sample drawn from a Normal proba-
bility distribution with mean, µjk, and variance, σjk.
V. RESULTS AND DISCUSSION
A hybrid device-circuit-algorithm co-simulation framework
was developed to evaluate the performance of the proposed
All-Spin Bayesian hardware. The magnetization switching
7characteristics of the mono-domain and multi-domain MTJ
was simulated in MuMax3, a GPU accelerated micro-magnetic
simulation framework [27]. Non-Equilibrium Green’s Function
(NEGF) based transport simulation framework [28] was used
for modelling the MTJ resistance variation with oxide thick-
ness and applied voltage. The obtained device characteristics
from MuMax3 and SPICE simulation tools was used in al-
gorithm level simulator, PyTorch, to evaluate the functionality
of the circuit. The performance of this design was tested for
a standard digit recognition problem on the MNIST dataset
[29]. A two layer fully connected neural network was used,
with each hidden layer having 200 neurons. The probability
distributions were learnt using the ‘Bayes by Backprop’ al-
gorithm [30]1, which learns the optimal Gaussian distribution
by minimizing the KL divergence 2 from the true probability
distribution. The prior distribution on the weights used for
training was a scaled mixture of two gaussian functions. The
network was trained offline to obtain the values of the mean
and standard deviation of the probability distributions of the
weights. Subsequently they were mapped to the conductances
of the DW-MTJ devices. The baseline idealized software
network was trained with an accuracy of 98.63% over the
training set and 97.51% over the testing set (averaged over 10
sampled networks).
The device parameters used in this work have been tabulated
in the previous section. 20KBT barrier height magnet was
used in the Gussian RNG unit. We considered 4-bit represen-
tation in the DW-MTJ weights and 3-bit discretization in the
neuron output. Note that, as explained in the previous section,
our neuronal devices mimic a saturating linear functionality
and our network was trained with such a transfer function
itself. Considering a minimum sensing and programming
displacement of 20nm for the DW location, we consider our
cross-point and neuronal devices to be 320nm and 160nm
in length. From our micromagnetic simulations, we observe
the critical current required to switch the neuronal device
from one edge to the other is 4µA for a time duration of
10ns. The crossbar supply voltage was assumed to be 100mV
for evaluating the crossbar power consumption. The crossbar
resistance ranges (which can be varied by the oxide thickness)
were designed to provide the critical current requirement for
the spin neurons. We considered 300% TMR in the DW-MTJ
conductances of the crossbar array. Considering such device-
level behavioral characteristics, non-idealities and constraints,
the test accuracy of the network was 96.98% (averaged over
10 samples). Further, non-ideal DW programming can also
impact the system accuracy. We performed 5 independent
Monte-Carlo runs of the network with 10% variation in
each of the programmed crossbar device conductances. The
average accuracy degradation was observed to be insignificant
1The related code can be found at https://github.com/nitarshan/
bayes-by-backprop.
2The Kullback-Leibler (KL) divergence is a measure of the difference
between two probability distributions. In this case, the KL divergence is
between the true posterior, P (W|D) and the approximated posterior q(W, θ).
It can be shown that minimization of this difference function can be achieved
by using the gradient descent method and iteratively updating the variational
parameters, µ and σ [30]. This is referred to as the ‘Bayes by Backprop’
algorithm.
- 96.74%.
In order to estimate the system-level energy consumption,
we considered the core RNG and crossbar energy consumption
along with peripheral circuitry like ADC and DAC3. We
evaluate the energy consumption for a single image inference
and a particular network sample. The crossbar read latency was
assumed to be 10ns (for each column read). During each 10ns
column read, the power consumption for the DAC and the cor-
responding crossbar column was considered. Subsequently the
neuron device state was read and converted to a digital value
using an ADC. The neuron is reset before every operation. For
the RNG, DAC and ADC units, we considered 8-bit precision
and 3 variables were used for the accumulation process in the
Normal distribution sampling. We would like to mention here
that we assumed 8-bit precision for the energy calculations
in order to achieve a fair comparison with numbers reported
in Ref. [9] for an iso-network CMOS architecture. However,
from functional viewpoint, lower bit-precision ∼ 4 bits was
observed to be sufficient. The total energy consumption of our
proposed “All-Spin” network was evaluated to be 790.2nJ per
classification, which is 24× energy efficient in contrast to the
baseline CMOS implementation [9]. The energy consumption
of the RNG unit including peripherals for adding the random
numbers generated per row was estimated to be 446.8nJ .
Energy consumption of the crossbar array including DAC,
ADC and multiplier peripherals was 343.3nJ . The system-
level energy efficiency stems from both the RNG design and
utilization of the “In-Memory” computing units.
Note that, resistive crossbars are usually characterized by
limited fan-in – much smaller than neuron fan-in in typical
deep networks due to non-idealities, parasitics and sneak-
paths. Hence, mapping a practically sized network requires
mapping synapses of a neuron across multiple crossbars [31],
[32]. Such architectural level innovations can be easily inte-
grated with our current proposal.
VI. SUMMARY
In summary, we proposed the vision of an “All-Spin”
Bayesian neural processor that has the potential of enabling
orders of magnitude hardware efficiency (area, power, energy
consumption) in contrast to state-of-the-art CMOS implemen-
tations. Computing frameworks, so far, have mainly segregated
deterministic and stochastic computations. Standard determin-
istic deep learning frameworks enabled by spintronic devices
and other post-CMOS technologies have been explored. In
such scenarios, device-level non-idealities are usually treated
as a disadvantage. More recently, stochasticity inherent in such
devices (for instance, probabilistic switching in presence of
thermal noise) have been exploited for computing to imple-
ment stochastic versions of their deterministic counterparts
[33], [34]. Due to additional information encoding capacity in
the switching probability, such devices can be scaled down to
single bit instead of multi-bit representations. Device stochas-
ticity has been also used in other unconventional computing
3The energy consumption for the peripheral circuitry were included from
typical numbers considered in literature [31], [32] and can be found at https://
github.com/Aayush-Ankit/puma-simulator/blob/training/include/constants.py.
8platforms like Ising computing, combinatorial optimization
problems, among others [35]. Note that prior work on using
magnetic devices for Bayesian Inference engines have been
proposed [36], [37] which are mainly used for implementing
Bayes’ rule for simple prediction tasks in directed acyclic
graphs and do not have relevance or overlap with Bayesian
deep networks. Bayesian deep learning is a unique computing
framework that necessitates the merger of both deterministic
(dot-product evaluations of sampled weights and inputs) and
stochastic computations (sampling weights from probability
distributions) - thereby requiring a significant rethinking of
the design space across the stack from devices to circuits and
algorithms.
REFERENCES
[1] Y. Gal, “Uncertainty in deep learning,” Ph.D. dissertation, PhD thesis,
University of Cambridge, 2016.
[2] A. Sengupta and K. Roy, “Encoding neural and synaptic functionalities
in electron spin: A pathway to efficient neuromorphic computing,”
Applied Physics Reviews, vol. 4, no. 4, p. 041105, 2017.
[3] A. Sengupta, G. Srinivasan, D. Roy, and K. Roy, “Stochastic inference
and learning enabled by magnetic tunnel junctions,” in 2018 IEEE
International Electron Devices Meeting (IEDM). IEEE, 2018, pp. 1–4.
[4] M. Romera, P. Talatchian, S. Tsunegi, F. A. Araujo, V. Cros, P. Bor-
tolotti, J. Trastoy, K. Yakushiji, A. Fukushima, H. Kubota, S. Yuasa,
M. Ernoult, D. Vodenicarevic, T. Hirtzlin, N. Locatelli, D. Q. Querlioz,
and J. Grollier, “Vowel recognition with four coupled spin-torque nano-
oscillators,” Nature, vol. 563, no. 7730, p. 230, 2018.
[5] A. Sengupta, Y. Shim, and K. Roy, “Proposal for an all-spin artificial
neural network: Emulating neural and synaptic functionalities through
domain wall motion in ferromagnets,” IEEE transactions on biomedical
circuits and systems, vol. 10, no. 6, pp. 1152–1160, 2016.
[6] A. Sengupta, A. Ankit, and K. Roy, “Performance analysis and bench-
marking of all-spin spiking neural networks (special session paper),”
in 2017 International Joint Conference on Neural Networks (IJCNN).
IEEE, 2017, pp. 4557–4563.
[7] R. Houthooft, X. Chen, Y. Duan, J. Schulman, F. De Turck, and
P. Abbeel, “VIME: Variational information maximizing exploration,” in
Advances in Neural Information Processing Systems, 2016, pp. 1109–
1117.
[8] C. Andrieu, N. De Freitas, A. Doucet, and M. I. Jordan, “An introduction
to MCMC for machine learning,” Machine learning, vol. 50, no. 1-2,
pp. 5–43, 2003.
[9] R. Cai, A. Ren, N. Liu, C. Ding, L. Wang, X. Qian, M. Pedram,
and Y. Wang, “VIBNN: Hardware acceleration of bayesian neural
networks,” in Proceedings of the Twenty-Third International Conference
on Architectural Support for Programming Languages and Operating
Systems. ACM, 2018, pp. 476–488.
[10] Z. Ghahramani and M. J. Beal, “Propagation algorithms for variational
Bayesian learning,” in Advances in neural information processing sys-
tems, 2001, pp. 507–513.
[11] A. Ankit, A. Sengupta, P. Panda, and K. Roy, “RESPARC: A recon-
figurable and energy-efficient architecture with memristive crossbars for
deep spiking neural networks,” in Proceedings of the 54th Annual Design
Automation Conference 2017. ACM, 2017, p. 27.
[12] W. Scholz, T. Schrefl, and J. Fidler, “Micromagnetic simulation of
thermally activated switching in fine particles,” Journal of Magnetism
and Magnetic Materials, vol. 233, no. 3, pp. 296–304, 2001.
[13] K. Yang, D. Fick, M. B. Henry, Y. Lee, D. Blaauw, and D. Sylvester,
“16.3 A 23Mb/s 23pJ/b fully synthesized true-random-number generator
in 28nm and 65nm CMOS,” in 2014 IEEE International Solid-State
Circuits Conference Digest of Technical Papers (ISSCC). IEEE, 2014,
pp. 280–281.
[14] D. Vodenicarevic, N. Locatelli, A. Mizrahi, J. S. Friedman, A. F. Vincent,
M. Romera, A. Fukushima, K. Yakushiji, H. Kubota, S. Yuasa, S. Ti-
wari, J. Grollier, and D. Querlioz, “Low-energy truly random number
generation with superparamagnetic tunnel junctions for unconventional
computing,” Physical Review Applied, vol. 8, no. 5, p. 054045, 2017.
[15] J. Hirsch, “Spin Hall effect,” Physical Review Letters, vol. 83, no. 9, p.
1834, 1999.
[16] L. Liu, C.-F. Pai, Y. Li, H. Tseng, D. Ralph, and R. Buhrman, “Spin-
torque switching with the giant spin Hall effect of tantalum,” Science,
vol. 336, no. 6081, pp. 555–558, 2012.
[17] Y. Kim, X. Fong, and K. Roy, “Spin-orbit-torque-based spin-dice: A
true random-number generator,” IEEE Magnetics Letters, vol. 6, pp. 1–
4, 2015.
[18] A. Sengupta, Z. Al Azim, X. Fong, and K. Roy, “Spin-orbit torque
induced spike-timing dependent plasticity,” Applied Physics Letters, vol.
106, no. 9, p. 093704, 2015.
[19] D. Bhowmik, L. You, and S. Salahuddin, “Spin Hall effect clocking of
nanomagnetic logic without a magnetic field,” Nature nanotechnology,
vol. 9, no. 1, p. 59, 2014.
[20] J. C. Slonczewski, “Conductance and exchange coupling of two ferro-
magnets separated by a tunneling barrier,” Physical Review B, vol. 39,
no. 10, p. 6995, 1989.
[21] C.-F. Pai, L. Liu, Y. Li, H. Tseng, D. Ralph, and R. Buhrman, “Spin
transfer torque devices utilizing the giant spin Hall effect of tungsten,”
Applied Physics Letters, vol. 101, no. 12, p. 122404, 2012.
[22] S. Emori, E. Martinez, K.-J. Lee, H.-W. Lee, U. Bauer, S.-M. Ahn,
P. Agrawal, D. C. Bono, and G. S. Beach, “Spin Hall torque magne-
tometry of Dzyaloshinskii domain walls,” Physical Review B, vol. 90,
no. 18, p. 184427, 2014.
[23] S. Emori, U. Bauer, S.-M. Ahn, E. Martinez, and G. S. Beach, “Current-
driven dynamics of chiral ferromagnetic domain walls,” Nature materi-
als, vol. 12, no. 7, pp. 611–616, 2013.
[24] S. Lequeux, J. Sampaio, V. Cros, K. Yakushiji, A. Fukushima, R. Mat-
sumoto, H. Kubota, S. Yuasa, and J. Grollier, “A magnetic synapse:
Multilevel spin-torque memristor with perpendicular anisotropy,” Scien-
tific Reports, vol. 6, 2016.
[25] P. Wijesinghe, A. Ankit, A. Sengupta, and K. Roy, “An all-memristor
deep spiking neural computing system: A step toward realizing the
low-power stochastic brain,” IEEE Transactions on Emerging Topics in
Computational Intelligence, vol. 2, no. 5, pp. 345–358, 2018.
[26] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen,
Z. Xu, N. Sun, and O. Temam, “DaDianNao: A machine-learning super-
computer,” in Proceedings of the 47th Annual IEEE/ACM International
Symposium on Microarchitecture. IEEE Computer Society, 2014, pp.
609–622.
[27] A. Vansteenkiste, J. Leliaert, M. Dvornik, M. Helsen, F. Garcia-Sanchez,
and B. Van Waeyenberge, “The design and verification of mumax3,” AIP
advances, vol. 4, no. 10, p. 107133, 2014.
[28] X. Fong, S. K. Gupta, N. N. Mojumder, S. H. Choday, C. Augustine,
and K. Roy, “KNACK: A hybrid spin-charge mixed-mode simulator for
evaluating different genres of spin-transfer torque MRAM bit-cells,” in
Simulation of Semiconductor Processes and Devices (SISPAD), 2011
International Conference on. IEEE, 2011, pp. 51–54.
[29] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner et al., “Gradient-based
learning applied to document recognition,” Proceedings of the IEEE,
vol. 86, no. 11, pp. 2278–2324, 1998.
[30] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra, “Weight
uncertainty in neural networks,” arXiv preprint arXiv:1505.05424, 2015.
[31] A. Ankit, I. E. Hajj, S. R. Chalamalasetti, G. Ndu, M. Foltin, R. S.
Williams, P. Faraboschi, W.-m. W. Hwu, J. P. Strachan, K. Roy et al.,
“PUMA: A programmable ultra-efficient memristor-based accelerator
for machine learning inference,” in Proceedings of the Twenty-Fourth
International Conference on Architectural Support for Programming
Languages and Operating Systems. ACM, 2019, pp. 715–731.
[32] A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Stra-
chan, M. Hu, R. S. Williams, and V. Srikumar, “ISAAC: A convolutional
neural network accelerator with in-situ analog arithmetic in crossbars,”
ACM SIGARCH Computer Architecture News, vol. 44, no. 3, pp. 14–26,
2016.
[33] A. Sengupta, M. Parsa, B. Han, and K. Roy, “Probabilistic deep spiking
neural systems enabled by magnetic tunnel junction,” IEEE Transactions
on Electron Devices, vol. 63, no. 7, pp. 2963–2970, 2016.
[34] G. Srinivasan, A. Sengupta, and K. Roy, “Magnetic tunnel junction based
long-term short-term stochastic synapse for a spiking neural network
with on-chip STDP learning,” Scientific reports, vol. 6, p. 29545, 2016.
[35] K. Roy, A. Sengupta, and Y. Shim, “Perspective: Stochastic magnetic
devices for cognitive computing,” Journal of Applied Physics, vol. 123,
no. 21, p. 210901, 2018.
[36] R. Faria, K. Y. Camsari, and S. Datta, “Implementing bayesian networks
with embedded stochastic MRAM,” AIP Advances, vol. 8, no. 4, p.
045101, 2018.
[37] Y. Shim, S. Chen, A. Sengupta, and K. Roy, “Stochastic spin-orbit torque
devices as elements for bayesian inference,” Scientific reports, vol. 7,
no. 1, p. 14101, 2017.
