SNRA: A Spintronic Neuromorphic Reconfigurable Array for In-Circuit
  Training and Evaluation of Deep Belief Networks by Zand, Ramtin & DeMara, Ronald F.
SNRA: A Spintronic Neuromorphic Reconfigurable
Array for In-Circuit Training and Evaluation of
Deep Belief Networks
Ramtin Zand
Department of Electrical and Computer Engineering
University of Central Florida
Orlando, FL 32816-2362
ramtinmz@knights.ucf.edu
Ronald F. DeMara
Department of Electrical and Computer Engineering
University of Central Florida
Orlando, FL 32816-2362
ronald.demara@ucf.edu
Abstract—In this paper, a spintronic neuromorphic reconfig-
urable Array (SNRA) is developed to fuse together power-efficient
probabilistic and in-field programmable deterministic computing
during both training and evaluation phases of restricted Boltz-
mann machines (RBMs). First, probabilistic spin logic devices are
used to develop an RBM realization which is adapted to construct
deep belief networks (DBNs) having one to three hidden layers
of size 10 to 800 neurons each. Second, we design a hardware
implementation for the contrastive divergence (CD) algorithm
using a four-state finite state machine capable of unsupervised
training in N+3 clocks where N denotes the number of neurons
in each RBM. The functionality of our proposed CD hardware
implementation is validated using ModelSim simulations. We
synthesize the developed Verilog HDL implementation of our
proposed test/train control circuitry for various DBN topologies
where the maximal RBM dimensions yield resource utilization
ranging from 51 to 2,421 lookup tables (LUTs). Next, we leverage
spin Hall effect (SHE)-magnetic tunnel junction (MTJ) based
non-volatile LUTs circuits as an alternative for static random
access memory (SRAM)-based LUTs storing the deterministic
logic configuration to form a reconfigurable fabric. Finally, we
compare the performance of our proposed SNRA with SRAM-
based configurable fabrics focusing on the area and power
consumption induced by the LUTs used to implement both
CD and evaluation modes. The results obtained indicate more
than 80% reduction in combined dynamic and static power
dissipation, while achieving at least 50% reduction in device
count.
I. INTRODUCTION
Within the post-Moore era ahead, several design factors
and fabrication constraints increasingly emphasize the require-
ments for in-circuit adaptation to as-built variations. These
include device scaling trends towards further reductions in
feature sizes [1], the narrow operational tolerances associated
with the deployment of hybrid Complementary Metal Oxide
Semiconductor (CMOS) and post-CMOS devices [2], [3], and
the noise sensitivity limits of analog-assisted neuromorphic
computing paradigms [4]. While many recent works have
advanced new architectural approaches for the evaluation
phase of neuromorphic computation utilizing emerging hard-
ware devices, there have been comparatively fewer works to
investigate the hardware-based realization of their training and
adaptation phases that will also be required to cope with these
conditions. Thus, this paper develops one of the first viable
approaches to address post-fabrication adaptation and retrain-
ing in-situ of resistive weighted-arrays in hardware, which are
ubiquitous in post-Moore neuromorphic approaches. Namley, a
tractable in-field reconfiguration-based approach is developed
to leverage in-field configurability to mitigate the impact of
process variation. Reconfigurable fabrics are characterized
by their fabric flexibility, which allows realization of logic
elements at medium and fine granularities, as well as in-field
adaptability, which can be leveraged to realize variation toler-
ance and fault resiliency as widely-demonstrated for CMOS-
based approaches such as [5], [6]. Utilizing reconfigurable
computing by applying hardware and time redundancy to
the digital circuits offers promising and robust techniques
for addressing the above-mentioned reliability challenges. For
instance, it is shown in [6] that a successful refurbishment for
a circuit with 1,252 look-up tables (LUTs) can be achieved
with only 10% spare resources to accommodate both soft and
hard faults.
Within the post-Moore era, reconfigurable fabrics can also
be expected to continue their transition towards embracing the
benefits of increased heterogeneity along several cooperating
dimensions to facilitate neuromorphic computation [7]. Since
the inception of the first field-programmable devices, various
granularities of general-purpose reconfigurable logic blocks
and dedicated function-specific computational units have been
added to their structures. These have resulted in increased
computational functionality compared to homogeneous archi-
tectures. In recent years, emerging technologies are proposed
to be leveraged in reconfigurable fabrics to advance new
transformative opportunities for exploiting technology-specific
advantages. Technology heterogeneity recognizes the cooper-
ating advantages of CMOS devices for their rapid switching
capabilities, while simultaneously embracing emerging devices
for their non-volatility, near-zero standby power, high integra-
tion density, and radiation-hardness. For instance, spintronic-
based LUTs are proposed in [8], [9], [10] as the primary
building blocks in reconfigurable fabrics realizing significant
ar
X
iv
:1
90
1.
02
41
5v
1 
 [c
s.E
T]
  8
 Ja
n 2
01
9
Fig. 1. (a) An RBM structure, (b) a 3×3 RBM implemented by a 4×4
crossbar architecture, (c) a DBN structure including multiple hidden layers.
area and energy consumption savings. In this paper, we
extend the transition toward heterogeneity along various logic
paradigms by proposing a heterogeneous technology fabric
realizing both probabilistic and deterministic computational
models. The cooperating advantages of each are leveraged to
address the deficiencies of the others during the neuromorphic
training and evaluation phases, respectively.
In this paper, we propose a spintronic neuromorphic recon-
figurable Array (SNRA) that uses probabilistic spin logic de-
vices to realize deep belief network (DBN) architectures while
leveraging deterministic computing paradigms to achieve in-
circuit training and evaluation. Most of the previous DBN
research has focused on software implementations, which
provides flexibility, but requires significant execution time and
energy due to large matrix multiplications that are relatively
inefficient when implemented on standard Von-Neumann ar-
chitectures. Previous hardware-based implementation of RBM
have sought to overcome software limitations by using FPGAs
[11], [12], stochastic CMOS [13], and hybrid memristor-
CMOS designs [14]. Recently, Zand et al. [15] utilized a
spintronic device that leverages intrinsic thermal noise within
low energy barrier nanomagnets to provide a natural building
block for RBMs. While most of the aforementioned designs
only focus on the test operation, the work presented herein
concentrates on leveraging technology heterogeneity to imple-
ment a train and evaluation circuitry for DBNs with various
network topologies on our proposed SNRA fabric.
II. RESTRICTED BOLTZMANN MACHINES
Restricted Boltzmann machines (RBMs) are a class of
recurrent stochastic neural networks, in which each state of
the network, k, has an energy determined by the connection
weights between nodes and the node bias as described by (1),
where ski is the state of node i in k, bi is the bias, or intrinsic
excitability of node i, and wij is the connection weight between
nodes i and j [16].
E(k) = −
∑
i
ski bi −
∑
i<j
ski s
k
jwij (1)
Each node in a RBM has a probability to be in state one
according to (2), where σ is the sigmoid function. RBMs,
when given sufficient time, reach a Boltzmann distribution
where the probability of the system being in state v is found
by (3), where u could be any possible state of the system.
Thus, the system is most likely to be found in states that have
the lowest associated energy.
P (si = 1) = σ(bi +
∑
j
wijsj) (2)
P (v) =
e−E(v)∑
u e
−E(u) (3)
Restricted Boltzmann machines (RBMs) are constrained to
two fully-connected non-recurrent layers called the visible
layer and the hidden layer. RBMs can be readily implemented
by a crossbar architecture, as shown in Fig.1. The most well-
known approach for training RBMs is contrastive divergence
(CD), which is an approximate gradient descent procedure
using Gibbs sampling [17]. CD operates in four steps as
described below:
1. Feed-forward: the training input vector, v, is applied to
the visible layer, and the hidden layer, h, is sampled.
2. Feed-back: The sampled hidden layer output is fed-back
and the generated input is sampled, v′.
3. Reconstruct: v′ is applied to the visible layer and the
reconstructed hidden layer is sampled to obtain h′.
4. Update: The weights are updated according to (4), where
η is the learning rate and W is the weight matrix.
∆W = η(vhT − v′h′T ) (4)
RBMs can be readily stacked to realize a DBN, which
can be trained similar to RBMs. Training a DBN involves
performing CD on the visible layer and the first hidden layer
for as many steps as desired, then fixing those weights and
moving up a hierarchy as follows. The first hidden layer is now
viewed as a visible layer, while the second hidden layer acts
as a hidden layer with respect to the CD procedure identified
above. Next, another set of CD steps are performed, and then
the process is repeated for each additional layer of the DBN.
III. PROPOSED RBM STRUCTURE
A feasible hardware implementation of a 4×2 RBM struc-
ture is shown in Fig. 2(a), in which three terminal spin Hall
effect (SHE)-driven domain wall motion (DWM) device [19]
is used as weights and biases, while the probabilistic spin logic
devices (p-bits) are utilized to produce a probabilistic output
voltage that has a sigmoid relation with the input currents
of the devices, as shown in Fig. 2(b) and Fig. 2(c), respec-
tively. The p-bit device consists of a SHE-driven magnetic
tunnel junction (MTJ) with a circular near-zero energy barrier
nanomagnet, which provides a natural sigmoidal activation
function required for DBNs as studied in [18], [20], [21], [22].
Transmission gates (TGs) are used within the bit cell of the
weighted connections to adjust the weights by changing the
Fig. 2. (a) A 4×2 RBM hardware implementation, (b) SHE-DWM based
weighted connections, and (c) p-bit based probabilistic neuron [18].
TABLE I
REQUIRED SIGNALING TO CONTROL THE RBM OPERATION PHASES.
Operation Phase WWL RWL BL SL
Feed-Forward / Test
GND VDD Hi-Z Hi-ZReconstruct
Feed-Back
Update Increase Weight VDD GND Vtrain GNDDecrease Weight GND Vtrain
domain wall (DW) position in SHE-DWM devices, as well
as controlling the RBM operation phases. TGs can provide an
energy-efficient and symmetric switching behavior [23], which
is specifically desired during the training operation.
Table I lists the required signaling to control the RBM’s
training and test operations. During the feed-forward, feed-
back, and reconstruct operations, write word line (WWL)
is connected to ground (GND) and the bit line (BL) and
source line (SL) are both in high impedance (Hi-Z) state
disconnecting the write path. The read word line (RWL) is
connected to VDD, which turns ON the read TGs in the
weighted connection bit cell shown in Fig. 2(b). The voltage
applied by the input neuron generates a current through TG1
and TG2, which is then injected to the output neuron and
modulates the output probability of the p-bit device. The
Fig. 3. FSM designed to control the train and test operations in a DBN.
amplitude of the current depends on the resistance of the
weighted connection which is defined by the position of the
DW in the SHE-DWM device.
During the update phase, the RWL is connected to GND,
which turns off TG1 and TG2 and disconnects the read path.
Meanwhile, the WWL is set to VDD which activate the write
path. Resistance of the weighted connections can be adjusted
by the BL and SL signals, as listed in Table I. The amplitude
of the training voltage (Vtrain) connected to BL and SL should
be designed in a manner such that it can provide the desired
learning rate, η, to the training circuit. For instance, a high
amplitude V train results in a significant change in the DW
position in each training iteration, which effectively reduces
the number of different resistive states that can be realized
by the SHE-DWM device. On the other hand, a higher SHE-
DWM resistance leads to a smaller current injected to the p-
bit device. Thus, the input signal connected to the weighted
connection with higher resistance will have lower impact on
the output probability of the p-bit device, representing a lower
weight for the corresponding connection between the input
and output neurons.
IV. PROPOSED HARDWARE IMPLEMENTATION OF
CONTRASTIVE DIVERGENCE ALGORITHM
To implement the contrastive divergence (CD) algorithm
required for training the weights in an RBM structure, we have
designed a four-state finite state machine (FSM) as shown in
Fig. 3. The proposed FSM is in the feed-forward state during
the test operation. When the training begins, the input of the
visible layer and the corresponding output of the hidden layer
will be stored in the v and h registers, respectively. The size
of the v and h registers depend on the number of neurons
in the visible and hidden layers. For instance, in the sample
4×2 RBM shown in Fig. 2 the size of the v and h registers
are 4-bits and 2-bits, respectively. In the feed-back state, the
sampled hidden layer is fed-back to the RBM array and the
corresponding output of the visible layer is stored in the v bar
register. Next, the stored values in v bar are applied to the
RBM to reconstruct the hidden layer, and the obtained output
of the hidden layer will be stored in h bar register. Finally
in the update state, the data stored in v, h, v bar, and h bar
registers are used to provide the required BL and SL signals
to adjust the weights according to (4).
Register v
Register h
Sign-Extend
1 4
Register 
v_bar
Register 
h_bar
Sign-Extend
1 4
BL0
BL1
BL2
BL3
SL0
SL1
SL2
SL3
1010
0010
10
1
1111
1
1
0
1
0
0
1
0
0000
0
0
0
0
01
0
0
0
0
0
Register v
Register h
Sign-Extend
1 4
Register 
v_bar
Register 
h_bar
Sign-Extend
1 4
BL0
BL1
BL2
BL3
SL0
SL1
SL2
SL3
1010
0010
10
0
0000
0
0
0
0
0
0
0
0
0000
0
0
1
0
01
1
0
0
1
0
(a) (b)
Register v
Register h
Register v_bar
Register h_bar
BL3
BL2
BL1
BL0
SL3
SL2
SL1
SL0
1010
0010
10
1
1
0
1
0
0
1
0
0
0
0
0
01
0
0
0
0
(a) (b)
1
0
Add
1
Counter
0
1
0
0
0
0
1
Register v
Register h
Register v_bar
Register h_bar
BL3
BL2
BL1
BL0
SL3
SL2
SL1
SL0
1010
0010
10
0
0
0
0
0
0
0
0
0
0
1
0
01
0
0
1
0
1
0
Add
1
Counter
0
1
1
1
1
1
0
10
Fig. 4. The hardware realization for the update state in the FSM developed to train a 4×2 RBM, (a) first clock cycle, and (b) second clock cycle.
Figure 4 depicts the schematic of the hardware designed for
the update state of the FSM developed for a 4×2 RBM. In
each clock cycle, The designed circuit adjusts the weights in a
single column of the RBM shown in Fig. 2. Thus, the number
of clock cycles required to complete the update state depends
on the number of neurons in the hidden layer of the RBM.
A counter register is used in the design to ensure that all of
the columns in the RBM are updated. The counter value starts
from zero and will be incremented in each clock cycle until
it reaches the hn value, which is the total number of nodes
in the hidden layer. Once the counter reaches hn, the update
state is completed and the FSM goes to the feed-forward state.
The logical AND gates are used to implement the vhT and
v′h′T expressions required to find ∆W for the weights in each
column. The output of Boolean gates implementing vhT and
v′h′T are stored in BL reg and SL reg registers, respectively,
which provide the required signaling for adjusting the weights
according to the Table I.
Herein, to better understand the functionality of the hard-
ware developed for the update state, we have used an example
with the v, h, v′, and h′ matrices having the hypothetical
values mentioned below:
v =

v0
v1
v2
v3
 =

1
0
1
0
 h = [10
]
v′ =

0
0
1
0
 h′ = [01
]
Hence, the ∆W can be calculated using (4) as shown below:
∆W = η(vhT − v′h′T ) = η

1 0
0 0
1 −1
0 0
 =

δw00 δw01
δw10 δw11
δw20 δw21
δw30 δw31

According to the obtained ∆W , w21 should be de-
creased while the w00 and w20 increases, and the remaining
weight values remain unchanged. The hardware realization
Fig. 5. The output signals generated by the proposed FSM. The clock fre-
quency is 500MHz, which can be modified based on the design requirements.
of the mentioned example is shown in Fig. 4, in which
the values stored in the registers are v=4’b0101, h=2’b01,
v bar=4’b0100, and h bar=2’b10. It is worth noting that, the
v0 element in the v matrix is stored in the least significant
bit of the v register, while v3 is stored in the most significant
bit. Other matrices are stored to their corresponding registers
in the similar manner. In this example, RBM has two output
neurons, therefore hn is equal to two and the update operation
can be completed in two clock cycles. In the first cycle shown
in Fig. 4(a), the counter is equal to zero and the first bits of
Fig. 6. (a) The schematic of the hardware designed to control the testing and training operations of a 4×2 RBM implemented on a Xilinx Kintex-7 FPGA
family, (b) the structure of a 6-input SHE-MTJ based fracturable LUT used as the building block of the proposed SNRA architecture.
h and h bar registers are selected by the multiplexers to be
used as the input of the AND gates. Therefore, the below BL
and SL signals are generated,
BL =

BL0
BL1
BL2
BL3
 =

1
0
1
0
 SL =

SL0
SL1
SL2
SL3
 =

0
0
0
0

As listed in Table I, the above BL and SL signals will increase
w00 and w20 weights shown in Fig. 2, if the WWL0 and
WWL1 signals are “1” and “0”, respectively. Similarly, in the
second clock cycle, the counter is equal to one and the second
bits of h and h bar registers are used to produce below BL
and SL signals as below,
BL =

BL0
BL1
BL2
BL3
 =

0
0
0
0
 SL =

SL0
SL1
SL2
SL3
 =

0
1
0
0

This results in a decrease in the w21 weight, while the
other weights remain unchanged. Thus, the proposed hardware
provides the desired functionality required for the update state
according to (4).
Herein, we have used the Verilog hardware description
language (HDL) to implement our proposed four-state FSM.
The ModelSim simulator is used to simulate the developed
register-transfer level (RTL) Verilog codes. Figure 5 shows the
obtained waveforms required for training a 4×2 RBM array
with the hypothetical register values mentioned above. The
results show that the desired BL, SL, RWL, and WWL control
signals are generated in five clock cycles, which verifies the
functionality of our proposed FSM.
To obtain the hardware resources required for our proposed
DBN control circuitry, we have synthesized and implemented
it using Xilinx ISE Design Suite 14.7. The schematic of
the hardware developed to control the testing and training
operations for a 4×2 RBM is shown in Fig. 6(a), in which 32
six-input fracturable look-up table (LUT) and Flip Flop (FF)
pairs are used to implement both sequential and combinational
logic. It is worth noting that out of the 32 LUT-FF pairs
only three of them are utilized for the test operation, thus
roughly 90% of the circuit can be power-gated during the test
operation. However in conventional homogeneous technology
FPGAs, volatile static random access memory (SRAM) cells
are employed in LUTs to store the logic function configuration
data. Therefore, by power-gating the SRAM-based LUTs the
configuration data will be lost and the FPGA is required
to be re-programmed. In addition to volatility, SRAM cells
also suffer from high static power and low logic density
[24]. Hence, alternative emerging memory technologies have
been attracting considerable attention in recent years as an
alternative for SRAM cells.
V. THE PROPOSED SNRA ARCHITECTURE
Herein, we propose a heterogeneous-technology spintronic
neuromorphic reconfigurable array (SNRA), which can com-
bine both deterministic and probabilistic logic paradigms.
The SNRA fabric is organized into islands of probabilistic
modules surrounded by Boolean configurable logic blocks
(CLBs). Both the probabilistic and deterministic elements are
field programmable using a configuration bit-stream based on
conventional FPGA programming paradigms.
Herein, the probabilistic modules consist of RBMs, which
can be connected hierarchically within the field-programmable
fabric to form various topologies of DBNs. Each RBM lever-
ages SHE-MTJs with unstable nanomagnets (∆  40kT ) to
generate the probabilistic sigmoidal activation function of the
neurons. With respect to the deterministic logic, the CLBs are
comprised of LUTs which realize the training and evaluation
circuitry. Non-volatile high energy barrier (∆ ≥ 40kT ) SHE-
MTJ devices are used as an alternative for SRAM cells within
LUT circuits. The routing networks include routing tracks,
as well as switch and connection blocks similar to that of
the conventional FPGAs. The feasibility of integrating MTJs
and CMOS technologies in an FPGA chip has been verified
in 2015 by researchers in Tohoku University [9]. They have
fabricated a nonvolatile FPGA with 3,000 6-input MTJ-based
LUTs under 90nm CMOS and 75nm MTJ technologies. The
measurement of fabricated devices under representative appli-
cations exhibited significant improvements in terms of power
consumption and area. Despite the mentioned improvements,
the conventional spin transfer torque (STT)-based MTJ de-
vices suffer from high switching energy and reliability issues.
Thus, we propose using SHE-MTJ based LUT circuits with
reduced switching energy and increased reliability of tunneling
oxide barrier [25]. Readers are referred to [26] for additional
information regarding the STT-MTJ and SHE-MTJ devices.
Figure 6(b) shows the structure of a six-input SHE-MTJ
based fracturable LUT [27], which can implement a six-input
Boolean function or two five-input Boolean functions with
common inputs. In general, LUT is a memory with 2m cells in
which the truth table of an m-input Boolean function is stored.
The logic function configuration data is stored in SHE-MTJs
in form of different resistive levels determined based on the
magnetization configurations of ferromagnetic layer in MTJs,
i.e parallel configuration results in a lower resistance standing
for logic “0” and vice versa. The LUT inputs can be considered
as the address according to which corresponding output of the
Boolean function will be returned through the select tree. The
LUT circuit shown in Fig. 6(b) includes two pre-charge sense
amplifiers (PCSAs) that are used to read the logic state of the
SHE-MTJs. The PCSA compares the stored resistive value
of the SHE-MTJ cells in the LUT circuit with a reference
MTJ cell that its resistance is designed between the low and
high resistances of the LUT’s SHE-MTJ cells. Therefore, if
the resistive value of a SHE-MTJ cell in the LUT circuit is
greater than the resistance of the reference cell, the output of
the PCSA will be “1” and vice versa. The readers are referred
to [27] for additional information regarding the functionality
of a SHE-MTJ based LUT circuit.
VI. RESULTS AND DISCUSSIONS
Herein, we have modified a MATLAB implementation of
DBN developed in [28] and utilized MNIST data set [29] to
calculate the error rate and evaluate the performance of our
DBN architecture. The simplest model of the belief network
that can be used for MNIST digit recognition includes a single
RBM with 784 nodes in the visible layer to handle 2828 pixels
of the input images, and 10 nodes in hidden layer representing
the output classes. Herein, we have examined the error rate for
Fig. 7. Error rate vs. training samples for various DBN topologies [15].
five different network topologies using 1,000 test samples as
shown in Fig. 7. As it is expected, increasing the number of
the hidden layers, nodes, and training images improves the
performance of the DBN, however these improvements are
realized at the cost of higher area and power dissipation.
To compare the resource utilization between the five net-
work topologies investigated in this paper, we have used Xilinx
ISE Design Suite 14.7 to implement their control circuitry
based on the FSM design proposed in Section IV. The obtained
logic resource utilization for each of the mentioned DBN
topologies is listed in Table II. Since the training operation in
different layers of the DBN does not happen simultaneously,
the resources can be shared for training each RBM. Therefore,
the amount of logic resources utilized to implement the FSM
of a DBN relies on the size of the largest RBM in the
network. For instance, as listed in Table II, the resource
utilization for training a 784×500×10 DBN is equal to that of
a 784×500×500×10 DBN, since the size of the largest RBM
in both networks is 784×500.
To provide a fair power consumption comparison between
the investigated DBN topologies, we have simulated an
SRAM-based six-input fracturable LUT-FF pair in SPICE
circuit simulator using 45nm CMOS library with 1V nominal
voltage. The obtained static and dynamic power dissipation
are listed in Table III. Herein, we have only focused on the
power dissipated by the LUT-FF pairs, and used the below
relation to measure the power consumption for each topology:
Ptotal =
∑
i
AiPread + IiPstandby (5)
where Ai and Ii are the number of active and idle LUT-FF
pairs in RBM i of the DBN, respectively. The obtained power
dissipation values for various DBN topologies are listed in the
last column of Table II. The provided trade-offs between the
error rate and power consumption can be leveraged to design
a desired DBN based on the application requirements.
To investigate the effect of technology heterogeneity on
the performance of the proposed DBN control circuitry, we
TABLE II
FSM LOGIC RESOURCE UTILIZATION AND POWER DISSIPATION FOR
VARIOUS DBN TOPOLOGIES.
Topology SliceRegisters
Slice
LUTs
Fully-used
LUT-FFs
Power
Consumption
784×10 3185 123 51 0.32 mW
784×500×10 4655 3545 1771 14.2 mW
784×800×10 5533 2449 2421 19.3 mW
784×500×500×10 4655 3545 1771 25.3 mW
784×800×800 ×10 5617 2449 2421 34.5 mW
TABLE III
PERFORMANCE COMPARISON BETWEEN SIX-INPUT FRACTURABLE
SRAM-BASED LUT AND SHE-MTJ BASED LUT.
Features SRAM-LUT SHE-MTJ LUT
Device Count MOS 1163 565MTJ - 66
Power (µW)
Read 6.28 1.1
Write 28 188
Static 1.6 0.21
Delay Read < 10 ps < 30 psWrite < 0.1 ns < 2 ns
Energy Read ∼ 62.8 aJ ∼ 33 aJWrite ∼ 2.8 fJ ∼ 376 fJ
Fig. 8. Power dissipation of developed FSM for various DBN topologies.
have simulated a SHE-MTJ based six-input fracturable LUT
in SPICE using 45nm CMOS and 60nm MTJ technologies.
The modeling approach proposed in [27][30] is leveraged to
model the behavior of SHE-MTJ devices. In particular, first, a
Verilog-A model of the device is developed and used in SPICE
to obtain the write current, as well as the power dissipation of
the read/write operations. Next, the write current is used in a
descriptive MATLAB model of a SHE-MTJ device to extract
the corresponding write delay. The simulation results obtained
for a SHE-MTJ based six-input fracturable LUT circuit are
listed in Table III.
Three types of power consumption profiles can be identified
in FPGA LUTs. During the configuration phase, the LUTs
must be initialized and thus written. This incurs an initial write
energy consumption, which occurs infrequently thereafter.
Second, upon configuration the LUTs comprising active logic
paths will consume read power including a certain sub areas
within high gate equivalent capacity of FPGA chips. Third,
the remainder of the LUTs, which can be a large number,
may be inactive and consume standby power. SRAM-based
FPGA is challenged by the difficulty with power-gating LUTs
which must retain the stored configuration. While, a SHE-
MTJ based LUT can be readily power-gated and incur near-
zero standby energy due to its non-volatility characteristic. On
the other hand, replacing SRAM cells with SHE-MTJ devices
results in a considerable reduction in the transistor count
of the LUT circuit since each SRAM cell includes 6 MOS
transistors in its structure, while SHE-MTJ devices can be
fabricated on top of the MOS circuitry incurring very low area
overhead. In particular, SHE-MTJ based LUT circuit achieves
at least 51% reduction in MOS transistor count compared to
the conventional SRAM-based LUT, as listed in Table III.
Transistors with minimum feature size are utilized in the SHE-
MTJ based LUT circuit to control the SHE-MTJ write and read
operations. Thus, the device count results can provide a fair
comparison between SHE-MTJ based LUTs and conventional
SRAM-based LUTs in terms of area consumption, since all of
the MOS transistors used in both designs have the minimum
feature size possible by the 45nm CMOS technology.
Figure 8 provides a comparison between the conventional
SRAM-based FPGA and the proposed SNRA with a focus
on the power dissipation induced by LUT-FF pairs utilized
to implement the developed DBN control circuitry. The com-
bined improvements in the read and standby modes of the
proposed SNRA resulted in realizing at least 80% reduction
in power consumption compared to the conventional CMOS-
based reconfigurable fabrics for various DBN topologies. The
results obtained for the read operation are comparable to that
of the STT-MTJ based FPGA proposed by the Suzuki et al.
[9]. However, the utilization of SHE-MTJ based LUTs within
the SNRA architecture instead of STT-MTJs can result in at
least 20% reduction in configuration energy as demonstrated
by authors in [27].
VII. CONCLUSION
The concept of SNRA offers an intriguing architectural
approach to realize beyond von-Neumann paradigms which
embrace both probabilistic and Boolean computation. As de-
veloped herein, the inclusion of in-field programmability offers
several practical benefits beyond simulation towards a feasible
post-Moore fabric. Most importantly, it can accommodate
process variation issues that would otherwise preclude the
validity of the baseline training values that differ from the
manufactured component.
To coordinate training, a four-state FSM is shown to be
sufficient to implement the contrastive divergence (CD) algo-
rithm, as well as the control circuitry for the test operation of
DBNs with various topologies. The proposed FSM is capable
of unsupervised training of an RBM in N + 3 clocks where
N denoted the number of nodes in the hidden layer of RBM.
Interpolating the synthesis results from the Xilinx toolchain
indicate a conventional FPGA footprint can accommodate
training circuitry for significantly deeper belief networks. This
is facilitated using the flexible allocation and routing of layers
and their downstream destinations which is a central tenant of
CD training. For instance, it was shown that the FSM for both
784×500×10 and 784×500×500×10 DBN topologies can be
implemented with 1,771 LUTs, since the size of the largest
RBM in both networks is 784×500.
Beyond the flexible architectural approach, within the
SNRA fabric, the device parameters are tuned to realize either
stochastic switching or deterministic behavior. In particular,
near-zero energy barrier SHE-MTJ devices are used to provide
a natural probabilistic sigmoidal function required for imple-
mentation of the neuron’s activation function within an RBM
structure. Meanwhile, non-volatile SHE-MTJ devices with
high energy barrier (∆ ≥ 40kT ) can be used to implement
LUTs. Use of SHE-MTJ based LUTs achieves more than
80% and 50% reduction in terms of power dissipation and
area, respectively, compared to conventional SRAM-based
reconfigurable fabrics. These improvements are achieved at the
cost of higher energy consumption during the reconfiguration
operation, which occurs rarely and can be tolerated due to
the significant area and power reductions realized during the
normal operation of the SNRA.
ACKNOWLEDGMENT
This work was supported in part by the Center for Proba-
bilistic Spin Logic for Low-Energy Boolean and Non-Boolean
Computing (CAPSL), one of the Nanoelectronic Computing
Research (nCORE) Centers as task 2759.006, a Semiconductor
Research Corporation (SRC) program sponsored by the NSF
through CCF 1739635.
REFERENCES
[1] D. E. Nikonov and I. A. Young, “Benchmarking of beyond-cmos
exploratory devices for logic integrated circuits,” IEEE Journal on
Exploratory Solid-State Computational Devices and Circuits, vol. 1, pp.
3–11, Dec 2015.
[2] S. Ghosh and K. Roy, “Parameter variation tolerance and error resiliency:
New design paradigm for the nanoscale era,” Proceedings of the IEEE,
vol. 98, no. 10, pp. 1718–1751, Oct 2010.
[3] S. Ghosh, A. Iyengar, S. Motaman, R. Govindaraj, J. W. Jang, J. Chung,
J. Park, X. Li, R. Joshi, and D. Somasekhar, “Overview of circuits,
systems, and applications of spintronics,” IEEE Journal on Emerging
and Selected Topics in Circuits and Systems, vol. 6, no. 3, 2016.
[4] B. Liu, M. Hu, H. Li, Z.-H. Mao, Y. Chen, T. Huang, and
W. Zhang, “Digital-assisted noise-eliminating training for memristor
crossbar-based analog neuromorphic computing engine,” in 2013 50th
ACM/EDAC/IEEE Design Automation Conference (DAC), 2013.
[5] R. S. Oreifej, R. Al-Haddad, R. Zand, R. A. Ashraf, and R. F. DeMara,
“Survivability modeling and resource planning for self-repairing recon-
figurable device fabrics,” IEEE Transactions on Cybernetics, vol. 48,
no. 2, pp. 780–792, Feb 2018.
[6] R. A. Ashraf and R. F. DeMara, “Scalable fpga refurbishment using
netlist-driven evolutionary algorithms,” IEEE Transactions on Comput-
ers, vol. 62, no. 8, pp. 1526–1541, Aug 2013.
[7] R. F. DeMara, A. Roohi, R. Zand, and S. D. Pyle, “Heterogeneous
technology configurable fabrics for field-programmable co-design of
cmos and spin-based devices,” in 2017 IEEE International Conference
on Rebooting Computing (ICRC), Nov 2017, pp. 1–4.
[8] R. Zand and R. F. DeMara, “Radiation-hardened mram-based lut for
non-volatile fpga soft error mitigation with multi-node upset tolerance,”
Journal of Physics D: Applied Physics, vol. 50, no. 50, p. 505002, 2017.
[9] D. Suzuki et al., “Fabrication of a 3000-6-input-luts embedded and
block-level power-gated nonvolatile fpga chip using p-mtj-based logic-
in-memory structure,” in 2015 Symposium on VLSI Technology (VLSI
Technology), June 2015, pp. C172–C173.
[10] J. Yang, X. Wang, Q. Zhou, Z. Wang, H. Li, Y. Chen, and W. Zhao,
“Exploiting spin-orbit torque devices as reconfigurable logic for circuit
obfuscation,” IEEE Transactions on Computer-Aided Design of Inte-
grated Circuits and Systems, vol. PP, no. 99, pp. 1–1, 2018.
[11] S. K. Kim, P. L. McMahon, and K. Olukotun, “A large-scale architecture
for restricted boltzmann machines,” in Field-Programmable Custom
Computing Machines (FCCM), 2010 18th IEEE Annual International
Symposium on. IEEE, 2010, pp. 201–208.
[12] D. Le Ly and P. Chow, “High-performance reconfigurable hardware
architecture for restricted boltzmann machines,” IEEE Transactions on
Neural Networks, vol. 21, no. 11, pp. 1780–1792, 2010.
[13] A. Ardakani, F. Leduc-Primeau, N. Onizawa, T. Hanyu, and W. J. Gross,
“Vlsi implementation of deep neural network using integral stochastic
computing,” IEEE Transactions on Very Large Scale Integration (VLSI)
Systems, vol. 25, no. 10, pp. 2688–2699, 2017.
[14] M. N. Bojnordi and E. Ipek, “Memristive boltzmann machine: A
hardware accelerator for combinatorial optimization and deep learning,”
in High Performance Computer Architecture (HPCA), 2016 IEEE Inter-
national Symposium on. IEEE, 2016, pp. 1–13.
[15] R. Zand, K. Y. Camsari, S. D. Pyle, I. Ahmed, C. H. Kim, and R. F.
DeMara, “Low-energy deep belief networks using intrinsic sigmoidal
spintronic-based probabilistic neurons,” in Proceedings of the 2018 on
Great Lakes Symposium on VLSI, ser. GLSVLSI ’18, 2018, pp. 15–20.
[16] D. H. Ackley, G. E. Hinton, and T. J. Sejnowski, “A learning algorithm
for boltzmann machines,” Cognitive science, vol. 9, no. 1, 1985.
[17] M. A. Carreira-Perpinan and G. E. Hinton, “On contrastive divergence
learning.” in Aistats, vol. 10, 2005, pp. 33–40.
[18] K. Y. Camsari, R. Faria, B. M. Sutton, and S. Datta, “Stochastic p-bits
for invertible logic,” Phys. Rev. X, vol. 7, p. 031014, Jul 2017.
[19] A. Sengupta, A. Banerjee, and K. Roy, “Hybrid spintronic-cmos spiking
neural network with on-chip learning: Devices, circuits, and systems,”
Phys. Rev. Applied, vol. 6, p. 064003, Dec 2016.
[20] R. Faria, K. Y. Camsari, and S. Datta, “Low-barrier nanomagnets as
p-bits for spin logic,” IEEE Magnetics Letters, vol. 8, pp. 1–5, 2017.
[21] B. Sutton, K. Y. Camsari, B. Behin-Aein, and S. Datta, “Intrinsic
optimization using stochastic nanomagnets,” Scientific Reports, vol. 7,
2017.
[22] B. Behin-Aein, V. Diep, and S. Datta, “A building block for hardware
belief networks,” Scientific reports, vol. 6, 2016.
[23] R. Zand, A. Roohi, and R. F. DeMara, “Energy-efficient and process-
variation-resilient write circuit schemes for spin hall effect mram
device,” IEEE Transactions on Very Large Scale Integration (VLSI)
Systems, vol. 25, no. 9, pp. 2394–2401, Sept 2017.
[24] I. Kuon and J. Rose, “Measuring the gap between fpgas and asics,”
IEEE Transactions on Computer-Aided Design of Integrated Circuits
and Systems, vol. 26, no. 2, pp. 203–215, Feb 2007.
[25] S. Manipatruni, D. E. Nikonov, and I. A. Young, “Energy-delay perfor-
mance of giant spin hall effect switching for dense magnetic memory,”
Applied Physics Express, vol. 7, no. 10, p. 103001, 2014.
[26] X. Fong, Y. Kim, K. Yogendra, D. Fan, A. Sengupta, A. Raghunathan,
and K. Roy, “Spin-transfer torque devices for logic and memory:
Prospects and perspectives,” IEEE Transactions on Computer-Aided
Design of Integrated Circuits and Systems, vol. 35, no. 1, 2016.
[27] R. Zand, A. Roohi, D. Fan, and R. F. DeMara, “Energy-efficient
nonvolatile reconfigurable logic using spin hall effect-based lookup
tables,” IEEE Transactions on Nanotechnology, vol. 16, no. 1, pp. 32–
43, Jan 2017.
[28] M. Tanaka and M. Okutomi, “A novel inference of a restricted boltzmann
machine,” in 2014 22nd International Conference on Pattern Recogni-
tion, Aug 2014, pp. 1526–1531.
[29] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning
applied to document recognition,” Proceedings of the IEEE, vol. 86,
no. 11, pp. 2278–2324, Nov 1998.
[30] A. Roohi, R. Zand, D. Fan, and R. F. DeMara, “Voltage-based concate-
natable full adder using spin hall effect switching,” IEEE Transactions
on Computer-Aided Design of Integrated Circuits and Systems, vol. 36,
no. 12, pp. 2134–2138, Dec 2017.
