Purdue University

Purdue e-Pubs
Open Access Dissertations

Theses and Dissertations

8-2018

Stochastic Algorithms for Optimization: Devices, Circuits, and
Architecture
Yong Shim
Purdue University

Follow this and additional works at: https://docs.lib.purdue.edu/open_access_dissertations

Recommended Citation
Shim, Yong, "Stochastic Algorithms for Optimization: Devices, Circuits, and Architecture" (2018). Open
Access Dissertations. 2069.
https://docs.lib.purdue.edu/open_access_dissertations/2069

This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries.
Please contact epubs@purdue.edu for additional information.

STOCHASTIC ALGORITHMS FOR OPTIMIZATION:
DEVICES, CIRCUITS, AND ARCHITECTURE

A Dissertation
Submitted to the Faculty
of
Purdue University
by
Yong Shim

In Partial Fulﬁllment of the
Requirements for the Degree
of
Doctor of Philosophy

August 2018
Purdue University
West Lafayette, Indiana

ii

THE PURDUE UNIVERSITY GRADUATE SCHOOL
STATEMENT OF DISSERTATION APPROVAL

Dr. Kaushik Roy, Chair
School of Electrical and Computer Engineering
Dr. Anand Raghunathan
School of Electrical and Computer Engineering
Dr. Vijay Raghunathan
School of Electrical and Computer Engineering
Dr. Byunghoo Jung
School of Electrical and Computer Engineering
Dr. Sang Phill Park
Qualcomm

Approved by:
Dr. Venkataramanan Balakrishnan
Head of the School Graduate Program

iii

To my wife, family and friends.

iv

ACKNOWLEDGMENTS
First of all, I would like to sincerely thank my advisor, Prof. Kaushik Roy for his
guidance, encouragement, and support during my studies and research. His warmhearted and enthusiastic mentoring has been leading my research at Purdue and I am
truly fortunate to have him as my main professor.
I would also like to thank my PhD committee members, Prof. Anand Raghunathan, Prof. Byunghoo Jung, Prof. Vijay Raghunathan and Dr. Sang Phill Park
for kindly serving on the advisory committee and for their excellent advice and feedbacks.
I would sincerely like to thank Dr. Jaydeep Kulkarni for mentoring me during my
2015 internship at Intel Corporation.
I would also like to thank all the former members of Nanoelectronics Research
Lab. I especially thank Dr. Deliang Fan, Dr. Kon-woo Kwon, Dr. Yusung Kim, Dr.
Woo-Shul Cho, Dr. Karthik Yogendra, Dr. Yeongkyo Seo, Dr. Ankit sharma, Dr.
Zubair A. Azim, Dr. Dongsoo Lee, Dr. Xuanyao Fong, and Dr. Sri Harsha Choday
for mentoring me and helping me with almost all aspects of my research.
I also thank Abhronil Sengupta, Minsuk Koo, Akhilesh Jaiswal, Saima Sharmin,
Priyadarshini Panda and Chankyu Lee for being wonderful collaborators on multiple
research projects.
I like to thank all the current NRL members, especially, Aayush ankit, Ahmed
Kamal Reza, Bing Han, Chamika L. Liyanagedera, Parami Wijesinghe, Syed S, Sarwar, Gopalakrishnan Srinivasan, Jason Allred, Maryam Parsa, Mei-Chen Chen for
their support and friendship.
Last but not the least, I would like to express my gratitude to my parents and
family for their endless love, support, and encouragement.

v

TABLE OF CONTENTS
Page
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

2 ISING COMPUTATION BASED COMBINATORIAL OPTIMIZATION USING SPIN-HALL EFFECT (SHE) INDUCED STOCHASTIC MAGNETIZATION REVERSAL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

2.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

2.2

From device to Ising model . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3

Problem Mapping and Results . . . . . . . . . . . . . . . . . . . . . . . 19

2.4

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.5

Addendum: From Ising Hamiltonian to Hardware implementation . . . 25

3 BIASED RANDOM-WALK USING STOCHASTIC SWITCHING OF NANOMAGNETS:
APPLICATION TO SAT SOLVER . . . . . . . . . . . . . . . . . . . . . . . 32
3.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.2

Algorithm and Hardware implementation . . . . . . . . . . . . . . . . . 35

3.3

Nano-magnet as the Main Computational Element . . . . . . . . . . . . 39

3.4

MTJ based SAT solver . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.5

Performance Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.6

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4 BAYESIAN INFERENCE ENABLED BY STOCHASTIC SPIN-ORBIT
TORQUE DEVICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.2

Device fabrication and spin-orbit torque (SOT) driven stochastic switching54

vi
Page
4.3

Bayesian Network based on Stochastic MTJ . . . . . . . . . . . . . . . 60

4.4

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.5

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5 COUPLED SPIN-TORQUE-OSCILLATOR BASED CONVOLUTION COMPUTATION:
APPLICATION TO IMAGE PROCESSING . . . . . . . . . . . . . . . . . . 72
5.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.2

Coupled oscillator based approximate computation primitive . . . . . . 74

5.3

System level implementation . . . . . . . . . . . . . . . . . . . . . . . . 77

5.4

Circuit level implementation . . . . . . . . . . . . . . . . . . . . . . . . 80

5.5

Measurement results and its application to edge detection . . . . . . . . 82

5.6

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

6 SUMMARY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
VITA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

vii

LIST OF TABLES
Table

Page

2.1

Device simulation parameters of the SHE-MTJ for Ising spin model . . . . 13

3.1

Device simulation parameters of the SHE-MTJ for 3-SAT solver . . . . . . 43

3.2

Performance comparison between Conventional RW and the proposed BRW 49

4.1

Estimated probability of the inference for 1,000 output samples with Gaussian noise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

viii

LIST OF FIGURES
Figure
1.1

Page

Technology node scaling is expected to meet its fundamental limit around
2020 (IRDS 2017) [1] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

Performance gap due to the limitation on device scaling and ineﬃcient
computing models (IRDS 2017) [2] . . . . . . . . . . . . . . . . . . . . . .

2

1.3

Categories of emerging computing architectures (2016 IRDS) [3] . . . . . .

3

2.1

(a) Conventional Ising spin model and the deﬁnition of Hamiltonian (H)
(b) Changes in Hamiltonian over the spin states. [10] . . . . . . . . . . . .

9

1.2

2.2

3-terminal SHE-MTJ device with MTJ on the top of the Heavy-Metal
(HM) layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3

(a) Switching probability (PSW ) versus input charge current through the
HM layer (Iq ). (b) Magnetic moment change (normalized) from 1 to -1
during 100 write operation with Iq = 90 uA for 3 ns. . . . . . . . . . . . . . 15

2.4

(a) Diﬀerent switching behaviors depending on the magnitude of input
charge current and energy barrier of the MTJ device. (b) Timing diagram
for two write operations and corresponding magnetic moment change. . . . 15

2.5

(a) Implementation of Majority vote function based on multiple current
sources with corresponding switches (b) Switching probability (PSW ) changes
based on the inputs from neighbors . . . . . . . . . . . . . . . . . . . . . . 17

2.6

The proposed device-circuit conﬁguration for single Ising spin model. . . . 18

2.7

(a) Ising spin model based on the proposed hardware implementation for
a single spin state. (b) The process of spin states update. . . . . . . . . . . 20

2.8

Application to Maximum-cut problem (a) System energy proﬁle (based on
the deﬁnition of Hamiltonian (H) in Fig. 2.1) over the number of iterations
(spin state update). (b) Visualized spin states at initial, intermediate, and
ﬁnal stages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.9

Application to Graph Coloring problem. . . . . . . . . . . . . . . . . . . . 23

2.10 (a) The sample GC problem with 4 vertices and 3 colors (Left) and the
number of spin states to represent the given problem (Right). (b) Example
solution from the Ising solver (Left) and its interpretation (Right) . . . . . 26

ix
Figure

Page

2.11 (a) Interpretation of the 1st term of Hamiltonian for a Graph Coloring. (b)
Energy penalty example from the 1st term of Hamiltonian with possible
cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.12 (a) Interpretation of the 2nd term of Hamiltonian for a Graph Coloring.
(b) Energy penalty example from the second term of Hamiltonian with
possible cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.13 (a) Interconnection (Weight) matrix example for GC problem with 4 vertices and 3 colors. (b) Implementation of a single Ising spin model with
SHE-STO and CMOS peripherals . . . . . . . . . . . . . . . . . . . . . . . 30
3.1

(a) Hardware implementation of 3-SAT solver with conventional RW algorithm, (b) Details of Main unit. . . . . . . . . . . . . . . . . . . . . . . . 36

3.2

(a) Proposed 3-SAT solver with BRW, (b) Details of the hardware implementation of the proposed SAT solver. . . . . . . . . . . . . . . . . . . . . 38

3.3

(a) 3-terminal spin-Hall-Eﬀect MTJ (SHE-MTJ) device, (b) Device-Circuit
conﬁguration for the core hardware primitive in our design. . . . . . . . . 40

3.4

(a) Switching probability (PSW ) versus input charge current through the
HM layer (Iq ) with and without thermal noise, (b) Transient simulation
results showing probabilistic switching of the device with the same input
stimulus that will incur 50 % chance of magnetization ﬂipping (mx ) in the
presence of thermal noise. . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.5

(a) Overall hardware conﬁguration of the proposed 3-SAT solver based on
BRW algorithm, (b) Timing diagram of the main command signals (RST,
WR, RD). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.6

Performance variation depends on (a) diﬀerent values of PSW,ST EP 1 for
linear PSW step, (b) diﬀerent values of PSW,IN IT and PSW,ST EP 2 for nonlinear PSW step. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.7

Performance comparison in terms of normalized solution search time between three diﬀerent approaches of the 3-SAT solver. The sample problems
are from SATLIB [53]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.8

(a) State transition of the variables during the problem solving, (b) Average number of variable ﬂips during the process of problem solving. . . . . . 49

3.9

Performance comparison between the proposed hardware SAT solver using
BRW algorithm, software WalkSAT solver [54] and probSAT [55] solver for
problems with a size of (a) 350 variables / 1,491 clauses and (b) 450 variables / 1,917 variables. The sample problems are from SAT competitions
2011 [56]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

x
Figure

Page

4.1

(a) Nanoscale Hall-bar structure consisting of Ta (10 nm)/CoFeB (1.3
nm)/MgO (1.5 nm)/Ta (5 nm) (from bottom to top) material stack. (b)
Variation of the magnet switching probability with variation in magnitude (amplitude) of the input current pulse. (c) Variation of the magnet
switching probability characteristics with variation in the duration of the
input current pulse. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.2

A three terminal device structure consists of Magnetic Tunnel Junction
(MTJ) on top of the Heavy Metal (HM) underlayer.(Left top) The stochastic switching of the device in the presence of thermal noise can be exploited
to various applications such as neuromorphic computing (Bottom), Ising
spin model (Right top), and Bayesian Network (Right middle). . . . . . . 57

4.3

Two independent simulations of stochastic Landau-Lifshitz-Gilbert (LLG)
equation with thermal noise for a magnet with perpendicular magnetic
anisotropy (along z direction) are shown. . . . . . . . . . . . . . . . . . . . 59

4.4

A simple Bayesian Network with four variable and corresponding Conditional Probability Table (CPT) . . . . . . . . . . . . . . . . . . . . . . . . 61

4.5

The proposed device-circuit level solution for Poisson spike pulse generation with MTJ on HM along with CMOS peripherals . . . . . . . . . . . . 63

4.6

Interconnection between two variables based on the proposed Stochastic
switching elements using CMOS interface circuit . . . . . . . . . . . . . . . 63

4.7

Practical implementation of the variable for BN and its interconnection
with neighboring variable . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.8

Timing diagram of the proposed device-circuit conﬁguration in Fig. 4.7 . . 65

4.9

Hardware implementation of Bayesian Network with stochastic MTJ and
CMOS peripherals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.10 (a) Inference operation based on default BN with the aid of multiplication
and division operations (b) Division operation between two Poisson pulses
by matching the rate of spiking between two nodes [79] . . . . . . . . . . . 68
4.11 Practical implementation of division operation between two Poisson pulses 68
4.12 (a) Timing waveform from each variable to estimate the probability during
100 sample points. (b) Inference results based on the proposed BN . . . . 70
5.1

(a) Oscillator as a basic computing element and two possible methods of
input mapping through FSK and PSK. (b) Approximate distance computation (L1, L2 norm) based on the oscillator and Distance Unit . . . . . . 74

xi
Figure

Page

5.2

(a) Approximate distance computation (L22 norm) between two multidimension vectors based on coupled oscillator network and the associate
Distance Unit (b) Basic idea of approximate convolution computation from
the L22 distance metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.3

Coupled STO cluster based approximate convolution computation: Application to image processing, Edge detection . . . . . . . . . . . . . . . . 78

5.4

The detailed implementation of the proposed system for approximate distance computation in Fig. 5.3 based on the coupled oscillator device module (GMR-STO) and CMOS detector (L22 unit). . . . . . . . . . . . . . . 79

5.5

(a) Simpliﬁed STO device module in Fig. 5.4 (b) Frequency versus input
dc bias current to the STO device under the frequency locking through a
ﬁeld line injection coupling (c) Expected waveforms along the signal ﬂow . 80

5.6

(a) The detailed block diagram of the CMOS detector, L22 unit (b) Expected waveforms along the signal ﬂow . . . . . . . . . . . . . . . . . . . . 81

5.7

(a) simulation environment setup to check the performance of the proposed
approximate distance computing primitive based on randomly generated
input vectors (b) Input test vector table used for the measurement (c) 2D
graph to show the input (randomly generated vectors in L2 norm format)
and the output (Digital code from CMOS detector) relationship . . . . . . 83

5.8

Approximate distance measurement results of the proposed coupled-STO
based computing primitive based on the simulation environment in Fig. 5.7. 84

5.9

Simulation study (Edge detection through Gabor ﬁltering) to show the feasibility of our system as an approximate convolution computing primitive
(a) Downscaled “Lenna” image (b) 2 × 2 Gabor ﬁlter (c) Edge detection
results from ideal convolution (Left) and approximate convolution (Right)
(d) Pixel-wise diﬀerence between two edge maps in (c) . . . . . . . . . . . 85

xii

ABSTRACT
Shim, Yong PhD, Purdue University, August 2018. Stochastic Algorithms for Optimization: Devices, Circuits, and Architecture. Major Professor: Kaushik Roy.
With increasing demands for eﬃcient computing models to solve multiple types of
optimization problems, enormous eﬀorts have been devoted to ﬁnd alternative solutions across the device, circuit and architecture level design space rather than solely
relying on traditional computing methods. The computational cost associated with
solving optimization problems increases exponentially with the number of variables
involved. Moreover, computation based on the traditional von-Neumann architecture
follows sequential fetch, decode and execute operations, thereby involving signiﬁcant
energy overhead. To address such diﬃculties, eﬃcient optimization solvers based
on stochastic algorithms were proposed. The stochastic algorithms show fast search
time through parallel solution space exploration by exploiting stochastic switching
elements. The goal of this research is to propose eﬃcient computing models for optimization problems by adopting a biased random number generator (RNG). Here we
use stochastic switching of nanomagnet under thermal noise. The switching probability of the nanomagnet is manipulated by the magnitude of input stimulus through
the device. This core element is used to build combinatorial optimization problem
solvers for diﬀerent types of problems such as an Ising spin model for Graph Coloring
and a Bayesian inference engine for probabilistic inference.
Apart from the optimization solvers, this research also focuses on the implementation of spin transfer torque based coupled oscillators on core computing primitives
for image processing applications and the associated CMOS supporting circuit design.
We have shown that the proposed coupled oscillator system can perform eﬃcient convolution computation and could be used for an edge detection.

1

1. INTRODUCTION
For over 50 years, CMOS (Complementary Metal Oxide Semiconductor) has been
the workhorse of information processing technology, especially in the area of Boolean
computation. The scaling of CMOS has enabled the integration of a larger number
of transistors into the same die area as well as faster device switching speeds, thereby
contributing to increased performance with more functionality. However, CMOS
scaling is expected to approach its fundamental limit after 2020 as the horizontal
feature size of the device reaches the sub-10-nm regime [1], as shown in Fig. 1.

Fig. 1.1. Technology node scaling is expected to meet its fundamental
limit around 2020 (IRDS 2017) [1]

2
In addition to the device scaling limitation, there remain other areas such as
optimization, recognition, and classiﬁcation where Boolean computation based on
von-Neumann architecture turns out to be ineﬃcient. The operations required for
such applications are computationally intensive in most cases, and these operations
are executed by following a sequential instruction set that leads to excessive computing resource usage. These limitations on device scaling and ineﬃcient computing
models result in a gap between the performance required to handle tremendous data
volumes (i.e., in the areas of big data, Internet-of-Things (IoT), deep learning, and
artiﬁcial intelligence) and the performance achievable through CMOS device scaling.

Fig. 1.2. Performance gap due to the limitation on device scaling and
ineﬃcient computing models (IRDS 2017) [2]

To bridge the performance gap, illustrated in Fig. 2, the research community is
now actively exploring new possibilities via novel devices and architectures to achieve
performance beyond what is possible with CMOS scaling [3]. Moreover, these trends
are focusing on an interdisciplinary perspective rather than relying on changes to
the device or the architecture only. From the device perspective, multiple devices
are introduced as beyond-CMOS devices in [4]: 1) Spin FET, 2) Negative Gate-

3
Capacitance FET, 3) Micro/Nano-Electro-Mechanical (M/NEM) switch, 4) All-Spin
Logic Devices, and 5) Memristors. These devices are opening up new possibilities for
non-conventional computing models such as neural-networks (NNs) and probabilistic
learning, as shown in Fig. 1.3 from [3].

Fig. 1.3. Categories of emerging computing architectures (2016 IRDS) [3]

In keeping with this trend, this research proposes interdisciplinary approaches for
1) multiple types of optimization problem solvers and 2) coupled oscillator network
based approximate computing primitives, by exploiting spintronic devices [5,6] to implement eﬃcient non-Boolean computing models. Among the variations of spintronic
devices, Magnetic-Tunnel-Junction (MTJ) is of particular interest due to its simple device structure, inherent non-volatility, and stochastic magnetization switching
behavior with thermal noise.
Based on the MTJ device, we ﬁrst present a stochastic algorithm for optimization
problems and its hardware implementations. Here, a magnetic stack with stochastic

4
switching characteristics constitutes a core hardware primitive for multiple types of
optimization problem solvers with the associated CMOS peripheral circuits. Then,
we present a coupled oscillator system as an approximate convolution computing
unit where the MTJ based Spin Torque Oscillator (STO) is exploited as an oscillator.
The collective behavior of the oscillators under a weak coupling is used to compute a
distance between two input vectors for image processing applications.
The rest of the dissertation is organized as follows
In chapter 2, we introduce the implementation of the Ising spin model based on
hybrid MTJ and CMOS circuit design. The Ising spin model is an eﬃcient computing
model for solving optimization problems based on its natural tendency of converging
towards a low energy state. The non-volatility and stochastic switching of the MTJ
device on a Heavy-Metal (HM) underlayer are exploited to implement two major
functionalities of the Ising spin model, ‘Annealing’ and ‘Majority Vote’. The majority
vote is used to update the state of spin in the system through interactions with
neighboring spins. Annealing introduces noise to the system, thereby preventing the
system from becoming stuck at local energy-minima during solution search. These two
functions are readily implemented by leveraging the inherent device characteristics
of the MTJ under consideration. The proposed core computing element is used to
build an Ising spin system that can be used to map the combinatorial optimization
problems. We demonstrate its feasibility by solving realistic NP-complete problems.
In chapter 3, we propose a Boolean Satisﬁability (SAT) problem solver based on
stochastic algorithm. The goal of SAT problem solving is to assign proper logic states
to Boolean variables such that the given Boolean expression in a Conjunctive Normal
Form (CNF) is evaluated to be true. The key to solving SAT problems lies in decreasing the number of unsatisﬁed clauses in a given CNF formula by ﬂipping a variable
connected to the unsatisﬁed clause. There are multiple top-down algorithm-level approaches to selecting a variable that might decrease the number of unsatisﬁed clauses.
We propose a bottom-up approach by utilizing our proposed device-circuit solution
to choose a variable to be ﬂipped for the problem-solving. From the simulation study,

5
the proposed hardware SAT solver exhibits better performance in terms of solution
search time in comparison to software SAT solvers.
In chapter 4, we discuss a probabilistic optimization engine based on Bayesian
Network (BN) and the inference operation through the network, which mimics the
spiking behavior of neurons in the human brain during reasoning. Bayesian Network
is a graphical model to represent conditional independences of variables, where nodes
in a BN represent random variables and links represent direct dependencies among
the variables. A random variable is used to encode information at the rate of spikes,
and the links between the variables are used to transfer the data. Our work attempts
to experimentally validate prior theoretical proposals for utilizing spin-orbit torque
switching nanomagnets as biased random number generators (RNG), where the biased
RNG is used as a variable in our system to generate Poisson spike train (rate coding).
The pulse train from the variables and the ﬂow of information through the network are
used to perform an inference operation, which exhibits reasonable accuracy compared
to the ideal result from complex ﬂoating point calculation.
In chapter 5, we propose an approximate convolution computation unit based on
a coupled oscillator network and CMOS supporting circuits. The hybrid system consists of a Giant Magnetoresistance (GMR) STO device and a CMOS detector chip.
These are used to form a distance computing core. The measurement result from experiments indicate that the proposed computing primitive can perform approximate
distance computation whose output exhibits a quadratic relationship to the input
variation. (L22 norm). The corresponding L22 norm is used to compute an approximate convolution for edge detection with a Gabor ﬁlter kernel to demonstrate the
feasibility of the coupled oscillator network as a core computing primitive for image
processing applications.
Finally, chapter 6 summarizes the thesis and discusses the future work.

6

2. ISING COMPUTATION BASED COMBINATORIAL
OPTIMIZATION USING SPIN-HALL EFFECT (SHE)
INDUCED STOCHASTIC MAGNETIZATION REVERSAL
Ising spin model is considered as an eﬃcient computing method to solve combinatorial optimization problems based on its natural tendency of convergence towards
low energy state. The underlying basic functions facilitating the Ising model can be
categorized into two parts, “Annealing and Majority vote”. In this chapter, we propose an Ising cell based on Spin Hall Eﬀect (SHE) induced magnetization switching
in a Magnetic Tunnel Junction (MTJ). The stochasticity of our proposed Ising cell
based on SHE induced MTJ switching, can implement the natural annealing process
by preventing the system from being stuck at local minimas. Further, by controlling
the current through the Heavy-Metal (HM) underlying the MTJ, we can mimic the
majority vote function which determines the next state of the individual spins. By
solving coupled Landau-Lifshitz-Gilbert (LLG) equations, we demonstrate that our
Ising cell can be replicated to map certain combinatorial problems. We present results
for two representative problems - Maximum-cut and Graph coloring - to illustrate the
feasibility of the proposed device-circuit conﬁguration in solving combinatorial optimization problems. Our proposed solution using a Heavy Metal (HM) based MTJ
device can be exploited to implement compact, fast, and energy eﬃcient Ising spin
model.

2.1

Introduction
Eﬃcient computing models for combinatorial optimization problems have attracted

considerable research interest. However, solving optimization problems in an eﬃcient
way based on conventional computing model is very challenging. Typically, to ﬁnd

7
an optimum solution for such problems, a performance index has to be computed
and compared for every possible input combinations. [7] However, the computational
cost associated with a combinatorial optimization problem, increases exponentially
with the number of variables. Moreover, if we consider the process of problem solving based on conventional computing (by following a sequential fetch, decode, and
execute cycles), ﬁnding an optimum (even near optimum) solution seems infeasible
keeping in view the energy and power requirements. Ising model has been researched
extensively owing to its simple architecture and inherent ability to solve combinatorial optimization problems [8, 9]. A CMOS based implementation of the Ising model
can be found in [10]. CMOS implementation, however, requires complex circuits for
implementing some of the primitives required for the Ising model like random number
generators, majority logic etc. As opposed to CMOS implementations, use of spin
devices for Ising model opens up new avenues for non-volatile hardware realization
of the Ising model. Here we propose a non-volatile Ising cell based on controlled
stochastic switching dynamics of the magnetization direction in a nanomagnet with
an underlying Heavy Metal (HM) layer. The device used in the present manuscript,
has been demonstrated experimentally in [11]. Recently, an Ising model based on a
similar SHE based device and low Energy Barrier (EB) magnet (<5KT ), was proposed in [12]. However, the use of magnets with low EB severely limits the design
space, since it has to be ensured that the read current is small enough to avoid possible pinning of the free-layer during the read operation. The constrain on the read
current and hence the read voltage also makes it challenging to drive multiple devices from a given read port of a particular nano-magnet. Further, use of low EB
magnets compromises the intrinsic non-volatility of spin devices. Another proposal,
similar to [12], but based on high EB magnets initially biased towards the magnetic
hard axis can be found in [13]. The inclusion of dipole coupled magnets in the proposed device presented in [12, 13], increases the device complexity. Besides, details
about circuit requirements for interconnections and weighted summation of currents
is missing. The simplicity of the device used in this chapter in addition to the pre-

8
sented CMOS circuits, allows us to construct an Ising cell, which can be seamlessly
interconnected to solve combinatorial optimization problems. A numerical simulation framework based on the stochastic Landau-Lifshitz-Gilbert (LLG) equation is
developed to analyze the switching characteristic of the proposed device. Further, by
solving coupled stochastic LLG equations and SPICE simulations, we show solutions
for some representative combinatorial problems obtained by using our proposed Ising
cell. Before we describe the proposed device, we would a brief introduction of the
Ising spin model. The Ising model considers the behavior of magnetic spins and the
coupling between them. Fig. 2.1(a) illustrates a simple view of the Ising model and
the deﬁnition of associated Hamiltonian (H) - the total energy of the system. The
model consists of individual spin state (si ), interconnection coeﬃcient between two
spins (Jij ), and external magnetic ﬁeld (hi ). Each spin can have one of the two states,
up and down, and there are four interconnections with neighbors in this model. The
spin at the center is named as sC and the four neighbors are sU , sD , sL , and sR . The
interconnection weights between sC and its neighbors are denoted as JCU , JCD , JCL ,
and JCR , respectively. These weights model the coupling between spins and are used
to determine next state of the spins. For example, if JCU has positive sign (i.e. +1),
it implies sU tries to align sC parallel to itself. Likewise, a neighboring spin with a
negative weight tries to align the given spin anti-parallel to itself. Since there are
four neighbors, the next state of sC is decided based on a majority vote - if majority
of the neighbors of a given spin state want to keep the given spin in +1 direction,
then the next state of sC will be +1, else it would be -1. The Hamiltonian (H) also
changes as the states of the spins are updated. Hence, once the problem is mapped
to the system properly (by programming the weights for each interconnection), the
system tries to reach the energy minimum state by switching the states of the spins
through the aforementioned majority coupling. When the system reaches the global
minimum energy state, the solution is obtained by examining the ﬁnal states of the
spins.

9

Fig. 2.1. (a) Conventional Ising spin model and the deﬁnition of
Hamiltonian (H) (b) Changes in Hamiltonian over the spin states. [10]

Fig. 2.1(b) shows the total energy (H) of the system as a function of diﬀerent
spin states. As shown in the ﬁgure, the energy proﬁle has a global minimum, and
also multiple local minimum states. This implies that the system could easily get
stuck at the local minimum state during the process of problem solving (i.e the
system evolves to diﬀerent states through coupling). To avoid the system being
stuck into a local minima, annealing process, such as “Simulated Annealing (SA)”
[14, 15] and “Quantum Annealing (QA)” [16–18] has been proposed. During a SA
process in conventional Ising model, the system starts from a known initial states
at a non-zero temperature. The system then evolves towards the minimum state of
the Hamiltonian by lowering its temperature gradually. In contrast, in a QA, the
temperature can be replaced by a quantum mechanical eﬀect, such as probabilistic
quantum tunneling [17]. Whether it is QA or SA, annealing always includes some
kind of randomization of the next state logic, to get the system out of the local
minima. Despite the fact that SA and QA can ﬁnd the lowest energy state of the
Ising spin model eﬃciently, the implementation of such a system needs control of
the state of each spin and coupling between them. Also, the state of individual
spins needs to be monitored for total energy of the system which is challenging from
hardware perspective. This is why hardware implementation of the Ising model did

10
not receive much attention, even though theoretical background has been widely
explored by the research community. Recently, hardware implementations of the Ising
spin model have been proposed using Superconducting material [19] and CMOS only
implementation [10]. In CMOS circuits, the annealing process can be implemented
by generating a random address that is used to choose a speciﬁc spin to be ﬂipped
[10]. However, such implementations require complex hardware for random number
generation, address decoding, and write operation for the speciﬁc spin state. These
series of operations have to be executed multiple times to get the system state out
of the local minima. Furthermore, randomizing spin states based on random number
generation can potentially move the system to totally diﬀerent state so that the system
might not reach the global minimum state eventually.
Another diﬃculty in implementation of the Ising model is due to the complexity
of the majority voting circuit. As explained, the next state of each spin is determined
by the interactions with all neighboring states. Multiple solutions might be possible
to implement majority function based on digital, analog, and even with mixed-mode
design. However, all of these implementations are expensive in terms of silicon area
and power consumption due to its multi-input nature and complexity of operation.
The motivation of our research arises from the fact that a spintronic device like a
Heavy-Metal (HM) based Magnetic-Tunnel-Junction (MTJ) can potentially provide
the aforementioned two important characteristics required for the Ising model viz. 1)
stochasticity (random spin ﬂip required for annealing) 2) majority voting. In order
to mimic the aforementioned functionality, a nanomagnet is required that is able to
switch its states with a certain probability to facilitate the annealing process and also
change its state based on a majority vote which is crucial for the system to evolve
towards minimum energy state. In the next section, we describe the mapping of the
magnetization dynamics of a nanomagnet to the Ising operations such as a natural
randomizer and a majority vote logic.

11
2.2

From device to Ising model
Let us ﬁrst describe the basic device structure used for our Ising model and its

principle of operation. Fig. 2.2 shows a three terminal device structure, consisting of
the conventional MTJ stack formed by a Tunneling Barrier (TB) sandwiched between
two nanomagnets. Since the magnetization of the upper ferromagnetic layer is ﬁxed,
it is termed as the Pinned Layer (PL). On the other hand, the magnetization of the
bottom layer, denoted as the Free Layer (FL), can be manipulated by an incoming
spin current. Depending on the direction of the FL, the MTJ structure can have two
stable states. If the magnetizations of the two ferromagnetic layers are in the same
direction, it is in the parallel conﬁguration (P), otherwise it is in the anti-parallel
conﬁguration (AP). These two states have diﬀerent resistances across the vertical
direction of the device. Typically, AP state has a higher resistance and the ratio
between P and AP states is deﬁned as the Tunnel Magneto-Resistance (TMR) ratio.

Fig. 2.2. 3-terminal SHE-MTJ device with MTJ on the top of the
Heavy-Metal (HM) layer.

The FL is in contact with a heavy metal (HM) like Pt or Ta. On passing a charge
current through the HM, from terminal T1 to T2, the direction of magnetization in the
FL becomes parallel to that of the PL. When the direction of charge current ﬂowing
through the HM is reversed (from T2 to T1), the FL switches its direction such that
it is now anti-parallel to the PL. The switching of FL due to a charge current ﬂowing

12
through the HM can be attributed to the large spin-orbit-coupling (SOC) of the HM.
SOC is a relativistic eﬀect that theoretically originates from the full relativistic waveequation. In the current scenario, due to SOC based eﬀects like the Spin Hall Eﬀect
(SHE), the electrons ﬂowing through the HM are deﬂected such that up and down
spins split. As shown in Fig. 2.2, the up-spins are deﬂected in the +z direction and
down-spins in the -z direction. This spin splitting results in a resultant spin current
ﬂowing in the +z or -z direction, based on the direction of the charge current. The
spin current (Is ), thus generated due to the charge current ﬂowing in the HM, exerts a
torque on the magnetization direction of the FL, making it parallel or anti-parallel to
the PL. Thus, a charge current ﬂowing between the terminals T1 and T2 can switch
the state of the MTJ from RAP to RP or vice-versa. In order to sense the resistance
of the MTJ, a voltage can be applied between terminals T3 and T1/T2. By sensing
the current ﬂowing through the MTJ stack (or detecting the voltage level across the
device) one can conclude if the current state of the MTJ is RAP or RP . Based on the
above description, the three terminal structure of the device shown in Fig. 2.2 oﬀers
the following desirable characteristics 1) the write and read path are decoupled and
can be independently optimized 2) the eﬃciency of spin current generated from the
charge current ﬂowing through the HM, can be greater than 100% [20], resulting in
low write-energy 3) by controlling the amount of current ﬂowing through the HM, one
can not only alter the switching probability but also implement a majority function.
Later in this chapter, we would describe how these desirable characteristics of the HM
based MTJ device can be used eﬃciently to mimic the various operations required
for the Ising spin model.
We would now describe the equations used for modeling the device shown in Fig.
2.2. Under an external excitation, for example a magnetic ﬁeld or a spin current, the
magnetization dynamics of the FL can be obtained by the well known Landau-LifshitzGilbert (LLG) equation under the single domain approximation with an additional
term for the spin transfer torque proposed by Slonczewski [21],
dm
b
dm
b
1
c)
b×
)+
(m
b ×m
b × Is M
= −γ(m
b × Hef f ) + α(m
dt
dt
qNs

13
where m̂ is the unit vector corresponding to the direction of magnetization in the FL,
γ = 2µB µ0 /h̄ is the gyromagnetic ratio for electron, α is Gilbert damping ratio, Hef f
is the eﬀective magnetic ﬁeld including the shape anisotropy ﬁeld [22] and uniaxial
interface anisotropy ﬁeld [23], Ns = Ms V /µB is the number of spins in the FL volume
V (Ms is saturation magnetization and µB is Bohr magneton), Is in equation (1) is
c is magnetization
the spin current ﬂowing through the FL in the +z or -z direction, M
direction of PL.
Based on recent experimental studies [11, 24–27], the spin current generated due
MT J
to a charge current ﬂowing through the HM can be estimated as Is = θSH WtHM
IQ ,

where IQ is the charge current ﬂowing through the HM, θSH is the spin-hall angle [25],
WM T J and tHM are device dimension parameters, shown in Fig. 2.2. The details of
the device parameters we used for benchmarking are summarized in Table I.
Table 2.1.
Device simulation parameters of the SHE-MTJ for Ising spin model
Parameters

Value

Free layer area

(π/4) × 45 × 112.5 nm2

Free layer thickness

1.5 nm

Heavy-metal thickness, tHM

2.3 nm

Saturation magnetization Ms

1257.3 emu/cm3 [28]

Spin-Hall Angle, θSH

0.3 [20]

Gilbert damping factor α

0.1

Energy barrier EB

60 KT

MgO Thickness, tM gO

1.4 nm

MTJ resistance, RP (RAP )

8.56 (18.31) KΩ

Resistivity of HM, ρHM

200 µΩ - cm [20]

Pulse width tP W

3 ns

Temperature TK

300 K

Supply Voltage, VDD

1V

14
Finally, to model the eﬀect of thermal noise at non-zero temperature, the thermal
q
2KB T
α
noise is accounted as a random thermal ﬁeld [29], Hthermal =
G ,
1+α2 γµ0 Ms V δt 0,1
where G0,1 is a Gaussian distribution (zero mean, unit standard deviation), KB is the
Boltzmann constant, T is the temperature and δt is the time step. Under the eﬀect
of thermal noise, the switching behavior of the MTJ during the write operation due
to the charge current ﬂowing through the HM layer, becomes stochastic in nature.
Also, the probability of switching changes according to the magnitude of the input
charge current. Fig. 2.3(a) illustrates a graph showing switching probability (PSW )
versus input charge current (Iq ) through the HM layer. The charge current pulse
was applied for a ﬁxed time (3 ns) and an additional 6 ns wait time was included
for the magnetization to relax. The amount of charge current varies from 40 uA
to 160 uA with 5 uA step, and 104 simulations were executed for each simulation
step to get a switching probability. As shown in the ﬁgure, when an input current
of 90 uA is applied for 3 ns, the PSW becomes roughly 50 %. Fig. 2.3(b) shows
the normalized magnetic moment change with 100 write operations (assuming that
the initial magnetization is 1 and is being ﬂipped to -1 direction) when the input
charge current is kept at 90 uA for 3 ns followed by 6 ns of relaxation time. It can
be observed, 51 out of 100 cases ﬂipped which is close to the ratio one would expect
from Fig. 2.3(a).
The stochastic switching of the nano-magnet has been attracting signiﬁcant interest in recent years and is being used for intriguing applications such as neuromorphic
computing [30, 31]. In this paper we use this stochastic ﬂipping nature of the nanomagnet as a natural randomizer - one of the key elements for the ‘natural annealing’
process. The baseline idea of a general annealing process in Ising model lies in perturbing the spin states randomly to get the system out of the local minimum energy
state. Thus, while switching the state of the MTJ, we can tune the input charge
current ﬂowing through the HM such that the MTJ switches with the desired probability. The write current can be controlled with ease by adopting simple CMOS
peripherals which will be explained later. It is worth noting here that the proposed

15

Fig. 2.3. (a) Switching probability (PSW ) versus input charge current
through the HM layer (Iq ). (b) Magnetic moment change (normalized)
from 1 to -1 during 100 write operation with Iq = 90 uA for 3 ns.

natural annealing can also provide time-varying switching probability (by adjusting
the input current value) which mimics the natural property of annealing through the
temperature control. This helps to ﬁnd a global minimum quickly when the system
reaches the end of iterations.

Fig. 2.4. (a) Diﬀerent switching behaviors depending on the magnitude of input charge current and energy barrier of the MTJ device. (b)
Timing diagram for two write operations and corresponding magnetic
moment change.

16
Additionally, we can exploit another beneﬁt from the nanomagnet due to the
strong dependence of the switching process on the input charge current ﬂowing
through the HM. As explained, charge current ﬂowing through the HM layer induces
spin current in transverse direction at the FL-HM interface, which ﬂips magnetization of the FL. The dependence of the ﬂipping process on the input current can be
explained by the energy proﬁle of the FL. Assume that the angle between FL magnetization and the PL magnetization be represented by θ. The FL energy as a function
of θ is shown in Fig. 2.4(a), where the two stable states (θ=0◦ and θ=180◦ ) are
separated by the energy barrier EB . Here we assume that the MTJ changes its state
from P (θ=0◦ ) to AP (θ=180◦ ) state. Once the input charge current is presented to
the HM layer for a duration tW RIT E , spin current is induced based on the spin-hall
eﬀect (SHE) and makes the spin at point 1 to move uphill along the energy proﬁle. If
the charge current is not enough to move the spin across the energy barrier, the spin
stops at the metastable state (point 2) and falls down again to the point 1 during
tRELAX . On the other hand, once enough charge current is applied, ultimately the
spin will overcome the energy barrier and move to the point 3. Fig. 2.4(b) shows these
situations using magnetic moment change and two current pulses with diﬀerent magnitudes. Note that, in presence of thermal noise, a hard switching threshold current
does not exists. A particular amount of current can only ﬂip a magnet with certain
probability unless the applied current is large enough to deterministically switch the
magnet.
The decoupled write operation in HM based MTJ can be used as an eﬃcient
way to construct the majority vote function required for the Ising model. Let us
consider a given connection between a particular spin state and one of its neighbors.
We would represent the connection as (+1,+1), where the ﬁrst number denotes the
spin state (+1 represents up spin and -1 down spin), while the second number in the
bracket represents the associated weight of the given spin and its neighboring spin. As
mentioned in the previous section, the next state of each spin is determined through
the interaction with all connected neighboring spins. For example, if a neighbor

17
has a state (+1,+1) or (-1,-1), then it would want the next state of the spin to be
+1 (obtained by product of spin state and associated weight). In case of (+1, 1) or (-1, +1), the neighbors would tend to make the next state of the given spin
under consideration as -1. In the presence of multiple neighbors, a majority vote
is taken to determine the next state of the spin under consideration. The hardware
implementation of such a multiplication (product of spin state and associated weight)
and majority vote functionality requires complex circuit.
Our proposed HM based MTJ circuit that can eﬃciently implement the majority
vote functionality is shown in 2.5(a).

Fig. 2.5. (a) Implementation of Majority vote function based on multiple current sources with corresponding switches (b) Switching probability (PSW ) changes based on the inputs from neighbors

The HM layer receives the input current proportional to the number of voters
from the neighboring spin states using multiple current sources and switches. The
corresponding PSW is depicted on the top of Fig. 2.5(b). Here we assume the possible
current range for write operation is limited within 60 uA to 120 uA. This range is
equally divided among its neighbors. For instance, if there are four neighbors, each
neighbor would contribute 15 uA of write current by voting to one of the two potential
states. Note that, the leftmost current source is used to provide a default magnitude
of current (60 uA) during the write cycle. Based on this, if there are 0 voters, the

18
current during the write operation becomes 60 uA which leads to 2 % PSW . If there
are 4 voters, then the current becomes 120 uA and corresponding PSW becomes 96 %.
This directly mimics the general rule of majority vote - low PSW with less voters, high
PSW with more voters. Note that, due to its probabilistic nature there are chances
of spin ﬂip into unwanted state. This is not an issue, it rather mimics the natural
annealing process as discussed earlier.
The overall device-circuit conﬁguration for single spin model is shown in Fig. 2.6.
For read operation, reference resistor (RREF ) and the switch transistor are used in
series on the top of the nanomagnet. The resistance of RREF is in between two stable
resistance states of the MTJ device. Hence, if the MTJ resistance is smaller than
RREF (in parallel magnetization conﬁguration), the output of the inverter becomes
high, and vice versa.

Fig. 2.6. The proposed device-circuit conﬁguration for single Ising spin model.

The output of inverter i.e. the next state of the current spin is then stored in
following D-latch. This additional storage element de-couples the evaluation of the
next state from interacting with other neighboring cells. Consequently, every Ising
cell can be updated concurrently at each operation step regardless of size of the
system, thereby improving the scalability of the proposed architecture. The ‘next
state’ signal is used to choose the direction of the current ﬂowing during the write

19
operation and also be presented to neighboring cells for the majority vote operation.
For the majority vote and write operation, there are current sources with multiple
branches (each with stacked transistors, one for biasing and the other for switch
operation) and XNOR gates to combine information from the neighbors (product of
spin state and interconnection weights). Note, here we assumed that there are four
neighbors. The number of neighbors can be extended by adding more branches on
current source and controlling the amount of current from each branch. Thus, 1) the
XNOR gate (which implements the multiplication function), 2) the transistor switches
and the dependence of Psw on input current (which mimics the majority vote logic), 3)
the ‘next state’ logic which controls the direction of current ﬂow and 4) the inherent
stochasticity of switching (representing the annealing process) constitute our basic
Ising cell, that can be replicated to implement combinatorial optimization problems.

2.3

Problem Mapping and Results
In this section, we would ﬁrst elaborate how the basic Ising cell described in

the previous section, can be used to map combinatorial problems. For the Ising spin
model shown in Fig. 2.7(a), each spin state consists of an MTJ device with underlying
HM layer and associated interface circuits. The interconnection weights between
neighboring spins are a function of the speciﬁc problem to be solved. These weights
are programmed into the system before the Ising model tries to converge towards the
minimum energy state. Based on the initial spin states and associated weights, the
system evolves by updating the state of each spin through the coupling (annealing and
majority vote) described earlier in the manuscript. The time evolution of the Ising
system in our simulation framework was achieved by following the steps shown in
Fig. 2.7(b). After selecting a speciﬁc spin to be updated, the neighboring states and
the weights associated with each interconnection are examined. Then, the number
of voters for the potential next state (either +1 or -1) and corresponding charge
current for write operation is obtained through SPICE simulations. The resultant

20
current level is then fed to the LLG solver to analyze the magnetization switching
behavior of the HM based MTJ. After applying the current pulse for a duration of
3 ns to the HM layer followed by 6 ns of relaxation time, the ﬁnal direction of the
FL magnetization is examined which determines the next state of the spin under
consideration. This process is repeated until the state of the all the spins in the
system is updated. Then, the system computes the Hamiltonian of the next state to
check whether the system has found a solution.

Fig. 2.7. (a) Ising spin model based on the proposed hardware implementation for a single spin state. (b) The process of spin states
update.

The aforementioned generic methodology was the applied to two practical combinatorial optimization problems - Maximum-cut and Graph coloring. First, the
proposed Ising spin model is used to solve Maximum-cut problem. The problem can
be deﬁned as ﬁnding two mutually exclusive subsets of spins by connecting edges
(which connect spins from two separate subsets) so as to maximize the summation

21
of weights along the edges (i.e. boundary between two subsets) [32]. We have used
∼3,900 spins for this application and the interconnection weights are programmed
so that the spin states show the digit numbers from 0 to 4 without noise once the
system reaches the minimum possible energy state. Fig. 2.8 shows the results of the
maximum-cut problem and also transition of the system energy over iterations.

Fig. 2.8. Application to Maximum-cut problem (a) System energy
proﬁle (based on the deﬁnition of Hamiltonian (H) in Fig. 2.1) over
the number of iterations (spin state update). (b) Visualized spin
states at initial, intermediate, and ﬁnal stages.

As shown in Fig. 2.8(a), the system energy shows a rapid decrease up to ∼100
iterations followed by a slow decrease in energy for additional ∼300 iterations. The
initial rapid energy drop happens due to the changes in the spin states from the initial
random states to a set of two separate regions with some associated noise. This implies
that the system reaches local minima. Then the system seeks for the solution with
global minima based on the natural annealing. Note that, this global minima search
process could be expedited through the time-varying switching probability function
obtained by controlling the current ﬂowing through the HM underlayer. Here the
amount of current presented to each Ising cell was adjusted at 400th iteration so
that the system have less random ﬂipping, which leads to faster convergence. This
directly mimics the conventional SA process wherein lowering the system temperature
is represented by the decrease in the switching activity of the spin states. Fig. 2.8(b)

22
shows visualized spin states at the initial, intermediate and ﬁnal steps. Here the black
dot denotes spin state -1 and white dot represent spin state +1. Initially, spin states
start from random distribution of -1 and +1 (black and white dots). As the states
are updated through the coupling with neighboring states, target digit numbers are
visible with some noise (100th iteration). After 400 iterations, the system almost
ﬁnds the solution, but, still there exists nontrivial amount of noise mainly due to the
“natural annealing”. The clear output image shows that the optimum solution of the
current maximum-cut problem has been obtained.
Next, our proposed Ising spin model with HM based MTJ was applied to the
Graph coloring problem. Graph coloring is a famous NP-complete problem [33] and
is deﬁned as “Is it possible to color n-vertices with k-colors such that no two adjacent
vertices have the same color?” To implement speciﬁc hardware to solve the graph
coloring problem using our Ising spin model, a pre-processing step is needed to prepare
a weight matrix (check the addendum of chapter 2). This can be generated according
to the penalty Hamiltonian in [34]. This Hamiltonian basically deﬁnes rules to be
obeyed and applies an energy penalty whenever these rules are violated. For example,
one term of the Hamiltonian for this particular problem provides energy penalty once
the edge connects two vertices with the same color. Consequently, optimum solution
can be obtained when there is no energy penalty, hence the system evolves towards
minimum energy state. Interested readers are directed to Ref. [34] for more details
on the Hamiltonian for the Ising model. Once the weight matrix is prepared, it shows
interconnect map and also weights for each interconnections. We prepared 3 simple
Graph coloring problems and implemented corresponding hardware based on weight
matrix from penalty Hamiltonian. Fig. 2.9 shows the details of the problem from our
proposed Ising solver.
It is worth noting here that since the spin can only have one of the two states, a
total of n × k spins are needed to represent all possible states for the graph coloring
problem (with n-vertices and k-colors). In this case, each spin state is denoted as
vi,j , where i represents current vertex and j represents current color. For example,

23

Fig. 2.9. Application to Graph Coloring problem.

v1,1 can be interpreted as a spin representing vertex 1 and color 1. The simulations
were conducted 1,000 times for each problem to get an average number of iterations
to reach minimum energy state. The transition of the system energy is monitored by
checking the states of all the spins after each epoch to check if the system has found
a solution.
Lastly, let us brieﬂy discuss the energy consumption of the HM based MTJ device
used in the Ising spin model. The operation of a single Ising cell can be divided into
three parts - read operation, write operation and relaxation time. The time duration
for write, relax, and read cycle was taken to be 3 ns, 6 ns, and 1 ns, respectively.
The energy consumption during the write cycle is mainly due to the input charge
current ﬂowing through the HM layer. Assuming an average input current of ∼90
uA, the energy consumption during the write operation is evaluated to be ∼0.27 pJ
with VDD of 1V . Our simulations indicate that the dynamic energy consumption

24
per cell due to CMOS peripherals is small as compared to the energy dissipated in
the MTJ during the write operation. Likewise, the device-circuit simulation of the
read circuit including the associated CMOS transistors yielded an average energy
consumption of ∼0.04 pJ, when considering the average read current to be ∼38 uA
and VDD to be 1V. In contrast, the energy consumption involved in the relax mode
and CMOS switches resulted in insigniﬁcant contribution (∼0.01 pJ) to the total
energy consumption. Overall, the proposed single spin model based on our HM based
MTJ along with peripheral circuits consumes ∼0.32 pJ of energy per single spin
update operation. Even though the write operation consumes most of the energy
required (∼83 %), we believe with improvement in material parameters, write energy
can be signiﬁcantly reduced.

2.4

Conclusion
In summary, we have proposed SHE-MTJ based Ising cell to solve combinatorial

optimization problems. We demonstrate the mapping of annealing and majority vote
functions to the behavior of the spins in the nanomagnet. Although, the stochastic
switching nature of the MTJ due to the thermal noise is usually regarded as a problem
in typical memory applications [35], such random switching can be exploited to build
Ising spin model having the property of “natural annealing”. Also, the decoupled
write and read path through the HM underlayer enables the majority vote function with minimum number of devices. Using coupled magnetization dynamics and
SPICE simulations, we present solutions for two combinatorial optimization problems
- Maximum-cut and Graph coloring. We believe that the behavior of our HM based
MTJ device, mimicking the key elements of the Ising spin model, can potentially pave
the way for scalable, low-power, and simple Ising solver capable of handling complex
combinatorial optimization problems.

25
2.5

Addendum: From Ising Hamiltonian to Hardware implementation
This additional sub-chapter explains a way to map a combinatorial optimization

problem into the Ising spin model by interpreting a Hamiltonian (system energy)
expression in [34]. The Hamiltonian basically deﬁnes rules to be obeyed to solve
the problem, and applies an energy penalty whenever these rules are violated. In
addition, the rules from the Hamiltonian are used to generate an interconnection
matrix (weight matrix) which assigns coeﬃcients to the interconnections between a
pair of spin states. Then, the Ising network updates states of the spins based on
interactions with neighboring spin states where the interactions are aﬀected by the
interconnection strength (interconnection coeﬃcient) and the current states of the
neighboring spins.
Note that, the Hamiltonian has a diﬀerent form according to the type of problems.
However, the basic ﬂow of interpreting a Hamiltonian formula and getting a weight
matrix is quite similar regardless of the problem type. Hence, we choose one of the
NP-complete problems, Graph Coloring (GC), as an example to explain the basic
steps that have to be followed for a problem mapping. First, the Hamiltonian of the
GC problem (with ‘V’ vertices and ‘n’ colors) is deﬁned as:
H=A

X
V

1−

n
X
i=1

!2
xv,i

+A

n
X X

xu,i xv,i

(2.1)

(uv)∈E i=1

where ‘xv,i ’ represents a binary variable which becomes ‘1’ if vertex ‘v’ is colored with
color ‘i’, and becomes ‘0’ otherwise. By following this expression, each vertex could
be represented with ‘n’ spin states. For example, when the ﬁrst vertex ‘v1 ’ is colored
with color ‘i1 ’ in a 3-color GC problem (possible colors are ‘i1 ’, ‘i2 ’, and ‘i3 ’), this is
denoted as ‘x1,1 ’. Likewise, the ﬁrst vertex with color ‘i2 ’ and ‘i3 ’ are represented as
‘x1,2 ’ and ‘x1,3 ’, respectively. Therefore, the total number of spin states required to
represent a problem with ‘V ’ vertices and ‘n’ colors becomes ‘V × n’.
Fig. 2.10(a) shows an example problem with 4 vertices and 3 colors. To represent
this problem using the Ising model, the total number of spin states needed is 12 (4

26

Fig. 2.10. (a) The sample GC problem with 4 vertices and 3 colors
(Left) and the number of spin states to represent the given problem
(Right). (b) Example solution from the Ising solver (Left) and its
interpretation (Right)

× 3) as shown in Fig. 2.10(a). The example solution from the Ising model and its
interpretation are also depicted in Fig. 2.10(b). Note that, the white and black color
pixels represent spin state ‘1’ and ‘0’, respectively. In a ﬁrst row of the example
solution in Fig. 2.10(b), there are three spin states for the vertex ‘v1 ’ to represent
possible colors. Among them, the ﬁrst color (‘i1 ’) is only mapped with ‘1’ which
means the system assigns the color ‘i1 ’ to this vertex. If the ‘i1 ’ is assumed to be
‘Red’, then we can assign ‘Red’ color to the ﬁrst vertex. Similarly, we can complete
the graph coloring based on the given solution as illustrated in Fig. 2.10(b). Here,
the third and fourth vertices (‘v3 ’ and ‘v4 ’) are having the same color. Hence, this
example solution is not a correct answer for the GC. To prevent such a situation, the
formula (2.1) adds an energy penalty to the system so that the system avoids this
circumstance and goes towards a lower energy state.

27
Now, we would give a brief description of the Hamiltonian formula (2.1). The
given formula comprises of two terms. The ﬁrst term makes each vertex colored
with exactly a single color. If an intermediate solution during the problem solving
violates this rule, the ﬁrst term provides an energy penalty which leads to positive
Hamiltonian.

Fig. 2.11. (a) Interpretation of the 1st term of Hamiltonian for a
Graph Coloring. (b) Energy penalty example from the 1st term of
Hamiltonian with possible cases

The example situations regarding the ﬁrst term and the associated energy penalty
are shown in Fig. 2.11. The ‘Case 1’ of Fig. 2.11(a) shows a preferable situation for
the GC problem where the vertex ‘v1 ’ is colored with a single color ‘i1 ’. However,
the other cases violate the ﬁrst term as ‘v1 ’ has no color (Case 2) or colored with
multiple colors at the same time (Case 3, 4). Consequently, the Hamiltonian applies
an energy penalty to the system where the amount of energy added depends on how

28
many violations are happening in a given situation. To explain the relation of a
solution and an energy penalty, the ﬁrst term of eqn. (2.1) has been expanded in
terms of ‘v1 ’ in a 3-color GC problem.

X
1

1−

3
X

!2
x1,i

= (1 − (x1,1 + x1,2 + x1,3 ))2

(2.2)

i=1

Based on eqn. (2.2), once a single vertex is mapped with two colors at the same
time (x1,1 = x1,2 = ‘1’), the system applies energy penalty ‘1’. However, this energy
becomes ‘4’ when all three colors are mapped together (x1,1 = x1,2 = x1,3 = ‘1’).

Fig. 2.12. (a) Interpretation of the 2nd term of Hamiltonian for a
Graph Coloring. (b) Energy penalty example from the second term
of Hamiltonian with possible cases

The second term of the Hamiltonian adds a penalty once the connected two vertices (for instance, ‘v1’ and ‘v2’) have the same color as shown in Fig. 2.12(a). The
amount of energy penalty can be calculated by mapping current spin states to the
second term of the Hamiltonian. Here, the formula (2.3) shows a simpliﬁed expression
of the second term when considering two connected vertices only.

29

3
X X

A

xu,i xv,i = A

(uv)∈E i=1

X

(xu,1 xv,1 + xu,2 xv,2 + xu,3 xv,3 )

(uv)∈E

= A (x1,1 x2,1 + x1,2 x2,2 + x1,3 x2,3 )

when u = 1, v = 2

(2.3)

The corresponding energy penalty is depicted in Fig. 2.12(b) for multiple cases
by following eqn. (2.3).
Note that, both Hamiltonian terms contribute to an additional energy of the
system. Therefore, the minimum energy state of the Ising network for a given GC
problem is acquired when the Hamiltonian becomes zero. Then, the ﬁnal solution
from the system could be attained by examining the spin states at the lowest energy
level (‘0’ in this particular case) as we discussed in section 2.1.
The interconnection matrix that is used to build a hardware Ising spin model also
comes from the penalty Hamiltonian formula. By expanding and reorganizing the
original formula (2.1), the ﬁrst term of the Hamiltonian becomes the following.

H1st = A

X

1−

=

v=1

1−2

n
X

!2
xv,i

=

xv,i xv,i +

i=1

k
X

1−2

v=1

i=1

V

k
X

n
X

j
n X
X

n
X

xv,i +

i=1

n
X
i=1

xv,i

n
X

!
xv,j

j=1

!
xv,i xv,j

[using xv,i = 0 or 1 → x2v,i = 0 or 1]

i=1 j=1

=

k
X
v=1

1−

n
X
i=1

xv,i xv,i +

n
X

!
xv,i xv,j

(2.4)

i=
6 j

Among three terms in eqn. (2.4), the ﬁrst term (‘1’) adds a constant energy to the
system without impacting the interconnections between the spin states. The second
and third terms represent interconnections between possible combinations of two spin
states. The coeﬃcient (interconnection weight) for a self-loop (second term) is ‘-1’,
and the weight between spin states for the same vertex and diﬀerent colors (third
term) is ‘+1’. These coeﬃcients are used to ﬁll-up an interconnection matrix.

30

Fig. 2.13. (a) Interconnection (Weight) matrix example for GC problem with 4 vertices and 3 colors. (b) Implementation of a single Ising
spin model with SHE-STO and CMOS peripherals

Fig. 2.13(a) shows the interconnection matrix for the sample GC problem with
4 vertices and 3 colors. Here, the size of the matrix becomes 12 by 12 to represent
a fully connected network among the spin states. The elements of the matrix could
be ﬁlled by interpreting the Hamiltonian as explained. The blue box represents the
energy penalty from the ﬁrst term of the Hamiltonian given by eqn. (2.4). The impact
from the second term inside the yellow box can be calculated similarly. Note that,
each row directly shows a connection between a vertex and the other vertices. For
instance, the vertex ‘x1,1 ’ must be connected with 5 other neighboring vertices which
are ‘x1,1 ’, ‘x1,2 ’, ‘x1,3 ’, ‘x2,1 ’, and ‘x3,1 ’. The interconnection strength (‘Jij ’) becomes
the number at the cross point between two speciﬁc vertices (‘vi ’ and ‘vj ’).
Finally, the hardware implementation of the Ising spin model now becomes quite
straightforward as the interconnection matrix contains all the required information.
Fig. 2.13(b) illustrates an implementation of the ﬁrst vertex ‘x1,1 ’ based on the
information from the ﬁrst row of the interconnection matrix. The interactions with
the neighboring 5-vertices are handled with ‘Majority Vote’ unit, which eventually
controls the switching probability of the nano-magnet as explained in chapter 2.2.

31
The implementation of the other vertices could be done in a similar manner. The
rest of the process of problem-solving has been explained in section 2.3.

32

3. BIASED RANDOM-WALK USING STOCHASTIC
SWITCHING OF NANO-MAGNETS:
APPLICATION TO SAT SOLVER
Random Walk (RW) based local search algorithms are highly popular for solving
combinatorial optimization problems such as the Satisﬁability (SAT) problem. The
RW algorithm tries to solve the SAT problem by ﬂipping a randomly chosen variable
to minimize the number of unsatisﬁed clauses for a given problem. In this chapter, we
propose a Biased Random-Walk (BRW) based on stochastic magnetization switching
dynamics of nano-magnets in the presence of thermal noise. The controllable stochastic switching behavior of nano-magnets is used to ﬂip the current state of the variable
of interest based on the assigned probability. Here, the ﬂipping probabilities are assigned to the variables responsible for unsatisﬁed clauses (instead of deterministic ﬂipping in traditional algorithms). The stochasticity of individual units of the proposed
hardware SAT solver based on a Magnetic Tunnel Junction (MTJ) lying on top of a
Heavy-Metal (HM) layer enables parallel search of the solution space, which results
in rapid convergence in comparison to the baseline RW algorithm. A device-circuitalgorithm co-simulation framework (benchmarked to experimental measurements of
a magnetic stack) is used to assess the eﬃciency of the proposal. In comparison to
the baseline RW algorithm, stochastic magnetization switching driven BRW achieves
∼94 % reduction in search time while consuming ∼30 pJ energy per iteration.

3.1

Introduction
The recent advent of nano-scale devices has inspired the research community to

re-consider and re-think the implementation of various unconventional computing

33
applications, which have proved to be ineﬃcient based on conventional von Neumann
architecture in terms of power, speed, and silicon area. One such application that we
attempt to address in this paper is the hardware implementation for various types
of combinatorial optimization problem solver, such as Boolean Satisﬁability (SAT)
problem. The SAT problem is a famous NP-complete problem and its goal is to assign
proper logic states to Boolean variables such that the given Boolean expression in a
Conjunctive Normal Form (CNF), which is a conjunction (logic AND) of a number
of clauses, is evaluated to be true. The clause is a disjunction (logic OR) of multiple
literals, where the literal would be a Boolean variable or its negation.
We propose a hardware SAT solver based on the stochastic switching dynamics
of a Magnetic-Tunnel-Junction (MTJ) device with an underlying Heavy-Hetal (HM)
layer. The stochasticity of the device has been harnessed to select a variable to be
ﬂipped based on the number of related unsatisﬁed clauses that is essential to solve
the problem. Compared to the deterministic ﬂipping of the variable of interest in a
conventional SAT solving approach, here we assign a switching probability to each of
the variable (scaling with the number of unsatisﬁed clauses) such that the variables
related to any of the unsatisﬁed clauses have a chance to be ﬂipped probabilistically.
This causes the system to ﬂip more number of variables during the search process in
a greedy way, thereby leads to fast convergence toward the optimal solution.
Before we describe the proposed hardware SAT solver based on the device physics
of the nanomagnet under consideration, we provide a brief introduction of the SAT
solver in general. Due to its versatile application range from testing of electrical systems to bioinformatics [36], enormous eﬀort has been devoted to ﬁnd eﬃcient SAT
solvers. From the algorithm point of view, SAT solvers can be divided into two major
categories: complete and incomplete [37]. The complete algorithm can either ﬁnd
a solution or determine whether the given CNF formula is unsatisﬁable. Most of
the modern software SAT solvers are based on the complete algorithm and demonstrate extremely good performance due to development of eﬃcient heuristics such as
branching and learning. The most well-known algorithm in this area is the Conﬂict-

34
Driven-Clause-Learning (CDCL) algorithm [38, 39]. However, even with success in
the software based implementations of SAT solvers based on advanced algorithms,
its hardware implementation still remains ineﬃcient. It is mainly because most of
the complete algorithms require frequent memory access and complex controls such
as backtracking and decision making to determine the next set of variables to be
ﬂipped [40].
In contrast, incomplete algorithms do not guarantee ﬁnding a solution or equivalently do not ensure that the problem is satisﬁable. Nevertheless, it still outperforms
the complete algorithm over certain types of the problems such as random k-SAT
(single clause has exactly k-literals) problems [40, 41]. In addition, its simple control
without learning capability makes the incomplete algorithm hardware friendly. Most
of the incomplete algorithms are based on Stochastic Local Search (SLS) that tries to
satisfy the given CNF by ﬂipping a variable based on some eﬃcient heuristics. One of
the heuristics used in Greedy SAT (GSAT) algorithm [42] is based on a ‘score value’
that guides the solver to ﬂip the variable that potentially decreases the maximum
number of unsatisﬁed clauses. One of the drawbacks of this greedy search algorithm
is that the system easily gets stuck at the local minima during problem-solving. To
avoid this issue, a probabilistic random movement (noise) has been introduced in the
incomplete algorithms during the search process. In a Random-Walk (RW) step, ﬁrst
one of the unsatisﬁed clauses is selected at random. Then, a variable connected to
that clause is ﬂipped. On the other hand, Biased Random-Walk (BRW) uses various
heuristic techniques to choose the variable rather than a random selection. Apart
from the RW algorithm, many additional heuristics are also used to choose a variable
eﬃciently. Depending on the heuristics used, the incomplete algorithms would be
categorized into multiple groups such as WalkSAT [43] and Dynamic Local Search
(DLS) [44].
The aforementioned incomplete algorithm based on SLS seems suitable for hardware implementation as the controls required are much simpler than CDCL based
complete algorithm. However, there are still some functions that are expensive from

35
the viewpoint of hardware implementation such as random variable selection and
frequent memory access. This diﬃculty becomes worse as the incomplete algorithms
incorporate sophisticated heuristics on choosing a variable to be ﬂipped. For instance,
DLS algorithms [44] adopts a weight value for each of clause that changes along the
search process for a variable selection.
The motivation of our research arises from the fact that nano-scale devices can
potentially provide crucial characteristics required for hardware SAT solvers based on
its 1) inherent non-volatility and 2) stochastic switching property. The non-volatility
of the Magnetic-Tunnel-Junction (MTJ) and its compatibility with CMOS interfaces
lead to compact hardware solution by enabling eﬃcient “in-memory computing”. In
addition, stochastic switching of the device under thermal noise can be exploited as
a random noise source that introduces a randomness to the system. We propose a
compact hardware SAT solver based on the device-circuit co-design that performs
aforementioned properties. The performance of the proposed hardware SAT solver
was evaluated on practical 3-SAT problems with various sizes through a device-circuitalgorithm co-simulation framework.

3.2

Algorithm and Hardware implementation
In this section, we ﬁrst introduce the baseline unbiased RW algorithm along with

its possible implementation. Note that here the hardware implementation of baseline
RW algorithm is introduced to show the basic components required and corresponding
peripherals along with its connectivity to build a hardware SAT solver. Thereafter,
we will discuss our proposed BRW algorithm and the necessary modiﬁcations from
its basic hardware implementation.

Unbiased RW
The algorithm of the unbiased RW starts from a random assignment of states to
all variables. When the assigned values are fed into the clauses, the system examines

36
how many clauses have been satisﬁed based on the current variable values. If there
are unsatisﬁed clauses, the system randomly ﬂips a variable by choosing one of the
unsatisﬁed clauses and one of the variables connected to that clause. These set of
operations are repeated until the system ﬁnds a solution or until the pre-deﬁned
maximum number of iterations are reached.
To build a hardware based on the abovementioned unbiased RW algorithm, one
can divide the system into two major parts, Main and Control, as shown in Fig.
3.1(a).

Fig. 3.1. (a) Hardware implementation of 3-SAT solver with conventional RW algorithm, (b) Details of Main unit.

The Main block consists of three essential components, Variables, Clauses, and
Accuracy Detector. The variable is implemented by a memory cell since the variable
has to store the assigned value and be able to ﬂip its state in the process of problem
solving. For the clauses and accuracy detector, simple combinational logic can be
used if the outputs from the variables have logic ‘0’ and ‘1’ states. Here the accuracy
detector counts the number of satisﬁed clauses and compares to the total number of
clauses to determine whether the given CNF formula has been solved. The detailed
view of the main block is illustrated in Fig. 3.1(b). Note that, we have assumed a
particular case of the SAT problem where every clause has exactly three literals (3SAT problem). In this case, each of the clauses can be implemented through a single

37
3-input OR gate that leads to a compact hardware design. The hardware SAT solver
for the problem with a diﬀerent number of literals could be implemented without
diﬃculty by adopting a multi-input OR gate and ﬁnding optimal parameters that
will be explained in the later section.
The control portion selects a variable to be ﬂipped from one of the unsatisﬁed
clauses. Note that, in the process of choosing a clause and corresponding variable, a
RW algorithm uses ‘random selection’ that is expensive in terms of area and power
based on conventional CMOS logic implementation. Even though there are multiple
ways of generating a random number using conventional CMOS technologies [45, 46],
yet it requires considerable hardware resources.

Proposed Biased RW
From the viewpoint of hardware friendly design, small changes in the control part
can potentially simplify the system eﬃciently. The main idea behind our proposal
lies in the method of choosing the next variable to be ﬂipped.
Rather than choosing a variable to be ﬂipped randomly, a switching probability
(PSW ) is assigned to each variable in our proposed BRW such that every variable
can have a certain probability of ﬂipping its current state. The switching probability
is kept proportional to the number of unsatisﬁed clauses the variable of interest is
related to. A similar approach, using a probabilistic distribution function to select
the next variable to ﬂip based on the break value (the number of clauses unsatisﬁed
by ﬂipping the speciﬁc variable), has been proposed and used for an incomplete software SAT solver, probSAT [47]. Note that, the probSAT uses normalized probability
distribution function (either exponential or polynomial function) to detect a variable
that will incur maximum decrease in unsatisﬁed clauses and shows huge performance
gain in comparison to the WalkSAT algorithm [47]. In contrast, our proposed BRW
assigns separate switching probability to each of the variables based on the unsatisﬁed clause information. This directly mimics the gist of the greedy search where the

38
variable that can potentially reduce the maximum number of unsatisﬁed clauses (connected to more unsatisﬁed clauses) has a higher chance of state ﬂipping. In addition,
several variables can be ﬂipped at every iteration in this way, thereby expediting the
search speed.
The block diagram for the hardware implementation of the proposed BRW is
shown in Fig 3.2.

Fig. 3.2. (a) Proposed 3-SAT solver with BRW, (b) Details of the
hardware implementation of the proposed SAT solver.

As can be observed from Fig. 3.2(a), the control part can be now split up into
multiple identical blocks – unsatisﬁed clause counter (UnSAT Counter) and PSW mapping block (PSW mapping). When the variables are assigned certain logical values
at a particular iteration, the outputs of the clauses depict whether they are satisﬁed
or not. Subsequently the UnSAT counter counts the number of unsatisﬁed clauses
related to each variable. Based on the output from the UnSAT counter, the PSW
mapping block assigns a pre-deﬁned probability of switching to the designated variable. Once PSW mapping is over, the variable states are updated depending on the
corresponding magnitude of the switching probability. Fig. 3.2(b) shows the detailed
implementation of the control block. Since the outputs from the clauses have logic
value ‘0’ or ‘1’, the role of the UnSAT counter is to count the number of zeros among
its input connections. Therefore, we named the sub-block of UnSAT counter as Zero

39
Counter (ZCn ). An internal block of PSW mapping unit (DECn ) receives the digital
code from ZC and decodes it into the corresponding switching probability value for
the variable.
It is worth mentioning here that the additional blocks required to implement the
control part of the proposed SAT solver can be easily combined with the variable
to construct a single module. Moreover, considering the practical implementation
of the zero counter and decoder – zero counter is a single-bit multi-input adder and
decoder is simple combinational logic. Therefore, the entire system is compact from
hardware implementation viewpoint. The only remaining point worth considering is
the hardware implementation of the variable (and its probabilistic update) which will
be introduced in the succeeding section.

3.3

Nano-magnet as the Main Computational Element
In this section, we ﬁrst describe the required characteristics of each variable used in

the proposed BRW algorithm from hardware implementation perspective. Thereafter,
implementation of the proposed variable based on stochastic switching property of
nano-magnets will be introduced along with the modeling framework for the device.

Stochastic MTJ as a SAT variable
The variable in a SAT solver stores the logic state assigned to it and should
be able to ﬂip its state once the system determines that the number of unsatisﬁed
clauses could be decreased when ﬂipping the state of the variable of interest. A
memory element such as SRAM or DRAM is conventionally used to implement the
variable [48,49]. To access the memory cell at a speciﬁc location, basic read/write operation based on the address is essential which might incur hardware, interconnection,
and communication overhead. However, these conventional memory elements do not
ﬁt well with the proposed BRW algorithm even without considering the diﬃculties
mentioned above. It is because the unit cell of such memories can only be ﬂipped

40
deterministically. Hence, alternative hardware elements need to be explored for the
eﬃcient implementation of the BRW algorithm. From this perspective, the stochastic
switching characteristics of a nano-scale magnetic device can be exploited.
As discussed in the previous section, recent research about the Magnetic-TunnelJunction (MTJ) on top of a heavy metal (HM) layer such as Pt or Ta has revealed
that the switching of the resistance state of the MTJ can be naturally stochastic in the
presence of thermal noise [50]. In addition, the frequency of switching events varies in
accordance to the amount of input current through the HM underlayer. The dynamic
modeling of the stochastic MTJ device and its switching behavior to the external
stimulus are well deﬁned in [51]. The basic device structure used to represent each
variable of the proposed SAT solver is illustrated in Fig. 3.3(a).

Fig. 3.3. (a) 3-terminal spin-Hall-Eﬀect MTJ (SHE-MTJ) device, (b)
Device-Circuit conﬁguration for the core hardware primitive in our
design.

The device consists of a conventional three-layered MTJ stack on top of the HM.
The MTJ has two ferromagnetic layers, Pinned Layer (PL) and Free Layer (FL),
which are separated by a Tunneling Barrier (TB) between them. The device can
have two stable resistance states (RP and RAP ) depending on the relative direction of
the magnetization of two ferromagnetic layers. The direction of FL magnetization can
be changed by injecting spin current in the FL of the device. This spin current can
be induced by passing a charge current through the HM layer by exploiting spin-Hall-

41
eﬀect (SHE) induced spin injection at the FL-HM interface [11,24–27]. If “suﬃcient”
charge current ﬂows through the HM layer from port T1 to T2, the magnetization of
the FL becomes parallel to that of the PL and vice versa.
This nano-magnetic stack can oﬀer the following desirable characteristics such as:
1) It can serve as a memory element due to inherent non-volatility of the MTJ device
and two possible stable resistive states, 2) It can provide decoupled read/write current
paths (shown in Fig. 3.3(a)) that oﬀers huge design ﬂexibility for the peripheral
circuits, 3) The eﬃciency of spin current generation from input charge current ﬂow
through the HM layer can be larger than 100 % [20] that enables low energy operation.
Note that, even though a charge current ﬂowing through the HM layer can ﬂip
the magnetization of the FL, it does not imply that the switching event is always
deterministic. In fact, the magnetization switching under thermal noise at non-zero
temperature exhibits stochastic behavior. This implies that the magnet state can be
ﬂipped with a certain probability and this probability is controlled by the amount
of input charge current. This interesting stochastic property can be utilized in our
system with simple CMOS peripherals as shown in Fig. 3.3(b). For the write operation, a variable current source along with a series switch has been used to provide
a current through the HM. When the input write (WR) command pulse enables the
switch, charge current will ﬂow through the HM underlayer and ﬂip the magnetization with a certain probability. Based on the proposed BRW algorithm, the amount
of input charge current is controlled by the number of unsatisﬁed clauses the variable
of interest is connected to. To read the magnetization state, a switch and reference
resistor (RREF ) are used in series with the MTJ device. When read (RD) command
turns on the switch, a current ﬂows through the series of RREF , RM T J (either RP or
RAP ), and portion of the HM. Here, RREF has a resistance value in between RP and
RAP so that the voltage at VM T J node during the read (RD) cycle can have diﬀerent values (VL /VH ) depending on the current resistance of the MTJ. When VM T J is
ampliﬁed through the inverter and becomes a rail-to-rail swing, the output voltage is
transmitted to the rest of the system for further processing.

42
Modeling of magnetization dynamics
The magnetization dynamics of the FL under an external excitation like an input
spin current can be modeled using Landau-Lifshitz-Gilbert (LLG) equation with an
additional spin-transfer-torque (STT) term by Slonczewski [21] using single domain
approximation,
dm
b
dm
b
1
b×
)+
(m
b × Is × m
b)
= −γ(m
b × Hef f ) + α(m
dt
dt
qNs
where m̂ is the unit vector of FL magnetization, γ = 2µB µ0 /h̄ is the gyromagnetic
ratio for electron, α is Gilbert’s damping ratio, Hef f is the eﬀective magnetic ﬁeld
including the shape anisotropy ﬁeld [22], Ns = Ms V /µB is the number of spins in the
FL volume V (Ms is saturation magnetization and µB is Bohr magneton), Is is the
input spin current. The spin current generated from the charge current ﬂow through
MT J
the HM layer is modeled as Is = θSH WtHM
IQ , where IQ is the charge current ﬂowing

through the HM, θSH is the spin-Hall angle [25], WM T J and tHM are device dimension
parameters, shown in Fig. 3.3(a). The noise model from the thermal noise is modeled
q
2KB T
α
as a random thermal ﬁeld [29] (included in Hef f ), Hthermal =
G ,
1+α2 γµ0 Ms V δt 0,1
where G0,1 is a Gaussian distribution with zero mean and unit standard deviation, KB
is the Boltzmann constant, T is the temperature and δt is the simulation time step [52].
Based on above-mentioned analytic formulas, we have constructed a behavioral model
of the MTJ using Verilog-A, and the model was used to capture the magnetization
dynamics based on the parameters (benchmarked to experimental data) shown in
Table 3.1. The remaining CMOS blocks are implemented using 90nm CMOS process
technology and HSPICE simulations are performed to validate the functionality of
the interface circuits along with the device models we generated using Verilog-A.

Switching probability versus input current
To evaluate the switching behavior of the MTJ based on the amount of input
charge current and also to check the impact of thermal noise, we applied a current

43

Table 3.1.
Device simulation parameters of the SHE-MTJ for 3-SAT solver
Parameters

Value

Free layer area

(π/4) × 100 × 40 nm2

Free layer thickness

1.2 nm

Heavy-metal thickness, tHM

2 nm

Saturation magnetization Ms

1000 emu/cm3 [25]

spin-Hall Angle, θSH

0.3 [25]

Gilbert damping factor α

0.0122 [25]

Energy barrier EB

20 KB T

MgO Thickness, tM gO

1.34 nm

MTJ resistance, RP (RAP )

8.56 (18.31) KΩ

Resistivity of HM, ρHM

200 µΩ - cm [25]

Pulse width tP W

1 ns

Temperature TK

300 K

Supply Voltage, VDD

1.2 V

pulse with varying height through the HM layer with and without thermal noise.
After applying an 1 ns current pulse with diﬀerent amplitudes, the ﬁnal state of
FL magnetization was sensed after a 5 ns relaxation time. Fig. 3.4(a) depicts the
switching probability (PSW ) for diﬀerent values of the input charge current (Iq ) ﬂowing
through the HM layer.
The amplitude of the input current is varied from 0 uA to 100 uA with 2.5 uA
step, and 1,000 independent stochastic LLG simulations have been performed for
each simulation step to determine the switching probability. As shown in 3.4(a),
the PSW curve with thermal noise shows a smooth transition from 0 % to 100 %
unlike sharp transition in the case of absence of thermal noise. This directly shows
that the thermal noise actually makes the device stochastic. Fig. 3.4(b) shows two

44

Fig. 3.4. (a) Switching probability (PSW ) versus input charge current
through the HM layer (Iq ) with and without thermal noise, (b) Transient simulation results showing probabilistic switching of the device
with the same input stimulus that will incur 50 % chance of magnetization ﬂipping (mx ) in the presence of thermal noise.

independent transient simulations with thermal noise. Here the magnetization of the
FL (mx ) is -1 and is being ﬂipped probabilistically to +1 by applying a charge current
pulse that leads to 50 % switching probability.

3.4

MTJ based SAT solver
In this section, we describe the device-circuit conﬁguration of the MTJ based SAT

solver that can be used to implement the variable and associated probabilistic update
operation of the proposed BRW algorithm.

Overall hardware conﬁguration
Fig. 3.5(a) shows a portion of the SAT solver along with practical implementation
of the variable based on SHE-MTJ with CMOS peripherals. For explanation, the
decoder has been depicted in the left-most portion of the ﬁgure.

45

Fig. 3.5. (a) Overall hardware conﬁguration of the proposed 3-SAT
solver based on BRW algorithm, (b) Timing diagram of the main
command signals (RST, WR, RD).

The operation of the proposed SAT solver is controlled by three main commands:
Reset (RST), Write (WR), and Read (RD). The basic operation during the read and
write cycles are the same as explained in the previous section. The output inverter
drives a latch and forms a Clocked-Latch (C-LAT). This C-LAT unit ampliﬁes the
small voltage variation from VM T J node and generates a rail-to-rail voltage (VLAT )
based on the RD command. Since the VLAT node indicates the current state of the
variable, the remaining SAT solver performs computation based on this value. In
addition, the level of VLAT determines the direction of charge current ﬂow through
the bottom HM layer by controlling the switches at both sides of the MTJ device. To
provide a proper magnitude of current, multiple current sources have been used with
corresponding switches. These switches are controlled by the decoder unit and enable
pulses (WR and RST). The decoder unit determines a proper amplitude of current
pulse by receiving the input code from the zero counter and selects the corresponding
number of current sources. On the other hand, the enable pulse controls the width of
the current pulse in sync with the WR and RST commands. After the series of WR
and RD operations, RST signal is used to reset the device for the next operation.

46
The timing diagram used to simulate the entire SAT solver is shown in Fig. 3.5(b).
Each of the three commands has a 1 ns duration. Delay of 1 ns was inserted between
each pair of two commands (RST and WR, RD and RST). Especially, a 5 ns of
relaxation time has been added after the write command to relax and stabilize the
magnetization direction before the read operation. In this scenario, our proposed SAT
solver spends 10 ns of time for a single iteration, i.e., to update particular variables
in the system based on the previous computation result.

Determination of optimum PSW
Let us now describe an important parameter that aﬀects the performance of our
proposal. Determination of the variables to be ﬂipped in the proposed BRW algorithm
is achieved by assigning a switching probability to each of the variables based on the
number of related unsatisﬁed clauses. To ﬁnd an optimum PSW , we deﬁne probability
step (PSW,ST EP ) that represents the increment in switching probability assigned to
a particular variable as the number of related unsatisﬁed clauses increases. Two
diﬀerent approaches are used to assign PSW to the variable based on PSW,ST EP which
are: 1) “linear PSW,ST EP ” and 2) “non-linear PSW,ST EP ”.

Fig. 3.6. Performance variation depends on (a) diﬀerent values of
PSW,ST EP 1 for linear PSW step, (b) diﬀerent values of PSW,IN IT and
PSW,ST EP 2 for non-linear PSW step.

47
The linear PSW mapping literally divides the total range of probability into ﬁxed
number of steps. For instance, if there are ﬁve steps, each unsatisﬁed clause can
contribute 20 % of switching probability (PSW,ST EP 1 ) to the particular variable. In
contrast, non-linear PSW,ST EP uses two parameters to assign a probability, namely,
PSW,IN IT and PSW,ST EP 2 . Once the variable is related to a single unsatisﬁed clause,
the corresponding PSW is set to PSW,IN IT . If the number of unsatisﬁed clauses become
more than two, the PSW is increased by PSW,ST EP 2 from PSW,IN IT as the number of
clauses are increased.
To ﬁnd an optimum probability step, we checked the average search time over
1,000 simulations based on the proposed system using multiple problems with a size
of 1) 50 variables / 218 clauses and 2) 75 variables / 325 clauses [53]. The results of
the problem with 50 variables are shown in Fig. 3.6(a) and (b). As can be observed
from Fig. 3.6(a), linear probability step of 10 % leads to best performance on the
particular problem set. In the case of non-linear PSW step, once we choose 5 ∼ 10
% of PSW,IN IT , the search time in Fig. 3.6(b) shows a little variation around the
minimum search time as the PSW,ST EP 2 becomes larger than 50 %. Similarly, the
optimum probability step for the larger problem size (75 variables) shows a similar
tendency. This implies that the size of the problem has less impact on the optimum
probability step and also the performance is less susceptible to the variation in PSW .
Therefore, these values were chosen to determine the performance of the proposed
system.

3.5

Performance Results

Comparison to the Unbiased RW
Based on the aforementioned settings, performance comparison was done for three
diﬀerent approaches: unbiased RW, BRW with linear PSW step, and BRW with nonlinear PSW step. We measured the average search time for 10 diﬀerent cases from two
diﬀerent sized problems (50 variables / 218 clauses and 75 variables / 325 clauses)

48
over 1,000 simulations. Fig. 3.7(a), (b) shows the performance comparison in terms
of normalized searching time.

Fig. 3.7. Performance comparison in terms of normalized solution
search time between three diﬀerent approaches of the 3-SAT solver.
The sample problems are from SATLIB [53].

As shown in the ﬁgure, BRW with non-linear step achieves roughly ∼94 % (50
variables) and ∼98 % (75 variables) of speedup as compared to the baseline RW
algorithm. Fig. 3.8(a) shows intermediate states of some of the variables (x1 to x10
out of 50 variables) during the process of problem solving (uf50-07). The accuracy
based on the states of the variables at a particular instant is shown in the top row.
As can be seen from the graph, multiple variables are being ﬂipped at every
iteration, which can expedite the search speed. Note that, just increasing the number
of variables to be ﬂipped might lead the system to an unwanted state (local minima).
However, in our proposed system, stochastic switching behavior occurs in a logical
way such that the variable related to a large number of unsatisﬁed clauses will have a
higher chance of switching its current state. In addition, even when the system falls
into the local minima, the variables connected to the unsatisﬁed clauses always have
a certain probability of switching that causes the system to escape the unwanted local
minima state. To determine the average number of switching events per iteration, we

49

Fig. 3.8. (a) State transition of the variables during the problem
solving, (b) Average number of variable ﬂips during the process of
problem solving.

monitored the number of variable switching per iteration over the 30 simulations (with
uf50-07 problem set), which is illustrated in Fig. 3.8(b). Initially, the system ﬂips
more than 10 variables per iteration. The number gradually decreases and reaches
around 3∼4 ﬂips per iteration. The average number of ﬂips up to 100 iterations is
∼ 6.25. Considering the time required to complete a single iteration based on our
proposed hardware to be 10 ns, the proposed system can ﬂip variables more than a
couple of hundred million times per second (fps).
Table 3.2.
Performance comparison between Conventional RW and the proposed BRW
Problem

Avg. ﬂips

Avg. ﬂips Speedup

fps

Conv.RW

BRW

uf50-03

2,786

127

∼95 %

386 M

uf50-07

13,538

454

∼96 %

484 M

uf50-04

46,223

817

∼98 %

371 M

uf50-06

96,323

8,933

∼91 %

304 M

50
Table 2 shows the performance comparison between unbiased RW and the proposed BRW algorithm with non-linear PSW step for the selected problem set. For
all types of problems, our proposed SAT solver shows more than 90 % search time
improvement. The number of ﬂips per second (shown in the right-most column) is
estimated by multiplying the time required for a single iteration and the average
number of variables being ﬂipped per iteration.

Comparison to the software SAT solvers

Fig. 3.9. Performance comparison between the proposed hardware
SAT solver using BRW algorithm, software WalkSAT solver [54] and
probSAT [55] solver for problems with a size of (a) 350 variables /
1,491 clauses and (b) 450 variables / 1,917 variables. The sample
problems are from SAT competitions 2011 [56].

To validate the scalability of the system, we check the performance of the proposed hardware SAT solvers based on the problems from SAT competitions 2011 [56].
Two types of the problems are used with a size of 1) 350 variables / 1,491 clauses
and 2) 450 variables / 1,917 variables. To compare the performance gain over the
existing software SAT solvers, we also check the performance of the WalkSAT [54]

51
and probSAT [55] for the same problem set. To get an average performance, each
problem has been solved using the proposed hardware solver and software solvers for
100 times. Note that, the optimum PSW steps assigned to the variables were chosen
based on the results from the previous section (non-linear PSW step with 5 ∼ 10 % of
PSW,IN IT and ≥ 50 % of PSW,ST EP 2 ). The maximum iteration has been set to 100,000
for all solvers. Fig. 3.9 shows results in terms of 1) the number of successful runs over
the 100 iterations, 2) the number of average iterations for ﬁnding a solution, and 3)
Runtime in milliseconds. For the Runtime measurement of the proposed SAT solver,
we used the estimated time by multiplying the number of average iterations and the
time required per each iteration (10 ns). For the software solvers, we used CPU
runtime reported after the simulation. The results of the software solvers reported
here were executed on an Intel(R) Xeon(R) CPU E5-2600 (40 CPUs) at 2.6GHz with
64GB RAM, running Red Hat Linux (Version 4.4.5-18). As shown in the ﬁrst and the
second column data, the average number of iterations used to ﬁnd an answer and also
the number of successful tries are comparable among the solvers. However, practical
runtime from the hardware SAT solver (estimated) outperforms the software SAT
solver by a factor of 1) 23x ∼ 102x in comparison to the WalkSAT and, 2) 21x ∼ 33x
as compare to the probSAT depending on the diﬃculty of the problem.

Energy consumption
Let us conclude this section by brieﬂy mentioning the energy consumption of
the proposed SAT solver. Based on the proposed system for the problem with 50
variables and 218 clauses, the average current consumption is ∼2.5 mA based on
90 nm CMOS technology which leads to ∼30 pJ/iteration of energy consumption.
The energy break down results show that the MTJ with peripherals consume 67 %
(∼20.1 pJ/iteration) and CMOS logic consumes 33 % (∼9.9 pJ/iteration) of the
total energy consumption. The energy consumed by each variable is estimated to be
0.5 pJ/iteration (average current ∼40 uA for 10 ns at 1.2 V VDD supply) without

52
considering simple peripheral logic units such as accuracy detector. Even though the
MTJ still consumes two-thirds of the total energy consumption, we believe that with
an advancement in device technology, this number could be decreased further which
would result in more energy-eﬃcient and compact SAT solvers.

3.6

Conclusion
In this chapter, we proposed a compact hardware design for the 3-SAT solver

based on BRW algorithm by exploiting a controllable stochastically switching nanomagnet in the presence of thermal noise. Compared to the conventional unbiased RW
that ﬂips a randomly chosen variable from an unsatisﬁed clause, the BRW algorithm
assigns a particular switching probability to each magnet (representing a particular
variable) according to the number of related unsatisﬁed clauses. This enables parallel search of the solution space, thereby leading to fast convergence times. Also, the
device-circuit conﬁguration based on MTJ and CMOS peripherals can provide a compact and energy-eﬃcient core hardware computing primitive for the proposed SAT
solver. The performance of the system was evaluated using a device-circuit-algorithm
co-simulation framework in 90 nm CMOS technology. Based on the problem-set with
50 variable and 218 clauses, our proposed SAT solver shows ∼94 % improvement in
search time as compared to unbiased RW while consuming ∼0.5 pJ per variable per
iteration.

53

4. BAYESIAN INFERENCE ENABLED BY STOCHASTIC
SPIN-ORBIT TORQUE DEVICES
Probabilistic inference from real-time input data is becoming increasingly popular and
may be one of the potential pathways at enabling cognitive intelligence. As a matter
of fact, preliminary research has revealed that stochastic functionalities also underlie
the spiking behavior of neurons in cortical microcircuits of the human brain. In tune
with such observations, neuromorphic and other unconventional computing platforms
have recently started adopting the usage of computational units that generate outputs
probabilistically, depending on the magnitude of the input stimulus. In this chapter,
we experimentally demonstrate a spintronic device that oﬀers a direct mapping to
the functionality of such a controllable stochastic switching element. We show that
the probabilistic switching of Ta/CoFeB/MgO heterostructures in presence of spinorbit torque and thermal noise can be harnessed to enable probabilistic inference in a
plethora of unconventional computing scenarios. This work can potentially pave the
way for hardware that directly mimics the computational units of Bayesian inference.

4.1

Introduction
Spin-orbit torque generated by an underlying heavy metal has recently emerged

as an energy-eﬃcient mechanism for magnetization reversal [57–59] and domain wall
motion [60–63]. Nanomagnet switching due to input charge current ﬂowing through
the heavy-metal (HM) underlayer is mainly attributed to spin-Hall eﬀect (SHE) [64],
wherein, a transverse spin current is injected in the nanomagnet lying on top. While
magnets with in-plane anisotropy can be switched directly by spin-orbit torque (SOT),
perpendicular magnets require an external magnetic ﬁeld for deterministic switching.
Recent proposals have also explored deterministic switching in perpendicular magnets

54
without the assistance of any external ﬁeld [65–67]. While such spin-orbit torque
induced magnetization switching has been extensively studied in the deterministic
regime, it is intrinsically probabilistic due to the inherent time-varying thermal noise
involved in the magnetization dynamics.
This work ﬁrst attempts to experimentally validate prior theoretical proposals
for utilizing spin-orbit torque switching nanomagnets as biased random number generators where the bias can be tuned using the magnitude of the input stimulus by
operating the magnets in the stochastic regime [12, 31, 68, 69].
Based on the experiment, we extend the concept to a three-terminal device structure that can be easily interfaced with CMOS peripherals for enabling diﬀerent genres
of unconventional computing scenarios. The device-circuit conﬁguration based on the
stochastic spin-device and CMOS interface circuits forms a core hardware primitive
and used for versatile applications ranging from neuromorphic [31,68,69] to combinatorial optimization [12]. As a second contribution, we propose a Bayesian inference
engine by exploiting the controllable stochastic switching of the nanomagnets. Each
nanomagnet along with peripheral CMOS circuits form a key element, a variable, of
the Bayesian Network (BN) and generates Poisson spike pulse train. The probabilistic information is transferred through the interconnected variables by following pulse
based arithmetic [70]. The eﬃciency of such spintronic-enabled probabilistic Bayesian
networks stems from the direct mapping of the key stochastic computing element to
the underlying stochastic device physics of the spin devices.

4.2

Device fabrication and spin-orbit torque (SOT) driven stochastic
switching
The probabilistic switching was characterized in a 1.2µm wide Ta (10 nm)/CoFeB(1.3

nm)/MgO (1.5nm) Hall-cross structure in presence of a 100 Oe in-plane external magnetic ﬁeld. The in-plane magnetic ﬁeld is required to achieve switching of the PMA

55
(perpendicular magnetic anisotropy) free layer in presence of in-plane polarized spins
generated by current ﬂowing through the heavy metal underlayer.

Fig. 4.1. (a) Nanoscale Hall-bar structure consisting of Ta (10
nm)/CoFeB (1.3 nm)/MgO (1.5 nm)/Ta (5 nm) (from bottom to
top) material stack. (b) Variation of the magnet switching probability with variation in magnitude (amplitude) of the input current pulse.
(c) Variation of the magnet switching probability characteristics with
variation in the duration of the input current pulse.

Fig. 4.1(a) depicts the Hall-bar structure. Input current ﬂows between the terminals I+ and I− while the magnetization state is determined by the anomalous Hall
eﬀect resistance detected between terminals V + and V −. Initially the magnet is reset
by passing a suﬃcient magnitude of current between I+ and I− terminals in the negative x direction. Subsequently, a current is passed in the positive x direction and the
ﬁnal state of the magnet is determined. The magnitude of the current is varied and
over 50 measurements are taken per current magnitude to determine the probability
of switching of the magnetic stack. The experiment was repeated over multiple de-

56
vices and consistent probability switching characteristics were obtained. Fig. 4.1(b)
represents the variation of switching probability of the magnet with variation in the
magnitude of the input current pulse (with the pulse width being ﬁxed at 10 ms) while
(c) depicts that the switching probability can be tuned by varying the pulse width
as well (with pulse amplitude being ﬁxed at 0.6mA). Prior proposals have exploited
such non-linear switching probability characteristics for neuromorphic [31, 68, 69] as
well as Ising computing [12]. Additionally pulse width duration serves to control the
rate of change of switching probability with variation in the input current magnitude
which, in turn, impacts the performance of such computing systems in presence of
noise and process variations in the spin-devices and CMOS peripherals [31].

Extension to three-terminal device and possible applications
The experiment in the previous section provides proof-of-concept for a large number of theoretical proposals that have exploited probabilistic SOT-driven magnetization dynamics in a three-terminal device structure shown in Fig. 4.2 (Left top). The
three terminal device structure, shown in Fig. 4.2, can be designed by fabricating a
Magnetic Tunnel Junction (MTJ) stack on top of the ferromagnet-heavy metal layers.
Note that, the conventional MTJ device consists of two ferromagnetic layers, which
are separated by a Tunneling Barrier (TB) between them. The top layer is called the
Pinned Layer (PL) and the bottom layer is called the Free Layer (FL). The magnetization of the FL can be manipulated by an input spin current injected from the
heavy metal underlayer while the magnetization of the PL is ﬁxed in a particular
direction. When the magnetization of the two ferromagentic layers is located in the
same direction, the resistance across the tunneling junction has a smaller resistance
(RP ) compared to the opposite case (RAP ). While the write current ﬂows between terminals T1 and T2 and probabilistically switches the magnet (probability determined
by current magnitude), the read path between terminals T3 and T2 determines the
ﬁnal state of the magnet after the switching process.

57

Fig. 4.2. A three terminal device structure consists of Magnetic Tunnel Junction (MTJ) on top of the Heavy Metal (HM) underlayer.(Left
top) The stochastic switching of the device in the presence of thermal
noise can be exploited to various applications such as neuromorphic
computing (Bottom), Ising spin model (Right top), and Bayesian Network (Right middle).

58
Note that the only diﬀerence between the fabricated samples and the device structure (shown in Fig. 4.2) is the read-out operation. While the magnet switching dynamics is similar to the fabricated samples, through the injection of spin current, the
read-out mechanism is through a MTJ structure lying on top of the heavy metal.
The tunnel junction exhibits a much larger resistance variation (typically 2-3 times)
corresponding to the two stable magnetization states that, in turn, leads to compatibility with peripheral CMOS technology. The subsequent text and applications
discussed in this article are based on the device measurements shown in Fig. 4.1.
Consequently, all conclusions presented in this article are based on experimentally
measured stochastic switching characteristics of the magnet. However, for performing the CMOS circuit-level simulations we consider that the device resistance that can
be sensed is similar to values obtained from standard MTJ stacks compatible with
CMOS technology. Such devices can be scaled down to dimensions exhibiting barrier
height of the order of ∼ 10 − 20kB T . Further scaling can potentially result in device
state update during “read operation as the device becomes increasingly sensitive to
the input bias current.
The controllable stochastic switching element can be potentially used in various
applications as shown in Fig. 4.2. For instance, neuromorphic applications inspired by
brain functionalities consist of a set of pre-neurons transmitting information to a set
of post-neurons through synapses. Such a computing framework can be mapped to a
crossbar array of stochastic switching elements (serving as synapses) driving stochastic post-neuronal devices. The device can be interfaced with peripheral transistors
to implement stochastic Spike-Timing Dependent Potentiation (P) and Depression
(D) learning rules which dictate the variation of synaptic switching probability with
spike timing ΔT . Similarly, stochastic neural functionality can be implemented by
interfacing the neuronal device with a Reference MTJ. For details, please refer to
Refs. [31, 68, 69]. Stochastic switching property of the device can also provide a natural annealing property in Ising computing systems for solving optimization tasks by
assisting the system to move out of a local minima [12]. Here we explore Bayesian

59

Fig. 4.3. Two independent simulations of stochastic Landau-LifshitzGilbert (LLG) equation with thermal noise for a magnet with perpendicular magnetic anisotropy (along z direction) are shown.

inference networks and utilize the probabilistic switching characteristics in response
to pulse current magnitude of the sample, depicted in Fig. 4.1(b), as the core enabling
element for probabilistic inference.

Micromagnetic simulation for device modeling
In this section, we provide a simulation framework that can be utilized to model
the stochastic device physics of the spin-devices. The probabilistic switching characteristics of the ferromagnet can be analyzed by Landau-Lifshitz-Gilbert (LLG) equation with additional term to account for SOT generated by the heavy-metal (HM)
underlayer [21],
dm
b
dm
b
1
b×
)+
(m
b × Is × m
b)
= −γ(m
b × Hef f ) + α(m
dt
dt
qNs

60
b is the unit vector of Free Layer (FL) magnetization, γ =
where, m

2µB µ0
h̄

is the

gyromagnetic ratio for electron, α is Gilbert’s damping ratio, Hef f is the eﬀective
magnetic ﬁeld including the shape anisotropy ﬁeld, Ns =

Ms V
µB

is the number of spins

in free layer of volume V (Ms is saturation magnetization and µB is Bohr magneton),
and Is is the spin current generated by the HM underlayer. Thermal noise is included
q
2kB TK
α
by an additional thermal ﬁeld [29], Hthermal = 1+α
2 γµ M V δ G0,1 , where G0,1 is a
t
0 s
Gaussian distribution with zero mean and unit standard deviation, kB is Boltzmann
constant, TK is the temperature and δt is the simulation time-step. Additional eﬀects
like considering ﬁeld-like torque and Dzyalohinskii-Moriya Interaction (DMI) can be
also included in the modelling framework [62, 71].
Fig. 4.3 depicts two independent stochastic LLG simulations. The magnet was
taken to be circular in shape with diameter 40nm and thickness 1.3nm with a barrier
height of 31.44kB T . The magnet damping factor was taken to be 0.0122 [25]. A spinHall angle of 0.12 [57] was assumed for the Tantalum heavy metal layer of thickness
10nm. The magnet was subjected to an in-plane magnetic ﬁeld of strength 100Oe and
a current pulse (I) of magnitude 900µA and duration 2ns. As shown in the ﬁgure,
the magnet stochastically relaxes to either of the two stable magnetization directions.
Note that this section serves to outline the simulation framework that can be used
to model the stochastic magnetization dynamics. We directly use the experimental
probability switching characteristics obtained from Fig. 4.1 for the Bayesian Network
implementation discussed next.

4.3

Bayesian Network based on Stochastic MTJ
Bayesian Network (BN) is a graphical model to represent conditional independen-

cies between each variable, where the nodes in a BN represent random variables and
links represent direct dependencies among the variables [72–74]. A simple Bayesian
Network with four variables is shown in Fig. 4.4 [75]. It illustrates dependencies
among the variables – whether it is cloudy (‘C’), whether it is rainy (‘R’), whether

61
the sprinkler is on (‘S’), and whether the grass is wet (‘W’). Here we assume that
each variable is binary (True or False, could be ‘1’ or ‘0’). As shown in Fig. 4.4,
the dependencies between variables are quantiﬁed using conditional probabilities (in
Conditional Probability Table (CPT)) associated with each transition to a particular
node from its parent nodes in the network. Due to its simple conditional independence
statement expressed in graph, BN helps to reduce the number of variables required
to compute a probabilistic inference.

Fig. 4.4. A simple Bayesian Network with four variable and corresponding Conditional Probability Table (CPT)

Based on the basic BN, the inference operation tries to estimate the probability of
the hidden causes, on the given observed situation [75]. As an example, let us assume
that we observe wet grass and attempt to estimate the cause. There are possibly two
hidden causes, either the Sprinkler is on or it is raining. Here we can use Bayes’ rule
(shown below) to calculate posterior probability of each cause,
P (A|B) =

P (B|A) P (A)
P (B)

where A are the hidden causes and B is the observed evidence.

(4.1)

62
Implementation of a Bayesian inference network on conventional general-purpose
computers is ineﬃcient in terms of area and energy consumption since a large number
of complex ﬂoating point calculations need to be performed to compute the probability of occurrence of a particular variable since multiple causal variables are involved
in the network. Hence, various approaches for BN hardware implementation have
been proposed based on synthesizable hardware such as Field Programmable Gate
Arrays [76,77], fully digital system with stochastic digital circuit [78,79], analog based
probabilistic hardware for inference [80, 81], and mixed-signal approach (Muller CElements) [82]. However, multiple transistors are required to implement the functionality of a single stochastic element, thereby leading to an ineﬃcient design. In
this work, we demonstrate the manner in which the stochastic switching behavior
of ferromagnet-heavy metal structures can directly mimic the core behavior of such
controllable random number generators.

Poisson spike generation based on stochastic switching of nanomagnet
Instead of using conventional ﬂoating point calculations to estimate the probability
of a certain inference process, pulse based computation [70] can be exploited. Here
the probability of inference operation can be estimated by counting the number of
pulses from each variable over a large enough time window. The main idea behind
this approach is that the variable of the BN is represented by a Poisson pulse train
generator that converts the probability information into the frequency of the output
pulses. Based on the controllable stochastic switching of the nano-sized magnet with
thermal noise, the Poisson spikes can be generated with the aid of simple CMOS
peripherals as shown in Fig. 4.5.
Here the device is interfaced with a reference resistor to generate a Poisson spike/pulse
train where the number of spikes in a large enough time window encode information
about the frequency and magnitude of the incoming spike train. The operation can
be explained by considering a write/ read/ reset cycle. Let us consider the case for

63

Fig. 4.5. The proposed device-circuit level solution for Poisson spike
pulse generation with MTJ on HM along with CMOS peripherals

Fig. 4.6. Interconnection between two variables based on the proposed
Stochastic switching elements using CMOS interface circuit

a single stochastic element in Fig. 4.6 that receives input from another stochastic
element.
Note that each element represents a particular node in the BN. During the write
phase, the causal stochastic element which is transmitting spikes from the previous
stage inverter generates a write current IW R through the heavy metal. The magnitude
of the write current is determined by the corresponding current source which is tuned
according to the CPT. The current source is activated with a frequency equivalent
to the frequency of the incoming spike train. After the write cycle, the read cycle is

64
used to determine whether the device has switched and is reset to the initial state
in case of a switching event. Hence, the frequency of the output train is directly
proportional to the switching probability (determined by the magnitude of the current
source which is tuned depending on the corresponding entry in the CPT) and the
pulse train frequency from the causal element. The pulse train frequency from the
causal stochastic element encodes the probability of occurrence of the corresponding
event while the current source being driven by that element encodes the conditional
probability of occurrence of the receiving element.

Information transfer through the network
The detailed implementation of the core components (variable and edge) proposed
for the BN is shown in Fig. 4.7. The MTJ device along with two current sources and
few other circuit elements constitute each variable of the BN. Note, here we depict two
current sources for each of Write (WR) and Reset (RST) operations. Two switches
at either sides of the nanomagnet control the direction of current ﬂow during each
operation.

Fig. 4.7. Practical implementation of the variable for BN and its
interconnection with neighboring variable

The inverter at the output node for level conversion is combined with a latch and
forms a Clocked Latch (C-LAT). This Clocked Latch operates in sync with the Read

65
(RD) signal and can amplify small voltage changes at the Vx node. In addition, this
unit stores the value from the result of stochastic switching. This provides two important beneﬁts: 1) It prevents an interaction from the changes in voltage during the
Read operation to the rest of the network. Without this intermediate storing stage,
unwanted analog voltage ﬂuctuation during the Read operation can be ampliﬁed and
be transmitted to the next stage. 2) Also, it makes the system operate in a pipelined
manner, and hence contributes to make the system synchronous. The variables in BN
except for the root variable can only process information on receiving the data from
the parent node, which makes the system asynchronous with low throughput. By
adopting clocking with proper storing element, the system become synchronous and
can have high throughput. The stored value in C-LAT represents result of current
stochastic switching. When the output level of C-LAT (VM T J ) is combined with the
Write command, the output pulse from each node is generated. Note that, depending
on the level of VM T J , either of Vo+ or Vo− can generate a pulse at a time. In case the
level of VM T J is logic ‘1’, then only Vo+ node can generate a pulse when the Write
pulse is presented to the AND gates. On receiving this pulse, corresponding current
pulse at the next stage is generated.

Fig. 4.8. Timing diagram of the proposed device-circuit conﬁguration in Fig. 4.7

66
Abovementioned operations are illustrated with a timing diagram in Fig. 4.8.
Based on global clock signal (CLK), three main pulses are generated for Write, Reset, and Read operations (WR, RST, and RD). These three signals are broadcasted
through global signaling and used by all the variables and interfaces. Let us assume
that the initial magnetization direction is parallel conﬁguration (lower resistance than
RREF ), hence Vx node is at a high voltage level (Vh ). Since the switching probability of the initial node is set to be 0.5, the magnet ﬂips its state with 50% chance.
The Vx node reﬂects such a switching frequency and can have two possible states (Vh
and Vl ). This small voltage diﬀerence is ampliﬁed and stored in the Clocked Latch
(VLAT /VLAT B ) on receiving the Read command pulse. The stored value - stochastic
switching result - is now converted into a pulse when this value is combined with
the global Write command. Depending on the switching results, either Vo+ or Vo−
transmits a pulse to the next variable, and thereby can generate a current pulse with
desired amplitude as we speciﬁed through the CPT.

Implementation of BN and inference operation

Fig. 4.9. Hardware implementation of Bayesian Network with stochastic MTJ and CMOS peripherals

67
Based on the implementation of the variable and edge in the previous section, we
discuss the implementation of the entire network (in Fig. 4.4). Fig. 4.9 shows a complete view of the BN implementation using the proposed device-circuit conﬁguration.
Each variable is made up of a single stochastic device with the required peripherals.
The next state of each variable is determined by controllable stochastic switching
which contributes to the generation of a Poisson spike train. The switching frequency
information is transferred to the next variable as a pulse through the aforementioned
interface circuitries. Note that, four AND gates are adopted at the beginning of the
last variable to generate a multiplication output between two Poisson spike trains.
Based on the complete network in Fig. 4.9, we can infer the probability of each variable, i.e. P(Sprinkler), P(Rain), and P(Wet), which is diﬃcult to get directly from
the CPT. This could be accomplished just by counting the number of output pulses
corresponding to each variable for a long enough time duration. For instance, if 28
pulses are counted at the Vs+ output of ‘Sprinkler’ node over 100 write cycles, this
means that the probability of ‘Sprinkler is on’ is estimated to be 28%.
Moreover, estimation of more complex inference is also possible by introducing
additional arithmetic building blocks such as division and multiplication between two
Poisson pulses (Fig. 4.10(a)). Let us consider the same question that was mentioned
in the earlier section, i.e. the probability of ‘Sprinkler is on’ in a given situation of
‘Wet grass is true’. By the deﬁnition of conditional probability, this could be written
as P (S = 1|W = 1) = P (S = 1 ∩ W = 1)/P (W = 1). Here the intersection
function between the two probabilities can be interpreted as multiplication between
two Poisson spike trains which could be implemented by an AND gate. The only
remaining function to be implemented is the division operation. Recently, the method
of performing division operation between two Poisson spike trains by matching the
rate of spiking of two internal signals has been proposed in Ref. [79]. Fig. 4.10(b)
depicts the concept of the division operation between two Poisson spikes and practical
implementation of the divider is shown in Fig. 4.11. Let us ﬁrst describe the division
operation using Fig. 4.10(b). There are two input pulse trains with diﬀerent spiking

68

Fig. 4.10. (a) Inference operation based on default BN with the aid of
multiplication and division operations (b) Division operation between
two Poisson pulses by matching the rate of spiking between two nodes
[79]

Fig. 4.11. Practical implementation of division operation between two
Poisson pulses

rate to the division unit (with current inputs I1 and I2 to the proposed device-circuit
conﬁguration, named as MTJ1, MTJ2). Here we denote spiking rate of upper and

69
lower pulse trains as ‘S1’ and ‘S2’ respectively. Likewise, the rate of spiking from the
output node is denoted as ‘SO’. Note, the current input pulse to the output MTJ
device (MTJ3) is generated from additional CMOS logic gates (named as ‘Counter &
Feedback Logic’). By multiplying the two pulses, ‘S2’ and ‘SO’, using AND gate, the
resultant spiking rate becomes ‘S2*SO’. Here the main idea is that once the ‘Counter
& Feedback Logic’ controls the spiking rate of MTJ3 device to match the spiking
rate of ‘S1’ and ‘S2*SO’ to be the same, then the rate of output spiking becomes
SO = S1/S2. For this functionality, the Counter logic counts the number of pulses
from each of ‘S1’ and ‘S2’ over a predetermined time duration. If the comparison
results show ‘S1’ is more than ‘S2’, then the Feedback logic raises the spiking rate
of output ‘SO’ so that ‘S2*SO’ rate also increases. By following this simple feedback
rule, the division operation can be achieved. Fig. 4.11 shows the implementation of
the divider unit based on this approach. The device-circuit conﬁguration to generate
an output Poisson spike train is identical to the one we used to implement the variable
in BN. The only diﬀerence is the rate of spiking (i.e. amplitude of input current pulse),
is controlled by the ‘Counter & Feedback Logic’ inside the dotted box. The ‘Counter
& Feedback Logic’ consists of two digital up-counters to count the number of spikes
from two inputs, digital comparator, subtractor, and timing controller (TCON).

4.4

Results
The functionality of the BN was veriﬁed by performing a simulation of the entire

network based on the probability switching characteristics depicted in Fig. 4.1(b).
Fig. 4.12(a) represents the timing waveforms and average number of spikes produced
for various events which approximates the actual analytical solution to a reasonable
degree of precision. Fig. 4.12(b) depicts that the probability distribution function
(PDF) of the two conditional events (S|W ) and (R|W ) approach the actual analytical
solution as the number of samples increases.

70

Fig. 4.12. (a) Timing waveform from each variable to estimate the
probability during 100 sample points. (b) Inference results based on
the proposed BN

It is worth noting here that Bayesian inference and in general, the class of unconventional computing platforms that can be enabled by such stochastic magnetic elements, are inherently resilient to variations and noise in the computational units. For
instance, we perform an analysis of the network performance in presence of random
Gaussian transient noise added to the input charge current provided by the CMOS
current source through the heavy metal layer. Note that ∼ 60µA (0.48 − 0.54mA)
of current range is being exploited from the switching probability characteristics of
the magnetic stack (Fig. 4.1(b)) for proper functioning of the Bayesian network. We
utilize a 6-bit DAC to provide the input current through the heavy metal, thereby
providing ∼ 1µA of current resolution (LSB). Table 4.1 depicts the mean and standard deviation of the output P (R|W ) of the network for 1, 000 samples with varying
amplitudes of the Gaussian noise (from 1LSB ∼ 1µA to 3LSB ∼ 3µA). As expected,
the mean value remains close to the case without noise. Although the variance of the
PDF distribution increases by a small amount, it can be reduced by increasing the
number of sample points used for inference.

71
Noise amplitude

Mean of P (R|W )

STD (σ) of P (R|W )

0 (No noise)

0.7048

0.0222

1 LSB

0.7046

0.0249

1 LSB

0.7046

0.0339

1 LSB

0.7031

0.0545

Table 4.1.
Estimated probability of the inference for 1,000 output samples with
Gaussian noise.

4.5

Conclusion
Prior proposals have considered experimental demonstration of spin-orbit torque

driven magnetic heterostructure switching for unconventional computing platforms
like associative memory operations [83]. However, the proposals typically exploit
analog and deterministic Hall-bar switching (the resistance of the Hall-bar varies in
an analog fashion depending on the magnitude of the input current) as the core computing element. In contrast, this work exploits binary and stochastic Hall-bar switching. Such probabilistic computation is expected to replace deterministic computing
platforms (based on such post-CMOS technologies) since at highly scaled dimensions
such devices are not expected to exhibit multi-bit analog resolution. Therefore, exploring probabilistic computing platforms based on stochastic device switching that
encode information in time (through probabilistic update of binary elements) rather
than space (through deterministic analog computing elements) will become important. This work can potentially stimulate eﬀorts at developing stochastic computing
platforms that embrace the underlying stochasticity of highly-scaled nanomagnets.
In conclusion, in this chapter we provided proof-of-concept experiments demonstrating probabilistic spin-orbit torque induced magnetization reversal. Such stochastic devices can provide a direct mapping to the computing elements of Bayesian inference, Deep Belief Networks and probabilistic neuromorphic applications.

72

5. COUPLED SPIN-TORQUE-OSCILLATOR BASED
CONVOLUTION COMPUTATION:
APPLICATION TO IMAGE PROCESSING
A coupled oscillator network uses phase or frequency of the oscillators output signal as a computing resource based on coupling among the oscillators. The collective
behavior through weak coupling, when the output phase (or frequency) is similar to
the neighbors, makes the network attractive as a Degree-of-Match (DoM) computation unit. Based on these features, a coupled oscillator based computing primitive
can be used as an alternative approach for associative computing rather than relying on Boolean computation with von-Neumann architecture. In this chapter, we
demonstrate a core computing hardware for convolution computation and Euclideandistance-squared computing unit, based on a weakly coupled Spin-Torque-Oscillator
(STO) cluster and the associated CMOS supporting circuits. To our knowledge, this
is the ﬁrst demonstrated coupled STO system as a computing unit. The proposed
system is exploited in edge detection through Gabor ﬁltering to show the potential
usage of the coupled oscillator based system in an image processing application.

5.1

Introduction
Even with the unprecedented success driven by device scaling in Boolean com-

puting, problem optimization, recognition, and classiﬁcation are still not suited for
digital computation based on von-Neumann architecture. The operations required for
such applications are computationally intensive in most cases, for instance series of
multiplication and accumulation (MAC) operations. In addition, the operations are

73
executed by following a sequential instruction set which leads to excessive computing
resource usage.
This has motivated the research community to explore alternative computing models in non-Boolean regime and emerging post-CMOS devices for better performance
and energy eﬃciency. One of the eﬃcient models proposed for associative computing
is a system based on a weakly coupled oscillator array where the array is used to
compute Degree-of-Match (DoM) between two multi-dimension vectors [84–90]. The
DoM represents a similarity between two input vectors in terms of distance where the
distance could have diﬀerent metric forms, such as the Manhattan distance metric
(L1 norm) or the Euclidean distance metric (L2 norm). Here the input information is
encoded as either frequency or phase of the oscillator output signal. The outputs are
coupled through the interconnections among the oscillators wherein the connections
are used to communicate with neighboring elements.
The emerging nano-sized oscillator devices also make the coupled oscillator system
attractive and implementable as they can possibly provide scalable dimensions, faster
computation time, and less energy consumption compared to its digital counterpart.
Recently, simulation level studies have been conducted on the coupled oscillator network using a mathematical model of the coupled oscillator/models calibrated with
fabricated devices for associative computation based on emerging devices such as
Spin-Torque Oscillator (STO) [87, 90], Resonant Body Oscillator (RBO) [91], and
vanadium dioxide (VO2 ) [92–94]. However, there has not yet been a demonstration
of the entire coupled oscillator network through the physical implementation of nanooscillators and CMOS supporting circuits for the associative computations.
In this chapter, we demonstrate an entire system comprising coupled oscillators
using Giant Magnetoresistance (GMR) STO devices [95–97] and a CMOS detector
chip, as the core computing primitive of a convolution computation. The proposed
hybrid system based on the coupled STOs and the CMOS circuits has been used
to measure the DoM between two input vectors whose output follows the Euclidean
distance squared metric (L22 norm). Corresponding distance in L22 norm is used to

74
compute an approximate convolution for an image processing application to envision
the feasibility of the practical coupled oscillator network as a core computing element
of non-Boolean computation.

5.2

Coupled oscillator based approximate computation primitive
The main idea of utilizing the oscillator as a core computing element of non-

Boolean computation lies in mapping an input information into a property of output
signal from the oscillator. Once the information is mapped into a frequency of the
output signal, this is referred to as Frequency Shift Keying (FSK). In case the input
is encoded as a phase of the output signal, it becomes Phase Shift Keying (PSK).
Fig. 5.1(a) shows the main diﬀerence between FSK and PSK with the associated
waveforms. Here, the input is encoded as the phase of the output signal based on the
device property under consideration and its coupling methods, the GMR-STO device
and injection locking, respectively.

Fig. 5.1. (a) Oscillator as a basic computing element and two possible
methods of input mapping through FSK and PSK. (b) Approximate
distance computation (L1, L2 norm) based on the oscillator and Distance Unit

75
The phase (or frequency) variation of the oscillatory signal could be used to measure the distance between two diﬀerent input vectors. When the diﬀerence in inputs
‘a’ and ‘b’, (a-b), is mapped into the phase (or frequency) of the oscillator output
signal, the distance metric is attained through the Distance Unit in Fig. 5.1(b). Note
that, the Distance Unit is typically implemented using CMOS circuits to facilitate
further processing in digital domain after getting a distance. The distance metric
could be denoted as Manhattan distance (L1 norm) when the shape of the output
curve shows linear dependence to the input variation. Once the output curve follows
square of the input vector diﬀerence, it becomes Euclidean distance squared metric
(L22 norm).
The same concept can be used to compute the distance between two multidimensional vectors with the aid of a summing element. As depicted in Fig. 5.2(a),
each oscillator gets an element-wise diﬀerence between two vectors (‘A’ and ‘B’) as
an input. Correspondingly, the output of the oscillator exhibits a diﬀerent phase
(or frequency). These signals are merged through the summing element (denoted as
P
in Fig. 5.2(a)) before being presented to the Distance Unit. Note that, to sum
up incoming signals from the oscillators, diﬀerent techniques have been used such
as resistive coupling [98] and capacitive coupling [87]. The subsequent output signal
from the summing unit would exhibit diﬀerent amplitude as the incoming signals
have diﬀerent phases (or frequencies) that lead to amplitude attenuation during the
summation. The signal with varying amplitude is then presented to the Distance
Unit, and the ﬁnal output shows diﬀerent level of voltage or current (depends on the
Distance Unit) in accordance with the amplitude variation of the incoming signal.
Once the level of the output signal shows quadratic change over the input variations,
it becomes approximate L22 distance metric which is adopted as a DoM in many
literature. [98, 99]
Recently, the idea of performing an approximate convolution by exploiting the
coupled oscillator array-based L22 distance metric has been proposed [98, 99]. The

76

Fig. 5.2. (a) Approximate distance computation (L22 norm) between
two multi-dimension vectors based on coupled oscillator network and
the associate Distance Unit (b) Basic idea of approximate convolution
computation from the L22 distance metric

underlying principle follows a simple algebraic expression as shown below, where ‘A’
and ‘B’ represent multi-dimensional input vectors.
L22 (A, B) = (A − B)2

whereA = (a1 , a2 , ..., an ), B = (b1 , b2 , ..., bn )

(5.1)

Convolution(A, B) = A · B = −0.5 × [L22 (A, B) − L22 (A, 0) − L22 (0, B)]

(5.2)

By following (5.2), approximate convolution is calculated using three L22 units
with three diﬀerent cases of input vectors, i.e. (A-B), (A-0), and (0-B) as shown in
Fig. 5.2(b).
In order to calculate such a distance metric or a convolution using Boolean computation, a series of arithmetic operations such as multiplication and accumulation
(MAC) operations have to be performed several times through the general-purpose
CPU. Also, the operations are performed by following the sequential instruction cy-

77
cles of fetch, decode, and execute which results in ineﬃcient usage of computing
resources. Here the use of coupled oscillator network for an approximate computation makes such a process rather simple, since the computationally intensive MAC
operations with multiple inputs are performed at once in an analog domain. This
can possibly provide an eﬃcient solution with a reasonable accuracy in the domain
of associative computing where many of the applications are error-resilient.

5.3

System level implementation
As mentioned in the previous section, we propose a hardware system targeting an

approximate convolution computation based on the coupled oscillator (GMR-STO)
array and the CMOS supporting circuits. Fig. 5.3 shows the proposed system in
brief and its application to image processing, especially for edge detection through
Gabor ﬁltering. Gabor ﬁltering is a widely used feature extraction method in image
processing [100, 101]. The oscillatory changes in images in a certain direction can be
detected by convolving an image with diﬀerent ﬁlter kernels that show diﬀerent edge
features.
We would now give a brief description about how the coupled oscillator based
system in Fig. 5.3 can perform a convolution computation. First of all, the pixel
intensity of Gabor ﬁlter kernel and an input image fragment (same size with ﬁlter
kernel) is converted into digital bits (bit-quantization). Then the pixel-wise diﬀerence
is calculated and is used as an input to each STO after being transformed into dc bias
current. The STO devices are used to encode input data into the phase of the output
signal. The outputs are combined into a single signal with a varying amplitude where
the variation comes from the phase diﬀerence among the STO output signals. These
operations occur inside the STO module in Fig. 5.3.
The output oscillating signal from the STO module is now presented to the CMOS
detector chip. The L22 unit in the CMOS module converts the information on the
amplitude of the incoming signal into digital code. The output code from the L22 unit

78

Fig. 5.3. Coupled STO cluster based approximate convolution computation: Application to image processing, Edge detection

will vary in a quadratic manner with respect to the input variation which represents
a distance between the input image fragment and the Gabor ﬁlter kernel. To complete the convolution computation, this process has to be performed three times with
diﬀerent inputs (or single operation with three identical structures) as we discussed.
Finally, the convolution output is attained through simple addition and subtraction
between three outputs that becomes a single pixel intensity of the output edge map.
The detailed implementation of the system is shown in Fig. 5.4. The entire
system is divided into two sub-groups, 1) STO module with external components
and 2) CMOS detector (L22 unit) along with bias current sources. On receiving the
bias currents (IDC ) from the current sources, the STOs generate oscillating signals.
Note that, the four STOs are coupled to the ﬁeld-line ac signal for the frequency
locking where the ﬁeld line traverses the STO devices (Fig. 5.4 bottom-left). The
parasitic capacitance between the signal line and the STO is used to inject a portion
of the ac signal to each device (injection coupling). Once the frequency of the injected
signal is similar to a natural frequency of the STO, then the STO shows a frequency
locking to the injected signal. The outputs from the oscillators are combined through
the series of Power Combiners (PCs), and then undergoes ampliﬁcation stage, LowNoise-Ampliﬁer (LNA), before being transmitted to the CMOS detector.

79

Fig. 5.4. The detailed implementation of the proposed system for
approximate distance computation in Fig. 5.3 based on the coupled
oscillator device module (GMR-STO) and CMOS detector (L22 unit).

Upon receiving the incoming signal from the STO module, the CMOS detector
(L22 unit) generates an output digital code where the digital code would increase or
decrease as the amplitude of the input signal increases or decreases, respectively. The
ﬁrst few stages are used to amplify the incoming signal with enough gain to interact
with succeeding CMOS circuits such as integrator and Analog-to-Digital Converter
(ADC). Since the incoming signal is expected to have a small amplitude (<1 mV ), we
ﬁrst adopted LNA again. Then, a diﬀerential ampliﬁer and inverter based ampliﬁers
are used in series. After converting the small signal into CMOS compatible magnitudes, the integrator has been used whose output voltage level changes in accordance
with the incoming signal amplitude. Finally, the level of the integrator output voltage is converted into a digital code through ADC. The proposed CMOS detector has
been implemented and fabricated using 90nm CMOS process, and the die photo of
the CMOS detector is shown at the bottom side of Fig. 5.4.

80
5.4

Circuit level implementation

Fig. 5.5. (a) Simpliﬁed STO device module in Fig. 5.4 (b) Frequency
versus input dc bias current to the STO device under the frequency
locking through a ﬁeld line injection coupling (c) Expected waveforms
along the signal ﬂow

In this section, we would explain the detailed operation of the STO module and the
CMOS detector by showing expected waveforms along the signal ﬂow. The simpliﬁed
STO module is shown in Fig. 5.5(a). The STOs are biased with a separate input dc
current, and the amount of the current is determined by the element-wise diﬀerence
between two input vectors. Due to the injection locking through the ﬁeld line, the
STOs are in the frequency locked state. Hence, the output signals from the STOs are
having the same frequency with proper input dc bias currents (I1 , I2 ) which fall into
the frequency locking range (Fig. 5.5(b)). However, this does not imply the phases of
the output signals are identical, but diﬀerent according to the input bias current as
reported in ref. [102–104]. Since our scheme encodes the input information into the

81
phase of the oscillating signal, there is a phase diﬀerence between the inputs to the
Power Combiner (PC) (Fig. 5.5(c) left). The amplitude variation also exists due to
diﬀerent bias current through the resistive device. However, the resistance variation
during the oscillation is quite small in case of the GMR-STO device. Therefore,
there is typically negligible amplitude variation. The oscillating signals are combined
through the Power Combiner (PC), and become a single signal where the amplitude
is varying due to the phase diﬀerence. The output signal is ampliﬁed at the LNA
stage and then transmitted to the CMOS detector side through a wired connection.
(Fig. 5.5(c) right).

Fig. 5.6. (a) The detailed block diagram of the CMOS detector, L22
unit (b) Expected waveforms along the signal ﬂow

Once the signal is presented to the CMOS module, additional ampliﬁcation stages
are used initially inside the CMOS detector (Fig. 5.6(a)). The ﬁrst two stages are
an LNA and a diﬀerential ampliﬁer as we described in the previous section. Then,
the inverter based ampliﬁers are used to amplify only the portion of interest, the
tip of the incoming signal, based on a threshold level of the inverter. Consequently,

82
the signal in front of the integrator behaves like the fourth waveform of the Fig.
5.6(b) where the signal normally stays at ‘VDD’ level, and only goes below when the
‘INV1 amp.’ signal exceeds the threshold level of the second inverter (VT H2 ). The
depth of dip in voltage from the ‘VDD’ level changes along with the amplitude of the
incoming signal ‘Vamp. ’ and controls the amount of current ﬂow at the integrator by
adjusting the degree of turn-on of the switch PMOS transistor. The current from the
integrator charges the capacitor, and makes a voltage across the capacitor rise during
an integration time. Finally, the level of the integrator output voltage is converted
into a digital code at ADC stage for further convolution processing.

5.5

Measurement results and its application to edge detection
In the previous sections, we have shown the basic principle of getting L22 norm

from the proposed coupled STO based computing primitive. In addition, the implementation of the proposed system based on the STO device and the CMOS detector
modules has been introduced. Note that, the L22 norm is eventually used to compute
an approximate convolution for an image processing.
To experimentally demonstrate the feasibility of the proposed system as a L22
computing unit, the test environment is set up as illustrated in Fig. 5.7. As shown in
Fig. 5.7(a), we have prepared four dc bias current sources (I0 +I1 ∼ I0 +I4 ) where I0
is the baseline current (∼7 mA). The additional bias currents (I1 ∼ I4 ) are randomly
generated within the locking range (Fig. 5.7(b)). To display these multi-dimension
input vectors in a single axis, the Euclidean distance metric (L2 norm) has been used
as a x-axis component. For instance, if the diﬀerences between two input vectors are
[1,1,1,1] and [0,0,0,2], then Euclidean distance of both inputs become 2 as depicted
in Fig. 5.7(c).
Based on the experimental setup, the input bias current combinations are applied
to the STOs and the corresponding output voltage variations from the proposed
system in 5bit digital code (bit-resolution of the designed ADC) have been monitored

83

Fig. 5.7. (a) simulation environment setup to check the performance
of the proposed approximate distance computing primitive based on
randomly generated input vectors (b) Input test vector table used for
the measurement (c) 2D graph to show the input (randomly generated
vectors in L2 norm format) and the output (Digital code from CMOS
detector) relationship

for ∼400 randomly generated input vectors. The measured data is shown as a box plot
in Fig.5.8. As shown in the ﬁgure, the majority of the data points follow a second
order non-linear curve, thereby reaﬃrming the fact that an approximate distance
computation, an L22 norm in this particular case, can indeed be performed using the
proposed coupled-oscillator based system.
To check the feasibility of using the proposed L22 unit as a convolution computing
primitive for an image processing, we have performed an edge detection using Gabor
ﬁltering for the “Lenna” image (Fig. 5.9(a)). The image was downsized to 256 ×

84

Fig. 5.8. Approximate distance measurement results of the proposed
coupled-STO based computing primitive based on the simulation environment in Fig. 5.7.

256 with grayscale pixels and was convolved with Gabor ﬁlter kernel of size 2 × 2.
The Gabor ﬁlter kernels have been generated based on the model in [99]. Among the
set of ﬁlters, one ﬁlter that shows the best edge detection result has been chosen for
further simulation study (Fig. 5.9(b)). Here the size of the ﬁlter kernel was selected
to match the number of STOs in our system. Note that, 5-bit quantization has been
applied for a pixel intensity of the image and the Gabor ﬁlter kernel.
Based on these settings, edge detection was performed through the following process. By preparing an image fragment and the ﬁlter kernel with 2 × 2 pixels, a
pixel-wise diﬀerence (in digital bit) was calculated and the corresponding amount
of bias current was prepared. Then, the Euclidean distance of 4-dimensional input
vector (combination of four bias currents) has been used to get a digital code from
our L22 unit based on the characteristic curve shown in Fig. 5.8. This process was
repeated two times more to get the ﬁnal convolution output as we discussed in Fig.

85

Fig. 5.9. Simulation study (Edge detection through Gabor ﬁltering)
to show the feasibility of our system as an approximate convolution
computing primitive (a) Downscaled “Lenna” image (b) 2 × 2 Gabor
ﬁlter (c) Edge detection results from ideal convolution (Left) and approximate convolution (Right) (d) Pixel-wise diﬀerence between two
edge maps in (c)

5.2(b). The ﬁnal convolution output represents a single pixel intensity of the output
edge map. Therefore, the entire process has to be repeated by sliding the image
fragment window across the picture until the whole output edge map is obtained.
The edge detection results from ideal convolution (mathematical function) and
approximate convolution based on the proposed system are shown in Fig. 5.9(c).
The pixel-wise diﬀerence between two edge-maps is shown in Fig. 5.9(d). The visual
quality of the approximate convolutional unit is very similar to the ideal one within
±5% error.

86
Now, let us consider the energy consumption of the proposed system. For a single
distance computation, the proposed system requires ∼10 ns of integration time. In
addition, ∼5 ns of signal ﬂight time through a wired connection and ∼3 ns of settling
time (to stabilize the output oscillating signal after changing the input data) were
added to the basic operation time which results in ∼18 ns of computation time per
single operation. During this time, the external components on the STO device
module (PC and LNA) and CMOS supporting circuits (ampliﬁers and integrator)
are under operation. The energy consumption of the entire system during a single
operation based on the two module approach is estimated to be ∼5.59 nJ where
the external components consume the most of the energy consumption (∼5.4 nJ).
The four STOs consume ∼0.035 nJ based on the fact that the average bias current
is ∼7 mA and the resistance across the device is ∼ 10 Ω which makes the power
consumption of each oscillator to be ∼490 uW by following P = I 2 R. The remaining
portion of the energy is consumed by the CMOS supporting circuits (∼0.23 nJ). Note
that, here the energy consumption of the ADC was not considered, as the type of
the ADC would be diﬀerent depending on the target design, and the corresponding
operation time and power consumption also changes severely.
Once the system is integrated into a single chip, the total computation time is
reduced as the chip-to-chip communication time (signal ﬂight time) is eliminated.
Also the energy consumption of the external components can be ignored due to the
direct connection between the STO and the CMOS circuit. Now, the total time
required and the energy consumption for a single distance computation become ∼13
ns and ∼0.14 nJ, respectively. This energy number could be reduced more with
an advanced device technology. For instance, Tunneling Magnetoresistance (TMR)
STO device on the top of a heavy-metal thin layer by exploiting Spin-Orbit-Coupling
(SOC) can provide a larger resistance variation compared to the current GMR-STO
device. The corresponding larger output signal removes the necessity of using multiple
ampliﬁcation stages in the CMOS detector.

87
5.6

Conclusion
In this chapter, we have experimentally demonstrated a core convolution primitive

based on a weakly coupled oscillator array. Starting from the basic idea of getting L22
norm with coupled oscillators, we have shown that frequency locking of the oscillators
and the reference signal through ﬁeld line coupling plays a crucial role in the proposed
system.
The performance of the system as a L22 unit was examined by presenting randomly
generated test input vectors, bias current to the STOs, and corresponding output
digital codes from the CMOS detector have been monitored. The characteristic curve
from the experiment that shows the approximate L22 norm is used to check the
feasibility of the proposed system as a convolution computation primitive for an
image processing application, especially edge detection. The approximate convolution
output based on our system shows reasonable accuracy as compared to the ideal
convolution results, thereby aﬃrming that the proposed coupled STO based system
can perform an approximate convolution task.

88

6. SUMMARY
Optimization problems have attracted much attention from the research community
due to their relevance to the decision-making process and many other applications.
However, computations to solve optimization problems based on traditional computing models turn out to be ineﬃcient due to an exponentially increase in complexity
with the number of variables involved. This inspires researchers to look into alternative computing models over the entire stack of device, circuit, and algorithm design
spaces.
In this research, we proposed multiple optimization problem solvers based on a
stochastic algorithm. The key to the stochastic algorithm for our case, comes from
the stochastic switching behavior of nano-sized magnets in the presence of thermal
noise. The magnetization switching of a magnetic stack becomes non-deterministic
when a subthreshold input current is applied across the device under thermal noise.
Here, the switching probability of the device could be manipulated by controlling the
magnitude of input stimulus. This controllable stochastic switching makes the device
suitable as a 1bit Random Number Generator (RNG) that forms the core computing
primitive of the proposed hardware solutions in many applications.
Based on the 1bit RNG using a ferromagnet-heavy metal heterostructure along
with necessary CMOS peripherals, we ﬁrst introduced the implementation of the Ising
spin model in chapter 2. The Ising spin model describes the behavior of magnetic
spins and the coupling between them. Its simple architecture and inherent ability
to evolve towards the minimum energy state make this computing model attractive
and implementable. The stochastic switching of the nanomagnet allows eﬃcient implementation of some of the primitives required for Ising model, such as “Annealing”
and “Majority Vote”. The feasibility of the system has been demonstrated with

89
two representative combinatorial optimization problems, Maximum-cut and Graph
Coloring.
In chapter 3, we proposed another optimization problem solver for Boolean Satisﬁability (SAT) problem. The stochastic switching of the proposed 1-bit RNG has
been used to select a variable to be ﬂipped that would decrease the number of unsatisﬁed clauses. For this functionality, a switching probability is assigned to each
variable responsible for unsatisﬁed clauses such that every variable can have a certain probability of ﬂipping its current state. The probability increases or decreases
as the number of unsatisﬁed clauses connected increases or decreases, respectively.
This enables parallel search of the solution space by ﬂipping multiple variables at the
same time, which results in rapid convergence toward the global solution. Based on
a device-circuit-algorithm co-framework, our proposed Biased Random Walk (BRW)
approach demonstrates a better solution search time in comparison to the software
SAT solver with similar algorithmic approaches such as WalkSAT and ProbSAT.
In chapter 4, we presented an implementation of the Bayesian Network (BN) model
based on the proposed hardware primitive. Starting from a proof-of-concept experiment demonstrating probabilistic spin-orbit torque induced magnetization switching,
we have shown that such stochastic switching can be used to emulate a computing
element of Bayesian Inference. Here, the stochastic 1-bit RNG forms a Poisson spike
pulse generator with CMOS peripherals and is used as a variable of the network. The
switching event from each variable encodes an input of information into the frequency
of output pulse train. Then, the information is transferred through the connections
between the variables in the given network. Based on a simple BN with four variables, we have shown that probabilistic inference could be achieved with reasonable
accuracy in comparison to the ideal ﬂoating-point computation.
Finally, we proposed a complete hardware system for an approximate distance
computation. Apart from the previous three research topics, this research focuses on
utilization of a weakly coupled spin torque oscillator (STO) network as a core hardware primitive. The oscillators are controlled to remain in a frequency locked stage

90
through a ﬁeld-line injection coupling, which is used to map input information into
a phase of the output oscillating signals. The proposed system comprises two separate sub-modules: 1) STO module with external components and 2) CMOS detector
as a L22 distance computation unit. The distance calculated from our system was
exploited to compute approximate convolution for an image processing application.

REFERENCES

91

REFERENCES

[1] (2017)
Executive
summary,
https://irds.ieee.org/roadmap-2017

irds.

[Online].

Available:

[2] (2017) Beyond cmos, irds. [Online]. Available: https://irds.ieee.org/roadmap2017
[3] (2016) Beyond cmos, irds. [Online]. Available: http://irds.ieee.org/reports
[4] (2015) Itrs executive report, itrs. https://www.semiconductors.org/main/
2015 international technology roadmap for semiconductors itrs.
[5] M. N. Baibich, J. M. Broto, A. Fert, F. N. V. Dau, F. Petroﬀ, P. Etienne, G. Creuzet, A. Friederich, and J. Chazelas, “Giant magnetoresistance
of (001)fe/(001)cr magnetic superlattices,” Phys. Rev. Lett., vol. 61, p. 2472,
1988.
[6] G. Binasch, P. Grnberg, F. Saurenbach, and W. Zinn, “Enhanced magnetoresistance in layered magnetic structures with antiferromagnetic interlayer exchange,” Phys. Rev. B, vol. 39, p. 4828, 1989.
[7] M. Yamaoka, C. Yoshimura, M. Hayashi, T. Okuyama, H. Aoki, and H. Mizuno,
“New computing paradigm for analyzing increasingly complex social infrastructure systems,” Hitachi Review, vol. 64, no. 8, p. 525, 2015.
[8] F. Barahona, “On the computational complexity of ising spin glass models,” J.
Phys. Math. Gen., vol. 15, 1982.
[9] B. A. Cipra, “An introduction to the ising model,” Amer. Math. Monthly,
vol. 94, pp. 937–959, 1987.
[10] M. Yamaoka, C. Yoshimura, M. Hayashi, T. Okuyama, H. Aoki, and H. Mizuno,
“A 20k-spin ising chip to solve combinatorial optimization problems with cmos
annealing,” IEEE Journal of Solid-State Circuits, vol. 51, no. 2, pp. 303–309,
2016.
[11] L. Liu, C.-F. Pai, Y. Li, H. W. Tseng, D. C. Ralph, and R. A. Buhrman, “Spintorque switching with the giant spin hall eﬀect of tantalum,” Science, vol. 336,
pp. 555–558, 2012.
[12] B. Sutton, K. Y. Camsari, B. Behin-Aein, and S. Datta, “Intrinsic optimization
using stochastic nanomagnets,” Sci. Rep., vol. 7, 2016.
[13] B. Behin-Aein, V. Diep, and S. Datta, “A building block for hardware belief
networks,” Sci. Rep., vol. 6, p. 29893, 2016.

92
[14] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi, “Optimization by simulated
annealing,” Science, vol. 220, pp. 671–680, 1983.
[15] D. A. Huse and D. S. Fisher, “Residual energies after slow cooling of disordered
systems,” Phys. Rev. Lett., vol. 57, p. 2203, 1986.
[16] A. B. Finnila, M. A. Gomez, C. Sebenik, C. Stenson, and J. D. Doll, “Quantum
annealing: A new method for minimizing multidimensional functions,” Chem.
Phys. Lett., vol. 219, pp. 343–348, 1994.
[17] T. Kadowaki and H. Nishimori, “Quantum annealing in the transverse ising
model,” Phys. Rev. E, vol. 58, pp. 5355–5363, 1998.
[18] G. E. Santoro, R. Martok, E. Tosatti, and R. Car, “Theory of quantum annealing of an ising spin glass,” Science, vol. 295, pp. 2427–2430, 2002.
[19] M. W. Johnson, M. H. S. Amin, S. Gildert, T. Lanting, F. Hamze, N. Dickson,
R. Harris, A. J. Berkley, J. Johansson, P. Bunyk, E. M. Chapple, C. Enderud,
J. P. Hilton, K. Karimi, E. Ladizinsky, N. Ladizinsky, T. Oh, I. Perminov,
C. Rich, M. C. Thom, E. Tolkacheva, C. J. S. Truncik, S. Uchaikin, J. Wang,
and B. Wilson, “Quantum annealing with manufactured spins,” Nature, vol.
473, pp. 194–198, 2011.
[20] S. Manipatruni, D. E. Nikonov, and I. A. Young, “Energy-delay performance of
giant spin hall eﬀect switching for dense magnetic memory,” Appl. Phys. Expr.,
vol. 7, no. 10, p. 103001, 2014.
[21] J. C. Slonczewski, “Conductance and exchange coupling of two ferromagnets
separated by a tunneling barrier,” Phys. Rev. B, vol. 39, p. 6995, 1989.
[22] M. Beleggia, M. D. Graef, Y. T. Millev, D. A. Goode, and G. Rowlands, “Demagnetization factors for elliptic cylinders,” J. Phys. D, vol. 38, p. 3333, 2005.
[23] A. Jaiswal, X. Fong, and K. Roy, “Comprehensive scaling analysis of current
induced switching in magnetic memories based on in-plane and perpendicular
anisotropies,” JETCAS, vol. 6, pp. 120–133, 2016.
[24] J. E. Hirsch, “Spin hall eﬀect,” Phys. Rev. Lett., vol. 83, p. 1834, 1999.
[25] C.-F. Pai, L. Liu, Y. Li, H. W. Tseng, D. C. Ralph, and R. A. Buhrman, “Spin
transfer torque devices utilizing the giant spin hall eﬀect of tungsten,” Appl.
Phys. Lett., vol. 101, p. 122404, 2012.
[26] I. M. Miron, K. Garello, G. Gaudin, P.-J. zermatten, M. V. Costache, S. Auffret, S. Bandiera, B. Rodmacq, A. Schuhl, and P. Gambardella, “Perpendicular
switching of a single ferromagnetic layer induced by in-plane current injection,”
Nature, vol. 476, pp. 189–193, 2011.
[27] L. Liu, O. J. Lee, T. J. Gudmundsen, D. C. Ralph, and R. A. Buhrman,
“Current-induced switching of perpendicularly magnetized magnetic layers using spin torque from the spin hall eﬀect,” Phys. Rev. Lett., vol. 109, p. 096602,
2012.

93
[28] S. Ikeda, K. Miura, H. Yamamoto, K. Mizunuma, H. Gan, M. Endo, S. Kanai,
J. Hayakawa, F. Matsukura, and H. Ohno, “A perpendicular-anisotropy cofeb–
mgo magnetic tunnel junction,” Nature materials, vol. 9, no. 9, pp. 721–724,
2010.
[29] W. Scholz, T. Schreﬂ, and J. Fidler, “Micromagnetic simulation of thermally
activated switching in ﬁne particles,” J. Magn. Magn. Mater., vol. 233, p. 296,
2001.
[30] A. F. Vincent, J. Larroque, N. Locatelli, N. B. Romdhane, O. Bichler, C. Gamrat, W. S. Zhao, J.-O. Klein, S. Galdin-Retailleau, and D. Querlioz, “Spintransfer torque magnetic memory as a stochastic memristive synapse for neuromorphic systems,” IEEE Transactions on Biomedical Circuits and Systems,
vol. 9, no. 2, pp. 166–174, 2015.
[31] A. Sengupta, M. Parsa, B. Han, and K. Roy, “Probabilistic deep spiking neural
systems enabled by magnetic tunnel junction,” IEEE Transactions on Electron
Devices, vol. 63, no. 7, pp. 2693–2970, 2016.
[32] M. X. Goemans and D. P. Williamson, “Improved approximation algorithms
for maximum cut and satisﬁability problems using semideﬁnite programming,”
Journal of the ACM, vol. 42, p. 1115, 1995.
[33] M. R. Garey and D. S. Johnson, Computers and Intractability: A Guide to the
Theory of NP-Completeness. W. H. Freeman & Co., New York, 1979.
[34] A. Lucas, “Ising formulations of many np problems,” Front. Phys., vol. 2, 2014.
[35] A. Nigam, C. W. Smullen, V. Mohan, E. Chen, S. Gurumurthi, and M. R. Stan,
Delivering on the promise of universal memory for spin-transfer torqur RAM
(STT-RAM). ISLPED, 2011.
[36] J. Marques-Silva, “Practical applications of boolean satisﬁability,” Int. Workshop on Discrete Event Systems, pp. 74–80, 2008.
[37] J. Gu, W. P. andJ. Franco, and B. W. Wah, “Algorithms for the satisﬁability (sat) problem: A survey,” DIMACS Series in Discrete Mathematics and
Theoretical Computer Science, vol. 35, pp. 19–151, 1997.
[38] K. A. Marques Silva, João P.and Sakallah, Grasp—A New Search Algorithm for
Satisﬁability. Springer US, 2003, pp. 73–89.
[39] M. W. Moskewicz, C. F. Madigan, Y. Zhao, L. Zhang, and S. Malik, “Chaﬀ: engineering an eﬃcient sat solver,” in Proceedings of the 38th Design Automation
Conference (IEEE Cat. No.01CH37232), 2001, pp. 530–535.
[40] A. A. Sohanghpurwala, M. W. Hassan, and P. Athanas, “Hardware accelerated
sat solvers - a survey,” Journal of Parallel and Distributed Computing, 2017.
[41] H. Kautz, A. Sabharwal, and B. Selman, Incomplete algorithms.
2009.

IOS Press,

[42] B. Selman, H. Levesque, and D. Mitchell, “A new method for solving hard
satisﬁability problems,” Proc. Nat’l Conf. Am. Assoc. Artiﬁcial Intelligence,
pp. 440–446, 1992.

94
[43] B. Selman, H. A. Kautz, and B. Cohen, “Noise strategies for improving local
search,” Proc. Nat’l Conf. Am. Assoc. Artiﬁcial Intelligence, pp. 337–343, 1994.
[44] F. Hutter, D. A. D. Tompkins, and H. H. Hoos, Scaling and Probabilistic
Smoothing: Eﬃcient Dynamic Local Search for SAT. Springer Berlin Heidelberg, 2002, pp. 233–248.
[45] J. Holleman, S. Bridges, B. P. Otis, and C. Diorio, “A 3 w cmos true random number generator with adaptive ﬂoating-gate oﬀset cancellation,” IEEE
Journal of Solid-State Circuits, vol. 43, no. 5, pp. 1324–1336, 2008.
[46] D. J. Kinniment and E. G. Chester, “Design of an on-chip random number
generator using metastability,” Proc. European Solid-State Circuits Conference,
pp. 595–598, 2002.
[47] A. Balint and U. Schöning, “Choosing probability distributions for stochastic
local search and the role of make versus break,” in Theory and Applications of
Satisﬁability Testing – SAT 2012, A. Cimatti and R. Sebastiani, Eds. Berlin,
Heidelberg: Springer Berlin Heidelberg, 2012, pp. 16–29.
[48] P. H. W. Leong, C. W. Sham, W. C. Wong, H. Y. Wong, W. S. Yuen, and
M. P. Leong, “A bitstream reconﬁgurable fpga implementation of the wsat
algorithm,” IEEE Transactions on VLSI Systems, vol. 9, no. 1, pp. 197–201,
2002.
[49] K. Kanazawa and T. Maruyama, “An fpga solver for wsat algorithms,” in International Conference on Field Programmable Logic and Applications, 2005.,
Aug 2005, pp. 83–88.
[50] T. Devolder, C. Chappert, J. A. Katine, M. J. Carey, and K. Ito, “Distribution
of the magnetization reversal duration in subnanosecond spin-transfer switching,” Phys. Rev. B, vol. 75, p. 064402, 2007.
[51] Y. Wang, H. Cai, L. A. d. B. Naviner, Y. Zhang, X. Zhao, E. Deng, J. O. Klein,
and W. Zhao, “Compact model of dielectric breakdown in spin-transfer torque
magnetic tunnel junction,” IEEE Transactions on Electron Devices, vol. 63,
no. 4, pp. 1762–1767, 2016.
[52] J. L. Garcı́a-Palacios and F. J. Lázaro, “Langevin-dynamics study
of the dynamical properties of small magnetic particles,” Phys.
Rev. B, vol. 58, pp. 14 937–14 958, Dec 1998. [Online]. Available:
https://link.aps.org/doi/10.1103/PhysRevB.58.14937
[53] Satlib. [Online]. Available: http://www.cs.ubc.ca/∼hoos/SATLIB/benchm.html
[54] [Online]. Available: https://www.cs.rochester.edu/u/kautz/walksat/
[55] [Online]. Available: http://satcompetition.org/edacc/sc14/experiment/24/solverconﬁgurations/1559
[56] [Online]. Available: http://www.satcompetition.org/
[57] L. Liu, C.-F. Pai, Y. Li, H. Tseng, D. Ralph, and R. Buhrman, “Spin-torque
switching with the giant spin Hall eﬀect of tantalum,” Science, vol. 336, no.
6081, pp. 555–558, 2012.

95
[58] I. M. Miron, K. Garello, G. Gaudin, P.-J. Zermatten, M. V. Costache, S. Auffret, S. Bandiera, B. Rodmacq, A. Schuhl, and P. Gambardella, “Perpendicular
switching of a single ferromagnetic layer induced by in-plane current injection,”
Nature, vol. 476, p. 189, 2011.
[59] L. Liu, O. Lee, T. Gudmundsen, D. Ralph, and R. Buhrman, “Current-induced
switching of perpendicularly magnetized magnetic layers using spin torque from
the spin Hall eﬀect,” Physical review letters, vol. 109, no. 9, p. 096602, 2012.
[60] K.-S. Ryu, L. Thomas, S.-H. Yang, and S. Parkin, “Chiral spin torque at magnetic domain walls,” Nature nanotechnology, vol. 8, no. 7, pp. 527–533, 2013.
[61] K.-S. Ryu, S.-H. Yang, L. Thomas, and S. S. Parkin, “Chiral spin torque arising
from proximity-induced magnetization,” Nature communications, vol. 5, 2014.
[62] S. Emori, U. Bauer, S.-M. Ahn, E. Martinez, and G. S. Beach, “Current-driven
dynamics of chiral ferromagnetic domain walls,” Nature materials, vol. 12, no. 7,
pp. 611–616, 2013.
[63] S. Emori, E. Martinez, K.-J. Lee, H.-W. Lee, U. Bauer, S.-M. Ahn, P. Agrawal,
D. C. Bono, and G. S. Beach, “Spin Hall torque magnetometry of Dzyaloshinskii
domain walls,” Physical Review B, vol. 90, no. 18, p. 184427, 2014.
[64] J. Hirsch, “Spin hall eﬀect,” Physical Review Letters, vol. 83, no. 9, p. 1834,
1999.
[65] G. Yu, P. Upadhyaya, Y. Fan, J. G. Alzate, W. Jiang, K. L. Wong, S. Takei,
S. A. Bender, L.-T. Chang, Y. Jiang et al., “Switching of perpendicular magnetization by spin-orbit torques in the absence of external magnetic ﬁelds,” Nature
nanotechnology, vol. 9, no. 7, pp. 548–554, 2014.
[66] L. You, O. Lee, D. Bhowmik, D. Labanowski, J. Hong, J. Bokor,
and S. Salahuddin, “Switching of perpendicularly polarized nanomagnets
with spin orbit torque without an external magnetic ﬁeld by engineering a tilted anisotropy,” Proceedings of the National Academy of
Sciences, vol. 112, no. 33, pp. 10 310–10 315, 2015. [Online]. Available:
http://www.pnas.org/content/112/33/10310
[67] J. Torrejon, F. Garcia-Sanchez, T. Taniguchi, J. Sinha, S. Mitani,
J.-V. Kim, and M. Hayashi, “Current-driven asymmetric magnetization switching in perpendicularly magnetized cofeb/mgo heterostructures,”
Phys. Rev. B, vol. 91, p. 214434, Jun 2015. [Online]. Available:
https://link.aps.org/doi/10.1103/PhysRevB.91.214434
[68] A. Sengupta, P. Panda, P. Wijesinghe, Y. Kim, and K. Roy, “Magnetic tunnel
junction mimics stochastic cortical spiking neurons,” Scientiﬁc reports, vol. 6,
2016.
[69] G. Srinivasan, A. Sengupta, and K. Roy, “Magnetic Tunnel Junction Based
Long-Term Short-Term Stochastic Synapse for a Spiking Neural Network with
On-Chip STDP Learning,” Scientiﬁc Reports, vol. 6, 2016.
[70] A. F. Murray, “Pulse arithmetic in vlsi neural networks,” IEEE MICRO, vol. 9,
no. 6, pp. 64–74, 1989.

96
[71] J. Kim, J. Sinha, M. Hayashi, M. Yamanouchi, S. Fukami, T. Suzuki, S. Mitani,
and H. Ohno, “Layer thickness dependence of the current-induced eﬀective ﬁeld
vector in ta—cofeb—mgo,” Nature Materials, vol. 12, pp. 240–245, 2013.
[72] F. V. Jensen, An introduction to Bayesian networks. UCL press London, 1996,
vol. 210.
[73] D. Heckerman, D. Geiger, and D. M. Chickering, “Learning Bayesian networks:
The combination of knowledge and statistical data,” Machine learning, vol. 20,
no. 3, pp. 197–243, 1995.
[74] T. D. Nielsen and F. V. Jensen, Bayesian networks and decision graphs.
Springer Science & Business Media, 2009.
[75] K. P. Murphy, An introduction to graphical models. Unpublished, 2001.
[76] M. Lin, I. Lebedev, and J. Wawrzynek, “High-throughput bayesian computing machine with reconﬁgurable hardware,” in Proceedings of the 18th annual ACM/SIGDA international symposium on Field programmable gate arrays.
ACM, 2010, pp. 73–82.
[77] S. Zermani, C. Dezan, H. Chenini, J.-P. Diguet, and R. Euler, “FPGA implementation of bayesian network inference for an embedded diagnosis,” in Prognostics and Health Management (PHM), 2015 IEEE Conference on. IEEE,
2015, pp. 1–10.
[78] V. K. Mansinghka, E. M. Jonas, and J. B. Tenenbaum, “Stochastic digital circuits for probabilistic inference,” Massachussets Institute of Technology, Technical Report MITCSAIL-TR, vol. 2069, 2008.
[79] C. S. Thakur, S. Afshar, R. M. Wang, T. J. Hamilton, J. Tapson, and
A. Van Schaik, “Bayesian estimation and inference using stochastic electronics,”
Frontiers in neuroscience, vol. 10, 2016.
[80] P. Mroszczyk and P. Dudek, “The accuracy and scalability of continuous-time
bayesian inference in analogue CMOS circuits,” in Circuits and Systems (ISCAS), 2014 IEEE International Symposium on. IEEE, 2014, pp. 1576–1579.
[81] Z. Weijia, G. W. Ling, and Y. K. Seng, “PCMOS-based Hardware Implementation of Bayesian Network,” in Electron Devices and Solid-State Circuits, 2007.
EDSSC 2007. IEEE Conference on. IEEE, 2007, pp. 337–340.
[82] J. S. Friedman, L. E. Calvet, P. Bessière, J. Droulez, and D. Querlioz, “Bayesian
inference with Muller C-elements,” IEEE Transactions on Circuits and Systems
I: Regular Papers, vol. 63, no. 6, pp. 895–904, 2016.
[83] W. A. Borders, H. Akima, S. Fukami, S. Moriya, S. Kurihara, Y. Horio, S. Sato,
and H. Ohno, “Analogue spin–orbit torque device for artiﬁcial-neural-networkbased associative memory operation,” Applied Physics Express, vol. 10, no. 1,
p. 013007, 2016.
[84] D. E. Nikonov, G. Csaba, W. Porod, T. Shibata, D. Voils, D. Hammerstrom,
I. A. Young, and G. I. Bourianoﬀ, “Coupled-oscillator associative memory array
operation for pattern recognition,” IEEE Journal on Exploratory Solid-State
Computational Devices and Circuits, vol. 1, pp. 85–93, Dec 2015.

97
[85] S. P. Levitan, Y. Fang, D. H. Dash, T. Shibata, D. E. Nikonov, and G. I.
Bourianoﬀ, “Non-boolean associative architectures based on nano-oscillators,”
in 2012 13th International Workshop on Cellular Nanoscale Networks and their
Applications, Aug 2012, pp. 1–6.
[86] V. Narayanan, S. Datta, G. Cauwenberghs, D. Chiarulli, S. Levitan, and
P. Wong, “Video analytics using beyond cmos devices,” in 2014 Design, Automation Test in Europe Conference Exhibition (DATE), March 2014, pp. 1–5.
[87] D. Fan, S. Maji, K. Yogendra, M. Sharad, and K. Roy, “Injection-locked
spin hall-induced coupled-oscillators for energy eﬃcient associative computing,”
IEEE Transactions on Nanotechnology, vol. 14, no. 6, pp. 1083–1093, Nov 2015.
[88] J. A. Carpenter, Y. Fang, C. N. Gnegy, D. M. Chiarulli, and S. P. Levitan, “An
image processing pipeline using coupled oscillators,” in 2014 14th International
Workshop on Cellular Nanoscale Networks and their Applications (CNNA),
July 2014, pp. 1–2.
[89] T. Shibata, R. Zhang, S. P. Levitan, D. E. Nikonov, and G. I. Bourianoﬀ,
“Cmos supporting circuitries for nano-oscillator-based associative memories,”
in 2012 13th International Workshop on Cellular Nanoscale Networks and their
Applications, Aug 2012, pp. 1–5.
[90] G. Csaba, M. Pufall, D. E. Nikonov, G. I. Bourianoﬀ, A. Horvath, T. Roska, and
W. Porod, “Spin torque oscillator models for applications in associative memories,” in 2012 13th International Workshop on Cellular Nanoscale Networks
and their Applications, Aug 2012, pp. 1–2.
[91] P. Maﬀezzoni, B. Bahr, Z. Zhang, and L. Daniel, “Analysis and design of
boolean associative memories made of resonant oscillator arrays,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 63, no. 11, pp. 1964–
1973, Nov 2016.
[92] N. Shukla, A. Parihar, M. Cotter, M. Barth, X. Li, N. Chandramoorthy, H. Paik,
D. G. Schlom, V. Narayanan, A. Raychowdhury, and S. Datta, “Pairwise coupled hybrid vanadium dioxide-mosfet (hvfet) oscillators for non-boolean associative computing,” in 2014 IEEE International Electron Devices Meeting, Dec
2014, pp. 28.7.1–28.7.4.
[93] S. Datta, N. Shukla, M. Cotter, A. Parihar, and A. Raychowdhury,
“Neuro inspired computing with coupled relaxation oscillators,” in 2014 51st
ACM/EDAC/IEEE Design Automation Conference (DAC), June 2014, pp. 1–6.
[94] A. Parihar, N. Shukla, M. Jerry, S. Datta, and A. Raychowdhury, “Vertex coloring of graphs via phase dynamics of coupled oscillatory networks,” Scientiﬁc
Reports, vol. 7, no. 1, 2017.
[95] S. Kaka, M. R. Pufall, W. H. Rippard, T. J. Silva, S. E. Russek, and J. A. Katine, “Mutual phase-locking of microwave spin torque nano-oscillators,” Nature,
vol. 437, no. 389, 2005.
[96] D. Houssameddine, U. Ebels, B. Delat, B. Rodmacq, I. Firastrau, F. Ponthenier, M. Brunet, C. Thirion, J.-P. Michel, L. Prejbeanu-Buda, M.-C. Cyrille,
O. Redon, and B. Dieny, “Spin-torque oscillator using a perpendicular polarizer
and a planar free layer,” Nature Materials, vol. 6, no. 447, 2007.

98
[97] P. M. Braganca, B. A. Gurney, B. A. Wilson, J. A. Katine, S. Maat, and
J. R. Childress, “Nanoscale magnetic ﬁeld detection using a spin torque
oscillator,” Nanotechnology, vol. 21, no. 23, p. 235202, 2010. [Online].
Available: http://stacks.iop.org/0957-4484/21/i=23/a=235202
[98] D. M. Chiarulli, B. Jennings, Y. Fang, A. Seel, and S. P. Levitan, “A computational primitive for convolution based on coupled oscillator arrays,” in 2015
IEEE Computer Society Annual Symposium on VLSI, July 2015, pp. 125–130.
[99] D. E. Nikonov, I. A. Young, and G. I. Bourianoﬀ. (2014) Convolutional
networks for image processing by coupled oscillator arrays. [Online]. Available:
https://arxiv.org/abs/1409.4469
[100] R.
Mehrotra,
K.
Namuduri,
and
N.
Ranganathan,
“Gabor ﬁlter-based edge detection,”
Pattern Recognition,
vol. 25,
no.
12,
pp.
1479
–
1494,
1992.
[Online].
Available:
http://www.sciencedirect.com/science/article/pii/003132039290121X
[101] B. S. Manjunath and W. Y. Ma, “Texture features for browsing and retrieval of
image data,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
vol. 18, no. 8, pp. 837–842, Aug 1996.
[102] W. H. Rippard, M. R. Pufall, S. Kaka, T. J. Silva, S. E. Russek,
and J. A. Katine, “Injection locking and phase control of spin transfer
nano-oscillators,” Phys. Rev. Lett., vol. 95, p. 067203, Aug 2005. [Online].
Available: https://link.aps.org/doi/10.1103/PhysRevLett.95.067203
[103] Y. Zhou, J. Persson, and J. kerman, “Intrinsic phase shift between a spin torque
oscillator and an alternating current,” Journal of Applied Physics, vol. 101,
no. 9, p. 09A510, 2007. [Online]. Available: https://doi.org/10.1063/1.2710740
[104] Y. Zhou, J. Persson, S. Bonetti, and J. kerman, “Tunable intrinsic phase of a
spin torque oscillator,” Applied Physics Letters, vol. 92, no. 9, p. 092505, 2008.
[Online]. Available: https://doi.org/10.1063/1.2891058

VITA

99

VITA
Yong Shim is currently pursuing a Ph.D. degree as Williams Fellow and Graduate
Research Assistant under the guidance of Prof. Kaushik Roy in the School of Electrical and Computer Engineering at Purdue University, West Lafayette, IN, US. His
research interests include emerging device base neuromorphic computing and approximate computing, as well as Complementary-Metal-Oxide-Semiconductor (CMOS)
interface circuits for the spin-devices such as Spin-Torque Oscillator (STO). He received the B.S. and M.S. from Korea University in 2004 and 2006, respectively, all
in electrical engineering. In 2006, he joined Samsung Electronics, Hwasung, Korea,
where he was involved in designing circuits for memory interface.

