Magnetoresistive random access memory (MRAM) technologies with thermally unstable nanomagnets are leveraged to develop an intrinsic stochastic neuron as a building block for restricted Boltzmann machines (RBMs) to form deep belief networks (DBNs). The embedded MRAM-based neuron is modeled using precise physics equations. The simulation results exhibit the desired sigmoidal relation between the input voltages and probability of the output state. A probabilistic inference network simulator (PIN-Sim) is developed to realize a circuit-level model of an RBM utilizing resistive crossbar arrays along with differential amplifiers to implement the positive and negative weight values. The PIN-Sim is composed of five main blocks to train a DBN, evaluate its accuracy, and measure its power consumption. The MNIST dataset is leveraged to investigate the energy and accuracy tradeoffs of seven distinct network topologies in SPICE using the 14nm HP-FinFET technology library with the nominal voltage of 0.8V, in which an MRAM-based neuron is used as the activation function. The software and hardware level simulations indicate that a 784 × 200 × 10 topology can achieve less than 5% error rates with ∼ 400p J energy consumption. The error rates can be reduced to 2.5% by using a 784 × 500 × 500 × 500 × 10 DBN at the cost of ∼ 10× higher energy consumption and significant area overhead. Finally, the effects of specific hardware-level parameters on power dissipation and accuracy tradeoffs are identified via the developed PIN-Sim framework.
INTRODUCTION
In recent years, innovation within the disciplines of machine intelligence and learning (ML) utilizing artificial neural networks (ANN) that aim to model biological brain behavior has grown significantly due to the existence of vast datasets available to train such networks. Some interesting projects within these fields include solving complicated classification problems by utilizing ANN's strength in information processing [Basheer and Hajmeer 2000] , pattern recognition tasks [Bishop et al. 1995] , and even out-maneuvering a world champion Go player to a historic defeat [Churchland and Sejnowski 2016] .
The techniques most commonly used to train ANNs today typically require supervised learning, where the error rate is measured by comparing the output from the network with a known desired output. Then, using a subsequent training technique such as backpropagation, the corresponding weights within the network are adjusted [Hecht-Nielsen 1992] . However, unsupervised learning is attracting considerable attentions in recent years due to its compatibility with the nature of intelligent biological systems, which learn through observation, not by supervision [LeCun et al. 2015] .
In unsupervised learning approaches, decision processes based on probabilistic inference are built upon constructing statistical correlation of the inputs into categories [Buesing et al. 2011 [Sarikaya et al. 2014] . DBNs are constructed by multiple Restricted Boltzmann machines (RBMs) , which can be hierarchically connected to form a network [Hinton et al. 2006] .
Research focused on software implementation of DBNs show that conventional von-Neumann architectures are poorly-matched to the processing flow in terms of the constituent operations at a fine granularity. Although software implementations on conventional architectures provide flexibility, they require significant execution time and energy caused by the memory-processor bandwidth bottleneck, which is intensified due to the large matrix multiplications required [Merolla et al. 2014] . Therefore, hardware-based RBM design research seeks to surmount these limitations.
Previous work on RBM hardware implementations use conventional VLSI design techniques [Yuan and Parhi 2017] , FPGA approaches [Kim et al. 2010; Ly and Chow 2010] , and stochastic CMOS methods [Ardakani et al. 2017] . Moreover, emerging technologies such as resistive RAM (RRAM) [Bojnordi and Ipek 2016; Sheri et al. 2015] and phase change memory (PCM) [Eryilmaz et al. 2016] had been utilized as weighted connections within the DBN architecture to interconnect its various building blocks. The previous hybrid Memristor/CMOS designs attempt to realize an intrinsic implementation of the weighted connections. Recently, a current-driven low energy-barrier spintronic device has been proposed to be utilized in RBMs as the activation function [Zand et al. 2018] , while similar devices have been previously proposed for spiking [Sengupta et al. 2016b,c] and hard axis clocked [Behin-Aein et al. 2016 ] neural systems. However, the current-mode operation of these devices imposes a significant power consumption to the activation functions, while requiring weighted connections with M Ω resistances. The design proposed herein takes a new approach from the device-level upward to overcome the challenges mentioned above by utilizing a voltage-driven spintronic device with embedded magnetoresistive random access memory (MRAM) constructed by low energy barrier nanomagnets, which leverages intrinsic thermal noise to provide a natural and power-efficient building block for RBMs. Moreover, we propose a simulation framework for probabilistic learning networks, called PIN-Sim, which is utilized herein to realize a feasible circuit-level implementation of DBN architectures using a SPICE model of our proposed embedded MRAM-based neuron. Specifically, the main contributions of this paper are as follows:
1. A transportable Probabilistic Inference Network Simulator (PIN-Sim) to realize a circuit-level implementation of DBN utilizing voltage-controlled embedded MRAM-based neurons as the probabilistic sigmoidal activation functions.
The PIN-Sim framework can be utilized for design space exploration to achieve an optimized network implementation based on the application requirements.
2. Detailed results and analyses about the effects of various circuit-level and device-level tunable parameters on the accuracy and power consumption of the DBNs implemented by PIN-Sim framework.
3. Discussions regarding the effects of noise, and variations in the resistance of the weighted connections on the accuracy of our proposed probabilistic spin logic-based DBN circuits.
The remainder of the paper is organized as follows. Section 2 describes the fundamentals of the RBMs and the CD unsupervised learning algorithm. The structure and modeling methodology of the proposed neuron with embedded MRAM is elaborated in Section 3. Section 4 provides details about the circuit-level implementation of DBNs using our proposed PIN-Sim framework. The software and hardware level simulation results are provided in Section 5, as well as a comprehensive comparison between our proposed DBN realization and previous hardware implementations. Finally, Section 6 concludes the paper by relating its contributions, as well as the improvements achieved by the proposed MRAM-based neuron and PIN-Sim framework. Fig. 1 . (a) An RBM structure depicting neurons organized into hidden and visible layers, (b) a 3 × 3 RBM implemented within a 4 × 4 crossbar architecture using a weighted array to generate resistances needed to appropriately control the activation function, (c) a DBN structure constructed from multiple hidden layers which act to increase recognition accuracy.
RESTRICTED BOLTZMANN MACHINES
Restricted Boltzmann machines (RBMs) are a class of recurrent stochastic neural networks, in which each state of the network, k, has an energy determined by the connection weights between nodes and the node bias as described by Equation 1, where s k i is the state of node i in k, b i is the bias, or intrinsic excitability of node i, and w ij is the connection weight between nodes i and j [Ackley et al. 1985] .
Each node in an RBM has a probability to be in state one according to Equation 2, where σ is the sigmoid function.
RBMs, when given sufficient time, reach a Boltzmann distribution where the probability of the system being in state s is found by Equation 3, where u could be any possible state of the system. Thus, the system is most likely to be found in states that have the lowest associated energy.
RBMs are constrained to two fully-connected non-recurrent layers called the visible layer and the hidden layer.
As shown in Figure 1 , RBMs can be readily implemented by a crossbar architecture. The most well-known approach for training RBMs is contrastive divergence (CD), which is an approximate gradient descent procedure using Gibbs sampling [Carreira-Perpinan and Hinton 2005] . CD operates in four steps as described below:
1. Feed-Forward 1: the training input vector, v, is applied to the visible layer, and the hidden layer, h, is sampled.
2. Feed-back: The sampled hidden layer output is fed-back and the generated input (v ′ ) is sampled.
3. Feed-Forward 2: v ′ is applied to the visible layer and the reconstructed hidden layer is sampled to obtain h ′ .
Update:
The weights are updated according to Equation 4, where η is the learning rate and W is the weight matrix. 
RBMs can be readily stacked to realize a DBN, which can be trained similarly to RBMs. The training process is conducted by executing CD starting first with the visible layer and the first of the hidden layers within the network.
The CD is repeated as many times as required, which will adjust the weights in a hierarchical flow as described in Algorithm 1.
EMBEDDED MRAM BASED NEURON AS A BUILDING BLOCK FOR RBMS
The basic building block of Boltzmann Machines is a stochastic binary neuron that produces a binary output with a given probability. This probability is modulated by the weighted input the neuron receives from the other neurons [Hinton et al. 1984] , as shown Figure 2 (a). Here, we show that a recently proposed building block that leverages the highly scaled embedded magnetoresistive random access memory (MRAM) technology, which is conventionally used as a memory device, can enable an approximate hardware representation of the binary stochastic neuron in RBM structure as shown in Figure 2 (b).
The functional component of an MRAM architecture is a magnetic tunnel junction (MTJ) that is a multilayer 2-terminal device that exhibits a resistance change depending on the orientation of its magnetic layers. One of these magnetic layers is designed to have a fixed magnetic orientation (fixed layer) while the magnetization of the other layer can be switched by a magnetic field or by a spin-polarized current (free layer). In the latter, a current that flows through the fixed layer can exert a "spin-transfer-torque" to switch the magnetization of the free layer allowing an electrical writing mechanism [Bhatti et al. 2017] . In conventional memory devices, the free layer is designed to have a large energy barrier with respect to the thermal energy (kT) so that the fixed layer can function as a non-volatile memory. In recent years the use of superparamagnetic MTJs that are not thermally stable have been experimentally and theoretically investigated in search of functional spintronic devices [Camsari et al. 2017a; Choi et al. 2014; Debashis et al. 2016; Liyanagedera et al. 2017; Locatelli et al. 2014; Mizrahi et al. 2018; Sutton et al. 2017; Zand et al. 2018] .
In this paper, we use a recently proposed design that makes minimal modifications to the 1 Transistor / 1 MTJ architecture of the commercially available embedded MRAM technology [Camsari et al. 2017b] . The first modification is to replace the stable free layer with a low-barrier nanomagnet (E B ≪ 40kT ) that can be achieved by either reducing the total number of spins in the nanomagnet (by reducing M s Vol., where M s is the saturation magnetization and
Vol. is the volume [Bapna and Majetich 2017] ) or by using circular disk magnets that have no preferential easy-axis [Debashis et al. 2016] . The resistance of an MTJ with such a low-barrier nanomagnet randomly fluctuates between high (R AP ) and low resistance states (R P ), creating a fluctuating output voltage at the drain of the NMOS transistor CMOS inverter can amplify these fluctuations to produce a rail-to-rail stochastic output at this input value. Changing the input voltage modulates the transistor resistance, and can suppress these fluctuating outputs either by making the transistor resistance too small and shorting the output to ground, or by making the transistor resistance too high and making the output node V DD . The basic device operation can be understood by considering the MTJ conductance [Camsari et al. 2017b] :
where m z is the instantaneous free layer magnetization that is fluctuating stochastically in the presence of thermal noise, G 0 is the average MTJ conductance, (G P + G AP )/2, and T MR is the tunneling magnetoresistance ratio, that is defined as T MR = (G P − G AP )/G AP . The voltage division between the transistor and the MTJ (Figure 2b ) produces a drain voltage that can be expressed as:
where we introduce a parameter, α, that is defined as the ratio of the transistor conductance (G T ) to the average MTJ conductance (G 0 ), i. e, α = G T /G 0 . As the input voltage V IN changes the transistor conductance G T , the drain output behaves as a noisy inverter. It can be seen from Equation 6 that the noise amplitude at the drain is maximum when α ≈ 1, therefore the MTJ resistance is matched to the NMOS resistance (α = 1) when V IN /V DD = 0.5 to obtain an output with large fluctuations at the symmetry point. Even though the drain voltage shows fluctuations of the order of hundreds of mV for typical TMR values, an additional inverter is used to amplify the noise to produce rail-to-rail voltages for a range of input voltages.
The full circuit behavior of the embedded MRAM based neuron is modeled by a solving the magnetization dynamics of the low barrier nanomagnet using the stochastic Landau-Lifshitz-Gilbert (LLG) equation self-consistently with the transport equations in a SPICE framework [Camsari et al. 2015] . The NMOS transistor is modeled by the predictive technology models (PTM) and for simplicity a bias-independent MTJ model is used that is modeled according to
Equation 5. The magnetization input for the MTJ conductance is instantaneously provided from the stochastic LLG equation. The stochastic LLG reads:
where α is the damping coefficient of the nanomagnet, γ is the electron gyromagnetic ratio, q is the electron charge, and ì I S is the spin current incident to the free layer. The spin current is polarized along the direction of the fixed layer polarization (ẑ) and its amplitude is proportional to the charge current I c flowing through the MTJ, such that ì
N is the total number of spins in the free layer (CoFeB), N = M s Vol./µ B , where M s is the saturation magnetization of CoFeB and µ B is the Bohr magneton. For the free layer, we use a monodomain circular disk magnet whose effective field ì H is given as −4π M s m xx + ì H n ,x being the out-of-plane direction of the magnet. ì H n is the isotropic thermal noise field, uncorrelated in three directions:
In this paper, we use a circular disk magnet with ≪ kT energy barrier in the absence of any shape anisotropy. Such magnets have been fabricated and characterized in [Cowburn et al. 1999; Ostwal et al. 2018 ].
Moreover, elliptical magnets showing GHz telegraphic oscillations have also been experimentally observed in [Pufall et al. 2004] . The demonstrated parameters listed in Table 2 [ Camsari et al. 2017b ] are used to generate all of the results that are provided within this paper. We also note for the chosen parameters with a circular free layer with an in-plane anisotropy that the results are not significantly influenced by the current that is flowing at the midpoint (V IN = V DD /2), and note that any pinning at higher input voltages benefits the switching operation of the device. Fig. 3 . An n × m RBM hardware implementation. Two resistive arrays are leveraged along with differential amplifiers to implement both positive and negative weights. The embedded MRAM-based neurons are used to evaluate the activation functions. The fluctuating output voltage of the neurons are integrated through an RC circuit to generate the output of the proposed RBM structure.
RBM Hardware Implementation
Figure 3 exhibits a feasible hardware implementation of an n ×m RBM, in which neurons based on the concise embedded MRAM-based design described in the previous section are used to generate the required probabilistic sigmoidal activation function. The resistive crossbar arrays are utilized to realize the matrix multiplication elaborated in Equation 2. In this work, the weights are trained off-chip and the resistive weighted connections will be programmed accordingly. Any resistive devices such as memristors [Strukov et al. 2008] or spin-orbit torque (SOT)-driven domain wall motion (DWM)
devices [Sengupta et al. 2016a ] can be utilized for weighted connections without the loss of generality. 
PROPOSED DBN STRUCTURE
To implement the positive and negative weights in the w matrix, two resistive weighted arrays with the same dimensions are required [Hu et al. 2012] , as shown in Figure 3 . The outputs of the positive and negative weighted connections are linked to differential amplifiers which are implemented by op-amps as shown in Figure 3 . The output voltage of the op-amp, i.e.
, is applied to the MRAM-based neuron as an input signal. The neuron with embedded MRAM will generate an output voltage signal, which fluctuates between VDD and GND with a probability that is modulated based on the applied input voltage. Finally, a resistor-capacitor (RC) integrator circuit is utilized to convert the probabilistic output of the neuron to an analog voltage level, which can be later converted to a digital output through digital to analog conversion. In order to verify the functionality and assess the performance of our proposed RBM implementation, we have simulated a 2 × 2 RBM via SPICE circuit simulation using the 14nm HP-FinFET technology library with an MRAM-based neuron used as the activation function. The results obtained validate the functionality of our proposed design as elaborated in Figure 4 .
Probabilistic Inference Network Simulator (PIN-Sim)
In order to automate and scale up the design space exploration of DBNs at the circuit-level, we have developed a hierarchical simulation framework called PIN-Sim, which can be utilized to implement any probabilistic learning networks. The block diagram of the PIN-Sim framework used to implement DBNs in our work is shown in Figure 5 , which is comprised of five primary blocks. The PIN-Sim methodology is described in Algorithm 2. First, we have modified a MATLAB implementation of DBN developed in [Tanaka and Okutomi 2014] to train the network and obtain the trained weight (W ) and bias (B) matrices according to Algorithm 1. The extracted (W ) and (B) matrices are then applied to a MATLAB module called mapWEIGHT , the functionality of which is described in Algorithm 3. The mapWEIGHT module first converts each of the W and B matrices with positive and negative elements to two separate matrices with only positive elements as described below:
Next, the mapWEIGHT module maps the elements in W + , W − , B + , and B − matrices to their corresponding conductance values using the below equations:
where ∀д (i, j) ∈ G : д min ≤ д (i, j) ≤ д max , in which д min = 1/r max and д max = 1/r min are minimum and maximum conductances of all weighted connections in the crossbar weighted array. Moreover, b max , b min , w max , and w min are the maximum and minimum values in all of the bias and weight matrices, respectively. Finally, Equation 12 is utilized 
; posW eiдht(i).txt ⇐ RW + ; neдW eiдht(i).txt ⇐ RW − ; posBias(i).txt ⇐ RB + ; neдBias(i).txt ⇐ RB − ; end to convert and quantize all of the obtained conductance values to their corresponding resistance values, which can then be utilized to implement the required resistive crossbar array.
where Q is the quantization factor, and GW + , GW − , GB + , and GB − are positive weight, negative weight, positive bias, and negative bias conductance matrices, respectively.
Once the positive and negative weight and bias resistance matrices are obtained, they will be converted to text files and applied to a Python module called mapRBM.py, shown in Figure 5 , which produces plural crossbar weighted array circuits in SPICE automatically based on the defined network topology. Finally, a testDBN.py module is developed using Python scripts, which utilize the generated circuit of the DBN, and the model of the probabilistic neuron to perform a SPICE circuit simulation and calculate the error rate using the test inputs and test labels, which are provided for the testDBN module in form of text files.
SIMULATION RESULTS AND DISCUSSION
Herein, we have leveraged a hierarchical simulation method to examine the performance of our DBN implementation.
In software-level simulation, the behavioral results of the developed embedded MRAM-based neuron model are used to implement a DBN in MATLAB for MNIST pattern recognition application [Lecun et al. 1998 ]. In the hardware-level simulation, the proposed framework is used to develop a circuit-level DBN implementation using the p-bit SPICE model and 14nm CMOS technology in SPICE circuit simulator with 0.8V nominal voltage.
MATLAB simulation
Herein, we have modified the sigmoid activation function in a MATLAB implementation of DBN [Tanaka and Okutomi 2014 ] by using the device-level simulation results of the proposed embedded MRAM-based neuron. To assess the performance of the implemented DBN, we have used the MNIST data set [Lecun et al. 1998 ] including 60,000 training and 10,000 test sample images of hand-written digits, each of which having 28 × 28 pixels. We have used Error rate (ERR) and root-mean-square error (RMSE) metrics to evaluate the performance of the DBN, as expressed by the following equations [Tanaka and Okutomi 2014] : 
RMSE =
where M is the number of output classes, N is the number of input data, N F is the number of false inference, F is the inference of the trained DBN, x k is the k-th input data and y k represents its corresponding target output.
As shown in Figure 6 , the most elementary model of the DBN requires 784 nodes in visible layer for the 28 × 28 pixels of the input images, and 10 nodes in hidden layer for, which represents 0-9 output digits. Figure 7 shows the relation between the error rate and the number of training samples for seven distinct DBN topologies, which is obtained using 1,000 test samples. The results obtained by MATLAB simulation exhibit that an error rate of 28.2% for a 784 × 10 DBN trained by 500 training inputs can be decreased to a 2.5% error rate achieved using 784 × 500 × 500 × 500 × 10 and 784 × 500 × 500 × 10 DBN topologies, which are trained by 10,000 input training samples. Thus, the recognition accuracy can be improved by increasing the number of hidden layers in the network, number of nodes in each layer, and number of training samples. However, these improvement can lead to higher power consumption and area overheads as investigated in the hardware-level simulations elaborated below.
PIN-Sim simulation
In this section, we utilize our proposed PIN-Sim framework to provide a circuit-level model of DBN architecture. Next, we will provide the energy and power consumption profiles of the seven different DBN topologies investigated in the previous section to analyze the energy and accuracy trade-offs of these networks. Finally, we will focus on the effect of various important hardware-level parameters. These are vital parameters during design space exploration that influence the accuracy of DBN architectures as tradeoffs necessary to obtain efficient hardware-level implementation for pattern recognition applications.
Power and Energy Consumption Analysis. Figure 8(a) depicts the power consumption of various DBN topologies
while evaluating a single input image. As shown, a significant amount of power is consumed in the weighted connections, while less than 10% of the total power is consumed in the neurons of an embedded MRAM-based p-bit approach. For instance, the total power consumption of a 784 × 200 × 10 DBN is approximately equal to 86 mW, only 5.6 mW of which is dissipated in the activation functions. This is achieved by using the proposed power-efficient embedded MRAM-based neurons to implement the activation functions, as opposed to more elaborate floating-point circuits and pseudo-random number generators. Moreover, it is shown that the total power consumption depends primarily upon the aggregate number of neurons that are used in a network and not the number of layers. For instance, the power consumption of a 784 × 500 × 10 DBN is greater than that of a 784 × 200 × 200 × 10 network, although the latter has higher number of hidden layers. However, the test operation delay is linearly proportional to the number of hidden layers which is determined by the signal propagation and computation progression. In particular, the RC integrator circuit shown in Figure 3 is sampled every 2 ns, leading to an operating clock frequency of 500 MHz and a delay of 2 ns for each RBM.
Thus, the 784 × 200 × 200 × 10 DBN mentioned above requires three clock cycles to complete the evaluation operation, while a 784 × 500 × 10 DBN can produce its output in two clock cycles. Figure 8 (b) shows the energy consumption for Resistances of the resistors in the differential amplifiers
Resistance and capacitance of the RC integrator circuits 100 k Ω, 20 f F various DBN topologies, which simultaneously includes the impact of number of nodes and hidden layers on power consumption and delay, respectively.
PIN-
Sim tunable parameters and their affect on DBN performance. Table 2 lists the tunable parameters in the PIN-Sim framework, which can be adjusted based on the application requirements. The last column of the table shows the default values that are utilized herein for the MNIST digit recognition application. Figure 9 shows the output voltages of the neurons in the last hidden layer of a 784 × 200 × 10 DBN utilized for MNIST pattern recognition tasks, each of which represents an output class. The probabilistic outputs of the p-bit devices are shown in Figure 9 (a), while Next, we will focus on the effect of some of the tunable parameters on the accuracy and power consumption of DBN architectures implemented by the proposed PIN-Sim framework. First, the effect of ∆R W is investigated, which defines the possible resistance range of weights and biases as follows, r max = (1 + ∆R W 100 ) × r min . The r max and r min parameters are utilized in the mapWEIGHT module in the PIN-Sim tool to map the trained weights and biases to their corresponding resistance values according to Equations 10 and 11, respectively. Figure 10(a) shows the effect of ∆R W on the recognition accuracy and power consumption of our default 784 × 200 × 10 DBN implementation. As it can be seen in the figure, the error rate is reduced from 53% to 24% by increasing the ∆R W from 100% to 400%, however a significant change in the error rate cannot be observed for ∆R W values larger than 400%. These results are particularly beneficial for magnetic tunnel junction (MTJ)-based weighted connections [Roy et al. 2018; Sengupta et al. 2016a] , in which the difference between maximum and minimum resistance is defined by the tunneling magneto-resistance (TMR) effect. The results obtained show that a TMR of 400% could be adequate to achieve the desired error rate. However, it is worth noting that this is quite application specific and can vary for different datasets. These results are worthy since the realization of higher TMR values would impose more complex fabrication processes [Parkin et al. 2004] , of which 700% ] have been demonstrated experimentally and others of 250% [Wang et al. 2018 ] via current scalable means. Moreover, as it is shown in Figure 10(a) , increasing the ∆R W results in reduced power dissipation in the weighted array, while the power dissipated in activation functions remains almost unchanged. The higher resistance range for the weighted connections increases the overall resistance of the weighted array. Therefore, since the input voltages remain unchanged the current flowing through the synapses will be decreased, which consequently reduces the power dissipated in the weighted array. In practice, providing an accurate and continuous range of weight resistances at nanoscale is not attainable due to the fabrication complexities and process variation. Therefore, a realistic circuit-level model of the resistive crossbar architecture should leverage quantized weights. Thus, leveraging PIN-Sim framework for design space exploration, we have assigned a quantization factor (Q) parameter, which can be tuned by the user based on the application requirements. 400% that is trained with 3,000 training samples. As shown, the error rate for the hardware implementation with Q = 4, which means the weights are discretized into four equal intervals between R min and R max , is increased to 21.2% from the 19% error rate that is achieved by the DBN with unquantized weights. As it is expected, this increase in the error rate is mainly caused by the information loss that occurs during the discretization. Moreover, implementations with larger Q values result in error rates closer to that of the DBN with unquantized weights, which can also be expected since the discretization intervals are so small that the weight values are getting close to their unquantized values. However, an interesting phenomenon can be observed in the hardware implementation with Q = 8, where the error rate of 17.8% is realized which is lower than the error rate of the unquantized DBN. We have performed multiple tests to ensure that this is a repetitive behavior for the DBNs with Q = 8, and in all of the cases the error rate obtained was lower than that of the DBN with unquantized weights. These results can be particularly interesting in the hardware-implementation, since for instance in our examined case there is a 0.5 k Ω gap between various weight resistances, considering the R min = 1k Ω and ∆R W = 400%, which can provide some robustness against process variations without incurring a significant increase in the error rate. In particular, we have investigated the impacts of the variations in the input voltages of neurons, which can be induced by different noise sources, as well as variations in the resistance of the weighted connections on the recognition accuracy of the network. According to the results shown in Figure 11 (a), a 784 × 200 × 10 DBN trained by 3,000 images loses 1% accuracy in presence of variations in weighted connections 
Discussion
Some of the previous hardware implementations of DBNs are listed in Table 3 . The designs proposed in [Kim et al. 2010; Ly and Chow 2010] leverage FPGAs to achieve speedups of 25-145 compared to software implementations, however these approaches suffer from constrained clock frequencies and routing congestion, as well as major resource deficiencies due to the significant embedded memory utilization for both weighted connections and activation functions. In [Yuan and Parhi 2017] , those authors have proposed optimization methods to reduce memory requirements for weights and biases, however implementing each activation function still requires dedicated piecewise linear approximator, random number generator (RNG), and comparator circuits which lead to increased area and energy consumption per neuron than the embedded MRAM-based approach herein. In [Ardakani et al. 2017] , the low-complexity characteristics of stochastic CMOS-based arithmetic units are leveraged to implement RBM with reduced area and power consumption.
However, the large number of linear feedback shift registers (LFSRs) that are required to generate the long input and weight bit-streams results in increased latencies that considerably limits the energy savings.
On the other hand, emerging technologies such as resistive RAM (RRAM) and phase change memory (PCM) have been recently utilized within the crossbar arrays to implement matrix multiplication within RBMs [Bojnordi and Ipek 2016; Eryilmaz et al. 2016; Sheri et al. 2015] . In particular, [Bojnordi and Ipek 2016] has achieved 100× and 10× improvement in terms of operation speed and energy consumption, respectively, compared to single-threaded cores by using RRAM devices as weighted connections. In all of the above-mentioned designs, CMOS-based circuits such as multipliers and RNGs are utilized to realize the probabilistic behavior of activation functions. In [Zand et al. 2018] , authors have utilized low energy barrier spin-orbit torque (SOT) MTJs to implement the probabilistic sigmoidal activation function, which realizes significant area and energy reductions. However, the current-mode behavior of the SOT-MTJ devices imposes significant power consumption to the activation functions, while requiring weighted connections in M Ω resistances which can incur significant area overhead and fabrication complexity [Sengupta et al. 2016a; Yuasa et al. 2004] . The work presented herein utilizes a voltage-driven embedded MRAM-based neuron with low energy barrier unstable nanomagnets, which leverages the intrinsic thermal noise to generate sigmoidal probabilistic activation functions required for RBMs within a power-efficient package. As listed in Table 3 , the proposed RBM implementation using embedded MRAM-based neuron can achieve approximately three orders of magnitude energy reduction compared to the previous energy-efficient CMOS-based implementations, while realizing at least 90× device count reduction.
However, as it was described in previous sections, the embedded MRAM based neuron requires an RC circuit to integrate its output voltage. The SPICE circuit simulation results exhibits an approximate average energy consumption of 10-20 fJ for the RC circuit as listed in Table 3 . Moreover, the area required to implement the RC circuit with 100 K Ω resistor and 20f F capacitor is approximately three times larger than that of the MRAM-based neuron [Scott 1998; Stengel and Spaldin 2006] . Thus, the proposed MRAM-based activation function can achieve approximately 20× and 300× area reduction compared to the CMOS-based stochastic neurons proposed in [Ardakani et al. 2017] and [Bojnordi and Ipek 2016] , respectively. The area of the MRAM-based neuron, which is utilized as the baseline for the area comparisons, is approximately equal to 32λ × 32λ, that is obtained by the layout design, in which λ is a technology-dependent parameter.
Herein, we have used the 14nm FinFET technology, which leads to the approximate area consumption of 0.05µm 2 per neuron. MRAM devices can be fabricated on top of the transistors, thus incurring near-zero area overhead.
CONCLUSIONS
Herein, it was shown that embedded MRAM-based neurons with thermally unstable superparamagnetic MTJs can realize a probabilistic output that can be modulated by an input voltage. The magnetization dynamics of the MRAM-based stochastic neuron is modeled by solving the LLG equations for a low energy barrier nanomagnet. The device-level simulations exhibited a desired sigmoidal relation between the input voltages and output probability of the neuron. Once the functionality of the proposed stochastic neuron was verified, we have developed an embedded MRAM-based RBM leveraging two resistive crossbar arrays with differential amplifiers to implement the matrix multiplication operation for both positive and negative weights. SPICE circuit simulations for a 2 × 2 weighted array validated the functionality of the proposed embedded MRAM-based RBM.
To provide a circuit-level implementation of DBN, we have developed a PIN-Sim framework which is a transportable framework for rapid, automated, and accurate design space exploration of hybrid CMOS and post-CMOS neuromorphic circuits. PIN-Sim is composed of five main modules to train the network, map the trained weights to their corresponding resistances, create the SPICE model of the RBMs, and measure the accuracy and energy consumption. MNIST dataset is utilized to investigate the accuracy and energy tradeoffs for seven distinct DBN topologies implemented by the developed PIN-Sim framework. The simulation results showed that at least two hidden layers are required to achieve suitable error rates. In particular, a 784 × 200 × 10 DBN can realize 5% error rate while consuming less than 500 pJ energy. The error rates could be decreased to 2.5% by using a 784 × 500 × 500 × 500 × 10 DBN topologies trained by 10,000 input training samples at the cost of ∼ 10× higher energy consumption and significantly larger area overheads.
Moreover, PIN-Sim can be used to optimize network topologies based on different application requirements for energy versus accuracy tradeoffs.
Next, we have focused on the effect of various hardware-level parameters that can be adjusted in the PIN-Sim tool on the performance of the network. One particular parameter which is specifically important for MTJ and RRAM based crossbar architectures is the difference between the largest and smallest possible resistance values in a weighted connection (∆R W ). It was shown that at least a ∆R W of 400% is required to realize suitable error rates, however it is worth noting that increasing the ∆R W to values larger than 400% does not lead to a significant reduction in error rate.
Therefore, some fabrication complexities for increasing the ∆R W in MTJ-based weighted connections can be avoided.
Moreover, to realize a realistic hardware implementation we have studied the effect of weight quantization on the accuracy of our network. It was shown that a quantization factor of eight, which provides eight different resistive levels in each weighted connection, can lead to even lower error rates compared to a network with unquantized weights. This also shows the robustness of our proposed circuit-level DBN implementation to minor variations in the resistance of the weighted connections, which is inevitable during the fabrication process. Finally, the comparison results exhibited that the embedded MRAM-based neuron can contribute to several orders of magnitude energy reduction, and reduce the area requirement by 20-fold, with respect to recent energy-optimized designs. Although this is a simulation-based result, hardware realization may endure significant process variation and impacts of sneak currents in large crossbar arrays.
While on-chip training can help to mitigate these somewhat, alternate approaches using binarized weights are options explored in other works with varying results [Courbariaux et al. 2015] . To address these further, the development of the PIN-Sim framework provides several possibilities for future work, including: (1) leveraging optimization techniques to reduce the performance gap between the ideal implementation of the DBN using simulation tools such as MATLAB, and the realistic circuit-level implementation of DBN using PIN-Sim framework, (2) training DBNs with binary weights which can be implemented by MTJs or RRAMs, (3) implementing convolutional DBNs using PIN-Sim for more complex pattern recognition applications.
