Stochastic Spiking Neural Networks based on nanoelectronic spin devices can be a possible pathway at achieving "brain-like" compact and energy-efficient cognitive intelligence. The computational model attempt to exploit the intrinsic device stochasticity of nanoelectronic synaptic or neural components to perform learning or inference. However, there has been limited analysis on the scaling effect of stochastic spin devices and its impact on the operation of such stochastic networks at the system level. This work attempts to explore the design space and analyze the performance of nanomagnet based stochastic neuromorphic computing architectures for magnets with different barrier heights. We illustrate how the underlying network architecture must be modified to account for the random telegraphic switching behavior displayed by magnets with low barrier heights as they are scaled into the superparamagnetic regime. We perform a device to system level analysis on a deep neural network architecture for a digit recognition problem on the MNIST dataset.
I. INTRODUCTION
Emulating the computational primitives of neural network based machine learning approaches by the inherent device physics of nanoelectronic components have proven to be useful in reducing the area and energy requirements of the underlying hardware fabrics. To that effect, several post-CMOS technologies like phase change memories [1] , Ag-Si devices [2] , spintronic devices [3] among others have shown to exhibit neural and synaptic functionalities at the intrinsic device level. In this work, we focus on spintronic technologies, in particular, due to the low current and energy requirements of such devices in comparison to traditional memristive technologies.
While traditional neuromorphic computing models have been based on deterministic neural and synaptic primitives, recent efforts have been directed towards adapting such computing schemes to stochastic models. This has been driven primarily by two factors: (1) Deterministic neural or synaptic models are characterized by multi-bit resolution. However, as device dimensions of nanoelectronic neurons or synapses are scaled down, they might lose the multi-bit resolution capacity. In conjunction, such devices are expected to exhibit increased stochasticity during the switching process. For instance, spintronic devices exhibit stochasticity due to thermal noise at non-zero temperatures. Consequently, computational models that leverage the underlying device stochasticity are being recently explored. Information encoding over time due to probabilistic synaptic or neural updates also enables state compression of neural and synaptic units, thereby allowing them to be implemented by single-bit technologies. (2) The human brain, the main inspiration behind such neuromorphic computing models, is characterized by stochastic neural and synaptic units. As a matter of fact, neuroscience studies have * cliyanag@purdue.edu indicated cortical neurons generate spikes probabilistically over time [4] . Consequently, stochastic neural computing models can potentially enable "brain-like" cognitive computing. In this work, we will focus on stochastic neural inference in deep neural networks for typical pattern recognition tasks [5] . However, the analysis can be easily extended to stochastic synaptic units [6] , or even other unconventional computing platforms that require stochastic switching elements like Ising computing [7, 8] , Bayesian inference, among others.
As mentioned previously, spintronic devices display a stochastic switching nature due to thermal noise. Given a particular duration of write current flowing through the device, a magnet exhibits a particular probability of switching during that corresponding write cycle. Consecutive write and read cycles can be used to generate an output pulse stream whose average value depends on the magnitude of the input stimulus. While stochastic neural networks based on spintronic devices have been explored previously [5, 6, 9] , there has been limited analysis on the scaling effects of these devices. It is generally expected that as the magnet dimensions scale down, the device would exhibit increased stochasticity. Further, the operating current or voltage ranges required for operating such devices in the probabilistic regime would reduce. However, as the scaling tends to the superparamagnetic regime the magnets undergo random telegraphic switching with low data retention time, making the device practically volatile in nature. Utilizing such a device as a biased random generator require re-thinking of the peripherals and the underlying network architecture, since parallel read and write operations of the nano-magnets are now required. However, adaptation of such low energy superparamagnets as neural components come at the expense of reduced error resiliency. This is mainly because the gradient or the rate of change of switching characteristics of such magnets in response to input current magnitude is extremely high. This article attempts to address the different schemes of operation of stochastic Spiking Neural Networks (SNNs) for magnets in non-telegraphic to telegraphic regime and analyze its associated energy-accuracy tradeoffs at the system level.
II. MAGNETIC TUNNEL JUNCTION AS A STOCHASTIC SWITCHING ELEMENT
A magnetic tunnel junction (MTJ) is a magnetoresistive device that consists of a tunneling oxide sandwiched between two magnetic contacts. One of the contacts is magnetically hardened and is called the pinned layer, while the direction of magnetization of the other contact, called the free layer, can be switched. In a spin-Hall effect based MTJ (SHE-MTJ), the direction of the free layer is switched by passing a charge current through an underlying heavy metal (HM), as shown in Fig. 1 . The passage of the charge current (I charge ) through the HM layer induces a resulting spin current (I spin ) flowing perpendicular to the planes of the magnetic layers of the MTJ. This spin current can switch the direction of magnetization of the free layer, making it parallel (P) or anti-parallel (AP) to that of the pinned layer, through the well known spin-orbit torque mechanism [10, 11] . Due to the magneto-resistance effect, the SHE-MTJ exhibits a lower resistance (R P ), when in the P state and a higher resistance (R AP ), when in the AP state. Thus, the SHE-MTJ shown in Fig. 2 , exhibits decoupled read and write current paths. Write operation can be achieved by a charge current flowing through the HM layer, while the read operation can be accomplished by sensing the resistance of the MTJ in a direction transverse to the plane of the magnetic layers.
It is to be noted that the switching process of the nanoscale free layer is influenced by thermal noise at nonzero temperatures. Thermal noise results in a stochastic switching behavior, wherein, for a given current flowing through the HM layer, the MTJ switches with a certain probability. Moreover, the probability of switching can be controlled by the magnitude of the current flowing through the HM. The dynamics of the magnetization vector in presence of the HM layer current is given by the stochastic Landau-Lifshitz-Gilbert-Slonczewski (LLGS) equation and can be written as [12] ,
Here, α is the Gilbert's damping constant, γ is the gyromagnetic ratio, m is the unit vector in the direction of the magnetization, t is the simulation time and H EF F is the effective magnetic field including the demagnetization field and the interface anisotropy field. A detailed description of the various fields included in H EF F can be found in [12] . ST T in equation (1) is the term representing the torque due to the SHE effect (modeled as a spin-transfer torque term) and can be written as follows [13] ,
where, mp is the magnetization of the pinned layer, e is charge of an electron, µ o is the permeability of vacuum, is modified Planck's constant, t F L is the thickness of the free layer and M S is saturation magnetization. J q is the charge current density flowing through the heavy metal. she is the spin polarization efficiency (defined as the ratio of the spin current generated due to the charge current flowing through the HM layer) and can be written as [14] ,
where, w is width of free layer, t is thickness of heavy metal, θ she is spin hall angle, λ sf is spin flip length. The random switching process due to the effect of the thermal noise can be included in the LLGS equation through a stochastic field H thermal in H EF F [15] ,
where, k B is the Boltzmann constant, T is the temperature, V ol is volume of the free layer magnet and dt is the simulation time step. 
A. Stochasticity in Non-Telegraphic Regime
The parallel and anti-parallel states of the MTJ is stabilized by an energy barrier, E B , that is defined as the product of the magnetic anisotropy and volume ( Fig. 3 ). The retention time for the magnetic state of a nanomagnet is given by [16] ,
where, τ 0 is a characteristic time constant in the range 1ps−100ps [16] . The retention time or the lifetime of the magnet varies exponentially with the barrier height. The non-volatility of the magnet enables such devices to be used in synchronous clocked systems where the device is operated in successive write and read phases. During the write cycle, a current pulse of fixed duration is passed through the HM layer, that can switch the MTJ from one state over the barrier to the other stable state. The switching probability of the magnet varies with the magnitude of the current pulse flowing through the underlying HM layer. During the read phase, a small current is passed through the MTJ-R ref (can be implemented by another MTJ whose state is not disturbed by the small read current) voltage divider circuit (refer Fig. 2 ) and the MTJ state is read at the output of the inverter. The read current should be sufficiently small such that it does not disturb the state of the MTJ during the read phase. Since the voltage difference at the voltage divider output for the parallel and anti-parallel states is generally small, multiple stages of inverters are required to obtain a full swing at the output. Fig. 4 (a) illustrates the variation of the MTJ switching probability with the amplitude of the current pulse being passed through the HM layer for different E B . The device parameters used for simulations are enlisted in Table  I . Note that the barrier height of the magnet was varied by scaling the area of the magnets appropriately. It can be shown that the probabilistic switching characteristics of the MTJ holds a sigmoidal relationship to the write current by describing the SHE layer current I, with two different parameters, namely I bias and I o . I bias is the dc current required to bias the switching probability of the MTJ to 0.5, and I o is the scaling factor used to map the swing of the switching probability around the bias current to the sigmoid curve. Fig. 4 (b) depicts the variation of the switching probability of the MTJ with I − I bias , normalized by a factor I o .
As shown in Fig. 4(a) , when E B and hence, the device dimensions are scaled down, the current range required for stochastic switching decreases, thereby reducing the write current requirements of the device. Fig. 4 (c) depicts that both the components, I bias and I o , reduce with reduction in the barrier height. Reduction in I o implies that the current range that can be utilized for stochastic MTJ switching reduces, thereby increasing the rate of change of switching probability with varying input cur- rent. Consequently the computing system becomes more prone to variations in the MTJ input current and exhibits less error resiliency with the reduction of I o . These considerations will be highlighted in the next section. Note that, if E B is not sufficiently large, the state of the magnet can switch during the read operation due to very small T RET EN T ION . The retention failure probability P F,RET EN T ION , of an MTJ within a given read access time is given by,
where, P F,RET EN T ION is the retention failure probability of the MTJ during a read time of t read in nanoseconds, and ∆ is the E B of the MTJ in k B T . In order to find the necessary t read for correct read operation, SPICE simulations (with a Verilog A model for the MTJ [17] ) were performed in IBM 45nm technology node. Simulation results show that the required read time is around 0.2ns for the nominal corner and 1ns for the worst case corner (with 2σ variations in the threshold voltage of the CMOS transistors). Hence, for retention failure probability calculations the required read time is taken to be 1ns to ensure that a correct read can be achieved even at the worst corner. As illustrated in Fig. 4(d) , retention failure probability increases exponentially as the MTJ is scaled down. In order to keep the retention failure probability smaller than 1%, the E B of the magnet should be kept greater than 4.6k B T . When the MTJs are scaled further they enter the superparamagnetic regime where the magnets are no longer thermally stable during the read cycle. Hence, parallel read-write operations are required for magnets in the superparamagnetic regime (E B < 5k B T ) to realize stochastic switching elements.
B. Stochasticity in the Telegraphic Regime
For low barrier height nano-magnets (E B ∼ 1k B T ), even with zero charge current flowing through the HM layer, the MTJ will exhibit random telegraphic switching between the two equilibrium states ( Fig. 5(a) ) due to thermal noise. The random switching characteristics of such scaled devices in the superparamagnetic regime can be still manipulated by passing a charge current through the HM layer. For instance, Fig. 5 (a)-(c) represents the in-plane magnetization of the MTJ in presence of 0, 1.5, −1.5µA write current flowing through the HM layer of a 1k B T magnet. The dwell time of the MTJ in either of the two stable states can be modulated by the magnitude and direction of the input write current.
The volatility of these devices entails a rethinking of the manner in which such nano-magnets can be operated with peripherals to realize a stochastic computing element. Due to device volatility and low retention time, such devices cannot be operated with separate write and read phases. Consequently, the write and read terminals of the MTJ are activated simultaneously and the device 5µA is flowing through the HM layer, the MTJ is more likely to be in the anti-parallel state, (c) When 1.5µA is flowing through the HM layer, the MTJ is more likely to be in the parallel state. state is read while an input bias current flows through the underlying HM layer of the MTJ. For high energy-
barrier MTJs the effect of read current on the switching characteristics is not a design issue since read and write cycles are de-coupled in time. However, for MTJs in the telegraphic switching regime, the read current can bias the switching characteristics since the read and write operations occur in parallel. Further, since the devices are highly scaled, the write (for stochastic switching) and read currents fall in the same order of magnitude (unlike high barrier height magnets where the write current for stochastic switching is higher). Hence the resistive divider of the read circuit (Fig. 2) needs to be highly optimized such that the read current is maintained at the minimal value. SPICE simulations reveal that the read current can be minimized to 100nA while having minimal effect on the MTJ switching characteristics. Fig.  6 (a) depicts the average output of the inverter stage over a duration of 2µs with and without the read current.
The case "with read current" is simulated by considering the additional spin-orbit torque induced by the 100nA read current flowing through the HM layer while the case "without read current" ignores the effect of the additional read current. As can be observed from Fig. 6 , the read current has minimal impact on the MTJ switching probability. Further, effect of device dimension variations (or equivalently E B variations) and read circuit variations (±1σ and ± 2σ variations in the threshold voltages of the CMOS transistors) was shown to have minimal effect on the stochastic switching behavior of the nano-magnets (Figs. 6(b)-(c)). Fig. 6(d) represents a typical plot of the voltage output of the inverter stage as a function of time with no input current flowing through the underlying HM of the MTJ. Note that the switching characteristics of superparamagnetic MTJs are highly sensitive to any change in the magnitude of the write current. As depicted in Fig. 6(a) , the switching probability of the MTJ shifts from 0.5 to 0.85 for a 1µA change in the write current. Hence, the impact of variations in the input current provided to a network of such scaled MTJs can be significant, and will be analyzed in more details in the next section. We would like to conclude this section by mentioning that parallel read-write operation is not suited for magnetization switching in the non-telegraphic regime (10 − 20k B T barrier height magnets) since the telegraphic switching would occur in timescales of ∼ µs − ms, thereby, resulting in enhanced delay for the computing process.
C. Stochastic Neuromorphic Computing
A neural network is essentially a collection of layers of neurons interfaced through a network of weighted synapses. A particular input to a neuron is first scaled by the corresponding synaptic weight of the synapse before they are accumulated and processed by the neuron. Neurons with sigmoid like transfer functions have been shown to be appealing for implementing deep spiking neural networks [5] , making SHE-MTJ structures ideal for realizing energy efficient neuromorphic hardware. In the stochastic neural network being considered in this work, the MTJ neurons generates an output spike probabilistically depending on the instantaneous magnitude of the resultant weighted synaptic input [5] . This computing framework can be directly translated to the resistive crossbar architecture illustrated in Fig. 7 , where the synaptic weights are mapped into the resistive elements between the horizontal and vertical metal lines. Note that resistive crossbar arrays based on memristive devices like phase change materials [1] , Ag-Si devices [2] and spintronic devices [18] have been proposed and experimentally demonstrated [19] . Two horizontal lines are used for each input connected to the crossbar array to implement the functionality of positive and negative weights. An input spike provided to the network will activate the corresponding access transistors supplying a voltage to the horizontal lines V + (positive voltage) and V − (negative voltage), which is translated to a current through the vertical columns (weighted by the conductances of the resistive elements). The current accumulated in the vertical columns are then supplied as the write currents to the stochastic neurons of the corresponding layer. If the weight connecting an input m to a neuron n is negative, then the corresponding resistive element connecting the positive horizontal line and the vertical column (G m,n+ ) is programmed to a high resistive 'off' state and the weight connecting the vertical column and the negative horizontal line is programmed to a conductance given by G m,n− = w m,n G o and vice versa. Here, w m,n is the synaptic weight between the input m and neuron n and G o is the mapped conductance for unity weight. The conductances of the resistive elements are selected by scaling the synaptic weights by a factor tioned previously. Assuming that the magneto-metallic spin devices have low input resistance in comparison to the cross-point resistances of the crossbar array, the neurons will receive a weighted summation of spike inputs in a particular layer and produce output spikes probabilistically over time that will drive the fan-out neurons of the next layer. For magnetic neurons operating in the non-telegraphic regime, the read circuit can be interfaced with a latch that stores the inverter output during the read cycle, which will drive the next stage of neurons during the following write cycle (hence synchronous operation). For magnetic neurons operating in superparamagnetic regime, the inverter output can directly drive the neurons in the next stage ( hence asynchronous operation). Note that the high barrier-height magnets are also driven by a current source to bias it at a switching probability of 0.5 unlike MTJs in the superparamagnetic regime. Due to the small input current and the zero bias current of magnetic neurons operating in the superparamagnetic regime, asynchronous architectures will grant significant power savings in the neurons and the resistive crossbar array. However, as shown later, asynchronous implementation will incur significant power loss at the read circuit, owing to the continuous switching activity of the inverters. 
III. DESIGN CONSIDERATIONS: SYNCHRONOUS AND ASYNCHRONOUS NEUROMORPHIC SYSTEMS

A. Device to System Simulation Framework
In order to analyze the design considerations for synchronous and asynchronous stochastic SNNs, a hybrid device-circuit-system co-simulation framework was developed for this work. Stochastic LLGS simulation for MTJs with different barrier heights was used to evaluate the probabilistic switching behavior of magnets operating in non-telegraphic to telegraphic regime. In this work, we use magnets of barrier height 10k B T and 20k B T for nontelegraphic regime and magnets of barrier height 1k B T and 2k B T for telegraphic regime. The device parameters were based on experimental measurements performed in Ref. [17] , and are summarized in table I. SPICE level simulations based on a Verilog-A model of the MTJ was used to evaluate the performance of the stochastic MTJ along with associated peripherals.
In order to perform a system-level analysis, the performance of the network was assessed for a large scale deep learning network architecture (28x28-6c5-2s-12c5-2s-10o) on a standard digit recognition problem based on the MNIST dataset [20] . The network consists of alternate layers of convolutional and subsampling operations. The dimensions of the input MNIST images are 28x28, which were applied as input to the convolutional layer consisting of 6 convolutional kernels of size 5x5. The subsampling kernel was of size 2x2, and was followed by another convolutional layer comprising of 12 output maps, which in turn, was followed by another subsampling layer. The final layer consisted of 10 neurons, each of which represented one of the ten digit classes. Once the training was accomplished, the learnt weights are mapped to the synaptic conductances using a value of G o = 5µS which is a typical resistance range of memristive synaptic devices. The same resistive crossbar array was used for all the different barrier height neuronal devices. The supply voltage δV was adjusted in each case to satisfy the relationship, δV = Io Go , as explained previously. The supply voltages δV , was calculated to be 0.1V ,0.11V ,1.05V and 2V for nano-magnets of barrier height 1K B T ,2K B T ,10K B T and 20K B T respectively. The sigmoid characteristic curves for the magnets operating in the telegraphic regime was obtained by averaging the output voltage of the read inverter circuit over a period of 2µs (for 1k B T ) and 5µs (for 2k B T ). urates at 97.5% and 97.2% for the 1K B T and 2K B T asynchronous designs. Both synchronous networks surpass an accuracy of 95% just under 20ns, whereas the two asynchronous networks require 80ns (for 1K B T ) and 250ns (for 2K B T ) to reach the same accuracy. In the asynchronous implementation, the high frequency telegraphic switching of the nano-magnets are translated into voltage spikes at a lower frequency due to gate capacitance charge delays of the CMOS devices, which explains the slower response of the asynchronous networks compared to the synchronous designs. Also as the E B of the nano-magnets are increased (for the superparamagnetic regime), the retention time of the nano-magnets increase, decreasing the spiking frequency at the output of the in-verters. Hence as the results show, for asynchronous designs, the time required for a network to reach a target accuracy increases with the E B of the nano-magnets used in the design. For the synchronous networks the duration of one time-step was selected to be 4ns, which includes a write time of 0.5ns, a rest period of 2ns, a read time of 1ns followed by a reset period of 0.5ns. The duration of the time-step for the asynchronous networks were determined by measuring the average duration of a voltage pulse at the output of the inverter read circuit at zero write current, and was calculated to be 8.2ns and 27.5ns for the 1K B T and 2K B T networks, respectively. chronous and asynchronous) corresponding to a target classification accuracy of 96%. Neuron energy ( Fig.  9(a) ) refers to the energy dissipated in the MTJ neuron due to the write/reset currents flowing through the HM layer. The neuron energy consumption is lowest for the 1K B T asynchronous design with an energy consumption of 1.15pJ per image classification, and increases with the size of the magnets, up to 37.8pJ per image classification for the 20K B T synchronous design. This trend can be explained by the increasing write current requirements of the nano-magnets as their sizes are increased. Since the current flowing through the HM layer are first routed through the resistive cross-bar network (synapses), the energy consumption in the synapses (Fig. 9(c) ) show a similar trend, increasing with the size of the magnets. Also the bias current required in the synchronous designs to bias the switching probability of the MTJs to 0.5, adds to the power dissipation in the HM layer and the synapses. The energy consumption in the synapses per image classification are 0.27nJ and 0.74nJ for the 1K B T and 2K B T asynchronous designs and, 1.3 nJ and 6.5nJ for the 10K B T and 20K B T synchronous designs. The read energy consumption, illustrated in Fig. 9(b) , is the summation of the power dissipated in the MTJ due to the read current passing through and the power dissipated in the CMOS interface circuitry. As the results indicate, the read energy consumption per image classification are larger for the asynchronous implementations (3.3 nJ for the 1K B T and 8.95nJ for the 2K B T ) compared to the synchronous implementations (2.1nJ for the 10K B T and 2.75nJ for the 20K B T ). The majority of the read power dissipation in asynchronous networks occur at the CMOS inverters, which are required to operate continuously due to the parallel read/write nature of the neurons. In synchronous networks, however, the CMOS inverters are only required to operate during the read cycle, and can be deactivated at other times using access transistors to save power. For both designs the power dissipated in the neurons are an order of magnitude smaller compared to the power dissipated in the synapses and the read circuit, owing to the low resistance of the HM layer. As depicted by Fig. (9(d) ), the 10K B T synchronous network shows the minimum energy requirement per image classification (3.4nJ), closely followed by the 1K B T asynchronous network (3.6nJ). The 2K B T asynchronous network exhibit an energy consumption of 9.7nJ per image classification followed by the 20K B T synchronous network with an energy consumption of 9.28nJ. For the synchronous networks, the energy consumption associated with the clocking circuitry is negligible, specially since a classification accuracy of 96% can be achieved under 10 clock cycles, and hence is not considered in this analysis.
B. Performance and Energy Estimation
C. Effect of Variations
Most of the computations of the proposed network occurs in the resistive cross bar array. Hence, any variations in the resistive elements of the crossbar array can result in a significant degradation of the classification accuracy. To measure the effect of such variations, separate experiments were performed allowing variations with a standard deviation up to 20% in the resistive elements. According to the results (refer Fig. 10 ), for variations in the synapses with a standard deviation of 20%, the accuracy loss is only 2.8% for the synchronous designs and 5.32% for the asynchronous designs. The slightly higher accuracy degradation observed in the asynchronous designs in comparison to the synchronous designs can be explained by the increased sensitivity of the MTJ switching probability in response to the write current at the superparamagnetic regime.
Due to the low operating currents of the nano-magnets used in the asynchronous design, the operating voltage of the crossbar architecture given by δV = Io Go can be very small for low K B T magnets. Hence any variation in the supply voltage can potentially result in a large deviation in the write current magnitude, influencing the classification accuracy of the network. Fig. 11 depicts the behavior of the classification accuracy of the two designs in the presence of supply voltage variation. As shown by Fig. 11(a) , due to the larger supply voltages used in the synchronous designs, 10K B T and 20K B T synchronous implementations are resilient to supply voltage variations up to 25mV . The asynchronous implementations, on the order hand, exhibit an accuracy degradation of 6.1% under 25mV variation in the supply voltage.
As explained in section II, the CMOS inverter read circuit for the asynchronous implementation must be designed carefully so that the average magnetization of the nano-magnet is properly reflected on the average output of the inverter. Any variation in the CMOS circuitry can offset the average output of the inverters, adversely affecting the classification accuracy of the network. As depicted by Fig. 12 , the classification accuracy of the 1K B T asynchronous network decrease by 3% and the accuracy of the 2K B T asynchronous network decrease by 0.7% at the worst case corner with 2σ variations in the CMOS read circuit. The synchronous networks are resilient towards such CMOS variations since the read time is selected to be adequate for a correct read even at the worst cell corner.
IV. SUMMARY
In conclusion, we outline the design considerations for MTJ based stochastic SNNs with varying barrier heights. We showed that the reduced energy consumption of low barrier height magnets is achieved at the expense of reduced error and variation tolerance and constrained design space of CMOS peripherals. We further showed that, in contrast to the popular belief that superparamagnetic MTJs would be more energy-efficient in comparison to high barrier-height magnets, parallel and always ON "read" and "write" operations in superparamagnets causes the peripheral "read" circuit energy consumption to dominate the network energy consumption profile. While scaling in the peripheral CMOS technology will reduce the peripheral energy consumption, reduced error tolerance might still be a concern for spin-based neuromorphic hardware design. The analysis performed in this work can be easily extended to other applications that require probabilistic inference, for example Bayesian networks and Ising computing.
