Probabilistic machine learning enabled by the Bayesian formulation has recently gained significant attention in the domain of automated reasoning and decisionmaking. While impressive strides have been recently made to scale up the performance of deep Bayesian neural networks, they have been primarily standalone software efforts without any regard to the underlying hardware implementation. In this article, we propose an "all-spin" Bayesian neural network where the underlying spintronic hardware provides a better match to the Bayesian computing models. To the best of our knowledge, this is the first exploration of a Bayesian neural hardware accelerator enabled by emerging post-CMOS technologies. We develop an experimentally calibrated device-circuit-algorithm cosimulation framework and demonstrate 24× reduction in energy consumption against an iso-network CMOS baseline implementation.
I. INTRODUCTION
P ROBABILISTIC inference is at the core of decisionmaking in the brain. While the past few years have witnessed unprecedented success of deep learning in a plethora of pattern recognition tasks (complemented by advancements in dedicated hardware designs for these workloads), these problem spaces are usually characterized by the availability of large amounts of data and networks that do not explicitly represent any uncertainty in the network structure or parameters. However, as we strive to deploy artificial intelligence platforms in autonomous systems such as self-driving cars, decision-making based on uncertainty is crucial. Standard supervised backpropagation-based learning techniques are unable to deal with such issues since they do not overtly represent uncertainty in the modeling process. To circumvent these problems, Bayesian deep learning has recently been gaining attention where deep neural networks are trained Manuscript received November 12, 2019; revised January 2, 2020; accepted January 14, 2020. Date of publication February 11, 2020; date of current version February 26, 2020. The review of this article was arranged by Editor W. Tsai. (Corresponding author: Abhronil Sengupta.) Kezhou in a probabilistic framework following the classic rules of probability, i.e., Bayes' theorem. In the Bayesian formulation, the network is visualized as a set of plausible models (assuming prior probability distributions on its parameters, for instance, synaptic weights). Given the observed data, the posterior probability distributions are learned that best explains the observed data. The key distinction between standard deep networks and Bayesian deep networks is the fact that network parameters in the latter case are modeled as probability distributions. It is worth noting here that the probability distributions are usually modeled by Gaussian processes characterized by a mean and variance [1] . Utilizing probability distributions to model network parameters allows us to characterize the network outputs by an uncertainty measure (variance of the distribution), instead of just point estimates in a standard network. These uncertainty measures can, therefore, be used by autonomous agents for decision-making and self-assessment in the presence of continuous streaming data. This article explores a hardware-software codesign approach to accelerate Bayesian deep-learning platforms through the usage of spintronic technologies. Recent research has demonstrated the possibility of mimicking the primitives of standard deep-learning frameworks-synapses and neurons by single magnetic device structures that can be operated at very low terminal voltages [2] - [4] . Furthermore, being nonvolatile in nature, spintronic devices can be arranged in crossbar architectures to realize "in-memory" dot-product computing kernels-thereby alleviating the memory access and memory leakage bottlenecks prevalent in CMOS-based implementations [5] , [6] . As mentioned earlier, the key distinction between Bayesian and standard deep learning is the requirement of sampling from probability distributions and inference based on sampled values. Interestingly, scaled nanomagnetic devices operating at room temperature are characterized by thermal noise, resulting in probabilistic switching. We propose to leverage the inherent device stochasticity of spintronic devices to generate the samples from Gaussian probability distributions by drawing insights from a statistical central limit theorem. Furthermore, this article also elaborates on a cohesive design of a spintronic Bayesian processor that leverages the benefits of spin-based Gaussian random number generators (RNGs) and spintronic "in-memory" crossbar architectures to realize high-performance, energy-efficient hardware platforms. We believe the drastic reductions in circuit complexity (single devices emulating synaptic scaling operations, crossbar architectures implementing "in-memory" dot-product computing kernels and leveraging device stochasticity to sample from probability distributions), and low operating voltages of Fig. 1 . In a Bayesian framework, each synaptic weight is represented by a Gaussian probability distribution. The core computing kernel for a particular layer during inference is a dot-product between the inputs and a synaptic weight matrix sample drawn from the individual probability distributions. Learning involves the determination of the mean and variances of the probability distributions using Bayes' formulation.
spintronic devices make them a promising path toward the realization of probabilistic machine learning enabled by the Bayesian formulation.
II. PRELIMINARIES: BAYESIAN NEURAL NETWORKS
Before going into the technical details of the work, we would like to first discuss the preliminaries of Bayesian neural networks and the main computationally expensive operations pertaining to their hardware implementation. As shown in Fig. 1 , a particular layer of a neural network consists of a set of neurons receiving inputs (sensory information or previous layer of neurons) through synaptic weights, W. Bayesian neural networks consider the weights of the network, W, to be latent variables characterized by a probability distribution, instead of point estimates. More specifically, each weight in such a framework is a random number drawn from a posterior probability distribution (characterized by a mean and variance) that is conditioned on a prior probability distribution and the observed datapoints, D (incoming patterns to the network). Hence, during inference, each incoming data pattern will propagate through the synaptic weights, each of which is characterized by a probability distribution. Hence, as shown in Fig. 1 , the final output of the neurons of a particular layer will also be described by a probability distribution characterized by a mean and variance (the uncertainty measure).
Bayesian neural networks correspond to the family of deeplearning networks where the weights are "learned" using Bayes' rule. The learning process here involves the estimation of the mean and variance of the weight posterior distribution. Following Bayes' rule, the posterior probability can be written as:
where P(W) denotes the prior probability (probability of the latent variables before any data input to the network). P(D|W) is the likelihood, corresponding to the feedforward pass of the network. In order to make the above-mentioned posterior probability density estimation tractable, two popular approaches are variational inference methods [7] or Markov chain Monte Carlo methods [8] . However, in this article, we focus on variational inference methods due to its scalability to large-scale problems [9] . Variational inference methods usually approximates the posterior distribution by a Gaussian distribution, q(W, θ), characterized by parameters, θ = (μ, σ ), where μ and σ represent the mean and standard deviation vectors for the probability distributions representing P(W|D) [10] . To summarize, the main hardware design space concerns in Bayesian neural networks can be categorized as follows.
A. Gaussian Random Number Generation
Central to the entire framework, both in the learning as well as the inference process, is the random number generation corresponding to the synaptic weights. Given the current large model sizes characterized by over a million synapses, coupled with the fact that random draws need to perform multiple times for each synaptic weight, RNG circuits would contribute significantly to the total area and power consumption of the hardware. Furthermore, the random numbers need to be sampled from a Gaussian distribution, thereby increasing the complexity of the circuit. We will discuss the hardware costs for CMOS implementations of such Gaussian RNGs in the following sections along with their limitations, followed by our proposal of nanomagnetic RNGs that can serve as the basic building blocks of such Bayesian neural networks.
B. Dot-Product Operation Between Inputs and Sampled Synaptic Weights
A common aspect of any standard deep-learning framework is the fact that forward propagation of information through the network involves a significant amount of memory-intensive operations. The dot-product operation between the synaptic weights and the inputs for inference involves the compute energy along with memory access and memory leakage components. For large-scale problems and correspondingly largescale models, CMOS memory access and memory leakage can be almost ∼50% of the total energy consumption profile [11] .
The situation is further worsened in a Bayesian deep network since each synaptic weight is characterized by two parameters (mean and variance of the probability distribution), thereby requiring double memory storage. However, the dotproduct operation does not occur directly between the inputs and these parameters. In fact, for each inference operation, the synaptic weights (typically assumed constant during inference for nonprobabilistic networks and implemented by memory elements in hardware) are repeatedly updated depending on sampled values from the Gaussian probability distribution. Hence, the direct utilization of crossbar-based "in-memory" computing platforms enabled by nonvolatile memory technologies (discussed in detail later) for alleviating the memory access and memory fetch bottlenecks is not possible and therefore requires a significant rethinking.
In the following sections, we sequentially expand on each of these points and propose a spin-based neural processor that merges the deterministic and stochastic devices as a potential pathway to enable Bayesian deep learning that can be orders of magnitude more efficient in contrast to state-of-the-art CMOS implementations.
III. SPINTRONIC DEVICE DESIGN

A. Magnetic Tunnel Junction-True Random Number Generator Design
The basic device structure under consideration is the magnetic tunnel junction (MTJ), which consists of two nanomagnets sandwiching a spacer layer (typically an oxide such as MgO). The magnetization of one of the layers is magnetostatically "pinned" in a particular direction, while the magnetization of the other layer can be manipulated by a spin current or an external magnetic field. The two layers are denoted as the "pinned" layer (PL) and "free" layer (FL). Depending on the relative orientation of the two magnets, the device exhibits a high-resistance antiparallel (AP) state (when the magnetizations of the two layers have the opposite direction) and a low-resistance parallel (P) state (when the magnetizations of the two layers have the same direction). These two states are stabilized by an energy barrier determined by the anisotropy and volume of the magnet.
Let us now consider the switching of the magnet from one state to another by the application of an external current. The switching process is inherently stochastic at nonzero temperatures due to the thermal noise [12] . In the presence of an external current, the probability of switching from one state to the other is modulated depending on the magnitude and duration of the current. True RNG (TRNG) can be designed using such a device by biasing the magnet at the "write" current corresponding to a switching probability of 50%. Note that CMOS-based TRNGs suffer from high-energy consumption and circuit design complexity [13] . Proposals and experimental demonstrations of MTJ-based TRNG have been shown [14] . MTJ-based TRNGs are characterized by low area footprint and compatibility with CMOS technology.
In this article, we consider a spin-orbit coupling-enabled device structure (see Fig. 2 ). It consists of the MTJ stack lying on top of a heavy-metal (HM) underlayer. The device "read" is performed through the MTJ stack between terminals T1 and T3. However, the device "write" is performed by passing current through the HM underlayer between terminals T2 and T3. Input current flowing through the HM results in spin injection at the interface of the magnet and HM due to spin-Hall effect (SHE) [15] and thereby causes switching of the MTJ FL [16] . The device has the following advantages.
1) The decoupled "write" and "read" current paths is advantageous from the perspective of peripheral circuit design to avoid "read"-"write" conflicts since the associated circuits can be optimized independently. 2) Such devices offer 1-2 orders of magnitude energy efficiency in comparison to standard spin-transfer torque MRAMs. This is due to the fact that in such spin-orbit coupling-based systems, every incoming electron in the "write" current path repeatedly scatters at the interface of the magnet and HM and transfers multiple units of spin angular momentum to the ferromagnet lying on top. Usage of SHE-based switching enables us to use an alternative TRNG design [17] , [18] that has the potential to produce high-quality random numbers in the presence of process, voltage, and temperature (PVT) variations. In the earlier scenario of a standard MTJ, device-to-device variations can result in deviations of the bias current required for 50% switching probability, thereby degrading the quality of the random number generation process. Our scheme is shown in Fig. 2 , where a magnet with perpendicular magnetic anisotropy (PMA) lies on top of the HM. The device operation is divided into three stages. During an initial "Reset" stage, a current flowing through the HM results in in-plane spin injection in the magnet and orients it along the hard axis for a sufficient magnitude of the "reset" current. The magnet is then allowed to relax to either of the two stable states in the presence of thermal noise-the switching probability being 50% since the hard axis is a metastable orientation point for the magnet. In this case, device-to-device variations only causes change in the critical current required for biasing the magnet close to the metastable orientation and does not skew the probability distribution to a particular direction (as in the standard MTJ case). Hence, by maintaining a worst case critical value of the HM "reset" current, the quality of the random number generation process can be preserved even in the presence of PVT variations. Furthermore, the "reset" current does not flow through the tunneling oxide layer (unlike the standard MTJ case), and therefore, the reliability of the oxide layer is not a concern in this scenario [17] , [18] . Note that our device operation is validated by recent experiments of holding the magnet to its metastable hard-axis orientation for performing Bennett clocking in the context of nanomagnetic logic [19] . SHEbased energy-efficient switching also results in the reduction of the energy consumption involved in the random number generation process.
The probabilistic switching characteristics of the MTJ can be analyzed by Landau-Lifshitz-Gilbert (LLG) equation with additional term to account for the spin-orbit torque generated by the SHE at the ferromagnet-HM interface [20] 
where m is the unit vector of FL magnetization, γ = 2μ B μ 0 /h is the gyromagnetic ratio for electron, α is Gilbert's damping ratio, H eff is the effective magnetic field, including the shape anisotropy field for elliptic disks, θ S H is the spin-Hall angle, and I q is the charge current flowing through the HM underlayer). Thermal noise is included by an additional thermal field [12] , H thermal = (α/1 + α 2 (2K B T K /γ μ 0 M s V δ t )) 1/2 G 0,1 , where G 0,1 is a Gaussian distribution with zero mean and unit standard deviation, K B is Boltzmann's constant, T K is the temperature, and δ t is the simulation time step.
The device parameters are mentioned in Table I . Considering a worst case "reset" current of 140 μA for a duration of 1 ns, the energy consumption involved in using a 20k B T barrier magnet (calibrated to experimental measurements reported in [21] ) as a TRNG is ∼57 f J/bit (I 2 Rt energy consumption) [17] , which is almost 2× lower than the standard MTJ-based TRNG.
B. Domain-Wall Motion-Based Magnetic Devices--Multilevel NonVolatile Memory Design
The mono-domain magnet discussed earlier is characterized by only two stable states. For a magnet with elongated shape, multiple domains can be stabilized in the FL, thereby leading to the realization of multiple stable resistive states. Such a domain-wall (DW) MTJ consists of a DW separating the two oppositely magnetized regions and the DW position is programmed to modulate the MTJ resistance (due to the variation in the relative proportion of P and AP domains in the device) [5] .
We consider SHE-based DW motion dynamics also in magnet-HM bilayers. In magnetic heterostructures with high perpendicular magnetocrystalline anisotropy, spin-orbit coupling and broken inversion symmetry stabilizes the chiral DWs through Dzyaloshinskii-Moriya interaction (DMI) [22] , [23] . Such an interfacial DMI at the magnet-HM interface results in the formation of a Néel DW. When an in-plane [22] . The device characteristics illustrate that the programming current magnitude is directly proportional to the amount of conductance change [5] .
charge current is injected through the HM, the accumulated spins at the magnet-HM interface results in the Néel DW motion. The device structure is shown in Fig. 3(a) , where a current of magnitude J flowing through the HM layer results in a conductance change G between terminals T1 and T3. As shown in Fig. 4(a) , for a given programming time duration, the current flowing through the HM underlayer causes the DW displacement proportional to its magnitude. Note that the device characteristics are obtained by performing micromagnetic LLG simulations by dividing the magnet into multiple grids. The DW position determines the magnitude of the MTJ conductance. The MTJ conductance varies linearly with the DW position since it determines the relative proportion of the area of the parallel and antiparallel domains of the MTJ [see Fig. 4(b) ]. Since such a device can be programmed to multilevel resistive states and is characterized by low switching current requirements and linear device behavior (device conductance change varies in proportion to the magnitude of programming current), they are an ideal fit for implementing crossbar-based "in-memory" computing platforms (discussed in Section IV). We will refer to this device as a DW-MTJ for the remainder of this text. Experimentally, a multilevel DW motion-based resistive device was recently shown to exhibit 15-20 intermediate resistive states [24] .
It is worth noting here that the device structure in Fig. 3 (a) can be used as a neuron by interfacing with a reference MTJ [see Fig. 3 (b)] [5] . The resistive divider can drive a CMOS transistor where the output drive current would be a linear function of the input current flowing through the HM layer of the device, thereby mimicking the functionality of a saturated linear functionality by ensuring that the transistor operates in the saturation regime [5] . The simulation parameters, provided in Table II , were used for the rest of this text for DW-MTJ unless otherwise stated. The parameters were obtained magnetometric measurements of CoFe-Pt nanostrips [22] .
IV. ALL-SPIN BAYESIAN NEURAL NETWORKS
A. Spin-Based Gaussian Random Number Generator
Gaussian random number generation task is a hardwareexpensive process. CMOS-based designs for Gaussian RNGs would usually require a large number of registers, linear feedback circuits, and so on. For instance, a recent work for a CMOS-based Gaussian RNG implementation reports 1780 registers and 528.69-mW power consumption for a 64-parallel Gaussian RNG task [9] . Fig. 5 . Outline of a 2 × 2 array utilizing spin-based devices interfaced with an accumulator to implement a Gaussian RNG. The probability distributions of random numbers generated from such an array are shown in the extreme right by using a sum of N random variables (rows of the array). We use 8-bit representation and 100 000 samples to plot the distribution.
TABLE II DW-MTJ DEVICE SIMULATION PARAMETERS
Let us now discuss our proposal of spin-based Gaussian RNG. In Section III, we discussed the design of a spintronic TRNG. An array of TRNGs can be used for sampling from a uniform probability distribution. Note that each spin device can be considered to produce a sample from a Bernoulli distribution with a probability of 0.5. However, reading a particular row of the array provides a sample from a discrete uniform distribution. In order to generate a Gaussian probability distribution from a uniform one, we draw inspiration from the statistical central limit theorem, as discussed in Box 1. The key result of the central limit theorem that we utilize is that the sum of a large number of independent and identically distributed (i.i.d) random variables is approximately normal.
Box 1: Central Limit Theorem
Let {X 1 , X 2 , . . . , X n } be a random sample of n i.i.d random variables drawn from a distribution (which may not be normal) of mean μ and variance σ 2 . Then, the probability density function of the sample average S n = X 1 + X 2 + . . . + X n /n approaches a normal distribution with mean μ and variance σ 2 /n as n increases.
Our proposed design is shown in Fig. 5 , which depicts a possible array implementation [17] of our spin-based TRNGs. Each spin device is interfaced with an access transistor. Rows sharing a reset line can be driven simultaneously. Hence, random numbers can be generated in the entire array in parallel. The timing diagram is shown in Fig. 5 . Each row can be read by asserting a particular wordline (WL) and sensing the bitline (BL) voltage. For an m ×n array, each row read produces an n-bit number generated from a uniform probability distribution. By interfacing the array with an accumulator that averages all the generated random numbers, we are able to produce random numbers drawn from a normal distribution. Note that the hardware overhead for this process would be high for applications that require precise sampling from Gaussian distributions since the convergence takes place only for infinite samples. However, for machine-learning workloads considered herein, the performance of such platforms is usually resilient to approximations in the underlying computations. For instance, Fig. 5 shows that even with an 8-bit representation and three random variables drawn from the uniform probability distribution, we are able to achieve an approximate Gaussian distribution. While Gaussian probability distributions are primarily used in such algorithms, non-Gaussian weight distributions can also be designed by using the Gaussian function as a basis. Note that while Box 1 discussions are equally valid for a CMOS-based TRNG, it will be an order of magnitude more area and power consuming than our proposed spin-based TRNG, as explained in Section III.
B. Dot-Product Operation Between Inputs and Sampled Synaptic Weights
Let us first discuss the operation of DW-MTJ enabled spintronic crossbar arrays as an energy-efficient mechanism to realize the dot-product computing kernel. Assuming each synapse to be represented by a DW-MTJ, as shown in Fig. 6 , they can be arranged in a crossbar structure. Each row of the array is driven by an analog voltage [output of digital-toanalog converters (DACs)] that corresponds to the magnitude Fig. 7 .
All-spin Bayesian neural network implementation. The two crossbar arrays behave as "in-memory" computing kernels, whereas the RNG unit provides the sampling operation from the Gaussian RNGs.
of the input. The current flowing through each synapse is scaled by the conductance of the device and due to Kirchoff's law, and all these currents get summed up along the column, thereby realizing the dot-product kernel. Note that negative synaptic weights can also be mapped by using two horizontal lines per input (driven by "positive" and "negative" supply voltages). In case a particular synaptic weight is "positive" ("negative"), then the corresponding conductance in the "positive" ("negative") line is set in accordance to the weight. The resultant currents get summed up along the column and pass as the input "write" current through the spin neuron. Consecutive "write" and "read" cycles of the spin neurons will implement multiple iterations of the Bayesian network. The analog output current provided by the spin neuron is then converted to a digital format using the analog-to-digital converters (ADCs). The digital outputs can be latched to provide the inputs for the fan-out crossbar arrays. The energy efficiency of the system stems mainly from two factors as follows.
1) The input write resistance of the spintronic neurons is low (being magnetometallic devices) and it inherently requires very low currents for switching. This enables the crossbar arrays of spintronic synapses to be operated at low terminal voltages (typically 100 mV). Furthermore, spintronic neurons are inherently currentdriven and thereby do not require costly current to voltage converters, in contrast to CMOS and other emerging technology-based (resistive random access memory and phase-change memory, among others) implementations [25] . 2) Since spin devices are inherently nonvolatile technologies, the ability to perform the costly multiplyaccumulate operations in the memory array itself enables us to address the issues of von-Neumann bottleneck.
However, in the context of Bayesian deep networks, even for the inference stage, the synaptic weights are not constant but are updated depending on sampled values from a Gaussian distribution. Assuming that we are able to generate samples from a normal distribution by using the device-circuit primitives proposed earlier, the computations in a Bayesian network can be partitioned in an appropriate fashion such that the benefits of spin-based "in-memory" computing can be still utilized. This is explained in Box 2.
Box 2: Computations Involved in Inference Operation
Once all the posterior distributions are learned (μ and σ parameters of the weight distributions), the network output corresponding to input, x, should be obtained by averaging the outputs obtained by sampling from the posterior distribution of the weights, W [9] . The output of the network y is therefore given by
where f (x,W) is the network mapping for input x and weights, W. Using the variational inference method, we approximate the weight distribution by Gaussian functions. The approximation is performed over S independent Monte Carlo samples drawn from the Gaussian distribution q(W, θ).
Considering just a single layer and neglecting the neural transfer function, f (x,W i ) for the j th neuron can be decomposed into
where k is the dimensionality of the input x and N(μ j k , σ j k ) represents a particular sample drawn from a normal probability distribution with mean μ j k and variance σ j k .
Realizing that a normal distribution with a particular mean and variance is equivalent to a scaled and shifted version of a normal distribution with zero mean and unit variance, we partition the inference equation, as shown in (4) . The constant parameters μ j k and σ j k (highlighted in red) represent the mean and variance of the probability distribution of the corresponding synaptic weight and can, therefore, be implemented by DW-MTJ-based memory devices from a hardware implementation perspective. The resultant system (see Fig. 7 ) consists of two crossbar arrays for storing the mean and variance parameters. While the inputs of a particular layer are directly applied to the crossbar array storing the mean values, they are scaled by the random numbers generated from the RNG unit (outputs normalized to provide random numbers with zero mean and unit variance) described previously for the crossbar array storing the variance values. Typical CMOS neuromorphic architectures are characterized by much higher movement of weight data than input data to compute the inference operation [26] . Our proposal of computation partition, explained in Box 2, enables us to leverage the "in-memory" computing primitives for storing the probability distribution parameters while parallelly computing energyefficient dot products in situ between inputs and stochastic weights. It is worth noting here that the crossbar column outputs are computed and read sequentially in order to ensure that the random numbers sampled for the synaptic weights of each column are independent.
V. RESULTS AND DISCUSSION
A hybrid device-circuit-algorithm cosimulation framework was developed to evaluate the performance of the proposed all-spin Bayesian hardware. The magnetization switching characteristics of the monodomain and multidomain MTJ was simulated in MuMax3, a GPU accelerated micromagnetic simulation framework [27] . The nonequilibrium Green's function (NEGF)-based transport simulation framework [28] was used for modeling the MTJ resistance variation with oxide thickness and applied voltage. The obtained device characteristics from MuMax3 and SPICE simulation tools were used in an algorithm-level simulator, PyTorch, to evaluate the functionality of the circuit. The performance of this design was tested for a standard digit recognition problem on the MNIST data set [29] . A two-layer fully connected neural network was used, with each hidden layer having 200 neurons. The probability distributions were learned using the "Bayes by Backprop" algorithm [30] , 1 which learns the optimal Gaussian distribution by minimizing the Kullback-Leibler (KL) divergence 2 from the true probability distribution. The prior distribution on the weights used for training was a scaled mixture of two Gaussian functions. The network was trained offline to obtain the values of the mean and standard deviation of the probability distributions of the weights. Subsequently, they were mapped to the conductances of the DW-MTJ devices. The baseline idealized software network was trained with an accuracy of 98.63% over the training set and 97.51% over the testing set (averaged over ten sampled networks).
The device parameters used in this article have been tabulated in Section IV; 20K B T barrier height magnet was used in the Gaussian RNG unit. We considered 4-bit representation in the DW-MTJ weights and 3-bit discretization in the neuron output. Note that, as explained in Section IV, our neuronal devices mimic a saturating linear functionality and our network was trained with such a transfer function itself. Considering a minimum sensing and programming displacement of 20 nm for the DW location, we consider our cross-point and neuronal devices to be 320 and 160 nm in length. From our micromagnetic simulations, we observe that the critical current required to switch the neuronal device from one edge to the other is 4 μA for a time duration of 10 ns. The crossbar supply voltage was assumed to be 100 mV for evaluating the crossbar power consumption. The crossbar resistance ranges (which can be varied by the oxide thickness) were designed to provide the critical current requirement for the spin neurons. We considered 300% TMR in the DW-MTJ conductances of the crossbar array. Considering such device-level behavioral characteristics, nonidealities, and constraints, the test accuracy of the network was 96.98% (averaged over ten samples). Furthermore, nonideal DW programming can also impact the system accuracy. We performed five independent Monte Carlo runs of the network with a 10% variation in each of the programmed crossbar device conductances. The average accuracy degradation was observed to be insignificant −96.74%.
In order to estimate the system-level energy consumption, we considered the core RNG and crossbar energy consumption along with peripheral circuitry, such as ADC and DAC. 3 We evaluate the energy consumption for a single-image inference and a particular network sample. The crossbar read latency was assumed to be 10 ns (for each column read). During each 10-ns column read, the power consumption for the DAC and the corresponding crossbar column was considered. Subsequently, the neuron device state was read and converted to a digital value using an ADC. The neuron is reset before every operation. For the RNG, DAC, and ADC units, we considered 8-bit precision and three variables were used for the accumulation process in the normal distribution sampling. We would like to mention here that we assumed 8-bit precision for the energy calculations in order to achieve a fair comparison with numbers reported in [9] for an iso network CMOS architecture. However, from a functional viewpoint, lower bit-precision ∼ 4 bits was observed to be sufficient. The total energy consumption of our proposed "all-spin" network was evaluated to be 790.2 nJ per classification, which is 24× energy efficient in contrast to the baseline CMOS implementation [9] . The energy consumption of the RNG unit, including peripherals for adding the random numbers generated per row, was estimated to be 446.8 nJ. Energy consumption of the crossbar array, including DAC, ADC, and multiplier peripherals, was 343.3 nJ. The system-level energy efficiency stems from both the RNG design and utilization of the "in-memory" computing units.
Note that resistive crossbars are usually characterized by limited fan-in-much smaller than neuron fan-in in typical deep networks due to nonidealities, parasitics, and sneak paths. Hence, mapping a practically sized network requires mapping synapses of a neuron across multiple crossbars [31] , [32] . Such architectural-level innovations can be easily integrated with our current proposal.
VI. SUMMARY
In summary, we proposed the vision of an "all-spin" Bayesian neural processor that has the potential of enabling orders of magnitude hardware efficiency (area, power, and energy consumption) in contrast to state-of-the-art CMOS implementations. Computing frameworks, so far, have mainly segregated deterministic and stochastic computations. Standard deterministic deep-learning frameworks enabled by spintronic devices and other post-CMOS technologies have been explored. In such scenarios, device-level nonidealities are usually treated as a disadvantage. More recently, stochasticity inherent in such devices (for instance, probabilistic switching in the presence of thermal noise) has been exploited for computing to implement stochastic versions of their deterministic counterparts [33] , [34] . Due to additional information encoding capacity in the switching probability, such devices can be scaled down to single bit instead of multibit representations. Device stochasticity has also been used in other unconventional computing platforms, such as Ising computing and combinatorial optimization problems, among others [35] . Note that prior work on using magnetic devices for Bayesian inference engines have been proposed [36] , [37] , which are mainly used for implementing Bayes' rule for simple prediction tasks in directed acyclic graphs and do not have relevance or overlap with Bayesian deep networks. Bayesian deep learning is a unique computing framework that necessitates the merger of both deterministic (dot-product evaluations of sampled weights and inputs) and stochastic computations (sampling weights from probability distributions), thereby requiring a significant rethinking of the design space across the stack from devices to circuits and algorithms.
