Uncertainty plays a key role in real-time machine learning. As a significant shift from standard deep networks, which does not consider any uncertainty formulation during its training or inference, Bayesian deep networks are being currently investigated where the network is envisaged as an ensemble of plausible models learnt by the Bayes' formulation in response to uncertainties in sensory data. Bayesian deep networks consider each synaptic weight as a sample drawn from a probability distribution with learnt mean and variance. This paper elaborates on a hardware design that exploits cycle-to-cycle variability of oxide based Resistive Random Access Memories (RRAMs) as a means to realize such a probabilistic sampling function, instead of viewing it as a disadvantage.
I. INTRODUCTION
While Bayesian deep learning has shown promise to serve as a pathway for enabling Probabilistic Machine Learning, the algorithms have been primarily developed without any insights regardring the underlying hardware implementation. Bayesian techniques are more computationally expensive than their non-Bayesian counterparts, thereby limiting their training and deployment in resource-constrained environments like wearables and mobile edge devices. In addition to the standard von-Neumann bottleneck [2] prevalent in current deep learning networks (where memory access and memory leakage can account for significant portion of the total energy consumption profile), Bayesian deep learning involves repeated sampling of network weights from learnt probability distributions (in most cases, Gaussian distributions are used which are much more hardware expensive than uniform probability distributions) and inference based on the sampled weights. For instance, implementing just a single synapse would involve a costly CMOS Gaussian random number generator circuit. Repeated parameter sampling and evaluation for just a single inference operation worsens the von-Neumann bottleneck issue further. With deep networks involving more than a million parameters coupled with the necessity of performing mathematical operations on sampled probabilistic data, the projected hardware costs for a typical CMOS implementation would be humongous. Thus there is a dire need to rethink hardware designs for such probabilistic machine learning frameworks ground- up where the core hardware units are better matched to the models of computation.
Our paper elaborates on a cohesive design of an RRAMbased Bayesian processor that leverages benefits of cycle-tocycle variability for gaussian random number sampling and "In-Memory" crossbar architectures to realize energy efficient hardware primitives that have the potential of enabling probabilistic Artificial Intelligence. While there have been some recent efforts at incorporating the RRAM stochastic switching processes for neuromorphic algorithms, the applications are either extremely simple machine learning tasks [3] or do not exploit the stochasticity for computing [4] . The concept of utilizing probability distributions obtained from RRAM cycleto-cycle variations for deep learning applications have been not explored before.
II. UTILIZING CYCLE-TO-CYCLE RRAM VARIABILITY FOR PROBABILITY DISTRIBUTION SAMPLING
Metal oxide RRAMs are emerging as an alternative to traditional CMOS based memories in non-Boolean computing due to their high density, CMOS compatibility and low power consumption. A metal oxide RRAM is a two-terminal device whose resistance can be changed upon applying a voltage across the terminals (see Fig. 1 ). The ability to program the device in multiple resistance states can be exploited to use the device as a multi-bit memory. The device consists of an oxide layer sandwiched between two electrode layers. Some materials used for the oxide layer are Hf O x , T iO x , AlO x and N iO x and materials used for electrodes are T i and P t.
The metal oxide RRAM has two primary resistance states, the high resistance state (HRS) and the low resistance state (LRS). When the resistance of the RRAM switches from HRS to LRS, the event is called a SET process and when the resistance switches from LRS to HRS, the event is called a RE-SET process. The SET and RESET processes of metal oxide RRAMs can be simplified by studying them as the growth and rupture of a single conduction filament. The distance between the tip of the filament and the opposite electrode, or the gap length (g), is the primary variable governing the device I − V characteristics. During the SET process, the oxygen ions drift to the anode interface, creating a conductive oxygen vacancy path which leads to the growth of the conduction filament, reducing the gap length. During the RESET process, the reverse electric field drives the oxygen ions back to recombine with the vacancies, breaking down the conduction filament and increasing the gap length. The current flowing through the device is generalized using the following equation [1] ,
where, I o , G o and V o are fitting parameters, g is the gap length and V is the voltage across the device. The exponential dependence on the gap length arises due to the various tunneling mechanisms present in the device [1] . We used an experimentally calibrated publicly accessible RRAM model [5] and our simulation parameters correspond to a bilayer
The parameters are tabulated below. Please refer to Ref. [1] for a description of the fitting parameters. The RRAM RESET process is a gradual one and exhibits cycle-to-cycle resistance variation as depicted in Fig. 1(c) . This is due to the random oxygen-vacancy filament dissolution process. While prior works have reported similar variation studies, they have typically considered this as a non-ideality. In this article, we utilize such a probability distribution sampling function for Bayesian network implementation. It is worth noting here that the distribution is not strictly Gaussian (lognormal is a better fit [4] , [6] ). However, we did not observe the minor deviation from the Gaussian distribution to impact the network accuracy. The network can be also inherently trained assuming log-normal probability priors to account for the device constraints. In this work, we have assumed Gaussian probability priors for training.
III. SYSTEM DESIGN AND RESULTS
In a Bayesian Neural Network, the synaptic weights, W, are typically characterized by Gaussian probability distributions. Once all the posterior distributions are learnt (µ and σ parameters of the weight distributions) [7] , the network output corresponding to input, x, should be obtained by averaging the outputs obtained by sampling weights from the posterior distribution of the weights, W [8] . The output of the network, y, is therefore given by,
(2) where, P (D|W) is the likelihood, corresponding to the feedforward pass of the network, f (x,W) is the network mapping for input x and weights, W. The approximation is performed over S independent Monte-Carlo samples drawn from the Gaussian distribution, q(W, θ), characterized by parameters, θ = (µ, σ), where µ and σ represent the mean and standard deviation vectors for the probability distributions representing P (W|D) [11] . We consider the Variational Inference method [9] , [10] for training the network in this work.
Hence, the core computation in a Bayesian network is a dotproduct between an input vector, x, and samples drawn from a Gaussian probability distribution, N (µ, σ). It can be shown by simple algebraic manipulations, that the dot-product for a single neuron is equivalent to the summation of dot-products x.µ and [xN (a, b) ].σ where, the probability distribution of RRAM cycle-to-cycle resistance variation is characterized by mean a and variance b (µ = µ − a/b and σ = σ/b). While the first dot-product can be easily performed in a standard RRAM crossbar array, the implementation of the second term is shown in Fig. 2 . We utilize a current-to-voltage converter to scale each input to the crossbar array by a factor sampled from Analog-to-Digital Converter 
RRAM Synapse Crossbar Array ( ) storing variances
Digital-to-Analog
. . a Normal distribution. The amplification factor of this stage is designed such that the maximum read-voltage of the crossbar array is limited to 0.3V (to ensure linear RRAM I −V characteristics). The crossbar array stores the parameters of the array σ . Each column is read sequentially by the array peripherals in order to ensure that independent random samples are being used for each array element. The current flowing through each cross-point device is scaled by the conductance of the device and due to Kirchoff's law, all these currents get summed up along the column, thereby realizing the dot-product kernel. The analog current outputs drive interfaced Analog-to-Digital converters (ADCs) to provide output to the fan-out neurons.
A hybrid device-circuit-algorithm co-simulation framework was developed and the design was tested for a standard digit recognition problem on the MNIST dataset [12] . The neural network used in this work consisted of 2 hidden layers each with 200 neurons. The probability distributions were learnt using the 'Bayes by Backprop' algorithm [13] . We considered 4-bit representation in the RRAM devices in the cross-point array and 3-bit discretization in the Analog-to-Digital converter output (interfaced with the crossbar array). We considered RRAM resistance programming ranges to be 16. Note that cycle-to-cycle read variations will be also present in the crossbar RRAM resistances. However, such variation was observed to produce negligible accuracy drop. This is also consistent with prior work since it is well known that neural networks are resilient to minor fluctuation in its synaptic weights. Our proposal exploits the read resistance variation and amplifies its effect to scale the dot-product calculation accordingly. Using our hybrid hardware-algorithm co-simulation framework, the test accuracy of the network was 96.09% (averaged over 10 samples). The baseline idealized software network was trained with an accuracy of 97.51% over the testing set (averaged over 10 sampled networks).
Since, RRAM devices are inherently memory elements, the ability to perform the costly dot-product operations (one of the main computationally expensive operations in a neural network hardware) in the memory array itself enables us to address the issues of von-Neumann bottleneck. Prior work has revealed that RRAM based "In-Memory" architectures can potentially achieve 1-2 orders of magnitude improvement in energy consumption in comparison to a baseline CMOS implementation [14] , [15] . Memory access and memory leakage constitute a significant portion of the total system energy consumption in CMOS implementations and the proportion increases significantly as the network size increases. Additionally, the memory access latency increases with increasing model size, thereby increasing the memory leakage energy as well. While these benefits remain valid in this scenario, the Gaussian random number generation adds to the hardware complexity. CMOS based implementations of Gaussian random number generators rely on a significant number of linear feedback circuits which are extremely hardware expensive.
For instance, a recent work for a CMOS based 64-parallel Gaussian RNG reports 1780 registers and 528.69mW power consumption [16] . In stark contrast, our implementation uses simply a single RRAM device and a Sample and Hold circuit to implement the probability distribution sampling. The total power consumption for our RRAM based Gaussian random number generation is 32.54mW for a similar 64 parallel generation task, which is 16.25× lower than the baseline CMOS implementation. It is worth noting here that device level non-idealities in other technologies can be harnessed as well for such probabilistic AI hardware. This work is based on Ref. [17] which explored the design for spintronic technologies. This work explores the design for RRAM devices which requires a significant rethinking due to different intrinsic device physics and operating voltage/current characteristics and conditions.
IV. SUMMARY
In conclusion, we have provided a vision for an RRAM based "In-Memory" computing primitive for Bayesian neural hardware that utilizes simple device-circuit primitives and leverages RRAM stochasticity as a computing resource, instead of viewing it as a disadvantage. Such hardware-software co-design of Bayesian neural network models can potentially lead to real-time decision making in autonomous agents in the presence of uncertainties.
