This work discusses the implementation of Markov Chain Monte Carlo (MCMC) sampling from an arbitrary Gaussian mixture model (GMM) within SRAM. We show a novel architecture of SRAM by embedding it with random number generators (RNGs), digital-to-analog converters (DACs), and analog-to-digital converters (ADCs) so that SRAM arrays can be used for high performance Metropolis-Hastings (MH) algorithmbased MCMC sampling. Most of the expensive computations are performed within the SRAM and can be parallelized for high speed sampling. Our iterative compute flow minimizes data movement during sampling. We characterize power-performance trade-off of our design by simulating on 45 nm CMOS technology. For a two-dimensional, two mixture GMM, the implementation consumes ∼ 91µW power per sampling iteration and produces 500 samples in 2000 clock cycles on an average at 1 GHz clock frequency. Our study highlights interesting insights on how lowlevel hardware non-idealities can affect high-level sampling characteristics, and recommends ways to optimally operate SRAM within area/power constraints for high performance sampling.
I. INTRODUCTION
Markov chain Monte Carlo (MCMC) is an extensively used statistical sampling technique for generating samples from high-dimensional probability density functions even when these functions can not be defined analytically [1] , [2] . Especially, in recent years, as various machine learning (ML) platforms are proliferating for realtime decision-making, MCMC is being combined with Bayesian ML models to perform efficient inference [3] , [4] . Unlike classical inference, Bayesian inference can capture uncertainties in the outcomes for riskaware decision making [5] as shown in Fig. 1 . Moreover, there is a growing interest to operate ML-based prediction models at the edge itself [6] , [7] . A low power/area MCMC platform is, therefore, becoming imperative along with low power ML implementation.
Prior works have discussed low power MCMC implementations using FPGAs, and have achieved an accuracy similar to their software counterparts, while being more energy-efficient. An FPGA-based hardware accelerator [8] was designed for variational inference of Bayesian neural networks (BNNs). Conversely, in this work, we present MC 2 RAM -a customized implementation of MCMC within SRAM where we co-locate and co-optimize functional units, control flow, and data flow to address critical bottlenecks in high-speed MCMCbased sampling. Our compute flow exploits the Markov chain property where successive chain outputs lie in the proximity minimizing the necessary computing load in each iteration. While MCMC in prior works [8] , [9] is limited to Gaussian functions, MC 2 RAM expands this to Gaussian mixture models (GMMs). A GMM can model any density arbitrarily closely with enough mixture components, making our implementation vastly more applicable. We develop interesting insights about the interaction between low-level hardware non-idealities and high-level sampling characteristics in MC 2 RAM. Our detailed design and operating power space exploration can lead to efficient design methodologies for SRAM-based sampling.
The paper is organized as follows. Section II discusses the background on MCMC and provides an overview of MC 2 RAM. Section III discusses density function computation and sampling in MC 2 RAM. Section IV discusses the results and implications of various sources of non-idealities in MC 2 RAM on sampling. Section V concludes this paper.
II. MCMC FOR BI AND PROPOSED MC 2 RAM
In ML models and methods that employ Bayesian inference (BI), the expectation of the predicted outcome (or other quantity of interest) is obtained by solving M (I, w) × P (w|D)dw, where M (I, w) is the model (say, a neural network) with the input I and parameters/weights w, and P (w|D) is the posterior density of weights given training data D. In such computations, an analytical integration is often intractable since the density function of the random variable (RV) and/or the function to be integrated, e.g., P (w|D) and M (I, w) in BI, are too complicated. A Monte Carlo approach, therefore, becomes necessary to numerically compute these quantities. Monte Carlo approach reduces an integral over a function of RV, x, as 
Here, G(x) is the function to be integrated, F(x) is the density function of x, and x F(x) is an independent and identically distributed (i.i.d.) sample drawn from F(x). The law of large numbers guarantees an asymptotic convergence of the summation to the exact integral as the number of samples (T ) increases. In BI, F(x) is the posterior density P (w|D), which can be numerically extracted using the Bayesian formula P (w|D) ∝ P (D|w) × P (w), but cannot always be defined analytically. Therefore, MCMC overcomes this critical problem by defining an ergodic Markov chain. Among the MCMC methods for sampling we choose Metropolis-Hastings (MH) sampling [11] that provides a middle ground in terms of acceptance/rejection complexity of the samples and the average time needed to generate a candidate sample. The algorithmic steps for MH-based MCMC are demonstrated in Fig. 2 . A candidate sample at step t, x cand t is determined from the previously accepted sample x t−1 and a randomly generated sample, R, as x cand t = R + x t−1 ; here, R follows the statistics of the proposal distribution P (e.g., Uniform or Gaussian) typically centered at zero. The statistics of R controls the search radius and sample search behavior in MCMC. The candidate sample is accepted when the ratio of the density of x cand t to that of x t−1 , i.e., F(x cand t )/F(x t−1 ) is greater than a random threshold, U , which is generated uniformly between zero and one. Fig. 3 shows the overall architecture of MC 2 RAM. An SRAM array stores the GMM parameters for the density function of the RV, i.e, mean, variance, and mixture weights (µ, σ, & p). Since the dimension of the sampling density can be high, µ, σ, & p vectors are stored appropriately in multiple SRAM banks as shown in the figure. SRAM arrays are also integrated with random number generators (RNG). Using RNGs and a previous sample of the chain (x t−1 ), a candidate sample (x cand t ) is generated within the SRAM. For x cand t , SRAM arrays partially compute the density of the candidate sample, i.e., F(x cand t ), in parallel by following single instruction multiple data (SIMD) style execution. Central processing layer receives the partial terms for density computation from SRAM arrays and applies Metropolis-Hastings-based sample acceptance criteria to accept/reject x cand t . Using the accepted x t , the chain iterates to find the next sample x t+1 .
The key complexities of MCMC illustrated via Fig. 2 F(x) can be complex, and (ii) high throughput sampling since many x cand t may end up being rejected. The details of our framework addressing these complexities is discussed subsequently.
III. IN-SRAM DENSITY COMPUTATION AND SAMPLING
The density of a GMM, G(x), for a candidate sample x cand t is given by
Here, M is the total number of mixture components; N(x cand t ; µ j , σ j ) is j th Gaussian mixture function whose density depends on µ j and σ j ; and p j is the mixture weight. In approximating a density function F(x) using GMM, mixture Gaussians with only diagonal co-variance can be used (this is called as mean-field approximation [12]). The density of N(x cand t ; µ j , σ j ) depends on its exponent E j as
Here, x cand , µ j and 1/σ j are each N -dimensional and expanded using the subscript 'i' as in the above equation. The overall GMM density in log-domain can be computed from the exponents E j itself using the identity ln(e a + e b ) = a + ln(1 + e b−a ) and a look up table (LUT) for ln(1 + e x ). E j can be further simplified by exploiting the sampling property where x cand t is searched in the proximity of the previous MCMC sample
Here, R is a generated random number from the proposal distribution P within SRAM, which is used to search the next MCMC sample. R/σ 2 j is an N -dimensional vector obtained by dividing each element of R, R i , with the corresponding σ 2 ij . We apply SRAM to compute scalar products R · R/σ 2 j and (x t−1 − µ j ) · R/σ 2 j within SRAM array to evaluate (4). For the scalar product of two vectors V and W , i.e., V.W , W is stored in the 8-T SRAM cell-array and V is copied to the DAC-operand buffer shown in Fig. 5 . An 8-T SRAM cell is shown in Fig. 4(a) which has an additional scalar product port as shown in red in the figure. SRAM columns store W in n-columns with n-bit precision. Digital-to-analog converter (DAC) converts the input V into corresponding analog-mode current vector I v and applies to the product word line W L p of the cells. The basic approach for the scalar product is to use memory cells as current-mode AND gate. If an SRAM cell 'j' stores bit '1', it allows the row DAC current I v to flow to its bit-line BL p . The currents from each active cell in the column add up and follow V.W as shown in Fig. 4(c) . The column multiplexer selects only one column at a time and the selected column-current will be read by an analog-to-digital converter (ADC) as shown in Fig. 4(d) and (e) that converts column current to digital bits representing V.W i , where W i is the i th precision binary vector of W . The column current is passed to an OP-AMP with a resistive feedback to convert the column current to corresponding analog voltage. The resistance value, R, is designed to match the operating range of ADC. OP-AMP in Fig. 4(d) serves two purposes. It stabilizes the potential of the tail-end of the column, and it also biases the column tail potential to zero. The hold cell in the Fig. 4(d) samples the output potential of OP-AMP and retains it after the OP-AMP is disconnected and bias-current of row DACs is turned off to save biasing power. The current of all n-columns are converted using ADC and combined with digital scaling to compute V.W . We use two-step flash ADCs [13] , [14] that optimally balances area/power constraints without incurring excessive delay.
For the proposed implementation in 45nm CMOS, Fig. 6 (a) shows HSPICE simulation for the scalar product using 32row SRAM cell array matching the ideal. The current-mode processing in the proposed design gives significant advantages. The SRAM cells either act as current buffers or block the input current so that the variability in SRAM cell transistors has minimal impact to the accuracy of scalar product that posed challenge in [15] . Also, the V T H variability of cell transistors does not affect the scalar product when DAC reference currents are sufficiently higher than SRAM leakage. Fig. 6(b-c) illustrate the variability analyses on the operation of SRAM. In Fig. 6(b) , the column current follows a Log-Normal distribution when considering process variability due to SRAM transistors. In Fig. 6(c) , with σ(V T H ) = 30 mV (black curve), the variation in column current is 1.43 (normalized against mean) that corresponds to the DAC reference current of 5 nA. Upon increasing the DAC current, the variability of column current reduces when normalized against the mean value. DAC in Fig. 4(b) displays two critical non-idealities that affect the scalar product accuracy: (i) Channel length modulation (CLM) in the mirroring transistors that affects the scalar product accuracy posing dependence to WL p potential and (ii) non-ideal mirroring ratio due to process variability. We address CLM-induced precision degradation by reducing the turn-ON voltage of select switches in DAC to limit source-to-drain voltage of mirroring transistors, which improves the accuracy as shown in Fig. 6(d) . To minimize process variability-induced non-ideal mirroring ratio in DAC, a set of calibrating transistors with small width W c relative to mirroring transistors are added to DAC in Fig. 4(b) . DAC mirror current is read against a reference to add W c until the current meets the desired level.
In MC 2 RAM, storage of density function (F(x)) parameters and sample generation R is collocated within the same array by integrating RNG cells with SRAM cells. Since in a highdimensional weight space many R end up being rejected, colocating the operations with the same SRAM array minimizes overheads and data movement. Fig. 4(f) shows the RNG cell based on cross-coupled inverters [16] . The differential ends Q and Q B are pre-charged to V DD when CLK = 0. When CLK = 1, the thermal noise resolves the meta-stability to generate a random bit. The random bits, stored in DAC operand buffer, can further be scaled with σ 2 within DAC as shown in Fig. 4(g) . The scaled R/σ 2 is used for density computation based on (4).
IV. RESULTS AND DISCUSSIONS ON SAMPLING
We analyze simulated distribution with respect to the ground truth using scatter plots and KL divergence [17] . The KL divergence between two discrete probability density functions F(x) and G(x) is measured as
We considered a sample GMM, GMM T , with µ = [1, −1; −1, 1], σ = [1, 0; 0, 1], and p = [0.5, 0.5]. We also considered 500 samples to be sufficient to eliminate any statistical errors in KL divergence. We discard the first 50 samples as burn-in samples. For GMM T , Fig. 7(a) shows the sampling trajectory and distribution when the precision of DAC/ADC is 8 bits with 1 volts power supply. The contour lines in the figure represent ground truth GMM and the scattered dots in blue represent simulated distribution of samples from MC 2 RAM. The trajectory of samples as shown in red in the figure corresponds to 75 random walks. Fig. 7(b) illustrates sampling when mean distance parameter d in µ = [d, −d; −d, d] between GMM components is set to 5. Fig. 7(c) shows DAC/ADC imprecision tolerance limit in MC 2 RAM which allows DAC and ADC to be low power/area. Sampling deviates from ground truth for ADC below 5-bit precision. Whereas, reduction in DAC precision has no significant impact in KL divergence. This also justifies our choice of using a low precision two-step flash ADC in MC 2 RAM which has lower overhead for low to moderate precision design. Also, for fixed sampling iterations KL divergence is high for large separation between GMM components. As we go for higher dimensions, the deviation of samples from ground truth increases as shown in 7(d). For a two-dimensional, two mixture GMM, the implementation consumes ∼ 91µW power per sampling iteration and produces 500 samples in 2000 clock cycles on an average at 1 GHz clock frequency. The pie chart in Fig. 6 (e) shows SRAM cells and DAC together contributing to 18% of power consumption whereas the remaining 82% is due to the ADC. The 10 comparators in the 2-step subranging flash ADC constitute 60% of the power consumed by the ADC, however, the delay associated with flash ADC is 2 clock cycles, which improves sampling throughput in MC 2 RAM.
V. CONCLUSION
We have presented a novel framework MC 2 RAM that is a key to accelerate Markov chain Monte Carlo (MCMC) sampling for Bayesian Inference (BI). We exploit MC 2 RAM to store parameters of posterior density of weights in BI and random number generation for high throughput Metropolis-Hastings (MH) based sample acceptance/rejection. The framework samples at low precision and power with tolerance to process variation thus removing latency/energy/safety bottlenecks associated with traditional von Neumann architecture that has spatially distant memory and processing elements.
