ABSTRACT We present an architecture and a compilation toolchain for stochastic machines dedicated to Bayesian inferences. These machines are not Von Neumann and code information with stochastic bitstreams instead of using floating point representations. They only rely on stochastic arithmetic and on Gibbs sampling to perform approximate inferences. They use banks of binary random generators which capture the prior knowledge on which the inference is built. The output of the machine is devised to continuously sample the joint probability distribution of interest. While the method is explained on a simple example, we show that our machine computes a good approximation of the solution to a problem intractable in exact inference.
I. INTRODUCTION
This project is inspired by E.T. Jaynes. His book, "Probability as Logic" [1] , proposes probability as an extension of logic to reason with incomplete and uncertain information. Since the beginning of our work in the 90s [2] , [3] , it has always been our goal to build a machine able to perform this type of reasoning on specialized hardware. We started by designing a method to automate probabilistic reasoning called Bayesian Programming [4] and an associated programming langage ProBT 1 to specify and perform Bayesian inferences on standard computers. The formalism and the inference engine of ProBT have been used to develop many academic [5] , [6] , and industrial applications. 2 Recently the European project BAMBI 3 gave us the opportunity to pursue our project. We started this project by designing and implementing machines performing exact inference with stochastic arithmetics on an FPGA. While these non conventional machines could already be used to synchronize a pair of Linear Feedback Shift Registers [7] or to control a robot [8] , they could not handle any problem of reasonable complexity. Here we describe a second generation of machines based on sampling which allow to approximate the solution of otherwise intractable problems.
The paper is organized as follows. Related works with similar goals and approaches are described next. Section III presents stochastic bitstreams and explains why we made the choice to sample binary probabilistic variables. In particular, it introduces the two only building blocks required to build the stochastic machine: Conditional Distribution Elements (CDEs) and Extended C-Elements (ECEs). CDEs are used to generate programmable and addressable stochastic bitstreams. ECEs extend Muller C elements and are used for inference. In the following section we recall the Gibbs algorithm in the case of binary variables and show how it may be implemented by combining our previously defined building blocks. Then, in Section V we describe our compilation toolchain. It starts from a Bayesian program on discrete variables, transforms it into an equivalent program on binary variables and finally compiles it into a VHDL circuit specification. Finally we use a simulator to demonstrate the performance of the obtained circuits performing approximate inference on an intractable problem.
II. RELATED WORK
The design of unconventional or non Von Neumann machines is essentially driven by the idea of reducing the power consumption of computing devices. It is also related to the need to process larger amounts of data while being closer as ever to the limits imposed by physical laws on current processors and the end of Moore's law. The need for 1 A free version of ProBT is available at http://www.probayes.com/fr/ Bayesian-Programming-Book/downloads/ more data processing with less energy comes from the common availability of extremely large data bases and from applications related to sensor fusion and interpretation.
Massively parallel and multi-core architectures are the current response to these challenges but several projects tend to investigate alternate paths. Projects like SpiNNaker 4 and TrueNorth 5 are still standing on the side of the multi-core approach while being bio-inspired. The global architecture they propose changes to comply with the artificial neural network inspiration, but the core computations are still made with Von Neuman processor architectures.
The Probabilistic-CMOS (PCMOS) project [9] modifies the structure of processors to build resilient and energy efficient machines. The key idea is to lower the nominal voltage at which standard CMOS circuits operate and to compensate errors with a robust design. An algebra formalizes how errors are propagated along a chain of logical operations, leading to circuits computing the correct result with a specified probability. Contrary to our approach, in which we design new theory and components to compute inference, PCMOS manages errors on classic CMOS arithmetic operators to perform approximate computation with compromises between circuit size, voltage, and precision.
Closer from our own work are the approaches taken by Vigoda, Masinkha and Jonas at MIT. In all cases, probabilistic programming is seen as a way to program/specify the future machines and they all avoid the Von Neuman architecture as well as the use of FPUs. Vigoda [10] was the first to initiate, in 2003, the design of non conventional machines dedicated to Bayesian inference. The track chosen was to represent probability distributions on binary variables with analog electrical values. He focused on a well known algorithm for exact inference: the message passing algorithm, and devised the necessary analog components to perform the message passing to obtain the circuit performing the desired inference. While the approach has many practical applications it will not handle complex inference problems.
To address this scalability issue, Mansinghka [11] and Jonas [12] are promoting a sampling approach based on stochastic calculus: Gibbs samplers or MCMC methods (Metropolis-Hastings) to perform approximate probabilistic inferences and return samples of the posterior distribution. These methods are already established in probabilistic languages such as Church [13] or Venture [14] , but also exist inside inference engines supporting approximate inference such as ProBT [15] . To obtain a hardware implementation they proposed a set of transition circuits to sample quantized or continuous variables. In [16] the circuits are based on building blocks performing operations (for example normalisation) by evaluating mathematical expressions with dedicated hardware or by using precomputed tables. Our goal was to propose simpler building blocks, and the architecture we propose only uses two types of probabilistic gates: the CDE (similar to the CPT gate of Jonas) and the ECE gate which has no equivalent in the previously proposed architectures.
Our project is also strongly related to the study of nano scale devices to perform Bayesian inference, which motivates our architecture choices. These devices are nanoscale samplers for Poisson distributions which could eventually replace the current source of entropy (PRNG for pseudo random number generator) of the presented architecture, as in the work of the Computing Fabrics Laboratory of the University of Massachusetts Amherst [17] which is presenting an unconventional hardware architecture and nanoscale technology implementation of Bayesian inference.
III. STOCHASTIC BITSTREAMS AND OPERATORS

A. DEFINITION
A stream of bits may be considered as encoding the probability p of a binary variable: p ¼ n 1 =ðn 1 þ n 0 Þ where n 1 is the number of 1 and n 0 the number of 0 in the stream. It may also be considered as coding the odds o ¼ p=ð1 À pÞ of this binary variable. The odds ratio is then given by o ¼ n 1 =n 0 .
Conversely, given a probability p (or an odds ratio o) we may use software or hardware stochastic processes to generate a bitstream BðpÞ (respectively BðoÞ) that approximately encodes the probability p (respectively the odds ratio o).
Operators on bitstreams may then be used to perform probabilistic calculus on binary distributions. A central point of this paper is to show that using only two simple operators called CDE and ECE we can solve complex probabilistic problems.
As these operators compute with an approximate representation of probabilities, we will use the root of the mean square error (RMSE) to measure the precision of an operator: if p 0 is the observed result given by an operator and p the desired result, we compute the RMSE by considering several experiments leading to several p 0 and averaging the square error on p and 1 À p
B. CDE: CONDITIONAL DISTRIBUTION ELEMENT
Gupta & Kumaresan [18] , [19] , proposed a circuit able to generate a stochastic bitstream BðpÞ encoding a probability p stored in a fixed point format p ¼ P i¼m i¼1 x i 2 Ài in a buffer X (see red part of Figure 1 ).
We propose to complement this circuit by an addressable memory (see blue part of Figure 1 ). Given a value in Y the content of the memory at this address (a given probability p Y ) is written in X. The generated stochastic bitstream Bðp Y Þ is then conditioned by the value Y.
C. ECE: EXTENDED C-ELEMENT
As in Gaines [19] , [20] we use operators to combine stochastic signals and perform arithmetic operations. For example, assuming Bðp 1 Þ and Bðp 2 Þ are two independent stochastic bitstreams we may use a simple AND gate to obtain Bðp 1 :p 2 Þ. As we will see in the next section, what we really need is the product of two odds: a way to generate Bðo 1 :o 2 Þ from two independent bitstreams Bðo 1 Þ and Bðo 2 Þ. For that purpose we propose to use an extension of Muller C-elements using more memory resources.
The Muller C-element truth table is given in Table 1 . It uses a one bit memory to provide the previous value Z prev as output when the inputs are either ð0; 1Þ or ð1; 0Þ. When two independent and not autocorrelated bitstreams Bðo 1 Þ and Bðo 2 Þ are given as inputs to a C-element, the output Bðo 1 :o 2 Þ is an unbiased estimator of the product of the odds o 1 and o 2 . This result can be obtained by applying the detailed balance principle to the Markov chain Z n ! Z nþ1 .
However, there is some autocorrelation in the output stream (a bit at time t is not independent of a bit at time t À 1) due to the effect of memory. This autocorrelation induces troubles when C-elements are cascaded [21] . To reduce the autocorrelation of the output, we extend the memory capacity using a buffer C of size D bits. This toroidal buffer works as follows. When the inputs have the same value (X=Y=v) the content of the buffer is shifted to the left, its rightmost bit is set to v and we provide v as output. When the inputs have different values, we provide the rightmost value of C as output and rotate the buffer content to the right. If D=1 we have a regular Muller C element. The truth table of the ECE is presented in Table 2 .
Several methods using some memory to mitigate the bitstream autocorrelation issue have been compared in [22] where the authors conclude that Edge Memories provide the best performances and speed on their decoding application. The toroidal buffer of the ECE serves the same purpose, but without the need of generating a random address, thus reducing the circuit area and power needs. Figure 2 presents the experimental RMSE for a cascade of two products of odds computed either with ECEs or with Edge Memories. As the buffer size is increased, the error decrease (until the precision threshold determined by the bitstream size N is reached) is faster for ECEs than for Edge Memories. Any joint probability distribution on X b c binary variables can be written as
IV. GIBBS ALGORITHM AND PRODUCT OF OODS
If we are interested in the probability distribution of one of these binary variables B i knowing the X b c À 1 values b j of all the others, we get
To compute the odds of B i , we get
The first term vanishes
As some variables may be independent, it may occur that some terms in the remaining products are independent of B i in which case they also get simplified. Equation (4) shows how to compute the odds of any binary variable knowing the value of all the others as a product of probability ratios. Each term has a likelihood, homogeneous to odds o, which can be sampled by bitstreams generated with probability p ¼ o=ð1 þ oÞ by CDEs described in Section III-B, and their product can be computed by ECEs described in Section III-C.
C. STOCHASTIC GIBBS CIRCUIT 1) PRINCIPLE
Let us take an example with three binary variables X ¼ fR; S; Gg to present the architecture of the stochastic Gibbs circuit. This example is inspired from the classical "sprinkler" Bayes net. R; S; G are binary variables corresponding to the predicates "it has been Raining", "the Sprinkler was turned on" and "the Grass is wet". The joint distribution may be decomposed as PðR; S; GÞ ¼ PðRÞPðSjRÞPðGjR; SÞ. We are interested in the probability that it rained knowing that the grass is either wet or dry: PðRjG ¼ gÞ. We have K ¼ fGg, I ¼ fRg and F ¼ fSg.
According to the Gibbs algorithm we need to sample R and S, which means, according to Equation (4), to compute two odds with the following formulas:
These formulas include 13 ratios which are preliminary knowledge specified in the model. We use CDEs to store them and to produce the associated bitstreams. We need one CDE for each of the three terms appearing in the decomposition. One for PðRÞ to produce when needed a bitstream Bðo R Þ with o R ¼ PðR¼1Þ PðR¼0Þ . One for PðSjRÞ with a four slot memory to produce respectively Bðo PðS¼1jR¼0Þ . Finally, one more for PðGjR; SÞ with a eight slot memory.
We then need two ECEs to compute the product of the three ratios required to sample oðRjG ¼ gÞ or the two required for oðSjG ¼ gÞ. This leads to the circuit sketched in Figure 3 dedicated to computing the inference PðRjG ¼ gÞ. The input has to be set to the desired value g and the output is a binary bitstream BðoÞ encoding OðRjG ¼ gÞ.
2) CONTROL AND SUBTLETIES
The Gibbs algorithm requires to scan in sequence all the binary variables in I [ F. This is obtained using a counter c. c is used by the multiplexer at the top left of Figure 3 to select which variable to update. It is also used as an input of each CDE to select the appropriate ratio to produce as they depend on the sampled variable being either R or S. For instance, if we sample S (c ¼ 0), we do not need the ratio
PðR¼1Þ
PðR¼0Þ and the output of the CDE R should be the bitstream Bðo ¼ 1Þ with equal number of 0 and 1. This bitstream is the neutral element for an ECE. PðS¼1jR¼0Þ . This is due to the fact that CDEs are used not only to store odds but more generally to store probability ratios.
As ECEs involve memory buffers, a burning sequence is required for the memory content to be reliable. This is implemented using a counter l which validates the output after a burning sequence of length k.
V. EXPERIMENTATION AND RESULTS
A. THE COMPILATION TOOLCHAIN
A compilation toolchain was developed to evaluate the proposed architecture. The input is any Bayesian program using discrete variables written in ProBT (see Figure 4 for an example). The output is a VHDL program specifying the circuit dedicated to the corresponding inference.
B. PREPROCESSING
Any probabilistic program using discrete variables is first transformed into an equivalent specification using only binary variables. For example, a distribution PðVÞ on a discrete variable with 2 n values will be transformed into one distribution and nÀ 1 conditional distributions on binary variables B V i : i 2 1; . . . ; n PðVÞ ¼ PðB
While the "Binary" representation looks less compact, it requires at most the same number of parameters as for representing PðVÞ. Each probability distribution, as for instance "PðB V 2 jB V 1 Þ", is automatically computed using the inference engine of ProBT.
C. COMPILER
The compiler starts from the previously generated "binary" Bayesian program and generates a VHDL file describing the final circuit. It specifies a circuit implementation of Equation 4 the architecture of which is similar to the one presented in Figure 3 , with two extra simplifications. One allowing to reduce the size of the memory of the CDE: it is not necessary to store all the terms which cancel out in Equation (3). Another one which reduces the number of ECEs by only considering informative binary signals: BðoÞ; o 6 ¼ 1.
D. SIMULATOR
The generated VHDL code could easily be simulated at the gate or the transistor levels with the standard EDA tools. However, this type of simulation is time consuming. We have developed our own simulation engine allowing very fast simulation of our virtual stochastic machines.
E. EXPERIMENTS
We designed a problem with circular dependences between its variables (as is happens in some image processing applications) and chose the parametric forms so that we could compute an analytic solution to be used as reference. We define a joint probability distribution on the n variables B i 2 0; 1 and the n variables V j 2 f0; . . . ; 2 m À 1g as
with
(see the Appendix for the definition of g) where u ¼ ðk 1 þ 0:5Þ2 Àm and v ¼ ðk 2 þ 0:5Þ2 Àm . We consider the following inference based on the previous joint distribution:
In this particular example 2 mÃn floating point additions would be necessary to compute the exact inference, which is intractable for large values of m and n. However it is possible to get an approximation of the exact solution with the following close form expression:
Àm Betaððk þ 0:5Þ2 Àm ; a; bÞ;
(7) where Betaðu; a; bÞ is the probability density function of a beta distribution with two known integer parameters a and b (see the Appendix).
F. RESULTS
We used a tractable instance of the previously defined inference (with m ¼ n ¼ 5) to evaluate the quality of our analytic approximation. ProBT was used to compute PðV 2 jb 1 ¼ 1; . . . ; b 5 ¼ 1Þ with exact inference by marginalizing over all unknown variables. The absolute difference with the analytic solution is always lower than 10 À17 , which validates the quality of the analytic approximation.
Increasing m will increase further the precision (see the Appendix). We now report results for a problem of size m ¼ 8; n ¼ 20. The exact inference would require 2 160 operations. We use the analytic approximation to evaluate the results since the exact inference solution is out of reach. The precision depends on:
The size D of the buffers used in ECEs to prevent the output bitstream autocorrelation. The burn-in duration BI allowing the ECEs to stabilize the output bitstream encoding the product Bðo 1 o 2 Þ. The number N of samples chosen to approximate the output distribution. Figure 2 , the precision was evaluated with a two stage product of odds. Here we use ECEs with a buffer size D¼16, which provides a good compromise between size and precision for this example cascading three ECEs. (Indeed, we compute the product of eight odd ratios, by arranging the ECEs as a binary tree of depth three.) Table 3 shows the impact of the burn-in BI and of the number N of samples on the Kullback-Leibler divergence between the analytic approximation and the results obtained with our stochastic architecture. We observe that the precision increases with N, but slowly. This is unsurprising since as mentionned before the precision of Bernoulli sequences go as ffiffiffiffi N 2 p . Increasing the length of the burning sequence after which the samples computed by the tree of ECEs are considered valid has a cost in terms of computation time, but it allows to significantly increase the precision of the results until they are as good as for floating point computations. This suggests a way to use the same circuits with different tradeoffs between latency (and power needs) and the desired precision.
VI. CONCLUSION
We have presented a compilation toolchain to design automatically stochastic machines that solve approximate Bayesian inference problems using the Gibbs sampling algorithm. This compilation toolchain is generic: it can translate any inference problem expressed as a Bayesian program into a circuit comprised only of two types of stochastic components: the CDE and the ECE. We implemented a quantitative test to evaluate our architecture on intractable problems. The results are promising: the output distribution of our stochastic machine approximates the analytic solution as well as a floating point implementation of Gibbs sampling.
In the near future we will extend and test our approach on Bayesian filters. In a second step, we will design a generic and programmable machine, which will accept any new Bayesian program without the need for a new circuit synthesis, opening the way to actually programmable Bayesian computing devices. RAPHA € EL LAURENT received the master's degree in computer science engineering from ENSI-MAG and the master's degree in artificial intelligence from Joseph Fourier University, Grenoble, France. During the PhD degree at Grenoble-Alpes University, he acquired valuable experience in probabilistic information processing systems as he implemented Bayesian models and algorithms to study cognitive aspects of speech perception, production, and learning. He is now working in the ProbaYes R&D Group, which develops efficient solutions in the fields of machine learning, probabilistic models, optimization, and data science for a wide range of industrial applications. His current research interests within the BAMBI project (Bottom-up Approaches to Machines dedicated to Bayesian Inference) include the design and hardware implementation of unconventional computing architectures along with the development of software tools allowing to automatically map probabilistic programs into efficient circuit implementations based on new computation paradigms. JACQUES DROULEZ received the mathematical and engineer training from Ecole Polytechnique, Paris, the complete medical training from MD: Lariboisi ere-St Louis Hospital, Paris, the master's degree in biochemistry, and a Habilitation to supervise research in cognitive sciences. He has got a fellowship from the Centre National dEtudes Spatiales (1978) (1979) (1980) (1981) (1982) . He is now emeritus director of research with the Centre National de la Recherche Scientifique, and head of the research teamÂ« active perception and probabilistic approachÂ» at ISIR laboratory of UPMC-Sorbonne university. His main research themes are the perception of 3D motion and objects, the theoretical study of models for multisensory interactions, and the adaptive motor control. He has about 90 publications in international journals including one in Proceedings of the National Academy of Sciences on sensory-motor integration model and one in Nature on object perception during self-motion. He is involved in several European and national research programs and in multidisciplinary scientific networks. 
