The brain interprets ambiguous sensory information faster and more reliably than modern computers, using neurons that are slower and less reliable than logic gates. But Bayesian inference, which underpins many computational models of perception and cognition, appears computationally challenging even given modern transistor speeds and energy budgets. The computational principles and structures needed to narrow this gap are unknown. Here we show how to build fast Bayesian computing machines using intentionally stochastic, digital parts, narrowing this efficiency gap by multiple orders of magnitude. We find that by connecting stochastic digital components according to simple mathematical rules, one can build massively parallel, low precision circuits that solve Bayesian inference problems and are compatible with the Poisson firing statistics of cortical neurons. We evaluate circuits for depth and motion perception, perceptual learning and causal reasoning, each performing inference over 10,000+ latent variables in real time -a 1,000x speed advantage over commodity microprocessors. These results suggest a new role for randomness in the engineering and reverse-engineering of intelligent computation. Our ability to see, think and act all depend on our mind's ability to process uncertain information and identify probable explanations for inherently ambiguous data. Many computational models of the perception of motion 1 , motor learning 2 , higher-level cognition 3, 4 and cognitive development 5 are based on Bayesian inference in rich, flexible probabilistic models of the world.
and abstracted via simple mathematical rules, yielding larger computational units that whose behavior can be analyzed in terms of their constituents. We describe primitives and design rules for both stateless and synchronously clocked circuits. But unlike digital gates and circuits, our gates and circuits are intentionally stochastic: each output is a sample from a probability distribution conditioned on the inputs, and (except in degenerate cases) simulating a circuit twice will produce different results. The numerical probability distributions themselves are implicit, though they can be estimated via the circuits' long-run time-averaged behavior. And also unlike digital gates and circuits, Bayesian reasoning arises naturally via the dynamics of our synchronously clocked circuits, simply by fixing the values of the circuit elements representing the data.
We have built prototype circuits that solve problems of depth and motion perception and perceptual learning, plus a compiler that can automatically generate circuits for solving causal reasoning problems given a description of the underlying causal model. Each of these systems illustrates the use of stochastic digital circuits to accelerate Bayesian inference an important class of probabilistic models, including Markov Random Fields, nonparametric Bayesian mixture models, and Bayesian networks. Our prototypes show that this combination of simple choices at the hardware level -a discrete, digital representation for information, coupled with intentionally stochastic rather than ideally deterministic elements -has far reaching architectural consequences. For example, software implementations of approximate Bayesian reasoning typically rely on highprecision arithmetic and serial computation. We show that our synchronous stochastic circuits can be implemented at very low bit precision, incurring only a negligible decrease in accuracy. This low precision enables us to make fast, small, power-efficient circuits at the core of our designs. 3 We also show that these reductions in computing unit size are sufficient to let us exploit the massive parallelism that has always been inherent in complex probabilistic models at a granularity that has been previously impossible to exploit. The resulting high computation density drives the performance gains we see from stochastic digital circuits, narrowing the efficiency gap with neural computation by multiple orders of magnitude.
Our approach is fundamentally different from existing approaches for reliable computation with unreliable components [13] [14] [15] , which view randomness as either a source of error whose impact needs to be mitigated or as a mechanism for approximating arithmetic calculations. Our combinational circuits are intentionally stochastic, and we depend on them to produce exact samples from the probability distributions they represent. Our approach is also different from and complementary to classic analog 16 and modern mixed-signal 17 neuromorphic computing approaches:
stochastic digital primitives and architectures could potentially be implemented using neuromorphic techniques, providing a means of applying these designs to problems of Bayesian inference.
In theory, stochastic digital circuits could be used to solve any computable Bayesian inference problem with a computable likelihood 18 by implementing a Markov chain for inference in a Turing-complete probabilistic programming language 19, 20 . Stochastic ciruits can thus implement inference and learning techniques for diverse intelligent computing architectures, including both probabilistic models defined over structured, symbolic representations 5 as well as sparse, distributed, connectionist representations 21 . In contrast, hardware accelerators for belief propagation algorithms [22] [23] [24] can only answer queries about marginal probabilities or most probable configura-tions, only apply to finite graphical models with discrete or binary nodes, and cannot be used to learn model parameters from data. For example, the formulation of perceptual learning we present here is based on inference in a nonparametric Bayesian model to which belief propagation does not apply. Additionally, because stochastic digital circuits produce samples rather than probabilities, their results capture the complex dependencies between variables in multi-modal probability distributions, and can also be used to solve otherwise intractable problems in decision theory by estimating expected utilities.
Stochastic Digital Gates and Stateless Stochastic Circuits (Figure 1 about here)
Digital logic circuits are based on a gate abstraction defined by Boolean functions: deterministic mappings from input bit values to output bit values 25 . For elementary gates, such as the AND gate, these are given by truth tables; see Figure 1A . Their power and flexibility comes in part from the composition laws that they support, shown in Figure 1B . The output from one gate can be connected to the input of another, yielding a circuit that samples from the composition of the Boolean functions represented by each gate. The compound circuit can also be treated as a new primitive, abstracting away its internal structure. These simple laws have proved surprisingly powerful: they enable complex circuits to be built up out of reusable pieces.
Stochastic digital gates (see Figure 1C ) are similar to Boolean gates, but consume a source of random bits to generate samples from conditional probability distributions. Stochastic gates are specified by conditional probability tables; these give the probability that a given output will result from a given input. Digital logic corresponds to the degenerate case where all the probabilities are 0 or 1; see Figure 1D for the conditional probability table for an AND gate. Many stochastic gates with m input bits and n output bits are possible. Figure 1E shows one central example, the THETA gate, which generates draws from a biased coin whose bias is specified on the input. Supplementary material outlining serial and parallel implementations is available at 26 . Crucially, stochastic gates support generalizations of the composition laws from digital logic, shown in Figure 1F . The output of one stochastic gate can be fed as the input to another, yielding samples from the joint probability distribution over the random variables simulated by each gate. The compound circuit can also be treated as a new primitive that generates samples from the marginal distribution of the final output given the first input. As with digital gates, an enormous variety of circuits can be constructed using these simple rules.
Fast Bayesian Inference via Massively Parallel Stochastic Transition Circuits
Most digital systems are based on deterministic finite state machines; the template for these machines is shown in Figure 2A . A stateless digital circuit encodes the transition function that calculates the next state from the previous state, and the clocking machinery (not shown) iterates the transition function repeatedly. This abstraction has proved enormously fruitful; the first microprocessors had roughly 2 20 distinct states. In Figure 2B , we show the stochastic analogue of this synchronous state machine: a stochastic transition circuit.
Instead of the combinational logic circuit implementing a deterministic transition function, it contains a combinational stochastic circuit implementing a stochastic transition operator that samples the next state from a probability distribution that depends on the current state. It thus corresponds to a Markov chain in hardware. To be a valid transition circuit, this transition operator must have a unique stationary distribution P (S|X) to which it ergodically converges. A number of recipes for suitable transition operators can be constructed, such as Metropolis sampling 27 and Gibbs sampling 28 ; most of the results we present rely on variations on Gibbs sampling.
More details on efficient implementations of stochastic transition circuits for Gibbs sampling and
Metropolis-Hastings can be found elsewhere 26 . Note that if the input X represents observed data and the state S represents a hypothesis, then the transition circuit implements Bayesian inference.
We can scale up to challenging problems by exploiting the composition laws that stochastic transition circuits support. Consider a probability distribution defined over three variables P (A, B, C) = P (A)P (B|A)P (C|A). We can construct a transition circuit that samples from the overall state (A, B, C) by composing transition circuits for updating A|BC, B|A and C|A;
this assembly is shown in Figure 2C . As long as the underlying probability model does not have any zero-probability states, ergodic convergence of each constituent transition circuit then implies ergodic convergence of the whole assembly 29 . The only requirement for scheduling transitions is that each circuit must be left fixed while circuits for variables that interact with it are transitioning. This scheduling requirement -that a transition circuit's value be held fixed while others that read from its internal state or serve as inputs to its next transition are updating -is analogous to the so-called "dynamic discipline" that defines valid clock schedules for traditional sequential logic 30 . Deterministic and stochastic schedules, implementing cycle or mixture hybrid kernels 29 , are both possible. This simple rule also implies a tremendous amount of exploitable parallelism in stochastic transition circuits: if two variables are independently caused given the current setting of all others, they can be updated at the same time.
Assemblies of stochastic transition circuits implement Bayesian reasoning in a straightforward way: by fixing, or "clamping" some of the variables in the assembly. If no variables are fixed, the circuit explores the full joint distribution, as shown in Figure 2E and 2F. If a variable is fixed, the circuit explores the conditional distribution on the remaining variables, as shown in Figure 2G and 2H. Simply by changing which transition circuits are updated, the circuit can be used to answer different probabilistic queries; these can be varied online based on the needs of the application.
( Figure 2 about here.)
The accuracy of ultra-low-precision stochastic transition circuits.
The central operation in many Markov chain techniques for inference is called DISCRETE-SAMPLE, which generates draws from a discrete-output probability distribution whose weights are specified on its input. For example, in Gibbs sampling, this distribution is the conditional probability of one variable given the current value of all other variables that directly depend on it. One implementation of this operation is shown in Figure 3A ; each stochastic transition circuit from Figure Discrete distributions on 1000 outcomes were used, spanning the full range of possible entropies, from almost 10 bits (for a uniform distribution on 1000 outcomes) to 0 bits (for a deterministic distribution), with error nearly undetectable until fewer than 8 bits are used. Figure 3C shows example distributions on 10 outcomes, and Figure 3D shows the resulting impact on computing element size. Extensive quantitative assessments of the impact of low bit precision have also been performed, providing additional evidence that only very low precision is required 26 .
( Figure 3 about here.)
Efficiency gains on depth and motion perception and perceptual learning problems
Our main results are based on an implementation where each stochastic gate is simulated using digital logic, consuming entropy from an internal pseudorandom number generator 31 . This allows us to measure the performance and fault-tolerance improvements that flow from stochastic architectures, independent of physical implementation. We find that stochastic circuits make it practical to perform stochastic inference over several probabilistic models with 10,000+ latent variables in real time and at low power on a single chip. These designs achieve a 1,000x speed advantage over commodity microprocessors, despite using gates that are 10x slower. In 26 , we also show architectures that exhibit minimal degradation of accuracy in the presence of fault rates as high as one bit error for every 100 state transitions, in contrast to conventional architectures where failure rates are measured in bit errors (failures) per billion hours of operation 32 .
Our first application is to depth and motion perception, via Bayesian inference in lattice
Markov Random Field models 28 . The core problem is matching pixels from two images of the same scene, taken at distinct but nearby points in space or in time. The matching is ambiguous on the basis of the images alone, as multiple pixels might share the same value 33 ; prior knowledge about the structure of the scene must be applied, which is often cast in terms of Bayesian inference 34 . Figure 4A illustrates the template probabilistic model most commonly used. The X variables contain the unknown displacement vectors. Each Y variable contains a vector of pixel similarity measurements, one per possible pair of matched pixels based on X. The pairwise potentials between the X variables encode scene structure assumptions; in typical problems, unknown values are assumed to vary smoothly across the scene, with a small number of discontinuities at the boundaries of objects. Figure 4B shows the conditional independence structure in this problem:
every other X variable is independent from one another, allowing the entire Markov chain over the X variables to be updated in a two-phase clock, independent of lattice size. Figure 4C shows the dataflow for the software-reprogrammable probabilistic video processor we developed to solve this family of problems; this processor takes a problem specification based on pairwise potentials
and Y values, and produces a stream of posterior samples. When comparing the hardware to handoptimized C versions on a commodity workstation, we see a 500x performance improvement.
( Figure 4 about here.)
We have also built stochastic architectures for solving perceptual learning problems, based on fully Bayesian inference in Dirichlet process mixture models 35, 36 . Dirichlet process mixtures allow the number of clusters in a perceptual dataset to be automatically discovered during inference, without assuming an a priori limit on the models' complexity, and form the basis of many models of human categorization 37, 38 . We tested our prototype on the problem of discovering and classifying handwritten digits from binary input images. Our circuit for solving this problem operates on an online data stream, and efficiently tracks the number of perceptual clusters this input; see 26 for architectural and implementation details and additional characterizations of performance. As with our depth and motion perception architecture, we observe over ∼2,000x speedups as compared to a highly optimized software implementation. Of the ∼2000x difference in speed, roughly ∼256x
is directly due to parallelism -all of the pixels are independent dimensions, and can therefore be updated simultaneously.
( Figure 5 about here.)
Automatically generated causal reasoning circuits and spiking implementations
Digital logic gates and their associated design rules are so simple that circuits for many problems can be generated automatically. Digital logic also provides a common target for device engineers, and have been implemented using many different physical mechanisms -classically with vaccum tubes, then with MOSFETS in silicon, and even on spintronic devices 39 . Here we provide two illustrations of the analogous simplicity and generality of stochastic digital circuits, both relevant for the reverse-engineering of intelligent computation in the brain.
We have built a compiler that can automatically generate circuits for solving arbitrary causal reasoning problems in Bayesian network models. Bayesian network formulations of causal reasoning have played central roles in machine intelligence 22 and computational models of cognition in both humans and rodents 4 . Figure A shows a Bayesian network for diagnosing the behavior of an intensive care unit monitoring system. Bayesian inference within this network can be used to infer probable states of the ICU given ambiguous patterns of evidence -that is, reason from observed effects back to their probable causes. Figure B shows a factor graph representation of this model 40 ; this more general data structure is used as the input to our compiler. This spiking implementation helps to narrow the gap with recent theories in computational neuroscience. For example, there have been recent proposals that neural spikes correspond to samples 41 , and that some spontaneous spiking activity corresponds to sampling from the brain's unclamped prior distribution 42 . Combining these local elements using our composition and abstraction laws into massively parallel, low-precision, intentionally stochastic circuits may help to bridge the gap between probabilistic theories of neural computation and the computational demands of complex probabilistic models and approximate inference 43 .
( Figure 6 about here.)
Discussion
To further narrow the efficiency gap with the brain, and scale to more challenging Bayesian inference problems, we need to improve the convergence rate of our architectures. One approach would be to initialize the state in a transition circuit via a separate, feed-forward, combinational circuit that approximates the equilibrium distribution of the Markov chain. Machine perception software that uses machine learning to construct fast, compact initializers is already in use 9 . Analyzing the number of transitions needed to close the gap between a good initialization and the target distribution may be harder 44 . However, some feedforward Monte Carlo inference strategies for Bayesian networks provably yield precise estimates of probabilities in polynomial time if the underlying probability model is sufficiently stochastic 45 ; it remains to be seen if similar conditions apply to stateful stochastic transition circuits.
It may also be fruitful to search for novel electronic devices -or previously unusable dynamical regimes of existing devices -that are as well matched to the needs of intentionally stochastic circuits as transistors are to logical inverters, potentially even via a spiking implementation. Physical phenomena that proved too unreliable for implementing Boolean logic gates may be viable building blocks for machines that perform Bayesian inference.
Computer engineering has thus far focused on deterministic mechanisms of remarkable scale and complexity: billlions of parts that are expected to make trillions of state transitions with perfect repeatability 46 . But we are now engineering computing systems to exhibit more intelligence than they once did, and identify probable explanations for noisy, ambiguous data, drawn from large spaces of possibilities, rather than calculate the definite consequences of perfectly known assumptions with high precision. The apparent intractability of probabilistic inference has complicated these efforts, and challenged the viability of Bayesian reasoning as a foundation for engineering intelligent computation and for reverse-engineering the mind and brain.
At the same time, maintaining the illusion of rock-solid determinism has become increasingly costly. Engineers now attempt to build digital logic circuits in the deep sub-micron regime 47 and even inside cells 48 ; in both these settings, the underlying physics has stochasticity that is difficult to suppress. Energy budgets have grown increasingly restricted, from the scale of the datacenter 49 to the mobile device 50 , yet we spend substantial energy to operate transistors in deterministic regimes. And efforts to understand the dynamics of biological computation -from biological neural networks to gene expression networks 51 -have all encountered stochastic behavior that is hard to explain in deterministic, digital terms. Our intentionally stochastic digital circuit elements and stochastic computing architectures suggest a new direction for reconciling these trends, and enables the design of a new class of fast, Bayesian digital computing machines. e. 
