Abstract-Stochastic computing (SC) is a digital computation approach that operates on random bit streams to perform complex tasks with much smaller hardware footprints compared with conventional binary radix approaches. SC works based on the assumption that input bit streams are independent random sequences of 1s and 0s. Previous SC efforts have avoided implementing functions that have feedback, because doing so has the potential for creating highly correlated inputs. We propose a number of solutions to overcome the challenges of implementing feedback in stochastic logic. We use a family of dynamical system functions that are similar to the well-known logistic map x → µx(1 − x) as case studies. We show that complex behaviors, such as period doubling and chaos, do indeed occur in digital logic with only a few gates operating on a few 0s and 1s. Our energy consumption is between 21% and 31% of the conventional binary approach. In order to verify our design methodology, we have measured the mean switching rate between the basins of attraction of two coexisting fixed points and the peak width of the steady-state distribution of the output using a logistic-map-like function as an example. Theoretical results match well with our numerical experiments.
I. INTRODUCTION

S
TOCHASTIC computing (SC) is an alternative digital approach to real arithmetic computations. A real value x ∈ [0, 1] is encoded as a Bernoulli sequence of length L, where the probability of a bit in the stream to be at logic 1 equals x. The expected value of the stream is N/L, N being the number of 1s in the stream of 0s and 1s. The idea of performing arithmetic using stochastic bit streams has been discussed in [1] and [2] and was motivated by the simplicity of stochastic computational elements. In the past few years, there has been renewed interest in SC [3] , showing significant benefits, including remarkable hardware cost reductions and high degrees of fault tolerance [4] , [7] , [13] , [15] . It has been used in applications, such as low-density parity-check (LDPC) encoding [8] , [9] , filter design [10] , [11] , and image processing [7] , [12] . Dynamical systems, such as the logistic map x → μx (1 − x) , are very simple in form but can exhibit quite complex behaviors ( Fig. 1 ), whose analysis is still a challenging problem. Reliable characterization of the statistical properties of such systems is needed and is usually done using time-domain simulations. Injecting is often necessary in these simulations to avoid long transients in temporal averages. The goal of this paper is to analyze the complexity in dynamical systems using inherently stochastic computational tools that employ stochastic logic. Fig. 2 shows how multiplication is performed in SC using only a single AND gate. Assume that X and Y are independent random Bernoulli bit streams representing probabilities x and y. Since each bit X i ∈ {0, 1} in stream X is independent of its corresponding bit in stream Y, the probability of the output of the AND gate being one at that clock cycle is P(X i = 1) × P(Y i = 1), which is x × y (multiplication operation).
SC provides several desired features compared with conventional binary: 1) it has a significantly smaller footprint area that can help with massive parallelism (e.g., compare the AND gate in Fig. 2 with an 8-bit conventional multiplier in terms of the number of gates) [6] , [7] ; 2) it can handle noise much better than conventional logic [4] ; and 3) it provides designers with high flexibility in choosing the resolution at which computations are performed, with almost no change in the hardware. As we increase the bitstream length, we increase the resolution and decrease random variance in the output stream. Hence, we can use the same simple hardware to perform operations with different resolutions, trading OFF delay/power with accuracy. SC has three major disadvantages compared with conventional binary arithmetic, though the following holds.
1) Feedback: The requirement that input bitstreams be independent random sources with little correlation creates challenges when designing circuits with feedback. Previous works on SC have largely avoided direct feedback in circuits. Li et al. [12] and Demaeyer and Gaspard [13] use state machines whose state transitions are determined by the bits from input random bitstreams. However, the output of the state machine is not fed back to the system. As a result, even though their designs are sequential, they are not classified as systems with true stochastic bitstream feedback.
We address the issue of direct feedback in Section III-A. 2) High Number of Random Sources: SC requires generation of multiple independent random streams, generation of which could be costly and offset some of the area savings of the stochastic logic. We address this issue by meticulously sharing random sources (see Section III-B). 3) High Latency: For high-resolution computations, SC requires long bitstreams, which translates to high latencies. We explore using a lookup table (LUT) to jump-start computations and significantly reduce latency (see Section III-C). We will use iterative equations (maps) in the dynamical systems domain as our case studies to investigate the implementation of stochastic circuits with direct feedback. Section II covers background material on dynamical systems and logistic-map-like functions specifically. Section III covers our implementation of the dynamical systems. We show the results of our analysis of the implementations in Section IV.
II. DYNAMICAL SYSTEMS: LOGISTIC MAP
Since the main goal of this paper is to use stochastic logic to implement equations with feedback in the dynamical system domain, we first review the logistic map and then discuss different ways to implement computations.
A. Logistic Map
A well-studied example of dynamical systems is the logistic map x n+1 = μx n (1 − x n ), where x i ∈ [0, 1] is a real number representing the value at iteration i , and 0 ≤ μ ≤ 4 is a real constant that determines the behavior of the system (x i is a real number in this context, not to be confused with a bit in a bitstream). It is a simple map that shows very complex behavior, as shown in Fig. 1 . The x-axis shows the value of μ, and the y-axis shows the "attractors" of the system, i.e., consecutive values of x i after a large number of iterations. The graph can be drawn using this procedure: for a fixed μ, set x 0 to a random initial value between 0 and 1. Then, iterate the map for a large number of iterations, e.g., i = 200 and then plot subsequent x i values. If the system converges to a single value (as is the case for 1 ≤ μ ≤ 3), we will see one dot above μ. For μ values between 3 and 3.449, the system converges to a period-2 cycle and oscillates between two values that depend on μ, e.g., when μ = 3.2, x 201 = 0.799, x 202 = 0.513, x 203 = 0.799, x 204 = 0.513, and so on. It can be seen in the graph that there are two dots above μ values in this range. As we increase μ, we enter regions in which the period doubles to 4, 8, and 16, and finally, around μ = 3.56995, we get to a chaotic region, where no value ever repeats in the sequence. Other interesting behaviors that are beyond the scope of this paper can be observed after the initial chaotic region.
The logistic map is a prototypical example in dynamical systems and shows how a simple iterated mapping can exhibit very complex behavior. In fact, experts still use it to study the statistical properties of dynamical systems, such as the densities of states and behavior under random perturbations [13] . The reason for its popularity is that the logistic map is a representative for a larger class of maps that exhibit period doubling and chaotic behavior. For example, the function shown in Fig. 3(e) , which unlike the logistic map is not symmetric, has a period-4 cycle similar to that of the logistic map with μ between 3.44949 and 3.54409 (approximately). Traditional methods of analyzing such dynamical systems require large number of experiments with random seeds to find the statistical distributions of points of convergence. As a result, parallel simulation of these systems in hardware would be beneficial. The examples of logistic-like functions are shown in Fig. 3(b) and (e).
B. Core Logistic-Map-Like Stochastic Logic
In this section, we ignore input correlations and feedback and focus only on the "core" stochastic logic that implements functions, assuming that the input bitstreams are ideal, uncorrelated inputs. Section III will look at the big picture, including interfacing to the core logic. In this section, we will show the examples of functions similar to the logistic map and discuss how they can be implemented using digital stochastic circuits. As mentioned in Section I, digital 1-bit values in a stochastic bitstream can be interpreted as real numbers when observed over a long period of time. As a result, we can go back and forth between the real values of x and its digital, single-bit slices in time when analyzing circuits. Fig. 3 shows two example circuits with their logic functions f and g and their real-valued equivalent functions. The second-and fourth-iterate maps, whose fixed points are 2-and 4-period cycles, are also plotted for f and g, respectively. The real-valued equivalent functions are derived by replacing an AND(x , y) gate with x × y, an OR(x , y) gate with x + y − x y, and so on. The notation i x in Fig. 3(a) and (d) denotes the i th independent bit stream, assuming that all bitstreams are representing the same probability value in the same clock cycle. It is interesting that a 0/1 digital implementation of a logistic-like function with only 6-8 digital 0/1 inputs can mimic complex behaviors, such as period-2 cycles and chaotic behaviors. Fig. 3(b) shows the intersection of f (x) with y = x. The intersection occurs at x = 0.7511, which means if an iteration of the map gets to x i = 0.7511, then x i+1 = x i+2 = · · · = 0.7511. In other words, the system converges to a single unstable fixed point. Under any small perturbation, the system will end up oscillating between x 1 and x 2 shown in Fig. 3(c) , which are the period-2 stable attractors of the system. The significance of the f 2 (x) plot becomes obvious when we follow a similar experiment. Consider the point x 1 . If the map generates λ 2 ) ; in other words, if the system deviates from one of the attractors due to noise, the system will bring the values back to the attractor. Note that in the analysis above, we ignored the question of bitstream independence and assumed that all values j x are generated using independent random sources that would enable us to treat them as real numbers. We will address the bitstream independence in Section III.
C. Synthesis Method
We used a greedy simulated annealing optimization method to synthesize stochastic logic that implements a dynamical sys- . For a given circuit depth, the annealing engine uses a balanced binary tree, in which the root of the tree represents the output of the circuit as a real-valued number 0 ≤ f (x) ≤ 1, and the leaves represent multiple copies of the input 0 ≤ x ≤ 1, which is also realvalued. Fig. 4 shows an example circuit of depth three. The internal nodes of the tree could be wires, logic gates, such as INV, AND, XNOR, and iAND (an AND gate with one inverted input), and so on. The annealing process randomly changes internal node types to get new circuits. The shaded regions in Fig. 4 are parts of the circuit that are ignored, because either a wire or an inverter gate is used at an internal node, both of which only operate on the top input connected to the node. The ignored parts of the circuit are not eliminated immediately from the data structure, because the inverter or the wire might be replaced in a later move by a two-input gate, hence reactivating the ignored parts.
To calculate the value of the output f (x) based on a given input value x, we can calculate the output of each internal node using an equation representing its behavior. For example, an AND gate multiplies the two real-valued inputs, an OR gate calculates x + y − x y, and so on. We use the annealing engine in two modes.
1) Curve-Fitting Mode: In this mode, a target function f * (x) is given. The cost function for the annealing engine is the mean square error of f * (x) compared with f (x), numerically calculated from 100 uniformly spaced sample points in the range 0 . . . 1.
2) Function Exploration Mode:
In this mode, no specific target function is given. Instead, the cost function uses a number of heuristics to find interesting functions that show dynamical system behavior, i.e., have a period-2 cycle or more. Examples of such properties include the following.
a) The function should be n-shaped and f (0) = f (1) = 0 (otherwise, it does not belong to the class of logistic maps). If a function is not n-shaped, a large penalty is added to the annealing cost function. b) The number of times f (x) crosses y = x has to be two (if it crosses only once at x = 0, the system has one trivial root at 0 and does not have periodic cycles). Both functions shown in Fig. 3 (b) and (e) cross the y = x function at an x location where x > 0. c) The two fixed points of f 2 have to be stable. This condition translates to λ = | f 2 (x)| < 1 at a fixed point. Fig. 3 (c) shows the two λ values. The λ value is calculated numerically by first finding the three nontrivial roots of the system x = f 2 (x), and then calculating the derivative at the roots. 1 d) We also consider the number of used inputs as a cost component. The reason for trying to minimize the number of inputs is that generating multiple independent copies of x is costly. The circuit shown in Fig. 4 uses five (out of a potential of eight) inputs. The circuits shown in Fig. 3 (a) and (d) were found using the function exploration mode. The outer loop of the annealing optimization changes the circuit depth to change the complexity of the resulting function. 2 We should note that our simulated-annealing-based method for curve fitting or exploring functions by no means guarantees finding optimal circuits. However, we have found the methodology powerful enough 1 The calculation showing why this condition results in stable fixed points is given in [13, Ch. 10, p. 350] . 2 The source code for the annealing engine is publicly available for download at: http://umn.edu/˜kia/Download/CurveFittingStoch. The circuit starts with an initial value of zero in the W -bit counter (integrator) on the right, and the W -bit X n value in the randomizer units labeled "Binary to Stochastic." The randomizers output individual bits that are 1 with probability x n = X n /2 W . If the output of the stochastic logic is 1, the integrator increments its value. After 2 W microcycles, the integrator has the W -bit binary approximation of the X n+1 value, which can be fed back to the randomizers.
for our purposes. Our future work includes extending the annealing optimization process to explore constant input values, and including MUXes as internal node candidates. The advantage of an MUX is that since its inputs are routed to the output in a mutually exclusive fashion, the two inputs could be perfectly correlated, hence requiring fewer randomization sources.
III. HANDLING FEEDBACK IN CIRCUITS
A key assumption for a combinational stochastic logic circuit to work is that the inputs are independent stochastic streams with no correlations. As shown in Fig. 2 , a two-input AND gate can implement multiplication. Using this circuit, x 2 can be computed only if the two input streams are generated using independent random sources, each with a probability of x. If one feeds the exact same sequence of bits to both inputs (i.e., Fig. 2) , the output will be the same as the input sequence, which represents x and not x 2 . This will be especially challenging in dynamical systems where function outputs are fed back to their inputs, unless we devise a method to eliminate or drastically reduce the correlation between input streams.
A. Breaking Dependence in Feedback: The Basic Architecture
Our strategy for minimizing correlations between bitstreams is to use multiple independent pseudorandom generators to generate fresh input bitstreams that show little correlations. Fig. 5 shows the basic architecture that we propose (our final architecture is discussed in Section III-C and Fig. 8 ).
To compute x n+1 from x n , we first represent x n in binary, then use binary to stochastic converters to generate independent input bitstreams. SC is performed on these bitstreams, and the result is again converted into binary for a new iteration. There are two clocks governing this system: one is a very fast clock controlling the stochastic logic, which we call a "microclock or stochastic bit clock," and the other is a slower clock which we call the real-domain clock. One x i value is generated per real-domain clock cycle. In Fig. 5 , W is a design parameter and represents the resolution of the system (W -bit binary). Small letters such as x n denote the real value of x in [0, 1] at iteration n, realized as a bitstream of {0, 1} bits with length 2 W . Each of these 2 W bits is generated in one microcycle. The circuit uses the digital logic shown in the blue cloud to calculate the stochastic value of x n+1 = f (x n ). At the output of the stochastic logic is a W -bit binary counter (denoted by ) which resets to zero at the beginning of each real-domain clock (the 2 W microcycle). At the end of the 2 W cycle, it stores the W -bit binary equivalent of x n+1 as X n+1 = x n+1 ×2 W . At the beginning of the next iteration, the binary value of the counter is copied to the W -bit wide registers in the binary to stochastic randomizer units that are going to generate the bitstream x n+1 to calculate x n+2 .
Each randomizer can be implemented as a W -bit register, a W -bit uniform pseudo-random number generator (RNG) [e.g., linear feedback shift register (LFSR)], and a W -bit comparator as shown in Fig. 6 . At each microcycle, a random number is generated and its value is compared with the register value X. If its value is less than or equal to the value in the register, a "1" bit is output as x. Otherwise, a "0" is an output. It is easy to show that the resulting output bitstream will represent the probability X/2 W , where X is the binary value stored in the register. The resolution at which all values are computed can be controlled by the parameter W , which also dictates how many microcycles are needed to compute one iteration of the map. For example, to get the equivalent of 10-bit binary accuracy, we set W = 10 and wait for 2 W = 1024 microcycles to compute x n+1 from x n . We will address how to drastically reduce this latency in Section III-C. The analysis of the architecture is presented in Section IV-A.
The correlation between input bitstreams highly depends on the design of pseudo-RNGs and the initial seeds used in different RNG instances. In the most common and simple RNG, namely, the LFSR, in general, a longer distance between initial seeds results in less correlated bitstreams. Fig. 7 shows an RNG that performs better compared with the traditional LFSR. It will be discussed in more detail in Section III-B.
B. Sharing Random Generators
RNGs take a nontrivial percentage of the overall area of stochastic logic. Hence, if we can reduce the number of RNGs without significantly hurting the accuracy of the computations, i.e., without adding a significant correlation between the bitstreams, we can reduce the area-delay product of the design. Assuming that consecutive values generated by the RNG have little correlation, we can share random sources and use delayed versions of their outputs to feed to more than one input in the system. This is not a valid assumption when using simple RNGs such as LFSRs with their relatively high autocorrelation. Using higher quality RNGs would result in less correlation between different copies of X, and would make sharing the random source a viable option. We use an RNG referred to as LFSR4, in which a single-state change is equivalent to four state changes of an LFSR. Specifically, let T represent the state transition matrix of an LFSR. Then, the state transition matrix of LFSR4 is given by 
where + denotes modulo-2 addition. The resulting structure is shown in Fig. 7 , which uses four XOR gates compared with a single XOR gate in an LFSR. We have experimentally verified that sharing one 10-bit LFSR4 for three to four inputs does not result in a noticeable computation accuracy degradation for 10-bit resolution computations. By using a wider LFSR4, one can share more inputs (e.g., our experiments show that a 12-bit LFSR4 can be shared between up to six inputs).
C. Using a Lookup Table to Fast Forward Computations
The architecture of Fig. 5 accumulates white noise over the course of 2 w microclock cycles to add a Gaussian noise to the X n+1 value. If the system under simulation has a high signal-to-noise ratio, only a fraction of the 2 w bits are needed to model Gaussian noise through accumulation of the white noise, and the bulk of the latency is used in calculating f (X i ). Note that one of the major drawbacks of SC is its exponential latency of 2 W cycles.
As shown in Fig. 8 , our solution is to divide the W bits into S lower bits and L upper bits (they could overlap for better accuracy, e.g., S + L = W + 1), where S and L are the design parameters (e.g., L = 6 and S = 4). The computations on the L-bit segment are done deterministically through an LUT, and the S-bit segment is handled using stochastic logic. The L-bit segment is used to map the noise-free value of X n to f (X n ) = X n+1 with a resolution of L, hence cutting the number of microclock cycles needed to generate the next iterate value by 2 L . The integrator is going to accumulate the lower S bits of X n+1 from the output of the stochastic logic requiring only 2 S microcycles. For W = 10, L = 6, and S = 4, we only need 16 microclock cycles instead of 1024 with a cost of 64 × 6 entry LUT. Section IV-C presents the experimental results on area/delay/latency tradeoffs.
IV. ANALYSIS AND EXPERIMENTAL RESULTS
We present the analysis of the architecture of Fig. 5 in this section. The results of the architecture of Fig. 8 are similar and are not presented in this paper. Markov-chain analysis of the architecture using the function in Fig. 3(a) is presented in Section IV-A. Theoretical models and comparisons to validate stochastic computations are discussed in Section IV-B. Hardware comparisons to conventional binary implementation are covered in Section IV-C.
A. Markov-Chain Analysis
To analyze the steady-state distributions of the stochastic system, we used Monte-Carlo simulations to model the behavior of the system as a Markov chain. The steady-state distributions are important, because one can compare two implementations and see if they show similar behavior, such as having the same fixed points, and having similar distribution profiles. The transition probability matrix M of the system is built using the procedure shown in Algorithm 1. In short, the procedure runs the system using Monte-Carlo simulations, each time with a different random seed to see what f (x i ) values the system is likely to get after initializing it with x i . We store the probabilities of going from state j to state k in the j , k entry of the transition probability matrix M. If we multiply the matrix M to itself, M j,k would represent the probability of transitioning from state j to state k after two iterations. Similarly, M p shows the transition probabilities between different states after p iterations of the function. Assuming that an initial state probability vector S is given with S j stating Algorithm 1 Procedure for Generating the Transition Probability Matrix of the System Modeled as a Markov Chain the probability that x 0 is set to j/2 w , we can calculate the steady-state density vector of the system by multiplying the vector S to the matrix M p for large values of p. Fig. 9 shows the results of our analysis using the function in Fig. 3(a) . The top row shows M for W = 10, 6, 4. The graphs are similar to the shape of the function shown in Fig. 3 . As expected, binomial-like distributions are superimposed on the function graph. Steady-state densities were calculated by multiplying a uniformly distributed vector of initial states (barring states 0.0 and 1.0) to a large power of M (we used M 2000 ). The middle row of Fig. 9 shows steady-state densities, which shows that all resolutions do converge to the period-2 cycle, although the W = 4 case does so with less accuracy, due to its limited resolution. More in-depth analysis of the steady-state behavior of the system is presented in Section IV-B. Finally, the bottom row of Fig. 9 shows sample time-domain simulations of the three systems. The two graphs in each plot are generated by grouping odd cycles x 1 , x 3 , x 5 . . . into the blue graph and even cycle values into the red graph. As W decreases, more noise is observed in the system, which is seen by switching from one stable attractor to the other in the case of W = 4 (odd and even iterations switch roles). Despite noise-induced switching between attractors, the system is still quite robust and is able to switch back to period-2 cycles. The stability of the system in low-resolution modes justifies the use of a variable resolution scheme.
B. Theoretical Predictions
We now wish to compare our experimental results with theoretical predictions. We believe that such predictions could be of great value for SC, as they help determine the bit length needed for a desired accuracy and how often a stochastic switch in system behavior occurs. Our analysis shows that the average time between two stochastic switches of the system is doubly exponential in the bit length W , implying that adding just a few bits of resolution to the system could significantly reduce such effects. To simplify the exposition, Fig. 9 . Transition probability graphs (top row), steady-state density (middle row), and time-domain simulation (bottom row) of the circuit of Fig. 5 for resolutions W = 10 (left column), W = 6 (middle column), and W = 4 (right column). The state density for W = 10 is drawn using 1024 points on the x-axis. The y-axis is the probability of being in that state. The red sliver region is 16-point wide, which is of equivalent resolution as one point in the W = 6 case (2 1 0/16 = 2 6 ). The integral of the red sliver in W = 10 is 0.216, which is close to the peak value of 0.19 in W = 6. Similarly, the sum of the four points near the peak in W = 6 is roughly equal to the peak in W = 4. In the time simulation of the W = 4 case (bottom-right), the system is more susceptible to noise compared with higher resolutions, and the noise causes odd and even cycles to switch roles, but otherwise, the system is stable and quickly goes back to the 2-cycle even with a limited resolution of 16 points on the x-axis.
we perform this comparison in a single prototypical example. In particular, we analyze the stable period-2 orbit of the stochastic mapping shown in Fig. 3(b) . This mapping has a polynomial approximation within the target 10-bit resolution (the justification for using this polynomial can be found in Section IV-C1)
We use the large deviation theory first developed by Wentzell and Freidlin [16] for noisy continuous dynamical systems, and subsequently by Kifer [17] , [18] and Hamm and Graham [19] for noisy discrete dynamical systems, to characterize the steady-state distributions of the period-2 orbit and how long it takes on average for noise to cause a phase slip in the periodic orbit. 3 This last quantity, often called the mean switching time, gives the average number of iterations for a stochastic trajectory to jump between the basins of attraction of two stable attractors and can sometimes be characterized by the height of a saddle-point between two minima of a corresponding potential. The aforementioned theory gives asymptotic results for these quantities in the weak noise limit, or, in our case, as the resolution W → ∞. As they are most readily applicable, we use the results of [19] to derive our predictions. We also note that the results of [20] could be used to obtain our predictions. To begin, let us collect some facts of the mapping f a and its second iterate h = f a • f a . The mapping f a has a fixed point x 0 ≈ 0.7542, and period-2 orbit consisting of the points the points x ± as the resolution W increases. Roughly, the resolution of the system, 2 −W , which is the width between two consecutive representable numbers with W bits, should give the magnitude of the noise. Note that, if one assumes perfect randomization of the input streams after each iteration, the probability distribution of noise would be state-dependent with Gaussian deviations. While such perfect randomization is not possible in our computational setting, the application of the large deviation theory in our context only requires general assumptions on the asymptotic noise distributions (see [19, Assumption A]).
That is, the theory requires that the distribution of noise ρ is dependent on the previous iterate x and satisfies
for large W , where = K (x)2 −W for some constant K (x) dependent on the previous iterate x. The following computational experiments verify such asymptotic behavior of the noise, and also determine the rate constant K for each of the two attractor points. We used the MATLAB realization of the SC system to calculate two iterations for 10 000 copies of each of the points x ± . We performed this experiment for bit lengths W ranging from 7 to 15, measuring the standard deviation σ ± = √ ± of the resulting distributions of points for each W . We found good agreement with our heuristic prediction and found the noise constants to be K + := K (x + ) ≈ 0.1224 and K − := K (x − ) ≈ 1.1377 (see Fig. 10 for a depiction of these results).
2) Steady-State Distributions:
We now analyze the steadystate distributions of the second iterates of the stochastic system. In other words, we wish to give a theoretical prediction for the width of the "peaks" shown in the middle row of Fig. 9 , and to verify that the "narrower" right peak and the "wider" left peak in these graphs are indeed to be expected in the dynamical system implemented by the circuit of Fig. 3(a) . For a given noise level , Theorem 1 of [19] gives that the steady-state distributions about the fixed points x ± of h, which we denote as v ± (y), satisfy lim →0 log v ± (y) = − ± (y) Fig. 11 . Verification of steady-state distribution asymptotics equation (2) . Plots of standard deviations σ ± (W ) of the numerical steady-state distributions, plotted against 2 −W/2 for W = 7 . . . 15 (blue line with markers). The markers show how wide or narrow the peaks are in the steady-state graphs, similar to the middle row of Fig. 9 . Linear fits (orange curves) for σ + and σ − have slopes 1.0301 and 0.3405, respectively, which agree well with our predictions, especially for larger values of W . or in other words, v ± (y) ∝ e − ± (y)/ as → 0. The results of [19] also give that the potential ± has the leading-order quadratic approximation
near x ± . We also remark that, other than the asymptotic noise condition (1), the results we apply require certain conditions on the types and quantities of attractors in the dynamical system in question (see [19, Assumption B] ). Since our example system only possesses a single attractor, the stable period-2 orbit, these conditions are trivially satisfied. Using this approximation, and taking into account our quantification of the noise above, the distributions have the asymptotic form
When plotted against 2 −W/2 , the standard deviations of the distributions are thus asymptotically linear with slope
where we have used the values of K ± found numerically above. We found that these asymptotics agreed quite well with the steady-state distributions obtained via Monte-Carlo simulations in Section IV-A. After splitting up the peaks for x + and x − , we measured the standard deviations of these numerical distributions for W ranging from 7 to 15. A linear fit of the data for high values of W gives slopes, which are within 0.8% of the predicted slopes (see Fig. 11 for more details).
3) Mean Switching Time: Now, let us analyze the switching time of the stochastic system for various resolutions. Here, we define a switch to be a phase slip of the period-2 orbit in the SC system. In other words, a switch occurs when the second iterates of the stochastic system either move from the basin of attraction of x + to that of x − or vice versa. Theorem 2 in [19] gives that the mean switching time from ± for W such that 2 W = 100, 300, 700 (blue line with markers). Linear fit (orange curves) of log(τ − ) and log(τ + ) data gives slopes 0.01385 and 0.01377, respectively, which agree well with our predictions.
x ± to x ∓ , which we denote as τ ± , satisfies
where ∂ B ± denotes the boundary of B ± , the fundamental basins of attraction for x ± under h defined in the beginning of this section. In other words, τ ± ∝ e ± / . Once again taking into account the noise quantification from above, the mean switching times have the asymptotic form
To measure switching times for a range of W values, we iterated the stochastic system for a sufficiently large number of iterations so as to obtain over 3 × 10 4 switches of both type (+ → − and − → +) for each run. When plotted against 2 W , the experimental values of log(τ W + ) and log(τ W − ) are asymptotically linear with slopes 0.01385 and 0.01377, respectively (see Fig. 12 ), which compares well with the approximations + /K + ≈ 0.02594 and − /K − ≈ 0.02019, obtained by using the quadratic approximation to the potential ± . Experimental escape times are smaller than those predicted by this approximation, since the potential softens near the basin boundaries.
C. Hardware Comparisons
In this section, we compare our stochastic implementation to the conventional binary implementation of different functions. Section IV-C1 considers functions that were found using the function exploration mode of our annealing engine, namely, fx6, fx8, and fx12. We found polynomial approximations to these functions to be used in the conventional binary implementation. We expect the stochastic implementation of these functions to outperform their corresponding conventional implementations.
Section IV-C2 considers the logistic map with different μ values and uses the curve-fitting mode of our annealing engine to find the stochastic equivalent circuits for those logistic-map functions. We expect the conventional implementation to outperform the stochastic implementation.
1) Function Exploration Mode:
In this section, we use fx6 and fx8 to refer to the functions of Fig. 3(b) and (e), respectively. The number after fx is the number of inputs to the digital equivalent circuit. We also tried a chaotic function fx12(x) = XOR(AND6(x), OR6(x)), where AND6 and OR6 are AND and OR gates with six copies of x used as an input. To compare our stochastic implementations with the equivalent conventional binary implementations, we first used curve fitting to find polynomials in the real domain that approximate our dynamical systems' functions. Given that our stochastic simulations were limited to a maximum of 10-bit binary resolution, it would be unfair to the conventional implementation to use exact polynomials describing functions fx6, fx8, and fx12 (the exact polynomial for these functions would be of degrees 6, 8, and 12, respectively). To save hardware resources in the conventional implementation, we used the smallest polynomial degree that could approximate our stochastic functions with a resolution of 10 bits. The resulting polynomials are We used fixed-point arithmetic in the conventional implementation. Even though the final resolution needs to be only 10 bits, the circuit needs a resolution of 16 bits for the intermediate calculations to correctly generate steadystate densities. Six out of those 16 bits are needed for the integer part to accommodate signed coefficients in the range of [−14,17] . We experimented with decreasing the number of bits in the conventional representation, but the steady-state graphs quickly degrade, hence confirming that 16 bits are needed. Fig. 13 shows the conventional architecture. The coefficients are labeled C i , the multipliers are labeled M j , the accumulator as "Add," and the LUT hardware to generate Gaussian noise as "Noise." The input value fed to the circuit is X n , and its output is X n+1 , which is fed back to the input register for the next iteration. The first row of multipliers calculates the C 1 × x 4 term, the second row calculates the C 2 × x 3 term, and so on. We avoided sharing multipliers to derive higher powers of x from lower powers (e.g., using only one multiplier to generate x 4 from x 3 ) because doing so would have resulted in underflow, requiring more bits. Better accuracy results are obtained when we first multiply the coefficient C i to x, and then multiply more x to get the right term in the polynomials listed above. Since we are comparing energy (latency×power) across different designs (last two columns of Table I ), using a more serial architecture that reuses multipliers would not result in significant changes in the total energy for the computation.
We implemented all designs in Verilog and compiled them on Artix 7 XC7K70TFBV676-1 field-programmable gate arrays (FPGAs) using the Xilinx Vivado default design flow. The chip is large enough to implement all our circuits. The architecture of Fig. 13 is labeled "conventional," the architecture of Fig. 5 is called "Flat stoch," the same architecture with shared LFSR4 randomizers is called "Shared LFSR," and the architecture of Fig. 8 is called Lx Sy, where x and y are the number of bits used as L and S in Fig. 8 . All stochastic implementations use W = 10. In the Lx Sy architecture, we allow one bit of overlap between the S and L parts for better accuracy (i.e., the LUT initializing the L leftmost bits, and the stochastic process accumulating the lower S bits, with a potential of bleeding into the lower bit of the L segment).
FPGA resource usage is shown under columns "LUT" and flip-flops (FFs) of Table I. Note that in an FPGA, the area of the circuit is a combination of LUTs and FFs. 5 The noise generation circuit in the conventional design takes a nontrivial amount of area, 6 because it has to generate Gaussian noise. The stochastic method generates the Gaussian noise almost for free by adding up uniform noise over 2 W (or 2 S ) microcycles. The size of the LUT containing the cumulative distribution function of the normal distribution for the conventional noise generator was set to achieve similar noise characteristics to those generated by the flat stochastic design (3σ of [−0.05, 0.05] with 10 bits of resolution). Among the hybrid architectures, L7S4 shows similar noise behavior as the flat design: even though it is no longer a smooth normal distribution function, its variance is close to the σ value of the flat design. The variance of L6S5 and L5S6 is higher. Given that the stochastic implementation is very stable even with low resolutions [ Fig. 9 (middle row)] , the fact that the resulting noise of Lx Sy is coarser than the Gaussian noise generated by the flat design does not seem to have an adverse effect on the quality of the steady-state distribution generated by the circuit.
We can see from the "LUT" column that the flat stochastic is much smaller than the conventional implementation, and as we use LFSR sharing, the area reduces further. In the Lx Sy architectures, as we increase L, the size of the LUT increases, but the width of the LFSRs decreases.
Clock frequency is reported in the column labeled " f ." The number of (micro) clock cycles needed to compute X n+1 from X n is labeled "Latency (cycles)." The conventional design has a pipeline depth of four, requiring four cycles to calculate. The flat design needs 1024 microcycles to complete the computation, and the hybrid architectures of Lx Sy require exponentially fewer cycles as S decreases. The column labeled "Latency (μs)" shows the actual time it takes to finish the computation (cycles/ f ). The column labeled "Power" is the dynamic power of the design, and "Energy" is the total energy required to perform the computation. The "Latency ratio" and the "Energy ratio" columns compare all stochastic implementations to the conventional design implementing the same function, i.e., all stochastic fx6 values are divided by the conventional fx6, all stochastic fx8 values by the conventional fx8, and similarly for fx12.
Comparing designs with similar noise characteristics, namely, "Conventional," "Flat," "Shared LFSR," and "L7S4," we can see the effect of each of our techniques on energy. The flat architecture as the baseline stochastic implementation has between 13× and 33× worse energy consumption compared with the conventional method ("Energy ratio" of "Flat" for the fx6, fx8, and fx12 functions in Table I . Switching to shared LFSR4's maintains noise characteristics but cuts down on power by a factor of 2.3×-2.5× (flat energy ratio divided by shared energy ratio). Finally, using the hybrid method in addition to sharing LFSRs results in between 86.8× and 174× improvement over the baseline flat stochastic method, and between 4× and 12.5× improvement compared with the conventional method (12.5 = 1/0.08). We did not perform experiments comparing the leakage power of the different designs, but given that the stochastic implementation uses much less logic compared with the conventional implementation, and given the rising share of leakage in more advanced technology nodes [21] , one can expect that stochastic implementation would yield better leakage power as well.
We did not perform experiments for higher noise levels for the conventional or the flat designs: their results would have gotten only worse. However, just as a point of reference, we included the results for the L6S5 and L5S6 architectures that simulate higher noise levels. Even though the numbers of microcycle increase, these architectures still show better results compared with the conventional architecture.
We next evaluated the impact of increasing the computation resolution W on the hardware performance metrics of the different implementations. Table II reports the results of our experiments for the function fx6 at three different values of the computation resolution. It can be seen that the relative performance of the different circuits does not change significantly when we use a higher resolution.
2) Curve-Fitting Mode: In this section, we report the results of our experiments with different logistic-map functions. We chose three values of μ, in which the logistic map exhibits period-2 cycles: μ = 3.175, 3.3, and 3.4. These three functions are easy to implement in the conventional method with relatively little cost. We used the curve-fitting mode of our annealing algorithm to synthesize the stochastic versions of these functions. Table III lists the target functions with the μ parameter (uxxxx_conv) and the curve-fitted stochastic implementations approximating it (uxxxx_stch). We list the function equation, along with the two fixed points and the mean-squared approximation error. It can be seen that our annealing algorithm was able to find close approximations to these functions. Table IV compares the hardware performance metrics of the different implementations. LFSR4's were used with a maximum input sharing of 6, which seemed to be the limit for these functions. Note that the L7S4 implementation has similar noise properties compared with the conventional, and it is the one that we have shown in bold. Other implementations of Lx Sy are included as before to show the tradeoff between noise profile and area/latency. It can be seen that the stochastic implementation outperforms the conventional implementation in terms of energy consumption (70%, 88%, and 75% for u = 3.175, 3.3, and 3.4, respectively). As expected, since the logistic function is less costly to implement in the conventional method compared with the functions used in Section IV-C1, we see less performance improvements using the stochastic implementations in Table IV compared with Table I .
V. CONCLUSION
We showed how stochastic logic can be used in simulating a class of complex dynamical systems with feedback using rerandomization of the output. The results show that the simulation is stable even for low resolutions, resulting in better opportunities for optimization (e.g., noise generation and latency reduction). We also used a table lookup technique to exponentially speedup stochastic computations, resulting in better energy consumption compared with the conventional design. This is in contrast to most previous work on SC that show improved power compared with conventional binary implementations, and not energy. We verified the behavior of the stochastic implementation by comparing a number of its features, such as the characteristics of its steady-state distribution and mean switching time between the two attractors. The theoretical results matched very well with our experimental results.
