Abstract 1 -Stochastic computing (SC) is an approximate computing technique that processes data in the form of long pseudorandom bit-streams which can be interpreted as probabilities. Its key advantages are low-complexity hardware and high-error tolerance. SC has recently been finding application in several important areas, including image processing, artificial neural networks, and low-density parity check decoding. Despite a long history, SC still lacks a comprehensive design methodology, so existing designs tend to be either ad hoc or based on specialized design methods. In this paper, we demonstrate a fundamental relation between stochastic circuits and spectral transforms. Based on this, we propose a general, transform-based approach to the analysis and synthesis of SC circuits. We implemented this approach in a program spectral transform use in stochastic circuit synthesis (STRAUSS), which also includes a method of optimizing stochastic number-generation circuitry. Finally, we show that the area cost of the circuits generated by STRAUSS is significantly smaller than that of previous work.
I. INTRODUCTION

S
TOCHASTIC computing (SC) is an unconventional computing technique introduced by Gaines [6] , [7] and Poppelbaum et al. [21] that processes long digital bit-streams that have well-defined numerical values, but randomly generated bit-patterns. It employs simple logic circuits to perform complex arithmetic operations For instance, multiplication can be implemented by a single AND gate. A bit-stream X containing N 1 1s and N 0 0s denotes the number p = N 1 /(N 1 + N 0 ). Since p lies in the real-number interval [0, 1] , it is usually interpreted as the probability of a 1 appearing at a randomly chosen location in X or, equivalently, as the probability that X outputs a 1 at a randomly chosen time.
With suitable interpretation (scaling) of the bit-streams' numerical values, SC can essentially approximate any arithmetic operation [6] . Fig. 1 October 16, 2015 . This work was supported by the U.S. National Science Foundation under Grant CCF-1318091. This paper was recommended by Associate Editor J. Cortadella.
The authors are with Advanced Computer Architecture Laboratory, University of Michigan, Ann Arbor, MI 48109-2122 USA (e-mail: alaghi@eecs.umich.edu; jhayes@eecs.umich.edu).
Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TCAD.2015.2432138 1 Parts of this paper are based on "A spectral transform approach to stochastic circuits," which was presented at the International Conference on Computer Design, Oct. 2012 [3] . 
involving addition, multiplication, and negation of N-bit numbers. (Why this is so will be explained later.) The accuracy and precision of an SC computation like this depends on the length and randomness of the input bit-streams. Fig. 1 shows N = 12, but in practice N is often much longer. Note that, the circuit C is a standard logic circuit defined by Boolean algebra, but its stochastic behavior, as given in (1) , is basically analog arithmetic over rational or real numbers. Also note that, the circuit C has three auxiliary inputs (the r i inputs in Fig. 1 ) that provide three-independent random numbers of constant probability value 0.5, another typical feature of stochastic circuits. As Fig. 1 suggests SC is suitable for applications that require large numbers of relatively low-precision operations. Moreover, SC's probabilistic aspect makes it tolerant of errors of the soft and bit-flip type. Recent technology trends, such as the need for massively parallel processing and the increasing sensitivity of ICs to soft errors, have renewed interest in SC as an attractive alternative to conventional binary computing in a few important applications. For example, Alaghi et al. [2] showed that SC can outperform binary computing in some image-processing tasks. A recently discovered use of SC is in low-density parity check (LDPC) decoding [8] ; LDPC codes are part of the IEEE WiFi standard [11] . Naderi et al. [18] have implemented a stochastic LDPC decoder that has performance comparable to that of conventional designs, but uses less chip area. Another attractive feature of stochastic circuits, reflecting their small size, is their low power requirement. Alaghi and Hayes [1] further discuss the benefits and applications of SC in the contemporary design space.
Although SC has been known for decades, most SC designs have been ad hoc in nature and tailored to very specific applications. Some are based on assembling circuits from a fixed and limited library of SC components, e.g., multipliers and adders. A first step toward systematic design of stochastic circuits was taken by Qian and Riedel [22] , who developed a method of implementing arithmetic functions based on Bernstein polynomials. Similar methods of synthesis have been proposed targeting combinational [26] and sequential circuits [16] , [27] .
Sequential circuits can, in general, implement a larger class of stochastic functions, and may be more efficient in implementing nonlinear functions such as tanh and exp [15] . However, their performance and accuracy are degraded by the auto-correlations that may exist in the input bit-streams, as well as the "warm-up" time needed for the circuits to arrive at their steady-state distributions, two phenomena that do not affect combinational circuits. Furthermore, there are many examples of combinational stochastic circuits that are more efficient than their sequential counterparts. Combinational multipliers, which will be discussed later in this section, do not have a low-cost sequential implementation. Another example is the combinational edge-detection circuit proposed in [2] , which is significantly more efficient than the sequential implementation of the same function [14] .
Despite the fact that large stochastic systems such as LDPC decoders [18] and even general-purpose machines [20] have been successfully implemented, a comprehensive SC design methodology has yet to appear. This paper identifies and explores a fundamental relation between SC circuits and spectral transforms like the Fourier transform [10] , [13] . Such transforms have many applications in engineering. For instance, consider the time-domain impulse response of an analog filter. While it contains all the information about the filter's behavior, it is not easy to extract the response of the filter to a given input signal. The Fourier transform of the impulse response reveals the "hidden" frequency-domain behavior of the system, from which its response to a given input signal can readily be found.
The spectral transforms of interest in this paper map Boolean functions (BFs) from the logic domain to the domain of real numbers. We show that they lead to a unique multilinear polynomial representation of a given BF, which defines its SC behavior. (A multilinear polynomial is a polynomial in which terms may contain products of variables, but no variable appears with a power of two or higher.) Based on this, we derive methods of analyzing and synthesizing combinational stochastic circuits. We present a synthesis algorithm called spectral transform use in stochastic circuit synthesis (STRAUSS), which applies to general polynomial types and covers a wider range of (positive and negative) stochastic number (SN) formats than previous work. Furthermore, it can synthesize circuits with significantly lower area cost.
The main contributions of this paper are as follows.
1) The use of spectral transforms to link the structure and behavior of stochastic circuits. 2) A general spectral algorithm for designing stochastic circuits to implement arbitrary arithmetic functions. 3) A method of optimizing stochastic designs by sharing SN generators. 4) A cost comparison of circuits designed with the proposed spectral method and previous nonspectral approaches.
II. BASICS OF STOCHASTIC COMPUTING
This section reviews the basics of SC, and introduces the terminology and notation used throughout this paper.
In SC, the numerical value associated with a 1-bit logic signal x is usually taken to be the probability of seeing a 1 on x. We denote this value by X and refer to it as an SN that corresponds to signal x. (SNs that are carried by multiple logic signals are not considered in this paper.) This is called the unipolar (UP) format and represents real numbers over the unit interval [0, 1], which is also the probability domain. Table I shows three different but closely related formats for SNs. The bipolar (BP) encoding is often preferred because it represents signed numbers over the [−1, +1] interval in a natural way. We introduce a new SN format called inverted bipolar (IBP) which is the inverse of BP. While, the IBP and BP formats are essentially equivalent, IBP is more convenient to use with spectral transforms, where the Boolean values 0 and 1 are replaced by +1 and −1, respectively. This small notational change greatly simplifies the analysis and synthesis of circuits in the spectral domain. To illustrate the various SN formats, consider the bit-stream 0110101101 of length 10 containing six 1s and four 0s. It represents the UP number 0.6, BP number 0.2, and IBP number −0.2. For most of this paper, we will use the IBP format, unless stated otherwise.
Multiplication of UP numbers is implemented by a single AND gate; see Fig. 2(a) . The probability of seeing a 1 at the output z is equal to the probability of seeing 1 at input x multiplied by the probability of seeing a 1 at input y, assuming the two probabilities are independent. Hence, Z = XY. Fig. 2(b) -(c) shows stochastic multipliers for the BP and IBP formats. Note that, the stochastic behavior of a circuit changes with the SN format used. For example, the XOR gate of Fig. 2(c) , i.e., the IBP multiplier, realizes the following UP operation:
The circuit of Fig. 1 is designed to operate on BP numbers; its XOR gate performs both multiplication and negation. Once, we know the stochastic behavior of a circuit in one format, we can use the equations given in Table I to derive its behavior in the other formats. In the next section, we will show how spectral transforms can be employed to derive the stochastic behavior of arbitrary logic circuits for any SN format. Stochastic addition is usually implemented by the multiplexer of Fig. 2(d) , which is used with all three SN formats in Table I . Note that, besides the main (data) inputs x and y, an auxiliary signal r is applied to the multiplexer's select input. This consists of a uniform random bit-stream of constant value R = 0.5 (in UP format), which serves as a scaling factor; r is typically obtained from a (pseudo) random number generator, such as a linear feedback shift register (LFSR) [12] . Since adding two numbers from the interval [0, 1] (or [−1, +1]) can produce a sum in the interval [0, 2] (or [−2, 2]), scaling by 0.5 brings the sum back into the original interval. In the BP circuit of Fig. 1 , the multiplexer computes 0.5(X 1 + X 2 ), which the XOR gate then multiplies by another factor of 0.5 and negates. The BP constant 0.5 (which is 0.75 in UP format) required by the XOR is generated by the OR gate.
To combine stochastic and conventional logic, interface circuits are needed that convert between (weighted) binary and SN formats. Converting an SN to binary form requires counting the number of 1s in the bit-stream, so a counter suffices to implement it. Converting from binary to SN form requires a more complex circuit. Fig. 3 shows a typical stochastic number generator (SNG) that converts a binary number of value X to an SN of the same value in UP format. At each clock cycle, it compares X with a binary number R that is uniformly distributed in the [0, 1] interval (or equivalently, on m wires, each of which carries an SN of UP value 1/2). The SNG outputs a bit-stream representing X, one bit per clock cycle. As reported in [26] , SNGs can consume as much as 80% of an SC circuit's total area, so reducing the number of SNGs is a major design challenge in SC.
The accuracy and precision of SC depends on the length of the bit-streams used and the degree of dependence or correlation among the input bit-streams. For instance, the SNG of Fig. 3 , if clocked for 2 m cycles, produces a 2 m -bit SN that has m bits of precision. If the bit-streams applied to the circuit of Fig. 2(a) are correlated, say each has the same bit-pattern of value X, then the output will also have value X instead of X 2 . Hence, an AND gate will not perform multiplication accurately if its inputs are correlated. In this paper, we make the usual assumption that the SNs applied to the primary inputs of our stochastic circuits are uncorrelated. Further discussion of accuracy and correlation issues can be found in [1] and [4] .
III. SPECTRAL TRANSFORMS
We now introduce the spectral transforms used to analyze and synthesize stochastic circuits. An n-variable BF f (x 1 , . . . , x n ) maps B n = {0, 1} n to B = {0, 1}. Here, B n is seen as a 2 n -dimensional vector space, where each dimension corresponds to a row of f 's truth table (TT), or equivalently, to an n-variable minterm. 
This is the familiar sum-of-minterms expansion of f, where the c i 's are 0-1 coefficients that define f. 3 , or in the column-vector form that we use later
The last vector is essentially g's TT. To save space, we also write such vectors in the transposed form [1 0 1 1] T .
As Table I (3) . By replacing 0s and 1s with +1s and −1s, respectively, we see that f 1 produces the TT vector
, and (−1, −1), in IBP format. These four discrete "TT points" can be embedded in a continuous real-number function as follows:
Observe that each term of the foregoing expression assumes the correct value 1 or −1 at each TT point. On expanding this expression, we get
The polynomial (4) interpolates the TT values in the real numbers. It is linear with respect to variables X 1 and X 2 , and is referred to as multilinear. Most importantly as we will see, it represents the stochastic behavior of the BF f 1 .
. . , X n , defined in an SC format, such as UP, BP, or IBP, are the input arguments of f, the output is another SN that is some function of X 1 , X 2 , . . . , X n . We denote this function byF(X 1 , X 2 , . . . , X n ) and refer to it as the SC behavior of f. We will see thatF has a unique multilinear form similar to that in (4), and can be determined by means of spectral transforms.
The spectral transforms of interest execute a change of basis from the minterm space of a BF f to a real-valued space.
We employ the Fourier transform F for BFs, which is also known as the discrete Walsh transform in Hadamard ordering [13] , [19] . Spectral transforms of this type have been considered previously for a wide variety of logic design and testing tasks [10] , [13] . For BFs with large values of n, it may be impractical to deal with 2 n -dimensional spectra, although concise representations of spectra for functions with hundreds of variables are known [5] . This size issue is of much less concern in SC, however, because the values of n tend to be small, e.g., n = 2 in Fig. 1 .
To compute the Fourier transform of a BF given as a TT vector f or equivalent, we first replace 0 and 1 by +1 and −1, respectively. The Fourier transform F = F ( f ) is specified by
where F is the vector form of F denoting its spectral coefficients or spectrum and H n is the Walsh matrix (with natural or Hadamard ordering) of dimension 2 n defined recursively by
Equation (5) is evaluated using the rules of linear algebra over real numbers. In the case of f 1 from Example 1, we get The result is the spectrum of f, and the corresponding basis is
T defined by the rows or columns of H 2 . (Note that H n is symmetric.) These basis vectors resemble digital waveforms, and are analogous to harmonics in the sine-cosine Fourier transform used for time-frequency conversions. More specifically, they correspond to the four linear BFs: 1, x 2 , x 1 , and x 1 ⊕ x 2 , where ⊕ denotes XOR. Recall that in the spectral domain, XOR is replaced by multiplication. Analogous to the sum-of-minterms expansion (2) for f 1 in the Boolean domain, we write the transformed function as
where the S i 's are the basis vectors 1, X 2 , X 1 , and X 1 X 2 , in the spectral domain, and the C i 's constitute the spectrum of f 1 . Hence, (6) becomes
This is a multilinear polynomial that interpolates (matches) the original BF f at its four Boolean input coordinates (TT points),
. This transformation is illustrated in Fig. 4 . Notice that the last expression above is exactly the same as (4), and the process of arriving at the two expressions is similar. So, we see intuitively that the Fourier transform of a BF defines its unique SC behavior. This leads to the following theorem. 
Calculating its Fourier transform yields
According to Theorem 1, the IBP behavior of XOR iŝ
Similarly, we know that a two-input AND gate acts as a multiplier in UP format. In this case, f 3 = [1 1 1 −1] T and
Thus
Since, we want the AND behavior in UP format, we must map (7) from IBP to UP. Let p 1 , p 2 , and p 3 denote the UP values of X 1 , X 2 , and F 3 , respectively. Table I implies
leading to the desired multiplication p 3 = p 1 p 2 . Finally, we note that the Fourier transform is invertible and preserves all information about f. We can therefore retrieve the original TT form by applying the inverse Fourier transform f = F −1 ( F) to the spectrum. Since H n is related to its own inverse by a 2 n factor, this may be calculated as follows:
This is the key link from the desired SC behavior F to a logic function f that, with appropriate modifications, implements F. As mentioned, the elements of F correspond to polynomial terms in the spectral domain. For n = 2, these terms are 1, X 2 , X 1 , and X 1 , X 2 , respectively. In the general case, we assign an n-bit binary number to each element of F, starting from all 0s. So the elements of F are assigned to "000 . . . 0," "000 . . . 1," . . . , "111 . . . 0," and "111 . . . 1," respectively. Now to find the polynomial term corresponding to each element, we get the assigned binary number and replace each 1 by the corresponding X i for that position, and replace each 0 by 1. Hence, the terms corresponding to the first, second, and last elements of F are 1, X n , and X 1 X 2 . . . X n , respectively.
IV. SPECTRAL TRANSFORM-BASED SYNTHESIS
The spectral transforms discussed so far have several useful applications in the SC context. Besides analyzing BFs and extracting their SC behavior, they can be used to systematically design combinational SC circuits. But before discussing our synthesis method (STRAUSS), a few preliminary concepts are needed.
A. Stochastically Implementable Functions
Theorem 1 implies that a combinational circuit can implement a stochastic functionF given in multilinear polynomial form, i.e., one in which terms may contain products of variables of degree at most one. The idea is then to apply the inverse Fourier transform F −1 toF and obtain the corresponding BF f in vector TT form, as in (8) . However, applying F −1 to an arbitrary target functionF does not necessarily yield a vector f whose elements are the TT values +1 and −1. In fact, we can have the following three possible outcomes.
1) All the elements of f are +1 or −1, in which case f is the TT of a BF and is directly implementable by a logic circuit. 2) All elements of f lie in the interval [−1, +1], but some have values {c i } other than +1 or −1. In that case, the function is still implementable, but requires auxiliary inputs and circuitry to generate the c i 's, as discussed in Section V. 3) Some elements of f are larger than +1 or less than −1, in which case the function is not implementable. It is still possible to implement a related functionF that is an approximate or scaled version ofF. We say polynomialF is SC-implementable if we can synthesize a stochastic circuit whose behavior is defined byF. SC-implementable functions fall into the first two categories above, and are distinguished by having inverse Fourier transforms. The simple product function X 1 X 2 is SC-implementable, whereas the unscaled sum X 1 + X 2 is not. We will refer to the latter as SC-unimplementable.
To illustrate the general case, consider the generic two-variable polynomial
Taking its inverse Fourier transform, we get
In order forF to be SC-implementable, the elements of f should be in the interval [−1, +1]. In other words, the following constraints should be satisfied:
Constraints of this kind can be obtained for polynomials of n variables by applying the inverse Fourier transform to them. Concepts similar to SC-implementable polynomials are discussed by Qian and Riedel [22] , [24] . They implement stochastic functions in the form of Bernstein polynomials, and define constraints on the coefficients of Bernstein polynomials in order to distinguish implementable functions. As we will show later, that approach can also be interpreted in terms of spectral transforms.
B. Target Function Conversion
In order to synthesize a stochastic circuit for an arbitrary target functionF, a few conversion steps are required. For instance, the inverse Fourier transform can only produce suitable BFs when applied to multilinear functions [19] . Supposê F is an ordinary polynomial of degree n F(X) = a 0 + a 1 X + a 2 X 2 + · · · + a n X n implying that it has nonlinear terms. We convert it to a multilinear polynomialP(X 1 , . . . , X n ) in which the nonlinear terms ofF, such as a n X n , are replaced by multilinear terms like a n X 1 X 2 . . . X n . The new variables X 1 , X 2 , . . . , X n are assumed to be independent copies of the original variable X. There are many possible ways to select a multilinear polynomialP that corresponds toF. A natural choice is one that is symmetric with respect to all its variables thuŝ
However, asymmetric polynomials are also possible. For instancê
is one of the possible asymmetric multilinear polynomials that correspond toF. Existing synthesis methods [3] , [22] , [26] assume symmetry, but as we will show for the first time in Section VI, asymmetric multilinear polynomials may lead to better implementations. Finally, to synthesize a functionF from [0, 1] n to [0, 1] that is SC-unimplementable, we convert it to an SC-implementable polynomial by approximation. This step is a straightforward polynomial fitting problem and can be easily solved by tools such as MATLAB [17] .
C. Synthesis Examples
The main steps of STRAUSS are listed in Fig. 6 . We first illustrate them with examples, and then discuss their details.
Example 3: Consider the problem of reverse engineering the IBP multiplier, so the given function isF 2 (X 1 , X 2 ) = X 1 X 2 . Since this already has the desired multilinear polynomial form, we can skip Step 1 of STRAUSS and proceed to Step 2, where we use the inverse Fourier transform to obtain f 2 's TT vector
The result f 2 only has +1 and −1 as elements, and is clearly the TT of a two-input XOR gate. Now consider the same problem for the UP multiplier
Note that, this function will be different fromF 2 because of the format change. MappingF 3 to the IBP format yieldŝ
Applying the inverse Fourier transform toP 3 produces the TT [1 1 1 −1] T , which defines an AND gate.
Example 4: Suppose, we attempt to synthesize the arithmetic sum functionF 4 (X 1 , X 2 ) = X 1 +X 2 . This is also in multilinear form, so we proceed with the inverse Fourier transform
Two elements of f 4 lie outside the interval [−1, +1], which means that it is SC-unimplementable. The standard solution is the scaled addition which substitutes s(X 1 + X 2 ) for X 1 + X 2 , where s is a scale factor that makes the function SC-implementable; in this case s = 1/2. The new target function isF 5 (X 1 , X 2 ) = 1/2(X 1 + X 2 ), which yields
However, the result f 5 has elements other than +1 and −1, namely 0, so a constant number generation step (Step 3) is required. This is discussed in the next section.
V. CONSTANT-NUMBER GENERATION
The third step of STRAUSS, as seen in Fig. 6 , is constant number generation, a challenging problem in itself. We first discuss the intuition behind this step. Then, we formally define the problem and present several solutions.
Continuing Example 4 from the previous section, we derived the TT f 5 = 1 0 0 −1 T for the stochastic add operation.
We can interpret this TT as follows. If both inputs are 0, then a constant SN of value +1 (in IBP format) is sent out; if both inputs are 1, then a constant SN of value −1 (in IBP format) is sent out; if one input is 1 and the other is 0, a constant SN of value 0 is sent out. Producing an SN with values other than +1 or −1 requires additional inputs, i.e., random number sources. For the running example, we can replace the two-input TT f 5 , with the following three-input TT:
where an auxiliary input r has been added to the function. (This happens to be the majority function.) The r input must be fed with the IBP SN 0, i.e., a pure random bit-stream. Fig. 5(a) shows a straightforward AND-OR implementation of f 5 . It is possible to optimize the synthesized circuit further by reordering the elements of f 5 . For example, the following TT has the same stochastic behavior of f 5 but has a simpler implementation:
This is the BF
, which has the AND-OR implementation of Fig. 5(b) . It is obvious that this new circuit is a 2-to-1 multiplexer with r as its select input. This is precisely the standard scaled adder in the SC literature [6] . Finding the best ordering of +1s and −1s of a TT is a difficult optimization task. We now formally define the constant SN generation problem, and discuss our solution method for it.
Single SN Generation Problem: Given a constant number c ∈ [ − 1, +1] and a desired precision m, we want to find an m-input BF f (r 1 , . . . , r m ) with k = (1 − c) · 2 m−1 (or k = (1 − c) · 2 m−1 ) minterms that has minimum cost.
When fed with pure random inputs, an m-input BF f (r 1 , . . . , r m ) with k = (1 − c) · 2 m−1 minterms, will output a 1 with a probability of k/2 m . Thus, the IBP value generated at the output of f is 1 − 2 · (k/2 m ) = c. Note that, the precision m only refers to the target constant c, and has no implications on the run-time for SN generation.
It should be noted that the number of auxiliary inputs introduced determines the precision of the constant numbers that can be generated. . Any other number c should be rounded to the closest number from the above set. So for an arbitrary real-valued c, one has to choose a value for m, taking into account that increasing m provides better precision and accuracy but also increases the cost of the circuit.
As the problem suggests, there may be many BFs that generate the same constant number, and our goal is to select one with minimum cost. We use literal count as our cost criterion, because it is easy to use and reflects both transistor and gate count fairly accurately in standard CMOS logic [9] . For example, both the following BFs f 6 (r 1 , . . . , r m ) = r 1 and f 7 (r 1 , . . . , r m ) = r 1 ⊕ · · · ⊕ r m have 2 m−1 minterms and thus generate the constant c = 0. But f 6 has a literal count of 1, while f 7 has a literal count of 2(m − 1) using a chain or tree of XOR gates. Note that, the cost of an XOR gate is twice the cost of an elementary gate.
Qian and Riedel [23] give a method of synthesizing a minimal two-level circuit that generates a given stochastic constant. Qian et al. [25] discussed several other methods that synthesize multilevel circuits. The method of [25] does not directly address the single SN generation problem defined in this paper, so the optimality of that method will not be discussed here. We now present a recursive algorithm called stochastic constant generation (SCG) to obtain a minimal multilevel constant generation circuit. This algorithm takes two parameters, the number of inputs m and the number of minterms k, and returns a TT of length 2 m with k minterms. If k ≤ 2 m−1 , i.e., k is at most half the TTs length, then SCG fills the first half of the TT with 0s, and makes a recursive call with parameters m − 1 and k. This recursion can be interpreted as follows. After m − 1 recursion steps, SCG returns a BF that is implemented by a chain of at most m − 1 AND or OR gates. The steps of the SCG procedure are given in Fig. 7 . This algorithm can also be used to solve the multiple constant SN generation problem, as discussed below.
Example 5: Consider finding a seven-input function f 8 (r 1 , . . . , r 7 ) with 77 minterms. We call SCG (7, 77) , which returns a TT with 64 ones in the second half, and 13 ones in the first half. The first half is now obtained by calling SCG (6, 13) . This recursion step corresponds to f 8 (r 1 , . . . , r 7 ) = r 1 ∨ f 8 (r 2 , . . . , r 7 ) Fig. 7 . Overview of the stochastic constant generation (SCG) procedure. in which f 8 is the result of SCG (6, 13) . Continuing down the recursion path we see SCG(5, 13), SCG(4, 13), SCG (3, 5) , SCG(2, 1), SCG(1, 1), and SCG(0, 1) which is a terminal case. The resulting function is hence
which has minimal literal count. The corresponding optimal circuit implementation of f 8 is shown in Fig. 8 .
The circuits generated by the SCG procedure are optimal in terms of literal count. The proof is straightforward; each input of the generated BF appears at most once in the final expression. So if m inputs are required to generate a constant, the literal count of the generated circuit will be m, which is the minimum possible literal count for an m-input circuit. The constant number generators synthesized by the method of [23] have near-optimal two-level implementations, but in most cases their literal count is greater than m.
Next, we generalize the problem to multiple constants. Note that, previous work, including the methods discussed in [23] and [25] , does not address multiple-constant generation. The multiple constant generation problem, which is step 3 of STRAUSS (see Fig. 6 ), is a difficult one, because there are many possible opportunities for sharing gates between the circuits for the c i 's. Since exhaustively searching among them would be very inefficient, we propose a heuristic algorithm based on the SCG procedure of Fig. 7 . To generate all the constants, we call SCG for each c i , and as SCG progresses, we keep a record of the constant SNs and circuits it has generated so far. The generated circuits are stored in a table and are reused if a previously generated SN is encountered. This approach is demonstrated in the following example which illustrates a complete synthesis computation using STRAUSS.
Multiple SN Generation
Example 6: Consider the target functionF 9 (X) = 0.4375 − 0.25X − 0.5625X 2 . We want to use STRAUSS to synthesize a stochastic circuit that implementsF 9 . In Step 1 of Fig. 6 , F 9 is reformulated as a multilinear polynomialP 9 because it contains a nonlinear term X 2 . This term is eliminated by introducing two new inputs X 1 and X 2 to replace X thuŝ
This is only one of the many possible multilinear polynomials that are equivalent to the target functionF 9 .
For this example, we chose a symmetric multilinear polynomial, but as we will show in the next section, STRAUSS examines many possible polynomials (including asymmetric ones) and chooses one that leads to a lower cost. At Step 2 of STRAUSS, we have 
Next, SCG is called with parameters m = 4 and k = 7 to generate c 2 . Calling SCG(4, 7) entails recursive calls SCG(3, 7) and SCG (2, 3) , at which point the recursion stops because the results of the previous calls (during the circuit generation for c 1 ) are reused. The returned BF is
which shares a gate with f c 1 .
Step 3 is now done, and we have a stochastic implementation ofF 9 , namely
In the last step, f 9 is optimized via conventional CAD tools. Fig. 9(a) shows the final gate-level implementation of f 9 . Notice the shared OR gate (r 3 ∨ r 4 ), which is used in both f c1 and f c2 . The inputs x 1 and x 2 must be fed with independent bit-streams that carry the same SN X. These can be generated by using two-independent SNGs, or just by shifting one bitstream in time and thus generating an independent copy of it [12] . The auxiliary inputs, on the other hand, must be supplied with pure random bit-streams. As shown in Fig. 9 , we connect the auxiliary inputs to a 4-bit LFSR, which generates four independent random bit-streams [12] . Fig. 9(b) shows another SC implementation ofF 9 using the algorithm proposed in [3] . As can be seen, the circuit generated by STRAUSS yields a smaller area based on literal count. In the next section, we discuss further optimizations used in STRAUSS.
VI. FURTHER OPTIMIZATIONS
Step 1 of STRAUSS involves converting the target function to a multilinear polynomial, which can be symmetric or asymmetric. The synthesis method of [3] only uses symmetric polynomials, but it is possible to further optimize the synthesized circuit by considering both symmetric and asymmetric polynomials.
We illustrate this with an example. Consider the target functionF
ConvertingF 10 to symmetric multilinear form, we get
which yields the following TT after applying the inverse Fourier transform:
Since this has elements other than 1 and −1, namely, 1/3 and −1/3, multiple-constant SN generation circuitry is required. However, if we choose the following asymmetric multilinear polynomial in the first step of STRAUSS, then:
The inverse Fourier transform produces Fig. 10 . Summary of the asymmetric polynomial selection (APS) procedure.
which is simply the BF f 10 (
and requires no constant generation circuitry. This means that significant cost savings may be possible if asymmetric polynomials are considered. A given SC-implementable polynomial can be mapped to many different asymmetric multilinear polynomials with the same SC behavior, some of which may be SC-unimplementable. To distinguish between them, we need to apply the inverse Fourier transform (Step 2 of STRAUSS) and check if all the elements of the TT are in the interval [−1, +1]. However, applying the transform to all possible polynomials is very time-consuming, making this approach infeasible. So STRAUSS uses a different approach which is discussed below. An overview of this asymmetric polynomial selection (APS) algorithm is given in Fig. 10 .
Given a target polynomial, we start with a symmetric multilinear polynomial-there is only one such polynomial-and use the inverse Fourier transform to obtain a symmetric TT (STT). Because of its symmetry, the STT includes repeated elements. We then modify the repeated elements to obtain an asymmetric TT of better cost. If we keep the new elements in the [−1, +1] interval, the newly obtained TT remains SC-implementable. And as long as we keep the average of the new elements the same as that of the old elements, the new TT will have the same SC behavior as the STT.
As an example, consider a generic three-input STT will have the same SC behavior as f 11 . Since all the inputs of the circuit have the same value X, the probabilities of getting any of the c 2 elements in f 11 will be the same. So by changing the elements to c 2 , c 2 , and c 2 in f 11 and keeping the average the same as before (by assigning c 2 + c 2 + c 2 = 3c 2 ), the SC behavior of the TT remains unchanged. Similarly, the three c 3 's can also be replaced with new elements. The choice of the new elements directly affects the cost of the new TT. Elements such as +1 and −1 are desirable because they can be generated at no cost, while other elements require constant generation circuitry. For example, consider the following STT:
T which has three −1/3s and three 1/3s. As shown earlier, we can replace these elements with new elements and obtain
which has the same SC behavior. Notice that, the three −1/3s are replaced with two −1s and one +1, and the three 1/3s are replaced with two +1s and one −1. Another choice of TT with the same behavior is
but it clearly has a higher cost because it requires several constant generation circuits, while f 10 requires none. It can be shown that given a set of symmetric elements, we can always find a new set of elements with the same average, all of which, except for at most one, are +1s or −1s.
Assume we have a set of k symmetric elements of value c. If c > 0, then, we can replace the set with one +1 element and k −1 new elements c of value (kc − 1)/(k − 1). The new set has the same average as the old set
Similarly, if c < 0, we can replace the set with one −1 and k − 1 new elements c of value (kc + 1)/(k − 1). By repeating this process on the new elements of the set, we obtain a set that has at least k − 1 elements of value +1 or −1.
Another factor affecting cost is the order of the new elements in the TT. For example, the two +1s and one −1 replacing +1/3 can appear in three possible orders: 
The number of different ways to order the TTs grows exponentially with the number of inputs, so searching among all of them is not possible for very large circuits. For such cases, STRAUSS has a greedy search heuristic that usually finds a good ordering. The heuristic starts from the STT and selects a group of repeated elements (say c 2 in f 11 ) and finds new elements (c 2 , c 2 , and c 2 ) to replace them. Then it tries all the possible orderings of this group, and selects one with the minimum cost. The algorithm proceeds to the next group of repeated elements (c 3 in f 11 ). In doing so, the algorithm only examines 3 + 3 = 6 orderings out of the 3 × 3 = 9 possible orderings. Thus, the search becomes feasible for larger circuits. Like most heuristics, this method does not always find an optimal solution, but our experiments show that a near-optimum polynomial is usually found. We also observed that in most cases, there are multiple optimum and many near-optimum polynomials, which favors the heuristic search.
It is worth noting that employing asymmetric polynomials reduces the precision needed in the constant generation circuit, and hence, can lead to cost savings. A good example is the functionF 10 discussed earlier in this section. A symmetric TT for this function is
which requires constant generation circuitry for 1/3 and −1/3. When applying the SCG procedure (Fig. 7) to f 10 , one has to choose a precision m, and round the numbers 1/3 and −1/3 to the precision closest to m. However, an asymmetric implementation ofF 10 , namely
T needs no constant generation at all. It is as if the constants 1/3 and −1/3 are implemented with unlimited precision and at no cost. We end this section by a complete example involving a multivariate target function.
Example 7: Consider the target functionF 11 
Since the maximum degree of the polynomial is two, we will need two-independent copies of each input. The corresponding symmetric multilinear polynomial ofF 11 iŝ
By applying the inverse Fourier transform onP 11 ( Step 2 of STRAUSS), we obtain the symmetric TT vector Steps 3 and 4 of STRAUSS yield the following BF:
which requires four auxiliary inputs. A gate-level implementation of f 11 is shown in Fig. 11(a) . It is possible to further optimize this circuit by applying the APS procedure toF 11 . In the symmetric TT shown above, there are eight symmetric elements with value 7/8. APS identifies this symmetric group and replaces it with a group of seven elements of value 1 and one element of value 0. After trying different possible orderings, APS returns the following asymmetric TT (which is one of the eight possible optimal TTs): 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1  1 1 . . . 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1   1 1 1 . . . 1 1 1 1 1 1 1 1 1 1 1 1 1] T .
This yields the BF
which requires only one auxiliary input. An implementation of f 11 is shown in Fig. 11(b) . The asymmetric polynomial P 11 that corresponds to f 11 has more terms than the symmetric polynomialP 11 but, surprisingly, it yields a lower-cost implementation. Note that, the APS procedure performs transformations on the TT only and does not deal with different (and in some cases complicated) polynomial terms. The first 16 terms of the asymmetric polynomialP 11 are shown below for illustration purposes onlŷ
VII. RELATED WORK AND EXPERIMENTAL RESULTS
Very little previous research has addressed the systematic design of SC circuits. Qian and Riedel [22] have developed a nonspectral method of implementing a single-variable SC target functionF(X) defined in UP format. They convertF(X) to a Bernstein polynomial of the form
n−i (9) in which the C i 's are constant coefficients. This polynomial is then mapped to a specific style of logic circuit termed reconfigurable SC architecture (ReSC) [26] . It consists of an n-input adder that implements the terms n i
Bernstein terms) and a multiplexer that selects the C i coefficients. Fig. 12(a) shows the ReSC implementation of (9) . There are n independent inputs (x i 's) to the adder representing the variable X, and n + 1 inputs to the multiplexer (c i 's) that represent the coefficients C i in (9) . The probability of a number k at the output of the adder is equal to
i.e., the kth Bernstein term. Thus, the probability of having a 1 at f is equal to the probability getting a 0 at the adder and a 1 at c 0 , plus the probability of getting a 1 at the adder and a 1 at c 1 , plus . . . , plus the probability of getting n at the adder and a 1 at c n . These probabilities can be expressed as
which is the same as (9) . Note that, the x i inputs are independent SNs representing the number X, similar to the x 1 and x 2 inputs of Fig. 9 . This structure requires n SNGs to generate the x i 's and n + 1 SNGs to generate the c i 's, which entails a significant area cost. Similar to STRAUSS, this design approach can be extended to multivariate functions [24] , and to the BP format. We now show that the method of [26] can be reinterpreted in terms of a spectral transform, a "Bernstein transform" B with a basis different to that of the Fourier transform F .
Consider the two-variable BF f (x 1 , x 2 ). We define a new spectral basis for it is as follows: 
This corresponds to the same multilinear expression generated by the Fourier transform. To see this, expand (10) thus
which is the multilinear polynomial produced by the Fourier transform, as the following equation demonstrates:
Despite their similarities, the circuits synthesized by STRAUSS are quite different from those designed via the ReSC architecture of [26] . Most importantly, ReSC does not benefit from asymmetric polynomials or constant input sharing. To illustrate this, we implemented the functionF 9 of Example 6 using ReSC. Fig. 12(b) shows the result (adjusted for the IBP format). Besides its adder and multiplexer, this design contains three SNGs, each consisting of a 4-bit comparator and a 4-bit random number generator (or LFSR), as in Fig. 3 . Note, however, that this circuit can be optimized using standard combinational techniques; the middle SNG, for instance produces a 0, and so can be removed from the circuit.
To attempt a fair comparison, we used the Berkeley SIS synthesis tool [28] to optimize this design and those of Fig. 9 and to map them to a generic library of gates. The library includes all the elementary gates and their relative area cost in terms of unit cells in a 0.35µm technology. Table II compares the three implementations, along with several other representative circuits, with area cost reported in terms of unit cells. The run-time of the circuits depend on the desired accuracy of the user. In general, longer run-time leads to better accuracy. The circuits compared in Table II achieve the same level of accuracy for a given run-time. These results show that STRAUSS synthesizes circuits that are significantly smaller than those designed by the techniques of [3] and [26] . The area reported in Table II does not include the conversion circuits that may be required for the primary inputs and outputs.
VIII. CONCLUSION
SC has re-emerged recently as an important technology for certain applications needing low-cost massive parallelism, or related features like very small circuit size or low power. Although SC has been recognized for many years, its underlying theory is not well-developed. We have shown here that well-defined transforms linking the Boolean and the spectral domains exist, which provide fundamental theoretical insights into SC behavior. We have also successfully applied spectral transforms to the design of circuits in a way that naturally accommodates the most useful SN formats. Furthermore, we have presented a novel and general synthesis technique STRAUSS for combinational circuit synthesis. Comparing this paper to the major existing SC design method [26] , we found that the results generated by STRAUSS can lead to significant cost savings.
APPENDIX
A. Proof of Theorem 1
We provide a proof by induction on n, the number of variables of f. Suppose n = 1, so f is a single-variable BF. Using the Boole-Shannon expansion theorem, we can write f (x) = c 0 x⊕c 1 
which describes the SC behavior of f in the IBP format. Now the TT of f is f = C 0 C 1 , so its Fourier transform is
This corresponds to the polynomial which is the same as (11), so the theorem holds for single-variable functions. Now as the induction hypothesis, assume the theorem holds for all functions of up to n−1 variables. We want to show that it also holds for the n-variable function f (x 1 , . . . , x n ) . Again using Boole-Shannon expansion, we can write f (x 1 , . . . , x n ) = f 0 · x n ⊕ f 1 · x n , where f 0 = f (x 1 , . . . , x n−1 , 0) and f 1 = f (x 1 , . . . , x n−1 , 1) are functions of n − 1 or fewer variables. Fig. 13 illustrates how f is decomposed in this way. Thus, f's SC behavior is, in terms of the behavior of a 2-to-1 multiplexer (Fig. 13) 
where F 0 and F 1 are the Fourier transforms of f 0 and f 1 , respectively. The resulting polynomial is
which is the same as (13) . HenceF = F, and from the principle of induction we conclude that the theorem holds.
