Abstract-Given inputs ~1,. . . , z,,, which are independent identically distributed random variables over a domain D, and an associative operation o, the probabilistic prej?x computation problem is to compute the product ~1 o x2 o . . . o xn and its n -1 prefixes. Instances of this problem are finite state transductions on random inputs, the addition or subtraction of two random n-bit binary numbers, and the multiplication or division of a random n-bit binary number by a constant. The best known constant fan-in circuits for these arithmetic operations had logarithmic depth, linear size, and produce no errors. Furthermore, matching lower bounds for depth and size (up to constant factors between the upper and lower bounds) had previously been obtained for the case of constant fan-in circuits with no errors.
INTRODUCTION

Previously-Known Circuits for Prefix Computation and Arithmetic
Given a function which can be computed sequentially with finite memory, we wish to design a circuit for its parallel computation.
This practical problem is reduced, in [l] , to the prefix computation problem: given n inputs ~1, . . . , z, taken from a domain D, compute all the products zrox20*** o xi for 1 5 i I 72, where o is an associative operation. Ladner and Fischer [l] also give circuits for prefix computation with linear size and logarithmic depth (where the depth is the length of the longest path in the circuit). For certain parameters, their circuits are identical to well-known circuits used for addition. Recently, Fich [2] has improved on the size of these circuits by constant factors. No previous work on prefix computation has considered a random distribution of inputs.
It has long been known [3, 4] that the expected length of the longest carry during the addition of uniformly distributed random n-bit binary numbers is O(logn). Some early addition circuits of Gilchrist, Pomerene and Wong [5] and Hendrickson [6] employed carry-completion testing.
carry-completion testing.
The best known constant fan-in circuit for addition [lo] employs a complicated variant of the carry-look-ahead method, has linear size and fi(logn) depth with no improvement for random inputs. In fact, for any of the above arithmetic operations over random input, the best known constant fan-in circuits have n(logn) depth. Recently, Chandra, Fortune and Lipton [ll] gave an addition circuit with near linear size and constant depth, but with n(n) nodes of unbounded fan-in, so they were not practical for VLSI applications.
Our Circuits for Probabilistic Prefix Computation and Arithmetic
The goal of this paper is to develop some fundamental techniques for the design of circuits which take random input.
In Section 2 of this paper, we formulate a probabilistic version of the prefix computation problem with random input. This probabilistic prefix computation problem has important practical applications (see Section 3) to arithmetic operations on uniformly distributed random numbers, such as (i) addition or subtraction of two random n-bit binary numbers; (ii) multiplication or division of a random n-bit binary number by a constant.
In Section 2, we describe circuits for probabilistic prefix computation. Our circuits have constant fan-in, linear size, and generally far less than logarithmic depth. These probabilistic circuits are defined using a parameter, the dependence length, which is related to the number of compositions of random input symbols required to compute an output within given likelihood.
The depth of our circuits are the logarithm of the dependence length. The dependence length is O(logn) (NOTE: Throughout this paper, logarithms with no base indicated will be taken base 2.) for the above arithmetic operations over random inputs. Hence, our circuits have O(log log n) depth and furthermore, have error probability which can be set to nYa for any constant (Y > 0. In applications with truly random inputs, this error probability might be set lower than the circuit component reliability.
In applications where we are not assured of random inputs, these probabilistic circuits may not be appropriate, since they do allow errors on certain fixed inputs. Instead, they can be used as the fundamental building blocks for variable delay circuits which make no errors. We modify our probabilistic circuits to detect all errors by introducing a single node of unbounded fan-in and a secondary error-free prefix circuit of logarithmic depth which is evaluated only on detection of an error in the primary circuits' computation. The delay of the resulting errorfree circuit is the number of parallel stops required for its evaluation.
For the above arithmetic operations (i) and (ii) over random inputs, our error-free circuits have at most O(loglogn) delay with probability at least 1 -n+, although they have O(logn) delay in the worst case. Our variable delay circuits for arithmetic may be of practical use in VLSI applications, since a single node of unbounded fan-in can easily be implemented in current technology by a single dedicated layer in the VLSI chip. [12] showed that fl(logf n) is a lower bound on the depth to compute addition of n-bit binary numbers by a circuit of fan-in f. In Section 4, we prove that for the above arithmetic operations on random n-bit binary numbers, any circuit of fan-in f, with error probability at most n-O, must have R(logf log n) expected delay. Thus, our constant fan-in circuits for these random arithmetic operations are asymptotically optimal. There were no previously known circuit depth lower bounds for any random-input problems. for independent random variables xi, . . . , xe with density function d. Intuitively, the e-dependence length gives the minimal number of variables which must be composed together before the resulting product is prefix invariant with probability at least 1 -E. Thus, if we are attempting to compute 21 0 . . . o xi, we need only compute zi-e(E)+i o . . -o xi and with probability 1 -E, we need not perform the further i -l(s) compositions. For the arithmetic operations over random input considered in Section 3, we show the c-dependence length is l(s) = O(logn) for E = nqQ. We will derive an upper bound on the &-dependence length as a function of
Lower Bounds for Arithmetic
Winograd
P= c d(a).
~ED is prefix invariant
Since the xi are independent,
> E by definition of minimality of a(c).
Hence, we have the following proposition.
Circuit Definitions
A circuit PI is a labelled acyclic direction-oriented graph. The in-degree (out-degree) of a node is the number of entering (departing, respectively) edges. The input nodes are those of in-degree 0.
The output nodes are a distinguished set of nodes. Let n, n' be the number of input and output nodes, respectively. The fan-in of e is the maximum in-degree of any node. The size of c is the number of edges. (NOTE: Ladner and Fischer [l] define the size to be the number of non-input nodes, but this is not appropriate for circuits of unbounded fan-in.) The depth of (.Z is the length of the longest path. 
B = {V, A, 7).
A Boolean circuit for addition of two (n/2)-bit numbers will have n input nodes, each of which is labelled by a distinct bit of the input numbers, and will have n' = 1 +n/2 output nodes, each of which is labelled by a disticnt bit of the output.
A prodzlct circuit has basis {o}, where (II, o) is a semigroup. Ladner and Fischer [l] show the following proposition.
PROPOSITION 2. For 0 5 k 2 log n, there is a product circuit T&?,(n) for prefix computation with fan-in 2, size 4(1+ 2-k) n -o(n) and depth k + [log n] , which makes no output errors.
So as to make our paper self-contained, we give the Ladner and Fischer proof in circuit in
Figures la and lb. Again, note that we consider the size to be the number of edges. Fich [2] gives a constant factor improvement in the size of such circuits.
. . . 
Our Product Circuits for Probabilistic Prefix Computation
Let (D, o, d) be a probabilistic semigroup as defined in Section 2.1. Let t(s) be the s-dependence length for (D, o, d) . A key lemma of this paper follows.
LEMMA 1. For any n 2 1, 0 5 k 5 logn and 0 < E < l/n, there is a product circuit for probabilistic prefix computation with fan-in 2, size (6 + 22-k) n -2e(s) -o(n), depth k + [log e(c)] + 1 and error probability at most n,s/Q).
PROOF. Fix L = e(s) and let T = [(n -1)/i?]. p&n) is illustrated in Figure 2 . !&?k,e(n) has input nodes ~1,. . . ,x, and output nodes 91,. . . ,y,,. !#k,e(n) has r subcircuits Q: O,. . . , (&-_I each a copy of vk(c), and an additional subcircuit E,, which is a copy of pk(n -rl). Forj=O ,..., randi=l,..., fZ,whereje+i<n,theith input to the subcircuit Cj is CE~~+~. For i=l , . . . , l, the i th output of !J.?k,e(n) is the i th output of C,,, which is always yi = x10.. 'oxi, since &,isacopyoftheerrorlesscircuitQh(fJ). Forj=l,...,randi=l,...,dwhereje+i<n,the (je + i) th output of $?k,e(n) is the composition of the e th output node of Cj_1 and the i th output node of Cj. This gives yjl+i = z(j-r)~+lc.
* .oxj~oxj~+lo* * *oxje+i,
. . oxje is prefix invariant) 2 1 --E. Hence, the probability that there is an error is at most TE < n&/e. The size and depth bounds can now be computed from known bounds on pk. We have size(pk$(n)) 5 2(n-e)+rsiZe(~k(e))+siZe(~k(n-re)) We allow these error-free circuits to indicate that the outputs have been computed by setting a distinguished Boolean termination switch. In the case this switch evaluates to one, then the rest of the circuit need not be evaluated.
Otherwise, the rest of the circuit must be evaluated before the outputs can be known. Let the delay be the number of parallel stops required until the termination switch is set. Thus, the delay is a random variable depending on the distribution of inputs. (Note that the delay can be considerably less than the depth, which is independent of the inputs.) In any case, the output never gives any errors.
We derive our errorless circuis &e(n) from our product circuit Q&n) defined above. The outputs of p&n) are correct if each z(j-r)e+r o...oxjeisaprefixinvariantforj=l,...,r. Fora given j, this can be verified by taking the disjunction of the equality tests z(j-l)e+r 0. . . ozje = a for each a E D which is a prefix invariant.
The termination switch for our circuit is a Boolean conjunction of these r disjunctions for all j = 1, . . . , T. In the case the termination switch evaluates to 1, then the outputs to &J(n) are identical to !$?k,e(n), and otherwise we use the outputs computed by the errorless prefix circuit pk(n) and then set the termination switch to true. The correct output for i = 1,. . . ,2C is given by the i th outputof~k,e(n),andfori=2!+l,...,nis given by using a r operation as the i th output node of &Q(n).
The arguments of this 7 operation are first the output switch, then the ith output node of qk,e(n), and last the i th output node of pk(n).
Thus, we have the following result.
THEOREM 1. For any n 2 1, 0 I k 5 logn and 0 < E < l/n, there is an error-free circuit for probabilistic prefix computation which has constant fan-in (except at the termination switch), size [13 + 23-k + (3101 + 1)/e(c)] n -8!(c) -o(n), depth k + [logn] + 2, but delay less than k + [logC(c)j + 5 with probability at least 1 -nc/!(c).
PROOF. We use our error-free circuit Q@(n) where f? = e(c). The depth of the entire circuit &J(n) is the max depth of pk(n) or of p@(n) plus 1, which is at most k + [logn] + 2. By the proof of Lemma 1, the probability that the entire circuit need be evaluated is at most ne/e, and otherwise the evaluation time for &e(n) is the depth of pk,[(n) plus 4. The size of &e(n) is at most (3101 + 1)r + 3(n -2.!) plus the sum of the sizes of !$?k,e(n) and pk(n). I
This size bound can further be reduced by observing that the pk(e) subcircuits required by pk,$(n) can be found in pk(n).
APPLICATIONS OF PROBABILISTIC PREFIX COMPUTATION CIRCUITS
Finite State Transducers with Random Input
A (Mealy) deterministic finite state transducer is a six-tuple M = (Q, C, A, 5, X, qo) where Q is a finite set of states, qo is the initial state, C, A are the finite input and output alphabets, respectively. 6 : Q x C 4 Q is the transition function and X : Q x C -+ A is the output function. Ladner and Fischer [l] show that computing the output of a deterministic finite state transducer can be expressed as a parallel prefix computation. We now generalize this to random inputs. We consider each input symbol a E C to be a mapping a : Q + Q such that a(q) = 6(q, a) for all q E Q. Thus, the domain D = C is the set of functions Q 4 Q. Let al oaz(q) = az(al(q)) for each ur,u2 E D and q E D. Thus, (II, 0, d) is a probabilistic semigroup as defined in Section 2.1. Let e(c) be the system's e-dependence length.
Given input ~1,. . . , x, E C, the prefix computation gives for i = 1,. . . , n a function 210.. Applying Theorem 1, we get the following theorem, using the circuit a,,!,,,(n).
THEOREM
2. The output of M can be computed by an error-free circuit with constant fan-m except at one node, size O(n), and delay O(log!?(s)) with probability at least 1 -n&/l(&).
Circuits for Binary Addition and Subtraction of Random Numbers
We design circuits for addition and subtraction of two uniformly distributed random n-bit binary numbers x(l) = xi') xflr . . . xil) and xc2) = xr' alar.. . xy). The input is ~1,. . . , xn,
(1) where the i th input xi is the concatenation of xi and x(2) and thus gives the i th least significant bits of x(l) and ~(~1. The input alphabet C = (00, 01,16,11} then consists of pairs of bits, when each possible pair has equal probability l/4. The state set Q = {q,,, ql} consists of an initial state qo with carry 0, and a state ql, with carry 1. The output functions are given explicitly by Figures 3a and 3b . In the case of addition (see Figure 3a) , 00 is considered the mapping qo -, qo, q1 -+ qo and 11 is the mapping qo + q1, q1 --+ ql. Note that both 00 and 11 are prefix invariants, but the other transitions 01 and 10 are not prefix invariants. Since 00 and 11 have total probability l/2, and (01, lo}* is exactly the set of words over C*, which are not prefix invariant,
we have prob(zl o . . . ox! is prefix invariant) = prob(zi = 00 or 11 for some 1 < i 5 e) = 1 -1/2l.
This implies from the definition of E-dependence length that the &-dependence length for addition is e(E) 5 I--log(E)1 for 0 < E < 1.
In the case of subtraction (see Figure 3b) , both 00 and 11 are the mapping qo + qo, q1 + ql, 01 is the prefix invariant mapping qo + 41, q1 -+ ql, and 10 is the prefix invariant mapping Qo -+ Qo, Ql + Qo. Each input xi has probability l/2 of being prefix invariant, so prob (~1 o . . . o xe is prefix invariant) = prob (xi = 01 or 10 for some 1 2 i 5 e) = 1 -$, and thus, the e-dependence length for subtraction is e(&) 5 I-log(E)] + 1 for 0 < E < 1. Figure 3a . A finite state transducer for addition of random binary numbers. Each transition has equal probability l/4 on random input. Figure 3b . A finite state transducer for subtraction of random binary numbers. Each transition has equal probability l/4 on random input.
Applying Lemma 2 and Theorem 2, we get the following corollary for any a > 1 and integer n 2 0. COROLLARY 1. There are Boolean circuits for addition and subtraction of random n-bit binary numbers with (a) constant fan-in, linear size, depth O(log( (cy + 1) log n)) and error probability at most nTa; (b) also, errorless Boolean circuits with constant fan-in except for a single node, linear size, depth O(logn), but delay at most O(log((a + 1) logn) with probability at least 1 -nMa.
Circuits for Constant Multiplication and Division of Random Numbers
We now consider the operation of multiplying a uniformly distributed random binary number x by an integer constant m 2 2. It is convenient to assume x consists of bn random bits, where b = [logm]. We will partition x into n consecutive blocks. So x = x,, . . . ,x1, where each xi is assumed to be independently randomly chosen from C = (0, 1, . . . ,2' -1). The state set is Q = {qo,. . . , qm_l}, where qo is the initial state. For each c = 0,. . . , m -1, state qc is associated with a carry of c. Given input symbol xi E C in state qc, the output h is the residue mod2b of xi m + c and xi defines a transition to state qC, with carry c' = (xi m + c -h) 2-b. Note that any input symbol xi = 0 is prefix invariante since then, the transition from any state is to state go. Hence, the probability of a prefix invariant input symbol is at least 1/2b. This implies prob(xl o . . + o xe is prefix invariant) > 1 -(1 -2-b)e. So for the case of multiplication by m, the c-dependence length is e(c) < [log(&)/ log(1 -2-b)l + 1 for 0 < E < 1. A similar argument gives the same upper bounds on the &-dependence length for division of a random number by m.
(NOTE: In the special case where m is a power of two, then the E-dependence length for both these operations is t(&) = 1 for all E.) Lemma 2 and Theorem 2 imply for any (u > 0 and integers m 2 2, n 2 0 and ,8 = -(CY + l)/ log(1 -2-b).
COROLLARY 2. There are Boolean circuits for multiplication and division of a random [logml n-bit binary number by integer m, with (a) constant fan-in, linear size, depth O(log(p log n)) and error probability at most nma, and (b) also, errorless circuits with constant fan-in except at a single node, size O(n), depth O(logn), but delay at most O(log(/3logn)) with probability at least 1 -nmQ.
NOTE: Of course, a circuit for multiplication by a power of two is trivial, since the required bit shifts can be done by simply defining outputs at the appropriate input nodes.
LOWER BOUNDS ON CIRCUIT DEPTH FOR COMPUTATIONS WITH RANDOM INPUTS
In this section, we devise a general technique for proving lower bounds on the depth of any circuit on computing a function F(xl, . . . ,x,) with given error where inputs xl,. . . , xn are independent identically distributed random variables. Then, we apply this lower bound technique to some arithmetic problems of interest. Let a circuit be xi-oblivious if its output does not depend on input xi. LEMMA 3. Any xi-oblivious circuit CE computes F(xl,. . . ,xn) with error probability at least ai/lDI, where oi = prob(3 a E D such that F(zl,. . . ,x,) # F(xl, . . . ,x+1, a, xi+l, . . . ,x,)). (al,. . . ,ui_l, u',ui+l, . . . ,a,) such that Z,d E H and F(Z) # F(8).
Since a', Z' E Dn -E, C outputs F(Z) and F(Si') on inputs Z, ii', respectively. But E is xioblivious, so E must compute the same output for both a' and 8, a contradiction. I
For each E, 0 < t < 1, we define input dependence of the system to be I(&,n) = I{il~i/lDl 2 E, 1 5 i 5 n}I.
THEOREM 3. Any circuit for F(xI,. . . ,x,) with n inputs, error less than E and fan-in f has depth at least logf(l (&, n) ).
PROOF. By contradiction.
Suppose there exists a circuit E of fan-in f which computes with error less than E. If the depth of E is less than logf(l (&, n) ), then e must be oblivious to more than n -I(&,n) inputs. But then, c must be xi-oblivious for some i where ai/lDI 2 E. So by Lemma 3, E has error at least E, a contradiction. I
For our applications to arithmetic on uniformly distributed random binary numbers, we show the corresponding gi are geometrically decreasing functions of i. Hence, I(nea, n) 1 R(log n) giving logf logn -o(n) depth lower bounds with fan-in f.
For example, in the case of addition or subtraction of two random n-bit binary numbers, 0i 2 1/2n+2-i for i = 1,. . . , n. We prove this for only the case of addition (the arguments for the case of subtraction are similar). Let the random input be 21,. . . , z,, as described in Section 3.2.
Let Ai be the predicate holding just when xj = 01 or 10 for each j = i + 1, . . . , n. Since prob(xj = 01 or 10) = l/2, prob(Ai) = 1/2n-i. Ai implies that xi+1 0. . .o zn is not prefix invariant. Hence, for each a E {OO,Ol, 10, ll}, prob(xl 0 . . .CJ xi-1 0 u 0 xi+1 0 * +. 0 X, # ~1 0 . . . 0 z,lAi) 2 l/4. Thus, ui 2 prob(Ai) . l/4 = 1/2nf2-i.
Since IDI = 4, this implies for addition and subtraction ai/lDI 2 npa for all the indices i when n + 4 -alogn I i < n. Hence, l(nea, n) > (alogn) -4, so we have, as a consequence of Theorem 3, the following corollary.
COROLLARY 3. Any Boolean circuit of fan-in f and error at most n+ for addition or subtraction of random n-bit binary numbers must have depth at least logf (a log n -4).
Thus, we conclude that our random addition and subtraction circuit of fan-in 2 and error n-", given by Corollary 1, have asymptotic optimal depth.
Next, we show for m not divisible by two, (pi 2 1/(2m) n+l-i for i = 1, . . . , n for multiplication or division of a random [log mln-bit binary number by m. Let the input be partitioned into blocks x1, . . . ,x, of length b = [logml described in Section 3.3, so each xi E (0, 1, . . . , 2b -1).
Let Bi be the predicate holding just when 2b -Xj = s is the multiplicative inverse of m modulo 2b for each j = i + 1,. . . , n. Since prob(sj
Bi implies xjm=-lmod2'foreachj=i+l,...,nsoxi+lo... o 2, is not prefix invariant. Hence, for eacha~{O,1,...,2b-l},prob(x~o~~~ox~_~oaox~+~o~~~o~,#x~o~~~ox:,~B~)~2~b.
Thus, oi 2 prob(Bi) * 2-b = 2-b(n-i) 2-b > (2m)-(n-i+1), since 2-b > 1/(2m).
Since in this case IDI = 2b, this implies for multiplication and division by m, ai/lDI > nea for all indices i where n + 2 -(cy log n)/ log(2m) < i < n. Hence, l(nwa, n) > (a log n)/ log(2m) -2, so by Theorem 3, we have the following corollary.
COROLLARY 4. Any Boolean circuit of fan-in f and error at most n-" for multiplication or division of random [log mln-bit binary numbers by m, must have depth at least logf(('u log n)/ log(2m) -2), if m is not divisible by two.
When m is constant
and not divisible by 2, this implies the asymptotic optimality of our constant multiplication and division circuits given by Corollary 2.
5. CONCLUSION Ladner and Fischer [l] observed that many arithmetic operations of practical interest can be sequentially computed with finite memory, so their prefix circuits can compute these arithmetic operations in parallel. It is our observation that for random inputs, these arithmetic operations have prefix invariant properties which allow us to design errorless circuits which have much less expected delay time.
Similarly, a practical design for an efficient circuit for any other problem can, in principle, take into account the distribution of inputs expected in applications. Generally, empirical experimentation of a circuit is done to derive timing parameters.
These parameters may be set so that on the vast majority of inputs, the circuit performs correctly, but in a few cases of low likelihood, there may be an error. Such errors might be detected and further corrective computation done, incurring some additional delay with low probabiltiy.
Perhaps the most interesting aspect of the work in this paper is our analytic derivation of timing parameters such as expected delay time. We hope these probabilistic analysis techniques introduced in our paper may be illustrative of how a theoretical result might contribute to the practical aspects of circuit design and parallel computation.
In particular, our analysis avoids the repeated experimental execution of circuits on random inputs which might otherwise be used in practice to determine timing parameters of the circuits.
