Suppose that a random n-bit number V is multiplied by an odd constant M ≥ 3, by adding shifted versions of the number V corresponding to the 1s in the binary representation of the constant M. Suppose further that the additions are performed by carry-save adders until the number of summands is reduced to two, at which time the final addition is performed by a carry-propagate adder. We show that in this situation the distribution of the length of the longest carry-propagation chain in the final addition is the same (up to terms tending to 0 as n → ∞) as when two independent n-bit numbers are added, and in particular the mean and variance are the same (again up to terms tending to 0). This result applies to all possible orders of performing the carry-save additions.
INTRODUCTION
Let X and Y be random n-bit integers that are independent and uniformly distributed in [0, 2 n − 1]. If they are added in the usual way, starting at their rightmost end and proceeding to the left, there may be various "carry-propagation chains". A carrypropagation chain is a sequence of k ≥ 1 consecutive positions in the binary representations of X and Y in which the rightmost position generates a carry (because both X and Y contain 1s in these positions), and the remaining k− 1 positions to the left propagate this carry (because exactly one of X and Y contains a 1 in each of these positions). (Here and in what follows, we use the terms left and right as usual for the conventional binary representation: the rightmost is the least significant position, and the leftmost is the most significant position.) Let the random variable C n denote the length of the longest carry-propagation chain. (Note that the longest carry-propagation chain is not necessarily the longest sequence of consecutive carries: the addition of the binary numbers 0101 and 1111 gives rise to two carry-propagation chains, each of length two, not to one of length four.) The length of the longest carry-propagation chain is of interest This research was supported by NSF grant CCF-0646682. Authors' addresses: A. Izsak, Department of Computer Science, University of British Columbia, 2366 Main Hall, Vancouver, BC V6T 1Z4, Canada; N. Pippenger, Department of Mathematics, Harvey Mudd College, 1250 Dartmouth Avenue, Claremont, CA 91711. Contact's email: njp@math.hmc.edu. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from because it governs the execution of certain parallel implementations of addition (see Claus [1973] and Knuth [1978] ).
The distribution of C n has been investigated since the early days of electronic computing. The investigation was begun in the famous "preliminary discussion" of Burks, Goldstein and von Neumann in 1946 (reprinted in von Neumann's collected works; see Burks et al. [1963] ), where it was shown that Ex(C n ) ≤ log 2 n + 1. The next step was taken by Claus [1973] , who showed that Ex(C n ) ≥ log 2 n− 2. Knuth [1978] showed that
(where the constant in the O-term is independent of k as well as n), and that this implies
where γ = 0.5772 . . . is Euler's constant, e = 2.718 . . . is the base of natural logarithms, and (ν) is a periodic function of ν with period 1 and average 0 (i.e., 1 0 (ν) dν = 0) satisfying | (ν)| ≤ 1.573 · · ·×10 −6 for all ν ∈ [0, 1). Pippenger [2002] gave an elementary derivation of (1.1), and showed that it also implies Var(C n ) = π 2 6 (log 2 e) 2 + 1 12
where π = 3.14159 · · · is the circular ratio, ω = 1.2374 · · ·×10 −12 is a constant, and (ν) is a periodic function of ν with period 1 and average 0 satisfying | (ν)| ≤ 5.3573 · · · × 10 −6 for all ν ∈ [0, 1).
In Section 2, we shall present a new analysis of the addition problem that yields results similar to these, though with weaker error bounds. Specifically, we shall show that
This implies
Ex(C n ) = log 2 n + γ log 2 e − 3 2 − (log 2 n) + O (log n) 2 n 1/3 (1.5)
and
Var(C n ) = π 2 6 (log 2 e) 2 + 1 12
in the same way that (1.1) implies (1.2) and (1.3). The weaker error bounds are a result of our choice to present our new argument in its simplest form; these bounds could be improved by elaboration of the argument (but, as Knuth [1978] points out, so could those of (1.1-3)). Our motivation, however, for presenting this new analysis is that it can be extended to obtain the results claimed in the abstract, which we shall now describe in more detail. We shall investigate the length of the longest carry propagation chain that occurs when a random n-bit integer V , uniformly distributed in [0, 2 n − 1], is multiplied by a fixed constant M. The simplest case of our problem is M = 3. In this case, the product Z = M · V is obtained by adding V to the number 2V that is obtained by shifting V one position to the left. The two random numbers being added in this case are not independent, but Izsak [2007] has shown that the length of the longest carrypropagation chain nevertheless satisfies the estimate (1.1). More generally, we may consider the case M = 2 d + 1 (where d ≥ 1), for which the product Z = M · V is obtained by adding V to the number 2 d V that is obtained by shifting V to the left d positions. Izsak [2007] has shown that again the estimate (1.1) applies (where now the constant in the O-term may depend on d, but not on k or n).
We shall consider a further generalization in which M has two or more 1s in its binary representation. Suppose that the binary representation of M is M = 0≤ j≤d m j 2 j (with m j ∈ {0, 1}) and that c (where 2 ≤ c ≤ d + 1) of the digits m 0 , m 1 , . . . , m d are 1s (so that the remaining d + 1 − c are 0s). We may assume without loss of generality that m d = 1 (since otherwise we could reduce the value of d) and that m 0 = 1 (since the carries that occur when multiplying by 2M will just be shifted versions of those that occur when multiplying by M). Let 0 = s 1 < s 2 < · · · < s c = d be the positions of the 1-bits, so M = 1≤i≤c 2 s i . For 1 ≤ i ≤ c, let W i = 2 s i V be obtained by shifting V to the left s i positions. The product Z = M · V will be obtained by adding these c numbers:
When c = 3, we can form the sum Z = W 1 + W 2 + W 3 in two stages as follows. The first stage will perform a "carry-save addition", which takes the three numbers W 1 , W 2 and W 3 as inputs and produces as outputs two numbers X and Y having the same sum:
There are, of course, many pairs of numbers X and Y that satisfy this condition. The details of carry-save addition, including the specification of the numbers X and Y that will be produced, will be given later. For now, we merely observe that in carry-save addition, all carries propagate one position to the left, and in a parallel implementation, all carries propagate simultaneously, so that a carry-save addition contributes a fixed delay to the parallel execution time. Thus, our analysis will not deal with carries in this stage. The second stage will perform a conventional "carry-propagate addition" to obtain the final product Z as the sum of X and Y . This addition is analogous to those considered in previous paragraphs, and it is the carrypropagation chains in this stage that will be the focus of our analysis. We will obtain the estimate (1.4).
When c ≥ 4, we can use c − 2 carry-save additions to reduce the c numbers W 1 , W 2 , . . . , W c to two numbers X and Y in the first stage, then add these two numbers with a carry-propagate addition in the second stage to obtain Z as before. In this case, however, there is an additional complication: there is more than one way to use c − 2 carry-save additions to reduce c numbers to two numbers. At one extreme, one can sum W 1 , W 2 and W 3 with the first carry-save addition, then proceed similarly with the resulting (c − 3) + 2 = c − 1 numbers, and so forth. The numbers X and Y are thus obtained after c − 2 carry-save additions, each (except for the first) of which depends for at least one of its inputs on its predecessor, so that these carry-save additions contribute c − 2 fixed delays to the parallel execution time. At the other extreme, one can use c/3 carry-save additions in parallel to combine 3 c/3 numbers, producing 2 c/3 numbers having the same sum, then proceed similarly with the resulting (c − 3 c/3 ) + 2 c/3 = c − c/3 numbers, and so forth. As Wallace [1964] has observed, these c − 2 carry-save additions contribute only log 3/2 c + O(1) fixed delays to the parallel execution time. Our result, which is that the estimate (1.4) again holds for the carry-propagate addition in the second stage, applies equally to all of the ways of performing the carry-save addition in the first stage.
All of our results reinforce one point: the randomness in one uniformly distributed number V is sufficient to produce the distribution (1.4); the full power of the independence of X and Y in forming their sum is not needed. In Section 3, we shall give a specification at the bit level of the algorithms that were specified above at the level of operations on numbers, and describe the features, common to all these algorithms, that will be used in the subsequent analysis. In Section 4, we shall give the proof of (1.4), based on these common features.
A NEW ANALYSIS OF ADDITION
In this section, we shall prove (1.4) for the addition of two independent random numbers. The analyses of Knuth [1978] and Pippenger [2002] of (1.1) proceed by deriving a recurrence for the probability that the addition of two random n-bit numbers yields a carry propagation chain of length at least k, then solving this recurrence for the asymptotic behavior of this probability. Our new analysis is based on the observation that the main term 1 − e −n/2 k+1 in (1.1) and (1.4) is the probability that a Poisson-distributed random variable with mean n/2 k+1 has value at least one. There are approximately n (actually n− k + 1) places at which a carry-propagation chain of length k can occur, and the probability that such a chain occurs at a given place is 1/2 k+1 . If all these possible occurrences were independent, we could derive the desired result from the Poisson approximation to binomial distribution. They are not independent, but the effects of their dependence can be analyzed far enough to yield the estimate (1.4). (This analysis is an application of the "Poisson paradigm" described by Alon and Spencer [2000] .)
A set of k consecutive bit positions will be called a k-block. There are n− k+ 1 distinct k-blocks. A k-block will be said to be active if its rightmost position generates a carry and each of the remaining k − 1 positions propagates a carry. The event "C n ≥ k" is clearly equivalent to the event "there is at least one active k-block", which we shall denote E n,k . To estimate Pr[E n,k ], we shall use the following principles.
-(A-1) The probability that a given k-block is active is 1/2 k+1 . -(A-2) If a set of k-blocks includes two that overlap, then they cannot all be active. If no two overlap, then the events of their being active are independent.
We shall show that (1.4) follows from these two principles. Let k 1 = 2 log 2 n .
(2.1)
For k > k 1 , we have Pr[E n,k ] ≤ (n − k + 1)/2 k+1 = O(1/n) by (A-1) and Markov's inequality. We also have 1 − e n/2 k+1 = O(1/n) by the power series e x = 1 + O(x), valid for x → 0. Thus, we have (1.4) for k > k 1 . For k ≤ k 1 , we shall estimate Pr[E n,k ] using inclusion-exclusion, using (A-1) and (A-2). We have
since there are just n− j(k−1) j ways to choose j non-overlapping k-blocks in the n bitpositions. Let k 0 = log 2 3n 2 log n − 6 log log n ,
log n − 2 log log n, (2.4) e −n/2 k 0 +1 = O log n n 1/3 (2.5) and e n/2 k 0 +1 = O n 2/3 (log n) 2 .
( 2.6) We shall begin by assuming k ≥ k 0 (as well as k ≤ k 1 ). Let
We shall break the sum in (2.2) at j 0 :
(2.8)
We bound the magnitude of the second sum in (2.8) by using n− j(k−1) j ≤ n j ≤ (en/j) j , which yields
using (2.4) and (2.7). For the first sum in (2.8), we use (2.1) and (2.7) to estimate the binomial coefficient by n− j(k−1) j = (n j /j!)(1 + O( jk/n)) j = (n j /j!)(1 + O((log n) 3 /n)):
The presence of the O-term in the summand prevents us from exploiting cancelation after moving the sum inside the O-term, so to obtain an error bound for the resulting sum we consider the magnitudes of the summands:
We bound the magnitude of this sum just as we did that of the second sum in (2.8), to obtain
Substituting (2.9) and (2.10) in (2.8), we obtain (1.4) for k 0 ≤ k ≤ k 1 . Finally, we consider k < k 0 . We use the fact that Pr[E n,k ] is a nonincreasing function of k, so that
using (2.5). This yields (1.4) for the remaining values of k.
THE ALGORITHM FOR MULTIPLICATION
In this section, we shall describe in more detail the algorithm presented in the Introduction. It will be most convenient to describe this algorithm in the language of hardware, implemented as circuits built from gates interconnected by wires, but this is of course equivalent to a description in the language of software for a parallel computer, such as that used by Claus [1973] and Knuth [1978] . We assume that M is given its unique conventional binary representation M = 0≤ j≤d m j 2 j , in which all digits m 0 = 1, m 1 , . . . , m d = 1 are either 0 or 1. As before, let 0 = s 1 < s 2 < · · · < s c = d denote the positions of the 1s. Our first step will be to specify the encodings of the numbers W 1 , W 2 , . . . , W c as sequences of bits. The input V = 0≤l≤n−1 v l 2 l will be received using n bits v 0 , v 1 , . . . , v n−1 as usual. Since V is an n-bit number (in the range [0, 2 n − 1]) and M is a (d + 1)-bit number (in the range [0, 2 d+1 − 1]), their product Z = M · V is an (n + d + 1)-bit number (in the range 0, (2 n − 1)(2 d+1 − 1) ⊆ [0, 2 n+d+1 − 1]). Thus it will suffice to represent all numbers produced during the execution of the algorithms (the output Z and all intermediate results) using n+ d + 1 bits, and to perform all additions (both carry-save and carry-propagate) modulo 2 n+d+1 . Thus we shall represent each W i (for 1 ≤ i ≤ c) by the n + d + 1 bits in its conventional binary representation:
Here and in what follows, we number the positions from right to left: position 0 is the rightmost position and position n + d is the leftmost.)
In the first stage of the algorithm, we reduce the c summands W 1 , W 2 , . . . , W c to two summands X and Y by means of carry-save adders. Each carry-save adder consists of n+ d+ 1 "full adders", one for each position in the numbers being added. A full adder is a pair of gates that takes three input signals (say f , g and h) and produces two output signals. The sum output is the parity (i.e., the sum f ⊕ g ⊕ h modulo 2) of the three inputs. The carry output is the majority ( f ∧ g) ∨ ( f ∧ h) ∨ (g ∧ h) of the three inputs. The parity and majority are symmetric functions of the three inputs, so when specifying what signals should be fed into a full adder, we do not need to specify which signal goes into which input. The n + d + 1 full adders in a carry-save adder reduce three summands (say F = 0≤l≤n+d f l 2 l , G = 0≤l≤n+d g l 2 l and H = 0≤l≤n+d h l 2 l ) to two summands (say A = 0≤l≤n+d a l 2 l and B = 0≤l≤n+d b l 2 l ) as follows. The signals f l , g l and h l are fed into the inputs of the full adder in position l (for 0 ≤ l ≤ n + d) . The sum outputs of the full adders become the bits of the summand A: a l = parity( f l , g l , h l ) for 0 ≤ l ≤ n+d. Finally, the carry outputs of the full adders become, after being shifted left one position, the bits of the summand B: b l+1 = majority( f l , g l , h l ) for 0 ≤ l ≤ n + d − 1 (the carry output from the full adder in the leftmost position is ignored) and b 0 = 0 (a 0 bit is shifted into the rightmost position of B).
After the c summands W 1 , W 2 , . . . , W c have been reduced to two summands X and Y by c − 2 full adders in the first stage, the summands X and Y are added by a carrypropagate adder in the second stage. Like a carry-save adder, a carry-propagate adder can be built from n+d+1 full adders, one for each position in the numbers being added. Two of the inputs of the full adder in position l (for 0 ≤ l ≤ n + d) are provided by the appropriate bits x l and y l of the numbers X = 0≤l≤n+d x l 2 l and Y = 0≤l≤n+d y l 2 l . But in this case the third input of the full adder in position l is fed from the carry output of the full adder in position l − 1 for 1 ≤ l ≤ n + d, and is fed the constant 0 for l = 0 (the carry output from the full adder in position n + d is ignored). The n + d + 1 bits of the final product Z are then produced at the sum outputs of the full adders.
This description of a carry-propagate adder gives an adequate picture of the production of the outputs, but it is not convenient for the analysis of the longest carry propagation chain, for which we must distinguish between the generation of carries and their propagation, rather than merely their production. To make the generation and propagation of carries more explicit, we may replace the full adders in the second stage by "half adders". A half adder is obtained from a full adder by substituting the constant 0 for one of its three inputs. The resulting device consists of a pair of gates, one of which computes the sum output as the parity (i.e., the "exclusive-OR") of the two remaining inputs, and the other of which computes the carry output as the conjunction (i.e., the "AND") of the inputs. If we replace each full adder in the second stage with a half adder, then the carry output of each half adder will indicate whether a carry is generated at that position (i.e., whether both x l and y l are 1s for that value of l), and the sum output will indicate whether a carry would be propagated by that position (i.e., whether exactly one of x l and y l is a 1). Since the carries are no longer propagated after this replacement, the final output is no longer computed, but the half adders provide exactly the information we need for our analysis of the length of the longest carry-propagation chain.
THE ANALYSIS OF MULTIPLICATION
We begin by deriving the principles, analogous to (A-1) and (A-2), that will allow us to analyze multiplication. A k-block is a sequence of contiguous bit positions among the n + d + 1 positions of numbers modulo 2 n+d+1 . Thus, there are just n + d − k + 2 distinct k-blocks, with the rightmost position of the rightmost k-block being position 0, and the rightmost position of the leftmost k-block being position n+ d − k + 1. We shall say that a k-block is active if, in the final addition in the second stage, its rightmost position generates a carry and its remaining k − 1 positions propagate a carry. Whether or not a k-block is active depends on the input bits not only in its k positions, but also in up to d positions to its right. These d or fewer positions will be called the extension of the k-block, and the k-block together with its extension will be called an extended k-block. (The d rightmost k-blocks will have fewer than d positions in their extensions, since there are fewer than d positions to their right.)
The inputs to the final addition are computed by circuits composed of three-input parity and majority gates and zero-input constant gates. Furthermore, constant gates occur only in the circuits computing the rightmost d and leftmost d + 1 positions (positions 0 through d − 1 and positions n through n + d). A k-block will be called marginal if it or its extension overlap the rightmost d or leftmost d + 1 positions. Thus there are 3d + 1 marginal k-blocks. A k-block will be called central if it is not marginal.
-(M-1) The probability that a central k-block is active is 1/2 k+1 .
Suppose the rightmost position of the k-block is position l (2d ≤ l ≤ n − k + 1). For the rightmost position to generate a carry, the values of both x l and y l must be 1. The value of y l depends on the inputs v l−1 , . . . , v l−d , and it is computed from them by a circuit composed of three-input parity and majority gates. These gates compute self-dual Boolean functions: if the arguments of a self-dual function are complemented, then the value of the function is also complemented. The class of self-dual functions is closed under composition, so y l is a self-dual function of the inputs v l−1 , . . . , v l−d . If the arguments of a self-dual function are independent unbiassed bits, then the value of the function is also an unbiassed bit. Thus, the probability that y l = 1 is 1/2. The value of x l depends on the input v l as well as the d inputs to its right, and we have
where φ is some d-argument Boolean function. Since v l is an unbiassed bit independent of v l−1 , . . . , v l−d , x l is an unbiassed bit independent of y l . Thus, the probability that position l generates a carry is 1/4.
For each of the remaining k − 1 positions of the k-block to propagate a carry, we must have x l+ j ⊕ y l+ j = 1 for 1 ≤ j ≤ k − 1. As between x l+ j and y l+ j = 1, only x l+ j depends on v l+ j and, as above, we have
Thus, each x j is an unbiassed bit independent of the bits to its right, so the probability that each of the remaining k − 1 bits propagates a carry is 1/2 k−1 , and the probability that a central k-block is active is (1/4)(1/2 k−1 ) = 1/2 k+1 .
-(M-2) The probability that a marginal k-block is active is at most 2 2d /2 k .
The analysis of (M-1) applies to the k − 2d or more positions of the k-block that do not overlap the rightmost 2d or leftmost d + 1 positions.
We shall say that two k-blocks are strongly nonoverlapping if they, together with their extensions, are nonoverlapping, and that they are weakly overlapping if they are nonoverlapping, but one overlaps the extension of the other.
-(M-3) If two k-blocks are overlapping, they cannot both be active.
This holds because at each position, generating a carry and propagating a carry are exclusive events.
-(M-4) If a k-block B lies to the right of, and is strongly non-overlapping, a k-block A then the event that B is active is independent of the event that A is active.
This holds because the activities of strongly non-overlapping k-blocks depend on disjoint sets of inputs.
-(M-5) If a k-block B overlaps the extension of a k-block A, but does not overlap A itself, then the probability that B is active, given that A is active, is at most 2 d /2 k+1 .
The analysis of (M-1) applies to the k − d or more rightmost positions of B that do not overlap A or its extension.
Finally, we must consider k < k 0 . Again as in the analysis of addition, the fact that Pr[E n,k ] is a nonincreasing function of k, together with the bound (1.4) for k = k 0 yields (1.4) for the remaining values of k.
CONCLUSION
In this article, we have shown that the distribution of the length of the longest carry propagation chain can be analyzed using what Alon and Spencer [2000] have called the "Poisson paradigm". We have also shown that this method of analysis can be used to show that a particular algorithm for multiplication of a random integer by a fixed constant has, to within terms tending to zero as n → ∞, the same distribution for the length of the longest carry chain in the final addition. This algorithm is characterized by shifting over zeros in the multiplier, and by the use of a carry-save adder to incorporate the contributions for all but the last two non-zero digits of the multiplier. We should point out that our analysis does not appear to be applicable to either of two natural variants of this algorithm: one in which zeros are not shifted over, but cause a contribution of zero to be added using a carry-save adder (for in this case we cannot appeal to self-duality in the computation of the final summands), and one in which a carry-propagate adder is used for all additions (in which case it does not matter whether or not zeros are shifted over, for in this case the outputs of each adder depend on an unbounded number of input bits to their right). It remains an open question whether the result of this paper applies to either or both of these variants.
An apparently even more challenging problem is to determine whether or not the result of this paper applies to the algorithm considered here when the multiplier is not a fixed integer, but is rather a random integer with the same distribution as, but independent of, the multiplicand. This question has been studied empirically for the variants described above by Gilchrist et al. [1955] (for the use of a carry-propagate adder for each addition), and by Estrin et al. [1956] (for the use of a carry-save adder). In each case, the answer is apparently affirmative.
