Binary Adder Circuits of Asymptotically Minimum Depth, Linear Size, and
  Fan-Out Two by Held, Stephan & Spirkl, Sophie Theresa
ar
X
iv
:1
50
3.
08
65
9v
3 
 [c
s.A
R]
  1
7 J
an
 20
17
Binary Adder Circuits of Asymptotically
Minimum Depth, Linear Size, and Fan-Out
Two
Stephan Held
Research Institute for Discrete Mathematics, University of Bonn
Sophie Theresa Spirkl
Research Institute for Discrete Mathematics, University of Bonn
November 14, 2018
We consider the problem of constructing fast and small binary adder cir-
cuits. Among widely-used adders, the Kogge-Stone adder is often considered
the fastest, because it computes the carry bits for two n-bit numbers (where
n is a power of two) with a depth of 2 log2 n logic gates, size 4n log2 n, and all
fan-outs bounded by two. Fan-outs of more than two are avoided, because they
lead to the insertion of repeaters for repowering the signal and additional depth
in the physical implementation.
However, the depth bound of the Kogge-Stone adder is off by a factor of
two from the lower bound of log2 n. This bound is achieved asymptotically
in two separate constructions by Brent and Krapchenko. Brent’s construction
gives neither a bound on the fan-out nor the size, while Krapchenko’s adder
has linear size, but can have up to linear fan-out. With a fan-out bound of two,
neither construction achieves a depth of less than 2 log2 n.
In a further approach, Brent and Kung proposed an adder with linear size
and fan-out two, but twice the depth of the Kogge-Stone adder.
These results are 33-43 years old and no substantial theoretical improvement
for has been made since then. In this paper we integrate the individual advan-
tages of all previous adder circuits into a new family of full adders, the first
to improve on the depth bound of 2 log2 n while maintaining a fan-out bound
of two. Our adders achieve an asymptotically optimum logic gate depth of
log2 n+ o(log2 n) and linear size O(n).
1
1 Introduction
Given two binary addends A = (an . . . a1) and B = (bn . . . b1), where index n denotes the
most significant bit, their sum S = A+B has n+1 bits. We are looking for a logic circuit,
also called an adder, that computes S. Here, a logic circuit is a non-empty connected acyclic
directed graph consisting of nodes that are either gates with incoming and outgoing edges,
inputs with at least one outgoing edge and no incoming edges, or outputs with exactly one
incoming edge and no outgoing edges. Gates represent one or two bit Boolean functions,
specifically And, Or, Xor, Not or their negations. A small example is shown on the right
side of Figure 1a. The fan-in is the maximum number of incoming edges at a vertex, and
it is bounded by two for all gates.
The main characteristics in adder design are the depth, the size, and the fan-out of a
circuit. The depth is defined as the maximum length of a directed path in the logic circuit
and is used as a measure for its speed. The lower the depth, the faster is the adder. The
size is the total number of gates in the circuit, and is used as a measure for the space
and power consumption of the adder, both of which we aim to minimize. The fan-out is
the maximum number of outgoing edges at a vertex. High fan-outs increase the delay and
require additional repeater gates (implementing the identity function) in physical design.
Thus, when comparing the depth of adder circuits, their fan-out should be considered as
well; we will focus on the usual fan-out bound of two. Circuits with higher fan-outs can be
transformed into fan-out two circuits by replacing each interconnect with high fan-out by a
balanced binary repeater tree, i.e. the underlying graph is a tree and all gates are repeater
gates. However, this increases the size linearly and the depth logarithmically in the fan-out.
Hoover, Klawe, and Pippenger [1984] gave a smarter way to bound the fan-out of a given
circuit, but it would also triple the size and depth in our case of gates with two inputs.
Using logic circuit depth as a measure for speed is a common practice in logic synthesis
that simplifies many aspects of physical hardware. In CMOS technology, Nand/Nor gates
are faster than And/Or gates and efficient implementations exist for integrated multi-
input And-Or-Inversion gates and Or-And-Inversion gates. We assume that a technology
mapping step [Chatterjee et al. 2006, Keutzer 1988] translates the adder circuit after logic
synthesis using logic gates that are best for the given technology. Despite its simplicity,
the depth-based model is at the core of programs such as BonnLogic [Werber et al. 2007]
for refining carry bit circuits, which is an integral part of the current IBM microprocessor
design flow. Recently, we reduced the running time for computing such carry bit circuits
significantly from O(n3) to O(n log n) [Held and Spirkl 2014]. Exemplary, for all newly
proposed adder circuits in this paper we will demonstrate how to efficiently transform
them into equivalent circuits using only Nand/Nor and Not gates.
Like most existing adders, we use the notion of generate and propagate signals, see
[Sklansky 1960, Brent 1970, Knowles 1999]. For each position 1 ≤ i ≤ n, we compute a
generate signal yi and a propagate signal xi, which are defined as follows:
xi = ai ⊕ bi,
yi = ai ∧ bi, (1)
2
where ∧ and ⊕ denote the binary And and Xor functions. The carry bit at position i+1
can be computed recursively as ci+1 = yi∨ (xi∧ ci), since there is carry bit at position i+1
if the i-th bit of both inputs is 1 or, assuming this is not the case, if at least one (hence
exactly one) of these bits is 1 and there was a carry bit at position i.
The first carry bit c1 can be used to represent the carry-in, but we usually assume c1 = 0.
The last carry bit cn+1 is also called the carry-out. From the carry bits, we can compute
the output S via
si = ci ⊕ xi for 1 ≤ i ≤ n and sn+1 = cn+1. (2)
With this preparation of constant depth, linear size, and fan-out two at the inputs ai, bi
and fan-out one at the carry bits ci+1 (i = 1, . . . , n), the binary addition is reduced to the
problem of computing all carry bits ci+1 from xi, yi (i = 1, . . . , n).
Convention: From now on, we will omit the preparatory steps (1) and (2) and consider
a circuit an adder circuit if it computes all ci+1 from xi, yi (i = 1, . . . , n).
Expanding the recursive formula for ci+1 as in equation (3) results in a logic circuit that
is a path of alternating And and Or gates. It corresponds to the long addition method
and has linear depth 2(n − 1).
ci+1 = yi ∨ (xi ∧ (yi−1 ∨ (xi−1 ∧ · · · ∧ (y2 ∨ (x2 ∧ y1)). . . . ))) (3)
1.1 Prefix Graph Adders
For two pairs zi = (xi, yi) and zj = (xj, yj), we define the associative prefix operator ◦ as
(
xi
yi
)
◦
(
xj
yj
)
=
(
xi ∧ xj
yi ∨ (xi ∧ yj)
)
. (4)
A circuit computing (4) can be implemented as a logic circuit consisting of three gates and
with depth two as shown in Figure 1a. For i = 1, . . . , n, the result of the prefix computation
zi ◦ · · · ◦ z1 of the expression zn ◦ · · · ◦ z1 contains the carry bit ci+1:
(
xi ∧ xi−1 ∧ · · · ∧ x1
ci+1
)
=
(
xi
yi
)
◦
(
xi−1
yi−1
)
◦ · · · ◦
(
x1
y1
)
. (5)
A circuit of ◦-gates computing all prefixes zi ◦ · · · ◦ z1 (i = 1, . . . , n) for an associative
operater ◦ is called a prefix graph. A prefix graph yields an adder by expanding each ◦-gate
as in Figure 1a, and extracting the carry bits as in (5).
Most previous constructions for adders are based on prefix graphs of small depth, size
and/or fan-out. Sklansky [1960] developed a prefix graph of minimum depth log2 n, size
1
2n log2 n, but high fan-out
1
2n+1. The first prefix graph with logarithmic depth (2 log n−1)
and linear size (3n− log n− 2) was developed by Ofman [1962], exhibiting a non-constant
3
AB
C
zi zj
zi ◦ zj
yi yjxi xj
xi ∧ xjyi ∨ (xi ∧ yj)
(a) Prefix gate and underlying logic circuit
z8 z7 z6 z5 z4 z3 z2 z1
(b) Kogge-Stone prefix graph
Figure 1: Prefix gate and graph
fan-out of 12 log n. Kogge and Stone [1973] introduced the recursive doubling algorithm
which leads to a prefix graph with depth log2 n and fan-out two (see Figure 1b). Since we
will use variants of it in our construction, we describe it in detail. For 1 ≤ s ≤ t ≤ n, let
Zs,t := zt◦· · ·◦zs, and for x ∈ R, let (x)+ := max{x, 0}. The graph has log2 n levels, and on
level i it computes for every input j (1 ≤ j ≤ n) the prefix Z1+(j−2i)+,j = zj ◦· · ·◦z1+(j−2i)+
according to the recursive formula
Z1+(j−2i)+,j = Z1+(j−2i−1)+,j ◦ Z1+(j−2i)+,(j−2i−1)+ , (6)
from the prefixes of sequences of 2i−1 consecutive inputs computed in the previous level.
The fan-out is bounded by two, since every intermediate result is used exactly twice: once as
the “upper half” and once as the “lower half” of an expression of the form zj◦· · ·◦z1+(j−2i)+ .
Note that for level i (1 ≤ i ≤ log2 n), a repeater gate (which computes the identity function)
is used instead of a ◦-gate if j ≤ 2i, i. e. in the case that the right input in (6) is empty.
Repeaters are shown as blue squares in Figure 1b. The Kogge-Stone prefix graph minimizes
both depth and fan-out. On the other hand, since there is a linear number of gates at each
level, the total size in terms of prefix gates is nlog2 n− n2 .
Ladner and Fischer [1980] constructed a prefix graph of depth log2 n but high fan-out.
Brent and Kung found a linear-size prefix graph with fan-out two, but twice the depth
of the other constructions. Finally, Han and Carlson [1987] described a hybrid between a
Kogge-Stone adder and a Brent-Kung adder which achieves a trade-off between depth and
size. Lower bounds for the trade-off between the depth and size of a prefix graph can be
found in [Fich 1983, Sergeev 2013].
The above prefix graphs can be used for prefix computations with respect to any associa-
tive operator ◦. In fact, we will later use a prefix graph in which the operator ◦ represents
an And gate. When turning one of the above prefix graph adders into a logic circuit for
addition such that each prefix gate is implemented as in Figure 1a, the depth of the logic
circuit is twice the depth of the prefix graph and the number of logic gates is three times
the number of prefix gates. The fan-out of the underlying logic circuit can increase by one
compared to the prefix graph, because the left propagate signal xi is used twice within a
4
prefix gate. In Section 3.1, we will see that in the case of the Brent-Kung adder a fan-out
of two can be achieved by using reduced prefix gates.
Any adder constructed from a prefix graph has a logic gate depth of at least logϕ n −
1 > 1.44 log2 n − 1, where ϕ = 1+
√
5
2 is the golden section [Held and Spirkl 2014], see
also [Rautenbach et al. 2008]. In [Held and Spirkl 2014] an adder of size O(n log2 log2 n)
asymptotically attaining this depth bound is described, however with a high fan-out of√
n+ 1.
1.2 Non-Prefix Graph Adders
Since none of the 2n inputs xi, yi (1 ≤ i ≤ n) except for x1 are redundant for cn+1,
the depth of any adder circuit using 2-input gates is at least log2 n + 1, which would be
attained by a balanced binary tree with inputs/leaves xi, yi (1 ≤ i ≤ n). With adders that
are not based on prefix graphs, this bound is asymptotically tight. Krapchenko showed
that any formula (a circuit with tree topology) for computing cn+1 has depth at least
log2 n+ 0.15 log2 log2 log2 n+O(1) [Krapchenko 2007].
Brent [1970] gives an approximation scheme for a single carry bit circuit attaining an
asymptotic depth of (1 + ε) log2 n + o(log2 n) for any given ε > 0. The best known depth
for a single carry bit circuit is log2 n+log2 log2 n+O(1), due to Grinchuk [2008]. However,
[Grinchuk 2008] and [Brent 1970] did not address how to overlay circuits for the different
carry bits to bound the size and fan-out of an adder based on their circuits. One problem
in sharing intermediate results is that this can create high fan-outs.
Krapchenko [1967] (see [Wegener 1987, pp. 42-46]) presented an adder with asymptot-
ically optimum depth log2 n + o(log2 n) and linear size. It was refined for small n by
[Gashkov et al. 2007]. However, the fan-out is almost linear.
1.3 Our Contribution
In this paper, we present the first family of adders of asymptotically optimum depth, linear
size, and fan-out bound two:
Theorem 1.1 (Main Theorem). Given two n-bit numbers A,B, there is a logic circuit
computing the sum A + B, using gates with fan-in and fan-out two and that has depth
log2 n+ o(log n) and size O(n).
The rest of the paper is organized as follows. In Section 2, we develop a family of adders of
asymptotically minimum depth, fan-out two, but super-linear sizeO
(
n
⌈√
log2 n
⌉2
2
√
log2 n
)
.
In Section 3, using reductions similar to [Krapchenko 1967], this adder is transformed into
an adder of linear size with the asymptotically same depth, proving Theorem 1.1. While
all of the above adders use only And/Or gates and repeaters, we show in Section 4 that
Theorem 1.1 holds also if only Nand/Nor and Not gates are available.
5
2 Asymptotically Optimum Depth and Fan-Out Two
For 1 ≤ s ≤ t ≤ n, let Xs,t and Ys,t denote the propagate and generate signal for the
sequence of indices between s and t, i.e.
Xs,t =
∧t
i=s xi
Ys,t = yt ∨ (xt ∧ (yt−1 ∨ (xt−1 ∧ · · · ∧ (ys+1 ∨ (xs+1 ∧ ys)) . . . ))) (7)
The adders based on prefix graphs as in Section 1.1 impose a common topological struc-
ture on the computation of intermediate results Xs,t and Ys,t. In the adder described by
Brent [1970], on the other hand, intermediate results Xs,t and Ys,t are computed separately
within larger blocks.
Let n = 2rk for r ∈ N and k ∈ N to be chosen later. A central idea of generating a faster
adder is to use multi-fan-in (also called high-radix) subcircuits within a Kogge-Stone prefix
graph. While all the prefix gates in Figure 1b have fan-in two, we want to use prefix gates
with fan-in 2r, so that the number of levels reduces from log2 n to log2r n =
1
r
log2 n. Each
prefix gate with fan-in 2r represents a logic circuit with fan-in and fan-out bounded by two.
Since the output of each prefix gate will be used in 2r prefix gates at the next level, our
approach also requires to duplicate the intermediate result at the output of a prefix gate
2r−1 times. To accomplish this, we consider the computation of generate and propagate
sequences separately.
Our adder consists of two global Kogge-Stone type prefix graphs. The first such graph
uses 2-input And-gates and computes propagate signals used in the other prefix graph.
This graph uses 2r-input subcircuits that are arranged in the same way as the Kogge-Stone
graph, and it computes the generate (carry) signals. Both graphs are modified to duplicate
intermediate generate signals 2r−1 times and intermediate propagate signals 2r times so
that the overall constructions obeys the fan-out bound of two.
2.1 Multi-Input Generate Gates
We now introduce multi-input generate gates, which are the main building block for com-
puting the generate signals. Given 2r propagate and generate pairs (x˜2r , y˜2r ), . . . , (x˜1, y˜1),
a multi-input generate gate computes the generate signal
Y˜1,2r = y˜2r ∨ (x˜2r ∧ (y˜2r−1 ∨ (x˜2r−1 ∧ · · · ∧ (y˜2 ∨ (x˜2 ∧ y˜1)) . . . ))) .
The input pairs (x˜i, y˜i) (i ∈ {1, . . . , 2r}) are not necessarily the input pairs of the adder;
they can be intermediate results.
Each multi-input generate gate has 2r−1 outputs, each of which provides the result Y˜1,2r ,
because later we want to reuse this signal 2r times and bound the fan-out of each output
by two. In contrast to two-input prefix gates computing (4), multi-input generate gates
do not compute the propagate signals X˜1,2r =
∧2r
i=1 x˜i for the given input pairs. All
required propagate signals will be computed by the separate And-prefix graph, described
in Section 2.2.
6
x˜1x˜2x˜3x˜4x˜5x˜6x˜7x˜8
y˜1y˜2y˜3y˜4y˜5y˜6y˜7y˜8
Figure 2: A 2r-input 2r−1-output generate gate for r = 3
Figure 2 shows an example of a multi-input generate gate with 8 inputs. A 2r-input
prefix gate computes Y˜1,2r as in the disjunctive normal form
Y˜1,2r =
2r∨
j=1

y˜j ∧

 2
r∧
i=j+1
x˜i



 ,
first computing all the minterms mj := y˜j ∧
(∧2r
i=j+1 x˜i
)
(j = 1, . . . , 2r), and then the
disjunction
∨2r
j=1mj. The terms
∧2r
i=j+1 x˜i are computed as a Kogge-Stone And-suffix
graph, which arises from a Kogge-Stone prefix graph by reversing the ordering of the
inputs. A single stage of (red) And-gates and one repeater concludes the computation
of the minterms. Each input y˜i is used exactly once within this circuit. The repeater is
dispensable but simplifies the size formula and will become useful in Section 4.
Finally, instead of computing the disjunction
∨2r
j=1mj by a balanced binary Or tree and
duplicating the results 2r−1 times through a balanced repeater tree, the duplication is ac-
complished by r rows of 2r−1 Or-gates as shown in Figure 2. Formally, let Mi,j =
∨j
i′=imi′
be the conjunction of minterms i, i+1, . . . , j. Then, on level l ∈ {1, . . . , r}, we compute each
signal of the formMi2l+1,(i+1)2l , i = 0, . . . , 2
r−l−1, from the previous level, and we compute
2l−1 copies of it. By usingMi2l+1,(i+1)2l = M2i2l−1+1,(2i+1)2l−1∨M(2i+1)2l−1+1,(2i+2)2l−1 , and
since each preceding signal is available 2l−2 times (l ≥ 2), we can ensure that each of them
has fan-out two. On the last level, we will have computed 2r−1 copies ofM1,2r = Y˜1,2r . Each
level uses 2r−1 Or-gates. Note that a similar construction for reducing fan-out has been
used by Lupanov when extending his well-known bounded-size representation of general
Boolean functions to circuits with bounded fan-out [Lupanov 1962].
7
Lemma 2.1. The multi-input generate gate has 2r generate/propagate pairs as input and
2r−1 outputs. Each propagate input has fan-out two and each generate input has fan-out
one. The gate consists of r2r+(r+1)2r−1 logic gates which have fan-out at most two. The
depth for the propagate inputs x˜i is 2r+1 and the depth for the generate inputs y˜i is r+1
(i ∈ {1, . . . , 2r}).
Proof. All the terms
∧2r
i=j+1 x˜i are computed as a Kogge-Stone And-suffix graph (blue and
yellow gates in Figure 2) of size
2r⌈log2 2r⌉ −
2r
2
= (r − 1)2r + 2r−1.
Then, there is a level of 2r (red)And gates and one repeater, concluding the computation
of the minterms. Finally, there are r2r−1 (green) Or-gates to compute the disjunction∨2r
j=1mj 2
r times, for a total of
r2r + (r + 1)2r−1
gates. By construction, no gate and propagate input has fan-out larger than two, and all
generate inputs have fan-out one. The depth is r for the And-suffix graph, one for the red
gates, and r for the disjunctions, yielding the desired depths of 2r + 1 for the propagate
inputs and r + 1 for the generate inputs.
2.2 Augmented Kogge-Stone And-Prefix Graph
The second important component of our construction is the augmented Kogge-Stone And-
prefix graph. It is used to compute Xs,t =
∧t
i=s xi for all 1 ≤ t ≤ n and s = 1 +
(
t− 2rl)+
with 0 ≤ l < k, providing each output 2r times through 2r individual gates. It is constructed
as follows. First, we take a Kogge-Stone [1973] prefix graph, where the prefix operator is an
And-gate, i.e. ◦ = ∧. It consists of log2 n levels, and on level i it computes for every input
j (1 ≤ j ≤ n) the prefix X1+(j−2i)+,j from the prefixes of sequences of 2i−1 consecutive
inputs computed in the previous level.
Each of the results Xs,t from level rl will later be used in 2
r multi-input generate gates
for all 0 ≤ l < k, s = 1 + (t− 2rl)+ and 1 ≤ t ≤ n. In order to achieve a fan-out bound
of two, starting at the inputs, we insert one row of n repeaters after every r levels of
And-gates. This allows to use the repeaters as the inputs for the next level, and to extract
the signals Xs,t once at the And-gates before the repeaters. The construction is shown in
Figure 3 with the extracted outputs Xs,t shown as red arrows. The last block of r rows of
gates (hatched gates in Figure 3) of the Kogge-Stone prefix graph can be omitted in our
construction to reduce the size.
Each output signal Xs,t will be input to a multi-input generate gate, where it is immedi-
ately duplicated. Thus, each output Xs,t of the augmented Kogge-Stone And-prefix graph
has to be provided through an individual gate. To this end, at each of the nk outputs, we
add 2r+1 − 1 repeater gates as the vertices of a balanced binary tree to create 2r copies
8
x9x10x11x12x13x14x15x16 x8 x7 x6 x5 x4 x3 x2 x1
Figure 3: Augmented Kogge-Stone And-prefix graph for r = k = 2.
of the signal with a single repeater serving each leaf. For simplicity these repeaters are
hidden in Figure 3.
Lemma 2.2. The total size of the augmented Kogge-Stone And-prefix graph is nr(k−1)+
nk2r+1.
Proof. Each binary repeater tree at one of the nk outputs consists of 2r+1 − 1 repeaters,
summing up to nk(2r+1−1) repeaters in these repeater trees. The remaining circuit consists
of r(k − 1) + k rows (r(k − 1) rows of And-gates and k rows of repeaters) of n gates each,
summing up to n(r(k − 1) + k) gates. Altogether, the circuit contains nr(k − 1) + nk2r+1
gates.
Lemma 2.3. The signal Xs,t for 1 ≤ t ≤ n and s = 1+
(
t− 2rl)+ for 0 ≤ l < k is available
2r times with internal fan-out one at a depth of (l + 1)(r + 1).
Proof. The functional correctness is clear by construction. For the depth bound, let 1 ≤
t ≤ n and 0 ≤ l < k. Then, for s = 1+(t− 2rl)+, the signal Xs,t is available at the bottom
of the l-th block at a depth of l(r + 1). Subsequently, we create 2r copies of the signal in
a repeater tree of depth r + 1. Together, this gives the desired depth (l + 1)(r + 1).
2.3 Multi-Input Generate Adder
We now describe the multi-input generate adder for n = 2rk. It consists of an augmented
Kogge-Stone And-prefix graph from the previous section and a circuit consisting of multi-
input generate gates similar to a radix-2r Kogge-Stone adder.
The construction uses k rows with n multi-input generate gates or repeater trees (see
Figure 4). The t-th multi-input generate gate in level l ∈ {1, . . . , k} computes Y
1+(t−2rl)+,t
9
Figure 4: Multi-input multi-output generate gate adder for r = k = 2
according to the formula Y
1+(t−2rl)+,t =
2r∨
j=1

Y
1+(t−j2r(l−1))+,(t−(j−1)2r(l−1))+ ∧

 2
r∧
k=j+1
X
1+(t−k2r(l−1))+,(t−(k−1)2r(l−1))+



 . (8)
If
(
t− 2rl)+ < (t− 2r(l−1))+ (yellow circuits in Figure 4), this computation is carried
out using a multi-input generate gate from Section 2.1. As its inputs, it uses generate
signals from the previous level, l− 1, and propagate signals obtained from the augmented
Kogge-Stone And-prefix graph.
Except for the last level, each intermediate generate signal will be used 2r times as in
(8) in the next level. As the fan-out of each generate input inside a multi-input generate
gate is one, we need to provide 2r−1 copies through individual gates to serve 2r multi-input
generate gates with fan-out two.
If
(
t− 2rl)+ = (t− 2r(l−1))+ (blue squares in Figure 4), Y
1+(t−2rl)+,t is already computed
in the previous level, and in this level it is sufficient to duplicate the signal 2r−1 times using
a balanced binary repeater tree.
The augmented Kogge-Stone And-prefix graph provides each signal 2r times with indi-
vidual repeaters. Thus, it can be distributed to 2r multi-input generate gates, where the
fan-out of each propagate input is two.
For the first level of multi-input generate gates, we duplicate each generate signal yi
at an input i ∈ {1, . . . , n} using a balanced binary repeater tree of depth r − 1 and size
2+22+ · · ·+2r−1 = 2r−2. Again, we can distribute each copy to two multi-input generate
gates, maintaining fan-out two.
In the last level of multi-input generate gates, we do not need to duplicate the signals
anymore. Instead of the r rows of 2r−1 Or-gates each, we can compute the single outputs
using a balanced binary tree of 2r − 1 Or-gates and depth r.
Lemma 2.4. The multi-input generate adder for n = 2rk bits obeys a fan-out bound of
two, contains less than
3nk(r + 2)2r−1 + n2r + nrk
10
gates, and has depth
kr + 2r + k + 1.
Proof. Inside each multi-input generate gate, the fan-out of propagate inputs is two and
the fan-out of generate inputs is one. Thus, it suffices to observe that in each non-output
level there are 2r copies of each propagate signal and 2r−1 copies of each generate signal,
and that the fan-out of two holds within the augmented Kogge-Stone graph and within
each multi-input generate gate.
By Lemma 2.2, the size of the augmented Kogge-Stone And-prefix graph is nr(k − 1) +
nk2r+1. The size of the n balanced binary trees duplicating the input generate signals is
n(2r − 2).
The remainder of the graph consists of k rows of n 2r-input multi-input generate gates
or repeater trees. The size of a repeater tree (blue boxes in Figure 4) is at most 2r−1− 1 ≤
r2r + (r + 1)2r−1 (r ≥ 1), which is the size of a multi-input generate gate. Thus, the size
of all these multi-input generate gates is at most nk
(
r2r + (r + 1)2r−1
)
. Summing up, the
total size is at most
nr(k − 1) + nk2r+1 + n(2r − 2) + nk (r2r + (r + 1)2r−1)
= nk2r+1 + nkr2r + n2r + nk(r + 1)2r−1 + nkr − n(r + 2)
= nk (4 + 2r + (r + 1)) 2r−1 + n2r + nkr − n(r + 2)
< 3nk (r + 2) 2r−1 + n2r + nkr.
For a simpler depth analysis, we assume that the input generate signals yi arrive delayed
at a depth of r + 2. The generate input signals traverse a binary tree of depth r − 1 and
the propagate input signals traverse a binary tree of depth r + 1 before reaching the first
multi-input generate gate, i. e. generate signals yi become available at depth 2r + 1 and
propagate signals at depth r + 1. Thus, the first row of multi-input generate gates has
depth
3r + 2 = max{2r + 1 + 1 + r, r + 1 + r + 1 + r},
where the first term in the maximum is caused by the delayed generate signals yi and the
second term by the propagate signals xi (1 ≤ i ≤ 1).
For the next level, the propagate signals are available at time 2r + 2, and the generate
signals at time 3r + 2, and the propagate signals again arrive r time units before the
corresponding generate signals, so at the next level, both signals arrive r+1 time units later
than they did before. Inductively, we know that for each level 2 ≤ l ≤ k, the generate and
propagate signals arrive at a depth of (l−1)(r+1) more than than they did for at the first
level. Consequently, the total depth of the adder is (k−1)(r+1)+3r+2 = kr+2r+k+1.
If
√
log n ∈ N, we can choose r = k = √log n and receive the following result.
11
Corollary 2.5. If
√
log n ∈ N, there is a multi-input generate adder for n bits with fan-out
two, size at most
3n(log n+ 2
√
log n)2
√
logn−1 + n2
√
logn + n log n,
and depth
log n+ 3
√
log n+ 1.
In general,
√
log n 6∈ N, and we get the following result.
Theorem 2.6. Let n ∈ N. For input pairs (xi, yi) (i ∈ {1, . . . , n}), there is a circuit,
computing all carry bits with maximum fan-out 2, depth at most
log2 n+ 5
⌈√
log2 n
⌉
+ 2.
The size is at most
4n
⌈√
log2 n
⌉2
2
⌈√
log2 n
⌉
(9)
if n ≥ 16, and at most
8n
⌈√
log2 n
⌉2
2
⌈√
log2 n
⌉
(10)
if n ≤ 15.
Proof. We choose r = k =
⌈√
log2 n
⌉
and apply Lemma 2.4, obtaining
3nk(r + 2)2r−1 + n2r + nrk = n
(
3(r2 + 2r)2r−1 + 2r + r2
)
.
(11)
Now, if n ≥ 16, we have r = k ≥ 2. Thus, we can use 2r ≤ r2 and 2r + r2 ≤ r22r to bound
the right hand side by
n
(
3
(
r2 + r2
)
2r−1 + r22r
)
= 4nr22r,
implying (9).
Otherwise, n ≤ 16, r = k ≤ 2, r2 ≤ 2r, r2 ≤ 2r, and the right hand side of (11) is
bounded by
n (3 (2r + 2r)) 2r−1 + 2r + 2r = 8nr2r ≤ 8nr22r,
The resulting depth is
kr + 2r + k + 1 =
⌈√
log2 n
⌉2
+ 3
⌈√
log2 n
⌉
+ 1
≤ (⌊√log2 n⌋+ 1)2 + 3 ⌈√log2 n⌉+ 1
≤ log2 n+ 5
⌈√
log2 n
⌉
+ 2.
12
If
√
log2 n 6∈ N, the adder in Theorem 2.6 is larger than necessary since it has n′ =
2
⌈√
log2 n
⌉2
> n inputs. If for example n = 32, we choose r = k = 3 and n′ = 512. Thus, if⌈√
log2 n
⌉2 ≥ n+ ⌈√log2 n⌉, choosing r = ⌈√log2 n⌉− 1 instead still yields an adder with
at least n inputs and outputs and reduces the size and depth significantly. For n = 32, we
would still obtain a 64-input adder using this method.
The analysis can be refined further by noticing that the columns n′ down to n+1 in the
augmented Kogge-Stone And-prefix graph and the multi-input gate graph can be omitted,
since they are not used for the computations of the first n output bits. This reduces the
size of the construction. If n′ > n, we can omit the left half of the construction and notice
that the right half of lowest row of multi-input generate gates only has 2r−1 inputs, so we
can actually use 2r−1-input generate gates and reduce the depth by 1. This process can
be iterated until n′ = n, which decreases the rounding error incurred in Theorem 2.6; the
depth is decreased by
⌈√
log2 n
⌉2 − log2 n.
In this section, we have achieved a depth bound of log2 n+O(
√
log n) = log2 n+o(log2 n),
which is asymptotically optimal, since the lower bound is log2 n.
3 Linearizing the Size of the Adder
To achieve a linear size while keeping the adder asymptotically fastest possible, we adopt
a technique similar to the construction by Brent and Kung [1982], which was first used as
a size-reduction tool by Krapchenko [1967] (see [Wegener 1987, pp. 42-46]).
3.1 Brent-Kung Step
Brent and Kung [1982] construct a prefix graph recursively as shown in Figure 5a. If n is a
at least two, it computes the n/2 intermediate results zn ◦ zn−1; . . . ; z2 ◦ z1 (see Section 1.1
for the definition of zi). A prefix graph for these n/2 inputs is used to compute the prefixes
Z1,2i for all even indices i ∈ {1, . . . , n/2}. For odd indices, the prefix needs to be corrected
by one more prefix gate as Z1,2i+1 = z2i+1◦Z1,2i (i ∈ {1, . . . , n/2−1}). We call this method
of input halving and output correction a Brent-Kung step. Note that the propagate signals
are not needed after the correction step. Thus, we can use reduced prefix gates (Figure 6)
in the output correction step. In these prefix gates, the left propagate signal xi is used only
once. Thus, the underlying logic circuit inherits the fan-out of two from the prefix graph.
The Brent-Kung step reduces the instance size by a factor of two, but it increases the
depth of the construction by four and the size by 5/2n in terms of logic gates.
Applying these Brent-Kung steps recursively, Brent and Kung obtain a prefix graph that
has prefix gate depth 2 log2 n− 1 and logic gate depth 4 log2 n− 2. The prefix gate depth
is not optimal anymore, but the adder has a comparatively small size of 12 (5n− log2 n− 8)
gates, and its fan-out is bounded by two at all inputs and gates. It is shown in Figure 5b.
Brent-Kung steps were actually known before the paper by Brent and Kung [1982] e.g.
they were already used in [Krapchenko 1967]. But the Brent-Kung adder is based solely
13
z8 z7 z6 z5 z4 z3 z2 z1
Any adder for 4 inputs
(a) Brent-Kung (reduction) step
z8 z7 z6 z5 z4 z3 z2 z1
(b) Brent-Kung prefix graph
Figure 5: Brent-Kung Step and Prefix Graph
on these steps.
3.2 Krapchenko’s Adder
Krapchenko’s adder is a non-prefix adder computing all carry bits with asymptotically
optimal depth and linear size. Its fan-out, on the other hand, is almost linear as well,
which makes it less useful in practice. Krapchenko’s techniques can be used to derive the
following reduction, based on Brent-Kung steps.
Lemma 3.1 ([Krapchenko 1967], see [Wegener 1987, pp. 42-46]). Let τ ≤ log2 n− 1, then
given a family of adders computing k carry bits with depth d(k), maximum fan-out f(k)
and size s(k), there is a family of adders computing n carry bits with depth d(n/2τ ) + 4τ ,
maximum fan-out max {2, f(n/2τ )} and size s(n/2τ ) + 5n.
With size s(n/2τ ) + 5.5n, we can achieve the same depth and a maximum fan-out of at
most max {2, f(n/2τ )}.
Proof. We apply τ Brent-Kung steps and construct the remaining adder for n/2τ from the
given adder family. Figure 5a shows the situation for τ = 1. The simple application of τ
Brent-Kung steps would achieve the claimed depth and fan-out result, except with at most
2n additional 2-input prefix gates (because we will never add more prefix gates than are
present in the Brent-Kung prefix graph) and thus with 6n additional logic gates.
To see that 5n logic gates are enough, we show that we can omit the propagate signal
computation for the parity-correcting part of the Brent-Kung step. Such a reduced output
prefix gate is shown in Figure 6. With this construction, note that for i even, we have
computed (x, y) = zi ◦ · · · ◦ z1. For zi+1 = (yi+1, xi+1), the carry bit arising from position
i + 1 is ci+2 = xi+1 ∨ (yi+1 ∧ y), which uses two gates. It follows that a Brent-Kung step
uses only the propagate signals at the inputs. For the next Brent-Kung step, the inputs
14
BC
zi yj
Yi,j
yi yjxi
yi ∨ (xi ∧ yj)
Figure 6: Reduced output correction prefix gate of a refined Brent-Kung step
are the n/2 pairs zn ◦ zn−1; . . . ; z2 ◦ z1, therefore we need three logic gates per prefix gate
for the reduction step.
Note that in Figure 5b, the propagate signal at a gate is used if and only if there is a
vertical line from this gate to another prefix gate (and not to an output or repeater). These
lines exist only in the “upper half” of the adder, i. e. the parts with depth ≤ log2 n. Since
parity correction occurs exclusively in the lower half with depth > log2 n, the propagate
signals from parity correction steps are never used.
As in the Brent-Kung prefix graph, n2 repeaters can be used to distribute the fan-out and
reduce the maximum fan-out of the parity-correcting gates to two (see also Figure 5b).
The fact that the refined Brent-Kung step does not require the inner adder to provide
the propagate signals, which a prefix graph adder would provide, allows us to use the
multi-input generate adder with the size and depth bounds stated in Theorem 2.6, and
which omits the last r rows of And gates (hatched gates in Figure 3) in the augmented
Kogge-Stone And-prefix graph.
Lemma 3.1 can be used to achieve different trade-offs. In particular, constructions for
all carry bits of size up to n1+o(1) can be turned into linear-size circuits with the same
asymptotic depth or depth guarantee, since we could choose τ = o(1) log2 n. This works
for prefix graphs and logic circuits; for example with τ = log2 log2 n, the Kogge-Stone
prefix graph will have size 3n, depth log2 n + 2 log2 log2 n and fan-out bounded by two in
terms of prefix gates [Han and Carlson 1987].
While the technique in Lemma 3.1 is essentially a 2-input prefix gate construction, the
main result of [Krapchenko 1967] cannot be constructed using only prefix gates.
3.3 Adders with Asymptotically Minimum Depth, Linear Size, and Fan-Out
Two
By combining Theorem 2.6 and Lemma 3.1, we get an adder of asymptotically minimum
depth, linear size and with fan-out at most two.
Theorem 3.2. There is an adder for n inputs of size bounded by 13.5n with depth
log2 n+ 8
⌈√
log2 n
⌉
+ 6
⌈
log2
⌈√
log2 n
⌉⌉
+ 2
15
and maximum fan-out two. If n ≥ 4096, the size can be bounded by 9.5n.
Proof. We apply Lemma 3.1 with τ =
⌈√
log2 n+ 2 log2
⌈√
log2 n
⌉⌉
and use an adder for
n/2τ inputs according to Theorem 2.6 as an inner adder. From the proof of Lemma 3.1,
we have seen that the output correction of the Brent-Kung step does not require propagate
signals from the inner adder. So the fan-out is indeed two. Using (10), this results in an
adder of size ⌈
8 n2τ 2
⌈√log2 n2τ ⌉ ⌈√log2 n2τ ⌉2
⌉
+ 5.5n
≤
⌈
8 n
2⌈
√
log2 n⌉+2 log2 ⌈
√
log2 n⌉ 2
⌈√
log2 n
⌉ ⌈√
log2 n
⌉2⌉
+ 5.5n
≤ 8n+ 5.5n = 13.5n.
If n ≥ 4096 we have n/2τ ≥ 16 that allows us to apply the alternative bound (9) to achieve
a size bound of 9.5n.
The depth is
log2
n
2τ + 5
⌈√
log2
n
2τ
⌉
+ 2 + 4τ = log2 n+ 5
⌈√
log2
n
2τ
⌉
+ 2 + 3τ
≤ log2 n+ 8
⌈√
log2 n
⌉
+ 6
⌈
log2
⌈√
log2 n
⌉⌉
+ 2,
where we are using τ ≤ ⌈√log2 n⌉+ 2 ⌈log2 ⌈√log2 n⌉⌉ for the inequality.
From Theorem 3.2, we can easily conclude our main result stated in Section 1.3:
Theorem 1.1 (Main Theorem). Given two n-bit numbers A,B, there is a logic circuit
computing the sum A + B, using gates with fan-in and fan-out two and that has depth
log2 n+ o(log n) and size O(n).
4 Technology Mapping
In this section we show that our construction from Theorem 3.2 can be transformed into
an adder using only Nand/Nor, and Not gates, which are faster in current CMOS tech-
nologies. This increases the depth by one and the size by a small constant factor.
Theorem 4.1. There is an adder for n inputs using only Nand, Nor, and Not gates.
Its size is bounded by (18 + 13 )n, the depth is at most
log2 n+ 8
⌈√
log2 n
⌉
+ 6
⌈
log2
⌈√
log2 n
⌉⌉
+ 3,
and the maximum fan-out is two. If n ≥ 4096, the size is bounded by (15 + 56 )n,
In the next two sections, we show how to transform the two main components of our
construction, the Brent-Kung steps and the multi-input multi-output generate gate adder,
into circuits using only the desired gates.
16
4.1 Mapping Brent Kung Steps
Brent-Kung steps can be implemented using Nand/Not prefix gates as shown in Figure 7
in the reduction step. Similarly, the reduced output reduction gate in Figure 6 can be
realized by two Nand gates and one Not gate, i.e. by eliminating the two rightmost gates
in Figure 7. The modified prefix gates do not increase the depth of the Brent-Kung step,
and increase the size by a constant factor less than 53 .
yi yjxi xj
xi ∧ xjyi ∨ (xi ∧ yj)
Figure 7: A Nand/Not prefix gate used in the reduction step
4.2 Mapping Multi-Input Multi-Output Generate Adders
We want to transform a multi-input multi-output generate gate adder using DeMorgan’s
laws. For easier understanding, we first insert repeaters so that the gates can be arranged
in rows, such all input signals for gates in odd rows are computed in even rows and vice
versa. This bipartite structure is already given in the augmented Kogge-Stone And-prefix
graph (see Figure 3).
Let us now consider a multi-input generate gate shown in Figure 2. By inserting 2r/2
repeaters gates in the last row of the And-suffix graph, we achieve a uniform depth of this
first stage. The red row of And gates and the final 2r−1-output Or already have a uniform
depth. The additional repeaters increase the size by less than a factor of 53 . Except for
the first row of generate gates, the depth of the generate signals equals the depth of the
propagate signals when they are merged in the red row of And gates. In the first row of
generate gates, the propagate signals arrive there at depth 2r+1, while the generate signals
arrive at depth r−1 (see the proof of Lemma 2.4). Thus, if r is odd, we add one additional
repeater at every generate input signal so that it arrives at an odd depth at the red level of
And gates. Note that we can do this without increasing the overall depth, as we already
assumed that the generate signals are delayed by r+1 in the proof of Lemma 2.4. At most
n repeaters are inserted this way.
Some generate gates of the multi-input generate gate adder are just buffer trees, i.e. blue
boxes in Figure 4. They have depth r−1, which is odd if and only if the depth r+1 of the
17
corresponding paths of generate signals through multi-input generate gates is odd. Thus,
they preserve the bipartite structure.
Now we can use the bipartite structure to transform the multi-input multi-output gener-
ate adder into a circuit consisting of Nand, Nor, and Not gates. In our construction we
will maintain the following characteristics. Inputs to an odd row, i.e. outputs of an even
row, will be the original function values, while inputs to an even row, i.e. outputs of an odd
row, will be the negated original function values. We achieve this by transforming gates
as follows: Repeaters are always transformed into Not gates. In odd rows, we translate
And gates into Nand gates and Or gates into Nor gates. In even rows, we translate And
gates into Nor gates and Or gates into Nand gates. If the number of rows is odd, we add
one row of Not gates to correct the otherwise negated outputs of the adder.
Together with the n repeaters that we insert behind each generate input signal if r is odd,
this makes 2n gates that can by accounted for by the size of the augmented Kogge-Stone
And-prefix graph (see Figure 3), which is at least 3n if r ≥ 1. Thus, the overall size of the
generate adder increases by at most a factor of 53 .
Together with the mapping of the Brent-Kung step in Section 4.1, this proves Theo-
rem 4.1.
Conclusion
We introduced the first full adder with an asymptotically optimum depth, linear size and
a maximum fan-out of two. Asymptotically, this is twice as fast and significantly smaller
than the Kogge-Stone adder, which is often considered the fastest adder circuit, as well as
most other prefix graph adders.
For small n, Theorem 3.2 will not immediately improve upon existing adders. When
focusing on speed for small n, one would rather omit the size reduction from Section 3.
Without the size reduction, our results in Lemma 2.4 match the depth of the Kogge-Stone
adder for 512 inputs and improve on it for 2048 inputs, where r = 3, k = 4 yields an adder
with depth 21 for our construction, but the adder of Kogge-Stone will have depth 22.
Today’s microprocessors contain adders for a few hundred bits. However, adders for 2048
bit numbers exist already today on cryptographic chips. Thus we expect that adders based
on our ideas will find their way into hardware.
References
[Brent 1970] R.P. Brent. On the Addition of Binary Numbers. IEEE Transactions on Com-
puters 19.8 (1970): 758–759.
[Brent and Kung 1982] R.P. Brent and H.-T. Kung. A regular layout for parallel adders.
IEEE Transactions on Computers 100.3 (1982): 260–264.
18
[Chatterjee et al. 2006] S. Chatterjee, A. Mishchenko, R. Brayton, X. Wang, and T. Kam.
Reducing structural bias in technology mapping. IEEE Transactions on Computer
Aided Design of Integrated Circuits and Systems 25.12 (2006): 2894–2903.
[Fich 1983] F.E. Fich. New bounds for parallel prefix circuits. Proceedings of the 15th
Annual ACM Symposium on Theory of Computing (STOC). ACM, 1983.
[Gashkov et al. 2007] S.B. Gashkov, M.I. Grinchuk, and I.S. Sergeev. On the construction
of schemes for adders of small depth. Diskretnyi Analiz i Issledovanie Operatsii, Ser.
1, 14.1 (2007): 27–44 (in Russian). English translation in Journal of Applied and
Industrial Mathematics 2.2, (2008): 167-178.
[Grinchuk 2008] M.I. Grinchuk. Sharpening an upper bound on the adder and compara-
tor depths. Diskretnyi Analiz i Issledovanie Operatsii, Ser. 1, 15.2 (2008): 12-22 (in
Russian). English translation in Journal of Applied and Industrial Mathematics 3.1,
(2009): 61–67.
[Han and Carlson 1987] T. Han and D.A. Carlson. Fast Area Efficient VLSI Adders. 8th
IEEE Symposium on Computer Arithmetic (1987): 49–56.
[Held and Spirkl 2014] S. Held and S. T. Spirkl. Fast Prefix Adders for Non-Uniform Input
Arrival Times. Algorithmica (2015).
[Hoover et al. 1984] H.J. Hoover, M.M. Klawe, and N.J. Pippenger. Bounding Fan-out in
Logical Networks. Journal of the ACM (JACM) 31.1 (1984): 13–18.
[Keutzer 1988] K. Keutzer. DAGON: technology binding and local optimization by DAG
matching. Papers on Twenty-five years of electronic design automation, ACM (1988):
617–624.
[Knowles 1999] S. Knowles. A family of adders. Proceedings of 14th IEEE Symposium on
Computer Arithmetic (1999): 277 – 281.
[Kogge and Stone 1973] P.M. Kogge and H.S. Stone. A parallel algorithm for the efficient
solution of a general class of recurrence equations. Computers, IEEE Transactions on
Computers C-22.8 (1973): 786–793.
[Krapchenko 1967] V.M. Krapchenko. Asymptotic estimation of addition time of a parallel
adder. Problemy Kibernetiki 19 (1967): 107–122 (in Russian). English translation in
System Theory Res. 19 (1970): 105–122.
[Krapchenko 2007] V.M. Krapchenko. On Possibility of Refining Bounds for the Delay of
a Parallel Adder. Diskretnyi Analiz i Issledovanie Operatsii, Ser. 1, 14.1 (2007): 87–
93. English translation in Journal of Applied and Industrial Mathematics 2.2 (2008):
211–214.
19
[Ladner and Fischer 1980] R.E. Ladner and M.J. Fischer. Parallel prefix computation. Jour-
nal of the ACM (JACM) 27.4 (1980): 831–838.
[Lupanov 1962] O.B. Lupanov. A class of schemes of functional elements. Problemy Kiber-
netiki 7 (1962): 61–114. English translation in Problems of Cybernetics 7 (1963): 68–
136.
[Ofman 1962] Y.P. Ofman. The algorithmic complexity of discrete functions. Doklady
Akademii Nauk SSSR 145.1 (1962): 48–51. English translation in Soviet Physics Dok-
lady 7 (1963): 589–591.
[Rautenbach et al. 2008] D. Rautenbach, C. Szegedy and J. Werber. On the cost of optimal
alphabetic code trees with unequal letter costs. European Journal of Combinatorics 29.2
(2008): 386-394.
[Sergeev 2013] I. Sergeev. On the complexity of parallel prefix circuits. Electronic Collo-
quium on Computational Complexity (ECCC). Vol. 20. 2013.
[Sklansky 1960] J. Sklansky. Conditional-sum addition logic. Electronic Computers, IRE
Transactions on 2 (1960): 226-231.
[Wegener 1987] I. Wegener. The complexity of Boolean functions. Wiley-Teubner (1987).
[Werber et al. 2007] J. Werber, D. Rautenbach and C. Szegedy. Timing optimization by
restructuring long combinatorial paths. Proceedings of the 2007 IEEE/ACM interna-
tional conference on Computer-aided design (2007): 536–543.
20
