Fast Prefix Adders for Non-Uniform Input Arrival Times by Held, Stephan & Spirkl, Sophie
ar
X
iv
:1
41
1.
29
17
v1
  [
cs
.A
R]
  1
1 N
ov
 20
14
Fast Prefix Adders
for Non-Uniform Input Arrival Times
Stephan Held and Sophie Theresa Spirkl
{held,spirkl}@or.uni-bonn.de
Research Institute for Discrete Mathematics, University of Bonn
Abstract
We consider the problem of constructing fast and small parallel prefix adders for non-
uniform input arrival times. This problem arises whenever the adder is embedded into
a more complex circuit, e. g. a multiplier.
Most previous results are based on representing binary carry-propagate adders as
so-called parallel prefix graphs, in which pairs of generate and propagate signals are
combined using complex gates known as prefix gates. Adders constructed in this model
usually minimize the delay in terms of these prefix gates. However, the delay in terms
of logic gates can be worse by a factor of two.
In contrast, we aim to minimize the delay of the underlying logic circuit directly. We
prove a lower bound on the delay of a carry bit computation achievable by any prefix
carry bit circuit and develop an algorithm that computes a prefix carry bit circuit with
optimum delay up to a small additive constant. Furthermore, we use this algorithm to
construct a small parallel prefix adder.
Compared to existing algorithms we simultaneously improve the delay and size guar-
antee, as well as the running time for constructing prefix carry bit and adder circuits.
1 Introduction
The addition of binary numbers is one of the most fundamental computational tasks performed
by computer chips. Given two binary addends A = (an . . . a1) and B = (bn . . . b1), where index n
denotes the most significant bit, their sum S = A+B has n+ 1 bits. For each position 1 ≤ i ≤ n,
we compute a generate signal gi and a propagate signal pi, which are defined as follows:
gi = ai ∧ bi,
pi = ai ⊕ bi, (1)
where ∧ and ⊕ denote the binary And and Xor functions. The carry bit at position i + 1 can
be computed recursively as ci+1 = gi ∨ (pi ∧ ci) [4, 12]. From the carry bits, we can compute the
output S via si = ci ⊕ pi for 1 ≤ i ≤ n and sn+1 = cn+1.
1
AB
C
z z′
z ◦ z′
g g′p p′
p ∧ p′g ∨ (p ∧ g′)
Figure 1: Prefix gate and underlying logic circuit
For two pairs (gi, pi) and (gj , pj) of generate and propagate signals, we define a binary prefix
operator as (
gi
pi
)
◦
(
gj
pj
)
=
(
gi ∨ (pi ∧ gj)
pi ∧ pj
)
. (2)
This operator is associative, and it can be used to compute the carry bit ci+1 using the identity(
ci+1
pi ∧ pi−1 ∧ · · · ∧ p1
)
=
(
gi
pi
)
◦
(
gi−1
pi−1
)
◦ · · · ◦
(
g1
p1
)
.
The prefix operator allows us to simplify notation by combining generate and propagate signals
into a single term zi = (gi, pi) and computing ci+1 as the first component of zi ◦ · · · ◦ z1. Figure 1
shows a prefix gate computing z ◦ z′ for the prefix operator in (2) on the left and its underlying
logic circuit on the right.
Formally, a logic circuit is a non-empty connected acyclic directed graph consisting of nodes that
are either inputs with at least one outgoing edge and no incoming edges, outputs with exactly one
incoming edge and no outgoing edges, or gates with one or two incoming edges representing one of
the 2-bit logical functions And (∧), Or (∨), Xor (⊕), Not and their negations.
The number of gates is the size of the circuit. The (maximum) fan-out of the circuit is the
maximum fan-out (out-degree) of its nodes. The depth of the circuit is the maximum number of
gates on a directed path.
A logic circuit with inputs g1, p1, . . . , gn, pn is called a prefix carry bit circuit if it computes cn+1
and p1 ∧ · · · ∧ pn, it is built from prefix operator gadgets in Figure 1, and the subcircuit computing
p1 ∧ · · · ∧ pn is a tree. Similarly, a prefix adder is a logic circuit built using the gadgets in Figure 1
that computes ci+1 and p1 ∧ · · · ∧ pi for all i = 1, . . . , n at its 2n outputs.
A graph that arises from a prefix carry bit circuit by contracting each gadget into a prefix gate as
in Figure 1, and by contracting all input pairs (gi, pi) into zi and the output pair (cn+1, pn∧· · ·∧p1),
is called a prefix tree. Likewise, a parallel prefix graph arises from a prefix adder by contracting
each gadget, all input pairs (gi, pi) = zi and output pairs (ci+1, p1 ∧ · · · ∧ pi) for all i = 1, . . . , n.
For inputs z1, . . . , zn, a prefix tree computes the last carry bit of an addition zn ◦ · · · ◦ z1, while a
parallel prefix graph computes zi ◦ · · · ◦ z1 for all 1 ≤ i ≤ n, i. e. all carry bits of an addition.
When aiming for a bounded fan-out, we allow the use of repeater gates with fan-in one (a single
incoming edge) and fan-out at least two in all types of circuits and graphs.
An example of the transition between parallel prefix graphs and prefix adders is given in Figure 2.
On the left the serial parallel prefix graph with depth 3 is replaced by an And-Or-path with logic
circuit depth 6 known as the ripple-carry adder. For the Kogge-Stone parallel prefix graph [5] on
2
the right, the depth increases from two to four and the maximum fan-out increases from two to
three.
z4 z3 z2 z1
(a) Serial prefix
graph
g2 g1p2 p1g4 g3p4 p3
(b) And-Or-path
z4 z3 z2 z1
(c) Kogge-Stone prefix
graph
g2 g1p2 p1g4 g3p4 p3
(d) Kogge-Stone logic circuit
Figure 2: Prefix graphs as logic circuits
Additions are typically not performed as isolated tasks, but the input signals result from preceding
computational stages and become available at different fixed arrival times ti ∈ N0 (i ∈ {1, . . . , n}),
e. g. when used within a multiplier. Here we make the simplifying assumption that gi and pi have
the same arrival time at the inputs, which is essentially fulfilled if they are generated as in (1).
We define the delay of a directed path in a logic circuit starting at an input as its depth plus its
input arrival time. The delay of a vertex is the maximum delay of a path ending in the vertex and
the delay of the circuit is the maximum delay of its outputs. Depth and delay coincide if all input
arrival times are zero. Paths and outputs attaining the delay of the circuit are called critical. The
delay of all vertices can be computed in linear time by a longest path computation in an acyclic
network.
z5
z4
z3 z2
z1
(a) Arrival times 0, 0, 0, 0, 0;
delay 4
z5
z4
z3
z2
z1
(b) Arrival times 4, 3, 2, 1, 0;
delay 6
z5
z4 z3
z2
z1
(c) Arrival times 0, 1, 2, 3, 4;
delay 7
Figure 3: Different arrival times profiles and their optimum prefix trees
In Figure 3, we show an example with five inputs and its optimum solutions for different arrival
time patterns. Each tree is optimal for neither of the other two arrival times sequences.
We aim for a prefix carry bit circuits and adders with close to minimum delay and small size.
A minimum-depth prefix graph for uniform input arrival times is given by [5, 6] and has depth
⌈log2 n⌉ in terms of prefix gates, but a non-minimal depth of 2⌈log2 n⌉ as a logic circuit. For non-
3
uniform arrival times, these circuits might be by a factor of three worse than the lower bound, for
example for the arrival time pattern t1 = log2 n and t2 = · · · = tn = 0. In Figure 2c, if z1 has
arrival time two and all other arrival times are zero, the delay of Figure 2d is 6.
Parallel prefix graphs minimizing the overall prefix graph delay for special input arrival time
patterns that occur mostly in certain multipliers were presented in [14, 7].
An algorithm for constructing optimum-delay parallel prefix graphs for arbitrary non-uniform
input arrival times is given in [2], however this approach may require O(n2) gates for a full n-
bit adder. In [11] parallel prefix adders are enumerated with heuristic pruning to achieve good
performance-area tradeoffs in practice. All these approaches minimize the delay of the prefix graph
rather than the underlying logic circuit. As the prefix operator contains two subsequent gates, the
resulting delay of the underlying circuit may be worse by a factor of two.
As it is common practice in logic synthesis [5, 6, 13, 11], we use a simple technology-independent
circuit and delay model in this work. In hardware, the delay of a gate certainly depends on its
physical structure. In CMOS technology, Nand/Nor gates are faster than And/Or gates and
efficient implementations exist for integrated multi-input And-Or-Inversion gates and Or-And-
Inversion gates. We assume that circuits are re-mapped into logically equivalent circuits based
on technology specific delays in a technology mapping step [3, 1] after logic synthesis. Despite its
simplicity, the simple circuit and delay model is successfully used in practice for re-optimizing carry
bit functions even late in the design flow [13].
1.1 Our contribution
We will use the delay properties of the prefix operator (2) aiming to minimize the delay of logic
circuits for additions instead of the corresponding prefix graphs. This idea was used by [8], who
proposed a cubic-time dynamic programming algorithm to compute a fast carry bit circuit.
With a deeper structural analysis of near-optimum prefix trees in Section 2, we can construct a
carry bit circuit with a better delay bound, size, and running time as shown in the rows with type
“Carry” of Table 1.
In Section 3, we apply the carry bit algorithm to substantially improve the delay bound given in
[9] for a full n-bit adder with input arrival times t1, . . . , tn ∈ N0. The result is listed in the rows
with type “Adder” in Table 1.
Finally, in Section 4, we prove a lower bound on the delay of any prefix carry bit circuit, which
shows that our carry bit algorithm is delay-optimal up to an additive constant of 5.
Type Delay Size Fan-out Running Time
[8] Carry 1.441W + 3 4n− 3 1, 2, 3 O(n3) / O(n3 logn)
Here Carry 1.441W + 2.674 3n− 3 1, 2 O(n logn) / O(n log2 n)
[9] Adder 2W + 6 log
2
log
2
n+O(1) 6n log
2
log
2
n
√
n+ 1 O(n2) / O(n2 logn)
Here Adder 1.441W + 5 log
2
log
2
n+ 4.5 6n log
2
log
2
n
√
n+ 1 O(n logn) / O(n log2 n)
Table 1: Improvements over [8, 9], where W = log2
(∑n
i=1 2
ti
)
is a lower bound for the delay.
Running times assume constant/linear time for binary addition.
4
2 Algorithm for Single Carry Bit Circuits
We start with a method that given a parallel prefix graph allows us to compute the delay of the
underlying logic circuit up to an additive error of one.
Proposition 2.1. Given a parallel prefix graph or prefix tree, we propagate the arrival times (which
might all be zero) through the prefix gates so that the delay t of a gate with left input (higher indices) l
and right input (lower indices) r with delay tl and tr, respectively, is defined as t = max{tr+2, tl+1}.
Let d be the maximum delay computed with this procedure, maximized over all gates, inputs and
outputs, then the delay D of the logic circuit corresponding to the given prefix graph or prefix tree
satisfies d ≤ D ≤ d+ 1.
Proof. This is a consequence of a longest path computation in acyclic networks. We construct a
logic circuit from the prefix graph. For every input pair (g, p) corresponding to an input with arrival
time t, we set the arrival time of g and p to be t and t−1, respectively. We prove by induction that
for every signal pair (g, p) corresponding to a signal z computed in the prefix graph, g has delay at
least one more than p, and if z is not an input, the delay of g is the maximum of two plus the delay
of the generate signal of its right predecessor and one plus the delay of the generate signal of its
left predecessor. This is clear for inputs. Now consider a signal z ◦ z′ as in Figure 1. Let tg, tp, tg′ ,
and tp′ denote the delay of g, p, g
′ and p′, respectively. By induction hypothesis, tp′ + 1 ≤ tg′ and
tp + 1 ≤ tg. Therefore, g ∨ (p ∧ g′) has delay max
{
tp + 2, tg′ + 2, tg + 1
}
= max
{
tg′ + 2, tg + 1
}
.
Furthermore, max
{
tp + 2, tg′ + 2, tg + 1
} ≥ 1 + max{tp + 1, tp′ + 1}, which proves that p′ ∧ p is
indeed by at least one time unit earlier.
Inductive application of the argument above yields that the generate signal of every output
arrives at time ≤ d under the assumption that all propagate signals of the inputs arrive one time
unit earlier than their actual arrival time. Shifting all computed delays up by one time unit yields
that the delay of the logic circuit for the actual arrival times is at most d+ 1.
To show that d ≤ D, consider only the generate signals, i. e. using the notation of Figure 1,
consider a logic circuit G in which all gates of type A and inputs pi computing propagate signals
are removed, and gates of type B are replaced by repeaters. It follows that for every output c, the
subcircuit of G which consists of all ancestors of c is a tree. This is certainly true in the parallel
prefix graph, and after the removal of propagate signals, every prefix gate internally corresponds
to a tree as well. Removing gates and inputs from a circuit does not increase its delay, because a
critical path in G is also a path in the original circuit.
Computing the delay of a signal in a tree is easy: when combining the generate signals of two
inputs, one of them has to pass through two gates (a repeater instead of B, and C and the other has
to pass through only one (namely C). This shows that the given method for computing d indeed
yields a lower bound.
For uniform arrival times, d = D, but for arbitrary arrival times, D = d + 1 is possible, for
example by choosing the arrival times of z and z′ in Figure 1 as 1 and 0, respectively.
The prefix graph and its underlying logic circuit can vary greatly in depth and delay. For example,
a prefix graph of optimal depth ⌈log2 n⌉ as in Figure 2c contains a balanced binary tree computing
its last output, therefore its logic circuit depth is 2 ⌈log2 n⌉. However, the depth only doubles for
the lower (right) input of a prefix gate by Lemma 2.1, which we exploit in the following.
For a single carry bit computation with arrival times, Rautenbach et al. [8] give a dynamic
programming algorithm with cubic running time. The algorithm restructures an And-Or-path
similar to a prefix tree. Here the right-to-left ordering of the leaves of this tree is fixed as z1, . . . , zn,
5
because ◦ is not commutative. The algorithm recursively splits the sequence of inputs into two
parts at an index l attaining the minimum in the recursive delay function
D(t1, . . . , tn) = min
l=1,...,n−1
max {D(t1, . . . , tl) + 2,D(tl+1, . . . , tn) + 1} . (3)
This solution can be computed for every subsequence ti, ti+1, . . . , tj of indices via dynamic pro-
gramming by choosing the D-optimum position l at which to split the sequence, which yields the
following result.
Theorem 2.2 ([8]). For n input pairs (gi, pi) for 1 ≤ i ≤ n with arrival times t1, . . . , tn ≥ 0, there
is a logic circuit computing the carry bit cn+1 with
delay(cn+1) ≤ 1.441 log2
(
n∑
i=1
2ti
)
+ 3. (4)
This circuit can be constructed in O(n3) time. It has size at most 4n−3, and its maximum fan-out
is bounded by two at all gates and bounded by three at all inputs.
Using our definition of a prefix tree, the size of the carry bit circuit can be reduced by n.
Lemma 2.3. Any prefix tree computing a single carry bit has an underlying logic circuit size of at
most 3n− 3 and an underlying maximum fan-out of two.
Proof. This is clear as any prefix tree for n inputs has exactly n− 1 prefix gates.
To analyze the structure of fast prefix carry bit circuits we begin with a well-known definition:
let Fn be the n-th Fibonacci number, where F0 = 0, F1 = 1 and Fn = Fn−1 + Fn−2. The exact
formula for computing the n-th Fibonacci number is Fn =
1√
5
(ϕn − ψn), where ϕ = 1+
√
5
2 is the
golden section and ψ = 1−
√
5
2 .
We first prove a similar delay bound to [8], but instead of bounding the recursive function D, we
explicitly construct our solution and obtain useful structural information about it.
Lemma 2.4. Let t1, . . . , tn ∈ N0 be a sequence of input arrival times for inputs z1, . . . , zn, and let
Fk be the first Fibonacci number that is at least as large as
∑n
i=1(Fti+3− 1). Then there is a prefix
tree computing zn ◦ · · · ◦ z1 with logic gate delay at most k.
Proof. Throughout the proof, we implicitly assume that every propagate signal actually arrives by
at least one time unit earlier than the corresponding generate signal, i. e. gi has arrival time ti
and pi has arrival time at most ti − 1, thus prefix gates have depth two for the input with smaller
indices and depth one for the input with larger indices. Under this assumption, we prove that the
delay is at most k − 1. Adding one to all arrival times and delays yields a circuit with delay k
under the assumption that gi is available at time ti+1 and pi at time ti, which is true for the given
input arrival times. For signal pairs (gi, pi) with arrival times satisfying this assumption, we say
that they have skewed arrival times.
The proof has two main parts. In the first part, we construct a binary tree T with Fk leaves in
such a way that if we consider its internal nodes as prefix gates and its leaves as inputs with arrival
time 0, then its overall delay is k − 1. During the second step, we replace sections of consecutive
leaves and the corresponding subtrees of T with our original inputs so that the arrival time of the
input does not exceed the depth of the subtree.
6
20 17 15 12 9 7 4 2
1921 18 16 14 13 11 10 8 6 5 3 1
Depth
0
1
2
3
4
5
6
7
Figure 4: Fibonacci tree T for k = 8
Let T be a tree constructed by starting at the root r and recursively constructing a binary tree
with Fk−1 leaves on the left and one with Fk−2 leaves on the right as in Figure 4. We refer to T as
a Fibonacci tree for k.
Replacing all non-leaf nodes with prefix gates and leaves with new inputs (with arrival time 0
and unrelated to the original inputs) as well as adding an output at the root yields a prefix tree for
Fk inputs with logic gate depth k− 1. This can be seen inductively; it is certainly true for k = 2, 3
and thus for k > 3, the left tree has depth k− 2, the right tree has depth k− 3, and the last prefix
gate has delay max{k− 2+1, k− 3+2} = k− 1. The minimum depth of a prefix tree with l leaves
is at most k − 1 if and only if l ≤ Fk.
Now we show how to replace parts of the tree by inputs with skewed arrival times t1, . . . , tn
without increasing the delay. We start by subdividing the leaves of the tree: from right to left, the
first Ft1+3 − 1 leaves are assigned to the first input, the next Ft2+3 − 1 leaves are assigned to the
second input, and input i gets leaves 1 +
∑i−1
j=1(Ftj+3 − 1) up to
∑i
j=1(Ftj+3 − 1). Our choice of
k ensures that every input i gets Fti+3 − 1 successive leaves assigned to it; leftover leaves can be
deleted without increasing the delay. The ordering of the inputs is preserved within the tree.
We define a subtree of size l to be a tree obtained by taking a vertex v and all its successors with
l leaves in total. By construction, every subtree of size l must be a Fibonacci tree for some j with
Fj = l. Furthermore, for every Fj with j ≤ k − 1, we can find subtrees of T of size Fj . A vertex v
in T is the root of a subtree of size Fj 6= 1 if and only if v has depth j − 1. For Fj = 1, we know
that j ∈ {0, 1} and v has depth 0.
Our goal is to show that every input i with arrival time ti owns all the leaves of a subtree of
size Fti+1. In order to see this, we remove all edges connecting a vertex with depth at most ti to a
vertex with depth more than ti from the tree. This separates the tree into a connected component
containing the root and several subtrees of size at most Fti+1. For example, if ti = 4, then Figure 4
would contain the component containing the root as well as subtrees indicated by the coloring of
size 3, 5, 5, 3, 5 in that order. In general, since every gate has depth 1 or 2, each root of such a
tree has depth ti or ti−1, therefore the subtrees can only have size Fti+1 or Fti . Our next goal is
to prove that this ordered subtree sequence has a special structure. Since only the roots of “big”
subtrees of size Fti+1 can be replaced by input i without increasing the delay, we show that there
are few small subtrees of size Fti .
Due to the fact that the depth difference between a node and its left child is always one, the
7
z3 z1
z2
(a) T for
k = 5
z3 z2 z1
(b) Prefix
tree
g3 p3 g2 g1p2 p1
(c) Logic circuit
Figure 5: A tight example
leftmost root in the subtree sequence of a Fibonacci tree for some k ≥ ti has depth ti and its parent
has depth ti+1. Therefore, the subtree rooted here has size Fti+1. We will now show that in a
Fibonacci tree, the ordered subtree sequence of the trees of size Fti+1 and size Fti never contains
two consecutive subtrees of size Fti . For k = ti + 1, this is clear. For k = ti + 2, there are only two
subtrees, and the left one has size Fti+1. For k > ti + 2, the subtree sequence of a Fibonacci tree
for k corresponds to the concatenation of the subtree sequences corresponding to a tree for k − 1
and a tree for k − 2. As those satisfy the claim by induction hypothesis and each sequence starts
with a tree of size Fti+1, the Fibonacci tree for k has the stated property as well.
We know that input i owns Fti+3 − 1 consecutive leaves. In the subtree sequence, at most the
first Fti+1 − 1 leaves belonging to input i are part of subtrees of which i does not own the first
(rightmost) leaf. Of the remaining leaves, the first Fti might cover a subtree of that size. This
accounts for Fti+1 + Fti − 1 = Fti+2 − 1 leaves. The next Fti+1 leaves are owned by i as well, so
at that point, at the latest, there must be a subtree of size Fti+1 of which i owns all leaves. For
ti 6= 1, we can replace the root of this subtree with one input with arrival time ti. By construction,
this does not increase the delay.
Here we used that Fti+1 > Fti to give a lower bound of the depth of the owned subtree. The
only exception from this is the case ti = 1, which can be treated analogously: every input with
ti = 1 owns two leaves, and by similar arguments as for the subtree sequence, one of them must be
at depth 1 in the Fibonacci tree.
After removing all leaves that have not been replaced by any original input, we obtain a prefix
tree computing zn ◦ · · · ◦ z1 with delay k− 1. All of these arguments used the assumption of skewed
arrival times also for the inputs, which can be achieved in such a way that the actual delay of the
circuit increases to at most k.
The upper bound of k is tight for the final logic circuit as evident from the example 0, 1, 0, where∑3
i=1(Fti+3 − 1) = 1 + 2 + 1 = 4, so k = Fk = 5 (see Figure 5).
For this arrival time profile, the algorithm will (implicitly) construct the Fibonacci tree T and
assign leaves to the inputs as in Figure 5a, where the colored vertices represent the positions at
which the inputs will actually be inserted into the tree. These do not have to be leaves in general.
After deleting redundant inputs, we obtain a prefix tree (Figure 5b) and a corresponding logic
circuit (Figure 5c). Note that p2 has arrival time 1 and the red path contains four gates, hence the
logic circuit has delay 5.
From the proof of Lemma 2.4, it is easy to see how to avoid the enumeration of all potential
splitting positions l = 1, . . . , n − 1 in (3). Since there are Fk−1 leaves in the left subtree of T and
8
Fk−2 in the right subtree, let
j = min
{
1 ≤ j ≤ n :
j∑
i=1
(Fti+3 − 1) ≥ Fk−2
}
and f = Fk−2 −
∑j−1
i=1 (Fti+3 − 1), then f counts how many leaves belonging to input j are part
of the right subtree, and j is the only input that might have leaves in both subtrees. Since in our
decomposition the leftmost Ftj+1 leaves of the right subtree belong to a Fibonacci tree of size Ftj+1,
j should be on the right side of the decomposition if and only if f ≥ Ftj+1. Otherwise, there are
at least Ftj+2 leaves on the left side, hence in our sequence of subtrees j might own all leaves of a
subtree of size Ftj , but the remaining leaves must belong to and cover a subtree of size Ftj+1, hence
j should be on the left side. Note that it is never optimal to assign all leaves to the same side, thus
this partition can always be assumed as proper without increasing the delay. After updating the
number of leaves belonging to j on the side it is assigned to, this yields a recursive procedure that
terminates when there is only one index left for a subtree.
Lemma 2.5. Given input arrival times t1, . . . , tn ∈ N0, let Fk be the first Fibonacci number that
is at least as large as
∑n
i=1(Fti+3 − 1). A prefix tree for these input arrival times with delay at
most k can be found with running time O (n log n+ k +max ti) under the assumption that we can
perform additions and multiplications by a constant on numbers of arbitrary size in constant time
(an assumption we will show how to avoid later). If ti ∈ O(n) for all i, then the running time is
O(n log n).
Proof. We have already argued that the algorithm achieves the stated delay bound. We show that
this partitioning strategy will ensure that every input i is substituted for a subtree of size at least
Fti+1.
If there is only one index i remaining, it was either the rightmost (lowest) or leftmost (highest)
index in the previous step. If it was the rightmost index, then the subtree previously contained
Fti+2 of its leaves as well as at least one more leaf, hence k ≥ ti + 3 and the right subtree has size
at least Fti+1, so replacing this subtree by input i leads to the claimed delay by the argument used
in Lemma 2.4. If it was the leftmost index, a similar argument applies.
For the running time estimate, we compute the indices assigned to every leaf and the delay
bound k in time O (n+ k +max ti). There are n − 1 recursive partitioning steps, during each of
which we find the input j as the input index to own leaves in the left subtree. This can be done in
logarithmic time using binary search in the sorted array of the indices of the first leaf belonging to
every input.
Figure 6a shows how the algorithm works for the sequence of input arrival times 3, 2, 3, 1, 0. The
number of leaves we need is 7 + 4 + 7+ 2+ 1 = 21, therefore k = 8 suffices. We number the leaves
from right to left as 1, . . . , 21. After the first split, 3 light blue leaves are in the left subtree, hence
the corresponding input is assigned to the left subtree. Note that for the orange leaves, we end up
assigning them to a subtree that does not contain any orange leaves in the beginning in order to
ensure a proper partition. We obtain the result shown in Figure 6b.
Lemma 2.6. We can construct a prefix tree with the delay bound of Lemma 2.4 for any instance
(t1, . . . , tn) by constructing a prefix tree for an instance (t
′
1, . . . , t
′
n) with max t
′
i ≤ 2n− 1.
This follows from the fact that the longest path from any input to the output contains at most
n− 1 prefix gates. The maximum delay difference can be assumed as 2n− 2, since any input with
earlier arrival time will never be critical.
9
(a) Algorithm for input arrival times 3, 2, 3, 1, 0
0
1
3
2
3
(b) Prefix tree
Figure 6: Example of the algorithm
Theorem 2.7. For n inputs with arrival times t1, . . . , tn ∈ N0, the algorithm finds a prefix carry
bit circuit for cn+1 with
delay(cn+1) ≤ k ≤
⌊
logϕ
(
n∑
i=1
ϕti
)⌋
+ 4.
The constructed logic circuit has size at most 3n − 3 and maximum fan-out two at all logic gates
and inputs. Furthermore, the delay is at most
k ≤ logϕ
(
n∑
i=1
2ti
)
+ 2.673 ≤ 1.441 log2
(
n∑
i=1
2ti
)
+ 2.673.
Proof. The size and fan-out bounds follow from Lemma 2.3. The delay of the constructed circuit
is k. By choice of k, we know that
∑n
i=1(Fti+3−1) ≥ Fk−1+1. With ϕ = 1+
√
5
2 , ψ =
1−√5
2 and the
exact formula Fn =
1√
5
·(ϕn−ψn), it follows that |√5Fn−ϕn| ≤ 1 and for n ≥ 1, |
√
5Fn−ϕn| ≤ |ψ|.
Now k − 1 = 0 can only be true if there is only one input. In this case, the stated delay bound
is trivially true. Otherwise, we obtain the estimate:
k − 1 = logϕ
(
ϕk−1
)
≤ logϕ
(√
5 (Fk−1 + 1)
)
≤ logϕ
(√
5
(
n∑
i=1
(Fti+3 − 1)
))
≤ logϕ
(√
5
(
n∑
i=1
(
1√
5
(
ϕti+3 + 1
)− 1)
))
≤ logϕ
(
n∑
i=1
ϕti+3
)
= logϕ
(
n∑
i=1
ϕti
)
+ 3,
which proves the first claim.
For a single input, the second delay bound is trivially true. Furthermore, for ti ≥ 0, Fti+3−1 ≤ 2ti .
We obtain the estimate:
k − 1 = logϕ
(
ϕk−1
)
≤ logϕ
(√
5 (Fk−1 + 1)
)
≤ logϕ
(√
5
(
n∑
i=1
(Fti+3 − 1)
))
≤ logϕ
(√
5
(
n∑
i=1
2ti
))
= logϕ
(
n∑
i=1
2ti
)
+ logϕ
√
5 ≤ 1.441 log2
(
n∑
i=1
2ti
)
+ 1.673.
10
Our proof allows an improvement over the delay bound of [8] due to a refined analysis. A running
time of O(n log n) follows from Lemma 2.5 assuming that we can add numbers of linear size and
multiply them by a constant in constant time. Under the more practical assumption that these
operations take linear time with respect to the number of digits, the algorithm has super-quadratic
running time, which can be avoided as follows:
Theorem 2.8. For any fixed γ > 1, a prefix carry bit circuit as in the setting of Theorem 2.7 with
delay(cn+1) ≤ logϕ
(
n∑
i=1
ϕti
)
+ 4 + 2.1 · n1−γ
can be found in O(nγ log2 n) time assuming linear-time addition and multiplication with constants.
It satisfies
delay(cn+1) ≤ 1.441 log2
(
n∑
i=1
2ti
)
+ 2.673 + 2.1 · n1−1.4γ .
Proof. By Lemma 2.6 and Theorem 2.7, we can solve instances with max ti − min ti ≤ γ
⌈
logϕ n
⌉
in O(nγ log2 n) time with linear-time addition.
Given an instance t1, . . . , tn, we set t
′
i = max
{
ti,maxj∈{1,...,n} tj − γ
⌈
logϕ n
⌉}
and compute a
circuit for the modified instance in O(nγ log2 n). When reverting to the original arrival times, the
delay of this solution does not increase, because none of the arrival times do. Therefore,
delay(cn+1)− 4 ≤ logϕ
(
n∑
i=1
φt
′
i
)
≤ logϕ
(
n · φmax ti−γ⌈logϕ n⌉ +
n∑
i=1
φti
)
≤ logϕ
(
φmax ti+(1−γ) logϕ n +
n∑
i=1
φti
)
≤ logϕ
(
n∑
i=1
φti
)
+ logϕ
(
1 + φ(1−γ) logϕ n
)
≤ logϕ
(
n∑
i=1
φti
)
+ logϕ e · n1−γ ,
11
and logϕ e < 2.1. For the dual logarithm-based delay bound, we have
delay(cn+1)− 2.673 ≤ logϕ 2 · log2
(
n∑
i=1
2t
′
i
)
≤ logϕ 2 · log2
(
n · 2max ti−γ⌈logϕ n⌉ +
n∑
i=1
2ti
)
≤ logϕ 2 · log2
(
2max ti−(1−γ logϕ 2) log2 n +
n∑
i=1
2ti
)
≤ logϕ 2
(
log2
(
n∑
i=1
2ti
)
+ log2
(
1 + 2(1−γ logϕ 2) log2 n
))
≤ logϕ 2 · log2
(
n∑
i=1
2ti
)
+ (logϕ e)n
1−γ logϕ 2,
and 1.4 < logϕ 2 < 1.441.
For γ > 1, the additional error decreases with growing n. Since the algorithm is only useful if
n ≥ 2, choosing a sufficiently large constant γ yields the delay bound 1.441 log2
(∑n
i=1 2
ti
)
+ 2.674
with running time O(n log2 n).
3 Algorithm for Prefix Adder Circuits
The na¨ıve parallel prefix graph construction, in which all carry bits are computed separately by
a carry bit circuit, might contain a quadratic number of gates. Therefore, Rautenbach et al. also
developed a parallel prefix graph construction computing all carry bits [9].
Theorem 3.1 ([9]). Given n ∈ N and arrival times t1, . . . , tn ∈ N0, there is a parallel prefix graph
for n inputs of size O(n log log n) with logic delay 2 log2
(∑n
i=1 2
ti
)
+ 6 log2 log2 n+O(1).
The primary objective in [9] is to minimize the delay of the prefix graph instead of the underlying
logic circuit. We will improve the performance guarantee for a similar construction as in [9] by
using a carry bit circuit as in Section 2 as a subroutinte.
Given inputs t1, . . . , tn, we partition the set {1, . . . , n} into l = ⌈
√
n⌉ subsets V1, . . . , Vl, each
containing l or l − 1 consecutive indices. Let Zi = ◦j∈Vizj , where Zi is computed by a circuit
constructed by the carry bit algorithm. This is shown in green, labeled “Best”, in Figure 7. The
parallel prefix graph construction is applied recursively to compute prefixes for all groups without
their highest index as well as for the l− 1 inputs Z1, . . . , Zl−1 (which corresponds to the red boxes
labeled “Recursion” in Figure 7), i. e. we build l+1 parallel prefix graphs, each with at most l− 1
inputs. As a final step, we combine all prefixes from group i with the (i−1)-th prefix of the Zi and
add one more prefix gate combining Zl with the (l − 1)-th prefix of the Zi. This yields a parallel
prefix graph.
The following two lemmas analyze the size of the resulting parallel prefix graph and the running
time of its construction.
Lemma 3.2. The parallel prefix graph in [9] and the modified construction above have the same
size; for n ≥ 3, it is bounded by 2n log2 log2 n in terms of prefix gates and 6n log2 log2 n in terms
of logic gates.
12
z25 z24 z23 z22 z21 z20 z19 z18 z17 z16 z15 z14 z13 z12 z11 z10 z9 z8 z7 z6 z5 z4 z3 z2 z1
RecursionBest RecursionBest RecursionBest RecursionBest RecursionBest
Recursion
Figure 7: Prefix graph construction
Proof. Consider Figure 8 and proceed by induction on the number of inputs. On a level with n
inputs, the number of green gates and the number of yellow gates are both at most n. The total
number of inputs of recursion blocks can be bounded by n as well: if there are l groups, then n− l
original inputs are inputs of recursion blocks; one further recursion block has l − 1 inputs. For
small n, the correctness follows from Figure 8, e. g. for n = 3, the size bound is 5 and 3 gates are
required.
Let V1, . . . , Vl be the groups, l ≥ 3, then by induction hypothesis, the prefix gate size is bounded
by
2n + 2(l − 1) log2 log2(l − 1) +
l∑
i=1
2(|Vi| − 1) log2 log2(|Vi| − 1)
≤ 2n+
l∑
i=1
2|Vi| log2 log2(l − 1)
≤ 2n+ 2n log2 log2
√
n = 2n log2 log2 n.
For logic gates, the size increases by a factor of three.
Lemma 3.3. The parallel prefix graph above can be computed in O(n log2 n) time.
Proof. As in Theorem 2.8, we round all running times up to at least max ti − γ ⌈log2 n⌉ for a fixed
γ > 1. For this arrival time profile, we have already shown that
1.441
(
n∑
i=1
2t
′
i
)
≤ 1.441 log2
(
n∑
i=1
2ti
)
+ 2.1 · n1−1.4γ .
Therefore, we can use the rounded arrival time profile to achieve a delay guarantee of
1.441 log2
(
n∑
i=1
2ti
)
+ 5 log2 log2 n+ 4.5
for γ = 3. This means that all numbers in the computations have size at most O(log n), thus it
remains to bound the number of operations by O(n log n).
For each level l = 1, . . . , log2 log2 n of the recursion, we have a partition of the n inputs into
groups, where the maximum group size is bounded by n1/2
l
. Therefore, the prefix trees for the
Zi can be computed in O
(
n log
(
n1/2
l
))
= O
((
1
2
)l
n log n
)
time. All Zi require time O(n log n),
because this is a geometric series. All remaining gates are prefix gates; they have fixed positions,
thus each of them requires only constant time to compute, and there are O(n log log n) such gates
in total.
13
The new parallel prefix graph construction is summarized in the following theorem.
Theorem 3.4. Given n ∈ N and arrival times t1, . . . , tn ∈ N0, our algorithm finds a parallel prefix
graph with logic gate delay at most
logϕ
(
n∑
i=1
ϕti
)
+ 5 log2 log2 n+ 4.5 ≤ 1.441 log2
(
n∑
i=1
2ti
)
+ 5 log2 log2 n+ 4.5.
It can be implemented with running time O(n log2 n) and the computed circuit has size at most
6n log2 log2 n in terms of logic gates.
This theorem implies that for n sufficiently large, we have a 1.441-approximation algorithm in
terms of the delay for a prefix adder. The algorithm of [9] has a running time of Ω(n2), which
the use of our carry bit algorithm improves to a near-linear running time, even with linear-time
addition.
To prove the delay bound, we assume that all arrival times are skewed by one time unit. Under
this assumption, let w =
∑n
i=1 ϕ
ti , and let delay(w,n) denote the maximum delay for a circuit
constructed as above with n ≥ 3 inputs and an arrival time profile leading to the same w. Then
delay(w,n) + 1 is an upper bound on the delay of the constructed circuit, and we have:
Lemma 3.5. For n input pairs with skewed arrival times t1, . . . , tn, let w =
∑n
i=1 ϕ
ti . Then we
have
delay(w,n) ≤ logϕw + 5 log2 log2 n+ 3.
Proof. We may assume that the given arrival time profile achieves the maximum delay, i. e. for
t1, . . . , tn, the construction actually has a delay of delay(w,n).
By Theorem 2.7 and using the assumption that the propagate signals arrive earlier than the
generate signals, we can compute Zi with delay logϕ
(∑
j∈Vi ϕ
tj
)
+3. Therefore, their prefix graph
has delay at most
delay
(
l∑
i=1
ϕ
logϕ
(∑
j∈Vi
ϕtj
)
+3
,
⌈√
n
⌉− 1
)
= delay

ϕ3 · n∑
j=1
ϕtj ,
⌈√
n
⌉− 1

 .
For each of the groups Vi containing ⌈
√
n⌉ or ⌈√n⌉ − 1 inputs, the prefix graph of all but its last
input (highest index) has delay at most
delay(w,
⌈√
n
⌉− 1) ≤ delay(ϕ3w, ⌈√n⌉− 1),
as delay(w,n) is monotonically increasing in w and n. Therefore, the combination of a prefix of
one of the Vi and the corresponding prefix of the Zi has logic gate delay at most delay(w,n) ≤
delay
(
ϕ3w, ⌈√n⌉ − 1) + 2.
We prove the absolute delay estimate by induction on n. For n ≤ 3, delay(w,n)− logϕw ≤ 3 for
all input sequences with this parameter w as logϕw ≥ maxi ti. Therefore, for n ≥ 4, delay(w,n) is
bounded by
delay(ϕ3w,
⌈√
n
⌉− 1) + 2 ≤ logϕ(ϕ3w) + 5 log2 log2(√n) + 5 ≤ logϕw + 5 log2(0.5 log2 n) + 8
= logϕw + 5 log2 log2 n− 5 + 8 = logϕ w + 5 log2 log2 n+ 3.
14
z25 z24 z23 z22 z21 z20 z19 z18 z17 z16 z15 z14 z13 z12 z11 z10 z9 z8 z7 z6 z5 z4 z3 z2 z1
Figure 8: Parallel prefix graph for uniform arrival times
Without assuming skewed arrival times, we achieve a delay bound of logϕw + 5 log2 log2 n + 4.
For n = 25, an example is shown in Figure 8. Gates are colored by the part of the recursion they
represent; in this special case, some gates can be used to compute the Zi as well as the group
prefixes, hence they are both red and green.
The construction in [9] and our variant of it both have a very high fan-out; for n inputs, the
fan-out is at least ⌈√n⌉. In a physical implementation such a high fanout induces a significant
delay and requires the insertion of duplicate gates into the interconnect to repeat the signals. The
high fan-outs occur precisely at the Zi-prefixes, therefore they accumulate on a critical path. For n
inputs, the fan-out can be redistributed to duplicate gates with fan-out 2 using depth 12 ⌈log2 n⌉+
1; this will lead to an overall increase in delay of ⌈log2 n⌉ + O(log2 log2 n) for a given path [9].
Therefore, we obtain a 2.441-approximation algorithm if the fan-out is bounded by 2, improving
the 3-approximation achieved by [9] in this scenario.
4 A Lower Bound for Prefix Adders
Lemma 2.1 shows that a lower bound for the delay of a prefix tree for a single carry bit is given
by an optimal binary tree with depth one for the left child and depth two for the right child in
which the leaves represent inputs and their right-to-left order corresponds to the ordering of the
inputs. For zero arrival times, this is achieved by a Fibonacci tree. Rautenbach et al. [10] observed
that this a special case of a more general concept: alphabetic code trees with unequal letter costs.
These can be used to obtain general lower bounds, which we improve and state explicitly by using
the specific properties of our application.
Lemma 4.1. Given n inputs with integral arrival times t1, . . . , tn ∈ N0, a prefix tree computing
their carry bit cn+1 has logic gate delay at least
delay(cn+1) ≥ logϕ
(
n∑
i=1
ϕti
)
− 1.
Proof. In Section 2 we saw that Fti+1 inputs with arrival time zero can be combined with depth
ti. Therefore, an optimal prefix tree for inputs with arrival times t1, . . . , tn of delay k can be
restructured into a prefix tree with
∑n
i=1 Fti+1 inputs with depth k by replacing input i by a
Fibonacci tree for ti+1. If there is only one input, the lemma is trivially true, thus we may assume
15
∑n
i=1 Fti+1 ≥ 2. But a tree of depth k has at most Fk+1 leaves, hence k ≥ 2 and
k + 1 = logϕ
(
ϕk+1
)
≥ logϕ
(√
5
(
Fk+1 +
ψ3√
5
))
≥ logϕ
(√
5
(
n∑
i=1
Fti+1 +
ψ3√
5
))
≥ logϕ
(
n∑
i=1
(
ϕti+1 − ψ2 + ψ3)
)
= logϕ
(
n∑
i=1
(
ϕti+1 − ϕψ2)
)
≥ logϕ
(
ϕ(1 − ψ2) ·
n∑
i=1
ϕti
)
= logϕ
(
n∑
i=1
ϕti
)
,
and k ≥ logϕ
(∑n
i=1 ϕ
ti
)− 1 as claimed.
This lemma shows that the single carry bit circuits in Section 2 have optimum delay up to an
additive margin of 5.
References
[1] S. Chatterjee, A. Mishchenko, R. Brayton, X. Wang, and T. Kam. Reducing structural bias
in technology mapping. IEEE Transactions on Computer Aided Design of Integrated Circuits
and Systems 25.12,2894–2903, 2006.
[2] Youngmoon Choi. Parallel prefix adder design. Dissertation, University of Texas at Austin,
2004.
[3] Keutzer, Kurt. DAGON: technology binding and local optimization by DAG matching. Papers
on Twenty-five years of electronic design automation, ACM, 617–624, 1988.
[4] Simon Knowles. A family of adders. Proceedings of the 15th IEEE Symposium on Computer
Arithmetic (ARITH-15) (2001): 277–81.
[5] Peter M. Kogge and Harold S. Stone. A parallel algorithm for the efficient solution of a general
class of recurrence equations. IEEE Transactions on Computers 100.8 (1973): 786–793.
[6] Richard E. Ladner and Michael J. Fischer. Parallel prefix computation. Journal of the ACM
(JACM) 27.4 (1980): 831–838.
[7] Vojin G. Oklobdzija. Design and analysis of fast carry-propagate adder under non-equal input
signal arrival profile. Conference Record of the Twenty-Eighth Asilomar Conference on Signals,
Systems and Computers, Vol. 2. IEEE, 1994.
[8] Dieter Rautenbach, Christian Szegedy and Ju¨rgen Werber. Delay optimization of linear depth
boolean circuits with prescribed input arrival times. Journal of Discrete Algorithms 4.4 (2006):
526–537.
[9] Dieter Rautenbach, Christian Szegedy and Ju¨rgen Werber. The delay of circuits whose inputs
have specified arrival times. Discrete Applied Mathematics 155.10 (2007): 1233–1243.
[10] Dieter Rautenbach, Christian Szegedy and Ju¨rgen Werber. On the cost of optimal alphabetic
code trees with unequal letter costs. European Journal of Combinatorics 29.2 (2008): 386–394.
16
[11] Subhendu Roy, Mihir Choudhury, Ruchir Puri, and David Z. Pan. Towards optimal
performance-area trade-off in adders by synthesis of parallel prefix structures. Proceedings
of the 50th Annual Design Automation Conference (2013).
[12] Arnold Weinberger and James L. Smith. A logic for high-speed addition Nat. Bur. Stand.
Circ, 591, 3–12, 1958.
[13] Ju¨rgen Werber, Dieter Rautenbach and Christian Szegedy. Timing optimization by restructur-
ing long combinatorial paths. Proceedings of the 2007 IEEE/ACM international conference on
Computer-aided design. IEEE Press, 2007.
[14] Reto Zimmermann. Binary adder architectures for cell-based VLSI and their synthesis. Disser-
tation, ETH Zurich, Hartung-Gorre, 1998.
17
