# Fast Prefix Adders for Non-uniform Input Arrival Times 

Stephan Held ${ }^{1}$ •Sophie Spirkl ${ }^{2}$ (D)

Received: 2 March 2015 / Accepted: 1 September 2015 / Published online: 22 September 2015
© Springer Science+Business Media New York 2015


#### Abstract

We consider the problem of constructing fast and small parallel prefix adders for non-uniform input arrival times. In modern computer chips, adders with up to hundreds of inputs occur frequently, and they are often embedded into more complex circuits, e.g. multipliers, leading to instance-specific non-uniform input arrival times. Most previous results are based on representing binary carry-propagate adders as parallel prefix graphs, in which pairs of generate and propagate signals are combined using complex gates called prefix gates. Examples of commonly-used adders are constructed based on the Kogge-Stone or Ladner-Fischer prefix graphs. Adders constructed in this model usually minimize the delay in terms of these prefix gates. However, the delay in terms of logic gates can be worse by a factor of two. In contrast, we aim to minimize the delay of the underlying logic circuit directly. We prove a lower bound on the delay of a carry bit computation achievable by any prefix carry bit circuit and develop an algorithm that computes a prefix carry bit circuit with optimum delay up to a small additive constant. Our algorithm improves the running time of a previous dynamic program for constructing a prefix carry bit from $\mathcal{O}\left(n^{3}\right)$ to $\mathcal{O}\left(n \log ^{2} n\right)$ while simultaneously improving the delay and size guarantee, where $n$ is the number of bits in the summands. Furthermore, we use this algorithm as a subroutine to compute a full adder in near-linear time, reducing the delay approximation factor of 2 from previous approaches to 1.441 for our algorithm.


[^0]Keywords Circuit • Delay • Parallel prefix problem • Addition • Prefix adder • Non-uniform input arrival times

Mathematics Subject Classification 68Q25 - 65Y04

## 1 Introduction

The addition of binary numbers is one of the most fundamental computational tasks performed by computer chips. Given two binary addends $A=\left(a_{n} \ldots a_{1}\right)$ and $B=\left(b_{n} \ldots b_{1}\right)$, where index $n$ denotes the most significant bit, their sum $S=A+B$ has $n+1$ bits. For each position $1 \leq i \leq n$, we compute a generate signal $g_{i}$ and a propagate signal $p_{i}$, which are defined as follows:

$$
\begin{align*}
g_{i} & =a_{i} \wedge b_{i} \\
p_{i} & =a_{i} \oplus b_{i} \tag{1}
\end{align*}
$$

where $\wedge$ and $\oplus$ denote the binary AND and Xor functions. The carry bit at position $i+1$ can be computed recursively as $c_{i+1}=g_{i} \vee\left(p_{i} \wedge c_{i}\right)$ [4,13]. From the carry bits, we can compute the output $S$ via $s_{i}=c_{i} \oplus p_{i}$ for $1 \leq i \leq n$ and $s_{n+1}=c_{n+1}$.

For two pairs $\left(g_{i}, p_{i}\right)$ and $\left(g_{j}, p_{j}\right)$ of generate and propagate signals, we define a binary prefix operator as

$$
\begin{equation*}
\binom{g_{i}}{p_{i}} \circ\binom{g_{j}}{p_{j}}=\binom{g_{i} \vee\left(p_{i} \wedge g_{j}\right)}{p_{i} \wedge p_{j}} \tag{2}
\end{equation*}
$$

This operator is associative, and it can be used to compute the carry bit $c_{i+1}$ using the identity

$$
\binom{c_{i+1}}{p_{i} \wedge p_{i-1} \wedge \cdots \wedge p_{1}}=\binom{g_{i}}{p_{i}} \circ\binom{g_{i-1}}{p_{i-1}} \circ \cdots \circ\binom{g_{1}}{p_{1}} .
$$

The prefix operator allows us to simplify notation by combining generate and propagate signals into a single term $z_{i}=\left(g_{i}, p_{i}\right)$ and computing $c_{i+1}$ as the first component of $z_{i} \circ \cdots \circ z_{1}$. Figure 1 shows a prefix gate computing $z \circ z^{\prime}$ for the prefix operator in (2) on the left and its underlying logic circuit on the right.

Formally, a logic circuit is a non-empty connected acyclic directed graph consisting of nodes that are either inputs with at least one outgoing edge and no incoming edges, outputs with exactly one incoming edge and no outgoing edges, or gates with one or two incoming edges representing one of the 2-bit logical functions AND $(\wedge)$, OR ( $\vee$ ), XOR $(\oplus)$, NOT and their negations.

The number of gates is the size of the circuit. The (maximum) fan-out of the circuit is the maximum fan-out (out-degree) of its nodes. The depth of the circuit is the maximum number of gates on a directed path.

A logic circuit with inputs $g_{1}, p_{1}, \ldots, g_{n}, p_{n}$ is called a prefix carry bit circuit if it computes $c_{n+1}$ and $p_{1} \wedge \cdots \wedge p_{n}$, it is built from prefix operator gadgets in Fig. 1, and

Fig. 1 Prefix gate and underlying logic circuit

$z \circ z^{\prime}$

the subcircuit computing $p_{1} \wedge \cdots \wedge p_{n}$ is a tree. Similarly, a prefix adder is a logic circuit built using the gadgets in Fig. 1 that computes $c_{i+1}$ and $p_{1} \wedge \cdots \wedge p_{i}$ for all $i=1, \ldots, n$ at its $2 n$ outputs.

A graph that arises from a prefix carry bit circuit by contracting each gadget into a prefix gate as in Fig. 1, and by contracting all input pairs $\left(g_{i}, p_{i}\right)$ into $z_{i}$ and the output pair $\left(c_{n+1}, p_{n} \wedge \cdots \wedge p_{1}\right)$, is called a prefix tree. Likewise, a parallel prefix graph arises from a prefix adder by contracting each gadget, all input pairs $\left(g_{i}, p_{i}\right)=z_{i}$ and output pairs $\left(c_{i+1}, p_{1} \wedge \cdots \wedge p_{i}\right)$ for all $i=1, \ldots, n$. For inputs $z_{1}, \ldots, z_{n}$, a prefix tree computes the last carry bit of an addition $z_{n} \circ \cdots \circ z_{1}$, while a parallel prefix graph computes $z_{i} \circ \cdots \circ z_{1}$ for all $1 \leq i \leq n$, i.e. all carry bits of an addition.

In a parallel prefix graph every vertex computes a signal of the form $z_{i, j}=z_{j} \circ z_{j-1} \circ$ $\cdots \circ z_{i}$ for some $1 \leq i \leq j \leq n$. In accordance with the generate and propagate signals $g_{k}, p_{k}$ at the inputs $1 \leq k \leq n$, we call the first component of $z_{i, j}$ the generate signal (sequence) and the second component the propagate signal (sequence) computed at the vertex.

When aiming for a bounded fan-out, we allow the use of repeater gates with fan-in one (a single incoming edge) and fan-out at least two in all types of circuits and graphs.

An example of the transition between parallel prefix graphs and prefix adders is given in Fig. 2. On the left the serial parallel prefix graph with depth 3 is replaced by an And- OR-path with logic circuit depth 6 known as the ripple-carry adder. For the Kogge-Stone parallel prefix graph [5] on the right, the depth increases from two to four and the maximum fan-out increases from two to three.

Additions are typically not performed as isolated tasks, but the input signals result from preceding computational stages and become available at different fixed arrival times $t_{i} \in \mathbb{N}_{0}(i \in\{1, \ldots, n\})$, e. g. when used within a multiplier. Here we make the simplifying assumption that $g_{i}$ and $p_{i}$ have the same arrival time at the inputs, which is essentially fulfilled if they are generated as in (1). We define the delay of a directed path in a logic circuit starting at an input as its depth plus its input arrival time. The delay of a vertex is the maximum delay of a path ending in the vertex and the delay of the circuit is the maximum delay of its outputs. Depth and delay coincide if all input arrival times are zero. Paths and outputs attaining the delay of the circuit are called critical. The delay of all vertices can be computed in linear time by a longest path computation in an acyclic network.


Fig. 2 Prefix graphs as logic circuits. a Serial prefix graph. b And- Or-path. c Kogge-Stone prefix graph. d Kogge-Stone logic circuit

In Fig. 3, we show an example with five inputs and its optimum solutions for different arrival time patterns. Each tree is optimal for neither of the other two arrival times sequences.

We aim for a prefix carry bit circuits and adders with close to minimum delay and small size.

Examples of minimum-depth prefix graphs for uniform input arrival times are the Kogge-Stone graph [5] or the Ladner-Fischer graph [6]. Both have depth $\left\lceil\log _{2} n\right\rceil$ in terms of prefix gates, but a non-minimal depth of $2\left\lceil\log _{2} n\right\rceil$ as a logic circuit. For nonuniform arrival times, these circuits might be by a factor of three worse than the lower bound, for example for the arrival time pattern $t_{1}=\log _{2} n$ and $t_{2}=\cdots=t_{n}=0$. In Fig. 2c, if $z_{1}$ has arrival time two and all other arrival times are zero, the delay of Fig. 2 d is 6.

Parallel prefix graphs minimizing the overall prefix graph delay for special input arrival time patterns that occur mostly in certain multipliers were presented by Oklobdzija [7] and Zimmermann [15].

(a)

(b)

(c)

Fig. 3 Different arrival times profiles and their optimum prefix trees. a Arrival times 0, 0, 0, 0, 0; delay 4. b Arrival times 4, 3, 2, 1, 0 ; delay 6. c Arrival times 0, 1, 2, 3, 4; delay 7

An algorithm for constructing optimum-delay parallel prefix graphs for arbitrary non-uniform input arrival times is given by Choi [1], however this approach may require $\mathcal{O}\left(n^{2}\right)$ gates for a full $n$-bit adder. Roy et al. [11] enumerate parallel prefix adders with heuristic pruning to achieve good performance-area tradeoffs in practice. In [12] they proposed a variant with polynomial running time. All these approaches minimize the delay of the prefix graph rather than the underlying logic circuit. As the prefix operator contains two subsequent gates, the resulting delay of the underlying circuit may be worse by a factor of two. Variants of the prefix operator have led to improved constructions for small sizes. Examples are the so-called Ling adders or Jackson adders, which were recently compared with other adders by Keeter et al. [3].

As it is common practice in logic synthesis [5,6,11,14], we use a simple technologyindependent circuit and delay model in this work. In hardware, the delay of a gate certainly depends on its physical structure. Despite its simplicity, the same model is successfully used in practice for re-optimizing carry bit functions even late in the design flow by Werber et al. [14]. Our lower bounds are based on the definition of delay motivated by the properties of logic circuits; for different computational models, even faster prefix addition (such as by Cole and Vishkin [2]) is possible. In CMOS technology, NAND/NOR gates are faster than AND/Or gates and efficient implementations exist for integrated multi-input AND- OR-Inversion gates and OR- AND-Inversion gates. Therefore, we will also describe techniques for an efficient technology mapping.

### 1.1 Our Contribution

We will use the delay properties of the prefix operator (2) aiming to minimize the delay of logic circuits for additions instead of the corresponding prefix graphs. This idea was used by [8], who proposed a cubic-time dynamic programming algorithm to compute a fast carry bit circuit.

With a deeper structural analysis of near-optimum prefix trees in Sect. 2, we can construct a carry bit circuit with a better delay bound, size, and running time as shown in the rows with "C" (for carry) as their type (column "T") of Table 1.

Table 1 Improvements over [8,9], where $W=\log _{2}\left(\sum_{i=1}^{n} 2^{t_{i}}\right)$ is a lower bound for the delay

|  | T | Delay | Size | Fan-out | Runtime |
| :--- | :--- | :--- | :--- | :--- | :--- |
| $[8]$ | C | $1.441 W+3$ | $4 n-3$ | $1,2,3$ | $\mathcal{O}\left(n^{3}\right)$ |
| Here | C | $1.441 W+2.674$ | $3 n-3$ | 1,2 | $\mathcal{O}(n \log n)$ |
| $[9]$ | A | $2 W+6 \log _{2} \log _{2} n+\mathcal{O}(1)$ | $6 n \log _{2} \log _{2} n$ | $\sqrt{n}+1$ | $\mathcal{O}\left(n^{2}\right)$ |
| Here | A | $1.441 W+5 \log _{2} \log _{2} n+4.5$ | $6 n \log _{2} \log _{2} n$ | $\sqrt{n}+1$ | $\mathcal{O}(n \log n)$ |

Running times assume constant time for binary addition

In Sect. 3, we apply the carry bit algorithm to substantially improve the delay bound given by Rautenbach et al. [9] for a full $n$-bit adder with input arrival times $t_{1}, \ldots, t_{n} \in \mathbb{N}_{0}$. The result is listed in the rows with type " A " (for adder) in Table 1. Here, all running times are listed assuming constant-time addition. In Theorem 3, we show that assuming linear-time addition costs only an additional factor of $\mathcal{O}(\log n)$ for all running times. For a comparison with traditional fast adders for uniform input arrival times, recall that adders based on the Kogge-Stone [5] or Ladner-Fischer [6] prefix graphs have delays of $2 W$ for uniform input arrival times and $3 W$ for arbitrary input arrival times.

Then in Sect. 4, we prove a lower bound on the delay of any prefix carry bit circuit, which shows that our carry bit algorithm is delay-optimal up to an additive constant of 5 .

Finally in Sect. 5, we show how to map our new carry bit circuits and full adders to NAND, NOR, and NOT gates economically without increasing the asymptotic delay bound.

## 2 Algorithm for Single Carry Bit Circuits

We start with a method that given a parallel prefix graph allows us to compute the delay of the underlying logic circuit up to an additive error of one, without inspecting the underlying logic circuit.

Proposition 1 Given a parallel prefix graph or prefix tree, we propagate the arrival times (which might all be zero) through the prefix gates so that the delay $t$ of a gate with left input (higher indices) $l$ and right input (lower indices) $r$ with delay $t_{l}$ and $t_{r}$, respectively, is defined as $t=\max \left\{t_{r}+2, t_{l}+1\right\}$. Let $d$ be the maximum delay computed with this procedure, maximized over all gates, inputs and outputs, then the delay $D$ of the logic circuit corresponding to the given prefix graph or prefix tree satisfies $d \leq D \leq d+1$.

Proof This is a consequence of a longest path computation in acyclic networks. We construct a logic circuit from the prefix graph. For every input pair ( $g, p$ ) corresponding to an input with arrival time $t$, we set the arrival time of $g$ and $p$ to be $t$ and $t-1$, respectively.

We prove by induction that for every signal pair $(g, p)$ corresponding to a signal $z$ at a vertex in the prefix graph, $g$ has delay at least one more than $p$, we also say that
the pair $(g, p)$ has skewed arrival times. Now, if $z$ is not an input, the delay of $g$ is the maximum of two plus the delay of the generate signal of its right predecessor and one plus the delay of the generate signal of its left predecessor. This is clear for inputs. Now consider a signal $z \circ z^{\prime}$ as in Fig. 1. Let $t_{g}, t_{p}, t_{g^{\prime}}$, and $t_{p^{\prime}}$ denote the delay of $g, p, g^{\prime}$ and $p^{\prime}$, respectively. By induction hypothesis, $t_{p^{\prime}}+1 \leq t_{g^{\prime}}$ and $t_{p}+1 \leq t_{g}$. Therefore, $g \vee\left(p \wedge g^{\prime}\right)$ has delay $\max \left\{t_{p}+2, t_{g^{\prime}}+2, t_{g}+1\right\}=\max \left\{t_{g^{\prime}}+2, t_{g}+1\right\}$. Furthermore, $\max \left\{t_{p}+2, t_{g^{\prime}}+2, t_{g}+1\right\} \geq 1+\max \left\{t_{p}+1, t_{p^{\prime}}+1\right\}$, which proves that $p^{\prime} \wedge p$ is indeed by at least one time unit earlier.

Inductive application of the argument above yields that the generate signal of every output arrives at time $\leq d$ under the assumption that all propagate signals of the inputs arrive one time unit earlier than their actual arrival time. Shifting all computed delays up by one time unit yields that the delay of the logic circuit for the actual arrival times is at most $d+1$.

To show that $d \leq D$, consider only the generate signals, i.e. using the notation of Fig. 1, consider a logic circuit $G$ in which all gates of type $A$ and inputs $p_{i}$ computing propagate signals are removed, and gates of type $B$ are replaced by repeaters. It follows that for every output $c$, the subcircuit of $G$ which consists of all ancestors of $c$ is a tree. This is certainly true in the parallel prefix graph, and after the removal of propagate signals, every prefix gate internally corresponds to a tree as well. Removing gates and inputs from a circuit does not increase its delay, because a critical path in $G$ is also a path in the original circuit.

Computing the delay of a signal in a tree is easy: when combining the generate signals of two inputs, one of them has to pass through two gates (a repeater instead of $B$, and $C$ and the other has to pass through only one (namely $C$ ). This shows that the given method for computing $d$ indeed yields a lower bound.

For uniform arrival times, $d=D$, but for arbitrary arrival times, $D=d+1$ is possible, for example by choosing the arrival times of $z$ and $z^{\prime}$ in Fig. 1 as 1 and 0, respectively.

The prefix graph and its underlying logic circuit can vary greatly in depth and delay. For example, a prefix graph of optimal depth $\left\lceil\log _{2} n\right\rceil$ as in Fig. 2c contains a balanced binary tree computing its last output, therefore its logic circuit depth is $2\left\lceil\log _{2} n\right\rceil$. However, the depth only doubles for the lower (right) input of a prefix gate by Proposition 1, which we exploit in the following.

For a single carry bit computation with arrival times, Rautenbach et al. [8] give a dynamic programming algorithm with cubic running time. The algorithm restructures an AND- OR-path similar to a prefix tree. Here the right-to-left ordering of the leaves of this tree is fixed as $z_{1}, \ldots, z_{n}$, because $\circ$ is not commutative. The algorithm recursively splits the sequence of inputs into two parts at an index $l$ attaining the minimum in the recursive delay function

$$
\begin{equation*}
\mathcal{D}\left(t_{1}, \ldots, t_{n}\right)=\min _{l=1, \ldots, n-1} \max \left\{\mathcal{D}\left(t_{1}, \ldots, t_{l}\right)+2, \mathcal{D}\left(t_{l+1}, \ldots, t_{n}\right)+1\right\} \tag{3}
\end{equation*}
$$

This solution can be computed for every subsequence $t_{i}, t_{i+1}, \ldots, t_{j}$ of indices via dynamic programming by choosing the $\mathcal{D}$-optimum position $l$ at which to split the sequence, which yields the following result.

Theorem 1 (Rautenbach et al. [8]) For $n$ input pairs $\left(g_{i}, p_{i}\right)$ for $1 \leq i \leq n$ with arrival times $t_{1}, \ldots, t_{n} \geq 0$, there is a logic circuit computing the carry bit $c_{n+1}$ with

$$
\begin{equation*}
\operatorname{delay}\left(c_{n+1}\right) \leq 1.441 \log _{2}\left(\sum_{i=1}^{n} 2^{t_{i}}\right)+3 \tag{4}
\end{equation*}
$$

This circuit can be constructed in $\mathcal{O}\left(n^{3}\right)$ time. It has size at most $4 n-3$, and its maximum fan-out is bounded by two at all gates and bounded by three at all inputs.

Using our definition of a prefix tree, the size of the carry bit circuit can be reduced by $n$.

Lemma 1 Any prefix tree computing a single carry bit has an underlying logic circuit size of at most $3 n-3$ and an underlying maximum fan-out of two.

Proof This is clear as any prefix tree for $n$ inputs has exactly $n-1$ prefix gates.
To analyze the structure of fast prefix carry bit circuits we begin with a well-known definition: let $F_{n}$ be the $n$-th Fibonacci number, where $F_{0}=0, F_{1}=1$ and $F_{n}=$ $F_{n-1}+F_{n-2}$. The exact formula for computing the $n$-th Fibonacci number is $F_{n}=$ $\frac{1}{\sqrt{5}}\left(\varphi^{n}-\psi^{n}\right)$, where $\varphi=\frac{1+\sqrt{5}}{2}$ is the golden section and $\psi=\frac{1-\sqrt{5}}{2}$.

We first prove a similar delay bound to [8], but instead of bounding the recursive function $\mathcal{D}$, we explicitly construct our solution and obtain useful structural information about it.

Lemma 2 Let $t_{1}, \ldots, t_{n} \in \mathbb{N}_{0}$ be a sequence of input arrival times for inputs $z_{1}, \ldots, z_{n}$, and let $F_{k}$ be the first Fibonacci number that is at least as large as $\sum_{i=1}^{n}\left(F_{t_{i}+3}-1\right)$. Then there is a prefix tree computing $z_{n} \circ \cdots \circ z_{1}$ with logic gate delay at most $k$.

Proof Throughout the proof, we assume that every input signal pair has skewed arrival times, i. e. $g_{i}$ has arrival time $t_{i}$ and $p_{i}$ has arrival time at most $t_{i}-1$ for all $1 \leq i \leq n$. By the proof of Proposition 1, all generate and propagate signal pairs in the prefix tree will have skewed arrival times under this assumption. Thus, all prefix gates have depth two for the input with smaller indices and depth one for the input with larger indices, and we will proof that the delay of the circuit is at most $k-1$. Without the skew assumption, this yields a circuit delay of $k$ and conclude the proof, where we pay a delay of 1 to establish the skew assumption at the inputs.

The proof has two main parts. In the first part, we construct a binary tree $T$ with $F_{k}$ leaves in such a way that if we consider its internal nodes as prefix gates and its leaves as inputs with arrival time 0 , then its overall delay is $k-1$. During the second step, we replace sections of consecutive leaves and the corresponding subtrees of $T$ with our original inputs so that the arrival time of the input does not exceed the depth of the subtree.

Let $T$ be a tree constructed by starting at the root $r$ and recursively constructing a binary tree with $F_{k-1}$ leaves on the left and one with $F_{k-2}$ leaves on the right as in Fig. 4. We refer to $T$ as a Fibonacci tree for $k$.


Fig. 4 Fibonacci tree $T$ for $k=8$

Replacing all non-leaf nodes with prefix gates and leaves with new inputs (with arrival time 0 and unrelated to the original inputs) as well as adding an output at the root yields a prefix tree for $F_{k}$ inputs with logic gate depth $k-1$. This can be seen inductively; it is certainly true for $k=2,3$ and thus for $k>3$, the left tree has depth $k-2$, the right tree has depth $k-3$, and the last prefix gate has delay $\max \{k-2+1, k-3+2\}=k-1$. The minimum depth of a prefix tree with $l$ leaves is at most $k-1$ if and only if $l \leq F_{k}$.

Now we show how to replace parts of the tree by inputs with skewed arrival times $t_{1}, \ldots, t_{n}$ without increasing the delay. We start by subdividing the leaves of the tree: from right to left, the first $F_{t_{1}+3}-1$ leaves are assigned to the first input, the next $F_{t_{2}+3}-$ 1 leaves are assigned to the second input, and input $i$ gets leaves $1+\sum_{j=1}^{i-1}\left(F_{t_{j}+3}-1\right)$ up to $\sum_{j=1}^{i}\left(F_{t_{j}+3}-1\right)$. Our choice of $k$ ensures that every input $i$ gets $F_{t_{i}+3}-1$ successive leaves assigned to it; leftover leaves can be deleted without increasing the delay. The ordering of the inputs is preserved within the tree.

We define a subtree of size $l$ to be a tree obtained by taking a vertex $v$ and all its successors with $l$ leaves in total. By construction, every subtree of size $l$ must be a Fibonacci tree for some $j$ with $F_{j}=l$. Furthermore, for every $F_{j}$ with $j \leq k-1$, we can find subtrees of $T$ of size $F_{j}$. A vertex $v$ in $T$ is the root of a subtree of size $F_{j} \neq 1$ if and only if $v$ has depth $j-1$. For $F_{j}=1$, we know that $j \in\{0,1\}$ and $v$ has depth 0 .

Our goal is to show that every input $i$ with arrival time $t_{i}$ owns all the leaves of a subtree of size $F_{t_{i}+1}$. In order to see this, we remove all edges connecting a vertex with depth at most $t_{i}$ to a vertex with depth more than $t_{i}$ from the tree. This separates the tree into a connected component containing the root and several subtrees of size at most $F_{t_{i}+1}$. For example, if $t_{i}=4$, then Fig. 4 would contain the component containing the root as well as subtrees indicated by the coloring and patterns of size $3,5,5,3,5$ in that order. In general, since every gate has depth 1 or 2 , each root of such a tree has depth $t_{i}$ or $t_{i-1}$, therefore the subtrees can only have size $F_{t_{i}+1}$ or $F_{t_{i}}$. Our next goal
is to prove that this ordered subtree sequence has a special structure. Since only the roots of "big" subtrees of size $F_{t_{i}+1}$ can be replaced by input $i$ without increasing the delay, we show that there are few small subtrees of size $F_{t_{i}}$.

Due to the fact that the depth difference between a node and its left child is always one, the leftmost root in the subtree sequence of a Fibonacci tree for some $k \geq t_{i}$ has depth $t_{i}$ and its parent has depth $t_{i+1}$. Therefore, the subtree rooted here has size $F_{t_{i}+1}$. We will now show that in a Fibonacci tree, the ordered subtree sequence of the trees of size $F_{t_{i}+1}$ and size $F_{t_{i}}$ never contains two consecutive subtrees of size $F_{t_{i}}$. For $k=t_{i}+1$, this is clear. For $k=t_{i}+2$, there are only two subtrees, and the left one has size $F_{t_{i}+1}$. For $k>t_{i}+2$, the subtree sequence of a Fibonacci tree for $k$ corresponds to the concatenation of the subtree sequences corresponding to a tree for $k-1$ and a tree for $k-2$. As those satisfy the claim by induction hypothesis and each sequence starts with a tree of size $F_{t_{i}+1}$, the Fibonacci tree for $k$ has the stated property as well.

We know that input $i$ owns $F_{t_{i}+3}-1$ consecutive leaves. In the subtree sequence, at most the first $F_{t_{i}+1}-1$ leaves belonging to input $i$ are part of subtrees of which $i$ does not own the first (rightmost) leaf. Of the remaining leaves, the first $F_{t_{i}}$ might cover a subtree of that size. This accounts for $F_{t_{i}+1}+F_{t_{i}}-1=F_{t_{i}+2}-1$ leaves. The next $F_{t_{i}+1}$ leaves are owned by $i$ as well, so at that point, at the latest, there must be a subtree of size $F_{t_{i}+1}$ of which $i$ owns all leaves. For $t_{i} \neq 1$, we can replace the root of this subtree with one input with arrival time $t_{i}$. By construction, this does not increase the delay.

Here we used that $F_{t_{i}+1}>F_{t_{i}}$ to give a lower bound of the depth of the owned subtree. The only exception from this is the case $t_{i}=1$, which can be treated analogously: every input with $t_{i}=1$ owns two leaves, and by similar arguments as for the subtree sequence, one of them must be at depth 1 in the Fibonacci tree.

After removing all leaves that have not been replaced by any original input, we obtain a prefix tree computing $z_{n} \circ \cdots \circ z_{1}$ with delay $k-1$. All of these arguments used the assumption of skewed arrival times also for the inputs, which can be achieved in such a way that the actual delay of the circuit increases to at most $k$.

The upper bound of $k$ is tight for the final logic circuit as evident from the example $0,1,0$, where $\sum_{i=1}^{3}\left(F_{t_{i}+3}-1\right)=1+2+1=4$, so $k=F_{k}=5$ (see Fig. 5).


Fig. 5 A tight example. a $T$ for $k=5$. b Prefix tree. $\mathbf{c}$ Logic circuit

For this arrival time profile, the algorithm will (implicitly) construct the Fibonacci tree $T$ and assign leaves to the inputs as in Fig. 5a, where the colored and patterned vertices represent the positions at which the inputs will actually be inserted into the tree. These do not have to be leaves in general. After deleting redundant inputs, we obtain a prefix tree (Fig. 5b) and a corresponding logic circuit (Fig. 5c). Note that $p_{2}$ has arrival time 1 and the red path contains four gates, hence the logic circuit has delay 5.

From the proof of Lemma 2, it is easy to see how to avoid the enumeration of all potential splitting positions $l=1, \ldots, n-1$ in (3). Since there are $F_{k-1}$ leaves in the left subtree of $T$ and $F_{k-2}$ in the right subtree, let

$$
j=\min \left\{1 \leq j \leq n: \sum_{i=1}^{j}\left(F_{t_{i}+3}-1\right) \geq F_{k-2}\right\}
$$

and $f=F_{k-2}-\sum_{i=1}^{j-1}\left(F_{t_{i}+3}-1\right)$, then $f$ counts how many leaves belonging to input $j$ are part of the right subtree, and $j$ is the only input that might have leaves in both subtrees. Since in our decomposition the leftmost $F_{t_{j}+1}$ leaves of the right subtree belong to a Fibonacci tree of size $F_{t_{j}+1}, j$ should be on the right side of the decomposition if and only if $f \geq F_{t_{j}+1}$. Otherwise, there are at least $F_{t_{j}+2}$ leaves on the left side, hence in our sequence of subtrees $j$ might own all leaves of a subtree of size $F_{t_{j}}$, but the remaining leaves must belong to and cover a subtree of size $F_{t_{j}+1}$, hence $j$ should be on the left side. Note that it is never optimal to assign all leaves to the same side, thus this partition can always be assumed as proper without increasing the delay. After updating the number of leaves belonging to $j$ on the side it is assigned to, this yields a recursive procedure that terminates when there is only one index left for a subtree.

Lemma 3 Given input arrival times $t_{1}, \ldots, t_{n} \in \mathbb{N}_{0}$, let $F_{k}$ be the first Fibonacci number that is at least as large as $\sum_{i=1}^{n}\left(F_{t_{i}+3}-1\right)$. A prefix tree for these input arrival times with delay at most $k$ can be found with running time $\mathcal{O}\left(n \log n+k+\max t_{i}\right)$ under the assumption that we can perform additions and multiplications by a constant on numbers of arbitrary size in constant time (an assumption we will show how to avoid later). If $t_{i} \in \mathcal{O}(n)$ for all $i$, then the running time is $\mathcal{O}(n \log n)$.

Proof We have already argued that the algorithm achieves the stated delay bound. We show that this partitioning strategy will ensure that every input $i$ is substituted for a subtree of size at least $F_{t_{i}+1}$.

If there is only one index $i$ remaining, it was either the rightmost (lowest) or leftmost (highest) index in the previous step. If it was the rightmost index, then the subtree previously contained $F_{t_{i}+2}$ of its leaves as well as at least one more leaf, hence $k \geq t_{i}+3$ and the right subtree has size at least $F_{t_{i}+1}$, so replacing this subtree by input $i$ leads to the claimed delay by the argument used in Lemma 2. If it was the leftmost index, a similar argument applies.

For the running time estimate, we compute the indices assigned to every leaf and the delay bound $k$ in time $\mathcal{O}\left(n+k+\max t_{i}\right)$. There are $n-1$ recursive partitioning


Fig. 6 Example of the algorithm. a Output for input arrival times 3, 2, 3, 1, 0. b Prefix tree
steps, during each of which we find the input $j$ as the input index to own leaves in the left subtree. This can be done in logarithmic time using binary search in the sorted array of the indices of the first leaf belonging to every input.

Figure 6a shows how the algorithm works for the sequence of input arrival times $3,2,3,1,0$. The number of leaves we need is $7+4+7+2+1=21$, therefore $k=8$ suffices. We number the leaves from right to left as $1, \ldots, 21$. After the first split, 3 light blue leaves of input 2 are in the left subtree, hence the corresponding input is assigned to the left subtree. Note that for the orange leaves of input 4, we end up assigning them to a right subtree that does not contain any orange leaves in the beginning, but only green leaves of input 3 , in order to ensure a proper partition. We obtain the result shown in Fig. 6b.

Lemma 4 We can construct a prefix tree with a delay bound as in Lemma 2 for any instance $\left(t_{1}, \ldots, t_{n}\right)$ by instead constructing a prefix tree for an instance $\left(t_{1}^{\prime}, \ldots, t_{n}^{\prime}\right)$ with $\max t_{i}^{\prime} \leq 2 n-1$.

This follows from the fact that the longest path from any input to the output contains at most $n-1$ prefix gates. The maximum delay difference can be assumed as $2 n-2$, since any input with earlier arrival time will never be critical.

Theorem 2 For $n$ inputs with arrival times $t_{1}, \ldots, t_{n} \in \mathbb{N}_{0}$, the algorithm finds a prefix carry bit circuit for $c_{n+1}$ with

$$
\operatorname{delay}\left(c_{n+1}\right) \leq k \leq\left\lfloor\log _{\varphi}\left(\sum_{i=1}^{n} \varphi^{t_{i}}\right)\right\rfloor+4
$$

The constructed logic circuit has size at most $3 n-3$ and maximum fan-out two at all logic gates and inputs. Furthermore, the delay is at most

$$
k \leq \log _{\varphi}\left(\sum_{i=1}^{n} 2^{t_{i}}\right)+2.673 \leq 1.441 \log _{2}\left(\sum_{i=1}^{n} 2^{t_{i}}\right)+2.673 .
$$

Proof The size and fan-out bounds follow from Lemma 1. The delay of the constructed circuit is $k$. By choice of $k$, we know that $\sum_{i=1}^{n}\left(F_{t_{i}+3}-1\right) \geq F_{k-1}+1$. With $\varphi=\frac{1+\sqrt{5}}{2}$, $\psi=\frac{1-\sqrt{5}}{2}$ and the exact formula $F_{n}=\frac{1}{\sqrt{5}} \cdot\left(\varphi^{n}-\psi^{n}\right)$, it follows that $\left|\sqrt{5} F_{n}-\varphi^{n}\right| \leq 1$ and for $n \geq 1,\left|\sqrt{5} F_{n}-\varphi^{n}\right| \leq|\psi|$.

Now $k-1=0$ can only be true if there is only one input. In this case, the stated delay bound is trivially true. Otherwise, we obtain the estimate:

$$
\begin{aligned}
k-1 & =\log _{\varphi}\left(\varphi^{k-1}\right) \leq \log _{\varphi}\left(\sqrt{5}\left(F_{k-1}+1\right)\right) \\
& \leq \log _{\varphi}\left(\sqrt{5}\left(\sum_{i=1}^{n}\left(F_{t_{i}+3}-1\right)\right)\right) \\
& \leq \log _{\varphi}\left(\sqrt{5}\left(\sum_{i=1}^{n}\left(\frac{1}{\sqrt{5}}\left(\varphi^{t_{i}+3}+1\right)-1\right)\right)\right) \\
& \leq \log _{\varphi}\left(\sum_{i=1}^{n} \varphi^{t_{i}+3}\right)=\log _{\varphi}\left(\sum_{i=1}^{n} \varphi^{t_{i}}\right)+3
\end{aligned}
$$

which proves the first claim.
For a single input, the second delay bound is trivially true. Furthermore, for $t_{i} \geq 0$, $F_{t_{i}+3}-1 \leq 2^{t_{i}}$. We obtain the estimate:

$$
\begin{aligned}
k-1 & =\log _{\varphi}\left(\varphi^{k-1}\right) \leq \log _{\varphi}\left(\sqrt{5}\left(F_{k-1}+1\right)\right) \\
& \leq \log _{\varphi}\left(\sqrt{5}\left(\sum_{i=1}^{n}\left(F_{t_{i}+3}-1\right)\right)\right) \\
& \leq \log _{\varphi}\left(\sqrt{5}\left(\sum_{i=1}^{n} 2^{t_{i}}\right)\right) \\
& =\log _{\varphi}\left(\sum_{i=1}^{n} 2^{t_{i}}\right)+\log _{\varphi} \sqrt{5} \leq 1.441 \log _{2}\left(\sum_{i=1}^{n} 2^{t_{i}}\right)+1.673 .
\end{aligned}
$$

Our proof allows an improvement over the delay bound of [8] due to a refined analysis. A running time of $\mathcal{O}(n \log n)$ follows from Lemma 3 assuming that we can add numbers of linear size and multiply them by a constant in constant time. Under the more practical assumption that these operations take linear time with respect to the number of digits, the algorithm has super-quadratic running time, which can be avoided as follows:

Theorem 3 For any fixed $\gamma>1$, a prefix carry bit circuit as in the setting of Theorem 2 with

$$
\operatorname{delay}\left(c_{n+1}\right) \leq \log _{\varphi}\left(\sum_{i=1}^{n} \varphi^{t_{i}}\right)+4+2.1 \cdot n^{1-\gamma}
$$

can be found in $\mathcal{O}\left(n \gamma \log ^{2} n\right)$ time assuming linear-time addition and multiplication with constants. It satisfies

$$
\text { delay }\left(c_{n+1}\right) \leq 1.441 \log _{2}\left(\sum_{i=1}^{n} 2^{t_{i}}\right)+2.673+2.1 \cdot n^{1-1.4 \gamma}
$$

Proof By Lemma 4 and Theorem 2, we can solve instances with $\max t_{i}-\min t_{i}$ $\leq \gamma\left\lceil\log _{\varphi} n\right\rceil$ in $\mathcal{O}\left(n \gamma \log ^{2} n\right)$ time with linear-time addition.

Given an instance with arrival times $t_{1}, \ldots, t_{n} \in \mathbb{N}_{\geq 0}$, we define a new instance $t_{i}^{\prime}=\max \left\{t_{i}, \max _{j \in\{1, \ldots, n\}} t_{j}-\gamma\left\lceil\log _{\varphi} n\right\rceil\right\}$ and compute a circuit for the modified instance in $\mathcal{O}\left(n \gamma \log ^{2} n\right)$. When reverting to the original arrival times, the delay of this solution does not increase, because none of the arrival times do. Therefore,

$$
\begin{aligned}
\operatorname{delay}\left(c_{n+1}\right)-4 & \leq \log _{\varphi}\left(\sum_{i=1}^{n} \phi^{t_{i}^{\prime}}\right) \\
& \leq \log _{\varphi}\left(n \cdot \phi^{\max t_{i}-\gamma\left\lceil\log _{\varphi} n\right\rceil}+\sum_{i=1}^{n} \phi^{t_{i}}\right) \\
& \leq \log _{\varphi}\left(\phi^{\max t_{i}+(1-\gamma) \log _{\varphi} n}+\sum_{i=1}^{n} \phi^{t_{i}}\right) \\
& \leq \log _{\varphi}\left(\sum_{i=1}^{n} \phi^{t_{i}}\right)+\log _{\varphi}\left(1+\phi^{(1-\gamma) \log _{\varphi} n}\right) \\
& \leq \log _{\varphi}\left(\sum_{i=1}^{n} \phi^{t_{i}}\right)+\log _{\varphi} e \cdot n^{1-\gamma},
\end{aligned}
$$

and $\log _{\varphi} e<2.1$. For the dual logarithm-based delay bound, we have

$$
\begin{aligned}
\operatorname{delay}\left(c_{n+1}\right)-2.673 & \leq \log _{\varphi} 2 \cdot \log _{2}\left(\sum_{i=1}^{n} 2^{t_{i}^{\prime}}\right) \\
& \leq \log _{\varphi} 2 \cdot \log _{2}\left(n \cdot 2^{\max t_{i}-\gamma\left[\log _{\varphi} n\right\rceil}+\sum_{i=1}^{n} 2^{t_{i}}\right) \\
& \leq \log _{\varphi} 2 \cdot \log _{2}\left(2^{\max t_{i}-\left(1-\gamma \log _{\varphi} 2\right) \log _{2} n}+\sum_{i=1}^{n} 2^{t_{i}}\right) \\
& \leq \log _{\varphi} 2\left(\log _{2}\left(\sum_{i=1}^{n} 2^{t_{i}}\right)+\log _{2}\left(1+2^{\left(1-\gamma \log _{\varphi} 2\right) \log _{2} n}\right)\right) \\
& \leq \log _{\varphi} 2 \cdot \log _{2}\left(\sum_{i=1}^{n} 2^{t_{i}}\right)+\left(\log _{\varphi} e\right) n^{1-\gamma \log _{\varphi} 2},
\end{aligned}
$$

and $1.4<\log _{\varphi} 2<1.441$.

For $\gamma>1$, the additional error decreases with growing $n$. Since the algorithm is only useful if $n \geq 2$, choosing a sufficiently large constant $\gamma$ yields the delay bound $1.441 \log _{2}\left(\sum_{i=1}^{n} 2^{t_{i}}\right)+2.674$ with running time $\mathcal{O}\left(n \log ^{2} n\right)$.

## 3 Algorithm for Prefix Adder Circuits

The naïve parallel prefix graph construction, in which all carry bits are computed separately by a carry bit circuit, might contain a quadratic number of gates. Therefore, Rautenbach et al. also developed a parallel prefix graph construction computing all carry bits [9].

Theorem 4 (Rautenbach et al. [9]) Given arrival times $t_{1}, \ldots, t_{n} \in \mathbb{N}_{0}$, there is a parallel prefix graph for $n$ inputs of size $\mathcal{O}(n \log \log n)$ with logic delay $2 \log _{2}\left(\sum_{i=1}^{n} 2^{t_{i}}\right)+6 \log _{2} \log _{2} n+\mathcal{O}(1)$.

The primary objective in [9] is to minimize the delay of the prefix graph instead of the underlying logic circuit. We will improve the performance guarantee for a similar construction as in [9] by using a carry bit circuit as in Sect. 2 as a subroutine.

Given $n$ inputs, we partition the set $\{1, \ldots, n\}$ into $l=\lceil\sqrt{n}\rceil$ subsets $V_{1}, \ldots, V_{l}$, each containing $l$ or $l-1$ consecutive indices. Let $Z_{i}=\circ_{j \in V_{i}} z_{j}$, where $Z_{i}$ is computed by a circuit constructed by the carry bit algorithm. This is shown in green, labeled "Best", in Fig. 7. The parallel prefix graph construction is applied recursively to compute prefixes for all groups without their highest index as well as for the $l-1$ inputs $Z_{1}, \ldots, Z_{l-1}$ (which corresponds to the red boxes labeled "Recursion" in Fig. 7), i.e. we build $l+1$ parallel prefix graphs, each with at most $l-1$ inputs. As a final step, we combine all prefixes from group $i$ with the $(i-1)$-th prefix of the $Z_{i}$ and add one more prefix gate combining $Z_{l}$ with the $(l-1)$-th prefix of the $Z_{i}$. This yields a parallel prefix graph.

The following two lemmas analyze the size of the resulting parallel prefix graph and the running time of its construction.

Lemma 5 The parallel prefix graph in [9] and the modified construction above have the same size; for $n \geq 3$, it is bounded by $2 n \log _{2} \log _{2} n$ in terms of prefix gates and $6 n \log _{2} \log _{2} n$ in terms of logic gates.

Proof Consider Fig. 8 and proceed by induction on the number of inputs. On a level with $n$ inputs, the number of prefix gates in all the green carry bit circuits (labeled


Fig. 7 Prefix graph construction


Fig. 8 Parallel prefix graph for uniform arrival times
"Best") and the number of yellow prefix gates in the bottom row are both at most $n$. The total number of inputs of recursion blocks can be bounded by $n$ as well: if there are $l$ groups, then $n-l$ original inputs are inputs of recursion blocks; one further recursion block has $l-1$ inputs. For small $n$, the correctness follows from Fig. 8, e. g. for $n=3$, the size bound is 5 and 3 gates are required.

Let $V_{1}, \ldots, V_{l}$ be the groups, $l \geq 3$, then by induction hypothesis, the prefix gate size is bounded by

$$
\begin{aligned}
2 n & +2(l-1) \log _{2} \log _{2}(l-1)+\sum_{i=1}^{l} 2\left(\left|V_{i}\right|-1\right) \log _{2} \log _{2}\left(\left|V_{i}\right|-1\right) \\
& \leq 2 n+\sum_{i=1}^{l} 2\left|V_{i}\right| \log _{2} \log _{2}(l-1) \\
& \leq 2 n+2 n \log _{2} \log _{2} \sqrt{n}=2 n \log _{2} \log _{2} n .
\end{aligned}
$$

For logic gates, the size increases by a factor of three.
Lemma 6 The parallel prefix graph above can be computed in $\mathcal{O}\left(n \log ^{2} n\right)$ time.
Proof As in Theorem 3, we round all running times up to at least max $t_{i}-\gamma\left\lceil\log _{2} n\right\rceil$ for a fixed $\gamma>1$. For this arrival time profile, we have already shown that

$$
1.441\left(\sum_{i=1}^{n} 2^{t_{i}^{\prime}}\right) \leq 1.441 \log _{2}\left(\sum_{i=1}^{n} 2^{t_{i}}\right)+2.1 \cdot n^{1-1.4 \gamma}
$$

Therefore, we can use the rounded arrival time profile to achieve a delay guarantee of

$$
1.441 \log _{2}\left(\sum_{i=1}^{n} 2^{t_{i}}\right)+5 \log _{2} \log _{2} n+4.5
$$

for $\gamma=3$. This means that all numbers in the computations have size at most $\mathcal{O}(\log n)$, thus it remains to bound the number of operations by $\mathcal{O}(n \log n)$.

For each level $l=1, \ldots, \log _{2} \log _{2} n$ of the recursion, we have a partition of the $n$ inputs into groups, where the maximum group size is bounded by $n^{1 / 2^{l}}$. Therefore, the prefix trees for the $Z_{i}$ can be computed in $\mathcal{O}\left(n \log \left(n^{1 / 2^{l}}\right)\right)=\mathcal{O}\left(\left(\frac{1}{2}\right)^{l} n \log n\right)$ time. All $Z_{i}$ require time $\mathcal{O}(n \log n)$, because this is a geometric series. All remaining gates are prefix gates; they have fixed positions, thus each of them requires only constant time to compute, and there are $\mathcal{O}(n \log \log n)$ such gates in total.

The new parallel prefix graph construction is summarized in the following theorem.
Theorem 5 Given $n \in \mathbb{N}$ and arrival times $t_{1}, \ldots, t_{n} \in \mathbb{N}_{0}$, our algorithm finds a parallel prefix graph with logic gate delay at most

$$
\log _{\varphi}\left(\sum_{i=1}^{n} \varphi^{t_{i}}\right)+5 \log _{2} \log _{2} n+4.5 \leq 1.441 \log _{2}\left(\sum_{i=1}^{n} 2^{t_{i}}\right)+5 \log _{2} \log _{2} n+4.5
$$

It can be implemented with running time $\mathcal{O}\left(n \log ^{2} n\right)$ and the computed circuit has size at most $6 n \log _{2} \log _{2} n$ in terms of logic gates.

This theorem implies that if the number $n$ of inputs is sufficiently large, we have a 1.441-approximation algorithm in terms of the delay for a prefix adder. The algorithm of [9] has a running time of $\Omega\left(n^{2}\right)$, which the use of our carry bit algorithm improves to a near-linear running time, even with linear-time addition.

To prove the delay bound, we assume that all arrival times are skewed by one time unit. Under this assumption, let $w=\sum_{i=1}^{n} \varphi^{t_{i}}$, and let delay $(w, n)$ denote the maximum delay for a circuit constructed as above with $n \geq 3$ inputs and an arrival time profile leading to the same $w$. Then delay $(w, n)+1$ is an upper bound on the delay of the constructed circuit, and we have:
Lemma 7 For $n$ input pairs with skewed arrival times $t_{1}, \ldots, t_{n}$, let $w=\sum_{i=1}^{n} \varphi^{t_{i}}$. Then we have

$$
\text { delay }(w, n) \leq \log _{\varphi} w+5 \log _{2} \log _{2} n+3
$$

Proof We may assume that the given arrival time profile achieves the maximum delay, i.e. for $t_{1}, \ldots, t_{n}$, the construction has a delay of $\operatorname{delay}(w, n)$.

By Theorem 2 and using the assumption that the propagate signals arrive earlier than the generate signals, we can compute $Z_{i}$ with a delay of at $\operatorname{most} \log _{\varphi}\left(\sum_{j \in V_{i}} \varphi^{t_{j}}\right)+3$. Therefore, their prefix graph has delay at most

$$
\text { delay }\left(\sum_{i=1}^{l} \varphi^{\log _{\varphi}\left(\sum_{j \in V_{i}} \varphi^{t_{j}}\right)+3},\lceil\sqrt{n}\rceil-1\right)=\operatorname{delay}\left(\varphi^{3} \cdot \sum_{j=1}^{n} \varphi^{t_{j}},\lceil\sqrt{n}\rceil-1\right) .
$$

For each of the groups $V_{i}$ containing $\lceil\sqrt{n}\rceil$ or $\lceil\sqrt{n}\rceil-1$ inputs, the prefix graph of all but its last input (highest index) has delay at most

$$
\operatorname{delay}(w,\lceil\sqrt{n}\rceil-1) \leq \operatorname{delay}\left(\varphi^{3} w,\lceil\sqrt{n}\rceil-1\right)
$$

as delay $(w, n)$ is monotonically increasing in $w$ and $n$. Therefore, the combination of a prefix of one of the $V_{i}$ and the corresponding prefix of the $Z_{i}$ has logic gate delay at $\operatorname{most}$ delay $(w, n) \leq$ delay $\left(\varphi^{3} w,\lceil\sqrt{n}\rceil-1\right)+2$.

We prove the absolute delay estimate by induction on $n$. For $n \leq 3$, delay $(w, n)-$ $\log _{\varphi} w \leq 3$ for all input sequences with this parameter $w$ as $\log _{\varphi} w \geq \max _{i} t_{i}$. Therefore, for $n \geq 4$, delay $(w, n)$ is bounded by

$$
\begin{aligned}
\operatorname{delay}\left(\varphi^{3} w,\lceil\sqrt{n}\rceil-1\right)+2 & \leq \log _{\varphi}\left(\varphi^{3} w\right)+5 \log _{2} \log _{2}(\sqrt{n})+5 \\
& \leq \log _{\varphi} w+5 \log _{2}\left(0.5 \log _{2} n\right)+8 \\
& =\log _{\varphi} w+5 \log _{2} \log _{2} n-5+8 \\
& =\log _{\varphi} w+5 \log _{2} \log _{2} n+3 .
\end{aligned}
$$

Without assuming skewed arrival times, this construction has a delay bound of $\log _{\varphi} w+5 \log _{2} \log _{2} n+4$. For $n=25$, an example is shown in Fig. 8. Gates are colored by the part of the recursion they represent; in this special case, some gates can be used to compute the $Z_{i}$ as well as the group prefixes, hence they are colored both red (right half) and green (left half).

The construction in [9] and our variant of it both have a very high fan-out; for $n$ inputs, the fan-out is at least $\lceil\sqrt{n}\rceil$. In a physical implementation such a high fanout induces a significant delay and requires the insertion of duplicate gates into the interconnect to repeat the signals. The high fan-outs occur precisely at the $Z_{i}$ prefixes, therefore they accumulate on a critical path. For $n$ inputs, the fan-out can be redistributed to duplicate gates with fan-out 2 using depth $\frac{1}{2}\left\lceil\log _{2} n\right\rceil+1$; this will lead to an overall increase in delay of $\left\lceil\log _{2} n\right\rceil+\mathcal{O}\left(\log _{2} \log _{2} n\right)$ for a given path [9]. Therefore, we obtain a 2.441 -approximation algorithm if the fan-out is bounded by 2 , improving the 3-approximation achieved by [9] in this scenario. A 3-approximation is also obtained by the Kogge-Stone adder [5] for arbitrary input arrival times. However, it comes with a larger size of $\mathcal{O}\left(n \log _{2} n\right)$.

## 4 A Lower Bound for Prefix Adders

Proposition 1 shows that a lower bound for the delay of a prefix tree for a single carry bit is given by an optimal binary tree with depth one for the left child and depth two for the right child in which the leaves represent inputs and their right-to-left order corresponds to the ordering of the inputs. For zero arrival times, this is achieved by a Fibonacci tree. Rautenbach et al. [10] observed that this a special case of a more general concept: alphabetic code trees with unequal letter costs. These can be used to obtain general lower bounds, which we improve and state explicitly by using the specific properties of our application.

Lemma 8 Given $n$ inputs with integral arrival times $t_{1}, \ldots, t_{n} \in \mathbb{N}_{0}$, a prefix tree computing their carry bit $c_{n+1}$ has logic gate delay at least

$$
\operatorname{delay}\left(c_{n+1}\right) \geq \log _{\varphi}\left(\sum_{i=1}^{n} \varphi^{t_{i}}\right)-1
$$

Proof In Sect. 2 we saw that $F_{t_{i}+1}$ inputs with arrival time zero can be combined with depth $t_{i}$. Therefore, an optimal prefix tree for inputs with arrival times $t_{1}, \ldots, t_{n}$ of delay $k$ can be restructured into a prefix tree with $\sum_{i=1}^{n} F_{t_{i}+1}$ inputs with depth $k$ by replacing input $i$ by a Fibonacci tree for $t_{i}+1$. If there is only one input, the lemma is trivially true, thus we may assume $\sum_{i=1}^{n} F_{t_{i}+1} \geq 2$. But a tree of depth $k$ has at most $F_{k+1}$ leaves, hence $k \geq 2$ and

$$
\begin{aligned}
k+1=\log _{\varphi}\left(\varphi^{k+1}\right) & \geq \log _{\varphi}\left(\sqrt{5}\left(F_{k+1}+\frac{\psi^{3}}{\sqrt{5}}\right)\right) \\
& \geq \log _{\varphi}\left(\sqrt{5}\left(\sum_{i=1}^{n} F_{t_{i}+1}+\frac{\psi^{3}}{\sqrt{5}}\right)\right) \\
& \geq \log _{\varphi}\left(\sum_{i=1}^{n}\left(\varphi^{t_{i}+1}-\psi^{2}+\psi^{3}\right)\right) \\
& =\log _{\varphi}\left(\sum_{i=1}^{n}\left(\varphi^{t_{i}+1}-\varphi \psi^{2}\right)\right) \\
& \geq \log _{\varphi}\left(\varphi\left(1-\psi^{2}\right) \cdot \sum_{i=1}^{n} \varphi^{t_{i}}\right)=\log _{\varphi}\left(\sum_{i=1}^{n} \varphi^{t_{i}}\right)
\end{aligned}
$$

and $k \geq \log _{\varphi}\left(\sum_{i=1}^{n} \varphi^{t_{i}}\right)-1$ as claimed.
This lemma shows that the single carry bit circuits in Sect. 2 have optimum delay up to an additive margin of 5 .

## 5 Technology Mapping

In CMOS technology, AND and OR gates are significantly slower than their negated counterparts. Therefore, in the following we will transform the constructed logic circuits to use only NAND and NOR gates without increasing their delay too much.

To this end, we introduce two new prefix gates, $\triangle$ and $\triangle \bar{v}$ shown in Fig. 9a, b. Much like the non-inverted prefix gate in Fig. 1, they take as input two pairs of generate and a propagate signals, and compute as output their combined generate and propagate signal. Unlike o gates, these gates assume that exactly one $g$ and $p$ is inverted, and the same is true for the output generate and propagate signal.

The new prefix gates also preserve the property that the delay of the gate is one for the left input and two for the right input assuming skewed arrival times as before. Therefore, replacing non-inverted prefix gates with $\triangle$ and $\nabla$ almost preserves the delay, as the following lemma shows.


Fig. 9 Inverting prefix gates. a A prefix gate using only NAND and NOT. b A prefix gate using only Nor and Not

Lemma 9 A non-inverted prefix gate tree computing a single carry bit can transformed into a prefix tree of $\star$ and $\boxtimes$ gates without increasing the delay by more than 1. The transformed circuit has size $5 n-4$.

Proof We begin at the root of the tree, where we replace the prefix gate by a $\pi$ gate, thus ensuring that the output carry bit (generate signal) is non-inverted. For the predecessors we proceed as follows. The right child of an inverted prefix gate requires the same inversion for $g$ and $p$ as the output pair, so the right child of a $\triangle$ gate should be a $\star$ gate, and the right child of a $\nabla^{\nabla}$ gate should be a $\nabla$ gate. The left child requires the reverse inversion of $g$ and $p$, so the left child of a $\nabla^{\nabla}$ should be a $\triangle$ and vice versa. Making these replacements as needed is possible since the original prefix circuit is a tree in terms of prefix gates. Since the delay of the non-inverted and inverted prefix gates are the same, so is the delay of the resulting tree. However, since each pair input signals $\left(g_{i}, p_{i}\right)$ will be used with exactly one of the components inverted, each input has to use an additional inverter, thus increasing the overall delay by one. The size is the sum of $4(n-1)$ logic gates in the $n-1 \boxtimes$ and $\triangle$ gates, and $n$ inverters at the inputs.

Similarly, we can map the non-inverted full adder from Fig. 7 to NAND, NOR-, and Not gates, by applying Lemma 9 to the prefix carry bit circuits labeled "Best".

Lemma 10 Given $n \in \mathbb{N}$ and arrival times $t_{1}, \ldots, t_{n} \in \mathbb{N}_{0}$, we can compute a full adder (computing all carry bits) using only NAND, NOR and NOT gates with delay at most

$$
1.441 \log _{2}\left(\sum_{i=1}^{n} 2^{t_{i}}\right)+6\left\lceil\log _{2} \log _{2} n\right\rceil+6.5
$$

and size $\mathcal{O}\left(n \log _{2} \log _{2} n\right)$.

Proof We require an additional delay of one to ensure that in the beginning all signals are available inverted and non-inverted as needed, and an additional delay of one at the outputs as well. The recursion has $\left\lceil\log _{2} \log _{2} n\right\rceil$ levels. For each level, we need to correct inversions at most once, after the first level of "Best" and "Recursion" computations, thus also supplying each input signal of the next level of the recursion in inverted and non-inverted form. Depending on how the outputs are inverted after the recursion is applied to the carry bits of the groups as computed by "Best", the prefix gates in Fig. 7 are replaced by $\star$ or $\otimes$ gates. After correcting the inversion at the outputs, the resulting circuit computes all carry bits with an additional delay of $\log _{2} \log _{2} n+2$ compared to the non-inverted construction. The additional number of gates is at most linear at every level, and thus $\mathcal{O}\left(n \log _{2} \log _{2} n\right)$ overall.

This shows how the given constructions that were formulated in terms of noninverted prefix gates can be transformed into circuits using only inverting gates, which are more useful for practical applications. The transformations do not increase the asymptotic delays. For the carry bit circuit the delay increases just by one. Thereby, the size bounds grow by a small constant factor.

## Compliance with Ethical Standards

Conflict of interest The authors declare that they have no conflict of interest.

## References

1. Choi, Y.: Parallel Prefix Adder Design. Dissertation, University of Texas at Austin (2004)
2. Cole, R., Vishkin, U.: Faster optimal parallel prefix sums and list ranking. Inf. Comput. 81(3), 334-352 (1989)
3. Keeter, M., Harris, D.M., Macrae, A., Glick, R., Ong, M., Schauer, J.: Implementation of 32-bit Ling and Jackson adders. In: Proceedings of the Forty Fifth Asilomar Conference on Signals, Systems and Computers, pp. 170-175 (2011)
4. Knowles, S.: A family of adders. In: Proceedings of the 15 th IEEE Symposium on Computer Arithmetic (ARITH-15), pp. 277-281 (2001)
5. Kogge, P.M., Stone, H.S.: A parallel algorithm for the efficient solution of a general class of recurrence equations. IEEE Trans. Comput. 100(8), 786-793 (1973)
6. Ladner, R.E., Fischer, M.J.: Parallel prefix computation. J. ACM 27(4), 831-838 (1980)
7. Oklobdzija, V.G.: Design and analysis of fast carry-propagate adder under non-equal input signal arrival profile. In: Proceedings of the Twenty-Eighth Asilomar Conference on Signals, Systems and Computers, vol. 2, pp. 1398-1401 (1994)
8. Rautenbach, D., Szegedy, C., Werber, J.: Delay optimization of linear depth Boolean circuits with prescribed input arrival times. J. Discrete Algorithms 4(4), 526-537 (2006)
9. Rautenbach, D., Szegedy, C., Werber, J.: The delay of circuits whose inputs have specified arrival times. Discrete Appl. Math. 155(10), 1233-1243 (2007)
10. Rautenbach, D., Szegedy, C., Werber, J.: On the cost of optimal alphabetic code trees with unequal letter costs. Eur. J. Comb. 29(2), 386-394 (2008)
11. Roy, S., Choudhury, M., Puri, R., Pan, D.Z.: Towards optimal performance-area trade-off in adders by synthesis of parallel prefix structures. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 33(10), 1517-1530 (2014)
12. Roy, S., Choudhury, M., Puri, R., Pan, D.Z.: Polynomial time algorithm for area and power efficient adder synthesis in high-performance designs. In: Proceedings of the 20th Asia and South Pacific Design Automation Conference (ASP-DAC), pp. 249-254 (2015)
13. Weinberger, A., Smith, J.L.: A logic for high-speed addition. Natl. Bur. Stand. Circul. 591, 3-12 (1958)
14. Werber, J., Rautenbach, D., Szegedy, C.: Timing optimization by restructuring long combinatorial paths. In: Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pp. 536-543 (2007)
15. Zimmermann, R.: Binary Adder Architectures for Cell-Based VLSI and Their Synthesis. Dissertation, Swiss Federal Institute of Technology (ETH) in Zurich (1998)

[^0]:    Sophie Spirkl
    sspirkl@princeton.edu
    Stephan Held
    held@or.uni-bonn.de
    1 Research Institute for Discrete Mathematics, University of Bonn, Lennéstr. 2, 53113 Bonn, Germany

    2 Program for Applied and Computational Mathematics, Princeton University, Fine Hall, Washington Road, Princeton, NJ 08544, USA

