Abstract-Binary addition is the most fundamental and frequently used operation. A weu-designed adder should he fast and of inputs may not he uniform for a specific application. For examde. Fie.1 shows the O U~D U~ delav orofile of oanial carry-select adders. Compared with-Kogge-Stone and BrentKung adders, the results of the proposed approach have the smallest ontput delay.
I. INTRODUCTION
than Kogge-Stone adder. But this algorithm can not guarantee to find the delay-optimal prefix structure.
In this paper, we propose an algorithmic approach to generate an irregular parallel-prefix adder, which has the minimal Datapath module is essential for high quality ASIC design, and may dominate the whole system performance. Arithmetic components, such as adders, multipliers and shifters, are considered as basic cells to a construct datapath. Design of arithmetic components should be high performance and satisfy the application requirements. Binary addition is the most fundamental and frequently used operation in computing systems. To speed up hinary addition, many different architectures have been proposed over the years.
The ripple-carry adder has the minimal area, hut is quite slow. The carry-skip adder [l] can speed up hinary addition with a small hardware overhead. The carry-select adder [Z] accelerates binary addition further, hut suffers from large hardware penalty. The carry-lookahead adder [3] [4] comes with prefix computation. It has @log n) time and O(n log n) area. Brent-Kung parallel prefix adder and Kogge-Stone pardelay for a given profile of input signals. The approach can cover different topologies of ripple-carry adder, carry-skip adder and carry-select adder. The time complexity of the proposed algorithm is O(n3). To minimize the area cost a heuristic backward reduction procedure is added. The experimental results show that the proposed approach outperforms both Brent-Kung and Kogge-Stone parallel prefix schemes, for particular applications.
The rest of the paper is organized as follows. The problem and timing model are defined in Section 11. The algorithmic approach will he presented in Section 111. In Section IV, we will show how this approach covers ripple-carry, carry-skip and carry-select adders. The backward reduction procedure will he introduced in Section V. Section VI analyzes the experimental results, before conclusions and future work in section VII.
allel prefix adder &e two classical regular prefix computation structures, which reach lower bound of area and lower hound of time [51 respectively.
PRELIMINARIES

A. Prefur Addition/6]
~.
We define the binary addition problem as follows: given an " ~-l l b~c u n _~L B I I I n-bit augend A, an n-bit addend B, and a 1-bit carry-in CO, ' . ,I..Ur*bDE generate the n-hit sum S and the 1-hit cany-out c,,. Suppose . . To simplify the representation of G and P, an operator is To make representations easier, the simplest timing model is used in Sections I11 to V, which is similar to the timing model in [SI. A gp generator as well as sum generator takes 1 unit delay from all inputs to outputs. A (G, P ) adder takes 2 unit delay from all inputs to outputs. A more accurate timing model is used in experimental results (Section VI). Since pre-processing and post-processing have constant delay, prefix computation becomes the core of prefix adders
The forward process constructs (G,P)s level by level. Initially all gzp, are in level 1. All ( G , P ) s with length 2 are constructed first, which are denoted as (G,P) [k+l,k] . These ( G , P ) s form level 2, and (G,P) [kil,k] can only be generated from the pair { g k f l p k + 1 , g k p k } . The succeeding levels are constructed step by step. When constructing level 1, all (G, P)s in the previous levels are known. All possible combinations of (G, P ) pairs are tried without overlap, and the pair with minimal delay is selected to construct the current Fig.4 illustrates the above process. Consider (G, P )~J I as an example. There are two ways to construct (G, P)p,11
These pairs have delays of 8 and 7 units respectively. { g 3 p 3 , (G, P)[2,11} is selected to construct (G,P) [,,,] . Continue selecting pairs until (G, P ) I~,~~ is finally constructed.
(G, PI. Once we have all (G, P)li,ol, which are needed for postprocessing stages, a backward process is added to minimize the area. A simple method to reduce the number of (G, P ) adders is to remove the unnecessary (G, P ) s from the prefix structure. We do so by checking each (G, P ) backward. If one (G, P )
does not output to any other component, the component and its input connections are removed. In this example, (G, P ) [ 4 , 1~. ( G , P ) I~,~~ and ( G , P ) I~,~~ are removed first. As a result, (G,P) [,,,] and (G, P ) [ z , l~ are now unnecessary. Following is the pseudo-code for prefix adder reduction.
Algorithm 2 Prefix Adder Reduction reduce-orefixadder (data-width)
1: n =dafa-width; 
with "in.use":
if (G, P),,+,-,,jl is labelled with "inuse" then label two parents of (G,P)[j+i-l,,l with "in-use":
Obviously, the time complexity and space complexity of 7 this algorithm are O(n3) and O(nz) respectively. From the observation of (G, P ) , we have two conclusions
. Lemma 1: ( G , P ) overlap is not necessary to keep the minimal delay. Overlap can be eliminated by using a (G, P ) with shorter length, since the delay of (G, P )
will be decreased with length reduction. This property makes the time complexity of the algorithm decrease from 0 ( n 4 ) to qn3).
. Theorem 2: When G and P are not allowed to separate, Algorithm 1 has minimal delay. This property is based on the following two observations: 1) To achieve minimal delay at one ( G , P ) , all ( G , P ) s in its critical path should have minimal delay; 2) it's not harmful to make all ( G , P ) s not in the critical path as early as possible. As a direct inference, a minimal delay can be obtained from minimal delays in the previous levels.
Since level 1 and level 2 have minimal delays by nature, Theorem 2 holds.
In our experiments, G and P separation cannot lead to better timing results. So we did not include G and P separation in the dynamic programming algorithm.
IV. TRANSFORMATION TO GENERIC ADDERS
The proposed algorithm can cover topologies of ripplecarry. carry-skip and carry-select adders. Based on these topologically identical structures, the generated structures can be partially converted to different adders. According to the structure comparisons, the resulting structure is not only optimal in parallel-prefix adders, but also optimal in combinations of different adders. '36 E. Transformation to carry-skip adder Fig.6 shows a carry-skip structure and the corresponding result in the proposed approach. Two dash-line boxes denote two corresponding blocks. They.have the same inputs and their outputs are connected to the same logic. Note that the carryskip logic in carry-skip adder is exactly the same as the G part in (G, P ) adder. C. Transformation to carry-select adder Fig.7 shows a carry-select structure and the corresponding result in the proposed approach. In the carry-select adder, all possible results are pre-calculated to wait for carry-in. When carry-in is ready, it will feed all columns simultaneously, without propagation between columns. There exists the same topology in the resulting structure. Two dash-line boxes denote two corresponding carry-select blocks. The principal difference is the moment to process the carry-in ca. In carry-select adder, sy and st are generated before the carry-in processing. So the delay from CO to si is only the delay of a MUX. In the resulting structure, carry-in is processed earlier. The delay from Q to si increases to the delay of a ( G , P ) adder plus the delay of a sum generator.
To reduce the delay from CO to si, a carry-select block in the resulting structure can be transformed to the structure shown in Fig.S(b) . The original function of the carry-select block is
P[t-l,ll). Then function(l1) changes to
which is the function of Fig.S(b) . By doing this transformation, the resulting structure will have not only the same topology with a carry-select adder but also the same delay property.
GPIi.r.11
n GI!-IJI After removing all (G, P ) adders without fanout, the prefix structure shrinks to the minimal size with optimal delays at all internal (G, P ) s . However, keeping optimal delays at all internal (G, P ) s is not necessary for the global delay-optimal. Only the (G, P ) s in critical path have to keep their optimal delays. Hence there is a potential to reduce the size further by relaxing some internal (G, P ) s not in critical path.
To relax a ( G , P ) , its timing requirement should be calculated first. A relaxed timing can not exceed this limit,
Note that G[i-l,l] and PI,-l,ll are mutually exclusive. So the term (G,i-l,ll + P l z~l , l~)
can be replaced by (G~i-l,ll otherwise the global delay-optimality will be broken. Required times are backward calculated from outputs to inputs. Required times of outputs are given. One ( G , P ) ' s required time will limit the required times of its parents. Fig.l2(a) shows an not be satisfied. However (G, P)p21 can be generated from
Either way the required time can be satisfied. In these possible combinations, we choose the one whose right parent has the maximal length. So the pair {941)4, (G, P)[3,2~} are selected to generate (G, P) [4, 2] . This strategy makes the structure approach the ripple structure, which has the minimal size. When a netlist changes, the required time of its parents will be changed. Fortunately, the prefix structure is a tree structure. So that the interconnection and the required time can be decided and calculated backward, level-by-level. Fig. 12(b) shows the result of backward reduction from the initial structure Fig.l2(a) . The number of (G,P) adders reduces by 1. The backward reduction has the same time complexity as the forward dynamic programming procedure. Algorithm 3 gives the pseudo-code of the backward reduction procedure.
the pair I ( G , P )~~,~] ,~Z P Z I or the pair ( 9 4~4 , (G,P)p,21}.
VI. EXPERIMENTAL RESULTS
Different input delay profiles and different data widths are used. The shapes of different input delay profiles include monotonously increasing, decreasing, and convex curves. As respectively. Casel, case 2, and case3 use monotonously increasing, decreasing, and convex input delay profiles.
Since the arrival time of inputs is not uniform, the delay between the last input and the last output is used to compare these results. Table. 1 shows the comparison on the delay and the number of (G, P ) adders. The number of (C, P ) adders relates to the area of a prefix structure. Two different reduction procedures are used in the algorithmic approach, the backward '38 . , , I. 12 ,* I O 2 . . . Table. 1 in all these three kinds of input delay profiles indicates that the prefix adder generated by the proposed approach is faster than either Kogge-Stone or Brent-Kung adders.
111-
-
Particularly in monotonously decreasing profiles, results of the proposed approach improve at least 40% in terms of speed, in comparison with Brent-Kung adders. However, a drawback is the incremented area. The penalty of keeping the minimal delay is a small area increment in a convex profile and a large area increment in a monotonously decreasing profile.
To check the soundness of the proposed algorithm, we tested the timing performance of generated adders under 0.13pm considering accurate gate delay, and further wire delay. The gate delay is calculated by the look-up table method, while the wire delay is computed using Elmore delay model. The delay profiles of the two inputs are given by a sum array and a UXTY array, which are the outputs of a partial product reduction tree. From the results, the error carried by load-capacitance is not quite large and is almost constant with different bitwidth. The error introduced by wire delay originally is small, but increases quickly with the increase of data width. The proposed algorithm is practical if data width is not huge.
But, with the increase of data width, the algorithm should be modified to regard the influence of interconnect. Buffer insertion and gate sizing also should be considered in delay estimation.
VII. CONCLUSIONS
In this paper, we proposed an algorithmic approach to generate an irregular parallel-prefix adder for a specific application. The prefix structure generated by the proposed algorithm has minimal delay at all outputs, and can cover all topologies of generic adders. The time complexity and space complexity of the algorithm are O(n3) and O ( n 2 ) respectively.
The experimental results have the smallest delay of outputs, compared with Brent-Kung and Kogge-Stone prefix adders. This approach can be applied to generate the optimal final adder for a specific multiplier.
