Abstmct-In this paper, we present a systematic method of implementing a VLSI parallel adder. First, we define a family of adders, based on a modular design. Our design uses three types of component cells, which we implement in static CMOS. We then formulate the adder design as a dynamic programming problem, optimizing with respect to area and time. The result is an area-time optimal adder in our design family. We illustrate our approach by implementing a 66-bit adder for use in a floating point processor. In addition, we indicate how to use our method for implementations in technologies and design styles other than static CMOS.
timize our adder design with respect to latency by solving a dynamic programming problem. This is described in Section V. In Section VI, we extend the dynamic programming formulation to optimize both latency and area. Based on the results from Section VI, we implemented a 66-bit adder for use in SPUR (symbolic processing using RISC's) floating point processor [lo]- [12] . The design and area-time tradeoff curve is presented in Section VII. In Section VIII, we show that our method is general to cover a wide range of technologies and circuit design styles. In the last section, we summarize our approach to adder designs, and indicate that our method is an efficient implementation scheme for wide data width addition.
MATHEMATICAL DESCRIPTION
In this section, we present and review briefly the mathematical formulations and terminologies to be used in the following sections. It has been known that binary addition can be transformed into a parallel computation by introducing an associative operator o [9] . If we define the carry generate term, g I N ; , and the carry propagate term, pZN;, for each bit position i, then the carry c; for each bit obeys the following recurrence relation: 
III. IMPLEMENTATION
To implement the functions defined in Section 11, three circuit blocks shown in Fig. 1 are required. The first circuit (labeled precondition circuit) gates in adder inputs a; and bi to generate the initial pINi and gZN; for each bit position i. The computed p and g terms are then fed into the fast carry generator which performs the operations defined in (2.1) and (2.2). It is the fast carry generator that allows accelerated carry computations and this is the focus of our study. The third block is a sum circuit, consisting of a row of @ gates, to combine the carry propagate bits ( p I N i ) from the first block with the carry bits (ci) from the second block, according to (2.3).
We use three basic types of tiling cells to implement the parallel carry computation: black cells, white cells, and driver cells. The terms "black" and "white cells come from Brent and Kung [6] , and are shown in Fig. 2 . Note that some of the inputs to our black and white cells "pass through" the cells. Specifically, the ( g r , p r ) inputs of the black cell are available as outputs. This convention greatly simplifies our wiring diagrams.
The black cell performs the associative concatenation defined in (2.2). Based on a static CMOS implementation, the black cell is of two categories, ba and bb . The ba cell shown in Fig. 3(a) gates in positive-true signals and produces complemented outputs. The bb cell, shown in Fig. 3 p subcell produces a pout signal and the g subcell produces a
Definition: The fan-out f is the number of subcells that a signal drives.
For example, the fan-out for P I (or E ) inside a black cell is 2, as it drives both p and g terms. However, the fan-out for g l , g, , p r and their complements is 1 as they each drive only one subcell.
If we employ static CMOS implementations which feature inverting logic, sometimes we need inverters, called "white" cells, to ensure a proper signal polarity. These cells shown as w a are depicted in Fig. 4 . Also in Fig. 4 is a modified white cell, wb, which provides a "turning corner" for signals. In the case of long wire interconnects or large fan-outs, we use a specially ratioed inverting driver, either in single stage or cascaded stages. In summary, the black cell is the "computation" cell. White cells are used for electrical requirements and driver cells are for performance improvements. design, we can estimate the cell resistances and capacitances, and compute the associated signal delay. Furthermore, we explicitly model drivers which are an integral part of our circuit layout.
In implementing the black and white cells, we use minimum-length transistors for the pull-down network of each subcell. PMOS pull-up transistors are ratioed so that the maximum (over all possible input conditions) of pull-up and pulldown channel resistances are equal (examples of transistor ratios are shown in Fig. 3) . We define this resistance as R t . For simplicity, we also assume that the capacitance Ci and resistance R; of the horizontal interconnect between neighboring cells are the same as the R and C values of the vertical interconnect. This condition depends on the specific implementation and is discussed further in Section VI.
In a static CMOS design, a pair of a PMOS pull-up and an NMOS pull-down transistors constitutes a basic inverting unit. The input signal drives the gates of both the pull-up and pull-down transistors. Let C , be the total gate capacitance of such an inverting unit, then we can approximate the generation time of its output signal, tout, as a function of its interconnect and channel resistances (Ri and R I , respectively), its load capacitance, its fan-out f , and its input ready time, rin: rout = tin + (Rt + R ; f >(C; + C , ) f .
In the case when metal is used for interconnects and Rt > Ri f , then rout becomes
Let 7 be a normalized time constant, defined as
Notice that in (4.1), the fan-out f of a subcell is variable, depending on the type (p or g ) of the subcell, and on the type and number of its succeeding cells. This can be illustrated using the example of a 5-bit adder shown in Fig. 5 . In this circuit, each cell is identified by a pair of height and bit coordinates. (si. p / ) o ( g r , P r ) = (gr + P / g r , P / P r ) . The "left" operand of Cell ( 3 3 , namely ( g l , p / ) , is supplied by Cell (2,5) which precedes Cell (3,5) in a vertical connection. In other words, pout and gout of Cell (2,5) become black cell ( 3 3 ' s pi and g l , respectively. The fan-out for pout of Cell (2,5) is 2 as it drives both the p and g subcells of Cell ( 3 3 , and the fan-out is 1 for gout of Cell (2,5) as it drives only the g term of Cell Given the above analysis, we can evaluate the generation time for each circuit signal as the sum of its input ready time and delay factor. Deline tgout as the time when signal gout is ready, tgl as tgout of the cell producing g / , and tgr as tgout of the cell producing g, . Similar deiinitions apply to t,/ and t p , , We can then formulate the input ready time for the g subcircuit of a black cell, tgin, as fgin = max { t g / > t p / 9 f g r } . And fpin for the p subcircuit of a black cell becomes fpin = max { t p / , t p r } . If we defme f g (resp., f p ) to be the fan-out of the subcell under analysis, then
A similar formulation can be written for signal pout: (4.5) Equations (4.4) and (4.5) parameterize the signal delay with respect to fan-out which is determined by the interconnection of modular cells.
V. FAST CARRY GENERATOR
Having evaluated the timing behavior of basic cells, we are facing the problem of how to place and interconnect them into the fastest circuit. Consider the design space of a family of circuits, R ( n ) , shown in Fig. 7 . In this figure, black boxes represent the concatenation operation, and are either ba or bb cells as appropriate.
The construction of R ( n ) composes blocks of data sizes smaller than n, that is, m and n -m as shown. The R(m) and R ( n -m) are, in turn, composed of blocks of even smaller sizes. In other words, we have a recursive construction of R(n). The basis of the construction is R( 1) and R(2). R( 1 ) and R(2) are shown in Fig. 8 . The R ( n ) circuit has a large tances. Since the critical paths converge at the leftmost top cell, we focus our analysis on this cell. We can thus compare the adders in R( n) by comparing the times at which their leftmost @, g ) outputs are produced. This "principle of optimality " validates a dynamic programming approach to choosing the best place m at which to decompose an n-bit adder into subcircuits R(n -m) and R(m). Our design problem, then, (5.3) Lemma I : The optimal circuit for adding bits j +r through j is identical to the optimal circuit for adding bits r + 1 through 1. Thus, t p l n ( j + r , j ) = tpln(r + 1 , 1) and tgin(j + r , j ) = Proof sketch: Equations ( 5 . l), (5.2), and (5.3) are not sensitive to the absolute magnitude of i andj. Only the difNote that in (5.1), the optimal m minimizing fgln(i, j ) may not be the same as the m minimizing fpln(i, j ) . We have to keep a list of optimal: one for p and one for g. However, we prove, in the following lemma, that both p and g signals synchronize upon the p signal. The optimal splitting m value for the p signal is the same as that for the g signal. As a result, we have a one-dimensional dynamic programming problem, optimizing with respect to the generation of the p signal. hence the lemma. We can use dynamic programming to calculate the optimal fast carry generator configurations for n up to any desired data width. The results are presented in Table I where, for WE1 AND THOMPSON: AREA-TIME OPTIMAL ADDER DESIfiN 67 1 ease of discussion, the data width is limited to 32. Using this table to construct an optimal n-bit fast carry generator, the left block should be left bits wide and the right block should be right bits wide. The column labeled driver stages indicates the number of stages in the driver connected to the output of the most significant bit of the right block. The height entries indicate the number of rows of modular cells in the optimal n-bit block. The generation time of the most significant bit for the n-bit structure is shown in the column labeled tpl,, (gl,,) .
construction of H structures is identical to that of R circuits shown in Fig. 7 . The difference lies in the component blocks. The component blocks for the H circuits can be either R or ripple blocks. The ripple blocks are generated as the base case of minimum area (height = 2) and are shown in Fig. 1 The time unit T is a normalized RC time constant. Fig. 10 illustrates the optimal 32-bit fast carry generator.
VI. AREA AND LATENCY TRADEOFF
Using the formulation in Section V, we can find the fastest fast carry generator for data width between 1 and n. We also know that the ripple circuit features the smallest area among all blocks. Between these two extremes of speed and area, we can generate a "hybrid" structure called the H circuit. The
We observe from the above algorithms that we use BlackCellTime. DelayofSelected-Driver, and RippleCellTime (used for the basis) to evaluate the circuit delay. These timing parameters come from our analysis in Section IV.
VII. LAYOUTS Since our design algorithms are optimization driven, "what if" questions can be answered easily. For example, so far we have assumed that there is no limit to the number of driver stages we can use. In practice, we want to get away with the minimum number of driver stages as to minimize layout efforts. We can make the number of stages an input to the algorithm, and evaluate the performance of corresponding design. In addition, the driver ratio between successive stages is usually an integer two or three instead of the "optimal" fractional ratio discussed in Section IV. If this is the case, we can use (4.1) instead of (4.2), where f becomes the driver ratio, to compute the delay through the driver. Using three-stage drivers with a ratio of two, we have calculated the optimal H ( n ) circuits for n from l to 66. These circuits are indexed by both data width (66) and height. For ease of discussion, we tabulate only H(66) results. See Table  I1 and Fig. 12 .
Note that the fastest 66-bit fast carry generator has height 11. Decreasing the height by 4 saves 36% in area, but increases adder delay from 24.3757 to 387, a 56% increase. Also note that the possibility of using small ripple blocks makes our H ( n ) class a superset of the R(n) class.
The 66-bit H circuit of height 10, H(66,10) Our discussion has centered on a static CMOS implementation. We can extend the proposed algorithm to NMOS and bipolar technologies. In addition, the algorithm is valid to dynamic circuit design such as domino logic.
It is rather straightforward to adopt the algorithm for an NMOS design. The timing analysis for static NMOS is similar to that of CMOS. The difference is the magnitude of parameters in the timing models. For example, in a static NMOS circuit, the input signals drive the gates of NMOS pull-down network. This is in contrast with static CMOS whose inputs drive both pull-up and pull-down networks. As a result, the C , value of part of (4.1) is different for NMOS. Nevertheless, since the optimization is based on a normalized time constant 7 , changes in 7 will not affect the solution determining the interconnection of modular cells.
Signal delay of bipolar TTL parts is insensitive to fanout and interconnect capacitances, thus eliminating the need for driver circuits. This insensitivity is in contrast with static CMOS/NMOS circuits whose delay is linearly proportional to fan-out and interconnect capacitances (4.1). CMOS/NMOS and TTL circuits have different granularity of implemented logic functions. In static CMOS/NMOS, a single network can perform complex logic functions. For example, the G subcell of a black cell uses a single pull-down and a corresponding pull-up network to implement an AND-OR function (Fig. 3) . If the G subcell has a unit fan-out, then based on (4.1) there is a single time-constant ( 7 ) delay through the subcell for all the inputs: PI, g r , and gr. On the other hand, we use two TTL gates to implement the AND-OR function of the G subcell: one for AND; one for OR. Consequently, pi and g , have a twogate delay through the G subcell, and gl has a one-gate delay.
Another difference between CMOS/NMOS and TTL circuits is that the latter has a "hard" fan-out limit whose violation results in a failed circuit.
Using the proposed algorithm with an objective function minimizing the gate delay through the critical path and timing behavior unique to TTL parts, we have calculated optimal TTL fast carry generator configurations for n from 1 to 32.
A fan-out limit of 10 is used. See Table 111 .
The design algorithm is also applicable to dynamic circuit implementations such as NORA (No-Race) and domino logic, as long as their timing behavior satisfies the linearity of the recursion step described by (5.1). There may be differences in the magnitude of delay constants and load functions. For example, consider the domino circuit which features an NMOS pull-down network, an NMOS evaluation device, and a PMOS precharging device. The output of this domino stage is followed by a static inverter before it drives another domino stage. Because of this required inverter, there is an extra delay of 17 between inputs of successive domino stages. Consequently, the delay constants 17 and 27 of (5.1) are replaced by 27 and 37, respectively. As for the load function, consider the following. The ratio of the required inverter can be modified to drive a large load. The modified inverter thus becomes a built-in first-stage driver. Any additional driver comes in the form of even stages to preserve the noninverting property of domino logic circuits. These design considerations can be incorporated into the load function, either in the form of f ( i , m, j ) 7 in (5.1) or DelayofSelected Driver() of the design routine in Section VI.
IX. DISCUSSIONS AND CONCLUSIONS
We have formulated the adder design as a dynamic programming problem with latency and/or area as the performance variables to be optimized. We illustrated our design algorithm with an example of CMOS VLSI implementations. The algorithm incorporates factors crucial to VLSI design: modularity, routing, speed, area, and driver design. It takes as inputs timing parameters associated with component modules, and generates an area-time optimal adder as the output. Not limited to static CMOS implementation, our formulation has been shown to be general enough to cover a wide range of implementation technologies: MOS and TTL, as well as design styles: dynamic and static circuits.
A natural extension to our formulation is to include increased cell "fan-in'' and to use a more elaborate timing model. The first issue arises from the fact that the black cells we have used feature a cell "fan-in" of two as they implement the didactic associative operator 0. An increase in their "fan-in" may result in a reduced number of logic levels and corresponding improvement of circuit latency. Nevertheless, the advantage of increased "fan-in'' has to be addressed in the context of a specific implementation technology and circuit style.
The second issue relates to the timing model we have used to introduce our dynamic programming formulation. There we have assumed negligible parasitic capacitances and used a simplistic 7 model. We can use a more sophisticated timing model as long as the adder latency is monotonically nonincreasing with respect to the data width, thus satisfying the "principle of optimality " of a dynamic programming formulation.
Our structured approach lends itself to design automation. The solution to our algorithm, specifying an arrangement of modular tiles (Fig. lo) , can feed an automatic VLSI layout tool such as MQUILT [16] . MQUILT, along with usersupplied modular tiles, generates the adder layout in MAGIC [17] format, and this is how we obtained our layout discussed in Section VII. We have proposed a realistic and practical approach to the design of optimal adders. The implementation result shows our approach is extremely competitive to alternative implementation schemes such as variable-block carry-skip adders [ 181, [19] . This is especially true for large data width additions.
12]
[31
L41
PI
