This paper describes a fully static Complementary Metal-Oxide Semiconductor (CMOS) implementation of a Ling type adder. The implementation described herein saves up to one gate delay and always reduces the number of serial transistors in the worst-case (critical) path over the conventional carry look-ahead (CLA) approach with a negligible increase in hardware.
Introduction
For high-speed addition, Ling type adders [1, 2] h ave been demonstrated to have advantages over conventional CLA adders in emitter-coupled logic (ECL) [3] . Ling's approach results in a drastic load reduction in the input stage circuitry, thereby allowing direct generation of the groupgenerate from input operands. Because this approach takes advantage of the dot-or capability of ECL, it is not as suitable for CMOS adders. A straightforward application of Ling's scheme to CMOS adders can lead to an increase in hardware or delay time, or both. Also, in fully static CMOS ( as opposed to ECL), the designers have to worry about both the N-channel and the P-channel transistor networks. Optimizing on the N-channel transistor network alone is not sufficient.
In this paper, we present an implementation of a fully static CMOS adder using a modified Ling scheme. Our implementation saves up to one gate delay and always reduces the number of serial transistors in the critical path over the conventional CLA approach with a negligible increase in hardware. In CMOS, because the speed of a gate is primarily limited by the number of serial transistors connecting the output node to the power or the ground nodes, reducing the number of serial transistors in the critical path, therefore, speeds up the adder.
In section 2, we show, by way of a design, how Ling's approach can be modified for CMOS adders. In section 3, we compare the present adder with other adders reported in the literature. Section 4 contains a summary. In this paper, AND is denoted by juxtaposition, OR by V, EXCLUSIVE-OR by $, negation by overbar, and ny=, pi by pl-,. Index i E (0,32) is used for bits (carry-in is treated as go and actual sums range from 1 to 32), index j E (0,lO) for groups, and index k E (0,2) for blocks, exclusively and respectively.
The Adder
The adder is divided into four blocks, of sizes 9-, 9-, 9-, and 6-bits, respectively. Since carryin is treated as go, block 0 actually has only eight bits. Each block is subdivided into three 3-bit groups, except for the last 6-bit block, which has only two 3-bit groups. Within each group and each block, the local sum logic uses the conditional-sum algorithm. Figure 1 shows the structure and numbering convention of the adder and Figure 2 depicts the global carry propagation process. In [2] and [3] , the adders have 4-bit groups; owing to the limited fan-in capability of fully static CMOS circuits, 3-bit groups are used in the present adder (Figure 1 In the definitions, a; and bi are the ith bits of the input operands. si and 5'; are the jth bits of the local and the final sum, respectively.
Global-Carry
For ease of discussion, we illustrate our approach on one group in a block (Figure 1 Using the identity gi = pigi and extracting p26, we rewrite equation (1) Equation (3) contains four terms and a total of ten literals, with the largest term having three liter&. This can be implemented in CMOS in one complex gate [5] . The conventional group-generate equation (l) , when expanded, contains seven terms and a total of 24 literals, with the largest term containing four literals. In CMOS, the number of literals in a term in a logic equation corresponds to the number of N-channel serial transistors; equation (2) is therefore preferable to equation (1) for direct generation of group-generate from input operands.
Generating group-generate saves one gate delay in the critical path because the gi and pi terms are not implemented.
Readers experienced in CMOS circuit design may have realized that the implementation of equation (3) in fully static CMOS has four P-channel transistors in series, severely limiting its usefulness ( Figure 3) . A more careful examination of equation (2) suggests an alternative as the relationship of pi and 9; can again be put to use (nisi = ?r;>. The P-channel transistor network implements the dual of equation (2), which is Using the definition of PT, we rewrite equation (5) as 
Hence, the use of PT affords a more efficient implementation of block-generate; equation (6) is easier to implement than equation (5) because p26 in equation (6) propagates further down into the final-carry equation:
= p26(Gb; V Gb;P;P;P; v Gb;P;P;P;P;P;Ps*) = p&Gb; V Gb;Pb; v Gb;Pb;Pb;)
where c,* = Gb; v Gb;Pb; v Gb;,Pb;Pb;) 5
(8) The definitions of Gb; and Pb*, in equations (7), (9), (lo), and (11) follow those of the conven- Ling's scheme calls for either the implementation of Ck from Pbz and Gb; (equations (7), (9), (lo), and (11) > or the modification of local sum logic to account for the fact that CT is propagated [a] . Neither options are attractive. The former fails to reduce the number of serial transistors in the critical path and the latter adds complexity to the local sum logic, increasing hardware and delay time.
In the present implementation, only the C;, Pbz, Gbz, PT, and GT terms are implemented in the carry look-ahead circuitry without modifying the local sum logic. The p terms in equations (V, (9), (10 and (11) are implemented in the local group-carry equations, both of which are non-critical paths. In the following section, we show that this is indeed possible and in fact desirable because of reuse of the Pj* and Gj* terms. The ability to reuse the Pj* and GJ terms is one of the salient features of the present adder.
Local Group-Carry
We can prove for the general case that the p terms in the final-carry equations (7), (9), (lo), and (11) can be implemented in the local group-carry equations for all blocks in the adder. For ease of discussion, however, we show that this is possible for block 2. Equations for other blocks in the adder can be derived in a similar fashion.
The final-sum equation for bit 24 in group 8 is (see Figure 1 )
= 324 63 b23(G; V P,*Gg V P,*P,*C,')] Hence, only CT in equation (10) needs to be propagated globally to block 2. 1317 can be accounted for locally in equations (14) and (15). Both equations (14) and (15) can be implemented in two complex gate delays since PT and G; are available in one complex gate delay. In the present implementation, all complex gates have at most 3 serial transistors and are roughly of the same complexity as that shown in Figure 4 . The only exception is the complex gate in the non-critical path used to generate Gout, which has 4 serial transistors ( Figure 6 ). One complex gate delay is roughly equal to two and a half 2-input NAND gate delays. The Gg V PC term in equation (15) is actually available from group 7 within the same block and can be reused at a cost of one (complex) gate delay, increasing the number of (complex) gate delays of pb7 from two to three. Figure 5 shows an implementation of group 8.
In Figure 5 , it is interesting to note that we have used ~25( = a25 $ b25) as the local propagate (globally, we used pi( = a; V bi)), dlowing a more efficient implementation of (g25 v p25g24) and (925 v p2@24) in equation (16). B ecause the present approach does not modify the local sum logic, there is no increase in hardware in Figure 5 when compared with an implementation that uses the conventional conditional-sum (CSA) algorithm.'
In terms of number of (complex) gate delays, s32 is no worse than s26. The equation for S32 is similar to equation (16) 
= m(G: v P;)
From the previous discussion, PT and G; are available in one complex gate delay, Pb; and Gb*, in two, and Cz in three. The final sum selection multiplexor is counted as one gate delay.
Hence, the present adder has a total of four complex gate delays (three complex gates and a multiplexor). Table 1 compares the present adder with the conventional CLA, conditional-sum, carry-select, and the Multiple-Output Domino Logic (MODL) adders [7] in terms of complex gate delays and number of serial transistors in the critical path. The present adder has fewer complex gate delays than other adders. In the comparison, we have assumed that the CLA adder is implemented in a complex gate oriented media (i.e., MOS LSI or VLSI) and the carry-select adder uses 4-bit groups and conventional carry look-ahead to propagate the global carry. To be fair, we have further assumed that the conditional-sum adder has a similar organization as the present adder but without using the modified Ling approach as our adder did and that the MODL adder uses conditional-sum logic locally.
Comparison with Other Adders
'Since we used CSA for the local sum logic, it is fair to compare our adder with CSA, knowing that CSA consumes more hardware than CLA [4] . Also, by using si(= ai @b,) as the local propagate, the present implementation actually saves hardware. Number of serial transistors is a better measure.
Because CLA, conditional-sum adder, and carry-select adder do not generate group-generute directly, they have one more gate delay than the present adder. CLA requires another gate delay to generate the local sum, increasing its total number of gate delays from 5 to 6. The MODL adder though generates group-generate directly, it does so by using a small 2-bit group [7] , requiring more levels in the global carry generation process than the present adder.
Comparison of complex gate delays in CMOS adders can be misleading because they depend on both fan-in and fan-out. A better measure is the number of serial transistors which a signal must traverse in the critical path. This means that for fully static CMOS circuits, we evaluate both P-channel and N-channel transistors for critical paths. For dynamic CMOS circuits [8] , which include DOMINO circuits, we only evaluate the N-channel transistors. Hence, this comparison scheme is slightly biased against fully static CMOS circuits.
Admittedly, comparing CMOS adders in terms of serial transistors in the critical path is crude but it does allow us a quick way to evaluate the potential performance of an algorithm for further study. A fair comparison scheme should consider area, power consumption, speed, and design turnaround time; such an elaborate scheme is far more time consuming than can be afforded during the algorithm selection phase of a study.
In counting the number of transistors, the discharge N-channel transistors in DOMINO logic are not included. Inverters are counted as one transistor and XOR gates as two. The number of serial transistor count for NAND, NOR, and complex gates is the number of transistors in the longest N-channel or P-channel chain for static and dynamic CMOS circuits. When there are pass gates [8] involved, the situation is a little more complicated. The source-drain (inputoutput) path of a pass gate is counted as 0.5 transistors and the gate-drain (control-output) path as one transistor. By the same token, 2-l multiplexors are counted as two transistors from the selection-output path, but as 0.5 from the input-output path. A similar comparison scheme has been suggested by Oklobdzija and Barnes [9] , but the accounting details were not given.
From Table 1 Table 1 . The inverting output buffers in Figure 5 are not needed in the adder and are therefore not counted as in the critical path.
The path from ci, to Gout has the same number of serial transistors as the path from ci, to
Saz. Gbz and Pb*, are available in eight transistors, hence C&, is available in twelve transistors and Gut in fourteen from equation (7) (Figure 6 ). The path from ci, to CoUt, however, is not the critical path in terms of actual delay because there is much less capacitive loading on this path than on the path from tin to Sss.
We have laid out a g-bit block of the adder in an advanced bipolar/CMOS process with l.Opm drawn channel length. The block was laid out in a standard-cell fashion and occupied roughly 700x450 pm 2. No automatic compaction was performed on the layout. The delay of the whole adder operated at 5 volts driving a 300fF load at room temperature is estimated using SPICE to be around 3.0ns.
Summary
In summary, we have presented a fully static CMOS implementation of a Ling type adder. The implementation has fewer number of complex gate delays and fewer number of serial transistors in the critical path than other conventional adders (i.e., conditional-sum adder, CLA, carry-select adder, and multiple-output domino logic adder). Compared with a conventional conditional-sum adder, the increase in hardware in the present implementation is negligible.
Two key ideas presented in this paper that allows Ling scheme to be used in CMOS are:
(1) the identity nisi = pi can be used on the P-channel transistor network (equation (4) (2) the factored p term in Ling's equation can be propagated locally (equations (14) and (15)).
Together, they allow a reduction of serial transistors in the critical path without any hardware increase. A minor observation is that by using 3 i( = ai $ bi) as the local propagate, the present implementation saves hardware in the local sum logic over the conventional conditional-sum adder ( Figure 5 ).
