AbstractÐInteger addition is one of the most important operations in digital computer systems because the performance of processors is significantly influenced by the speed of their adders. This paper proposes a self-timed carry-lookahead adder in which the logic complexity is a linear function of n, the number of inputs, and the average computation time is proportional to the logarithm of the logarithm of n. To the best of our knowledge, our adder has the best area-time efficiency which is Ân log log n. An economic implementation of this adder in CMOS technology is also presented. SPICE simulation results show that, based on random inputs, our 32-bit self-timed carry-lookahead adder is 2.39 and 1.42 times faster than its synchronous counterpart and self-timed ripple-carry adder, respectively; and, based on statistical data gathered from a 32-bit ARM simulator, it is 1.99 and 1.83 times faster than its synchronous counterpart and self-timed ripple-carry adder, respectively.
NTEGER addition is one of the most important operations in digital computer systems. In addition to explicit arithmetic (such as addition, subtraction, multiplication, and division) performed in a program, additions are performed to increment program counters and calculate effective addresses [1] . Statistics presented in [1] , [2] show that, in a prototypical RISC machine (DLX), 72 percent of the instructions perform additions (or subtractions) in the datapath. The statistics reported in ARM processors even reaches 80 percent [3] . Thus, the performance of processors is significantly influenced by the speed of their adders.
Circuits may be classified as synchronous or asynchronous. Synchronous circuits have a clock to synchronize the operations of subsystems, while asynchronous circuits do not. Subsystems in asynchronous circuits usually need start and completion mechanisms to synchronize with one another. One advantage of using asynchronous circuits is that these circuits operate at average rates, while synchronous circuits are required to operate at the worst rates. A good example for this is that n-bit ripple-carry adders (which are synchronous), shown in Fig. 1 , have worst case computation time Ân, 1 whereas n-bit carry-completion sensing adders [4] , [5] , [6] (which are asynchronous), shown in Fig. 3 , have average computation time Âlog n [7] . This paper proposes a self-timed carry-lookahead adder in which the logic complexity is a linear function of n, Ân, and the average computation time is proportional to the logarithm of the logarithm of n, Âlog log n. To the best of our knowledge, our adder has the best area-time efficiency, which is Ân log log n.
This paper is organized as follows: Section 2 introduces designs of some earlier synchronous and asynchronous adders. Section 3 presents a novel delay-insensitive carrylookahead adder with speed-up circuitry. The complexity analysis is presented in Section 4. Section 5 shows the economical CMOS implementation of the propsosed adder. Section 6 presents the results of the SPICE simulation based on both random and statistical data. Section 7 concludes this paper as a whole.
BACKGROUND
Adders may be implemented with synchronous or asynchronous circuits. This section introduces some previous work on adder design, including both synchronous and asynchronous adders.
Let e nÀI e nÀP F F F e H and f nÀI f nÀP F F F f H be two n-bit binary numbers with a sum of nÀI nÀP F F F H and with a sequence of carry bits, g n g nÀI F F F g H . (The least significant bits are e H and f H .)
Synchronous Adders

Ripple-Carry Adders
The ripple-carry scheme computes the i s as follows:
g iI e i f i e i f i g i i e i È f i È g i Y I
where i HY IY F F F Y n À I and È is an exclusive OR operation. An n-bit ripple-carry adder (RCA) is constructed by cascading n one-stage full adders, shown in Fig. 1 . Obviously, both the logic complexity and the worst case computation time are Ân.
Carry-Lookahead Adders
To compute a sum, an RCA requires, in the worst case, n stage-propagation delays. For high speed processors, this scheme is undesirable. One way to improve adder performance is to use parallel processing in computing the carries. That is why Carry-Lookahead Adders are introduced. The carry-lookahead scheme computes the i s as follows:
p i e i È f i rry À propgte P g i e i f i rry À generte Q i e i È f i È g i R g iI g i p i g i Y S
where i HY IY XXXY n À I. Equation (5) may be further expanded into
For large i, it is impractical to build a two-stage full carrylookahead adder because of the practical limitations on fanin and fan-out, irregular structure, and many long wires [8] , [1] . However, the carry-lookahead scheme may be built in the form of a tree-like circuit, which has a simple, regular structure [9] , [10] , [1] , by reformulating (6) into iYk iYj jÀIYk lok À rry À propgte U q iYk q iYj iYj q jÀIYk lok À rry À generte V g j q jÀIYk jÀIYk g k Y W
where i ! j b k, q iYi g i , and iYi p i . The tree-like circuit of the CLA with n V is shown in Fig. 2c . It consists of two kinds of modules: e and f. emodules, shown in Fig. 2a , compute carry-propagate (see (2) ), carry-generate (see (3)), and sum bits (see (4) ). The inputs e i and f i to the ith e-module represent the ith bits of binary numbers, which are to be added. The outputs g i and p i are the ith carry-generate and carry-propagate, respectively.
f-modules, shown in Fig. 2b , compute block-carrypropagate (see (7)), block-carry-generate (see (8) ), and carry bits (see (9) ).
In a synchronous CPU with a CLA, the addition operation is synchronized by clock pulses. The clock period must be large enough to allow all the possible input configurations to be computed. Thus, the CLA must work under the worst possible condition. The worst propagation delay of an n-bit CLA is two unit delays 2 of the e-module plus P log P n À I unit delays of the f-module.
Other Tree-Like Adders
Conditional-Sum Adders [11] , Type-2 Adders [10] , and Brent and Kung Adders (BKA) [12] were proposed to further improve the worst case delay. The worst propagation delay of these adders is about log P n-stage delays. However, the logic complexity of these adders goes up to Ân log n. Once all the stages have computed their carries, the addition is completed. An n-input AND gate may be used to signal the completion (i.e., finish nÀI iH ek i ). The enable signal is used to start the computation and to ensure that no false completion signal will be generated. When enle H, all g H i and g I i (i IY XXn) signals are set to zero. The completion signal, finish, must be zero, too. Thus, no false completion can be asserted. After all the input data have arrived at the input ports of the CCSA, the enable signal is turned on to start the addition operation. Upon the completion of the addition, the finish signal is turned on.
For more detail on the CCSA, see [4] , [5] , [13] . The logic complexity is Ân and the average computation time for randomly distributed inputs is Âlog n [7] . Note that the worst case computation time is Ân.
Delay-Insensitive Circuits
Delay-insensitive (DI) circuits [14] are a subclass of asynchronous circuits. The defining property of DI circuits is that their correctness is insensitive to delays in both gate elements and connection wires. Thus, DI circuits are the most robust circuits in terms of the operating variations such as temperature, voltage, and processing. The class of pure DI circuits is quite limited [15] . However, extending pure DI circuits with isochronic forks is sufficient to construct any circuit of interest. (Such circuits are sometimes called quasi-DI.) For this paper, we assume isochronic forks.
The CCSA, shown in Fig. 3 , is not a DI circuit. It must meet the bundling constraint [16] , [13] : The start signal cannot be asserted unless all the input data bits have arrived and the sum bits must arrive at the environment before the environment receives the finish signal.
One way to meet the above constraint is to put a delay in the control path. Fig. 4a shows an example of how an adder with bundling constraint is constructed. It works as follows: First, the req signal loads the data of the registers A and B to the input ports of the adder and then the start signal is turned on. The delay in the path from req to start guarantees that all the data bits of the registers A and B have arrived at the input ports of the adder before the adder starts its computation. Second, the sum bits have arrived at the register C and then the register C is to store them. The delay in the path from finish to write guarantees that all the sum bits are propagated faster to the register C than the finish signal is. Finally, an ack signal is generated to indicate the completion of an addition operation. The req, start, finish, write, and ack signals must be reset to zero before the next addition starts.
It is impossible to meet bundling constraints if there is no way to control delays in control and data paths. Besides, DI circuits are the most robust circuits and particularly easy to compose and substitute. Thus, the design of DI circuits is very interesting.
2. One unit delay of e-or f-modules is equal to the delay of two AND or OR gates if they are implemented in two-level logic. where i HY IY F F F Y n À I.
An n-bit DIRCA is shown in Fig. 5 . The logic complexity of DIRCA is a linear function of n, Ân, and the average computation time is proportional to the logarithm of n, Âlog n. Fig. 4b shows an example of how a DI adder is constructed. It works as follows: The req signal loads the data of registers A and B, taking them to the input ports of the adder. The DI adder computes sum bits whenever any data bits arrive. Any sum bits produced by the DI adder are sent to and stored in register C. Once the register C receives all the sum bits, it sends out the ack signal to indicate the completion of an addition operation. The req, ack, and the input and output signals of the adder must be reset to zero before next addition starts. The registers used by the DI adder are dual-rail asynchronous registers in [17] .
Martin [18] proposed a very good design of DIRCA adder by using CMOS technology. The transistor count per DIRCA cell is 42. Compared to the synchronous RCA cell which needs 40 transistors, it is clear that the asynchronous DIRCA adder is hardly larger than the synchronous one, in spite of the use of dual-rail signals.
In the next section, we will present a delay-insensitive carry-lookahead adder in which the logic complexity is Ân and the time complexity is Âlog log n.
DELAY-INSENSITIVE CARRY-LOOKAHEAD ADDERS
Delay-Insensitive Carry-Lookahead Adders (DICLA) may be implemented by using dual-rail signaling in input bits, sum bits, and carry bits, and by using one-hot code in the internal signals. A DICLA may be built with two basic modules: g and h, connected in a tree-like structure (Fig. 6 ).
The equations of the g-module are defined as follows:
where i HY XXY n À I. Note that, in a DICLA, the input data bits and the output data (sum) bits may be propagated at any time in any order. The completion signal, finish, may be generated when needed as follows:
For more completion detection circuits for dual-rail selftimed systems, see [19] .
The g-module is shown in Fig. 6a . The dual-rail signals on the lefthand side of Fig. 6a are grouped as e i e H i Y e
The resulting g-module is shown on the righthand side of Fig. 6a .
The equations for the h-module are defined as follows: The performance of the DICLA may be further improved by some speed-up circuitry. It is obvious that if e i f i (i.e., carry-kill or carry-generate), then the output carry, g iI , is independent of the input carry, g i . The tree-like circuit, shown in Fig. 6 , does not take full advantage of this feature to speed up the carry computation.
Speed-Up Circuitry
The idea of speeding up carry computation of tree-like adders is to send the carry-generates and carry-kills to all stages that can use this information directly. For example, the DICLA, shown in Fig. 6 , takes one unit delay of the gmodule plus three unit delays of the h-module to compute g R . However, g I R may sometimes be determined by g Q , q QYP , or q QYH . If we know e I Q f I Q I (i.e., g Q I), we may directly set g I R g Q . Thus, it takes one unit delay of the gmodule plus only one unit delay of the h-module to compute g I R .
The h H -module (h-module with speed-up circuitry), shown in Fig. 7a , is designed to speed up the carry computation. It is defined in the following way: where P j n. These two equations are taken from the scheme of full carry-lookahead adders. It is impractical to implement the speed-up circuits, shown in (29) and (30), when n is large because of the practical limitations on fan-in and fan-out, irregular structure, and many long wires. In addition, the logic complexity of the speed-up circuits increases more than linearly (i.e., Ân log n).
Fortunately, all the above problems can be solved by using the properties of a tree-like structure. That is, instead of fully employing the carry-lookahead generation to each carry, our speed-up mechanism focuses on the root nodes of a subtree. By trading time with area, the logic complexity of the adder with the speed-up circuitry is brought down to a linear function of n, the number of inputs, and the average computation time is proportional to Âlog log n. Both proofs will be given in Section 4.
The above two equations can be reformulated in a more efficient and economical way in logic as: where j PY RY TY F F F Y n and l is the level of the h H -module, receiving the speed-up signals, in the tree. The tree leaves are at level 0 and the immediate parent nodes of nodes at level i are at level i I (Fig. 7b) . Note that only the modules at level ! P need speed-up circuits.
In this speed-up mechanism, each root node above level 1 receives speed-up signals from the left edge nodes of its right subtree. For example, the h H -module at level 3 in Fig. 7b has the following speed-up signals:
The DICLA with the speed-up circuitry (SPDICLA) for n V is shown in Fig. 7b 7c is used to demonstrate the power of the speed-up circuitry. Imagine a carry propagation chain (i.e., a sequence of carry-propagates) with length x which is spanned by two subtrees, shown in the solid triangles of Fig. 7c . The r (right) subtree has a carry propagation chain with length x H and the l (left) subtree has a carry propagation chain with length = x À x H . Assume that x H and x À x H are powers of 2 and x H ! x À x H . There must exist two subtrees with the same size (say y P ) which contain the r and l subtrees. It takes P log P x H I module delays to compute the carries in the r subtree. Thanks to the speedup circuitry, the carries can be directly propagated from the right subtree to the left subtree. The l subtree can start carry computation at log P x H I (the computation time from the leaves to the root node of r subtree) plus 1 (the delay of the h H -module which is the root of the two dotted subtrees) delays. Thus, it takes only
delays to compute all these carries. Without the speed-up circuitry, it would take log P y I log P x À x H I delays. For any n-bit addition with a maximal carry chain length = , which is a small constant, if the carry chain is located in the middle of the tree, then it requires Âlog n stage delays in DICLA, but only ÂI stage delays in SPDICLA.
Optimization
Adding speed-up circuitry not only enhances the performance, but also makes some logic redundant and, hence, removable. Namely, if the carry-kills and carry-generates are exploited in the speed-up circuitry, these signals need not propagate through the tree. For example, since g Q and k Q are used by the h H -module, which computes g R , it is unnecessary to route them to the h-module, which computes g Q . In general, the left carry-kill and left carrygenerate signals of h-and h H -modules can be eliminated. The equations for the simplified h-module (h) are redefined as follows:
iYk iYj jÀIYk QQ Fig. 8 .
Note that the SPDICLA is not delay-insensitive due to the speed-up circuitry, which creates multiple paths to compute the u and q signals of h H -modules. This is not a problem in the computation phase. However, in the reset phase, assume that there is a path with long delay and the signal has not asserted by this path. A reset completion may be falsely flagged. Removing the redundant logic, in fact, makes the circuit (i.e., DICLASP) delay-insensitive again.
PERFORMANCE ANALYSIS
In this section, we analyze the logic complexity and the average delay of DICLASP adders. We assume three things: First, the delay through each module (e.g., g-, h-, and h H -modules) is d; second, the number of bit positions of input arguments, n, is a power of 2, i.e., n P k ; third, the distribution of the input configurations is uniform.
Average Computation Time
The time required to perform an addition (computation time) in an adder is the time required for propagating the carries (carry propagation time) in stages plus one more delay to compute the sum bits. The computation time of an adder is sensitive to the numbers to be added. The upper and lower bound proofs of average computation time are an extension of proofs for CCSAs by Greenstreet [20] . Theorem 1. For any input configuration, the carry propagation time is proportional to the logarithm of the length of the longest carry chain.
Proof. Consider a carry chain with length x in an input configuration where P lÀI`x P l . Upper bound: If the carry chain is contained in one subtree with the length of leaves P l in DICLASP (see Fig. 9a ), then the carries in the carry chain can be computed in Pl Id. To propagate the last carry of the carry chain to the next stage, it may need two more delays. So, the carry propagation time is at most Pl Qd. Otherwise, the carry chain must span more than one subtree with the length of leaves P l (see Fig. 9b ). There must exist two subtrees with the length of leaves P l in DICLASP containing the carry chain. In this case, the propagation time can be computed in at most Pl Qd. Since P lÀI`x , l`log P x I. Thus, Pl Q`Plog P x I Q Plog P x S. To summarize, in both cases, the carry propagation time is at most P log P x Sd.
Lower bound: There must exist one subtree with the length of leaves P lÀP in DICLASP such that the leaves of the subtree are contained in the carry chain. Thus, the carry propagation time is at least Pl À P Id Pl À Qd. Since x P l , log P x l. Thus, Pl À Q ! Plog P x À Q, i.e., the carry propagation time is at least P log P x À Qd. t u Now, we shall show the average computation time of DICLASP is Âlog log n.
Theorem 2 (Lower Bound).
A lower bound of the average carry propagation time of the DICLASP is log log n.
Proof. Partition the n stages into nonoverlapping segments, each of length log P n P . There are Pn log P n such segments. The probability that a carry propagates across one of these segments is the probability that all log P n P stages in the segment propagate the carry. This is HXS log P n P I n p . The probability that a carry is not propagated across a segment is I À I n p . The probability that there is no segment over which a carry is propagated is I À I e . This is satisfied when Pn ! n p log P n, i.e., when n ! I.
So, for n ! I, the probability that a carry is not propagated across at least one segment is less than I e . This means that the probability that a carry is propagated across at least one segment exceeds I À I e . Thus, the average carry propagation time exceeds I À I e Á P log P log P n P À Q d I À I e Á P log P log P n À S dY i.e., the average carry propagation time (APT) is log log n. t u An upper bound is obtained by considering n overlapping length P log P n chains. We show that such chains are rare.
Theorem 3 (Upper Bound
). An upper bound of the average carry propagation time of the DICLASP is ylog log n.
Proof. Let i be the probability that the longest carry chain has length i and let i be the maximal carry propagation time of chains of length i. Then, eg n iH i Á i H Á Qd P log P n iI i Á P log P i Sd n iP log P nI i Á P log P i SdX The probability that the maximal carry chain = 0 is I P n . Thus, H Á Qd Q P n d. The second term can be derived as follows: Since P log P n iI i`I and i P log P n P log P n iI i Á P log P i Sd` P log P n iI i Á P log P P log P n Sd P log P P log P n Sd P log P log P n UdX
The third term can be derived by considering n overlapping segments of length P log P n. Each carry chain of length longer than P log P n contains at least one segment completely. The probability that a carry propagates across such a segment is HXS P log P n I n P . The number of segments, even allowing overlapping, is less than n. Also, the length of any carry chain is at most n. Thus, the contribution to the expected carry propagation time of carry chains that are longer than P log P n is less than n Á I n P Á P log P n Sd P log P nS n d.
Since e ` Q P n P log P log P n U P log P nS n d, the average carry propagation time is ylog log n. Proof. By Theorems 2 and 3. t u
Logic Complexity
Theorem 5. The logic complexity of DICLASP is a linear function of n.
Proof. It is easy to show that the n-input DICLASP requires n g-modules, n P h-modules, n P h H -modules and the speed-up circuitry. We show that the speed-up circuitry is also a linear function of n by counting the total number of added inputs. Consider the h H -modules at level i, where i k (k log P n). There are P kÀi h H -modules and each h Hmodule at level i has Pi À I inputs (i.e., i À I inputs for carry generate and i À I for carry kill) for the speed-up circuitry. We also need the speed-up circuit, which contains Pk inputs, to compute the last carry. The total number of the added inputs is
For VLSI implementation, the fan-in and fan-out for gand h-modules used in DICLASP are no problem at all. The maximal fan-in and fan-out of h H -modules is log P n.
This is not a severe problem either, except for a very large n. The merged cell method [10] , which merges two or more primary modules (e.g., g-modules) or two or more interior modules (e.g., h-and h H -modules) to reduce the depth of the tree, may be applied further to speed up the addition operation.
Comparisons of Adders
In this section, we compare the logic complexity and computation time for both synchronous and asynchronous adders. They are Ripple Carry Adder (RCA), ConditionalSum Adder(CSA1) [11] , Carry-Select Adder(CSA2) [21] , [1] , Carry-Skip Adder (CSA3) [22] , [1] , Carry-Completion Sensing Adder (CCSA) [4] , Delay-Insensitive Ripple Carry Adder (DIRCA) [18] , Conditional-Sum Completion-Sensing Adder (CSCSA) [23] , Brent and Kung Carry-Lookahead Adder (BKA) [12] , Type-2 Adder [10] , and, our DICLA, SPDICLA, and DICLASP.
The logic (area) complexity and computation time of the above adders are listed in Table 1 . The worst case computation time is assumed for the synchronous adders and average computation time is assumed for the asynchronous adders. Area-time efficiency is computed by multiplying logic complexity and time complexity.
The asynchronous adders listed in Table 1 are classified as bundling constraint (BC), delay-insensitive (DI), and selftimed (ST) adders. In bundling constraint (BC) and selftimed (ST) adders, delay assumptions have to be made to guarantee correct operations. In a DI adder, input signals may arrive at any time in any order since its correctness is insensitive to delay.
It is worth mentioning that the average computation time of CSCSA is also ylog log n. However, the logic complexity of CSCSA is yn log n and the design requires circuits to detect the true sum at each level and to route the completed sum from any arbitrary level to the output latch. Our DICLASP is a delay-insensitive circuit and has the best area-time efficiency among these adders.
Delay Analysis by Program Simulation
From the lower bound (Theorem 2) and the upper bound (Theorem 3) theorems, we can easily derive that the constant factor of the average computation time of the DICLASP is between 2 and 1.264. It would be very interesting to know the exact constant factor of the time complexity of the DICLASP and to compare it with other adders. No exact mathematical model for the DICLASP has been found yet. Thus, simulation was used to analyze the average computation delay. For simulations of adders, see [2] , [24] .
The DIRCA, the DICLA, and the DICLASP adders have been simulated with C++ programs. The results are shown in Table 2 . The results of 8-bit adders are produced by exhaustive enumeration of possible configurations. All other results are produced by simulating 100,000 pairs of random numbers.
The delays shown in Table 2 are module (or cell) delay. A module delay is equal to two AND or OR gate delays if twolevel logic is used to implement those modules.
The propagation delays of the DIRCA, the CLA, the DICLA, and the DICLASP shown in Table 2 are plotted in Fig. 10 . Two curves, log P log P n and P log P log P n, are also plotted in Fig. 10 .
The experimental results show: First, our DICLASP has best performance even when n is small. Second, the average delay of the DICLASP is the logarithm of the logarithm of n. The constant factor, which is derived by averaging the delay of the DICLASP divided by log P log P n, is approximately 1.7. Third, the average computation time of the DICLA and the DIRCA and the worst computation time of CLA are all proportional to the logarithm of n, but the constant factor of the DIRCA is smaller than those of the DICLA and the CLA. The performance of the DIRCA is slightly better than that of the DICLA. It may be due to the overhead of computing the primary carry in the DICLA.
CMOS IMPLEMENTATION
To implement economic DICLASPs in CMOS technology, Martin's method to design economic delay-insensitive datapath circuits may be applied [18] . The behaviors of gates can be represented through a set of production rules and, then, they can be directly implemented in CMOS. Here, we just show the CMOS circuits of DIRCA and DICLASP. For more detail, see [18] , [24] . The CMOS implementation of DIRCA cell is shown in Fig. 11 . It contains only 40 transistors. Note that, in Martin's paper [18] , the transistor count per DIRCA cell is 42. The improvement is due to factoring out the subexpression as much as possible. That is, some of the transistors may be shared.
CMOS Implementation of DICLASP
The CMOS implementation of the g-module (see (14) - (18)) is shown in Fig. 12a . It contains 36 transistors. Note that k i and g i are implemented by static CMOS, while p i , (n ! R), the speed-up circuitry needs the following transistors in total: Table 3 lists the transistor counts of RCA, DIRCA, CLA, and DICLASP. It is worth mentioning that the speed-up circuitry in fact uses a very small percentage of the CMOS area. Compared to Martin's n-bit ripple-carry adder, which needs RPn transistors, our n-bit DICLASP, which needs TTn À R transistors, is indeed practical in high speed processors.
PERFORMANCE EVALUATION FROM SPICE
The SPICE simulation consists of two parts: random number inputs and statistical data gathered from a 32-bit ARM simulator. Note that our SPICE simulations are based on circuits specified only in a topological sense. Chip layouts have not been developed so that factors such as 
SPICE Simulation of Random Number Inputs
Ten thousand pairs of randomly generated numbers are simulated in SPICE with MOSIS 2 micron, level 2 CMOS parameters for a 32-bit DIRCA and a 32-bit DICLASP.
The propagation chain distribution of the 10,000 random samples is plotted in Fig. 13 . Propagation chains with length = 2 to length = 9 take up to 98.6 percent of the sample space. Table 4 shows the results from SPICE simulations. Note that the best (worst) case delay is the smallest (largest) delay among the simulated cases.
It is highly impractical to exhaustively simulate 32-bit self-timed adders, which would require P TR cases. In our SPICE experiment, we simulate only 10,000 cases for DIRCA and DICLASP. Comparing 10,000 simulated cases to the sample space, it is obvious that only a very tiny percentage (SXRP Â IH ÀIR percent) of cases are simulated. Table 5 shows the confidence limits [25] : sample mean, standard deviation, confidence interval with 95 percent confidence level, and confidence interval with 99 percent confidence level for DIRCA and DICLASP. General distribution is assumed in computing confidence limits. For more detail, see [24] , [25] .
Statistical outcomes show that we are quite confident (99 percent) that, for random inputs, the true mean (i.e., average case performance) of a 32-bit DICLASP lies between 4.36 ns and 4.40 ns and the true mean of a 32-bit DIRCA between 6.16 ns and 6.26 ns.
The results from the random number inputs show that: First, DICLASP (DIRCA) is about 2.39 (6.34) times faster than its synchronous counterpart. Second, DICLASP is about 1.42 times faster than DIRCA. Third, the results clearly demonstrate the superiority of asynchronous circuits in the domain of average case performance vs. worst case performance of synchronous circuits. The results also contradict the report from Kinniment [26] . The report concludes that ªasynchronous adders only give a performance improvement over more conventional hardware in very limited conditions,º which is wrong. The major problem of Kinniment's report is that he compares CCSA, an asynchronous version of a ripplecarry adder, with a tree-like conditional-sum adder. This is unfair and misleading.
SPICE Simulation of Real Data
Based on Garside's work [3] , we collected statistical data by running a 32-bit ARM emulator. The average computation times of DIRCA and DICLASP based on dynamic traces were then calculated. Table 6 presents the statistical data obtained by simulating three benchmark programs, Dhrystone f1, Dhrystone f2, and Espresso dc2 on a 32-bit ARM emulator. All additions and subtractions performed in a 32-bit ARM ALU are analyzed and collected. The traces are categorized into four sets: Add/Subtract, Compare, Load/Store, and Branch.
A maximal carry propagation chain of inputs A and B is obtained by first XORing them, then finding the largest contiguous strings of 1s. The distribution of longest carry propagation chains is plotted in Fig. 14 . The accumulated percentage is also shown for the case f1 + f2 + dc2.
The results show that: First, the distributions of Dhrystone f1 and f2 are similar and the distributions of Dhrystone and Espresso dc2 are different. This implies that different applications may have different distributions of propagation chains. Second, the distribution of f1 + f2 + dc2 is close to Espresso dc2 because it contributes 92 percent of the instructions executed. Third, from the dynamic trace of f1 + f2 + dc2, only 1.43 percent of instructions have the worst case behavior and 62.83 percent of instructions have maximal carry propagation chain less than or equal to 6. This implies that the speed-up of average case performance vs. worst case performance is significant.
It is impossible to simulate the about 1.9 million cases to calculate the average case performance of asynchronous adders. We adopt the following formula to compute average computation time of asynchronous adders. 
TABLE 4 Results of SPICE Simulation of Various Adders
where eg is the average computation time, hi is the average delay for the propagation chain with length = i, and i is the percentage of instructions with longest propagation chain = i in a dynamic trace. The average delay for the propagation chain with length = i is computed by averaging the delays of cases where the propagation chain is in different positions. For more detail, see [24] .
The performance improvement of the DI adders vs. their synchronous counterparts based on real data is shown in Table 7 .
The results show that: First, the average computation times of the DIRCA and the DICLASP and based on the dynamic trace of Dhrystone f1 (f2, dc2, f1 + f2 + dc2) are 9.90 ns and 5.36 ns, ({10.03 ns and 5.40 ns}, {9.59 ns and 5.24 ns}, {9.61 ns and 5.25 ns}), respectively. Second, the average computation times of the DIRCA and the DICLASP based on the dynamic traces of Dhrystone f1 and f2 are longer than those based on Espresso dc2. This implies that the computation time is dependent on the nature of applications. Third, based on the dynamic traces, f1, f2, dc2, and f1 + f2 + dc2, DICLASP is 1.85, 1.86, 1.83, and 1.83 times faster than DIRCA, respectively. Fourth, 32-bit DICLASP and DIRCA are 1.99 and 4.10 times, on average, faster than their synchronous counterparts, respectively. Average addition time for real data is greater than for uniformly distributed random data. Nevertheless, our simulations show that our adder is substantially faster than adders operating in synchronous mode which behave as though each computation entails a worst-case (n-length) carry chain.
CONCLUSIONS
We have proposed a novel delay-insensitive carrylookahead tree adder (i.e., DICLASP) in which the logic complexity is a linear function of n and the average computation time is proportional to the logarithm of the logarithm of n. To the best of our knowledge, our adder has better area-time efficiency than any other adders [8] , [2] , [12] , [18] have.
The adder presented here is as robust as any with respect to toleration of delay variations and for no other adder is the order of computation time or the order of hardware complexity less.
The SPICE simulation results show that: First, based on random inputs, our 32-bit DICLASP is 2.39 and 1.42 times faster than its synchronous counterpart and DIRCA, repectively. Second, based on statistical data gathered from a 32-bit ARM simulator, our 32-bit DICLASP is 1.99 and 1.83 times faster than its synchronous counterpart and DIRCA, repectively.
We also present an economic CMOS implementation of our delay-insensitive carry-lookahead tree adders. The proposed adders are suitable for VLSI implementation because of their regular structure. We believe this work can be applied in the design of high speed asynchronous processors. With an interface of asynchronous and synchronous modules [27] , DICLASP may be used to improve the performance of synchronous processors. Prior to coming to Columbia, he was a member of the technical staff at Bell Telephone Laboratories for just under five years. He supervised a software development group there for almost two years, after having been engaged in research on various problems in computer science. Professor Unger has been a summer and/or sabbatical leave employee of GE, IBM, RCA Laboratories, and Bell Laboratories, and a consultant for a number of companies. He was a visiting professor at the Danish Technical University in [1974] [1975] . He has published more than 40 technical papers and reports on topics including logic circuits, parallel processing, pattern recognition, and computer software, as well as the books Asynchronous Sequential Switching Circuits, The Essence of Logic Circuits, and Controlling Technology: Ethics and the Responsible Engineer. He holds one patent, is an IEEE fellow, and an AAAS fellow, and was a Guggenheim fellow in 1967. Professor Unger has also been active in the field of technology and society, is a past member of the IEEE Board of Directors, and past chair of the IEEE Ethics Committee.
Michael Theobald is a PhD student in computer science at Columbia University. He received the Diplom degree in computer science from Johann Wolfgang Goethe-Universita È t, Frankfurt/Main, Germany, in 1994. His research interests include synchronous and asynchronous circuits, computer-aided digital design, logic synthesis, formal verification, efficient algorithms and data structures, and combinatorial optimization. He received the Honorable Mention Award at the 1997 International Conference on VLSI Design and was a Best Paper finalist at the 1998 IEEE Async Symposium. He is a student member of the IEEE.
