Abstract-An efficient parallel adder under left-to-right input arrival is proposed. Making full use of the delay of the input arrival, it produces the sum within a small constant delay after the arrival of the final bits. Its amount of hardware is proportional to the operand length. It can be applied to the quotient conversion in an array divider.
INTRODUCTION
ADDITION is one of the most fundamental and important operations in digital systems. Designing efficient adders has been a great practical and, also, theoretical interest for many years. Parallel adders, i.e., combinational circuits performing addition, are usually designed under the assumption that all input bits are given at the same time. However, in several practical applications, this assumption does not hold. Namely, delay exists among the arrival of input bits.
In this paper, we consider parallel addition under left-to-right input arrival. Namely, the augend and the addend are given from the most significant bits to the least significant bits successively. We encounter addition (subtraction) under this condition in, for example, the conversion of the quotient in an unfolded implementation of shift-and-subtract type division such as SRT division, i.e., in an array divider [6] , in the upper part of the final addition in the tree-type multipliers [8] , and in left-to-right multiplication [3] .
When we use an ordinary adder, the delay, i.e., the computation time after the arrival of the final input bits, is at least O(log n), where n is the operand length. Although a carry select adder (CSLA) [1] with variable blocks is shown to be efficient under this input arrival condition, its delay is still O(log n) [8] . An unfolded implementation of the adder based on the on-the-fly conversion algorithm (OTFA) [3] , [5] produces the sum immediately after the arrival of the least significant bits. Its delay is a small constant independent of n. However, it requires a large amount of hardware, i.e., O(n 2 ). The amount can be reduced to O(n log n) when the processing time of each step in it is shorter than the interval between the input arrivals [7] . In this paper, we propose a high-speed reduced-size adder under left-to-right input arrival. It is a two-level CSLA,and consists of two blocks. The lower block is a small OTFA and the upper block is again a CSLA consisting of several subblocks. We determine the number of bits received by the OTFA and that by each subblock in the upper block so that the adder makes full use of the delay of the input arrival. Its delay is a small constant independent of n, which is slightly larger than that of an OTFA. Its amount of hardware is O(n). It is not only efficient in practical use, but also interesting in theory. The adder is theoretically, i.e., asymptotically, optimal in both delay and size simultaneously.
The next section is an introductory section, where we define the problem and explain the OTFA. A new efficient adder is proposed in Section 3, and its implementation is considered in Section 4.
PRELIMINARIES

Addition Under Left-to-Right Input Arrival
We consider addition of two n-bit unsigned binary integers under left-to-right input arrival. We let the augend, the addend, and the sum be X = [x n−1 … x 0 ] (x i ∈ {0, 1}), Y = [y n−1 … y 0 ] (y i ∈ {0, 1}), and S = [s n … s 0 ] (s i ∈ {0, 1}), respectively. The augend X and the addend Y are given from the most significant bits to the least significant bits successively. Namely, the most significant bit pair 〈x n−1 , y n−1 〉 arrives first and the remaining bit pairs arrive successively.
We are concerned with an implementation of an adder as a combinational circuit composed of logic gates with restricted fanin. We measure its complexity by its delay and size. The size is the number of logic gates in it. The delay is the computation time after the arrival of the final input bits, and is defined as follows. Let TA i denote the arrival time of the ith bit pair 〈x i , y i 〉. Let TC i denote the computation time of a partial circuit which receives bit pair 〈x i , y i 〉. The delay of an adder is the maximum value of (TA i + TC i − TA 0 ) among all is. We assume that TC i is proportional to the number of logic gates on the longest path from inputs x i and y i to the outputs s j s (0 ≤ j ≤ n).
On-the-Fly Adder
Under left-to-right input arrival condition, an unfolded implementation of an adder based on the on-the-fly conversion algorithm (OTFA) [3] , [5] produces the sum immediately after the arrival of the least significant bits.
When inputs are given from left to right, the sum S is computed by an induction of S[t] := 2S[t − 1] + (x n−t + y n−t ) where S[0] = 0 and S = S [n] . When x n−t + y n−t = 2, a carry is generated. It propagates up to the least significant bit position of S[t − 1] that has the value 0.
Addition based on the on-the-fly conversion algorithm [4] Fig. 1 shows an OTFA [3] , [5] . An A-module adds x n − t and y n − t , and produces the sum A n−t = x n−t ⊕ y n−t . A D-module controls the propagation, i.e., decides whether to increment the sum A i at the corresponding bit position. Its output is either u or s or g, where they mean "undecided," "decided: stop (not increment)," and "decided: generate (increment)," respectively.
Thus, the delay of an OTFA is a small constant independent of n, i.e., O (1) . Since an n-bit OTFA has 
AN EFFICIENT ADDER UNDER LEFT-TO-RIGHT INPUT ARRIVAL
We propose an efficient adder under left-to-right input arrival. Its delay is constant independent of n and its size is linearly proportional to n. We call it CDLS (constant delay, linear size) adder. The CDLS adder is a two-level carry select adder (CSLA), and consists of two blocks, % I and % II . The upper block % I is again a CSLA and the lower block % II is a small on-the-fly adder (OTFA). More precisely, % I is a combination of two CSLAs sharing ripple carry adders. Fig. 2 illustrates a block diagram of the CDLS adder, where RCA denotes a ripple carry adder and SEL denotes a selector. First, % I receives almost all input bits and computes two conditional subsums: one expecting no carry from % II and the other expecting a carry. While the subsums are computed in % I , the remaining few bits are received and added in % II by means of an OTFA. When all the bits are given and the on-the-fly addition in % II is completed, all we have to do is to select the correct subsum in % I .
% I consists of d subblocks, B d , …, B 1 and a final selector. Each subblock includes two ripple carry adders and two selectors.
We determine the number of bits received by % I (subblocks B d , …, B 1 ) and % II , after we describe the scheme of the CDLS adder. The scheme is as follows:
[Scheme]
Step 1: Each subblock B j (j = d, …, 1) in % I receives input bit pairs 〈x m(j) , y m(j) 〉, …, 〈x l(j) , y l(j) 〉, and produces two conditional subsums expecting no carry and a carry from subblock B j−1 by two parallel ripple carry additions.
Step 2: Selections are performed in all subblocks to produce the two conditional subsums for whole % I . Namely, the appropriate subsums of subblocks are selected successively from the least significant subblock B 1 to the most significant subblock B d .
Step 3: % II receives input bit pairs 〈x l(1) −1 , y l(1)−1 〉, …, 〈x 0 , y 0 〉, and produces the subsum for it and a carry to % I by means of an OTFA.
Step 4: The correct subsum in % I is selected by the final selector according to the carry from % II . In order to minimize the delay, we let the selections in Step 2 be completed before the on-the-fly addition is completed in % II . Thus, the delay of the CDLS adder is the sum of the delay of the on-thefly addition in Step 3 and the time for the final selection in Step 4.
The size of % I is proportional to the number of its input bits, and the size of % II is proportional to the square of the number of its input bits. In order to minimize the size, we should minimize the number of the input bits in % II . Thus, the selections in all subblocks in % I should be completed just before the on-the-fly addition in B II is completed. Now, we determine the parameters d, m(j), and l(j) for n-bit addition. We assume that the input bit pairs arrive successively with an interval of a constant value ∆ in , for convenience. This assumption holds in the conversion of the quotient in an array divider, and in left-to-right multiplication. In the upper part of the final addition in the tree-type multipliers, the interval between the input arrivals can be considered to be nearly constant as shown in Fig. 10 in [8] .
We denote the computation time for carry generation at the least significant bit position of each subblock and that for 1-bit carry propagation in Step 1 as ∆ cg and ∆ cp , respectively. We also denote the computation time for selection as ∆ sel . ∆ cg , ∆ cp , and ∆ sel are constant values. Table 1 ( 1 ) In order to minimize the delay, we fix b 1 to 1 so that subblock B 1 produces conditional subsums as soon as possible. As mentioned before, the addition in subblock B j (j ≥ 2) should be performed during the input arrival for the remaining subblocks B j−1 , …, B 1 and the selections in the subblocks. Thus, b j is an integer bounded by the following inequality:
In order to minimize the delay and the size, the selections in all subblocks should be completed just before the on-the-fly addition in % II is completed. Thus, b II should be the minimum integer satisfying
where ∆ OTFA is the delay of the OTFA. 
The size of the CDLS adder is the sum of the number of logic gates in the ripple carry adders and selectors required in % I and that in the OTFA required in % II .
From (1) and (4) Table 2 shows a comparison among the CDLS adder, the OTFA, the CSLA, and the CSLA with carry look-ahead mechanism (CSLA+CLA), regarding the delay and the size. The CSLA can be constructed by the same way with % I in the CDLS adder. CSLA+CLA employs the carry look-ahead mechanism for the selection. As shown in the table, the delay of the CDLS adder is constant as the OTFA, in contrast to that of the CSLA, which is proportional to log n, and that of the CSLA+CLA, which is proportional to log log n. The size of the CDLS adder is proportional to n as the CSLA and the CSLA+CLA in contrast to that of the OTFA which is proportional to n The number of gates required in % II is reduced to O((log log n) 2 )
from O(log 2 n). Although O(log 2 n) gates are reduced, the size of the whole adder is still O(n). The delay is unchanged.
IMPLEMENTATION
Logic Design
We show a logic design of the CDLS adder. The hardware reduction scheme for carry select adders [9] can be effectively applied to logic design of % I . We may not compute the conditional subsums nor carries expecting a carry from the lower subblock in Step 1 nor 2. Thus, each subblock needs only one RCA instead of two. 
where
0 is the subsum bit computed in Step 1. Note that b 1 = 1.
For l(1) ≤ i < n, each final sum bit s i in subblock B j is computed as: 
where C II denotes the carry from % II produced in Step 3. P j−1 [2] is a condition for propagating C II to subblock B j . Namely,
s n is the carry from subblock B d . Namely,
. Fig. 4 shows a logic design of B j . 2-input "AND," 2-input "OR," and 2-input "XOR" gates are used. FA means a full adder, which consists of two "AND," one "OR," and two "XOR" gates. It produces the condition for carry propagation, i.e., p i as well as the sum and the carry. HA means a half adder, which consists of an "AND" and an "XOR" gate. Subblock B j (2 ≤ j ≤ d) requires (6b j − 2) "AND," b j "OR," and (4b j − 1) "XOR" gates. Subblock B 1 requires only one half adder and an "XOR" gate.
In logic design of % II , i.e., an OTFA (recall Section 2.2), we use the "group carry propagate" signal γ i [t] and the "group carry generate" signal
(x n−t ⊕ y n−t ) and (x n−t ⋅ y n−t ) are produced by an A-module, which consists of an "AND" gate and an "XOR" gate. A D-module computes (7), and consists of two "AND" gates and an "OR" gate. A G- . Table 3 shows b j s derived from (2), where τ denotes ∆ in /∆ cp . Table 4 shows b II for ds derived from (3). , we can know the range of n for ds, as shown in Table 5 . Therefore, when n and τ are given, we can We can further reduce the amount of hardware of % I . From (5) and (6) , Although this reduction causes additional delay of an "AND" gate and an "XOR" gate, the amount of hardware is significantly reduced. We call this reduced-size CDLS adder an R-CDLS adder.
Evaluation
The delay of the CDLS adder is the sum of the delay of the OTFA in % II and the delay of the final selector in % I . The delay of the OTFA in % II is the computation time for an A-module and a Gmodule. It is 2∆ AND + ∆ OR + ∆ XOR , where ∆ AND , ∆ OR , and ∆ XOR are the delay of an "AND," an "OR," and an "XOR" gate, respectively. The delay of the final selector in % I is ∆ AND + ∆ XOR . Thus, the delay of the CDLS adder is 3∆ AND + ∆ OR + 2∆ XOR . Note that this is independent of n. Now, we evaluate the size of a 64-bit CDLS adder, assuming τ (= ∆ in /∆ cp ) is fixed to 1, as an example. As we can see from Tables 3,  4, Table 6 shows a gate-level comparison among the CDLS adder, the R-CDLS adder, the OTFA, the CSLA, and the CSLA+CLA regarding the delay and the size when n = 64 and τ = 1. Only 2-input "AND," 2-input "OR," and 2-input "XOR" gates are used. The 
CONCLUDING REMARKS
We have proposed a high-speed reduced-size adder under left-toright input arrival. Making full use of the delay of the input arrival, it produces the sum within a small constant delay after the arrival of the final input bits. Its amount of hardware is proportional to the operand length.
Although we have mainly considered the case that the interval between the input arrivals is constant, we can apply the scheme to the other cases by adjusting the block-size b j so that the addition in B j is completed just before the selections in B 1 , …, B j−1 is completed.
The proposed adder is not only efficient for various practical applications, but also interesting in theory. It is asymptotically optimal in both delay and size simultaneously.
