An integer adder for integers in the binary representation is one of the basic operations of any digital processor. For adding two integers of N bits each, the serial adder takes as many clock ticks. For achieving higher speeds, parallel circuits are discussed in the literature, and these circuits usually operate in two levels. At the lower level, integers represented by blocks of smaller number of bits are added, and in a cascade of stages in the next level, the carries produced in previous addition operations are summed to the augends. These circuits perform addition of integers of N bits in about O(log 2 N ) number of clock ticks and O(N * log 2 N ) space. In this paper, we describe a fast method and an improvement of it. The first attempt resembles the operation method of the merge sort algorithm, from which some important properties of carries produced in each stage are analysed and assimilated, resulting in a parallel adder that runs in about O(log 2 N ) number of clock ticks and O(N * log 2 N ) space. Then, the crucial insights are brought to fruition in an improved design, which takes 2 clock ticks to perform the addition operation, requiring only O(N 2 ) space. The number of bits N is chosen usually to be a positive integer power of 2. The speedup is achieved by special purpose circuits for increment operations by 2 i , for 0 ≤ i ≤ N − 1, each operation taking only a single clock tick to complete. The usefulness of this adder for multiplication operation is discussed. The standard multiplication method utilizes quantizer and 3-bit to 2-bit consolidation circuits to produce an integer that represents in binary the number of 1s in a 1 column corresponding to a place (weighted coefficient) of nonnegative integer power of 2. The last two consolidated integers are added by an adder in the end.
gers represented by blocks of smaller number of bits are added, and in a cascade of stages in the next level, the carries produced in previous addition operations are summed to the augends. These circuits perform addition of integers of N bits in about O(log 2 N ) number of clock ticks and O(N * log 2 N ) space. In this paper, we describe a fast method and an improvement of it. The first attempt resembles the operation method of the merge sort algorithm, from which some important properties of carries produced in each stage are analysed and assimilated, resulting in a parallel adder that runs in about O(log 2 N ) number of clock ticks and O(N * log 2 N ) space. Then, the crucial insights are brought to fruition in an improved design, which takes 2 clock ticks to perform the addition operation, requiring only O(N 2 ) space. The number of bits N is chosen usually to be a positive integer power of 2. The speedup is achieved by special purpose circuits for increment operations by 2 i , for 0 ≤ i ≤ N − 1, each operation taking only a single clock tick to complete. The usefulness of this adder for multiplication operation is discussed. The standard multiplication method utilizes quantizer and 3-bit to 2-bit consolidation circuits to produce an integer that represents in binary the number of 1s in a column corresponding to a place (weighted coefficient) of nonnegative integer power of 2. The last two consolidated integers are added by an adder in the end.
Introduction
Addition operation of integers represented in binary is a basic operation on most, if not all, modern digital processors. The sequential or serial circuit for performing addition of two N bit integers takes N clock ticks. For parallelization of the addition operation, the main issue is to find an efficient method to deal with the carry produced by addition operation of smaller number of bits. For various methods discussed in the literature, viz, Ripple carry adder or Carry propagate adder, Carry look-ahead adder, Carry skip adder, Manchester chain adder, Carry select adders, Prefix adders, Multioperand adder, Carry save adder, Pipelined parallel adder, etc., see [[3] - [7] ]. These circuits can perform addition of integers of N bits in about O(log 2 N ) number of clock ticks and O(N * log 2 N ) space (see [1, 2] ).
In the next section, we present a k-stage cascade circuit, where N = 2 k , performing addition operation in only k clock ticks, requiring k * 2 (k−1) − 1 space for the special purpose circuits for carry addition. The motivation of this work is to present a unified and simplified circuit that can achieve the same task as discussed in the literature. Moreover, some important insights are gained in the design of this circuit, in a first attempt, which are exploited for realizing an improved circuit that adds in constant time, i. e., in 2 time delays, but requiring only at most
space. Further improvements, including the application of the adder for fast multiplication of two integers represented in binary, are discussed towards the end of the article. The standard multiplication method utilizes quantizer and 3-bit to 2-bit consolidation circuits to produce an integer that represents in binary the number of 1s in a column corresponding to a place (weighted coefficient) of nonnegative integer power of 2. The last two consolidated integers are added by an adder in the end.
Parallel Binary Adder
The steps involved in a parallel adder, resembling the merge sort algorithm, are described in the following algorithm:
First Attempt Parallel Adder Circuit 1. Let the number of bits in the integers be N = 2 k , for some positive integer k. . . . b 0 be the input integers in the binary form, with the convention that the most significant bit is the leftmost (and the least significant bit the rightmost).
3. Initially, compute 2 (k−1) sums of two bits each, s 1, 2 * i+1 s 1, 2 * i , and the corresponding carries c 1, i , such that, the binary sequences s 1, 2 * i+1 s 1, 2 * i are the two lesser significant bits obtained by adding a 2 * i+1 a 2 * i and b 2 * i+1 b 2 * i , with a carry bit c 1,
this operation is performed separately by 2 (k−1) many programmable logic arrays or associative memory units, which compute in parallel for each index i, where 0 ≤ i ≤ 2 (k−1) − 1.
4. For l = 1, 2, . . . , k − 1, in steps of 1, in the ascending order, after 2 (k−l) the sums of 2 l bits each, s
, together with the carries c l, i , for 0 ≤ i ≤ 2 (k−l) − 1, the following increment operation is performed : the integer represented by the binary sequence c l, 2 * i+1 s
is incremented by c l, 2 * i , to get the carry c l+1, i and left half string of the sum s
, and the right half string of the sum s
is taken to be the bit string s
; the left and right half bit strings are concatenated to get the binary string s
, representing the sum, with the corresponding carry c l+1, i as just obtained; this increment operation can be performed in a single clock tick by a special purpose circuit, which identifies the least index j, where 0 ≤ j ≤ 2 l , such that all the least significant bits up to (but not including) index j are 1 and the bit with index j is 0, by means of (2 l + 1) AND-gates, implemented by negated NOR-gates, and instantly complements the bits with index j upto the least significant bit; if c l, 2 * i+1 is 0, then there is one such index j, and if c l, 2 * i+1 is 1, then it must have been produced in the previous, i.e., l-th , cascade stage, and therefore, the integer represented by the binary sequence s
can be at most 2 2 l − 2, as shown below, and hence, there is such an index j as just being discussed, and the increment operation cannot further produce a carry. Proof : The claim is true for m = 1, by the construction in Step 3. Now, it is assumed be true through all cascade stages up to and including m and l, where 1 ≤ m ≤ l ≤ k − 1. Entering the second for-loop, indexed by
Step 4, it is required to show that the assertion in (1) holds true, for m = l + 1. Now, by inductive hypothesis, the following is assumed to hold true, for 0 ≤ i ≤ 2 (k−l−1) − 1:
and (2) c l, 2 * i+1 * 2
Multiplying both sides of (3) by 2 l , the following is obtained:
and adding the corresponding sides of (4) and (2), the following is obtained:
where the last term is the result of addition of the integers represented by the pair of binary sequences a
either of the summands on the right hand side of (3) is at most 2 (2 l ) − 1, and therefore, their sum is at most 2 2 l +1 − 2, while the maximum integer that can be represented by the left hand side in (3) is 2 2 l +1 − 1, which means that the single bit c l, 2 * i can be added to the left hand side of (3) without an overflow, for 0 ≤ i ≤ 2 (k−l−1) − 1. Thus, by the result of the carry increment in Step 4, the following holds, for 0
and
Now multiplying both sides of (6) by 2 2 l and adding the corresponding sides in (7) to the result, and using (5), the following is obtained, for 0
which proves the claim for m = l + 1.
Circuit Complexity : We estimate the number of special purpose AND-gates required for performing the carry addition operation in Step 4. For 1 ≤ l ≤ k − 1, there are 2 (k−l) many sum sequences in the input at level l, and, of these, only 2 (k−l−1) many, that constitute the higher precision subsequence at level (l + 1), are required to be incremented. Each sequence to undergo increment operation needs (2 l + 1) AND-gates. Thus the total number of special purpose AND-gates of this implementation is found as follows:
The Usefulness of Special Purpose Circuits for Addition or Subtraction by 2 i A processor can be furnished with a special purpose circuit for incrementing an integer represented by N -bit sequence by 2 i , for 0 ≤ i ≤ N − 1. This operation is useful in the following contexts: taking 2's complement operation, subtraction operation, increment of instruction pointer, array index computations, memory address calculation, and as a special instruction, dedicated for this purpose, similar to shift operation. The special purpose circuit is expected to take only one clock tick to perform the specified increment operation. Further, for adding an integer represented by very sparsely occupied 1-bits, the addition operation can be implemented by a sequence of such instructions. The subtraction operation by 2 i , for 0 ≤ i ≤ N − 1, can be realized complementarily.
Improved Parallel Adder Circuit 3. In the second step, the carries c i , for 0 ≤ i ≤ N − 1 are added in parallel, without conflict, requiring about
special purpose ANDgates, as follows: is the only carry to be added, and this case is easily handled by the algorithm. Let 2 ≤ r ≤ N . The main point in the proof is that the addition operation of a carry c i l does not affect the addition operation of the carry c i l+1
, for 0 ≤ l ≤ r − 1, as observed in the following. The bit s i l+1 must be 0, because c i l+1 = 1 and c i l+1
, being the result of adding only two bits, a i l+1
and b i l+1 , cannot be the bit string 11, for 1 ≤ l ≤ r − 1. Thus, there exists an index j l , such that i l + 1 ≤ j l ≤ i l+1 and SC AND(i l , j l ) = 1, for 1 ≤ l ≤ r − 1. Now, since there are no carries of 1s, whose indexes between i l + 1 and i l+1 − 1, inclusive of both, when i l + 2 ≤ i l+1 , the complementation of the string s j l s j l −1 · · · s i l +1 is equivalent to adding 1 to the corresponding integer represented by it, without affecting the carry addition of c i l+1
, for 1 ≤ l ≤ r − 1. The last carry c i r is added, as if it were lone carry to be added.
It may be observed that addition of two (2N )-bit integers takes only 3 time delays by means of two N -bit adders as just described. Two lower and higher significant N -bit integers are added, and if a carry is produced by the addition operation of the two lower significant N -bit integers, then it is added to the sum of the two higher significant N -bit integers, in just one time delay. The last step may require additional N special purpose AND-gates, for the addition operation by 1, when the initialization at the leaf node is two bits at a time, by means of associative memory units. Thus, the total number of special purpose AND-gates could be about 2 × N (N +1) 4
, for addition of two (2N )-bit integers, in three time delays. The application for multiplication of two N -bit integers is discussed in the next section.
In the first attempt algorithm described in the previous section, we started at leaf node with sums of two bits of a's and b's each, at a time. If we assume a similar initialization to compute the sum s and carry c bits, we could reduce the space required by a factor of 2 for the special purpose AND-gates, in the algorithm just described in this section. Another possibility for reduction of the number of special purpose AND-gates, for the sake of economy, is to consider a two-level cascaded implementation. In the first cascade stage, about √ N blocks are taken for addition in parallel, each block consisting of again about √ N sum and carry bits. In this circuit design, the number of special purpose AND-gates in the first cascade stage would be about
. In the second cascade stage, there are about √ N carry bits to be added, which would require about
special purpose AND-gates, because the least significant √ N bits do not affect the carry addition to the higher significant (N − √ N + 1) bits. Thus, the total number of special purpose AND-gates could be about (N + 1) √ N . Combined with the previous observation, i.e., starting with two bits of a's and b's to get the s and c bits in the initialization step, it is possible to realize a (2N )-bit integer adder performing the addition operation in three clock ticks, requiring about (2N + 1) √ N − N 2 special purpose AND-gates. The estimates are as follows:
1. there are N carries to be added after the initialization step; 2. in the first cascade stage, there are 2 √ N sum bits and √ N carries, in each block, which would need Thus, the total number of special purpose AND-gates in this construction would be about 2N
, as compared to (N (N + 3))/2 = 2144, required by the circuit without the space reduction by twostage cascaded implementation, both circuits taking only three clock ticks to add two (2N )-bit -i.e., two 128-bit -integers. On a 64-bit processor, 128-bit integer adder is needed for multiplication operation. The first attempt design circuit of the previous section would need (128 * 7)/2 − 1 = 447 special purpose AND-gates, performing the addition of two 128-bit integers in about 7 clock ticks, while a two-stage cascade circuit would need about 1000 special purpose AND-gates, to repeat, performing the addition of two 128-bit integers in 3 clock ticks. In Slide 83 of [8] , it is stated that the Pentium processor performs the 32-bit integer addition in 11 gate delays.
Multiplication of Two Integers in Binary Representation
The time delay of multiplication of two N -bit integers is determined mostly by the time delay of addition of (2N )-bit integers, requiring at least one (2N )-bit adder and consolidation circuits that reduce a larger number of integers to a smaller number of integers for addition, such that the sum of the integers, before and after consolidation, is the same. For each index i, a Cauchy sum of product is formed, which corresponds to the coefficient of 2 i , for 0 ≤ i ≤ 2N − 1. Since there are at most N products of two bits in each sum, they are added in log 2 N stages, to get 2N coefficients represented by at most log 2 N bits each. Then, the bit-planes of the coefficients are rearranged, similar to rearranging the order of summation of a doubly indexed sum, into log 2 N integers of at most 2N bits, with (N + 1) quantization levels, which can be classified by (N + 1) comparators (Chapter 7 of [9] ). The quantization intervals are recognized by two adjacent voltage levels. The voltages of the bits in a column corresponding to the same place of a nonnegative integer power of 2 are connected in series, to get the sum of voltages, which encodes the number of 1s in the column. If the bits are sensitive to current measurements, then they are added in parallel, to form the sum of currents. The common junction point is connected to the ground by an additional resistor. Thus, in any case, the sum of the voltages is measured at a particular junction point. The sum falls (after accounting for small errors and fluctuations) somewhere in the middle of exactly one quantization interval, which is recognized by the conjugation of the conditions that (i) the upper limit voltage is larger, and (ii) the lower limit voltage is smaller than the sum of the voltages in a column. The conjunction of the two conditions is fed to a switching circuit (Chapter 8 of [9] ), which switches an associative memory entry containing the bit pattern that encodes the integer to count the number of 1s ub the column. Thus, the sum of ν ≥ 3 integers can be reduced to a sum of ⌊log 2 ν⌋ + 1 integers, in a constant number of (which may be two) clock ticks. However, when the number of integers to be added falls to a small number (such as below 6), the consolidation method described in Slide 45 of [8] may be faster than the quantizer circuit. The quantizer based consolidation method achieves higher speed, when the number of integers to be consolidated is larger than a prescribed number, and as such may be qualified to be called optimal, owing to its constant time operational performance. The final two integers after the consolidation stages are added to get the integer which is the product of the two integers, given as input in the beginning. The consolidation operation is illustrated for the 64-bit multiplication. Initially, there are 64 integers to be added, which are aligned properly adjusting for the respective binary places. Two cases are described for comparison: one with only 3-bit to 2-bit consolidation circuits described in Slide 45 of [8] , and the other with quantizers for about two stages followed by 3-bit to 2-bit consolidation circuits described in Slide 45 of [8] in the remaining stages, until both reduce the sum of the initially given 64 integers into a sum of 2 integers, where the latter could be 128-bit long, unlike in the input, which are at most 64-bit long. The quantizer is assumed to take two clock ticks to produce the required integers, as follows: in the first clock tick, the lower and upper bounds of interval of quantization are detected, consequently initiating the corresponding switching circuit, and in the second clock tick, the initiated switching circuit activates an associative memory unit, which places the contents in appropriate places, taking care also of the binary places, positioning the resulting integers as in a staircase, for the next stage. The circuit initialization phase is sensitive to the leading or trailing edge of a switching (initiating) pulse, giving the pipeline or cascade effect, which is partly folded into (overlapped with) the duration of the switching pulse. The edges are not always sharp or crisp, and edge sensitivity is exploited for gaining speedup in cascading (during both feed-forward and feedback stages of) compound circuits. The measurements for settling time for the overall circuit are explicitly performed, by trying out its response for various pulses that arise in typical (empirical) situations. (6) 10 to 7 (with only 9 to 6 consolidation and one integer left out), (7) 7 to 5 (with only 6 to 4 consolidation and one integer left out), (8) 5 to 4 (with only 3 to 2 consolidation and two integers left out), (9) 4 to 3 (with only 3 to 2 consolidation and one integer left out) and (10) 3 to 2 consolidation, taking 10 clock ticks to complete the task. The overall consolidation factor for consolidating 64 integers into 2 integers is 32, and with a consolidation factor of 3 2 per stage, the lower bound for the number of stages is ⌈log 3/2 (32)⌉ = ⌈8.547...⌉ = 9. The overrun of the number of stages is caused by the nondivisibility of the number of integers to be consolidated by the integer 3 in some stages.
It may be observed that, with required quantizers to add up 14 bits to produce 4-bit integers in binary representation, steps (5) through (8) can be replaced with a single quantizer step, which may take two clock ticks to perform this particular subtask, saving two clock ticks. As another opportunity, again with required quantizers to add up 7 bits to produce 3-bit integers in binary representation, for instance, steps (7) through (9) can be replaced with a single quantizer step, which may take two clock ticks to perform this particular subtask, but saving just one clock tick.
(B) With quantizers and 3-bit to 2-bit consolidation. The numbers of integers to be consolidated in a sequence of stages taking one or two clock ticks per stage, depending on the particular stage, are as follows (the serial number marking for the end of the clock tick offset number): (2) 64 to 7 (with 63-bit to 6-bit consolidation based on quantizers, taking two clock ticks, and one integer left out), (4) 7 to 3 (with 7-bit to 3-bit consolidation based on quantizers, taking two clock ticks), and (5) 3 to 2 consolidation (with only 3-bit to 2-bit consolidation, taking one clock tick), taking 5 clock ticks to complete the task.
For the overall time needed, 5 clock ticks for consolidation of 64 to 2 integers of at most 128 bits each, added to about 3 clock ticks for the addition of the two 128-bit integers, to get the final result of multiplication of the two input 64-bit integers in about 8 clock ticks, in case (B), and about 9 clock ticks obtained by the theoretical lower bound for consolidation of 64 to 2 integers of at most 128 bits each, added to about 15 clock ticks for the addition of the two 128-bit integers, to get the final result of multiplication of the two input 64-bit integers in about 24 clock ticks, in case (A). Thus, the speedup factor is at least 24 8 = 3. In the following discussion, the circuit complexity for the two cases discussed above is estimated. The initial 64 number of 64-bit integers are arranged in a parallelogram staircase, in the standard presentation. They can be arranged to foom a nabla (∇) or Delta (∆) shape staring at 127-bit integer in the first row, followed by 125-bit integer in the second row and so on, until 1-bit integer in the last (64-th) row. In the first stage, since 64 itself is not divisible by 3, there are 63 rows to be consolidated, and 121 number of 3-bit to 2-bit consolidation circuits, required in the second row, followed by 115 number of 3-bit to 2-bit consolidation circuits, required in the fifth row, until one 3-bit to 2-bit consolidation circuit, in the 62-nd row, skipping two rows in between, with 6 circuits less in succession. These consolidation circuits must perform in parallel in the first stage at least. This number can also be arrived at by observing that 21 rows of 3-bit to 2-bit consolidation circuits are required to consolidate 63 rows to 42 rows in the first step. Thus, there are 20 i=0 (6 * i + 1) = 1 + 7 + · · ·+ 121 = 21 * 61 = 1281 number of 3-bit to 2-bit circuits (associative memory units) required, in case (A), each circuit containing 8 entries of 2-bit associative memory. Now, in case (B), in addition to 128 number of 3-bit to 2-bit consolidation circuits in the final consolidation stage, the number of 63-bit to 6-bit quantizers needed is about 128, with possible reuse in the second stage, and if no reuse is possible, another 128 number of 7-bit to 3-bit quantizers in the second consolidation stage are needed. For comparison, 128 number of 63-bit to 6-bit quantizers hold 64 * 128 = 8192 associative memory entries of 6-bits each, while 1281 − 128 = 1153 number of 3-bit to 2-bit consolidation circuits hold 1153 * 8 = 9224 number of 2-bit associative memory entries. If reuse of the quantizers in the second stage is possible, the associative memory space requirement in case (B) is less than 3 times that in case (A), with a speedup factor of at least 3. It may be observed that the well-known Amdahl's law for speedup bound is applicable for the same programs or circuits, when executed in parallel by replication of resources. An interesting situation is when different tasks together require some resources in total, which can be allocated to them to execute in parallel, without requiring any additional resources. Quantizers are more commonly well-known in the analog-to-digital (ADC) converters. However, the inputs to the quantizers in this section take only finitely many discrete values, and the required precision for the lower and upper bounds of the interval of quantization for the sum offers considerable tolerance for accounting for small errors and fluctuations in the current or voltage measurements taken at the input.
