Abstract-An implementation of a fast and flexible residue decoder for residue number system (RNS)-based architectures is proposed. The decoder is based on the Chinese Remainder Theorem (CRT). It decodes a set of residues to its equivalent representation in weighted binary number system. This decoder is flexible since the decoded data can be selected to be either unsigned magnitude or 2's complement binary number. Two different architectures are analyzed; the first one is based on using carry-save adders (CSA's), while the other is based on utilizing modulo adders (MA). The implementation of both architectures is modular and is based on simple cells, which leads to efficient VLSI realization. The proposed decoder is fast; it has a time complexity of O(log N ) ( N is the number of moduli).
I. INTRODUCTION
ECENTLY, RNS has received increased attention due to R its ability to support high-speed concurrent arithmetic [ 11 -[3] . Applications such as fast fourier transform, digital filtering, and image processing utilize the efficiencies of RNS arithmetics in addition and multiplication; they do not require the difficult RNS operations such as division and magnitude comparison. RNS has been employed efficiently in the implementation of digital signal processors [ 11, [4] .
Since special purpose processors are associated with general purpose computers, binary-to-residue and residue-to-binary conversions become inherently important, and the conversion process should not offset the speed gain in RNS operations. While the binary-to-residue conversion does not pose a serious threat to the high-speed RNS operations, the residue-to-binary conversion can be a bottleneck. The Chinese Remainder Theorem (CRT) [5] , [6] is considered the main algorithm for the conversion process. Several implementations of the residue decoder have been reported [7] - [15] . The residue decoders in [7] and [8] are based on using three moduli in the form (2" -1, 2", 2" f 1); n is the number of bits. Due to the limitation imposed on the number of moduli and the choice of them, it is limited in application. In [9] , a scheme of O(1og N P ) (where P is the Manuscript received October 15, 1990; revised December 17, 1991 number of bits and N is the number of moduli) is used to support only unsigned magnitude binary numbers. In [lo] , the residue decoder is based on the base extension technique; it uses only modular look-up tables in its implementation.
Since look-up tables are used, the choice of moduli must not be large for the implementation to be feasible. In addition, it does not support residue to 2's complement binary number system conversion. Although look-up tables are used in this scheme, its time complexity is O(N2). The implementation in [l 11 requires that one of the moduli must be a power of two; therefore, it may be limited in application. In [12] , the proposed residue decoders are basically based on biased addition, and take advantage of the fast addition speed of CSA [16] . But the conversion output is not in 2's complement form. In [13] and [14] , the scheme used has a time complexity of (?((log N ) 2 ) . In [15] , the mixed-radix conversion algorithm is used with a time complexity of O(N).
In this paper, a O(1og N ) residue decoder capable of decoding a set of residues to its equivalent representation in unsigned magnitude or 2's complement binary number system is introduced. Two different architectures using CSA's based on [17] and modulo adders (MA's) [18] are implemented. In the following section, the RNS theory is reviewed. Section I11 discusses how this fast and flexible residue decoder can be implemented. Section IV evaluates the speed performance of this residue decoder.
II. RESIDUE NUMBER SYSTEM
In RNS, an integer, X , can be represented by an N-tuple of residue digits, X = ( r , , r 2 ; . * , r N ) where ri = I X I mi, with respect to a set of N moduli { m , , m 2 , a , m N } . In order to have a unique residue representation, the moduli must be pairwise relatively prime; that is,
Then it is shown that there is a unique representation for each number in the range of 0 5 X < IIE, mi = M where N is the number of moduli. The arithmetic operation on two integers A and B is equivalent to the arithmetic operation on its residue representation, that is, where ' ' a " can be addition, subtraction, or multiplication. Therefore, it is desired to convert binary arithmetic on large integers to residue arithmetic on smaller residue digits in which the operations can be parallelly executed, and there is no carry chain between residue digits.
For applications in digital signal processing, it is helpful to define a dynamic range for the RNS with positive and negative integers. The dynamic range is defined as [ -(M -1)/2, (M -1)/2] for M odd, and as [ -M / 2 , M / 2 -11 for M even, or more specifically, for M odd, M -1 i f Z 5 7 and for M even,
where Z is an integer within the legitimate range, 0 5 Z < M. Any integer, X , within the dynamic range can be represented by N residue digits. The conversion from RNS to weighted binary number system is done by using the CRT, which states that where and GCD(mj, m k ) = 1, fOrj # k.
Although the CRT provides a direct, fast, and simple conversion formula, the lack of large and fast modulo M adder has held back this approach.
III. THE RESIDUE DECODER
The residue decoder based on the CRT can be implemented by a modulo M adder tree. The modulo M adders at each level are used to correct the partial sum so that it will be within the legitimate range. Since the modulo M adder is very slow, the implementation may pose an overhead to the overall speed performance of an RNS processor. In addition, the CRT only converts residues to its binary representation in the legitimate range but not in the dynamic range. Therefore, conversion to 2's complement binary number system requires a final correction.
In order to implement a high-speed residue decoder that can perform conversion to both unsigned magnitude and 2's complement binary number system, the following solutions are proposed.
1) The number of modulo M adders or binary adders should be reduced to a minimum. 2) CSA's can be used wherever multi-operand addition is required due to its high addition speed. 3) MA's can be used for multi-operand addition due to its constant speed in adding n-bit numbers in modulo M.
4)
Correction can be performed only at the last stage, and it supports conversion to both unsigned magnitude and the 2's complement binary number system.
For ease of residue decoder design, it is partitioned into four stages as shown in Fig. 1 . The input to the residue decoder are the residues and a control line, C which determines the output to be in unsigned magnitude or 2's complement number system.
Partial Sum Generator
The inputs to this stage are the N residues. The main function of this stage is to compute partial sums, ti's, where Since mi is usually small, the value of ti can be obtained by accessing a look-up table with a small address space. Hence, ri will serve as a ROM address input, and ti will be obtained from a ROM output as shown in Fig. 2 .
In most cases, it is better to reduce the number of partial sums, ti's, in order to reduce the complexity at lower stages and hence increase the speed of residue decoder as a whole. Since a modulus m j can be represented by [log, mi]-bit binary number, the jth residue, The previous implementation means that we decrease the N partial sums to a new number of partial sums ( A ) .
Partial Sum Adder
By far, the modulo M summation of partial sums, ti's, poses the biggest challenge to the implementation of residue decoder due to the slow computational speed of modulo M adder. This stage can be implemented using two different approaches.
Implementation using CSA:
A multilevel CSA tree consists of N -2 CSA's and a carry propagate adder (CPA) [16] . These are used to reduce A partial sums, t's, to a sum, S. Let I be the number of levels on a CSA tree, and e ( / ) be the maximum number of operands that can be processed with an I-level CSA tree. We can compute 19 by the recursive formula provided by Avizienis 
Hence the output S is an m-bit number that is passed to the next stage. A design example for this stage is shown in Fig.  3 (a). The complexity of the scheme is determined by the Theorem 1.
Theorem 1: The addition of N numbers using CSA's can be performed in O(log N) steps.
CSA tree is determined by
Proof: The number of levels required for addition in a
To determine the number of levels required to add N numbers, let us consider the following two cases.
Case (1):
e ( / -1) mod2 = 0.
Substituting in (3) using (4) and ( Since O(l) = 3, we can substitute in (6) to get successive values for e ( / ) as follows:
e(/) represents the number of operands that can be added using a CSA tree and has I levels. Suppose that the number of operands is N then
Taking the logarithm of both sides we have log N = I * l o g q .
Then: 1
We Step 4. temp,. 01 1 1 1 I 1 1 1 100 temp,. 101010101 100 that for all N 2 No the following is true: Suppose that the number of operands is N; then
Using the same analytical method used for the case of even e(,-1) we can find constants C,, C2, and No e 0, such that for all N 2 No the following is true:
From the previous analysis in both cases 1 and 2, N numbers can be added using CSA's in O(1og N ) .
Implementation using Modulo Adder:
The MA adder proposed in [ 181 is used to implement the partial sum adder. The idea of representing a number as a carry and a sum borrowed from CSA can be used in the modulo addition to obtain a scheme that has a constant speed that does not depend on the number of bits. Basically, CSA depends not on the idea of completing the addition process at a certain stage, but postponing it to the final stage. In the intermediate stages, numbers are represented as sum and carry to avoid the complete addition process. The MA is used to add two numbers A and B in modulo m. Fig. 3(b) shows that A is represented as a pair of numbers ( A , , A,) , B is repre- sented as (B,, B,) , and the output C is represented as (C,, C,) . Each number is represented as a group of sum bits and carry bits. There is no unique representation for A , and A,. The condition that needs to be satisfied is We need to add four numbers ( A , , A , , B,, B,) , which need two steps of CSA. After the addition process we need to detect if -M or 2 * ( -M ) is required to adjust the result.
The adjusting process takes at most three steps. The proposed algorithm for modulo m addition of two numbers can be described as follows. ( A , B, C , D ) begin
end.
An implementation of the algorithm is shown in Fig. 3 
(c).
Theorem 2: The modulo adder scheme for adding two n-bit numbers in modulo rn has an asymptotic time complexity 8 ( l ) .
Proof: To prove that the number of steps is constant (five), we need to prove that the last carry is equal to zero in five or less steps. Induction is used to prove the correctness of the theorem on the number of bits n.
Basis step: n = 0 means that we do not add any numbers and in this case the required number of steps is zero. Induction hypothesis: assume for a fixed arbitrary n 1 0 that the maximum number of steps is five.
Induction step: for numbers with n + 1 bits, let
Then we have the following cases. a) 7 = 0; then the carry propagation stopped at bit n , and it ends after five steps at most according to the induction hypothesis. b) 7 = 1: then the correction is 2"" -m in step 3. Fig. 3(d) . There is no unique representation for A and B. One valid representation is shown in Fig. 3(d) . Fig. 3(d) shows the detailed modulo addition operation for this example. In step 1 we get ternp2 [13] = 1 , and in step 2 we get temp4 [13] = 1 , which means that at step 3 we have to add 2(2" -M ) . At step 3 we get temp,[l3] = 1 , which means that at step 4 we have to add 2" -M . At step 4 we get temp,[l3] = 0, which means that the addition process stops at step 4. The result of step 4 is the final result.
The proposed modulo adder has the following advantages.
1 ) It does not have any limitation on the size of the modulus.
2) It is quite modular; it is a 2-D array of one type cell (full-adder) .
3) It is easy to pipeline. 
4)

2)
Step 1 is repeated on (y12, ~~~1 , .
-, ( Y (~-~) (~-, ) , Y("-1,").
)
Step 2 is repeated for 1 log lv] -2 times to obtain one final output represented as a sum and carry. 
iii) a > M and b < M like case ii). Ia, +b, l, . (15) From the previous four cases, Since addition is associative, then
case the correction is done in three steps (steps 3-5).
Using (16) we have
We can further expand this expression using the same method to get the addition process in the right-hand side in terms of only two operands added in modulo M .
Theorem 3 means that adding n numbers in modulo M can be performed using a binary tree consists of units that are capable of adding only two numbers in modulo M . MA's are used as those building blocks to perform the addition process. Since MA requires that inputs be represented in the form of sum and carry, then this form should be enforced at all levels. The form will be enforced automatically for levels 2 2 , because the outputs of the previous levels are in the correct forin. For the first level we have the following:
Ti, = y,, Ti, = 0 v l I i I n .
For the last stage the output is in the form of sum and carry which is exactly the same form we have using the CSA's. Fig. 3(e) shows the binary tree required to add n numbers in modulo M .
Range Determinator
This stage consists of three levels-namely ROM, magnitude comparator (MC), and bit corrector (BC). The major function of this stage is to determine the range of S so that appropriate value can be subtracted from S to obtain the correct result.
Since the input to this stage, S , is a large binary number, it is partitioned into groups of adjacent bits. For example, if S is a 24-bit number, we can partition S into three 8-bit groups Since each group if fed into a ROM module as an address input, the number of bits in each group should be small so that small ROM's that are fast and occupy small silicon area are used to implement this level. However, the number of groups, g , should be kept to be as small as possible since the complexity of MC cells is a function of the number of ROM modules, g. Hence, there are trade-offs in choosing g and the number of bits in each group. The following discussion is divided into two parts: sign magnitude and 2's complement.
3.3.1 Sign Magnitude: As shown in Fig. 4 , the input to ith ROM module is G,, and the outputs are Bi's and Ci's. The function of this ROM module is depicted as follows:
The ROM modules compare the input pattern S to the first set of values in Table 11 and produce g * (2 A -3) outputs that are fed to the MC level. 
The MC level consists of ( A -1) MC The complexity of an MC cell is a function of the number of ROM modules. If we have g ROM modules, the Boolean equation for the jth MC cell is as follows:
Since S may be larger than several values compared, the outputs of several MC cells may be set to 1; therefore, the BC level is used to ensure that only one of the outputs of the MC-' cells is set to one and also to indicate the appropriate range. In order to do so, A identical BC' cells are needed, and their common Boolean equation is as follows: '-,, for The Boolean equations are:
The MC level of this part consists of A MC2 cells, and each MC2 cell has the following function: for M odd,
and for M even, magnitude number system conversion will be performed; otherwise, only one of the BC level output lines will be equal to one, and thus residue to 2's complement number system conversion will be performed. The Boolean equation for a
1, if Gi > for i = 1 , 2 ; . * , A -1.
E; = <
Final Corrector
This stage consists of A tristate multiplexers and a carry lookahead adder. The BC' input lines will be used to enable one of the tristate multiplexers while BC2 input lines will be used as selectors of the multiplexers. If BC,' is set, then (i -l)M I S < iM. The lower bound (i -l ) M will be subtracted from S if conversion to an unsigned magnitude number system is desired, or S is less than ((2 i -l ) M + 1)/2 for M odd or ((2i -1)M)/2 for M even; otherwise, the upper bound, iM, will be subtracted from S . The implementation of this stage is shown in Fig. 6 . The CLA is used to add the 2's complement of the value to be subtracted from S and output the desired result.
IV . PERFORMANCE EVALUATION
1) The partial sum generator is implemented using small ROM's. If the number of residues is N and each 3) The range determinator consists of three different levels (Fig. 4) . The first level consists of g R O M ' s . The second level is the MC cells, which are combinational circuits that can be represented with a two-level switching function. Finally the last level is a two-stage combinational circuit. The three levels have a constant time delay that does not depend on N . The previous analysis shows that the range determinator has a time delay of e( 1).
4)
The final corrector consists of two stages. In the first stage we have A tristate multiplexers, which have a constant delay equivalent to two serial NAND gates. The second stage is a CLA that has a constant delay (for case of number of bits less than 64, the delay is equivalent to the delay of 12 serial NAND gates as shown in [16]). For numbers of bits larger than this we can still obtain a constant delay C L A . Then the final corrector has a delay of O( 1).
a)
From cases 1)-4) we see that all stages except the partial sum adder have a constant time delay, which does not depend on the number of residues N. Only the second stage requires O(log N ) steps.
V. CONCLUSION
The residue decoder introduced in this paper has a total delay of [log N ] . In addition, it has several advantages as listed below.
1) The design is quite modular and consists of simple cells such as small R O M ' s and MC cells. This makes the implementation of the whole residue decoder in a single chip possible. 2 ) It doesn't have any limitation on the moduli used.
3) It is flexible since it can convert residues to either unsigned magnitude or 2's complement number system, and it is controlled by only a control line, C . This means that it can be applied to wider area. 4) It is fast, compared with most schemes proposed before, since it has a time complexity of O(log N ) .
