This paper presents a general architecture for designing efficient reverse converters based on the moduli set {2 α , 2 2β+1 -1, 2 β -1}, where β ≺ α 2β, by using a parallel implementation of mixed-radix conversion (MRC) algorithm. The moduli set {2 α , 2 2β+1 -1, 2 β -1} is free from modulo (2 k +1)-type which can result in an efficient arithmetic unit for residue number system (RNS). The values of α and β can be selected to provide the required dynamic range (DR) and also to adjust the desired equilibrium between moduli bit-width. The simple multiplicative inverses of the proposed moduli set and also using novel techniques to simplify conversion equations lead to a low-complexity and high-performance general reverse converter architecture that can be used to support different DRs. Moreover, due to the current importance of the 5n-bit DR moduli sets, we also introduced the moduli set {2 2n , 2 2n+1 -1, 2 n -1} which is a special case of the general set {2 α , 2 2β+1 -1, 2 β -1}, where α=2n and β=n. The converter for this special set is derived from the presented general architecture with higher speed than the fastest state-of-the-art reverse converter which has been designed for the 5n-bit DR moduli set {2 2n , 2 2n+1 -1, 2 n -1}. Furthermore, theoretical and FPGA implementation results show that the proposed reverse converter for moduli set {2 2n , 2 2n+1 -1, 2 n -1} results in considerable improvement in conversion delay with less hardware requirements compared to other works with similar DR. key words: residue arithmetic, reverse converter, residue number system (RNS), VLSI architecture
Introduction
One of the most effective ways to achieve parallelism on arithmetic level in VLSI design is using residue number system (RNS) [1] . Because, RNS has an inherent property to perform addition, subtraction and multiplication without carry-propagation between residue digits, this makes RNS a high-performance alternative number system that can lead to reducing power dissipation and considerable speed-up in digital computing systems [2] , [3] . The most important applications of RNS have been reported in the digital signal processing (DSP) area including FIR filters, convolutions, DFT and FFT computations [4] - [7] . Furthermore, the advantages of using redundant RNS to provide easy error detection and correction are well documented [8] , [9] . However, the difficulties which have existed in implementation of non-modular RNS operations, as well as the overhead incurred by forward and reverse converters, were preventing the usage of RNS in general-purpose processors. But, the recent achievements to perform difficult RNS operations such as sign detection [10] , magnitude comparison [11] and scaling [12] promote the increase in applicability of RNS in general-purpose computing systems. The most imperative issue to design efficient RNS systems is appropriate selection of moduli set since the performance of residue arithmetic channels as well as the complexity of forward and reverse converters depends mainly on the form and the number of moduli [13] . The moduli set {2 n , 2 n -1, 2 n +1} has attracted a large amount of research for many decades primarily because of simple and balanced moduli. However, its dynamic range (DR) is not suitable for current highperformance DSP applications. To overcome this problem, i.e., having large DR together with the advantages of popular set {2 n , 2 n -1, 2 n +1}, Hariri et al. [14] proposed the 5n-bit DR moduli set {2 n , 2 2n -1, 2 2n +1} with its high-speed and low-cost reverse converter. Moreover, the moduli set {2 α , 2 β -1, 2 β +1}, where α ≺ β, has been introduced by Molahosseini et al. [15] to provide a large dynamic range RNS systems. The reverse converter of [15] relies on a simple and efficient architecture; however, with constraint α ≺ β, the DR will be concentrated on low-performance moduli 2 β +1 and this leads to an increase in the total delay of RNS arithmetic unit. Chavez and Sousa [16] suggested the moduli set {2 α , 2 β -1, 2 β +1}, where α β. They have tried to decrease the inefficiency of modulo 2 β +1 by concentrating DR to efficient modulo 2 α ; at the expense of a lower-performance reverse converter than [15] . The moduli sets {2 n−1 -1, 2 n -1, 2 n } [17] and {2 n -1, 2 n , 2 n+1 -1} [18] which are free from modulo 2 n +1 have been also introduced to provide fast RNS arithmetic unit but with more complex reverse converters than those for set {2 n , 2 n -1, 2 n +1}. Recently, the moduli sets {2 2n , 2 n -1, 2 n±1 -1} [19] which are the enhanced versions of these classical three-moduli sets are introduced to provide 4n-bit DR with reduced-complexity reverse converters. The demands for more parallelism than three moduli persuaded the researchers to investigate additional number of moduli. Hence, several four and five-moduli sets with different DRs have been proposed for RNS [20] - [28] . The researchers have aimed to introduce large DR moduli sets which can lead to efficient internal RNS arithmetic circuits as well as high-performance reverse converters. However, examination of published papers in this area shows that they have not completely reached this aim. In other words, when the researchers achieved fast arithmetic units, inefficient reverse converter is yield and vice versa. Although, some reCopyright c 2011 The Institute of Electronics, Information and Communication Engineers cent works have reported better tradeoffs between performance of the RNS arithmetic unit and reverse converter, there is still a need for moduli sets which can provide highefficiency in arithmetic unit and reverse converter.
In this paper, we propose the moduli set {2 α , 2 2β+1 -1, 2 β -1}, where β ≺ α 2β as a basis to provide large dynamic range RNS systems with adjustable DR and to attain fast RNS arithmetic unit as well as low-complexity reverse converters. Next, we present a general reverse converter architecture based on the moduli set {2 α , 2 2β+1 -1, 2 β -1} to achieve high-performance converters. The presented design is obtained using a parallel and adder-based implementation of the mixed-radix conversion (MRC) algorithm, resulting in a VLSI efficient architecture. Thus, the moduli set {2 α , 2 2β+1 -1, 2 β -1} can be regarded as a conversion-friendly as well as arithmetic-friendly moduli set, due to its potential to provide efficiency for all parts of RNS. Finally, we present the reverse converter for the 5n-bit DR special moduli set {2 2n , 2 2n+1 -1, 2 n -1}, that is obtained from the general architecture. This converter results in lower conversion delay than the converter design for {2 n , 2 2n -1, 2 2n +1} [14] which is the fastest known reverse converter in the area of 5n-bit DR. Moreover, the proposed converter for moduli set {2 2n , 2 2n+1 -1, 2 n -1} outperforms the best state-of-the-art reverse converters which have been designed for 5n-bit DR moduli sets
- [28] . The remaining sections of the paper are arranged as follows. In Sect. 2, we present the proposed general reverse conversion algorithm with its hardware architecture. The derivation of the reverse converter for the moduli set {2 2n , 2 2n+1 -1, 2 n -1} from the general architecture, evaluation of its performance and comparison with other works are described in Sect. 3. Finally, Sect. 4 concludes the paper.
The General Reverse Converter Architecture
First, we apply a three-modulus version of MRC to the moduli set {2 α , 2 2β+1 -1, 2 β -1}, where β ≺ α 2β to obtain the conversion algorithm. Next, to reduce the hardware complexity, some mathematical properties are utilized to simplify the conversion equations. But, to begin, we provide a brief introduction to RNS and MRC followed by a theorem which shows the efficient multiplicative inverses of the proposed set.
RNS and MRC
The basis of each RNS system is a moduli set {P 1 , P 2 ,. . . , P n } which involves pairwise relatively prime numbers. The DR is defined as M=P 1 P2. . . P n , so that the regular weighted number X≺M can be represented as (x 1 , x 2 ,. . . , x n ) where
The following theorem confirms that the moduli set {2 α , 2 2β+1 -1, 2 β -1} can be used for RNS.
Theorem 1:
The moduli set {2 α , 2 2β+1 -1, 2 β -1}, where β ≺ α 2β consists of pairwise relatively prime numbers.
Proof. Consider Euclid's theorem, i.e., GCD (a, b) = GCD (b, a mod b), where the term GCD stands for the greatest common divisor of a and b. We have
Since all the greatest common divisors of these moduli are equal to one, these numbers are pairwise relatively prime. By MRC [3] , [19] the reverse conversion (i.e., translating the residue represented number into its equivalent weighted number), can be done using this equation:
Where mixed-radix digits can be computed as follows:
. . .
is denoting the multiplicative inverse of P i modulo P j .
Multiplicative Inverses
The multiplicative inverses in the form of powers of two can lead to reducing the complexity of the reverse converter, since the required multiplications can be substituted with shift operations. The following lemma introduces the simple multiplicative inverses of the proposed set with their proofs. Lemma 1: The multiplicative inverses for the moduli set
Conversion Algorithm
Considering the moduli set {2 α , 2 2β+1 -1, 2 β -1} with its corresponding RNS representation (x 1 , x 2 , x 3 ). These residues can be shown in bit-level as below:
The following theorem and lemmas present the proposed conversion algorithm.
Theorem 2:
For the moduli set {2 α , 2 2β+1 -1, 2 β -1}, where β ≺ α 2β, the weighted number X can be achieved from its residues (x 1 , x 2 , x 3 ) by
Where
Proof. By substituting the moduli P 1 = 2 α , P 2 = 2 2β+1 -1 and P 3 = 2 β -1, together with the values of multiplicative inverses from lemma 1 into MRC formulas (4)- (7), the above equations will be achieved. The following properties can be used to simplify (16) and (17) , resulting in a reduction in hardware complexity. Property 1: The residue of a negative residue number (-v) in modulo (2 k -1) is the one's complement of v, where 0 v ≺ 2 k -1 [14] . Property 2: The multiplication of a residue number v by 2 P in modulo (2 k -1) is carried out by P bit circular left shift, where P is a natural number [14] . Lemma 2: Z 2 is computed as follows:
Proof. The above equations can be obtained by applying properties 1 and 2 to (16). Thus,
Lemma 3. Z 3 is calculated as below:
Proof. First, (17) can be rewritten as
Where, from (21) we have
The binary vectors L 1 and L 2 both are (2β+1)-bit numbers, so the maximum value of each one can be 2 2β+1 -1; However, L 2 has 2β-α+1 bits of zero and also L 1 is composed of the bits of the x 2 . We know that at least one of the bits of the x 2 is equal to zero, since the maximum value of x 2 is 2 2β+1 -2 (due to the fact that x 2 is a residue in modulo 2 2β+1 -1). Therefore, L 1 and L 2 are always less than 2 2β+1 -1. As a result, the most positive value of the modular subtraction of (32) will be less than 2 2β+1 -1 and consequently the reduction in modulo 2 2β+1 -1 can be removed. Moreover, the most negative value of (32) is higher than 2 2β+1 -1. Hence, we only need one corrective addition. Then, (32) can be calculated by
Secondly, careful examination of (23) shows that
Because x 1 is a α-bit number, so representing it in 2β+1 bits where β ≺ α 2β requires 2β − α + 1 bits of zero before x 1 . Thus, (2β-α + 1)-bit circular left shifting of x 1 will become the same as (2β-α + 1)-bit regular left shifting. Now, substituting (34) in (33) yields
Next, with considering (35) in case of L 1 − L 2 0, (31) can be evaluated as
In a similar way, for L 1 − L 2 ≺ 0, we obtain the following:
Therefore, in general case, we have
Now, we must simplify (38) using properties 1 and 2. First: Let's consider the case L 1 − L 2 0. Equation (38) can be rewritten as
By splitting (41), we have
Subsequently, the reduction of -L 1 in modulo 2 β -1 can be performed by considering (19) as follows
Now, by separating (44) into three parts, we achieve
Secondly: Let's consider the case L 1 − L 2 ≺0. Equation (38) can be simplified in the same way as before with only one additional vector to taking into account -(2 2β+1 -1). Thus,
The other binary vectors are previously obtained (40)-(47). The (47) and (49) both have β-1 bits with constant values. So, we can merge L 53 and L 6 to achieve one vector as
Therefore, seven operands of (48) reduced to six as shown below:
Finally, the following equation can be used instead of (39) and (51) to realize both cases.
Hardware Architecture
The hardware architecture of the proposed general reverse converter is depicted in Fig. 1 and is based on theorem 2 and lemmas 2 and 3. To implement the main equation, i.e., (15), we must first realize the mixed-radix coefficients (16) and (17) . Lemmas 2 and 3 provide simplified versions of (16) and (17) without direct dependency between Z 2 and Z 3 . First of all, the required operands (19) , (20), (25)- (30) are prepared by operand preparation unit 1 (OPU 1) with only some NOT gates and wiring. Next, we need a modulo 2 β -1 adder to realize (18) . Modular adders can be implemented using different methods; however, this paper considers the carry-propagate adder (CPA) with end-around carry (EAC) [29] to realize modulo of the forms 2 k -1 addition. The CPA with EAC has the same hardware complexity and double delay than the regular CPA. Therefore, realization of (18) relies on a β-bit CPA with EAC. Also, a six-operand modulo 2 2β+1 -1 adder [30] is employed to implement (24). This multi-operand modular adder can be mechanized using a six inputs carry-save adder (CSA) tree followed by a (2 2β+1 -1)-bit CPA with EAC. This CSA tree consists of four (2β+1)-bit CSAs with EACs as shown in Fig. 2 . Some of the full adders (FAs) are reduced to XOR/AND or XNOR/OR pairs, since some inputs of the CSAs have constant value of one or zero. One of the main features of lemma 3 is that it removes the direct dependency to Z 2 which exists in (17); however the carry of the first round addition of CPA1 with EAC is needed to achieve (30) . In other words, the EAC bit of CPA1 determines the sign of L 1 −L 2 . Thus, we used a 2×1 β-bit multiplexer (MUX) with inputs of (47) and (50), where the select line is connected to the carry-out of CPA1. Moreover, implementation of (15) can be done using simple concatenations followed by a regular binary addition. Clearly, (15) can be rewritten as
Where 
Preparing the operands of (55) relies on simple concatenations and inversions as shown below
Therefore, a (3β+1)-bit regular CPA with '1' carry-in is needed to add (56) and (57), where OPU 2 prepares Y 1 and Y 2 . Note that, 2β+1 FAs of CPA3 are reduced to 2β+1 XNOR/OR pairs, since (57) has 2β+1 constant bits with value of one. Finally, due to the fact that x 1 is a α-bit number, (54) can be achieved by a simple concatenation without using any computational hardware. 
Reverse Converter for
In this section a new moduli set with its specialized reverse converter is presented. The motivation to propose this new set as well as the way to derive the reverse converter from general architecture is described in the followings.
Introducing New Moduli Set
Many considerations are given on the 5n-bit DR residue number systems in recent years and new moduli sets with this DR are introduced. The first moduli set was {2 n , 2 n − 1, 2 n +1, 2 n −2 (n+1)/2 +1, 2 n +2 (n+1)/2 +1} and the best reverse converter for this set presented in [25] . The main drawbacks are the moduli 2 n − 2 (n+1)/2 +1 and 2 n + 2 (n+1)/2 +1 that result in decreasing performance of the arithmetic operation. Therefore in [26] , the moduli set {2 n − 1, 2 n , 2 n + 1, 2 2n +1} was suggested. For the first time, four moduli set are used to provide more than 4n-bit DR. The reverse converter of the above mentioned work has higher performance and also faster arithmetic operations in comparison to [25] . Moreover, the three moduli set {2 n , 2 2n − 1, 2 2n +1} [14] has been also proposed with 5n-bit DR and faster reverse converter compared to [25] and [26] . Although, moduli sets reported in [14] , [25] and [26] can provide high DR but presence of the moduli 22n+1 caused inefficient arithmetic operations. Hence, the set {2
with balanced moduli is introduced. However, unfavorable multiplicative inverses of this set lead to noticeable decreases in the reverse converter performance. In the newly reported work [28] , the moduli set {2 n − 1, 2 n , 2 n + 1, 2 2n+1 -1} is proposed to solve the problem of inefficient multiplicative inverses. Comparing to [27] , their moduli set provides faster reverse converter with the same speed of the RNS arithmetic unit and also less hardware complexity. However, the delay of the converter of [28] is longer than [14] . Therefore, the lack of a moduli set that can provide better tradeoff between fast arithmetic operations and efficient reverse converter in 5n-bit DR is evident. Hence, we propose the new 5n-bit DR moduli set {2 2n , 2 2n+1 − 1, 2 n -1}. This set can result in a very efficient RNS arithmetic unit, since it is free from modulo 2n+1. Furthermore, the critical modulo of this set is 2 2n+1 -1 and as investigated in [28] , the moduli 2 2n+1 -1 can result in a slightly faster modular addition than 2 n +1. So, it can be concluded that the proposed moduli set is slightly faster than moduli sets of [27] and [28] and quite faster than other moduli sets in 5n-bit DR class which have been introduced in [14] , [25] and [26] .
Converter Design
The moduli set {2 2n , 2 2n+1 − 1, 2 n -1} is a special case of the general set {2 α , 2 2β+1 -1, 2 β -1}, where β ≺ α 2β. Hence, its reverse converter can be derived from the general architecture by substituting α=2n and β=n into the general conversion equations which are described in Sect. 2. First, from lemma 2 formulas (18)- (20), we have
Second, from lemma 3 formulas (24)- (30), we have
Third, the final conversion equations can be obtained from the simplified relations of theorem 2, i.e., (54)-(57) as follows
The hardware implementation and also components details are the same as shown in Figs. 1-2 and Table 1 with α=2n and β=n.
Numerical Example
Consider the moduli set {16, 31, 3} which is derived from {2 2n , 2 2n+1 − 1, 2 n -1}, where n=2. The RNS number (12, 25, 2) can be converted into its equivalent weighted number by doing the following steps: 1) Binary representation of residues (12)- (14): (61):
3) Obtaining Z 3 (62)- (68):
4) Calculating X according to (68)- (71):
Thus, X=428, and verification can be simply done as
Performance Evaluation
This section evaluates the performance of the proposed reverse converter for the moduli set {2 2n , 2 2n+1 − 1, 2 n -1}, and compares it with the performance of the best state-of-art reverse converters for moduli sets in the class of 5n-bit DR such as
2n+1 -1} [28] . Table 2 presents hardware requirements and conversion delays of these reverse converters in terms of logic gates and FAs. Note that all the assumptions used in [28] are considered to obtain the formulas of Table 2 , such as using k-bit CPAs with EACs for the implementation of the moduli of the form 2 k -1 adders of all of the converters. Furthermore, the hardware requirements and conversion delay of the proposed reverse converter for the moduli set {2 2n , 2 2n+1 −1, 2 n -1} is derived from Table 1 and (58), respectively, where α=2n and β=n. The results show that efficient tradeoff between hardware requirements and conversion delay is obtained. However, to achieve precise estimations for Table 2 Hardware requirements and conversion delays of the different reverse converter.
Converter
Moduli set Hardware requirements Conversion delay [25] {2
+2D NOT * m=n-4, 9n-12 and 5n-8 for n=6k-2, 6k and 6k+2, respectively, and l is the number of the levels of the CSA tree with ((n/2)+1) inputs. area and delay, the proposed design as well as other converters were described in VHDL, and implemented using FPGA technology. The target technology is a Xilinx Virtex-5 FPGA and the area is evaluated by the number of occupied slices. Table 3 compares the area and delay of the converters showing the amount of improvement (%) for different n. As it is expected, delay of the proposed design is the least than the other converters. Comparing to fastest reverse converter which is proposed in [14] , 16.5% and 20.1% improvement in terms of speed of the reverse converter when n is equal to 16 and 24 respectively is achieved. Speed and area improvement compared to other converters are higher than these results. In order to ease the comparison, Figs. 3 and 4 are produced to show the practical delay and area comparison for converters based on the result of Table 3 . The  Fig. 3 confirms that with the growth of n, noticeable reduction in reverse conversion delay will be achieved. Moreover, Fig. 4 shows the noticeable hardware saving compared to other converters. There is only one work reported in [14] that needs less hardware requirement; however the ineffi-ciency of arithmetic operation due to the moduli 2 2n +1 and lower speed of reverse converter forces this moduli set to decrease the total efficiency of RNS.
Conclusion
We have presented a simple and efficient general reverse converter architecture which is constructed based on the moduli set {2 α , 2 2β+1 -1, 2 β -1}, where β ≺ α 2β. Due to the absence of the low-performance modulo (2 β +1) together with simple multiplica1tive inverses, the introduced moduli set is suitable for realizing large DRs, fast modulo arithmetic circuits and efficient forward/reverse converters providing high-performance RNS systems. The general reverse converter architecture has been built using a FA-based implementation of MRC where some novel techniques have been used to eliminate the dependency between mixed-radix coefficients to achieve high-speed. Moreover, the moduli set {2
2n , 2 2n+1 − 1, 2 n -1} is suggested with its specialized reverse converter derived from the proposed general architecture with high-speed and low-cost, compared to the best state-of-the-art reverse converters. In any case, modularity and regularity of our design makes it suitable for efficient VLSI implementation.
