ABSTRACT
INTRODUCTION
In recent times, Residue Number System (RNS) are being popular to implement a variety of specialized high-performance Digital Signal Processing (DSP) systems for its carry-free nature. Weighted number systems such as the binary number system, decimal number system etc has a carry chain [1] . It is often limiting the performance of arithmetic operations [2, 3] . In RNS, several residue digits represent a number. So, arithmetic operations like additions, subtractions and multiplications of higher bit numbers can be decomposed and performed in set of parallel sub-operation. As a result carry propagation, which is a genuine problem in weighted number systems, will be minimized in residue systems. RNS is extremely efficient for many applications such as digital signal processing [4, 5, 6] communications engineering, computer security (cryptography) [6] etc.
Generally, number of bits required in residue number system is greater than that of weighted number systems because RNS gives the number of residues same as the cardinality of the moduli set, increasing the number of bit required to express it in RNS. A number system is said to have higher bit efficiency if the bit required to represent a particular dynamic range is lower. There are many important parameters that determine the efficiency of RNS and bit efficiency is one of them. The bit efficiency depends on the choice of the moduli set [7] . There are several techniques [7, 8, 9] for moduli set generation reported in the literature {2 , 2 1, 2 -1} . For these schemes no algorithm is given to generate a moduli set; they are generated heuristically by finding a suitable n. The contributions of paper are following: 1. proposed an algorithm to generate any moduli set with finite cardinality in a given dynamic range. 2. bit efficiency of the proposed scheme is better than all other scheme given in the literature. 3 . theoretical analysis and proof of the proposed scheme to show that the proposed solution gives better results than the existing scheme [9] . 4. Applicability of this scheme in a reconfigurable DSP Processor
BRIEF OVERVIEW OF RNS
RNS uses a set of numbers 0 1 2 -1 ( , , , ..., ), 
... (1) where ⊗ represents any arithmetic operator and can be like addition, subtraction or multiplication. So from the equation (1) it is clear that using RNS, integer arithmetic can be broken down into some independent parts which can be calculated in parallel fashion without a carry between each component. So the operations can be performed much faster even faster than the special hardwares like Carry Look Ahead Adder [2] , Carry Select Adder [2] , Carry Save Adder [1, 2] , and Wallace Tree Multiplier [1, 2] , Array Multiplier [2] etc. When the size of a modulus increases, it gives large reminders having multiple bits. So when an arithmetic operation is performed on those reminders, carries are propagated within the small range. These special hardwares mentioned above can be used to do the operations that help increasing the processing time.
The arithmetic operations are implemented with residue number system [3, 10] , depending on the choice of the moduli. The Chinese Remainder Theorem (CRT) [11, 12, 13] may rightly be viewed as one of the most important fundamental results in the theory of RNS. The CRT is useful for many other operations and above all it is very helpful in case of RNS to binary conversion [8, 13] . Mainly New Chinese Remainder Theorem is introduced for this conversion [14, 15, 16, 17] . CRT is assured that if the moduli of a RNS are chosen appropriately then each number in the dynamic range will have a unique representation in the residue system. n n n M S = + + are four cases of the general-moduli sets and these sets are widely used for residue number system with a medium dynamic range [9, 18] .
In this paper we propose an algorithm that generates moduli sets for medium to large dynamic ranges. The scheme also attempts to keep the number of bits required to represent the moduli to a minimum.
SCHEME FOR IMPROVING BIT EFFICIENCY
The bits required to implement all the blocks of RNS number are depends on moduli set. Let 
Lower the value of this term, more optimized design of RNS in terms of bit width in achieved. Choice can be made over the various moduli set (like, three-moduli, four-moduli) and also the number within the set.
In this section we describe an algorithm to generate any number of moduli set for a given precision.
Module find_moduli(N,n,SM)
//Input: N (no. of Bit), n (no. of moduli set) //Output: SM (Efficient moduli set)
Step 1:
Step 2: if x is even then 2n x = else 2 1 n x = +
Step 3:
condition will be satisfied.
Step 4: if n = 4 then
relatively prime to 2n , 2 1 n + and 2 -1 n .
Find the smallest number 1 1 k k ≥ , where 1 k is relatively prime to 2n , 2 1 n + and 2 -1 n .
Therefore,
Find the smallest number 2 2 k k ≥ , where 2 k is relatively prime to 2n , 2 1 n + , 2 -1 n and 1 k .
Therefore, 
Theorem
The bit efficiency of the present scheme is better than the existing scheme of linear complexity.
Proof
We will first proof the result for three moduli set, and then extend the result for the general case.
Given a three moduli set 1 2 3 { , , } h h h and another three moduli set 1 2 3 { , , } h h h
. Now we consider two types of three-moduli set, one for our proposed scheme ( . ,{2 , 2 1, 2 -1}) i e n n n ′ ′ ′ + and another for [9] 1 1 1
( . ,{2 , 2 1, 2 -1}) n n n i e + .
As 2 n can always as an even number, i.e., 2n but the reverse does not hold.
They are equal when 2n can be represented as 2 n , less than otherwise.
For the general case, we start with the four moduli set. Now we consider two types of four-moduli set, one for our proposed scheme ( . ,{2 , 2 1, 2 -1, }) i e n n n k ′ ′ ′ + and another for [9]
We have already shown that (2 (2 1) (2 -1)) n n n
. Now, from the construction of the modulti set [9] , k is the smallest number relative prime to 2 , 2 1, 2 -1 n n n ′ ′ ′ + and 2 (2 1) (2 -1) 2
This logic follows for any size moduli-set.
Now an example will be given to illustrate the algorithm: Let N = 32
Therefore, dynamic range is 0 to 
For n = 4, we have 
k is calculated in the following way:
Here, 
k and 2 k are calculated in the following way:
Now we have to find the smallest number To find 2 k , When n = 6, we have 
k , 2 k and 3 k is calculated as the following way:
Now we have to find the smallest number To find 2 k , 
Now we have to find the smallest number 3 3 k k ≥ , where 3 k is relatively prime to 2n , 2 
COMPARISON OF BIT EFFICIENCY
In this section we will compute the number of bits required to implement the moduli set generated by the algorithm discussed in the last section. We also compare the bit efficiency of the moduli set proposed by us with some stander moduli set namely , which is proposed by us, up to six-moduli set. We also present a comparison with other sets for threemoduli set. Bits required to implement the moduli set log log log ... log p n n n n Table 2 . Comparison of bit efficiency of proposed scheme with standard approaches for three moduli set In Table 1 we present bits required for moduli set with cardinality three, four, five, six generated by the proposed approach. In [9] , an excellent scheme is proposed where moduli set need to be chosen such that all these moduli are co-prime numbers. This scheme is implemented up to 6 moduli sets. It may be noted that [9] is most widely accepted and our proposed scheme gives better results in most of the cases, for the remaining cases, it gives the same bit efficiency as given by [9] . In Table 2 From these tables it is observed that bits required in the proposed scheme are minimum than that of other schemes of order O(n). In other words our algorithm generates the most efficient moduli set.
GENERAL ARCHITECTURE OF RECONFIGURABLE RNS PROCESSOR
The general architecture of a reconfigurable RNS processor is shown in Figure 3 . Given a moduli set hardware complexity depends on the functionalities of the RNS. Because of the space issue, a simplified structure is shown using only three arithmetic operators. It contains Binary numbers are passed to the processor as inputs which first are converted to the RNS number. Here the selection of moduli is very much important because the proper selection of moduli optimizes the bit efficiency, area of the processor & time to process the particular function.
After the conversion of the binary number to its corresponding residue representation, the arithmetic operations can be performed. As any binary number produces a set of RNS numbers depending upon the number of moduli used, m copies of arithmetic units (adder, subtractor, multiplier etc) are required to perform some arithmetic operation of a number when it is converted to RNS, where m is the number of moduli used in that scheme. As the residues can be independently operated, parallel arithmetic operations can be performed on the residue set. Figure 4 shows the control & data flows between the various paths. The Programmable Controller can program the RRNS Processor directly or Programmable Memory is used to store the bit stream. Programmable Controller is governed by the General purpose CPU.
In general, all modular arithmetic operations like Binary to RNS conversion or RNS addition, multiplication are implemented in chip by using two different methods [25] . One is the table look-ups, implemented by PLA. Second one is the Hybrid Methods, which is the combination of the legacy hardware, like full adders, with a table look-up, which can be used to convert the output of the legacy hardware to the correct residue format.
Use of PLA gives a faster hardware than the Hybrid method, but the later takes less area in the chip than the former. PLA can be a good choice of modern reconfigurable RNS processor because of their regular dense structure & easy interconnection.
The area & the speed of the reconfigurable RNS processor depends of the number of moduli in the moduli set, the method used for generation as well as the scheduling algorithm used.
DESIGN PROCEDURE
In the reconfigurable architecture, there is no fixed path between the device units, but the path can be changed depending upon the requirements. MUX are used before the inputs of the device units that act as the switch determining a specific path with respect to some particular select condition.
For an example, suppose there are x number of adders, y number of subtractor & z number of multiplier in the processor. Also we are considering that the chip is accepting k number of inputs. So in general, for all the arithmetic unit having 2 inputs, the MUX in front of the inputs must be having of (x inputs coming from the outputs of adders + y inputs coming from the outputs of subtractors + z inputs coming from the outputs of multipliers + k external inputs). So the MUX must have log 2 (x + y + z + k) select lines. In our example (Figure 3 ), for simplicity, we have taken x = y = z = 1, k = 2. So we can use 5 x 1 MUXes having 3 select lines before all the arithmetic devices in general. As the output coming from all the units are fed to the inputs of all the unit devices in general, it is possible to have any combination of the arithmetic operations computed by the processor.
In our work, the aim is to design a RNS processor which can be reconfigured dynamically to compute some pre-determined functions. For this, the unit operations need to be analysed & sequenced in terms of the inputs & arithmetic operations. As the arithmetic operations are depicted in terms of the select condition on the MUX, the inputs as well as the select conditions need to be stored using a LUT. The bit sequences are stored in the LUT block wise, each block has some particular address. When the address is given for some function, these inputs & select conditions are passed to the input MUXes. Table 3 . Comparison of area given by [28] for three, four, five six moduli set Table 4 . Comparison of area given by standard moduli schemes for three moduli set
IMPLEMENTATION
The proposed scheme is implemented using a Virtex5 board (XC5VLX30). Verilog code is generated corresponding to the Reconfigurable RNS Processor which is synthesized & simulated using Xilinx.
Most of the types of equations containing the combination of arithmetic operations like addition, subtraction & multiplication can be implemented using the proposed scheme. Some examples are given here to illustrate the scheme. Suppose POWER is the function unit (not shown in the Figure) . POWER actually uses MULTIPLIER internally. When the select inputs of the corresponding MUXes are given, (x 1 , x 2 , x 3 ) are multiplied by itself (y 1 , y 2 , y 3 ) times using the same procedure shown before.
In Table 1 we compare area required for moduli set three, four, five, six generated by the moduli set G M is generated as n 's ( i = 2,3,..,t ) need to be chosen such that all these moduli are co-prime numbers. It may be noted that [28] is most widely accepted, hence the comparison three, four, five, six has been done. In Table 4 comparison has been done for three-moduli set with different schemes namely The results have been illustrated graphically in Figure 2 . From these figure it is observed that the area of the Reconfigurable RNS Processor varies as the number of module. 
CONCLUSION
In this paper we proposed an algorithm to generate any moduli set of finite cardinality for a given dynamic range and given the proof of correctness for this proposed algorithm. We have also shown that bit efficiency of the proposed scheme is better than all other schemes given in the literature. In future we will be working on how these parameters, bit efficiency, h/w complexity and time can be optimized for a reconfigurable RNS processor. Another moduli set can be proposed which is better than our proposed scheme considering the three parameters mentioned above, using which we can get the optimized values of the same. 
