Abstract: In this paper, the design of a Finite Impulse Response (FIR) filter based on the residue number system (RNS) is presented. We chose to implement it in the (RNS), because the RNS offers high speed and low power dissipation. This architecture is based on the single RNS multiplier-accumulator (MAC) unit.
Introduction
The Residue Number System (RNS) has been recognized as one of the efficient alternative number systems which can be used to high-speed hardware implementation of Digital Signal Processing computation algorithms. In RNS, an integer value with large word-length is divided into several relatively small integer copies by a specific moduli set. The addition and multiplication of RNS integers copies are performed in parallel. Each copy is called a channel and the so called RNS channel implements modular arithmetic. In this way, RNS arithmetic does not suffer from inter-channel propagation delay. Performance of the system can be increased by selecting small word-length channels with short internal carry prpagation delay [1, 2] . Due to this feature, many Digital Signal Processing architectures based on RNS have been introduced in the literature [3, 4] . Thus, RNS is an efficient method for the implementation of high-speed Finite Impulse Response (FIR) filters, where dominant operations are addition and multiplication. Implementation issues of RNS based FIR filters show that performance can be considerably increased, in comparison with traditional two's complement binary number system [3 -5] .
The basic of each RNS is a moduli set with consist of a set of pairwise prime number [6] . Until now, many moduli set have been introduced for RNS [7 -9] . Among these, the moduli set {2 1, 2 , 2 1} n n n − + is most well known. This moduli set can result in simple design of forward and inverse converter, but performance of RNS arithmetic unit is restricted to the time-performance of modulo 2 1 n + channel. The modulo 2 1 n − operations are complex, and are bottleneck for RNS arithmetic unit. Hence, the moduli set 1 {2 , 2 , 2 1} n n n + − [9, 10] is used as an alternative for moduli set {2 1, 2 , 2 1} n n n − + in this paper.
In this moduli set, the moduli A technique for implementing a finite impulse response (FIR) digital filter in a residue number system (RNS) is presenter in this paper. From the viewpoint of FIR architecture, single RNS multiplier-accumulator (MAC)-based architecture is applied to each channel FIR filter design.
The paper is organized as follows. After an introduction about RNS and reverse conversion algorithms, the architecture of the constant coefficient FIR filters which have been designed for three moduli sets 
{2
1, 2 , 2 1} n n n + − − will be investigated in the third section. Section 4 discusses a method for binary representation into residues, but the MRC architecture of RNS to binary conversion in Section 5 is discussed. Finally, in Section 6 the implementation of a 27th-order lowpass FIR filter is used to demonstrate design parameters that must be considered in designing an RNS filter.
Background
A residue number system (RNS) is defined in terms of k integers that are relatively-prime. That is
Possible realization of a FIR filter [16, 17] in transposed form is shown in Fig. 1 . The implementation of N trap FIR filter requires the implementation of N multiplications and 1 N − additions. Multiplication is very costly regarding hardware and computational time because the arithmetic unit performs fixed point computation on numbers represented in 2's complement form. The arithmetic unit consists of a dedicated hardware multiplier and an adder connected to the accumulator so as to be able to efficiently execute the multiply-accumulate operation. 3 shows the MAC architecture suitable for set of three moduli. The main components of an RNS system are a forward converter, parallel arithmetic channels and a reverse converter. The forward converter encodes a binary number into a residue represented number, with regard to the moduli set. Each arithmetic channel requires modular multiplication and accumulation for each modulo of set. The reverse converter decodes a residue represented number into its equivalent binary number. The arithmetic channels are working in a completely parallel architecture without any dependency, and this results in a considerable speed enhancement. The architecture has two separate memory spaces which can be accessed simultaneously. One of the memories can be used to store coefficients and the other to store input data samples both in residue form. FIR filtering is achieved in the RNS domain by using triple modulo FIR filter blocks. The implementation is generic and assumes 3 moduli ( 1 m , 2 m , 3 m ) selected so as to meet the desired filter precision requirements. The FIR filtering is performed as a series of modulo MAC operations across each moduli 1 m , 2 m and 3 m .
The Multiply-Accumulate Unit
The associated formula of (4) for RNS FIR filters can be expressed as:
It is shown that the design of an FIR filter modulo i m , (5) is actually a sum of product algorithm, that is, we need one modulo i m MAC unit. x n k 〈 − 〉 are two input sequence. In MAC unit (Fig. 3 ) the inputs are multiplied and added with zero which is initially stored in the memory (register). The sum is then stored in the memory unit. In the next clock, the next inputs are multiplied and added with previous data stored in the memory. Note, both operations are modular. ( 1)
〉 is recursively derived after N additions and multiplications. Hence the final result ( ) y n is generated by the residue-to-binary conversion of the RNS result
In the following text modular adder and modular multiplier which can be used for MAC unit implementation will be described.
Modulo addition
Modulo (2 1) n − (modulo 1 2 1 n+ − ) addition algorithm that avoids double representation of zero is defined by [18] :
where out C is the carry-out of the addition x y + . Fig. 4a depicts the architecture of the corresponding hardware operator which requires carrypropagate adder, a NOR gate and decrementer [19] . Fig. 4b shows the internal logic circuit schematic of a decrementer, based on the conventional n -bit ripple borrow half subtractor. Only n half subtractors are used for constructing the decrement architecture. The modulo 2 n adder can be realized directly with n -bit adder with ignored overflow.
Modulo multiplication
where
x y 〈 × 〉 corresponds to the low output word and div 2 n i i
x y × to the high output word of the multiplication i i x y × . Therefore, modulo (2 1) n − multiplication can be accomplished by an n bit unsigned multiplication followed by an n bit modulo 2 1 n − addition. Equation (7) can be rewritten as sum of partial products:
where: is the k -th partial product modulo (2 1) n − .
Note that all n -bit partial products , i k pp have the same magnitude (as opposed to ordinary multiplication, where the partial products have increasing magnitude), i.e., the number of partial product bits to add is the same for all bit positions. The partial product generation for inputs of four bits wide is as shown in Table 1 . In conventional memory-based technique, the ROM stores the results of the multiplication of all possible values. Here, we extended further to obtain a memoryless-based implementation.
The architecture of proposed implementation of modulo (2 1) n − multiplication is shown in Fig. 5a . Assuming the coefficient word length of 4-bits and input sample word length of 4-bits, in Fig. 5a shows the hierarchical decomposition of a 4 4 × Wallace tree logic. For ( 4 4 × ) bits, four partial products are generated, and are added in parallel. The partial sums are added by using two carry save adders (CSA) and a carry propagate adder with end around carry (CPA with EAC). The principe of the proposed memoryless-based implementation of partial product generator is shown in Fig. 5b . It consists of n 2-to-1 multiplexers, were n is input sample word length. The partial product is generated by connecting zero and coefficient value to the MUX data inputs, input data bits to the select input, and circular shifting output of the MUX 1 s − bits left, for 1 s n ≤ ≤ .
Assuming the multiplier Y and multiplicand 
were , , 
Binary to RNS Conversion
An integer X in the range [0, ) M , represented in 2 n notation as [20 -22] :
can be uniquely represented in RNS by the set 1 2 3 ( , , ) x x x for the moduli set be obtained by the remainder of the division of X by 2 n , which can be accomplished by truncating the value X , since:
For the 2 1 n − and 1 2 1 n+ − channels the calculation of the corresponding residues is more complex, since the final result of the conversion depends on the value of all the i X bits. Instead of using a division operation to calculate the 2 1 n − residue, which is a complex operation and expensive both in terms of area and speed, this calculation can be performed as a sequence of additions, as described below:
By taking the equation:
(12) can be rewritten as:
Thus the conversion of X to moduli 2 1 n − can be performed simply by adding modulo 2 1 n − the i N components of X .
In an identical manner, the 1 2 1 n+ − residue can be calculated as: 
where 0 N ′ , 1 N ′ and 2 N ′ ar 1 n + bit numbers.
RNS to Binary Conversion
Given RNS number ( , , ) y y y with respect to the moduli set 1 {2 , 2 1, 2 1}) n n n+ − − , the proposed algorithm compute binary equivalent of the RNS number using MRC technique. For proposed moduli set 3 k = then the (2) reduces to: 
The various multiplicative inverse for proposed moduli set are: 12 1 c = , 13 2 c = and 23 2 c = − [9] . Mixed-radix digits are computed using (3): a a a = − .
The proposed architecture of RNS-to-binary number conversion is depicted in Fig. 6a . It contains two modulo 1 2 1 n+ − subtractors, one modulo 2 1 n − and one traditional borrow propagation subtractor (BPS).
The modulo (2 1)
n − subtraction can be expressed as follows:
The borrow out signal ( out B ), which results from the subtraction of both x and y , can be used in the process of computing modulo 2 1 n − subtraction. This is due to the following observations:
Then modulo 2 n subtractor with borrow out feed back into the borrow input (to achieve end-around-borrow), can be used to implement modulo 2 1 n − subtractor (20) . This type of subtractor is also known as the Borrow-PropagateSubtractor wit End-Around-Borrow (BPS with EAB). Note, that proposed modulo (2 1) n − subtraction algorithm, which avoids the double representation of the zero, cover whole dynamic range while modulo subtractor based on the CPA with EAC does not [9] . For example, the the residue-to binary converter, that uses modulo 2 1 n − subtractor based on the CPA with end-around carry, generate wrong results for If modulo subtractor is performed as an ordinary subtractor with end around borrow (EAB), where the borrow output depends on the borrow input, a combinational logic is created to eliminate an unwanted race condition. Modulo subtractor based on the BPS with EAB is shown on Fig. 6b , but one solution for decrementer is shown in Fig. 4b .
Since 12 1 c = , mixed-radix digit 2 a is available after modulo subtraction shifting by 1 bit (since 13 2 c = ), a third operation is modulo subtraction with 2 a as minuend and results of circular shifting as subtrahend, and a last operation is carried out by left circular shifting by 1 bit (since 23 2 c = − ) in the result. 
Suppose that MRDs
As shown in (18) and (19), the three operand to be added to obtain 5 a .
These three operands is simplified as two (2n + 1)-bit word 4 a and 5 a since 2 a 
6 Filter Performance
The design and numerical computation of an FIR filter was done using MATLAB using Parks-McClellan algorithm [23] The filter coefficients are shown in Table 2 for double precision and for 9-bit precision, including the sign bit, in integer notation.
Integer value in the third column in Table 2 are transformed from floating point value (second column) in two steps. The first step is the conversion of floating point filter coefficients b in binary string b binary using two MatLAB functions, Q_1=quantizer('round',Format) and b_binary= num2bin(Q_1,b). Value Format in quantizer MatLAB function creates parameters of binary numbers: [wordlength,fractionlength] for signed fixed-point mode. For 9-bit precision format are wordlength=9 and fractionlength=8.
Table 2
The 27 th -order FIR lowpass filter coefficients for mouli set (2
The second step is the conversion of binary string b binary into integer value using two new MATLAB functions: q_1=quantizer('round', Format) and b_int=bin2num(q_1,b_binary). In this case value Format fractionlength is equal zero, i.e. Format= [9, 0] . At last, integer values of filter coefficients are transformed in RNS number. For coefficients forward conversion MATLAB function mod can be used. This paper investigates binary to residue converter for the modulo set {64,63,127} . In the following example we describe the fixed point-to-residue number system conversion of coefficient 1 b . Double precision of filter coefficient 1 b is 0.023540422135223 − which is converted to binary number b_binary=111111010, than to integer number b_int=-6, and at last to RNS number b_RNS=(58, 57, 121).
The simulation, which is done in MATLAB, depicts the effects of this design approach on the filter. Fig. 7 shows a plot of the ideal filter (dotted line) and the actual output. It is shown, the residue number based FIR filter to have a satisfactory attenuation performance. Assume that the data sequence is quantized to 10-bits (including sign) and that filter must be implemented without rounding error. An absolute upper bound on filter response | ( )| y n is given by (26): 
The moduli set {63,64,127} provides a dynamic range of 18.96 bits, which is adequate for most practical situations since the bound of 18.86 bits given by (26) is extremely pessimistic. For the impulse response of whole filter we can use the mixed-radix conversion technique in order to convert a number, presented in the residue system, in the conventional number system, as shown in Fig. 6 . The impulse response for this digital filter is shown in Fig. 9 .
The quantization error in the impulse response, resulting from quantizing the coefficients to 9 bits, is shown in Fig. 9 below. It can be seen that 9 bits is sufficient to maintain error which is less than 2 2 10 − ⋅ .
Conclusion
The Residue Number System has been recognized as one of the efficient alternative number systems which can be used to high-speed hardware implementation of Digital Signal Processing computation algorithms. However, forward and reverse converters are needed to act as interfaces between RNS and the conventional binary digital systems. The overhead of these converters can frustrate the speed efficiency of RNS, and due to this a lot of research has been done to design efficient reverse converters.
This paper presents a study on the state-of-the-art digital signal processing which have been designed for the recently introduced large dynamic range RNS three-moduli sets.
The applications of RNS to constant coefficient FIR filters has not been thoroughly researched yet in the literature, therefore based on our preliminary research, we propose to develop (i) new residue number systems that balance inter-channel slack therefore maximize the use of the clock cycle; (ii) residue channels architectures that exploit slack balancing for low power; and (iii) characterized prototype circuit.
