Abstract-This paper proposes a unified architecture for designing Residue Number System (RNS) based processors for moduli sets with an arbitrary number of channels. Recently, new RNS moduli sets have been proposed in order to increase the dynamic range and reduce the width of the channels. The proposed architecture allows designing forward and reverse RNS converters, as well as the arithmetic operators of each modulo channel. The forward and reverse conversions are implemented using channel arithmetic units, resulting in a very compact architecture. Moreover, the arithmetic operations supported at the channel level include addition, subtraction, and multiplication with accumulation capability. The presented results suggest that the proposed RNS architecture leads to compact and scalable implementations, with competitive, or even better, performance when compared with the related state of the art, considering fixed moduli sets. Experimental results suggest gains of 17% in the delay of arithmetic operations, with an area reduction of 23% regarding the state of the art.
I. INTRODUCTION
Residue Number System (RNS) is a non-weighted numbering system which uses remainders to represent numbers [1] . The modular characteristics of RNS offer the potential for high-speed, parallel arithmetic, given its carry-free arithmetic properties. The basic arithmetic operations (add, subtract, and multiply) are performed over operands that are significantly shorter than the equivalent binary representation. Typical applications for RNS are in Digital Signal Processing (DSP), namely filtering, convolutions, correlations, FFT computations, and cryptography [2] - [8] .
The RNS moduli set is composed by several m i channels, where m i represents positive relatively prime integers. A number X is represented in RNS by its residues x i = X mi , where x i is the remainder of the division of X by m i . The forward conversion from the weighted number system to RNS (binary-to-RNS), and vice-versa (RNS-to-binary or reverse conversion) introduces an overhead to the system. Despite this overhead, the faster arithmetic operations in the each RNS channel can lead to a better performance than the weighted representation. The choice of the moduli set is one of the most important aspects, in order to obtain an optimized and balanced system that exploits the parallelism for the required Dynamic Range (DR). The use of moduli sets with modulo {2 n ± k} channels, with unrestricted k values, is rather useful in the definition of larger RNS moduli sets, resulting in arithmetic circuits with better performances, given that operands in each channel require a smaller number of bits.
The state of the art is mainly focused on RNS moduli sets based on co-prime numbers of the form {2 n − 1, 2 n , 2 n + 1} [9] - [16] . Different moduli sets for the RNS processors have been proposed in order to increase the Dynamic Range or to reduce the width of RNS channels, such as the moduli sets {2 n − 3, 2 n − 1, 2 n + 1, 2 n + 3} [12] , [17] , {2 n − 1, 2 n+β , 2 n + 1, } [13] , and {2 n − 1, 2 n , 2 n + 1, 2 n+1 − 1} or {2 n − 1, 2 n , 2 n + 1, 2 n+1 + 1} [18] . Also, a new moduli set with 8n-bit has been proposed in [19] , using only modulo {2 n ± 1} [20] channel operations. Recently, a new moduli set and converter for DR up to (8n + 1)-bit have been proposed. The new moduli set in [21] is composed of the moduli set
This moduli set requires arithmetic units modulo {2
n ±1} [20] , and generic arithmetic units modulo {2 n ± k} [22] , for the modulo {2 n − 2
2 + 1} operations. The design of direct converters can be divided into two main categories, namely for specific moduli sets and for more generic approaches. Currently two structures for specific moduli sets with modulo {2 n ± 1} [20] and {2 n ± 3} [23] , [24] are proposed. The generic structure proposed in [25] , implements the direct conversion for modulo {2 n ± k} using the periodic proprieties of modulo, and is implemented using a ROM-based topology.
Herein, a unified structure for scalable Residue Number System computation based on {2 n ± k i } channels is proposed, allowing the design of RNS with any moduli set. The proposed architecture allows the unlimited increase of the number of RNS channels, and, consequently the increase of the dynamic range, or the reduction of the width of channels and thus, the delay, area cost, further exploiting the RNS parallelism. The proposed architecture allows computing the conversion from binary-to-RNS, RNS-to-binary, and the arithmetic operations of each channel. The arithmetic units of each RNS computation channel are also used in the conversions, resulting in a very compact RNS architecture.
In order to evaluate the performance of the proposed architecture, theoretical performance and efficiency estimations are performed. Furthermore, experimental results for the channel's arithmetic blocks are presented. The analysis of these results suggests that the proposed architecture achieves better efficiency for the same amount of moduli channels, and up to 50% better metrics when considering 16 to 32 channels. The results also suggest that this architecture is the most compact solution for the majority of the case studies. This paper is organized as follows. Section II describes the proposed RNS architecture, together with the formulation needed to design the channel's arithmetic modulo {2 n ± k i }. Section III and IV present the hardware complexity and the experimental results of the proposed architecture, respectively. Conclusions from this work are presented in Section V.
II. PROPOSED ARCHITECTURE
The first step of an RNS calculation process is the conversion from binary-to-RNS, followed by the arithmetic operations performed in each channel, and finally the conversion from RNS-to-binary. The approach herein proposed considers that the number of required conversions is less than the number of arithmetic operations performed in the channels. Therefore, these two conversion steps are executed in a serial way using the hardware resources of the modular channels. The architecture herein proposed is organized in three main blocks, depicted in Figure 1 : i) channel arithmetic blocks; ii) RNS-tobinary converter; and iii) control block. The arithmetic blocks perform the modular addition, subtraction and multiplication on each channel. They also, perform the binary-to-RNS conversion without imposing area overheads. The RNS-to-binary converter module contains the additional circuits required to compute the reverse conversion, which cannot be made with the channel's arithmetic blocks.
RNS-to-binary converter m ... 
A. Binary-to-RNS conversion
The conversion of a binary number X with jn-bits to RNS for modulo {2 n − k} can be computed as follows, where X [msb:lsb] means the bits msb to lsb of integer X, by considering lsb of X as the bit 0:
Similarly to modulo {2 n − k} and considering X i = X [(i+1)·n−1:i·n] , the binary-to-RNS conversion modulo {2 n + k}, can be computed as:
To compute the above reduction X 2 n +k , modular subtracters are required. Moreover, (2) can be modified to use only addition operations, since:
Thus, X 2 n +k can be computed as:
The last term is a constant value that can be pre-calculated.
The conversion from binary-to-RNS requires only additions and multiplications with accumulation. These operations can be serially performed using the channel's arithmetic blocks.
B. Arithmetic in the channel
This section analyses the modular arithmetic operations required for addition, subtraction, and multiplication, with and without accumulation capability, for modulo {2 n ± k}. The channel structure proposed to compute the needed arithmetic operations are herein described for modulo {2 n − k} and {2 n + k}.
The addition modulo {2 n − k} is easily computed [22] . To derive a subtraction of two residue values, let us start by computing the symmetric of a residue as:
The subtraction between the residue a and b with accumulation, can be described as:
Identically, the subtraction of (a + b) from the accumulated value is given by:
The multiplication of the residue a by b, with positive accumulation, can be computed as in [22] (note: the width of k is represented by w k , and 
The value p [2n−1:0] results from the binary multiplication of a × b, and can be computed by a n × n-bit binary multiplier.
The same multiplication but with negative accumulation, can be obtained by computing:
2) Modulo {2 n + k}: Similarly to modulo {2 n − k}, additions modulo {2 n +k} can be easily computed. The subtraction operations, with accumulation, for channels modulo {2 n + k}, considering (3), can be formulated as:
Using the same approach, the subtraction of (a + b) with accumulation can be computed as:
Similarly to modulo {2 n − k}, the multiplication of a by b, with positive accumulation, modulo {2 n +k}, can be computed as:
Using (3) to rewrite (12) considering negative accumulation, results in:
The described operations, can be implemented in a single hardware structure, using a binary multiplier, two constant multipliers (for computing (k·p [2n−1:n] ) and (k·m 1 [n+w k −1:n] )) used to perform the modular reduction, and one 5:1 modular adder. This 5:1 modular adder may be implemented by using one modular carry-save-adder [26] and one 4:1 adder [22] . The arithmetic structure for channels modulo {2 n − k} and {2 n + k} are similar. However, for channels modulo {2 n + k}, a ((n + 1) × (n + 1))-bit binary multiplier is used, instead of a (n×n)-bit multiplier. Two constant multipliers are also used, but more constants have to be selected at the input of the 5:1 modular adder, as depicted in Figure 2 .
C. RNS-to-binary conversion
Most of the related state of art of reverse converters are based on the Chinese Remainder Theorem (CRT) [27] , on the Mixed-Radix-Converter (MRC) [27] or on the more recent New CRT [28] . Herein, the MRC algorithm is considered, because it allows to reuse the hardware available in the modular channels to perform the conversion from RNS-tobinary in order to reduce the system's overall area. Moreover, the CRT algorithm is not considered in this work due to the hardware cost of the modular adder modulo M , required for the computation of the binary value. The New CRT I [28] is not considered since modulo channels have to satisfy the condition P i > 2P i−1 , imposing an unbalanced system. The New CRT II [28] requires intermediate modular operations that are not supported by the modular channels.
The MRC algorithm associates a mixed-radix representation with a residue representation. Considering the moduli set {m 1 , m 2 , · · · , m N }, where z i is the residue modulo m i (0 ≤ z i < m i ), the number X can be described by:
Applying the modular reduction m 1 to X, equation (14) can be re-written as:
In order to calculate z 2 we can consider:
Applying the reduction modulo m 2 to (16) it is obtained:
Multiplying by m
results:
Since z 2 is always a residue modulo m 2 , and X m2 = x 2 , equation (18) can be re-written as:
Continuing the process, the mixed-radix digits (z i ) can be iteratively calculated. The z N value can thus be computed as:
This iterative process requires 2N cycles for the reverse conversion computations, these operations are performed in the modulo channels. Furthermore, all the multiplicative inverse values required in the computation can be pre-computed and stored in memory. The final value X is computed by multiplying the mixed-radix digits (z i ) by a constant factor (W i ), as represented in (14) .
Considering z i and W i as:
The multiplication of digit z i by its weight W i , with W i l representing the [(l + 1) · n − 1 : l · n] bits of W i and c = N − 1, can be computed as:
The computation of (22) requires N additional steps to calculate the final value of X. The final value X is computed by a binary adder-tree, compressing the N +1 input values, with 2n-bit length, into one vector of 2n + e (note: e = ⌈log 2 (N + 1)⌉ representing the extra bits required), and the value shifted into a register. The binary adder-tree is fed with the results given by the binary multiplier of each channel, output p i [2n−1:0] of the channel's structure depicted in Figure 2 . With this approach, the conversion from RNS-to-binary requires a total of 3N cycles to compute X. The hardware overhead for this solution is reduced when compared with a CRT algorithm, which requires a final modulo M adder, as depicted in Figure 3 . 
III. HARDWARE REQUIREMENTS
To obtain a technology independent assessment of the resulting RNS architecture, an analysis has been carried out using a neutral Full Adder based model [21] . This model relates the area and propagation delay of the circuit, with the area and delay of a Full-Adder (FA), considering the FA as the finest grain component. The estimation model considers the area of 1-bit FA is represented by ∆ F A , and τ F A represents its delay. Bitwise logic operations are not considered, since they do not impose a significant delay or area cost in our design. Furthermore, it considers a Carry-Propagate-Adder (CPA) as the simplest addition structure, with n∆ F A area and nτ F A delay cost. This analysis also considers that a modulo {2 n ±1} adder is implemented by a CPA with EAC (End-AroundCarry), with an area cost similar to a binary CPA and twice the delay of a CPA [29] . The Multi-Operand Modular Adders for modulo {2 n ± 1} (MOMA [30] ) and binary Adder-Trees are considered to be organized as Wallace Trees, requiring approximately log 1.5 (N ) ≃ 2log 2 (N ) stages for N operands [31] , with a 1-bit FA delay per stage, and a total of n(N − 2)∆ F A of area resources. For the binary multipliers a delay of 2nτ F A and an area of n 2 FA is assumed [22] . For the multiplication of a n-bit operand by a w k -bit constant, a simpler estimation is considered, with nw k ∆ F A of area resources usage and (n + w k )τ F A of delay. The modulo {2 n ± k} adders proposed in [22] are considered, with a 4:1 adder imposing a delay of (n + 3)τ F A and an area of 5n∆ F A . For the modulo {2 n ± k} Carry-Save-Adder (CSA) structure, a delay of 2τ F A and an area requirement of 2n∆ F A is considered [26] .
Given these metrics, the first step to assess the efficiency of the proposed compact RNS architecture is to analysis the channel's arithmetic structures, described in section II-B and depicted in Figure 2 . The resulting area requirements for the modulo {2 n − k} channel arithmetic block is imposed by the area of one binary multiplier (n 2 ∆ F A ), two constant multipliers, contributing with 2nw k ∆ F A , and one 5:1 modular adder. The 5:1 modulo adder is implemented with one 4:1 [22] adder and one modular CSA [26] , with a total area cost of 7n∆ F A . As illustrated in Figure 2 , the critical path of the resulting channel arithmetic structure is imposed by the binary multiplier, with 2nτ F A , the two constant multipliers, contributing with 2(n + w k )τ F A , and the 5:1 modular adder with (n + 5)τ F A of delay. An identical analysis can be performed for the modulo {2 n + k} channel's arithmetic, resulting in the values depicted in Table I . The estimation of the cost for moduli {2 n } and {2 n ± 1} channels are also presented in Table I , since the considered related art employs these moduli. For modulo 2 n , the area cost is imposed by the n-bit output, binary multiplier ( n 2 2 ∆ F A ), one CSA to compute the addition of the operands A and B with the accumulated value, contributing with n∆ F A , and one CPA as the final adder with an n∆ F A area cost. The estimated delay for modulo 2 n channel structure is (n + 1 + n)τ F A , resulting from the contribution of the binary multiplier, the CSA, and the final CPA. Identically, for modulo {2 n ± 1} results an estimated delay of (2n + 1 + 2n)τ F A , given by the contribution of the modulo multiplier [31] , the CSA, and the final CPA with EAC. The estimated area is (n 2 + n + n)∆ F A , as presented in Table I .
To complete the evaluation of the proposed RNS architecture, the following characterizes the reverse conversion block. From Figure 3 and the presented equations, it can be derived an area cost and a delay of:
As discussed in the previous section, the reverse conversion requires 3N iterations, which in this case means 3N clock cycles to compute the binary value X.
IV. COMPARISON WITH RELATED ART
In order to properly evaluate the proposed RNS architecture it is herein compared with the related state of the art, namely: i) for the Traditional moduli set {2 n ± 1, 2 2n } with a 4n-bit DR [13] ; ii) the Mohan [18] for the moduli set {2 n −1, 2 n , 2 n + 1, 2 n+1 − 1}, which corresponds to a DR with 4n bits; iii) a RNS architecture proposed by Skavantzos [19] for the moduli set {2
n , 2 n + 1} supporting a DR up to (8n − 15) bits; and iv) a RNS-to-binary converter by Pettenghi [21] with a DR up to (8n + 1)-bit supported on the moduli set {2
Table II presents the estimated values for these RNS architectures considering the previously described performance model. The Traditional, Skavantzos, and Mohan reverse converters were implemented by using the arithmetic operators presented in [20] . For Pettenghi, a binary-to-RNS converter and arithmetic units were added in order to obtain a complete system, considering the same units used in the proposed architecture. For the {2 n ±2 n+1 2 +1} modulo channels, generic modulo {2 n ± k} arithmetic units are considered. Thus, the main difference in the full RNS lies in the reverse conversion, where a dedicated binary-to-RNS converter is used.
DR(bits)
area ratio (#/T raditional) Fig. 4 . Total area resources of the full RNS for the considered moduli sets In order to better evaluate the cost and the performance gains obtained with the proposed RNS architecture, the obtained values were normalized to the values obtained for the Traditional moduli set, considering the estimations presented in Table II . Given the scalability of the proposed RNS architecture, three systems are considered, varying the number of moduli channels, namely using moduli sets with 8, 16, and 32 channels modulo {2 n ± k}.
Traditional [13] bin-to-RNS 6n 2n + 2 channels 4n 2 + 8n 4n + 1 RNS-to-bin 6n 4n + 3
Mohan [18] bin-to-RNS 9n 2n + 2 channels (7/2) · n 2 + 10n + 2 4n + 5
Skavantzos [19] bin-to-RNS 47n − 104 2n + 4
Pettenghi [21] bin-to-RNS − 8 · (6n + 7) channels 27/9 · n 2 + 30n + 4 6n + 7
RNS-to-bin n 2 /2 + 81n/2 − 13 10n + 8 + log2(n/2 + 4) delay ratio (#/T raditional) It is harder to compare the delay, since it depends on the amount of arithmetic operations that needs to be performed. The overhead of the direct and reverse conversion becomes less significant with the increase of the number of arithmetic operations. To better evaluate this impact, the results herein presented consider 1, 10, and 100 arithmetic operations for each set of direct and reverse conversions. Figure 5 depicts the results, also normalized to the considered Traditional moduli set. As expected, a system using the Traditional moduli set has better delay performances when a single arithmetic operation is considered. The architecture proposed by Mohan is the one suggesting less relative degradation, being around two times slower than the Traditional. The Skavantzos architecture also suggests a degradation in delay when compared with the Traditional based system, being in the order of 2.6 times slower. The proposed architectures with 16 and 32 modulo channels suggest the worst delay performance when considering one arithmetic operation per conversion, being 10 to 8 times slower than the Traditional. In this case, the delay performance degradation is imposed mostly by the conversions, which are heavier in RNS with more channels. Considering 10 arithmetic operations, the architecture proposed by Skavantzos starts to suggest better delay performances than the Traditional moduli set, while the remaining architectures still suggest worse metrics. However, the delay performance degradation begins to be less significant. When considering 100 arithmetic operations, the conversion cost becomes negligible. The Skavantzos 8 channel RNS is, as expected, the one suggesting better delay performances. This is due to the use of a moduli set composed by modulo channels in the form {2 n ±1}, thus allowing the use of optimized modulo arithmetic units [20] . Furthermore, it uses a dedicated structure for the conversion from RNS-to-binary. In this scenario, the proposed RNS with 16 and 32 channels suggest performance gains of up to 2.5 times in comparison with the Traditional RNS.
The herein proposed architecture is targeted for a scalable and compact RNS. To better evaluate the considered RNS an Area-Delay-Product (ADP) efficiency metric is also considered. Figure 6 presents the ADP efficiency metric for 1, 10, and 100 arithmetic operations per conversion. From these results, it can be concluded that the Traditional RNS solution is always the best solution for 1 operation, as expected from the previous results. For the 10 operations scenario, and considering an 8 channel RNS, the Pettenghi implementation has the best efficiency for DR up to 512, however the proposed architecture has better ADP values for DR greater than 512, achieving up to 50% better RNS processors. When considering 100 operations, the proposed architecture is, almost in all of the analysed scenarios, the best solution, achieving, on average, a gain of 70% in comparison with the Traditional RNS. Even thought being more efficient in the arithmetic operations, the Skavantzos RNS is always the worst implementation when considering the ADP metric, given its area requirements, in particular due to the reverse conversion, using up to 85% of the total system area. The Mohan architecture always suggests worse or similar ADP metrics when compared to the Traditional based system. In order to validate the theoretical analysis, the channel's arithmetic blocks were described in VHSIC Hardware Description Language (VHDL) and mapped to the UMC 90nm CMOS technology from UMC [32] . Both synthesis and mapping were performed using Design Vision Version E-2010.12-SP4 from Synopsys. The obtained experimental results for area and delay, and the ratio with the Traditional RNS are presented in Table III . The results confirm the theoretical estimations, suggesting that, for an 8 channel RNS, the Skavantzos architecture has the best arithmetic performances, however at a high conversion cost [21] . The best arithmetic performances, both in area and delay, are obtained for the 16 channel RNS, using the architecture herein proposed. The results suggest arithmetic gains of more than 5 times in delay and an area reduction of 77% regarding the Traditional RNS. Also, a delay improvement of 17% and an area reduction of 23% regarding the arithmetic operations performed by the Skavantzos RNS.
In conclusion, the work herein proposed suggest that an efficient and scalable RNS architecture can be achieved. The results also suggest that, for the same number of channels, this architecture has similar ADP efficiency metrics than the existing state of the art, while conveying a high flexibility to adapt the system to the desired number of modulo channels and DR. The work herein presented is the first step towards the development of a flexible and adaptable comprehensive framework to automatically design RNS processors. This RNS framework can also be used to advance the development of novel moduli sets and optimized reverse converters.
V. CONCLUSIONS
In this paper a compact and scalable RNS architecture is proposed, allowing the design of RNS based processors for any {2 n ± k i } moduli set. The herein proposed RNS architecture is composed of modulo {2 n ± k i } channels, making use of a unified arithmetic structure. This structure allows modular additions, subtractions, and multiplication operations, with or without accumulation, while also supporting forward and reverse conversions. With the proposed architecture the delay and area cost of dedicated processors are reduced while further exploring the parallelism and carry-free characteristic of RNS. The presented performance analysis suggests that better performance metrics can be obtained with the proposed RNS architecture when a sufficient number of arithmetic operations are performed per conversion. When considering the complete RNS, with more modulo channels, area and delay improvements up to 50% can be achieved. Furthermore, when considering efficiency metrics such as ADP, the proposed architecture can be more efficient even when considering only 10 arithmetic operations per conversion for DR greater than 512 bits. The presented results suggest that the proposed RNS architecture allows scalable and compact implementations with competitive performances when compared with the related state of the art.
