Abstract: Residue number systems (RNS) are non-weighted systems that allow to perform addition, subtraction and multiplication operations concurrently and independently on each residue. The triple moduli set f2 n 2 1, 2 n , 2 n þ 1g and its respective extensions have gained unprecedent importance in RNS, mainly because of the simplicity of the arithmetic units for the individual channels and also of the converters to and from RNS. However, there is neither a perfect balance between the various elements of this moduli set nor an exact equivalence in the complexity of the individual arithmetic units for each individual residue. Two complementary approaches have been proposed to improve the efficiency of RNS based on this type of moduli sets: enhancing multipliers modulo 2 n þ 1, which perform the most complex arithmetic operation, and overloading the binary channel in order to obtain a more balanced moduli set. Experimental results show that, when applied together, these techniques can improve the efficiency of the multipliers up to 32%.
Introduction
In the last years, the residue number system (RNS) has became an important research field in the area of efficient digital computing [1] . A great amount of research has already been done to exploit the RNS properties for performing non-weighted, carry-free arithmetic. This characteristic can be greatly advantageous when repetitive arithmetic operations have to performed, especially when large operators are involved. A known example of such operations is the multiply-accumulate operation for linear filtering and in number theoretic transforms, typically found in digital signal processing (DSP) [2, 3] . Cryptography is another growing field where efficient multipliers are needed. This includes the asymmetrical encryption algorithm RSA [4, 5] , that intensively uses modular multiplications with operands up to 2048 bits, and the international data encryption algorithm (IDEA) symmetrical algorithm [6] , which applies the 2 16 þ 1 modular multiplication. By splitting the computation of a large operand into several smaller non-dependent channels the performance can be improved. This approach also reduces the required area, especially in the multiplication units where the area grows with the square of the operand's length [7] . Nevertheless, in RNS the need to perform forward and reverse conversions also has been taken into account, which represent an unavoidable overhead, because the majority of the standards use the binary representation.
With the increase of the modulus size, combinatorial RNS architectures become more efficient than look-up table architectures [8] . Therefore modulus using a nearby value to a power of 2 is preferable, because only modified functional blocks of binary architectures have to be applied for the same technology. This is the case not only of the commonly adopted 3-moduli set f2 n 2 1, 2 n , 2 n þ 1g, but also of the 4-moduli supersets that introduce a fourth element of the type f2 nþ1 + 1g to increase the dynamic range [9] . Mathematical properties of these moduli sets are also important to develop efficient converters between the RNS and the binary number system, using either the Chinese Remainder Theorem (CRT) or the mixed-radix conversion technique [10] . This paper addresses the problem of improving the speed-up achieved with the RNS-based systems. Having the multiply operation as the main focus of this paper, the methods and techniques herein proposed benefit the RNS arithmetic in general. From a detailed analysis of the RNS characteristics, the two main critical aspects that prevent from obtaining a higher speed-up are identified as follows.
1. The differences in the complexity of the arithmetic units for the various moduli of the set are not negligible, in particular, the longer critical path of the 2 n þ 1 moduli set. 2. In spite of just exceeding by one the power of 2, the additional bit of the moduli 2 n þ 1 multiplication originates a more complex computational structure and consequently the processing time increases, when compared with the other moduli multiplications.
These two aspects stress the imbalance between the elements of not only the traditional moduli set, but also the extended moduli sets and degrade the performance of the RNS. In this paper, two independent but complementary techniques are proposed to tackle this problem: the enhancement of the arithmetic unit for the modulo 2 n þ 1 multiplication [11] and a modification of the moduli sets to obtain more balanced structures. This last proposed modification corresponds to the overloading of the binary base element (channel), which allows us to alleviate the other non-binary channels for a given dynamic range.
To evaluate the improvements obtained from the proposed modulo 2 n þ 1 multiplication structure and the overloading of the binary channel, the arithmetic units for the modified and more balanced moduli set were implemented. They are described using both behavioural and fully structural parameterisable IEEE VHDL and implemented using Synopsys synthesis tools with a high-density StdCell library from UMC, which is based on a 0.13-mm CMOS process from Virtual Silicon Technology, Inc. [12] . Experimental results suggest the advantages of applying the proposed 2 n þ 1 multiplier and the balanced moduli set, evidence the gains, not only in the multiplication, with an improvement up to 32%, but also in the RNS arithmetic in general. Moreover, no additional overhead is imposed for conversion. The proposed conversion units are, on average, 4 to 10% more efficient than the conversion units for the original moduli sets. This paper is organised as follows. In Section 2, enhanced 2 n þ 1 multipliers are proposed and evaluated. The extension of the binary channel is proposed in Section 3 and new converters are presented for the derived moduli sets. Experimental results are presented in Section 4, and Section 5 concludes the paper.
2
Enhanced modulo 2 n 1 1 multipliers Unlike multiplication modulo 2 n 2 1 or modulo 2 n , the multiplication modulo 2 n þ 1 has to accommodate the cases where one or both operands are equal to 2 n . This is the main reason why the modulo 2 n þ 1 multipliers are the slowest, in both the 3-and the 4-moduli sets. We have recently shown that although the modified Booth recoding can be applied for designing modulo 2 n þ 1 multipliers, it usually does not lead to faster multipliers nor significantly reduces the required hardware [13] . This paper proposes to enhance one of the most efficient modulo 2 n þ 1 multiplication architectures, proposed by Zimmermann [14] , that does not use Booth recoding. By manipulating the values for the particular case when one or both operands are equal to zero, we seek to perform this calculation in a more efficient way by speeding-up the computation without significantly increasing the required circuit area.
Let us consider an integer number X represented with n þ 1 bits
and the multiplication of numbers X and Y, which can be computed as
By exploiting the following properties
(2) can be rewritten as
In the modulo 2 n þ 1 representation, when a n ¼ 1 the other bits are zero (A 0 ¼ 0); consequently the multiplication can be divided into three distinct operations: 1. for x n ¼ 0 and y n ¼ 0:
2. When one and only one of the nth bits is equal to 1:
3. Finally when the two nth bits of X and Y are equal to 1:
To compute (5), the approach presented in [14] can be applied:
In the above equation, it can be considered that modulo 2 n þ 1 adders intrinsically add an extra unit. Instead of using a multiplication matrix with n Â n bits inputs for adding the partial products (PP i ), followed by a carry save adder (CSA) to add the constant 2 in (8) [14] , the multiplication matrix can be redesigned to encompass an extra input, where such a constant is directly introduced. This is advantageous when multipliers are implemented using Wallace tree (WT) adders, because for most values (n) the delay is approximately the same as for n þ 1 inputs: the number of full adders (FAs) in the critical path of an n-bit WT is given by WT(n þ 1) ¼ b3/2WT(n)c [15] , as presented in Table 1 . However, this is only a rough approximation of the computation delay not taking into account the wire propagation delay, which is becoming increasingly more significant in nowadays' technologies.
Computation of kP
The computation of (7) is rather simple by taking into consideration that when x n ¼ y n ¼ 1, the partial results P 1 and P 2 are null and thus the final product is equal to 1. This particular case can be accommodated simply by setting the least significant bit of the multiplication to 1 whenever x n ¼ y n ¼ 1 (this operation can be efficiently performed using a simple OR gate). As most efficient modulo adders have the result of the least significant bit calculated before the most significant bit [14, 16] , the OR gate should not have any implication in the computation time.
Computation of kP 2 l 2 n +1
The most significant improvement proposed for the multiplication unit concerns the introduction of the partial product described in (6) into the final result. Zimmermann [14] adopted a multiplexer to select the result based on the fact that when the result of (6) is not null, then (5) is zero and vice versa. Such multiplexer is located at the output of the parallel multiplier, thus contributing to the critical path of the circuit. Our proposal introduces this partial product directly in the multiplication matrix, by adding extra inputs in the WT. However, each extra input line added to the matrix of adders corresponds to the introduction of an extra modulo 2 n þ 1 CSA, which intrinsically adds an extra unit [16] . Consequently, the constant initially introduced in the multiplication matrix has to be corrected in order to take into consideration both the extra inputs and the constant required by the partial product P 2 in (6). The two elements of P 2 are introduced in distinct inputs of the adder matrix, which requires the addition of two extra units and results in
where the first constant term 2 in the above equation is intrinsically added.
The resulting architecture is depicted in Fig. 1 , where u represents the added constant.
To assess the practical efficiency of the new proposed architecture, modulo 2 n þ 1 multipliers were synthesised based on the proposed and on the original Zimmermann architectures. The results for the 0.13-mm CMOS technology are presented in Section 4, and, in particular, in Fig. 12 ; it can be observed the relative efficiency of the multipliers by considering the product of the circuit area (A) by the square of the time (T ), that is, AT 2 . Given the quadratic variation of the area and the linear variation of the delay, the AT 2 metric is typically used as an efficiency metric [13, 17] . These results show a significant improvement, that is, 20% in average and up to 50% for large ranges.
However, the relative performance of the proposed multipliers is still lower than that of the binary multipliers. Fig. 2 depicts the delay difference between the binary multiplier and the already optimised modulo 2 n þ 1 multiplier. This delay difference can rise upto 70% (for n ¼ 4), being, on average, about 20%.
Also in the alternative diminished-1 number representation, proposed by Leibowitz [18] , for the residues modulo 2 n þ 1, the difference between the binary multipliers and the 2 n þ 1 multipliers is even higher, as depicted in Fig. 3 . The recently published [13] optimised diminished-1 modulo 2 n þ 1 multiplier was used. In this analysis, the difference is on average about 30%. These results underline the need for the other technique proposed in this paper in order to balance the moduli sets, and thus to compensate these significant differences in performance.
3
Balanced RNS moduli sets
As it was stated before, the fact that non-binary channels, mainly of the type 2 n þ 1, exhibit lower performance than the corresponding binary ones has a negative impact on the efficiency of RNS. In this section, the overloading of the binary channel is proposed in order to balance the computation time for the various channels, especially the 2 n þ 1 multiplication which is where the critical path is commonly located. The encoder and the decoder are the only new arithmetic units required by these new moduli sets, because the addition and the multiplication units for the traditional moduli set can be directly used. This section is organised as follows: first (i) the correctness of the proposed binary channel overloading technique is proved and then (ii) the encoder and the decoder units for the proposed and modified 3-and 4-moduli sets are presented.
Binary channel extension
Recently, moduli sets with larger binary channels have been proposed, mainly to increase the dynamic range [19] . The width of the binary channel was doubled in order to increase the dynamic range to approximately the value achieved with 4-moduli supersets. In this paper, the width of the binary channel is increased by a variable factor 2 k , in order to overcome the relative difference in the complexity of the arithmetic units. The new class of moduli sets is comprehensive, in the sense that it also integrates the traditional 3-moduli sets and 4-moduli supersets. Although k can take any general value, that is, k . 2n, we consider only values with practical interest: k ! 0. Therefore the following general new class of moduli sets is defined
which includes the traditional 3-moduli and 4-moduli sets, for k ¼ 0. It is worth noting that the binary channel is the one that requires the simplest arithmetic units. For example, for fast adders based on binary tree structures the increase of, at most, n bits leads to the introduction of just one further level in the tree, which corresponds to the additional stage usually required by the corresponding modulo 2 n þ 1 adders. However, in our proposal k is a variable and can be adjusted according to the technology and the required dynamic range.
Let us prove the validity of the new class of moduli sets, which corresponds to proving that all the elements of the moduli set are relative primes.
3.1.1 Validity proof of the moduli sets: For proving that 2 n 2 1, 2
are relative primes, it is only necessary to prove that 2 nþk is a relative prime to each of the other elements, because it has already been proved that all the other elements are relative primes (e.g. see [9, 20] ).
The six pairs of numbers in Lemma 1 can be reduced to only two different types, simply by replacing the variables. Therefore to prove Lemma 1 one only needs to compute the gcd for the following two pairs of numbers: (2 nþk , 2 n À 1) and (2 nþk , 2
and for 2 n þ 1 (w , n; f [ N):
For f odd:
and for f even:
From (12) and (16), it can be concluded that (2 nþk , 2
nÀ1 À 1) and (2 nþk , 2 nÀ1 þ 1) are pairwise relative prime numbers for n, k [ N. A
New conversion units
In order to implement the binary channel overloading technique, the encoder and the decoder memoryless units for the new sub class of 3-moduli sets {2 n À 1, 2 nþk , 2 n þ 1} are proposed. Values of k such as 0 , k n, where n [ N, are considered, as well as the standard and the diminished-1 number representations.
Binary to RNS converters: An integer
where N 3 is a variable-length value, with k bits, represented as
can be uniquely represented in RNS by the 3-tuple {x 1 , x 2 , x 3 } for the moduli set {2
The dynamic range of this moduli takes the value M which is given by
In order to obtain the RNS representation of the integer X, three independent converters, one for each channel, are required. The simplest converter is the one for the m 2 channel, because for the 2 n 2 1 and 2 n þ 1 channels the calculation of the corresponding residues depends on the value of all the bits of X.
Converter for the 2 nþk channel. The value of x 2 can be obtained by the remainder of the division of X by 2 nþk , which can be accomplished by truncating the (n þ k) less significant bits of X
Converter for the 2 n21 channel. Instead of using a division operation to calculate the 2 n 2 1 residue, which is a complex and expensive operation, both in terms of area and speed, this calculation can be performed as a sequence of additions
By taking into account that
(21) can be rewritten as
Hence, the conversion to the 2 n 2 1 channel can be performed simply by adding modulo 2 n 2 1 the several segments of X. However, in order to perform these four additions one does not need to use four modulo 2 n 2 1 carry propagate adders (CPA), which would be slow and inefficient. As represented in Fig. 4 , it is possible to group the values to be added as
with the first and second additions (kN 2 þ N 1 þ N 0 l 2 n À1 and kN 3 þ S 1 þ C 1 l 2 n À1 , respectively), being performed by a 3 -2 compressor (CSA), without a significant increase in the delay or in the area. Nevertheless, the third and last modulo 2 n 2 1 addition (kS 2 þ C 2 l 2 n À1 ) requires a modulo (2 n þ 1) CPA.
Converter for the 2 n þ 1 channel. Similarly, the 2 n þ 1 residue can be calculated as
(25) can be simplified to
The computation of (27) can be further simplified using only additions, given that a modulo 2 n þ 1 subtraction can be expressed as
and therefore (27) can be rewritten as
In the standard representation, the value of x 3 represents the 2 n þ 1 residue of value X, instead of value X 2 1. It will be seen that the complexity of the conversion unit is greater in the former case than in the latter situation.
Like in the 2 n 2 1 modulo conversion, the additions can be grouped and partially added by 3 -2 compressors. Nevertheless, the constant value 4 has to be added to compute (29). However, as referred before, modulo 2 n þ 1 compressors not only add the input values, but also intrinsically add an extra unit [16] . Therefore using this characteristic of the 2 n þ 1 modulo adders, the value of x 3 can be computed as
The binary to modulo 2 n þ 1 converter is depicted in Fig. 5 , where it can be noticed that the last CSA is only used to add one unit to S 2 þ C 2 þ 0, resulting in a simplified 2 -2 compressor.
In the diminished-1 representation, the remainder of the division is represented as x 3 ¼ kX 2 1l 2 n þ1 , where the obtained value is always the correct value minus one. The computation of x 3 for this representation, using a similar approach to the one used in (29), results in the following expression:
Decomposing the addition, (31) can be rewritten as
which corresponds to the hardware structure presented in Fig. 6 , which is even simpler than the one depicted in Fig. 5 for the standard representation. 
RNS to binary converters:
The proposed RNS to binary memoryless converter is based on the CRT [16, 20, 21 ]
wherem i ¼ M=m i , k1=m i l m i is the multiplicative inverse of m i and M is given by (19) . Equation (33) can be written as [16] 
where Z(X ) is a non-negative integer which is a function of X. For the proposed moduli set {m 1 ¼ 2
By replacing these values in (34) we obtain
and dividing X by 2 nþk :
where
Finally, with the simplification of the partial expressions F, G, H, 22
the overlapping bits x 3,n and x 3,0 are merged together and represented by x 3,x ¼ (x 3,n or x 3,0 ). This is only possible because x 3,n and x 3,0 are mutually exclusive and can not be simultaneously equal to 1. The upper 2 nþk bits of X can be calculated using two 3 -2 compressors and one modulo 2 2n 2 1 CPA
As it is observed in the above equation, the conversion from RNS to binary using the new extended moduli set can be performed by adding modulo 2 2n 2 1 the four partial expressions in (39), directly obtained from the three RNS channels. In order to optimise the converter, only one modulo 2 2n 2 1 CPA is used in the final stage of the computation. The other additions are performed by two modulo 2 2n 2 1 CSAs, as depicted in Fig. 7 , resulting in an RNS decoder with exactly the same structure as the RNS decoder for the traditional moduli set [20] .
To decode an RNS value encoded in the diminished-1 representation, the value from the 2 n þ 1 channel (x 3 ) has to be incremented by one, which is equivalent to adding 2 2n21 2 2 n21 modulo 2 2n 2 1 in (38). This extra value can be added by an extra simplified CSA, in which one of the inputs is a constant value. As a result, only an extra HA is introduced in the critical path of the decoder, as depicted in Fig. 8 .
Note that the complexity of the RNS to binary conversion for the new moduli sets is exactly the same as for the traditional moduli set. The only difference lays in the partial expression F, in that this new moduli set has no constant terms [20, 22] . 
Experimental results
We have synthesised circuits to implement the enhanced modulo 2 n þ 1 multipliers and the backward and forward converters proposed for the new moduli sets, using the Synopsys synthesis tools and the 0.13-mm standard cell technology mentioned in Section 1. The obtained experimental results allow to evaluate the real impact of the proposed techniques individually, as well as the efficiency of the enhanced multipliers applied to the modified RNS moduli set.
Enhanced modulo 2
n þ 1 multipliers Fig. 9 depicts the ratio of the processing time between the proposed modulo 2 n þ 1 multiplier and the one proposed by Zimmermann [14] . From this figure, it can be observed that the proposed multiplication architecture significantly improves the processing time. However, there are exceptions, as is the case for 37 bits. In this particular situation, the Zimmermann architecture has a better performance, since the advantages provided by the WT characteristics are not so significant for such width. The new multiplication architecture is, on average, 9% better than the one proposed by Zimmermann [14] . In fact, for bigger word lengths the improvement obtained by this multiplication architecture is, on average, above 10% (see Table 2 ).
The approach taken to introduce the partial product P 2 in the new multiplication architecture leads to an increase of the circuit area. This difference becomes less significant as the number of bits increases (see Fig. 10 ). This is only a significant disadvantage, in comparison to the architecture proposed by Zimmermann, when the number of bits is less than 8 (the circuit area increases by more than 20%). When the number of bits becomes greater than 16, the difference in the circuit area decreases to less than 10%. For operands with more than 40 bits, this difference is approximately 2%, as depicted in Fig. 11 , while improving the delay by more than 10%.
To evaluate the efficiency of the proposed multipliers, the product of the circuit area (A) by the square of the time (T), AT 2 , was adopted as figure of merit. Using such metric, and according to the experimental results depicted in Fig. 12 , one can conclude that a significant improvement is achieved using the new multipliers, the efficiency being superior in more than 20%.
New moduli set
The values presented for the new moduli set {2 n À 1, 2 nþk , 2 n þ 1} are for 0 k n, whereas the values for the traditional moduli set are for k ¼ 0. From  Fig. 13 , depicting the delay results for the binary to RNS Fig. 9 Relative delay of the new multiplier regarding the multiplier proposed by Zimmermann converters, it can be seen that the converter for the new moduli set exhibits a slight performance improvement, about 4% on average. This is because of the shorter length of the modulo (2 n + 1) adders used in the converters, because for the same dynamic range the value of n is smaller. However, these conversion units require about 16% more circuit area (Fig. 14) because of the usage of an extra CSA adder in the 2 n þ 1 channel to add the extra constant 1 (see Fig. 5 and (30) ).
The proposed conversion units from binary to diminished-1 RNSs are extremely compact, even more than the conversion units for the traditional moduli set. These conversion units are, on average, 10% faster ( Fig. 15) and, unlike for the standard representation, require slightly less circuit area than the traditional moduli set (Fig. 16) .
The RNS to binary converter for the new moduli set uses exactly the same structure as the converter for the original moduli set, with the advantage of having a smaller value of n, thus using smaller adders. This advantage not only causes an average reduction of about 10% in the conversion time (Fig. 17) , but also a decrease of about 15% in the required implementation area (Fig. 18) . Such improvement is mostly due to the longer binary channel (2 nþk ) that is directly used to obtain the lower n þ k bits of the conversion result.
Taking into account the total cost of the two conversion units, the advantage of using the new moduli set becomes clear. Both the forward and the backward conversion units have approximately the same delay, which does not happen for the traditional moduli set, where the RNS to binary conversion is slower than the converter in the opposite direction. Concerning the required circuit area, the proposed new moduli set allows for a reduction of, on average, 2%.
RNS multiplication in the new moduli set:
The proposed moduli set achieves a more balanced processing delay between the arithmetic units, due to the overloading of the binary channel; thus leading to an improvement of the overall computation efficiency. Fig. 19 depicts the delay difference for the multiplication operation, considering the new and the original moduli sets.
The use of the already optimised multiplier for the standard representation in the proposed moduli set allows an improvement in the multiplication delay up to 18%. For example, for n ¼ 17 the proposed modulo 2 n þ 1 multiplier is 16% faster and 23% more efficient than the related art. Combining it with the proposed moduli set, for a dynamic range of 56 bits (2 17 2 1, 2 22 , 2 17 þ 1), the multiplication becomes even 12% faster. This results in an overall improvement in the multiplication time of 32%, regarding the case where neither the improved multiplier nor the proposed moduli set are used.
For the diminished-1 representation, the improvement to the multiplication time is even higher. Fig. 20 depicts the difference between the multiplication delay in the original and in the proposed moduli set, using the already improved diminished-1 multiplier proposed by the authors of this paper [13] . For example, for a dynamic range of 56 bits, the proposed moduli set allows for a 15% faster multiplication.
Conclusions
This paper proposes an improvement to RNS that use the 2 n þ 1 modulo multiplication. The first step to improve the multiplication in RNS concerns the optimisation of the multiplication unit itself. However, because of the improved unit being even slower than the binary multipliers, we propose a new class of modified moduli sets that is capable of adapting the length of the binary channel ({2 n À 1, 2 nþk , 2 n þ 1}) to the employed technology. The proposed multiplier is an improved architecture based on the modulo 2 n þ 1 multiplier proposed by Zimmermann. Experimental results suggest that multipliers 10% faster can be achieved at the expense of some additional area. Nevertheless, for large dynamic ranges the area increase is not significant, requiring, on average, an additional 2% of silicon area. Furthermore, it can be concluded that for units using more than 9 bits, the proposed architecture is, on average, 20% more efficient than the original architecture.
The proposed extension of the binary channel allows the creation of more balanced and adaptable moduli sets. Conversion units were proposed for the particular case of the new 3-moduli set ({2 n À 1, 2 nþk , 2 n þ 1}). Experimental results evidence more balanced forward and backward conversion times. Moreover, regarding the original conversion units, the proposed ones are faster and require less circuit area.
When both balanced moduli sets are adopted and enhanced multipliers are used, the overall multiplication can be further improved up to 32%, with a delay reduction between 7 and 15%. 
