ABSTRACT Inter-modulo operations are the most time consuming and costly operations of the residue number system (RNS), and one of the main obstacles to applying RNS in practice to the design of computing devices, namely for signed integer arithmetic. In this paper, we derive simplified and unified mathematical formulations for inter-modulo operations, such as sign detection, magnitude comparison, scaling signed integers, and signed reverse conversion, grounded on the pillars of reverse conversion. These formulations, which cover a whole range of sets, with 3 to 5 moduli, are used to design components that when integrated allow the design of efficient complete multifunctional units, reusing blocks to perform several RNS intermodulo operations. Not only have the proposed individual components been compared with related art, but a configuration of the proposed multifunctional unit has also been implemented in application specific integrated circuits. Experimental results show that the multifunctional unit is significantly more area and power effective than the other solutions proposed in the state of the art, and the performance of the individual components compare well with dedicated ones. The novel multifunctional units are thus a further step toward the integration of RNS systems on constrained systems.
I. INTRODUCTION
Residue number systems (RNSs) [1] have been considered in computer arithmetic [2] , for more than half century, as a potential tool to bring a wish into reality: Parallel Arithmetic [3] . This challenging unconventional number system can break the long carry-propagation chains, by bounding them inside smaller modulo channels working in parallel with each other [4] , [5] . This important property is the root to use this type of number systems in a range of applications, from embedded [6] and digital signal processing systems [7] to cryptography [8] . While this parallelism for addition and multiplication is quite profitable for digital systems designers, there are some operations which are difficult to perform efficiently in RNS, such as sign detection [9] - [12] , magnitude comparison [13] - [16] , reverse conversion [17] , [18] and scaling [19] , [20] . These operations are usually designated inter-modulo RNS operations, since processing over the value of the residues in all channels is required, in contrast to modular addition, subtraction, multiplication and exact division, for which operations in the channels are individually performed in parallel. While those inter-moduli operations are essential in a real RNS-based processor, their algorithms and hardware realizations are complex.
Most of the previous publications on RNS are focused on designing efficient circuits and systems for modulo addition [21] , [22] , multiplication [23] , [24] , and reverse conversion [25] . However, without efficient circuits for those other difficult inter-modulo operations, it is not possible to adopt RNS in practice for applications other than simple linear processing systems. Therefore, there is a significant gap between the previous research directions and the practical interest of RNS. By filling this gap, RNS would not only enhance performance but also serve as an efficient tool to achieve low-power consumption [7] , [26] , and low-cost error detection and correction [27] , [28] .
Hence, we should firstly eliminate the significant constraints of RNS, and design optimized and efficient full RNS systems, so that it can be introduced in general computer engineering applications. Moreover, to achieve unrestricted and efficient systems, one must consider also Signed Residue Number Systems. Most of the research work on RNS, namely for comparison, scaling and reverse conversion, has either glossed over or ignored signed arithmetic and focused only on the amplitude of numbers. Even if individual units for each of these signed numbers operations were designed, they might not necessarily lead to efficient RNSs, since they may result in a high area-cost and power-consumption, which devalue the speed gains of fast independent residue modular channels.
The cornerstones of this work are the following observations: i) inter-modulo operations, the hardest in RNS, systematically rely to some extent on mapping the representation of the numbers from RNS into a weighted notation; ii) for these type of operations, moduli sets may be classified, based on their characteristics, in two main classes. With this in mind, we propose a new way of designing RNS-based datapaths by using a unified arithmetic unit, herein called a Multifunctional Unit (MU). The MU is a single unit, based on the reverse conversion principle, with the ability of performing difficult RNS operations, namely sign detection, magnitude comparison, reverse conversion and scaling, generalized to handle signed arithmetic. The MU promotes hardware reuse, achieving a significant area saving when a full RNS system is implemented, while assuring low processing delay.
This work is partially inspired on [29] , where slight modifications to the reverse converters of a particular class of moduli sets were proposed in order to efficiently process signed numbers. Herein, we also exploit the principle of computing the sign of a number to correct the result of signed reverse conversions, but apply it to other difficult RNS operations. Overall, the main contributions of this paper can be summarized as follows.
-Simplified formulation of inter-modulo operations, such as sign detection, magnitude comparison, support for scaling signed integers and reverse conversion of signed integers for a whole range of 3-, 4-and 5-moduli sets. -Efficient architectures for those inter-modulo operations, based on the formulations proposed in this paper; -MUs supporting all inter-modulo operations in RNS, with minimum circuit area, power consumption and delay overhead. The rest of this paper is organized as follows. After a brief introduction to the signed residue number system in Section II, the proposed formulations and algorithms are presented in Section III. Hardware architectures for each operation are presented in Section IV, and MUs are proposed in Section V. Evaluation of architectures and units is performed in Section VI, and Section VII concludes this paper.
II. BACKGROUND
A Residue Number System (RNS) relies on sets composed by h pairwise relatively prime moduli integers. A moduli set {m 1 , m 2 , · · · , m h } defines a range of M = h i=1 m i different integer numbers, each of which can be represented by a unique tuple {x 1 , x 2 , · · · , x h }. The residue x i is the least non-negative remainder of the division of X by m i , usually represented by |X | m i .
The M different tuples may represent unsigned integers, X in the range [0, M − 1], or signed numbersX , whereX = X ifX ≥ 0 orX = X − M ifX < 0 [5] : Although the traditional 3-moduli set has been intensively investigated [30] , its limited dynamic range and reduced parallelism led to the proposal and investigation of larger moduli sets, namely sets with four and five moduli. Table 1 organizes the most commonly used RNS moduli sets in the literature into two main categories, according to features that make the reverse conversion more (c-class) [ There are different methods for computing the residueto-binary conversion, namely the Chinese Remainder Theorem (CRT) [5] , Mixed-Radix Conversion (MRC) [5] and what is usually designated New CRTs I and II [45] . Among these methods, the New CRTs are featured in this work, since they allow us to derive the simplest conversion formulations for the different types of moduli sets.
New Chinese Remainder Theorem I [34] : Considering the RNS moduli set {m 1 , m 2 , . . . , m h }, the weighted representation of a number X can be obtained from its residues (x 1 , x 2 , . . . , x h ) by applying the New CRT-I:
where the required multiplicative inverses k i are calculated as:
The New CRT-II leads to tree like architectures with a depth that depends on the number of moduli in the set. For a 4-moduli set {m 1 , m 2 , m 3 , m 4 }, which is one of the main cases focused in this work, the New CRT-II provides the following formulation for the RNS residues (x 1 , x 2 , x 3 , x 4 ):
Where the following relations are observed for the multiplicative inverses k i :
III. ON THE MATHEMATICAL FORMULATION OF INTER-MODULI RNS OPERATIONS
In this section, we derive simplified mathematical formulations for inter-modulo RNS operations, supported on the residue to binary conversion principles. By revisiting the reverse conversion formulation, methods for sign detection, signed comparison, signed-output reverse conversion, and scaling will be derived. For each of these operations, we discuss separately the formulation for the RNS moduli sets in the c-class and a-class (see Table 1 ). Firstly, methods are derived for the general cases, and then a case study for a particular moduli set is presented. For the case studies, the moduli sets {2 2n , 2 n − 1, 2 n + 1, 2 2n + 1} and {2 n , 2 2n+1 − 1, 2 n − 1, 2 n + 1} [35] are considered for the c-and a-class moduli sets, respectively.
A. REVERSE CONVERSION
The New CRTs in (2) to (12) are adopted to derive the formulation for the considered inter-modulo operations. The New CRT-I is used for the c-class moduli sets, while for the a-class moduli sets the New CRT-II is applied. The reverse conversions are compactly represented using composite forms of RNS bases. A composite-form B of an RNS basis B is a basis such that the number of elements of B is fewer than or equal to that of B but such that b ∈B b = b∈B b.
1) REVERSE CONVERSION FOR c-CLASS MODULI SETS
An interesting feature of the c-class moduli sets is that the product of all moduli except 2 k takes the value of a powerof-two minus one, i.e. 2 P − 1 where k and P depend on the considered moduli set. Thus, these moduli-sets can be described in composite-form as {2 k , 2 P − 1}. There is a large class of possible c-class RNS moduli sets, the most representative of which can be found in Table 1 . By applying the New CRT-I, the computation of the reverse conversion for this category of moduli sets [29] can be performed as
where x 1 is the residue in the binary channel, and Y can be calculated using Carry-Save Adders (CSAs), with End Around Carry (EAC) [46] , followed by a two-operand CarryPropagate Adder (CPA) with EAC to add the Sum and Carry bit-vectors.
2) REVERSE CONVERSION FOR a-CLASS MODULI SETS
The a-class moduli sets can be represented in the general
By changing the values of k and Q, one obtains the a-class moduli-sets in Table 1 . Based on the new CRT-II conversion algorithm, we have
where the k i 's are multiplicative inverses computed with (10)-(12). 
3) COMPOSITE FORM OF THE MODULI-SETS
It should be noticed that both c-class and a-class moduli-sets can be described with composite-form {M 1 , M 2 }. The value of X can be obtained from the values
as follows:
where M 1 , M 2 , X 1 , X 2 can be found in Table 2 for each class of moduli sets, and compile (13) and (15) . The notation X i,j will be used henceforth to refer to the j th LSB (Least Significant Bit) of X i . 
B. SIGN DETECTION
Sign detection is a demanding operation in RNS [13] . In contrast to weighted representation systems, namely the binary system, for which the sign of a number is identified by its Most Significant Bit (MSB), in RNS there is no sign digit. Therefore, it is much more difficult in RNS to detect if a number is positive or negative. Herein we start from the observation that sign detection relies on, in one way or another, the principles of reverse conversion, to achieve formulations that allow deriving efficient methods for sign detection for both classes of RNS moduli sets. Based on (1), one concludes that the sign of an RNS representation can be obtained from (18) by detecting whether
It is readily noticeable from (18) that when X 2 = (M 2 − 1)/2 the range of values that X 2 can assume is evenly split in two, and that each half corresponds solely to positive or negativeX according to (1):
For the composite forms in (19) , (19) is satisfied by all X 2 whose MSB is '0' except for
2 , and (20) is satisfied by all X 2 whose MSB is '1'. These intervals are represented in Figure 1 . 2 , X is negative, otherwise it is positive. This comparison can be computed efficiently for the composite sets in Table 2 , since M 1 has either the form 2 k or 2 k (2 Q ± 1). In the former case, to detect whether X 1 ≥ 2 k−1 or X 1 < 2 k−1 the MSB of X 1 suffices:
, since the value being compared is
, its binary representation has k − 1 '0's as its LSBs, which means the k − 1 LSBs of X 1 can be disregarded. A noteworthy detail is that whereas X 1 is represented by k + Q bits when
. Thus, the Q + 1 or Q + 2 MSBs of X 1 have to be compared with 2 Q − 1 or 2 Q + 1, respectively:
Using the rationale above, the formulas described in Table 3 for the sign for both moduli classes can be reached. Therein, S = 1 corresponds toX < 0, whereas S = 0 is obtained forX ≥ 0. Example 1: Consider the c-class moduli set {m 1 , m 2 , m 3 , m 4 }={2 2n , 2 2n + 1, 2 n + 1, 2 n − 1} for n = 2, i.e. k = 4 and P = 8 for the composite set, and the RNS number (x 1 , x 2 , x 3 , x 4 ) = (11, 3, 3, 0). The dynamic range is M = 16 × 17 × 5 × 3 = 4080. First, using (3), X 2 can be achieved as 127. Now, since X 2 = M 2 −1 2 , we compare X 1 = x 1 ≥ 8, and conclude that this RNS number is negative. To verify the result, we can compute the complete weighted representation using (2) that is 2043, and then compare it with half of the dynamic range which is 2040. Since 2043 is larger than 2040, the number is negative.
Example 2: Consider the moduli set {2 n , 2 2n+1 − 1, 2 n + 1, 2 n − 1}, which is a special case of the general a-class moduli sets, with k = n and Q = 2n + 1 in Table 2 2 . Thus, the sign is equal to the MSB of X 2 which is 1. Hence, this number is negative. To verify the result, VOLUME 5, 2017 according to (7) , X = 1700, which is greater than half of the dynamic range, 930. Thus, the result of the sign detection is correct.
C. SIGNED REVERSE CONVERSION
In the previous section, it has been shown that it is possible to detect the sign of an integer number from the structure of different reverse converters in a simple manner. This section addresses the issue of how to correct the output result of a reverse converter when the result is negative. Firstly, (18) is rewritten for whenX < 0:
When considering the binary representation of M = M 1 M 2 for all the considered moduli-set classes in Table 2 , since they are the product of close to powers-of-two, the two's complement representations of their symmetric will be very simple:
In particular, for the considered moduli-sets in Table 2 , (24) can be simplified as follows:
this allows us to reach the following unified formulation forX :X
whereC takes the values in Table 4 . Example 3: Consider the moduli set and the number introduced in Example 1. The number 2043 is the output of the converter according to the regular conversion formula and converter architecture. However, 2043 represents a negative number in the RNS in signed number representation.
After detecting the Sign as described in Example 1, X 2 should be increased by one, resulting in 128=(1000 0000) 2 . Since M 1 = 2 k , the addition X = X 1 + M 1 X 2 corresponds to the concatenation of X 1 and X 2 , leading to (1000 0000 1011) 2 , which is the two's complement representation of −2037. The result is correct since 2043-4080=−2037.
Example 4: Continuing with the Example 2, the RNS number (0,26,0,2) based on the moduli set {4, 31, 5, 3} is detected as negative. Deriving from X 1 = 88 and X 2 = 13, (31) can be evaluated usingC = 2 k+2n , producingX = 1888 = 2048 − 160, which is the two's complement representation of −160. This value can be verified to be correct since X = 1700 andX = X − M = 1700 − 1860 = −160.
D. SIGNED COMPARISON
Signed comparison is one of the most complex RNS operations, since both sign detection and magnitude comparison are needed. Here, we show how, based on reverse conversion formulas, signed comparison can be done for different RNS moduli sets. It is worth to notice that the sign is efficiently available from the reverse converter, and also that the intermediate digits of the numbers are also computed across the reverse conversion structure.
We consider the parallel method to perform signedmagnitude comparison first introduced in [13] . This method can be adapted for the comparison of two RNS numbers, A and B, in the considered moduli-set classes. One starts by computing A 1 , A 2 and B 1 , B 2 as in (18), as well as the signs of A and B, all in parallel. If the two signs are different, the positive number will be the greatest. Otherwise, A 2 and B 2 should be compared, and the number with the greatest digit will be greatest as well. If the two digits A 2 and B 2 are equal, A 1 and B 1 should be compared and the result of the comparison of these two latter digits will determine the comparison of the two numbers -i.e. if
Whereas this method achieves a small delay, it requires a large area, since two parallel reverse converters are necessary.
The aforementioned method for the comparison of A and B can be readily converted to a sequential approach, which reduces the requirements to a single reverse converter. One starts by computing the sign and the weighted digits of A, A 1 and A 2 . These values are stored in registers while the reverse converter is reused to compute the sign of B as well as B 1 and B 2 . As soon as the signs of both numbers are available, along with their weighted digits, the greatest number can be determined using a comparison similar to the one presented for the parallel method.
E. SCALING
Scaling is one of the most important as well as complex operations in RNS. Scaling, i.e. division by one specific value (typically that of one modulus) is used to avoid overflow in residue number system arithmetic, because the results of operations may be larger than the dynamic range. In this case, we achieve a residue modulo M (the dynamic range) instead of the result in Z. If this result is not expected, it will cause dependent operations to produce wrong results. Therefore, overflow prevention using the scaling operation is usually used in practical RNS systems for applications like the dot product calculation, filtering, and fast Fourier transform, in order to reduce the range of the numbers and avoid overflow [47] .
First, let us derive the unsigned scaling equations for when a number X is scaled by m 1 :
The computation of s i = |S| m i for i = 1 is directly performed using
In contrast, the computation of s 1 is more complex, and will depend on the considered moduli-set. We will handle the formulation of s 1 in Sections a) and b). For negative numbers, namely whenX < 0, (32) has to be adapted as
Since m 2 m 3 . . . m n is an integer number, (34) can be written asS
By reducing this value modulo m 1 , we obtain:
For the other moduli of the set, we can derive the following unified equation:
We can thus conclude that the formulas for scaling are the same for signed and unsigned numbers, with the exception of the first residue, from which a constant should be subtracted when the considered value is negative. In the following subsections, the computation of s 1 ands 1 will be handled in detail for each moduli-set class.
a) SIGNED SCALING FOR c-CLASS MODULI SETS
For c-class moduli sets, by considering (2), (32) can be adapted as
Thus, we obtain
Moreover, by definition of the c-class moduli-set:
and the value ofs 1 , whenX < 0, can be computed as
+1, 2 n +1, 2 n −1} for n=2 (k=4 and P=8) with M=4080, and the RNS number X = 2043 = (11, 3, 3, 0) , withX = −2037. Y is calculated as 127 using (3). The unsigned scaling can be done using (33) and (39), resulting in S = (15, 8, 2, 1 
) SIGNED SCALING FOR a-CLASS MODULI SETS
We are left with determining the residue s 1 of the scaled value of X for a-class moduli sets with the general form {2 k , 2 P ±1, 2 n − 1, 2 n + 1}. According to the CRT-II equations (15)- (17), we can write
Substituting (42) in (32) while considering m 1 = 2 k as the scale factor results in
where the Z and the X 2 are the same as in (16) and (18) . Therefore, we have that
WhenX < 0, by considering the form of a-class moduli sets, s 1 can be obtained from s 1 through (36) as
Example 6: Consider the same number and moduli set as in Example 2 and Example 4, i.e .{2 n , 2 2n+1 − 1, 2 n + 1, 2 n − 1} for n = 2 (k = 2 and P = 5) with M = 1860, and the RNS number X = 1700 = (0, 26, 0, 2), withX 2 = −160. One can compute the values of W and Z from (16), (17) . With this result, according to (33) and (44), the unsigned scaling can be performed as S = (1, 22, 0, 2). However, it is shown in Example 4 that 1700 represents a negative number, and hence the unsigned scaling does not match the signed version. VOLUME 5, 2017 According to (48) , by just adjusting the first scaled residue, the correct signed-scaled number can be achieved as
Therefore,S = (0, 22, 0, 2). To verify the result, we can achieve the signed-reverse conversion of this number using (15)- (17) . In this case, we produce 1820 that corresponds to −40. This result is correct sincẽ X = −40 × 4 + 0 = −160
IV. DESIGNING COMPONENTS FOR INTER-MODULO OPERATIONS
This section presents the hardware implementation of the methods and equations proposed in Section III. Herein, it is assumed that the constants multiplying the residues in the equations for reverse conversion, namely in (2), (3) and (7)-, can be written as the addition or subtraction of a small amount of powers-of-two. This property allows us to compute these multiplications as a sum of a small amount of shifted binary representations of the residues, and are implemented in units designated ''Operand Preparation Unit''.
A. SIGN DETECTOR CIRCUIT FOR c-CLASS SETS
The reverse converter for c-class moduli sets includes only a CSA tree and an End-Around-Carry (EAC) modulo adder [46] to compute (13) and (14) . According to Table 3 , we only need to use a processing unit which computes X 2,P−1 ∧X 2,P−2 ∧. . .∧X 2,0 , thus detecting when X 2 = M 2 −1 2 , and an additional OR and AND gates to compute the sign. Figure 2 shows this circuitry, where the blue dashed rectangle denotes the regular reverse converter circuits while the red ones show the additional gates to detect the sign. One of the advantages of this design is that the converter architecture is completely independent of the sign detection gates. 
B. SIGN DETECTOR CIRCUITS FOR a-CLASS SETS
As discussed in Section III.A.2), the reverse conversion for a-class moduli sets is more complex than those for (16) and (17) . When m 2 = 2 Q − 1, an EAC CPA is used to compute Z , whereas when m 2 = 2 Q + 1, a Complement-End-Around-Carry (CEAC) CPA is used instead. Afterward, X 2 is computed using a CSA tree followed by a modulo adder. The ''Operand Preparation Unit 3'' is responsible for producing 2 k 2 Q X 2 and ±2 k X 2 , which when added together with X 1 output X .
A negative RNS number is detected by checking if the MSB of X 2 is '1', or, if it is not, by checking the conditions
and
The overall structure is shown in Figure 3 . The Sign signal in this circuit is computed sooner than the regular output of the reverse converter, since the final binary adder is not needed to derive the sign.
C. SIGNED REVERSE CONVERSION CIRCUITS
After computing the sign from the available reverse converters with small overhead, we can apply it to design reverse converters with signed output. According to (31) and Table 4 , the transformation of an unsigned to a signed converter for the c-class moduli-sets is performed by incrementing X 2 when the RNS number is negative. This operation is performed in Figure 4 by a conditional incrementor circuit that is controlled by the sign bit.
In contrast, Table 4 shows that rectifying the result for a-class moduli sets is more complex. In this case, one needs to add M 1 +C to the result whenX < 0. We have changed how the ''Operand Preparation Unit 3'' operates for the reverse converter, from Figure 3 to Figure 5 , so that this value is considered when Sign is '1'.
D. SIGNED MAGNITUDE COMPARATOR CIRCUITS
This section describes a design for signed comparison in RNS based on the method presented in Section III.D. Here, we unify the overall structure of the signed-magnitude comparison for c-class and a-class moduli sets, since the general signed comparison method for the different moduli sets is the same: after computing the signs, if the signs are different the positive number is greater; otherwise the weighted digits of the numbers should be compared. In particular, the signals for comparing two RNS numbers A and B can be defined as greater (G), lesser (L) and equal (E), which are '1' when A > B, A = B and A < B, respectively [13] :
The block diagram of the signed comparison unit is depicted in Figure 6 . Two alternative methods are considered for the computation of the signs and of the weighted digits. The parallel method in (a) requires two reverse converters with sign detection in order to produce A 1 , A 2 , sign A and B 1 , B 2 , sign B in parallel. For the sequential method in (b), the residues of A and B are inputted in two consecutive clock cycles. While B 1 , B 2 and sign B are being computed, the values of A 1 , A 2 and sign A are stored in registers. Finally, the comparison logic (c) in Figure 6 implements (46)- (48) using logical gates. 
E. SIGNED SCALER CIRCUITS
The scaling operation can be computed for residues s i ∀i = 1 with (33), using arithmetic units that are usually available in the channels of the RNS processors. Regarding the computation of s 1 , for c-class moduli-sets we have that s 1 = |Y | 2 k = |X 2 | 2 k . Therefore, the value of s 1 is readily available as the k LSBs of X 2 in Figure 7 . Moreover, when considering signed scaling, one needs to add |−m 2 . . . m n | m 1 = −2 P + 1 2 k to s 1 whenX < 0 to produces 1 . Hence, an adder and a multiplexer were added to Figure 7 , which selects s 1 or s 1 − 2 P + 1 2 k as the value of s 1 depending on the Sign bit.
When considering a-class moduli-sets, s 1 should be computed with (44). In particular, the values of Z /2 k and X 2 are reduced modulo 2 k before being fed into the Operand Preparation Unit in Figure 8 , which is responsible for computing |2 Q X 2 | 2 k and | ± X 2 | 2 k , and passing these values, along with Z /2 k , in carry-save format to the CSA tree that follows. Moreover, for the scaler to handle signed values one needs to add − 2 2n − 1 2 Q ± 1 2 k when Sign = 1, so as to produces 1 .
V. MULTIFUNCTIONAL UNIT
In the previous sections, it was explained how individual difficult RNS operations can be implemented based on the reuse of the available hardware for reverse converters. The slightly increased delay of these operations, when compared with that of optimized specific units for each one, should not have a significant impact on practical applications benefiting from RNS, because they are not used very frequently. Therefore, the proposed approach removes redundant hardware that would otherwise result from the implementation of specific circuits for each operation, and results in an optimized and compact version of a multifunctional RNS unit suitable for embedded systems and low-power applications with restricted area and power constraints. The general block diagram of a multifunction unit for sets of h moduli is depicted in Figure 9 . The inputs of this unit are the residues b 1 , . . . , b h and a 1 , . . . , a h , allowing for single-operand operations -such as signed scaling -and twooperand operations -such as comparison. Note that multiple configurations are possible. For example, designers that do not need comparison can remove its hardware from the multifunction unit. Furthermore, some configurations may allow a higher level of parallelism to be exploited. For instance, the second reverse converter required for the parallel comparison method may be used to allow for the reverse conversion of two numbers at the same time. The characteristics of the different configurations give rise to problem oriented architectures, which may change dynamically as algorithms progress [48] .
Finally, a Control Unit is in general needed to orchestrate the processing of both the multifunctional and arithmetic units of RNS as shown in Figure 10 . We assume that each modular arithmetic channel holds several registers, allowing for the residues of multiple values to be stored at the same time. Each channel is also connected to the multifunctional unit, so that inter-moduli operations are efficiently processed. It should be noted that that independent operations can be issued simultaneously on both the multifunctional unit and on the modular arithmetic channels in order to maximize performance. 
VI. PERFORMANCE EVALUATION
Since this is the first time a multifunctional unit is proposed in the literature, the performance of the components proposed in Section IV are first individually evaluated, and afterwards compared with operators dedicated to specific moduli sets from the state of the art.
We focus on the unsigned reverse converters presented in [35] and [39] for the c-class and a-class moduli sets
+1} and {2 n , 2 n −1, 2 n +1, 2 2n+1 −1}, respectively. These converters were used to build signed parallel comparators, signed scalers, sign identifiers and signed converters, as described in Section IV. We have included the whole converter in the designs, even when not strictly necessary (e.g. in Figure 5 , the final stages of the converter are not used for sign identification), since the main application of the proposed architectures is within a multifunctional unit, where the converter is shared by all operations.
The proposed and state of the art circuits were described in synthesizable VHDL 1 and their functionality was thoroughly tested. A well-known library of arithmetic units [49] , also described in synthesizable VHDL, was used. This library contains a structural specification of components, namely optimized prefix adders, which were employed to describe and implement the operators. Experimental results were obtained for ASICs using a TSMC standard cell library tailored for the 65 nm CMOS LOGIC General Purpose Plus technology [50] . Finally, experimental results were obtained for power from the placed-and-routed circuit specifications for 20% of switching activity. The Cadence Encounter builtin power reporting tool was employed, and the total power was measured, including the dynamic and leakage power.
We have condensed in Figure 11 the main experimental results, from a large set of obtained results, for the proposed systems. The DR in the x-axis represents an approximate value of the dynamic range in bits. The system for signed scaling comprises solely the computation ofs 1 (cf. (36)) since we have assumed the remaining residues to be computed in the Modular Arithmetic Channels in Figure 10 .
As expected from Figure 6 , the area of the parallel comparator is approximately twice that of the unsigned reverse converter. In Figure 11 , it is plotted an increase of at most 137% and 100% for the c-class and a-class moduli sets, respectively (DR=126-bit). For the remaining operations, it is clear that for both the considered c-class and a-class moduli sets the area overhead with respect to the unsigned converter is not significant. In particular, the relative increase is maximum for the signed converter, with values of up to 27% and 18% for the c-class and a-class moduli-sets, respectively.
Since c-class moduli sets have characteristics making reverse conversion very efficient, this is the operation that performs best in terms of delay when one takes a global view across all considered dynamic ranges. Regarding a-class moduli sets, since one is able to obtain positional representations faster than the traditional binary, certain operations achieve a lower delay than the unsigned reverse converter. Sign identification achieves a maximum reduction of 25% when compared with the unsigned converter. This makes evident that the proposed system is much better performant than a naïve approach for sign identification, which would consist of converting the number to a binary representation and afterwards comparing it with M /2. Moreover, this gain is obtained even when considering an extensible architecture, that underpins other hard RNS operations.
In what concerns power consumption, one can see that the power dedicated to sign identification is the lowest from all the considered operations; whereas for the comparator the power is 116% and 152% larger than that of an unsigned reverse converter, for c-class and a-class moduli sets, respectively. Finally, for a-class moduli sets, the signed scaler benefits especially from the simple final k-bit adder in Figure 8 when compared with other more complex architectures required for other operations.
A. COMPARISON WITH RELATED ART
In this section, the efficiency of two representative operations of the multifunctional unit is evaluated in relation to dedicated state of the art architectures. Concretely, the proposed signed scaler is compared with the unsigned scaler in [20] ; and the proposed signed comparator is evaluated against the unsigned comparator proposed in [51] . Since the techniques presented herein are general, we have limited our comparison to architectures whose application would fit both the previous considered moduli sets, though more specific designs could also be considered [9] , [13] .
In [20] , novel architectures were proposed that support the computation of X /2 n for a value X in RNS for modulisets of the form 2 n+x , 2 n − 1, 2 n + 1, m 4 . First, the value of |X | 2 n+x (2 2n −1) is computed as
where |X | 2 n is obtained as the n LSBs of the residue |X | 2 n+x . Finally, one obtains
It should be noted that when x = n, m 4 = 2 2n + 1, and when
− 1 the moduli-sets in [20] match the ones in the previous section.
Moreover, we have used the function proposed in [51] to compute the value F I (X ) = X /2 2n for the moduli-set
which corresponds to the most significant digit of the representation X = F I (X ) + 2 2n |X | 2 2n . Since a positional representation is available, the strategy described in Section D can be applied to compare two unsigned integers. A similar reasoning can be applied to the moduli-set {2 n , 2 n − 1, Figure 12 and Figure 13 depict how the performance of the proposed units compare with those of related art, using the Area-Delay (AD) product as a figure of merit, for the FIGURE 14. Energy ratio of the Proposed Multifunctional Unit and a combination of [20] and [51] for a program spending a ratio of δ of its execution time on hard RNS operations when executing on [20] and [51] .
two aforementioned moduli-sets. The denomination c-class is used for 2 2n , 2 n − 1, 2 n + 1, 2 2n + 1} and a-class for
− 1}. A significant increase in the AD product can be observed in Figure 13 for the c-class comparator when comparing the proposed system with related art. This is not only related to the overhead associated with exploiting reusable components, but also due to the fact that the proposed comparator takes into account the sign of the numbers, increasing the complexity of the circuits, whereas [51] operates on unsigned integers. Nevertheless, a relative small overhead is obtained for the remaining operations and the other moduli-set. In fact, the AD product was reduced by 4% for the a-class comparator when the DR is of 36 bits. In addition, in Figure 12 , the overhead of the AD of the proposed scaler is always low for the a-class moduli set. Moreover, the difference is only higher for large DRs, becoming 9.3% for a DR of 126 bits. The complete set of experimental results can be found in Appendix.
B. COMPLETE MULTIFUNCTIONAL UNIT
Experimental results for the area and delay of a configuration of the proposed multifunctional unit (including a scaler and a comparator) and a multifunctional unit exploiting [20] , [51] can be found in Appendix. Savings of up to 17% can be obtained for the circuit area. This can have significant financial impact for the production of chips, especially for the mass production of ASICs, and when considering energy consumption. Assuming the power consumed by the original circuit with [20] , [51] to be proportional to the circuit area:
The energy E o consumed by the multifunctional unit when an application is running is hence proportional to the product of A o and the time t o taken to execute it. If a fraction δ of the running time of the application is spent executing hard RNS operations on the original circuit, the same application will take t p to execute with the proposed multifunctional unit:
If we further assume the conservative case where the multifunctional unit consumes energy even when it is not executing VOLUME 5, 2017 instructions, the energy it consumes for the whole duration of the application is has been plotted in Figure 14 as a function of δ, for the c-class and a-class moduli sets. One can thus observe that the proposed multifunctional unit will result in energy savings for suitable low δ for c-class moduli sets and will always result in energy saving for a-class moduli sets, since both the area and the delay of the proposed units have been reduced when compared with related art. Summing up, the proposed multifunctional unit provides for a more effective use of circuit area, making it more suitable to develop constrained applications supported on RNS.
VII. CONCLUSIONS
This paper presents a novel, practical and efficient approach for designing RNS-based processors based on a multifunctional unit. This unit provides reusable hardware components to perform difficult RNS operations such as sign detection, magnitude comparison, signed reverse conversion and scaling, leading to significant area savings, with a reasonable delay overhead, when compared with dedicated architectures of the related art. Moreover, the applications that benefit the most from the RNS are comprised mostly of additions and multiplications. It was shown that in this case, one could further obtain a significant reduction in energy consumption. Thus, the proposed multifunctional unit improves the applicability of RNS systems to architectures with restricted area and power constraints.
APPENDIX EXTENSIVE EVALUATION IN COMPARISON WITH RELATED ART
See 
