Abstract-Novel very large-scale integration architectures and a design methodology for adder-based residue number system (RNS) processors are presented in this paper. The new architectures compute residues for more than one modulus either serially or in parallel, while their use can increase the resource utilization in a processor. Complexity is reduced by sharing common intermediate results among the various RNS moduli channels and/or operations that compose an RNS processor. The presented architectures are distinguished into two subtypes, depending on whether the inter-channel parallelism is preserved or not. The multifunction architecture paradigm is demonstrated by its application in residue multiplication, binary-to-residue conversion, quadratic RNS (QRNS) mapping, and base extension. The derived architectures are compared to previously reported equivalent ones and are found to be efficient in area 2 time product sense. Finally, the proposed design methodology reveals a new tradeoff in residue processor design, leading to more efficient RNS processors.
I. INTRODUCTION
A MONG alternative number representations, the residue number system (RNS) [1] , [2] has attracted a lot of interest since it features highly parallel carry-free addition and multiplication and borrow-free subtraction. Furthermore, RNS derivatives such as quadratic RNS (QRNS) [3] provide efficient complex number multiplication by reducing the number of required real multiplications. Various very large-scale integration (VLSI) architectures have been proposed for RNS operations; they rely on memory table lookup, combinatorial logic, or combination of both [4] .
Recently, a variety of adder-based architectures have been proposed in the literature [5] - [7] . Stouraitis et al. [7] presented full adder-based single-modulus architectures for RNS multiply-and-accumulate operations. DiClaudio et al. introduced the pseudo-RNS representation, which enables building reprogrammable modulus multipliers, allows for systolization, and simplifies the computation of DSP algorithms such as FIR filtering, correlations, and DFT [5] . Stouraitis et al. [7] focus on the optimization of a single-modulus operator rather than a system to function in a multimodulus context, while the pseudo-RNS architectures [5] are more flexible. This paper focuses on adder-based architectures that require less hardware than previously reported designs, are able to perform more than one functions, and their use increases the utilization of resources in an RNS processor.
In particular, a systematic way for eliminating hardware redundancy among the different moduli channels and functions in adder-based RNS architectures is presented. The systematic redundancy removal leads to area time product efficient architectures, the performance of which is detailed in Section VII. The resulting novel RNS architectures can perform multiplication, multiplication-addition, and conversions to/from RNS, binary, and RNS derivatives. No restriction is imposed on the moduli selection by the proposed design methodology. Furthermore, a technique is introduced which exploits the fact that particular input combinations do not occur to simplify hardware requirements.
The derived architectures form a new category called multifunction architectures to reflect their capability of performing more than one operations for one or more moduli. As multimodulus we refer to the special case of multifunction architectures which perform the same function for more than one modulus, either serially or concurrently. Two types of multifunction architectures may be distinguished. In particular, fixed multifunction architectures (FMA's) preserve interchannel parallelism, while variable multifunction architectures (VMA's) collapse several channels into a residue-serial unit.
FMA's perform multiple-moduli channel operations simultaneously. These architectures can replace a group of conventional single-modulus processors, achieving hardware reduction. In addition, the FMA concept suggests the incorporation of an arithmetic operation into a binary-to-RNS converter. For example, architectures that provide the RNS representation of a product from the non-RNS-encoded operands of the multiplication are possible.
VMA's produce a residue modulo which can be selected from a set of prespecified moduli during operation. VMA's achieve greater hardware savings than FMA's, but only at the cost of decreased parallelism. The modulus of operation can assume any of the values in the set by suitably rearranging the input bits through a multiplexer network. VMA's can be used in serialby-modulus (SBM) RNS architectures [8] or variations, where not all moduli channels are processed in parallel.
The proposed design methodology reveals a relation between converters and multipliers. A practical consequence of this relation is the design alternative of omitting a dedicated residue-to-binary converter from the system. Since algorithms amenable for RNS implementation should be computationallybound rather than I/O bound, omitting an under-utilized hardware component leads to better utilization of silicon resources.
The remainder of the paper is organized as follows: Section II contains some RNS basics and a description of adder-based RNS architectures. Section III presents the proposed design methodology and the new architectures.
Section IV describes the multiplier/converter relation as a consequence of the VMA design methodology. Area time product complexity models are derived in Section V, while applications to QRNS processing and base extension are discussed in Section VI. Area time performance comparisons are discussed in Section VII. Finally, some conclusions are offered in Section VIII.
II. RNS AND ADDER-BASED RNS ARCHITECTURES
In RNS, each operand is encoded as a vector of residues with respect to the members of a set of relatively prime integers-called the base. Hence, an integer is transformed to a set of smaller integers that can be processed in parallel [1] .
Let be integers and let be a base. All integers have a unique RNS representation as follows:
(1) (2) where for and denotes the operation modulo . The operations of multiplication, addition, and subtraction, are performed as described by (3) where and denotes any of the aforementioned operations. RNS is a carry-free number system in the sense that can be computed with no information from the tuples . In general, adder-based RNS architectures can be composed of three stages described by the sets of input bits that contribute to the th output bit of a stage or of its cascaded parts, called recursions [7] , [9] . The index refers to the output of the first stage and refers to the output of the th recursion of the second stage. The function of each stage and the specification of the sets are demonstrated through an example. Assume a modulo-5 multiplier, which implements (4) where and are -bit integers, and denote the th and th bit of and respectively. The output digits to which the input bit contributes-and hence the sets to which it belongs-can be found by writing in binary form. The first stage of the architecture implements the nested summations of (4), while the second and third ones perform the outer modulo operation of the right-hand side of (4). In particular, the second stage iteratively maps its inputs to a number having the same residue when divided by , but with a word length equal to that of the selected modulus,
. Finally, the third stage maps its input to the correct result through a conditional addition. In particular, since the input of the third stage is less than is added to it if . The described technique can be applied to determine for each of the recursions of the second stage [9] . A similar design procedure can be applied to a class of bit-level algorithms, which include comparison, multiply/accumulate operations, and a variety of conversion schemes [10] . The algorithms addressed in this paper, such as multiplication, multiplication-addition, and conversions, rely on sum-of-products operations. In order to completely describe such an algorithm, it suffices to identify the input bits that directly (not through carry propagation) contribute to each output bit of the result of the first stage and each recursion of the second one. Hence, such an algorithm, executed by the first stage and each recursion of the second one, can be described by a flag bit sequence (FBS) where each denotes whether the th input bit contributes or not to the th result bit during the th recursion. A -based description is equivalent to an FBS-based one, since (5) where denotes an input bit of the th recursion of weight . In the definition of the first stage of a multiplier, the partial product of (4) that has a weight of , replaces in (5) . Having determined the input bits that contribute to a particular digital position, an architecture consisting of 1-bit full adders (FA's) and half adders (HA's) can be derived using existing methodologies [9] , [11, pp. 108-109] .
III. DESIGN OF MULTIFUNCTION RNS ARCHITECTURES
In this section, the conventional single-modulus architecture design procedure is generalized to embrace multifunction architectures. For simplicity the discussion is restricted to the case of two moduli. However, the design approach can be extended to more than two moduli and/or functions, as demonstrated in Appendix A.
Let the FBS's and describe the th recursion of the modulusand modulusalgorithm, respectively, where and is , where and denote the word length of the input of the th recursion for the moduli and , respectively, and . Also, let denote the FBS which describes the common computations for both residues and and be the FBS's that describe the particular computations for each modulus. All input bits for which contribute to both moduli channels; hence, the sequence defines the computations that produce a shareable intermediate result.
The proposed methodology is based on computing these FBS's as dictated by the following propositions. 
Similarly, for it holds that (10) Both (9) and (10) contain the term . The common term is added to the partial results and to produce the results modulo and respectively. Equations (9) and (10) describe the functionality of an FMA and dictate the organization of it as it will be shown in Section III-A.
Considering the VMA, assume that is a sequence to be produced by the th recursion modulo . Then, as in (9), it can be derived that (11) Let be a sequence to be processed by the th recursion modulo . Then (12) The terms and while not equal, can be computed by a common part since they are characterized by the FBS hence, the structure for the computation of the terms does not depend on the modulus of operation. The first terms of the sums in (11) and (12), which contain and can be computed by a multiplexed part, as it will be discussed in Section III-B.
The input bits can be grouped into sets according to the values of and as follows: (13) Hence, the bits of contribute to both residues modulo and modulo while the bits of contribute to residues modulo only, respectively. Resembling the single modulus case, the allocation of input bits to digital positions through and defines the desired architecture [9] , [11] .
It is not necessary that the and sequences describe the same algorithm with regard to different moduli. In fact, any pair of bit-level algorithms, each of which can be described by such sequences, can be mapped onto a twofunction RNS adder-based architecture. The extended design procedure presented in this section, is used next to produce the two subtypes of multifunction architectures.
A. The Fixed Multifunction Architectures
The proposed fixed and variable multifunction architectures comprise three stages and-in the case of multiplication-AND gates that compute the partial products in a preliminary step of the first stage, as seen in Fig. 1 . The various units of the first and second stage perform multiple-operand addition, hence they can be implemented in a variety of ways, including carry-save adder arrays [12] , Wallace trees [13] , [14] , etc. Finally, the third stage of both VMA's and FMA's contains a conditional corrective addition.
In FMA's, the proposed approach is applicable to the first stage only. The first stage contains three parts, namely a common part for adding the bits of a second for adding the bits of and the result of the common part, and a third for adding the bits of and the result of the common part. In FMA's, during the input bit allocation, the bits of the common part result should be considered. Hence, the number of bits that contribute to a particular digital position in each particular modulus channel, should be increased by one, if the common part produces a bit of the particular weight. The proposed approach is not applicable to the decomposition (second) stage because the inputs to the second stage of each modulus channel are not equal. Hence, FMA's comprise a common first stage for all moduli and individual moduli channels for the remaining stages. The first stage of a 10-bit FMA binaryto-residue converter modulus (37, 41) is shown in Fig. 2 .
B. The Variable Multifunction Architectures
VMA's operate in one modulus from a prespecified moduli set at a specific time instance, as shown in Fig. 1 . Each stage of a VMA comprises a common part to perform operations for all moduli and a part which either handles the remainder of the input bits for modulusoperation or those of modulus . Unlike FMA's, VMA's extend the elimination of redundant hardware to the second stage by exploiting the sets . VMA's comprise a part for adding the common words and a part for the noncommon words. The latter takes the proper input for an operation according to the desired modulo. VMA's require multiplexers to determine the modulus of operation, i.e., to load the appropriate input words to the various stages and to alter the interconnection between the stages of the architecture. Hence, a reconfigurable network should interconnect the stages of a VMA.
Since the bits allocated to the th digital position for all moduli, belong in , the set defines the constant links, i.e., edges that are utilized in both operation modes. On the other hand, the bits of should be multiplexed with those of since the inputs to the noncommon units The implementation of a conditional link requires a multiplexer. Since the sets and are mutually exclusive, their contents are multiplexed to be processed by the th column. The inputs of the multiplexers are bit pairs that belong to the Cartesian product (15) where the operator denotes the number of elements of the set on which it is applied. In case that one of the sets , has a larger number of elements than the other, multiplexers should be used for replacing the extra bits with zeros when operating at the modulus that corresponds to the set of fewer elements. Hence, a number of (16) multiplexers are needed to control the th digital position of the th array. Therefore, the overall architecture requires (17) multiplexers. The first stage of a 10-bit residue-to-binary VMA converter modulus (37, 41) is shown in Fig. 2 .
IV. THE MULTIPLIER AS A BINARY-TO-RESIDUE CONVERTER
As a consequence of the multimodulus and, in particular, of the VMA design approach the following proposition can be stated, for the first stage of a multiplier and a converter:
Proposition 4: The first stage of every -bit modulus multiplier can function as the first stage of a -bit binary-to-residue converter.
Proof: The proof comes as a direct consequence of the VMA design principle. The first stage of the multiplication operation is described by (18) while the function of a -bit binary-to-residue converter is given by (19) Assume that the partial product sequence is replaced by a bit sequence according to the rule otherwise.
If the change of variables (20) is performed in (18), we obtain (19) . Hence, a multiplier can be used as a converter if the input assignments defined by (20) are followed.
Q.E.D.
The multiplier and the converter of Proposition 4 can share the same second and third stages. Proposition 4 is of great practical importance, since it points out that the same hardware resources can be used for both multiplication and binaryto-residue conversion, in a residue processor. The efficiency of this resource sharing depends on the scheduling of the executed algorithm, the input word length, and the modulus of operation.
V. COMPLEXITY OF MULTIFUNCTION ARCHITECTURES

A. VMA Hardware Complexity
The derivation of a VMA is a hardware complexity reduction procedure perceived as the minimization of the number of bits to be added at the th column. The overall hardware complexity is minimized by exploiting the fact that particular input combinations do not occur, such as-in the case of the VMA multiplier-the words where and is the maximum of the moduli of operation or, in the case of the second-stage recursions for all kinds of architectures, the words where is the maximum input value. Let be an input to the th recursion. Due to input combinations which do not occur, there exist couples or triplets of input bits , assigned to the th column, i.e., or the summation of which does not produce a carry, as there are no two bits among them asserted simultaneously for any legitimate input value . Equivalently, for the particular triplets, it holds that (21) while for the couples it is (22) where denotes an input word to the th recursion and is the set of all legitimate input values. When the first stage of a multiplier is considered, an input is a pair of integers and the input bits to the first stage are the bit products such that and . For an unsigned -bit binary-to-residue converter, the inputs are members of (23) An input word to the th recursion of the second stage for both multipliers and converters is in the set (24) where is the maximum output of the th recursion or of the corresponding first stage.
Let and denote the set of input bits required to be added at the th column for the computation of the result modulo modulo and the set of bits assigned to the th digital position for both moduli, respectively. Also let and contain the bits contributing to the th column, which are not included in . According to (21) and (22), the triplets or couples of bits and contained in are organized into the sets and defined as (25)
To produce the correct result, all input bits of and those of or should be added at the th column, and no bit should be added more than once. To meet the particular constraint, the input bits are partitioned into sets of disjoint triplets or couples and a set which collects the remainder of the bits not included in any of the triplets or couples. Specifically, several sets can be defined, which contain disjoint triplets and couples of and (27) while the remainder of the bits, which are not included in the triplets and couples of form the set
From the several possible and sets that can be formed, the ones that minimize the number of bits added in the th column, given as the sum (29) are adopted. Similarly, the bits of are organized into couples and triplets which form the sets and in turn used to define the sets and in accordance to and described by (27) and (28), respectively. The number of multiplexed bits to be added to the th column is (30)
The number of input bits added at the th digital position is the sum of (29) and (30) (31)
The bits added at the th column are the carries from the th column, input bits from the or sets, and bits produced by 2-or 3-input OR gates, each operating on a couple or triple of and . An OR gate suffices to add the bits of the particular couples or triplets and to save hardware in the th column as it replaces a larger full or half adder. The replacement is feasible because, as two or more bits cannot be simultaneously asserted by the definition of and carry generation logic is not required and it is omitted; for the same reason the XOR gates traditionally employed to generate a sum digit [14] , are reduced to OR gates. In addition, hardware is saved in the subsequent more significant columns, since the number of carries generated at the th column is reduced and, therefore, fewer bits need to be processed. The number of FA's and HA's which form the th column are (32) (33) where denotes the number of input carries. Since each FA and HA produces a carry bit, the number of carries to the next more significant digital position, i.e., the st column, is computed as with . The numbers of two-and three-input OR gates at the th column are
The organization of a column is shown in Fig. 3 .
The total hardware complexity of a recursion is obtained through (36) where is the word length of the largest number produced at the particular recursion, and and denote the complexity of an FA, an HA, and an OR gate of two and three inputs, respectively. To obtain the complete hardware complexity, the complexity of all recursions is added to the multiplexer complexity given by (17)
where each is computed as described by (36). To quantify the area complexity of the architecture, the various building blocks are normalized to the complexity of an FA, i.e., it is assumed that and assuming that an FA requires 28 transistors, an HA is implemented with 12 transistors, a multiplexer requires 4 transistors, while a 2-and a 3-input OR gate require 6 and 8 transistors, respectively [15] .
B. VMA Time Complexity
Let , , be an input word to the th recursion. The th recursion computes as
where and compose the related FBS . Assume that the delays after which each input bit is available, form the sequence
The derivation of the delay sequence of the output bits as obtained from is described. The sequences and are related in a similar manner. The delays of the bits processed by the th column of the adder array, are organized into two sequences and for modulo and operation, respectively. The sequence is defined as (40) where (41) The sequence for modulo operation is defined similarly. The delay of the carry and sum output of the th adder in the th column, is computed from (42) where is the delay of a 1-bit FA and and are the delays of the inputs to the th adder, which can be any of the sum bit of the adder above with a delay if any, carry bits from the th column of delay and any of the input bits the delays of which are contained in . The constraint reflects the carry-save organization of the array. It should be stressed that the column delay is minimized by assigning the input bits of longer delay to adders located low in the th column.
The delay of the th column is and the delays of all columns which compose the th recursion form the output delay sequence described as
The delay of the complete VMA is the maximum value in the delay sequence of the final ( th) recursion, . A reasonable throughput rate equals (44) to take into consideration the OR logic for the processing of the sets. In a pipelined VMA, the 1-bit signal required to control the multiplexers between the recursions should be kept synchronized with the data, for example through a line of delay elements the number of which equals the corresponding number of pipeline stages.
C. FMA Hardware Complexity
The first stage of a two-modulus FMA is composed of three parts, namely the common part and two individual parts, one for each modulus of operation. The sets and are formed as in the VMA case. The numbers of FA's and HA's in the th column of the common part, are derived from the sets 
D. FMA Time Complexity
The delay of the FMA first stage is computed in a manner similar to the VMA case. Three adder arrays compose the first stage of the FMA, specifically two individual modulus arrays and the common array. The delay of each column of the common part is computed as in the VMA case, taking into consideration that the inputs are bits from and ; hence not all cases in (41) occur. As there is no need for multiplexers, (41) is simplified for the FMA case as follows: (50) The sum and carry delays of the th adder in the th column is computed by (42), considering that the delays of the inputs to the column are given from (50). The delay of the th column of each of the individual parts of an FMA is given as the delay of the lowest 1-bit adder in the th column (51) where and are the delays of the inputs to the adder, one of which is the delay of the th bit of the output of the common part.
The delay of each second-stage recursion is computed as in the corresponding VMA case; the only difference is that .
VI. MULTIFUNCTION ARCHITECTURE APPLICATIONS
A. Multifunction Architectures for QRNS Processing
The multifunction architecture paradigm is applicable for any modulus including prime moduli of the form therefore, QRNS processing is supported. In this section the input bit assignment sets for the forward/inverse QRNS mappers and a combined binary-to-residue QRNS converter are derived. Notice that having defined the input bit assignment sets, the corresponding multifunction architectures can be derived as described in Section III. The forward QRNS conversion maps the Complex RNS (CRNS) couple of the residues and of the real and imaginary part to the QRNS pair [2] as described by (52) (53) Each of the conversions (52) and (53) is a modular multiplyadd operation, expressed as (54) (55) where , , and , , . The conversion (54) can be written as (56) (57) where , . The input bit assignment set for the th column of the first stage of the QRNS evaluator is (58) while for the case of the QRNS evaluator of (55), the corresponding bit set is (59) The inverse QRNS mapping which retrieves the CRNS representation from the QRNS couple [2] is described by the set of equations (60) (61) The inverse QRNS mapping (60) can be written as (62) where and . A similar relation can be derived for (61) by replacing with where The input bit assignment set for the th column in the first stage of the inverse QRNS converter modulo contains the bits and for which it holds that and it is given from (63) Finally, the sets and are used to derive the multifunction QRNS converters as described in Section V. The multifunction approach in the forward QRNS conversion is applicable in several ways. In particular, an FMA forward QRNS converter produces in parallel the QRNS values from the CRNS inputs for a particular modulus of operation, while a VMA forward QRNS converter produces serially the values from the input values for one or more moduli of operation.
Similarly, multifunction QRNS inverse converters are distinguished into FMA structures which produce in parallel the CRNS couple from the QRNS inputs for a particular modulus of operation and VMA structures which produce the CRNS couple from the QRNS inputs for one or more moduli of operation. Furthermore, the QRNS mapping can be embedded in an -bit binary-toresidue converter.
B. Multifunction Architectures for Base Extension
The multifunction architecture paradigm is also applicable to base extension procedures. Shenoy and Kumaresan [16] have proposed a base extension algorithm based on the expression of CRT as (64) where are the moduli of the base, are the corresponding residues, is an integer, and . From (64), it follows that (65) and (66) where is a redundant modulus and is the modulus to which the base is extended. By using the VMA approach, the computation of the residues of the quantity (67) modulo and modulo can be combined. Let . Then is expressed as (68) where , and . Notice the application of the distributed arithmetic concept. Quantity is computed for and . The FBS's and are exploited to derive the corresponding two-modulus VMA. Assume the numerical example of [16] , where the moduli base is the redundant modulus is and a residue modulo . The VMA for the evaluation of requires approximately a total of 47 and a delay of 30 and 12 for the operation modulo 17 and 8, respectively, where is the delay of an FA. The complete base extension architecture comprises two modulo adders, a lookup table for the evaluation of and hardware for the evaluation of , has a complexity of 77 and a maximum delay of 38
. The original implementation has a complexity equivalent to 125 and a delay of 27 . The area time product of the VMA-based architecture is 2926
, which corresponds to 13% savings over the area time product of 3375 exhibited by the original implementation.
VII. PERFORMANCE OF MULTIFUNCTION ARCHITECTURES
A. VMA Multipliers and Comparisons
The area complexity of the pseudo-RNS multiplier [5] equals the complexity of 2.5 -bit multipliers and 2 -bit adders. The delay equals that of an bit multiplier and 2 -bit adders, which implement a modulo adder. Assuming that the pseudo-RNS area complexity is and the corresponding delay is the complexity of the VMA multipliers is compared to the first stage of the equivalent pseudo-RNS ones. In certain cases, the VMA multiplier requires less hardware than the pseudo-RNS multiplier, however the delay is larger. There exist particular cases where the area time product is in favor of the VMA approach. Fig. 4 depicts the area time complexity of 4-bit and 5-bit VMA multipliers in comparison to the pseudo-RNS ones. In an area time sense, Fig. 4 shows that there are several cases where the proposed multipliers are more efficient. The VMA multipliers are less flexible than the pseudo-RNS ones because they operate for one of the prespecified moduli and not for all odd moduli of a particular word length.
The benefit from adopting the MRRNS is the exploitation of the low complexity of small-modulus residue multipliers [17] . The reduced switching-tree multiplier modulo 7 presented in [17] consumes, 528 m 622 m in a 3-m CMOS technology or approximately 40 FA's in an 0.8-m CMOS technology, assuming that an FA requires an area of 51 m 38 m [18] . The operating frequency of the particular multiplier when implemented in a 0.8-m CMOS technology, is anticipated to increase from 40 to 150 MHz, i.e., the delay corresponds to that of 2
The VMA area complexity for 3-bit moduli spans the range of 8.05 to 11.49 . However, they are 3.75 to 5.25 times slower. The 3-bit two-modulus VMA area time products span a range of 60.4 to 120. 7 while the area time product of the reduced table multiplier modulo 7 is 80
. For example, the (5, 7) VMA demonstrates a performance of 98. 3 . Also, the VMA's are more flexible as they can operate for two moduli. The multimodulus structures can be utilized in the MRRNS context.
A two-modulus VMA multiplier can be compared to a modulo and a modulo submodular index transform multipliers operating in parallel. The delay required by the VMA multiplier to compute a modulo and a modulo products, is the sum of the individual corresponding delays as the operation is serial. The delay of the two submodular multipliers equals the maximum of the two individual delays. Assuming that each bit of a lookup table corresponds to the complexity of one gate [5] and an FA consumes eight gates, the area complexity of the submodular index transform multiplier [19] can be converted to equivalent number of FA's. Similarly, the delay of a lookup table is assumed to equal the delay of [5] . The area savings achieved by the 5-bit VMA span 58% to 69%, while in the 5-bit case, the area savings are 69% to 78%. However the VMA delay is 2.33 to 4.2 times and 2.45 to 4.2 times the submodular delay for the 5-bit and 6-bit cases, respectively. The area time performance comparison of 5-and 6-bit VMA multipliers with a system of two submodular multipliers operating in parallel, is shown in Fig. 5 . It is obtained that for the 6-bit cases, VMA multipliers are more efficient for several moduli combinations. It should be noticed that the multifunction architectures can be applied for any modulus of operation and their applicability is not restricted to prime moduli only.
B. Multifunction Converters and Comparisons
The benefits of the proposed design methodology are demonstrated by evaluating the complexities of FMA and VMA binary-to-residue converters for various input word lengths and for various moduli. The mean area time savings percentage (MSP ) is used as a performance measure, for the two-modulus -bit binary-to-residue converters respectively and for prime moduli of various word lengths. MSP is computed as follows. Let denote any of the possible ordered sequences of all the nonordered couples of -bit prime moduli. The choice of a particular does not affect the final results. Initially, the hardware complexity, , of the two-modulus -bit binary-to-residue converters for the th element of is computed. The corresponding hardware savings percentage is then computed as %
where is the area-time complexity of a traditional converter which consists of two independent -bit singlemodulus converters, one for each member of the th element of . is computed by adding the area complexities of the two corresponding single-modulus -bit converters multiplied by the maximum delay of the two. Subsequently, MSP is computed as the mean value of all by MSP
where is the number of -bit prime moduli couples. Fig. 6 shows the area and delay complexity of two-modulus FMA binary-to-residue converters for 5-bit prime moduli couples and various input word lengths. It is shown that the FMA's are both smaller and faster than a combination of two converters [10] .
For FMA binary-to-residue converters of small input word length and in few cases, the design methodology may lead to an architecture the various parts of which are mainly composed of HA's. This happens as the partitioning of the computation leads to parts that process only a few input bits but contain relatively long carry propagation chains. In this case, the large number of HA's leads to no hardware savings or even to increase of hardware complexity. In such cases, negative MSP appear. The VMA converters require less hardware than the combination of two independent converters, as shown in Fig. 7 . The maximum VMA delay of computing a residue and the delay of computing both residues is also shown in Fig. 7 . The FMA area time product savings are depicted in Fig. 8 . It is shown that as the moduli and the input word length increase, the savings reach more than 40%. The performance of the VMA converters is shown in Fig. 8 , where the savings percentage exceeds 15%. 
VIII. CONCLUSION
The main feature of the derived multifunction adder-based architectures is their ability to execute more than one bitlevel algorithms. Multimodulus architectures perform arithmetic modulo several different moduli. In addition, the introduced architectures achieve hardware complexity savings, when replacing several single-modulus architectures. Two different types of multimodulus architectures have been presented. FMA's incorporate an operation into binary-to-residue conversion and achieve further hardware savings by omitting unnecessary operations. VMA's achieve increased hardware savings, when compared to FMA's, at the cost of reduced parallelism. The proposed technique has been demonstrated by its application for various RNS operations, based on the grouping of input bits into digital positions. The main concept of the proposed technique is the exploitation of the common subsets of the input bits sets assigned to particular columns in architectures performing different RNS operations or identical operations for different moduli. The proposed design methodology provides the means for exploiting a new tradeoff in residue processor design: the choice of the appropriate multifunction components that achieve minimal area time complexity for a particular RNS processor. Improved area time performance and the increased utilization of resources achieved by employing the proposition of Section IV, may prove very useful in building high-performance low-cost special-purpose DSP's.
APPENDIX A
Assume the design of an -modulus processor following the proposed approach. Shareable partial results can be found among all -tuples, of moduli channels. Let be a set of numbers, such that If represents the moduluschannel, each of the possible defines an -tuple of moduli channels, among which a shareable partial result for be found. Proof: As in Proposition 1 of Section III.
Q.E.D. A tradeoff that can be exploited in the design of the multimodulus processor is the selection of the degree of partial result sharing. In some cases, it may be beneficial to limit the partial result sharing, especially in the case of small input word lengths. By decomposing into sums other than (72), it is possible to avoid sharing all possible common results and derive a variety of architectures for a multimodulus processor. while an architecture which utilizes only the part common to all three moduli channels requires 46.44 .
