Abstract: Digital signal processing and multimedia applications often profit from the use of a residue number system. Among the most commonly used moduli, in such systems, are those of 2 n 21 and 2 n þ 1 forms and among the most commonly used operations are multiplication and sum-of-squares. These operations are currently performed using distinct design units and/or consecutive machine cycles. Novel architectures for combined units that perform modulo 2 n 2 1/ diminished-1 modulo 2 n þ 1 multiplication or sum-of-squares depending on the value of a control signal are proposed.
Introduction
Non-positional residue number systems (RNS) have been proposed as an attractive alternative to the binary system for applications, whose arithmetic operations are limited to addition, subtraction, multiplication and squaring [1] . An RNS is defined by a set of F moduli, suppose fm 1 , m 2 . . . , m F g, that are pair-wise relative prime numbers.
Assuming that jAj M denotes the modulo M of A, that is, the least non-negative remainder of the division of A by M, an integer A has a unique representation in the RNS, given by the residues set fa 1 , a 2 , . . . That is, the computation of z i depends only on a i , b i and m i . Therefore each z i can be computed in parallel in a separate arithmetic unit, often called channel. Since each channel deals with small residues instead of wide numbers and all channels operate in parallel without any carry propagation from one to another, significant speed-up over the binary system may be achieved. Several digital signal processors (DSPs) have been reported that adopt an RNS [2, 3] . RNS adoption has been proved very useful in enhancing the performance of modulation components used in communications [4] and in digital filter design [5] , in which it can help reduce the power consumption when the number of taps used in the filter increases [6] . Therefore there is an increasing interest in efficient modulo m i arithmetic components. Very efficient architectures for adders [7, 8] , multipliers [9, 10] and squarers [11] have appeared in the open literature when m i is of the 2 n 2 1 and 2 n þ 1 forms. It is therefore no surprise that RNS based on the set f2 n , 2 n 2 1, 2 n þ 1g has received significant attention and is the most commonly used.
In an RNS that uses a moduli set of the f2 n , 2 n 2 1, 2 n þ 1g form, however, the modulo 2 n þ 1 channel deals with operands that are 1 bit wider than the operands of the other two channels. To overcome this problem, and given that in the case of a zero operand the result can be derived straightforwardly in all the above operations, Leibowitz [12] introduced the diminished-1 representation, in which each number is represented decreased by 1 modulo 2 n þ 1 and all arithmetic operations are inhibited for a zero operand. This representation has the great advantage that n bits are used for representing all operands. Its only disadvantage is that converters from/to the diminished-1 to/ from the weighted representation are required. Since the use of an RNS is attractive in applications where a series of operations take place before a new conversion is required, the time spent on conversions is a very small portion of the total computation time. Efficient adders [13, 14] , multipliers [15] and squarers [16] are available in the open literature for the diminished-1 number system.
Another operation often found in many DSP and multimedia algorithms is the sum-of-squares (A 2 þ B 2 ) computation. This is commonly used in Euclidean branch calculation [17] , image compression [18] , pattern recognition [19] , motion estimation in computer vision systems, Viterbi convolution codes [20] , channel equalisation [21] , filtering [22] and statistics. For performing the sum-ofsquares operation, a multiplier -adder or a squarer -adder pair of units can be used. This approach, however, requires a significant delay for implementing the sum-of-squares function as well as the use of intermediate storage memory. Significant savings may result by having a distinct sum-of-squares unit. The later solution though increases significantly the implementation area.
This paper proposes a solution that stands between these two extremes. For the first time in the open literature, we introduce novel units that can perform either modulo multiplication or modulo sum-of-squares depending on the value of a control signal, for moduli of both the 2 n 21 and the 2 n þ 1 forms. We will hereafter denote these units as MMSSU 21 and MMSSU þ1 , respectively. In the case of the modulo 2 n þ 1 arithmetic, we consider that the operands follow the diminished-1 representation. As we will show, these units can be built with small overhead on the area and delay of a corresponding modulo multiplier. On the other hand, they are capable of providing significant savings in the delay of a modulo sum-of-squares calculation, compared to the case that the calculation is performed by two rounds of modulo multiplication and one modulo addition.
Proposed MMSSU 21 architectures
Suppose that A and B, with A ¼ a n21 a n22 Á Á Á a 1 a 0 and B ¼ b n21 b n22 Á Á Á b 1 b 0 , are two n-bit numbers in modulo 2 n 2 1 representation. Since A ¼ P n21 i¼0 a i 2 i and B ¼ P n21 j¼0 b j 2 j , for their modulo 2 n 2 1 multiplication, we obtain
where PP i denotes a partial product and
According to Wang et al. [9] , for a number
Therefore we can express the above partial products, directly from the bits of A and B in n bits, if we move the partial product bits with i þ j ! n to the jn þ i þ jj n bit position. For example, in the case that n ¼ 8 the partial products shown in the left-hand side of Table 1 can be derived. For the modulo 2 n 2 1 reduction of the products into two final addends a full adder (FA) based tree architecture with end-around-carry (EAC) can be used. Consider that c n is the carry output at the most significant bit position of some stage i in the reduction scheme. Since c n has a weight of 2 n and jc n 2 n j 2 n À1 ¼ jc n j 2 n À1 , c n can be simply added at the least significant bit position of the next stage, forming an end-around-carry modulo adder tree architecture. A parallel modulo 2 n 2 1 adder [7] can finally be used for obtaining the modulo product.
For the squares of A and B, we have that
Therefore their sum-of-squares is
The single bit additions indicated by the term P i¼0 n21 (a i þ b i )2 2i of (2) can be substituted by the corresponding sum and carry terms, leading to (È is used to denote the exclusive-OR logic function)
Therefore the partial products for the modulo 2 n 21 sum-of-squares operation can be derived by applying (1) to the terms indicated in (3). For example, in the n ¼ 8 case, the 10 partial products indicated in the right-hand side of Table 1 are derived. In the general case, n þ 2 (n þ 1) partial products are derived when n is even (odd). Again an FA-based tree architecture with EAC can be used to reduce the partial products into two final addends. A modulo 2 n 21 parallel adder can then provide the sum-of-squares.
From the above analysis, it is obvious that the modulo 2 n 21 multipliers and sum-of-squares units have a very similar structure, composed of a partial product generation subcircuit, a modulo partial product reduction unit and a final parallel adder. In the following two sections, we propose two architectures for unifying these designs into a single module.
Simple MMSSU 21 architecture
Let m denote the operation select signal of the MMSSU, that is, when m is 0 the MMSSU performs multiplication on its input operands, whereas in the m ¼ 1 case, the sum-of-squares of the input operands is generated. A straightforward solution to obtain a MMSSU 21 is to add two all 0s partial products to the original multiplication matrix, so that both multiplication and sum-of-squares matrices have the same dimensions, then use a multiplexer for every bit that differs in these and finally use the reduction tree architecture of the sum-of-squares function followed by a parallel adder to obtain the final result. This solution forms our simple MMSSU 21 architecture.
The simple architecture requires in the worst case: † n XOR gates for forming the partial product bits of the form a i È b i . † 2 n 2 AND gates for generating the partial product bits of the form a i a j or b i b j . † n 2 AND gates for the generation of the partial product bits of the form a i b j .
† n 2 multiplexers for selecting the correct partial product bits according to the desired operation. Multiplexers are not required for the zero partial product bits of both matrices.
† n AND gates that perform the logic function m(a i È b i ) and provide the rest n partial product bits that are not driven to multiplexers. † An FA-based partial product reduction tree that provides the final two summands. We consider that this tree follows a Dadda [23] architecture. Since we use as basis the sum-of-squares partial product reduction scheme, the Dadda tree is presented with n þ 2 partial products. It is well known that the depth in FA stages of a Dadda tree is a function, suppose D(k), of its number of operands (see Table 3 of Vergos and Efstathiou [16] for all practical values). The number of FAs required is n 2 . † A parallel modulo 2 n 2 1 adder.
Obviously, the critical path of the simple MMSSU 21 architecture, starts at the input of a XOR gate and passes through an AND gate, the adder tree architecture and the parallel adder. Different architectures are often compared using the unit-gate model [24] approximations. Under this model, all 2-input monotonic gates count as one gate equivalent for both area and delay, whereas a 2-input XOR or XNOR gate or a 2 ! 1 multiplexer counts as 2 gate equivalents for both area and delay. Using this model, we can compute that the area (A Simple,2 n À1 ) and delay (T Simple,2 n À1 ) of the simple MMSSU 21 architecture are given by (we assume the parallel-prefix implementation of [7] for the final adder)
Reduced area MMSSU 21 architecture
The large number of multiplexers required in the simple MMSSU 21 architecture is considered in this section. To this end, we define the following variables, for 0 i n 2 1
where _ denotes the logic OR function. Using the newly introduced variables, the original multiplication and sum-of-squares matrices can be made very similar using of the following rules: In our example case of n ¼ 8, Table 1 takes the form of  Table 2 , where the last two partial products are in the multiplication case equal to zero.
After these transformations, the resulting partial product tables for modulo multiplication and sum-of-squares are essentially the same, with the first column of the multiplication matrix transferred at the least significant position in the sum-of-squares one. We can therefore in both cases use the tree architecture of the sum-of-square case for our partial product reduction. In the case of multiplication, however (m ¼ 0), the most significant bits of the two final addends are computed at the rightmost FAs of the tree architecture. Considering then, the completely circular nature of the carry computation unit in every modulo 2 n 2 1 adder [7] , we conclude that this will provide a similarly rotated modulo 2 n 2 1 sum when its operands are rotated. Therefore, the two addends produced by the FA tree are directly driven to a final parallel modulo 2 n 2 1 adder and n multiplexers controlled by m perform the required right rotation in the case of a multiplication operation. Fig. 1 presents the proposed reduced area MMSSU 21 when n ¼ 4. The logic gates required for producing the partial product bits are not shown.
In the general case, the reduced area architecture is composed of: † n XOR gates for forming the terms of the form a i È b i . † n AND gates for forming the e i terms. † 2n 2 ! 1 multiplexers for producing the c i and d i terms. † 2 n 2 AND gates for generating the partial product bits of the form a i c j or d i b j . † n AND gates for generating the partial product bits of the form a i b i . † An FA based partial product reduction tree architecture, composed of n 2 FAs with a depth of D(n þ 2) FAs. † A parallel modulo 2 n 2 1 adder. † n 2 ! 1 multiplexers for performing a right rotation of the sum.
The critical path of the reduced area MMSSU 21 architecture starts at an XOR gate followed by an AND gate that implements an e i term and passes through the tree of FAs, the parallel adder and the rotation multiplexers. Adopting the approximations of the unit-gate model, we can compute that the area (A Reduced,2 n À1 ) and delay (T Reduced,2 n À1 ) of the reduced area MMSSU 21 architecture are given by (we assume the parallel-prefix implementation of the work of Kalamboukas et al. [7] for the final adder)
Qualitative comparisons
In this section, using the approximations of the unit-gate model we compare the proposed MMSSU 21 against two baseline architectures capable of performing modulo multiplication and sum-of-squares, hereafter called system A and system B. In system A, only a multiplier and a modulo adder are available. Therefore, modulo multiplication is performed in one cycle, whereas the sum-of-squares operation is performed in three consecutive cycles; two multiplication cycles followed by a summation one. System B offers a multiplier, a squarer and a modulo adder. The sum-of-squares operation requires two consecutive squaring cycles followed by an addition. System B obviously computes the sum-of-squares faster than A since a dedicated squarer circuit is significantly faster than a multiplier used for performing squaring.
Considering that each multiplier follows the architecture proposed in the work of Wang et al. [9] and all adders the architecture of Kalambaukas et al. [7] , we can obtain that their area requirements and delays are given by
According to Piestrak [11] , a modulo 2 n 2 1 squarer has half the partial products of the corresponding modulo 2 n 21 multiplier. Therefore, its area and delay requirements can be modelled as
The above approximations enable us to model our baseline systems' areas and delays when performing the sum-ofsquares operation as
The graphs in Fig. 2 present for several values of n the area and delay estimations in the baseline systems assumed, as well as, in systems that adopt one of the proposed MMSSU 21 architectures. The estimations show that the required implementation area of the proposed architectures lies within the requirements of the two baseline architectures. The proposed reduced area architecture requires only 6% or less implementation area more than the baseline architecture A, when n ! 16. The area of baseline architecture B is significantly larger than that of both proposed architectures. Its area requirements raise over 40% more than that of the proposed reduced area architecture when n ! 12.
Both proposed architectures offer significantly reduced sum-of-squares execution delays compared to both baseline architectures, for n ! 8. The savings offered reach 54% in the case of the proposed simple architecture and the n ¼ 32 case. Obviously, one may add a dedicated sum-of-squares unit in any of the baseline systems assumed, in order to compensate for this delay. This, however, would increase further the implementation area of such a system. Of course the price that has to be paid in every case of using one of the proposed MMSSU 21 architectures is longer multiplication or simple squaring times compared to those offered by the baseline systems. For example, multiplication time is increased by 11% on the average when n ! 16. Since the above results are based on a model that neglects the delay and area for routing as well as fan-in and fan-out, we present quantitative results from full static CMOS implementations in Section 4.
Proposed MMSSU 11 architectures
The multiplication of A and B, in diminished-1 modulo 2 n þ 1 arithmetic, can be treated in an analogous way given in the preceding section. In the work of Efstathiou et al. [15] , it has been shown that we can express each partial product of the diminished-1 modulo 2 n þ 1 multiplication of A and B in n bits. The bits with weight 2 nþi , 0 i n 2 1, are complemented and shifted to bit position jn þ ij n . For each such complementation and shifting, a correction factor of 2 nþi must be added. Let X 21 denote the diminished-1 representation of X, that is, X 21 ¼ X 2 1 and X = 0. Then, for the modulo 2 n þ 1 product P and the sum-of-squares SS of A and B, with
The above analysis indicates that in the diminished-1 modulo 2 n þ 1 multiplication/sum-of-squares operations, more partial products are required compared to the corresponding modulo 2 n 2 1 cases. For the multiplication operation, one is needed for A 21 and a second for B 21 . For the sum-of-squares operation, one is required for 2A 21 and a second for 2B 21 . A final extra partial product in both cases is also required for the total correction factor, which in the sum-of-squares case should also incorporate the constant term of (5). Since n þ 2 partial products are used in the modulo 2 n 21 sum-of-squares matrix, we conclude that in the modulo 2 n þ 1 case, n þ 5 partial products would be needed. We show below, however, that the required correction factors can be incorporated in the rest partial products. This eliminates the need for an extra partial product just for the total correction factor. Therefore, the proposed MMSSU þ1 architectures are based on the generation and diminished-1 addition of n þ 4 partial products. In the multiplication case, when n ¼ 8, by expanding (4) and applying the complementation and shifting method, we can obtain the first 10 partial products (PP 0 up to PP 9 ) listed in the left-hand side of Table 3 . The rest two products will represent the required correction. In the following, we compute the total correction required in the general case. Let CP denote the total correction factor for the multiplication operation. CP can be computed by following the procedures presented in the work of Efstathiou et al. [15, 16] as CP ¼ CP 1 þ CP 2 , where CP 1 is the correction imposed by the partial product generation and CP 2 is the correction imposed by the partial product reduction.
Since the complemented terms that appear at the partial products PP i , with 1 i n21 require a correction of 2 n (2 i 2 1), we can easily obtain that
The n þ 4 partial products are reduced into two final summands by a multi-operand modulo 2 n þ 1 FA-based tree architecture with complemented EAC [15] . Let c n denote the carry output of the most significant bit position at some stage i of the reduction scheme. c n has a weight of 2 n . Since c n can be complemented and added at the least significant bit position of the next stage provided that a correction of 2 n is taken into account. During the reduction of n þ 4 partial products, n þ 2 carries with weight 2 n are produced, leading to CP 2 ¼ 2 n (n þ 2). Therefore,
Since CP is treated in the proposed architectures as an extra partial product, we have to use in our reduction scheme its diminished-1 representation, CP 21 ¼ 2 n . Since this value cannot be represented in n bits, two partial products are inserted; the first is the all 1s partial product, whereas the second is the 00 . . . 01 partial product.
From (5), making the same substitutions as in (3), we can obtain that jSS À1 j 2 n þ1 ¼ jA
where
The 10 partial products required for the NPP when n ¼ 8 are listed in the right-hand side of Table 3 (PP 0 up to PP 9 ). Again, each partial product is expressed in n bits by complementations and shifts as explained earlier.
2iþ1 terms of (6) can be expressed by the last four partial products in the right-hand side of Table 3 . (As will be explained later the 1 in PP 11 also incorporates the required total correction factor.) Since the non-zero bits of PP nþ2 (a i b i terms) and PP nþ4 (a i b i terms) can substitute the zero bits of the PP i partial products, with 0 i (n 2 1), we can then obtain a very regular partial product array, with n þ 4 rows.
In the following, we compute the total correction factor required for the sum-of-squares operation, suppose CSS, and we show that this is equal to 3 independently of n. The total correction factor is given by CSS ¼ CSS 1 þ CSS 2 þ CSS 3 , where: † CSS 1 is the correction factor imposed by the partial products formation. This is equal to
is the correction factor required for the reduction of the partial products. Since n þ 2 carries of weight 2 n are produced during the reduction of the n þ 4 partial products into two final addends, we have that CSS 2 ¼ 2 n ðn þ 2Þ † CSS 3 is equal to 1 and represents the last term of (6).
According to the above correction factor, we obtain that
Since we are using the diminished-1 representation of CSS in the partial products matrix, we have to use CSS 21 ¼ 2, which explains the 1 in the second from the right bit of PP 11 in Table 3 . The above analysis indicates that a diminished-1 modulo 2 n þ1 multiplier has a very similar structure to a sum-of-squares unit. Both are composed of a partial product generation subcircuit, a modulo partial product reduction unit and a final parallel adder. This will enable us in the following, to propose two architectures for the unification of these designs into a single unit.
Simple MMSSU þ1 architecture
Our simple MMSSU þ1 architecture unifies the multiplication and sum-of-squares operations by utilising a multiplexer for every partial product bit that differs in the two operation matrices. Suppose that the desired operation is selected by a signal s. When s ¼ 0 (s ¼ 1) multiplication (sum-of-squares) is performed. Therefore, the area 
requirements of our simple MMSSU þ1 architecture are in the the worst case: † n XOR (XNOR) gates for forming the partial product bits of the form a i È b i (a i È b i ). † 2 n 2 AND (NAND) gates for generating the partial product bits of the form a i a j (a i a j ) or b i b j (b i b j ). † n 2 AND (NAND) gates for the generation of the a i b j (a i b j ) bits. † n 2 þ 3n multiplexers that select the correct partial product bits according to the desired operation. (Note that only n multiplexers are required for the last two partial products in the two operations.) † An FA-based partial product reduction tree architecture. This reduces the n þ 4 partial products into two by a Dadda tree. The architecture uses n(n þ 2) FAs. † A parallel modulo 2 n þ1 adder. We consider for it the parallel-prefix implementation presented in the work of Vergos et al. [14] .
Obviously, the critical path of the simple MMSSU þ1 architecture starts at the input of a XOR or XNOR gate, passes through a multiplexer, the adder tree and the parallel adder.
Using the unit-gate model approximations, we can compute that the area A Simple,2 n þ1 and delay T Simple,2 n þ1 of the simple MMSSUþ1 architecture are given by
The reduced area architecture that we propose in this section aims at the reduction of the multiplexers required. We introduce the following variables based on the operation control signal s
Utilising the above variables and following the substitution rules given in the modulo 2 n 2 1 case, we can obtain very similar matrices for the multiplication and sum-of-squares operations. In our example case of n ¼ 8, the partial product matrices are transformed into the ones indicated in Table 4 . The differences between these matrices are confined to: † the leftmost column of the multiplication matrix is transferred at the least significant bit position in the sum-of-squares operation and † some of the bits of this column are complemented.
We can therefore design a reduced area MMSSU þ1 unit, starting off from a pure sum-of-squares unit and taking into account that in the case of a multiplication operation it computes the two final addends left rotated by one bit position and also requires some of the input bits to be complemented. This means that multiplexers need to be inserted at the following points: † At the least significant bit position, for choosing between normal and complemented inputs. † Before the final parallel adder, for performing a single bit right rotation of the operands, when s ¼ 0. Unfortunately, the carry computation unit of a diminished-1 modulo 2 n þ 1 adder does not have a completely circular nature and therefore rotated operands do not yield rotated results. In contrast to the modulo 2 n 2 1 case, we therefore have to insert the multiplexers before the final adder and not after it. † At the carry inputs of the FAs of the most and the least significant bit positions, to select the appropriate carry between the one coming from the next to the left column and a complemented feedback carry from the previous FA stage. Two multiplexers are required per row of full adders of the Dadda tree. Fig. 3 presents the proposed reduced area MMSSU þ1 when n ¼ 4. The logic gates required for producing the partial product bits are not shown.
The area requirements of the reduced area architecture can be summarised into the following: † n XOR (XNOR) gates for forming the partial product bits of the form a i È b i (a i È b i ). 
The charts given in Fig. 4 present for several values of n comparative estimations of the area and delay of the baseline systems assumed and the proposed architectures. The required implementation area of the proposed architectures again lies within the requirements of the two baseline architectures. The proposed reduced area architecture requires less than 5.9% implementation area more than the baseline architecture A, when n ! 16. The area of baseline architecture B on the other hand is significantly larger; on the average of the examined cases it requires more than 15 and 37% extra implementation area than the simple and the reduced area architectures, respectively. Considering the execution delay, both proposed architectures offer significant savings during the sum-of-squares execution, for sufficiently large values of n. The savings are greater in the case of the simple architecture. On the average of the examined cases, the savings offered are 53.4 and 31.8% over the baseline architectures A and B, respectively. The multiplexers added on the critical path of the proposed reduced area designs make them a little slower than those that follow the simple architecture. The savings offered are in this case 45.1 and 18.9% over the baseline architectures A and B, respectively.
Quantitative comparisons
For area and delay estimations from real implementations, we developed a software tool capable of providing structural HDL descriptions of the proposed MMSSUs and the array multipliers proposed in the work of Wang et al. [9] and Efstathion et al. [15] . All partial reduction schemes make use of a Dadda tree FA-based architecture and all final adders are constructed according to the fast parallelprefix solutions proposed in the work of Kalamboukas et al. [17] and Vergos et al. [14] . Operand sizes of 4, 8, 16 and 32 bits were considered. After extensive simulation, the designs were synthesised in a 0.25 mm static CMOS technology with the restriction of a maximum fan-out of 8 and optimised following a standard optimization script from Bhatnagar [25] . The synthesised netlists along with the design constraints to achieve them were then used for placing and routing the designs. All design constraints, such as output load, floorplan initialisation information and pin placement were held constant for each architecture. Final timing analysis was performed after all RC parasitic information was extracted from the layout and back-annotated to the gate-level netlist. Typical voltage and temperature operating conditions are assumed along with average gate delays.
The charts given in Fig. 5 present the attained results. Considering the implementation area required, our results indicate that the proposed MMSSU 21 become more attractive as n increases; in the example case of n ¼ 32, the simple and reduced area MMSSU 21 , respectively, require 3.18 and 0.1% more area than the modulo multiplier. The delay overhead imposed by the proposed MMSSU 21 is about 0.35 and 0.5 ns for the simple and the reduced area architectures, respectively. We should keep in mind, that this is the worst delay required, that is, the delay irrespectively of the operation performed. On the contrary, a system that uses only a multiplier and a modulo adder, will require two multiplication and one addition cycles for the sum-of-squares operation. Taking into account that in the same implementation technology a modulo 2 16 2 1 adder has an implementation area of 10 586 mm 2 and a delay of 1.77 ns [13] , we can obtain that the multiplieradder solution requires, in the n ¼ 16 case, an implementation area of 117 475 mm 2 and offers an execution delay of 9.59 ns for the sum-of-squares operation. This is more than double the delay of both proposed MMSSU 21 architectures. Therefore, in an application that uses the sum-of-squares operation often, the proposed MMSSU 21 units are very attractive. Of course, we must always keep in mind that, in an RNS, the maximum clock frequency is determined by the delay of the slowest channel used.
In the modulo 2 n þ 1 case, the reduced area architecture actually shows area savings over the simple one for n ¼ 8, 16 or 32. In the smallest examined case, the multiplexers required for handling the carries and the operands before the last-stage adder overcome the savings on the multiplexers for the partial product bits selection. Again, as n increases, the use of the proposed MMSSUs becomes more attractive, since their area requirements are closer to those of a multiplier. In the n ¼ 16 and 32 example cases, the reduced area MMSSU architecture leads to implementations with only 2.2 and 0.2% more implementation area than the corresponding multipliers. The simple and reduced area implementations are on the average of the examined cases 0.7 and 1.2 ns slower than the corresponding multiplier. Taking into account that in the same implementation technology a modulo 2 16 þ 1 adder has an implementation area of 12 489 mm 2 and a delay of 1.68 ns [13] , we can again obtain a multiplier -adder solution requires in the n ¼ 16 case an implementation area of 146 422 mm 2 and offers an execution delay of 9.14 ns for the sum-of-squares operation. This makes the proposed MMSSU þ1 far more attractive in an application that uses frequently the sum-of squares operation.
Conclusions
The use of an RNS has been greatly appreciated in several DSP and multimedia applications that may make heavy use of both multiplication and sum-of-squares operations. The implementation area for having distinct units for these operations is very large, and solutions based on employing a multiplier -adder pair or a squarer -adder pair for performing the sum-of-squares operation require several machine cycles.
To this end, we have proposed two architectures for units that perform either the jX Â Yj R or the jX 2 þ Y 2 j R operation depending on the value of a control signal. We considered that R can either have the 2 n 2 1 or the 2 n þ 1 form and that the diminished-1 representation is used in the latter case.
6 References
