Abstract: Squarers modulo M are useful design blocks for digital signal processors that internally use a residue number system and for implementing the exponentiators required in cryptographic algorithms. In these applications, some of the most commonly used moduli are those of the form 2 n +1. To avoid using (n+1)-bit circuits, the diminished-1 number system can be effectively used in modulo 2 n +1 arithmetic applications. In the paper, for the first time in the open literature, the authors formally derive modulo 2 n +1 squarers that adopt the diminished-1 number system. The resulting implementations are built using only full-and half-adders and a final diminished-1 adder and can therefore be pipelined straightforwardly.
Introduction
A nonpositional residue number system (RNS) is defined by a set of L moduli, fd 1 ; d 2 ; . . . ; d L g; that are pairwise relative prime. Assuming that jAj M ; denotes the modulo M of A, that is, the least non-negative remainder of the division of A by M, an integer A has a unique representation in the RNS, given by the set fa 1 ; a 2 ; . . . ; a L g of residues, where a i ¼ jAj ; each z i is computed in parallel in a separate arithmetic unit, often called channel. Note that each channel deals with small residues instead of wide numbers and, since all channels operate in parallel, significant speedup over the binary system may be achieved.
The adoption of an RNS is therefore a good choice for applications whose arithmetic operations are limited to addition, subtraction, multiplication and squaring [1, 2] . Several digital signal processors (DSPs) targeting applications such as filtering [3 -5] or modulation for communication components [6, 7] have already been built adopting an RNS and many more are expected, provided that modulo d i arithmetic components are available and efficient.
Efficient adders [8 -11] , residue generators and multioperand adders [12] , multipliers [13, 14] and squarers [15] have been presented for moduli of the 2 n À 1 and 2 n þ 1 forms. Therefore, RNSs based on the set f2 n ; 2 n À 1; 2 n þ 1g have received significant attention and are the most commonly used. For this set of moduli, however, a new problem, namely the problem that the 2 n þ 1 channel has to deal with operands one bit wider than the other two, arises.
To overcome this problem, and given that in the case of a zero operand the result can be derived straightforwardly, Leibowitz [16] introduced the diminished-1 representation.
In the diminished-1 representation each number is represented decreased by 1 modulo 2 n þ 1 and all arithmetic operations are inhibited for a zero operand (easily identified by the most significant bit being at 1 in its diminished-1 form). This representation has the great advantage that the numbers are represented by n bits. Its only disadvantage is that converters from=to the diminished-1 to=from weighted are required. However, since an RNS is used when a series of additions and multiplications take place, the conversions required are only a very small portion of the total computation time.
Although very efficient adders [17 -19] and multipliers [20 -22] have appeared when the diminished-1 number system is used for the modulo 2 n þ 1 channel, an efficient squarer circuit has not yet been presented. Although a modulo 2 n þ 1 multiplier can also be used for squaring, as we will show in this paper, a dedicated squarer may result in significant delay savings.
Modulo 2 n þ 1 squarers can also find applicability in several cryptographic algorithms for the implementation of modular exponentiations of the form D ¼ jA E j 2 n þ1 . These exponentiations are implemented efficiently using square and multiply algorithms. For example, in the International Data Encryption Algorithm (IDEA), the sub keys are generated by shifting the master key and computing additive and multiplicative inverse elements. The computation of the multiplicative inverse modulo 2 16 þ 1 can be done using the Fermat theorem and is implemented by modular exponentiation. In the implementation described in [23] , the square and multiply algorithm was used for implementing the modulo 2 16 þ 1 exponentiator. Since a squarer was not available, multipliers modulo 2 16 þ 1 have been used. To alleviate the delay of modular computations on the wide operands used in cryptographic algorithms, an RNS has been adopted in some cases. Examples include the implementations of the RSA cryptoalgorithm [24] or the Montgomery modular multiplication algorithm [25] .
In this paper, we formally derive novel modulo 2 n þ 1 squarers for operands in the diminished-1 representation. The derived squarers can be implemented by using only a carry save array (CSA) composed of full (FA) and half (HA) adders and a final modulo 2 n þ 1 diminished-1 parallel adder [17 -19] . Their area and delay estimations indicate that they can perform squaring much faster than a diminished-1 multiplier [22] as well as former solutions for operands in weighted representation [15] . They can also be very easily pipelined up to the FA stage.
Novel squarers
In this Section, we introduce a new architecture for modulo 2 n þ 1 squarers of diminished-1 operands. We first explain the derivation of the partial products. We then consider the reduction of the partial products in two summands.
Let A be an ðn þ 1Þ-bit number, A 2 ½0; 2 n þ 1Þ and let A À1 ¼ a nÀ1 a nÀ2 . . . a 1 a 0 denote its diminished-1 representation. Assume that Q denotes the square of A modulo 2 n þ 1; that is, Q ¼ jA 2 j 2 n þ1 : We then have
or, equivalently,
The term jA 2 À1 j 2 n þ1 of (1) can be expressed as
Taking into account that i þ j 2n À 2; (2) can be written as
where
Since for z 2 f0; 1g it holds that
then (3) can be expressed as
and t t denotes the complement of bit t. Equation (5) indicates that one way to form the partial products is to complement each bit a i a j with i þ j ! n; and place it at bit position ji þ jj n ; provided that a correction equal to j2 n 2 jiþjj n j 2 n þ1 is taken into account. Therefore, (5) can be reformulated as
where PP i denotes the ith partial product
and C i is the corresponding correction factor. Note that PP 0 does not contain any complemented bits and thus C 0 ¼ 0: On the other hand, for i 6 ¼ 0; the value of C i depends on the number of the complemented bits a i a j ; and is given by
According to the above, and taking into account that a i a i ¼ a i ; the partial products and correction factors presented in Table 1 are derived for the term jA 2 À1 j 2 n þ1 of (1). The total correction, C P ; required for the formation of the above n partial products is equal to
We can now notice by observing the columns of the partial products that, in the same column some terms appear twice, once as a i a j and once as a j a i ; or once as a i a j and once as a j a i . Since a i a j ¼ a j a i ða i a j ¼ a j a i Þ and a i a j þ a i a j ¼ 2 Â a i a j ða i a j þ a j a i ¼ 2 Â a i a j Þ each such pair of product bits that appears in the column with bits of weight 2 iþj can be replaced by one product bit a i a j ða i a j Þ in the column with bits of weight 2 iþjþ1 ; that is, in the next to the left column. The pairs of the leftmost column can also, as explained earlier, according to (5) be complemented and placed at the rightmost column, if a correction factor equal to 2 n is taken into account for each such complementation and placement.
The number of pairs of equal product bits that appear in the leftmost column is b n 2 c; where bxc denotes the greater integer which is less than or equal to x. The total correction required by the simplification of the same terms is therefore equal to 
The resulting partial products matrix has different forms, depending on whether n is odd or even. In the case that n is odd, each column of the newly formed matrix has exactly nþ1 2 bits. When n is even, on the other hand, the columns with bits of weight 2 iþj ; with i þ j 2 f0; 2; 4; . . . ; n À 2g; have 2 þ n 2 bits, whereas the remaining columns have n 2 À 1 bits. Since we have derived the required partial products for the term jA 2 À1 j 2 n þ1 of (1), we now turn our focus to the term j2 Â A À1 j 2 n þ1 : This term leads to a new partial product equal to A À1 shifted one position to the left. According to (5) the a nÀ1 bit that overflows at the left end can be complemented and placed at bit position 0, provided that a correction factor, C N ; equal to 2 n ; is taken into account. Therefore the ðn þ 1Þth partial product of the proposed squarers is given by
The proposed squarers utilise one last partial product, that is, the diminished-1 modulo 2 n þ 1 representation of the sum of all correction factors required. These correction factors are C P ; C S ; C N ; but we must also take into account any correction factor introduced during the reduction of the partial products into two summands. In the following we explain how the latter correction factor can be computed.
We consider that the reduction of the partial products into two summands is performed by using a full adder (FA) based tree architecture. Tree architectures have been introduced by Wallace [26] . Dadda reduced their area requirements in [27] . Consider that c n is the carry output at the most significant bit position of some stage i in the reduction scheme. c n has a weight of 2 n : Since
c n can be complemented and added back at the least significant bit position, provided that a correction of 2 n is taken into account. In this way, an FA based tree architecture with complemented end-around-carry (EAC) is formed. CSA architectures with complemented EAC along with general principles for calculating the required correction factors have been presented in [12] .
For computing the total correction factor required, we consider the following two cases:
(i) n is odd: Then, during the reduction of the nþ1 2 þ 2 partial products, nþ1 2 carries are produced. The correction required for the reduction scheme is then equal to C R;odd ¼ 2 n nþ1 2 : Let C denote the modulo 2 n þ 1 sum of all correction factors, that is, the modulo 2 n þ 1 sum of C P ; C S ; C N and C R;odd : Note that, in parallel, C is our last partial product.
We then have
(ii) n is even: As analysed above, in this case all columns of the partial products matrix do not have the same number of partial product bits. We apply an FA at each column with bits of weight 2 iþj ; with i þ j 2 f0; 2; 4; . . . ; n À 2g; thereby reducing the number of partial product bits from
Since C is treated in the proposed architecture as an extra partial product, we have to use in our reduction scheme its diminished-1 representation, i.e. C À1 ; which is equal to the all 0s n-bit vector. Note that, although C À1 ¼ 0; it cannot be ignored during the reduction of the partial products, since in this case less than the computed carries of weight 2 n will be produced.
An implementation of the proposed architecture is therefore composed of AND or NAND gates that form a bit of each partial product, a Dadda tree that reduces the partial products into two summands, and a modulo 2 n þ 1 adder for diminished-1 operands [19] that accepts these two summands and produces the required product.
Example of proposed squarers
In this Section we present an example of the derived diminished-1 modulo 2 n þ 1 squarers. Consider the design of a modulo 2 7 þ 1 squarer. Let A À1 ¼ a 6 a 5 a 4 a 3 a 2 a 1 a 0 be the input operand. We start off by the partial product matrix for the A 2 À1 shown in Table 2 . We then complement each bit a i a j with i þ j ! 7; and place it at bit position ji þ jj 7 : This results in the following partial product matrix: Apart from the four above partial products, the complete matrix of the squarer must also include j2 Â A À1 j 2 n þ1 and the total correction C, that is, the following two partial products: These partial products are reduced into two summands, by the tree architecture indicated in Fig. 1 , which is composed only of FA and HA blocks. Each such block produces a carry at its left and a sum at its right output. The carries at the most significant bit position (leftmost carries) are complemented and added to the bits of the least significant bit position. Note that, in Fig. 1 , four such carries are produced, to comply with the analysis preceding (12) . The two final summands are driven to a final parallel diminished-1 adder. It is obvious that the proposed squarer architecture is very regular and can be straightforwardly pipelined up to the FA stage, since the final adder, if designed according to [19] , can be pipelined up to the complex gate level.
Comparisons
In this Section we examine the area and delay complexities of the proposed squarers. Furthermore, we compare their area and delay requirements against those of the multipliers proposed in [22] and the squarers proposed in [15] . The reasoning behind the first comparison is that a squarer circuit would be finally included in an RNS implementation only if performing the squaring function by a multiplier would significantly slow down the execution rate. We must also note that the comparison against the squarers of [15] should not be thought of as a direct one, since a designer's choice on whether he will use the diminished-1 or the weighted representation cannot be based solely on whether the proposed squarers' results are superior to those of [15] . This decision is based on the availability, delay and area of all design components he will need as well as on which representation best suits his application. Obviously, a designer cannot afford to choose components utilising different representations because of the time lost on conversions among them.
For our comparisons, we adopt the approximations of the unit-gate model [28] , that is, we consider that all 2-input monotonic gates count as one gate equivalent for both area and delay, while a 2-input XOR or XNOR gate counts as 2 gate equivalents for both area and delay.
In the proposed squarers, the required partial product bits can be derived in parallel by the use of nðnÀ1Þ 2 AND or NAND gates and n 2 inverters. We consider that these partial products are then reduced to two summands by the use of a Dadda tree. The depth in FA stages of a Dadda tree is a function, suppose D(k), of its number of operands and is listed in Table 3 for all practical values.
Each of the n columns of the tree is composed of at most Fig. 1, a 1 should be driven in both inputs of an HA at the column with bits of weight 2 2 : This HA has been removed from Fig. 1 , by substituting its sum output with 0 and its carry output with a 1 : The first substitution also leads to the simplification of an FA of the second row into an HA. These simplifications depend on the actual value of n and therefore are not modelled in the sequel). The area and delay of an FA is seven equivalent gates and four time units, respectively. The area and delay of an n-bit parallel diminished-1 modulo 2 n þ 1 adder that follows the architecture proposed in [19] is 9 2 n log 2 n þ 1 2 n þ 6 equivalent gates and 2 log 2 n þ 3 time units. Therefore the area (A) and delay (T) requirements of the proposed modulo 2 n þ 1 squarers are:
n log 2 n þ 1 2 n þ 6 equivalent gates; and
The modulo 2 n þ 1 multipliers presented in [22] were compared against those presented in [20, 21] and were found more efficient in both area and delay terms. Their delay using the unit-gate model is 4;5;7;8;11;12; ... 4Dðn þ 3Þ þ 2log 2 n þ 4; otherwise ð14Þ time units, while their area was estimated to be
Finally, the modulo 2 n þ 1 squarers for operands in the weighted representation proposed in [15] are implemented by nðnÀ1Þ 2 AND=NAND gates that each forms a partial product bit, a CSA array composed of ððnþ1ÞðnÀ2Þ=2Þþ1 FAs and a modulo generation circuit. For fair comparisons against the proposed squarers, in the following we consider that a Dadda tree architecture is used for the reduction of the partial products instead of a CSA array. The required modulo generation circuits can be designed using two n-bit adders that operate in parallel and ðn þ 1Þ multiplexers.
Since we have adopted a fast parallel-prefix solution with constant fanout for the last stage adder in the proposed squarers, we also in this case consider that each of the required adders is designed according to the parallel-prefix algorithm presented in [29] . One of these adders receives the constant vector 011 . . . 11; which enables simplifications up to the first prefix level. We therefore model the area of this adder as 3nðlog 2 n À 1Þ equivalent gates, whereas the area of the other one is 3n log 2 n þ n þ 4 equivalent gates. A multiplexer is considered to have the same area requirements with an XOR gate, that is, two gate equivalents. Taking into account all the above we can model the area requirements of the modulo 2 n þ 1 squarers presented in [15] as
The delay of the modulo 2 n þ 1 squarers presented in [15] is formed by the delay of an AND=NAND gate, the delay of the adder tree network, which can be modelled as Dð
2 Þ þ 1Þ FA stages for even (odd) values of n, and the delay of the modulo generation circuit. The latter is equal to the delay of the adder, which for the chosen implementation is 2 log 2 n þ 3 time units plus the delay of the multiplexers, which is considered equal to the delay of an XOR gate, i.e. 2 time units. Summing the above, we get that the delay of the squarers presented in [15] can be modelled for even values of n as
In Table 4 we present the area and delay requirements of the multipliers of [22] , the squarers of [15] and the proposed squarers for several values of n, along with the savings offered by the proposed squarers. The proposed squarers are capable of offering savings in the delay of the squaring function up to 66:6% compared to when a multiplier circuit is used for it. The delay savings are well above 20% in the most interesting cases, from a practical point of view, that is, when n 16. On the average of the examined cases, a dedicated squarer designed according to the proposed architecture offers 28:4% shorter delay. The proposed squarers can also be implemented in significantly less area Multipliers of [22] Squarers of [15] Proposed squarers Savings over [22] (%)
Savings over [15] (%) than that required for a multiplier. On the average of the examined cases, the squarers require 56:6% of the area required for a multiplier. The adder tree used in the proposed squarers is by at least one stage shorter than the one required by the squarers of [15] . This, along with the fact that the proposed squarers do not need multiplexers at their output stage, makes them significantly faster than the squarers of [15] . On the average of the examined cases the saving achieved is 29:4% on the execution delay. The area required by the proposed circuits is very close to that of the squarers of [15] ; on the average of the examined cases, area savings of 2:9% are offered.
Conclusions
Efficient modulo 2 n þ 1 squarers are useful design components in RNS and cryptography applications. In this paper we have derived a new architecture for designing diminished-1 modulo 2 n þ 1 squarers. The proposed squarers offer significant savings in propagation delay over the case that a modulo 2 n þ 1 multiplier is used for performing the squaring function. They are also significantly faster than an earlier proposed solution for operands in weighted representation. The proposed architecture results in implementations with very regular structure well suited to VLSI implementations and straightforward to pipeline.
References

