Abstract: (2 n − 1) is one of the most commonly used moduli in Residue Number Systems. In this express, we propose a novel Booth encoding architecture. Based on the proposed Booth encoding architecture, we can design high speed and high-efficient modulo (2 n − 1) multipliers, which are the fastest among all known modulo (2 n −1) multipliers. The performance and the efficiency of the proposed multipliers are evaluated and compared with the earlier fastest modulo (2 n − 1) multipliers, based on a simple gate-count and gate-delay model. These results reveal that the proposed multipliers lead to average approximately 14% faster than the fastest known modulo (2 n − 1) multipliers.
Introduction
A Residue Number System (RNS) is defined by a pair-wise prime moduli set {m 1 , m 2 , · · ·, m k } [1] . An integer X is represented in RNS as {x 1 
That is, in a RNS, the operations between residues are performed in parallel units, each one handling small residues. Moduli of the forms {2 n , 2 n − 1, 2 n + 1} and {2 n , 2 n − 1, 2 n−1 − 1} are the most commonly used moduli in RNS, because they have lower implementation difficulty and computational complexity than other moduli forms, when considering the area × time 2 product, and also they can offer efficient converters from and to the binary system [2] . Therefore, modulo (2 n − 1) has been widely used in many applications and it is a very interesting issue to find the new method to design efficient modulo (2 n − 1) multipliers. In recent years, several architectures have been proposed to design fast modulo (2 n −1) multipliers [3] [4] [5] . The most efficient modulo (2 n −1) multipliers were introduced in [3] . To our knowledge, the modulo (2 n − 1) multipliers in [3] are the fastest known modulo (2 n − 1) multipliers.
In this express, we propose a novel Booth encoding, which is much faster than ever before. Static analysis demonstrates that the proposed multipliers offer significant savings in execution delay against that in [3] [4] [5] and achieve high area × time 2 efficiency.
The proposed modified Booth modulo (n − 1) multiplier
In this section, we present the proposed architecture for our modified Booth modulo (2 n − 1) multipliers. Suppose that A is the multiplicand and B the multiplier and we have
with n-bit numbers in modulo (2 n − 1) representation. And the multiplier B can be rewritten as:
, we can get the formation of partial products showed in Table I from [3] , where
By performing logic addition and logic simplification, the terms of the sixth row with a[i] or a[i] (the notation x is used to denote the complement of x) in Table I can be formulated as:
and the terms of the sixth row with Table I can be expressed as: 
We consider
is Booth encoder (BE) logic equation. Fig. 1 plots the implementation adopted for the Booth encoder. Obviously, the expression for the term pp [k] [i] can be implemented with a XNOR and two 2 : 1 inverting multiplexers. Fig. 2 presents the implementation adopted for the Booth selector (BS), where MXI is a 2 : 1 inverting multiplexer. The control signals produced by the multiplier B, along with the multiplicand A, are used to form partial products and each Booth code produces a partial product. Every bit of each partial product is produced by a Booth selector block, just as shown in Fig. 2 . The produced partial products are reduced to two operands with CSA arrays or adder trees (Wallace trees), sometimes in several stages. "The carry output at the most significant bit position of each stage has a weight of 2 n , which, in modulo (2 n −1) arithmetic, is equal to 1. Therefore, these carries are added in an end-around carry way to the least significant bit position of the operands of the next stage." [3] These two resulting operands are added in a parallel modulo (2 n − 1) adder to compute the final result.
Analysis and Comparison
In this section, we demonstrate the improved performance of the proposed multipliers against the multiplier designs of [3] [4] [5] .
The proposed design requires an area equal to
where A BE,P ro is the area of a BE, A BS,P ro is the area of a BS, A F A is the area of a full adder (FA), A P An is the area of a modulo (2 n − 1) adder (PA(n)), * is the smallest integer larger than or equal to * . The delay of the proposed multiplier equals to
when a CSA array is used and equals to
when Wallace trees are used, where T BS,P ro is the delay of a BS, T F A is the delay of a FA, T P An is the delay of a modulo (2 n − 1) adder (PA(n)) and k(x) equals to 0, 1, 2, 3, 4, 4, 6 and 6 when x is 2, 3, 4, 5, 8, 9, 16 and 17, respectively [3] . Considering the area × time 2 product, the ratio of efficiency of the proposed multiplier to the reference multiplier can be defined as:
The area of the reference multiplier [3] can be given as
The delay of the reference multiplier [3] equals to
when a CSA array is used and
when Wallace trees are used. The area of the reference multiplier [4] can be given as
where * is the largest integer less than or equal to * . The delay of the reference multiplier [4] equals to
when Wallace trees are used. The area of the reference multiplier [5] can be given as
The delay of the reference multiplier [5] equals to
when Wallace trees are used. Based on the simple gate-count and gate-delay model used in [3] , the following approximations can be obtained: A BE,P ro equals to 2 equivalent gates, A BS,P ro equals to 6 equivalent gates, T BS,P ro equals to 4 time units, A BE equals to 5 equivalent gates, T BE equals to 3 time units, A BS equals to 5 equivalent gates, T BS equals to 4 time units, A F A equals to 7 equivalent gates, T F A equals to 4 time units, A P An equals to 3n log 2 n + 4n equivalent gates, T P An equals to 2 log 2 n + 3 time units. Based on these approximations, the delay and area comparisons of the proposed designs against those of [3] [4] [5] for n = 4, 8, and 16 are studied, with either a CSA array or a Wallace tree partial products reduction scheme used. Static analysis demonstrates that the proposed multipliers can achieve up to 21.4%, 38.9% and 53.2% faster designs than those of [3] [4] [5] , respectively. On the average, the proposed multipliers are faster than those of [3] [4] [5] by 13.7%, 26.4% and 45.3%, with a CSA used and by 14.2%, 24.2% and 36.6% with Wallace trees used, respectively. Compared against [3] , the proposed multipliers can achieve the faster speed because the proposed Booth encoding architecture can shorten the critical paths in the Booth encoding. The Booth encoding architecture in [3] has a delay of 7 time units, while the proposed Booth encoding architecture just has a delay of 4 time units. Compared against [4] , the proposed multipliers can achieve the faster speed because the proposed Booth encoding architecture can shorten the critical paths in the Booth encoding and the proposed multipliers have less FA stages than those of [4] . Compared to those of [3] , the proposed multipliers have a more 4.6% area overhead on the average. The proposed multipliers are 19.1% and 26.0% more area efficient than those of [4] [5] on the average. There is a very significant improved performance of the proposed multipliers against those of [5] . The proposed multipliers employ Booth encoding to reduce the partial products, while the multipliers in [5] do not employ Booth encoding and have one extra modulo (2 n − 1) adder using half adder. The proposed multipliers have n/2 partial products, while the multipliers in [5] have n partial products. The ratios of efficiency of the proposed multipliers to the multipliers in [3] 
Conclusion
In this express, we have presented a new architecture for modified Booth modulo (2 n − 1) multipliers. The proposed multipliers compare favorably, with respect to speed and efficiency, against the known modulo (2 n − 1) multipliers. The proposed multipliers can lead to average to approximately 14% faster than the fastest known modulo (2 n − 1) multipliers in [3] . The ratios of efficiency of the proposed multipliers to the multipliers in [3] 
