In this express, we propose an improved architecture for modulo (2 n + 3) multiplication on the condition n ≥ 6. With this architecture, we can design the fastest among all known modulo (2 n +3) multipliers. The proposed modulo (2 n + 3) multiplier can improve the state-of-art by 3.2% on the average in terms of area and 10.1% on the average in terms of performance delay.
Introduction
Residue number systems (RNS) are non-weighted parallel number presentation and operation systems based on Chinese remainder theorem (CRT) and a good alternative to the conventional binary arithmetic. {2 n , 2 n − 1, 2 n + 1} is the most common used moduli set in RNS applications because they have lower implementation difficulty and computational complexity than other moduli forms, when considering the area × time 2 product, and also they can offer efficient converters from and to the binary system [1] . The novel moduli set {2 2n − 1, 2 n , 2 2n + 1} is studied [2] . The 4-modulus base set {2 n − 1, 2 n + 3, 2 n + 1, 2 n − 3} has been proposed [3, 4] . Even though several moduli sets have been proposed using modulo (2 n + 3) operations, the design of the required modulo (2 n + 3) multipliers is still a key challenge. The improved units for modulo (2 n +3) multipliers have been proposed [5] . However, they are not very efficient because they employ the conventional method to compute the residue of modulo (2 n + 3). And the execution process of the reference modulo (2 n + 3) multiplier [5] is shown in Section 3.
In this express, we propose an architecture for modulo (2 n +3) multiplication on the condition n ≥ 6, which removes the drawbacks of the state-of-art modulo (2 n + 3) multipliers. With this architecture, we can design the fastest among all known modulo (2 n + 3) multipliers. Static analysis demonstrates the average ratio of efficiency of the proposed architecture to the reference architecture [5] is 1.33. Synthesized results demonstrate the proposed modulo (2 n + 3) multipliers can achieve an average ratio of efficiency of 1.29 against the reference multipliers [5] .
The feature of modulo (n + 3)
For any integer X, there is
In other words, the residue set of modulus (2 n + 3) is [0, 2 n + 2]. For modulo (2 n + 3), there are
where Z[w : v] represents bits of Z originally located in positions from v (less significant) to w (more significant) and the symbol # is used to concatenate bits.
The proposed modulo (2 n + 3) multipliers
Let A[n : 0] × B[n : 0] = P [2n + 1 : 0] and the modulo (2 n + 3) multiplication can be described as:
Using (2)-(4), (5) can be rewritten as:
The former 3 termsP [2n − 2 : n] # P [2n − 1],P [2n − 1 : n] and P [n − 1 : 0] in (6) can be computed with one n-bit CSA (Carry Save Adder) structure at first. The remaining three terms
(n−2) bits 0 · · · 0 #P [2n−1]#0 and 9 are reserved for mergence with some other term produced in the latter computation.
where L[n − 1 : 0] and H[n − 1 : 0] are sum output data and carry output data of the n-bit CSA structure, respectively. And (7) can be rewritten as:
The former 2 terms in (8) can be computed using one n-bit binary adder with L[n − 1 : 0] and H[n − 2 : 0] #H[n − 1] as the two addends and R[n : 0] is the sum of the former two terms in (8). And the third term in (8) just has one bit data in the second bit and is easy to merge with the fourth term and the fifth term in (6) . It is reserved for mergence the fourth term and the fifth term in (6) . The fourth term −3 in (8) is very easy to merge with the last term 9 in (6) .
And (9) can be rewritten as:
The fourth term, the fifth term in (6) 
Herein we have to make a simple correction as:
The fourth term −3 in (8), the third term −3 in (10) and the third term −3 in (14) will counteract the sixth third term 9 in (6) . Fig. 1 plots the proposed architecture for modulo (2 n + 3) multiplication. From Fig. 1 , it can be obtained the critical path is the timing path that propagates through the used (n + 1) × (n + 1) binary multiplier, the used n-bit CSA, n-bit binary adder (1), 5-bit binary adder, n-bit binary adder (2) and n-bit binary adder (3).
To explain the execution process of the proposed modulo (2 n + 3) multiplier, we choose n = 6 as a case study. That is 2 n +3 = 67. Let us consider the case of A Under the same inputs, the execution process of the reference modulo (2 n +3) multiplier [5] is given as follows. P [13 : 0] = 4290 = (1000011000010) 2 after a (n + 1) × (n + 1) binary multiplier, as the proposed (2 n + 3) multiplier. A CSA tree is used to compress 5 terms merged from 6 terms in (6) and it can be obtained that L[7 : 0] = 64 = (01000000) 2 and H[7 : 0] = 36 = (00100100) 2 . The sum of these 2 terms is 136 = (010001000) 2 . And the final result is 136 67 = 2. The proposed modulo (2 n + 3) multipliers are composed of a n-bit × n-bit binary multiplier, a n-bit CSA, a 1-bit full adder, a 5-bit binary adder, three n-bit binary adders and (n + 3) inverters, while the reference multipliers [5] are composed of 3 CSA, 3 (n + 1)-bit binary adder, 2 (n + 1)-bit (2 : 1) MUXs and n inverters.
Since the proposed modulo (2 n + 3) multiplier and the reference modulo (2 n + 3) multiplier [5] both consist of a (n + 1) × (n + 1) binary multiplier, the effect of the binary multiplier can be ignored in static analysis. Based on the model [6] , we have T Pro,nonMult = 2 log 2 (n)+log 2 (n−2)+max{6, log 2 (n−6)}+24 (16) The proposed modulo (2 n + 3) multipliers in this paper were modeled in structural VerilogHDL for a general value of n and their operations were exhaustively verified. The proposed modulo (2 n + 3) multipliers in this paper were implemented using TSMC 90 nm CMOS process technology. The Synopsys Design Compiler tool version Y-2006.06-SP4 was used to get the synthesized results. The obtained results are plotted in Fig. 2 and Fig. 3.   Fig. 2 . The delay performance of the proposed multipliers and the reference multipliers [5] Fig . 3 . The synthesized area of the proposed multipliers and the reference multipliers [5] From the synthesized results, the proposed modulo (2 n + 3) multipliers can achieve an average delay savings of 10.1% with an average area savings of 3.2% against the reference multipliers [5] . Synthesized results demonstrate that the proposed modulo (2 n + 3) multipliers have a very good performance.
Synthesized results demonstrate the proposed modulo (2 n + 3) multipliers can achieve an average ratio of efficiency of 1.29 against the reference multipliers [5] , which is shown in Fig. 2 and Fig. 3 .
Conclusion
In this express, we have proposed an improved architecture for the modulo (2 n + 3) multipliers on the condition n ≥ 6. Synthesized results demonstrate that the proposed modulo (2 n + 3) multipliers can achieve an average delay savings of 10.1% with an average area savings of 3.2% against the reference multipliers [5] .
