n +1 multiplier is one of the critical components in the area of digital signal processing, residue arithmetic, and data encryption that demand high-speed and low-power operation. In this paper, a new circuit implementation of a high-speed low-power modulo 2 n +1 multiplier is proposed. It has three major stages: partial product generation stage, partial product reduction stage, and the final adder stage. The proposed structure introduces a new MUXbased compressor in the partial product reduction stage to reduce power and increase speed, and in the final adder stage, the Sparse-tree-based inverted end-around-carry adder reduces the number of critical path circuit blocks, also avoids wire interconnection problem. The proposed multiplier is implemented using both 32nm CNTFET (Carbon-Nanotube FET) and bulk CMOS technology for performance comparison. The CNTFET-based design dramatically decreases the PDP (Power Delay Product) of the circuit. The simulation results demonstrate that the power consumption of CNTFET-based multiplier is at average 5.72 times less than its CMOS counterpart, while the PDP of CNTFET is 94 times less than the CMOS one.
INTRODUCTION
These days, the security of the data transmitted over channels is becoming more and more important, which makes data encryption a crucial technology to meet the requirement. Among a number of algorithms, International Data Encryption Algorithm (IDEA) is most frequently used for secured transmission of data [1] , and modulo 2 n +1 multiplication is the crucial operation in the circuit implementation of IDEA algorithm. Besides IDEA application, residue arithmetic is another significant application of modulo 2 n +1 multiplier. It is extremely efficient for image processing, speech processing, and transforms that are greatly important in today's high dense computing world [2] . As speed and power are becoming the most important issues today, a high-speed low-power modulo 2 n +1 multiplier turns out to be an essential block to meet the requirement of those applications.
Typical modulo 2 n +1 multiplier has three major stages: partial product generation stage, partial product reduction stage, and the final adder stage. The last two stages determine the speed and power of the whole circuit. Conventional compressor in the partial product generation stage takes advantage of cascade full-adders and half-adders. However, adder-based compressors consume a lot of power and have a large delay. In this paper, a new compressor based on the combination of MUX and XOR-XNOR gates is proposed to reduce PDP. For the final adder stage, the Kogge-Stone adder is often used in the conventional design because it is the fastest parallel prefix form carry look-ahead adder [3] . However, the performance of parallel prefix adder is limited by the large number of carry merge cells and excessive interstage wiring tracks [2] . In this paper, a Sparse-tree-based inverted End Around Carry (EAC) adder is used to solve this problem [4] . The Sparse-tree architecture dramatically reduces the number of critical path blocks in the last stage comparing to Kogge-stone adder, eliminating lots of wire interconnections, which significantly saves silicon area in the layout. The delay of the last stage is also reduced, because of the diminished fan-out of the critical path.
In order to test the adder's future performance based on the emerging technology, a carbon-nanotube (CNT) technology is introduced in the paper. Comparing to CMOS bulk technology, CNT has the advantages of lower leakage power, better frequency response, lower PVT variation, and extremely lower PDP, which make CNTFET a very promising substitute of the traditional MOSFET [5] . However, the CNT also has some drawbacks such as short life time and the difficulties in fabrication. In this paper, the performance of the multiplier designed using CNTFET technology is compared with its counterpart, bulk CMOS-based design. Additionally, the performance and power variations due to PVT variation of the two technology based designs are also compared.
The rest of the paper is organized as follows. Section II presents the algorithm used to implement the multiplier. Section III describes the proposed circuit implementation of modulo 2 n +1 multiplier. The novel Sparse-tree-based Inverted EAC adder and the MUX-based compressor are also described. The simulation results of the CNT and the comparison with traditional bulk CMOS technology are given in section IV followed by conclusions in section V.
II.
ALGORITHM OF THE MODULO 2 N +1 MULTIPLIER Among various existing A·B mod (2 n +1) algorithms, the one presented by Vergos and Efstathiou is considered to be the most effective one [6] . The proposed circuit implementation based on this algorithm turns out to be a viable solution for various applications such as IDEA cipher.
Assume A and B are two inputs that are represented as A=a n a n-1 a n-2 ···a 1 a 0 and B=b n b n-1 b n-2 ···b 1 b 0 , then A·B mod 2 n +1can be represented as follows [2, 6] :
where p i,j = a i AND b i . The n×n partial product matrix shown in Fig. 1 (where p i,j = a i AND b j ; q i = p n,i OR p i,n ) is derived from the initial partial products based on several observations. The first observation is regarding to the reposition operation on the partial product terms with weight greater than 2 n-1 to generate the final n×n partial product matrix based on the following equation [2, 6] :
Equation (2) shows that repositioning of each bit to i th bit needs a correction factor 2 2 | | to make sure that the partial product matrix is equivalent to the initial partial products before repositioning. For each partial product vector, the correction factor is derived as 2 1 2 . Hence, the correction factor for the entire partial product matrix is given by [2, 6] :
Fig.1 Final n×n Partial Product Matrix
The second observation is regarding to the rest of the circuit, which performs adder function in a way that is similar to carry save adder (CSA). Since this CSA works as a modulo 2 n +1 adder, the carry-out bit of each level of the CSA has to be fed back as the carry-in of the next subsequent level [6] . Supposing that the carry-out bit of the n th column at i th stage of CSA is c i with weight of 2 n , this carry-out is derived [2, 6] as following:
Thus, according to equation (4), another correction factor is introduced in an n-1 stage CSA due to the carry-out bits of the CSA is: COR |2 n 1 |
The final correction factor can be calculated from the sum of COR 1 and COR 2 [2, 6] :
For an n-bits module 2 n +1 multiplier, the constant "3" is the final correction factor. A "2" is added at the partial product reduction stage, while a "1" is added at the final adder stage due to the inverted carry feedback issue that will be discussed later in this paper.
III.
IMPLEMENTATION OF THE PROPOSED MOD 2 N +1 MULTIPLIER The proposed Implementation of modulo 2 n +1 multiplier consists of three stages: partial product generation stage, partial product reduction stage, and final addition stage. The possible configuration for each stage will be discussed in this section:
A. Partial Product Generation Stage
This stage is the simplest stage in the whole architecture. Traditional 2-input NAND gate and inverter need to be optimized to meet the power and speed demand of this stage.
B. Partial Product Reduction Stage
In this stage, the n×n partial product matrix is compressed into the final sum vector and carry vector. The traditional full adder based design of this stage no longer meets the highspeed and low-power requirements today. Thus, a new MUX based design is proposed in this paper. The speed, power, and area are improved in the proposed design since multiplexer blocks are pre-selected before the output signals of the previous stages arrive. Table 1 compares the performance and power of the two designs, taking 8-bit designs for example. To compress the matrix using the MUX based design, two possible compressor architectures are compared with an example of 8-bit modulo 2 n +1 multiplier (to compress one single column of the final partial product matrix with the correction vector). The first compressor architecture is shown in Fig. 2 (a) , where only three stages are introduced and only one compressor is used in every stage (best effectiveness). However, when taking parallel concept into consideration, the other architecture has the better performance, which is shown in Fig. 2 (b) . This architecture uses 7 3:2 compressors in four stages. In the first stage, three 3:2 compressors work in parallel, while the numbers of the compressors used in the rest of the stages are 2, 1, and 1, respectively. Although it seems like that the second architecture has more compressors than the first architecture, the second one has less delay, less number of transistors, and lower power as shown in Table 2 , because it fully takes advantage of the parallel concept. This architecture is also suitable for 16-bit compressor. Fig.2 (a)  Fig.2 (b Furthermore, the architecture in Fig. 2 (b) has advantages in layout comparing to the first one because two types of compressors are introduced in Fig. 2 (a) while only single type of compressor is introduced in Fig. 2 (b) . However, interconnect wire routing issue will occur in Fig. 2 (b) because of the parallel architecture, especially when the width of the input is large. The 3:2 compressor [7] is shown in Fig. 3 . In this paper, the architecture of Fig.2 (b) is chosen to achieve high speed, small silicon area and low power features. The circuits of MUX and XOR are designed and optimized based on the principle of achieving low power, good driving capability, high speed, and small silicon area features [8] .
C. Final Adder Stage
The conventional Kogge-Stone adder is the fastest parallel prefix form carry look-ahead adder [3] . Taking advantage of its high speed, the Kogge-stone adder is still the best choice in many cases. However, the performance of parallel prefix adders is limited by the large number of carry merge cells and excessive inter-stage wiring tracks [2] . To solve this problem, in this paper, a new design of Sparse-tree architecture is introduced [4] . The modulo 2 n +1 adding function should also be achieved in the final stage of the proposed multiplier. Thus, the sum vector and the carry vector generated in the previous stage should be added using modulo 2 n +1 algorithm [9] . In the work of Zimmerman [9] , it is shown that:
From (7) we observe that the inverted carry-out bit of the sum of the Sum vector and the Carry vector has to be fed back to achieve modulo 2 n +1 function. In Sparse-tree-based Inverted EAC adder, instead of calculating the carry term for each and every bit position, every (K=4, 8…) carry is computed. The value of K is chosen based on the Sparseness of the tree, and it is generally chosen as 4 [4] . Taking a 16-bit Sparse-tree Inverted EAC adder for example, the carry-out equations based on the Kogge-Stone adder idea are shown as follows:
where , , , , ,
. The 16-bit Sparse-tree based Inverted EAC adder is shown in Fig. 4 . The top one is the Sparse-tree architecture and the bottom one is the detailed circuit of the 4-bit conditional sum generator (CSG), where P sumi =a i b i . The comparison between the 16-bit original Kogge-Stone architecture and the Sparse-tree structure (with CSG) is summarized in Table 3 . The logic depth, the maximum fanout and the PDP of the new structure are the same as the KoggeStone architecture. However, the total number of critical path blocks used in the new structure is much less. Therefore, the interconnect wire routing problem no longer exist. The advantages of Sparse-tree structure over Kogge-Stone adder become more striking if the width of the input is large. The conventional CMOS technology has large leakage power and high sensitivity to process variation when the channel length reduces to below 25nm. To make further improvement on the device performance, CNT technology becomes a good substitution of CMOS in the future due to its material advantages. In detail, CNT technology has advantages of lower leakage power, better frequency response, lower PVT variation, and extremely low PDP [5] .
A. Delay
In Table 4 , detailed performance and power consumption of an 8-bit modulo 2 n +1 multiplier based on bulk CMOS technology and CNT technology are compared. All the data is measured at 0.8V power supply voltage and 1GHz input edge rate using SPICE simulator. In all the simulations, we used the PTM model for CMOS technology and the Stanford model for CNT technology. As a result, the PDP of CNT-based design is 94 times less than the PDP of the bulk CMOS-based design.
B. PVT Variation
The comparison of temperature variation (0℃~100℃), voltage variation (0.72V~0.88V/vary ±10% of the supply voltage), and process variation (The data of CMOS design is measured using models with different process corners, while the data of CNT design is measured by varying the width of the transistors ±3%) of CMOS and CNT technology are shown in Fig. 6 . The robustness of CNT technology based design is much better than its CMOS counterpart as the figures clearly show.
V. COUNCLUSION
In this paper, a new design of modulo 2 n +1 multiplier is proposed. The new design of MUX-based compressor increases speed and reduces power comparing to the conventional full adder based compressor. The parallel architecture of compressors further speeds up the partial products reduction stage and introduces regular layout. As for the final addition stage, the Sparse-tree architecture keeps the PDP advantage of Kogge-Stone and solves its interconnect wire problems. Finally, a comparison between CNT and CMOS based design is presented. CNT has advantages in PVT variation and PDP. It turns out that the CNT is a better choice than CMOS to meet the aggressive high-speed lowpower requirement with less PVT variations, and this paper will be a good reference for the future CNTFET-based design in other applications. 
