The parallel Fermat number transform (FNT) architecture is usually implemented with the code conversion (CC) and the butterfly operation (BO) in the diminished-1 number system. However, both the CC and the BO require too much area and delay due to modulo 2 n +1 carry-propagation addition. In this paper, we propose a novel parallel FNT architecture with the root of unity 2 which is mainly composed of carry-save code conversion (CSCC) and carry-save butterfly operation (CSBO). The CSCC and the CSBO remove the carrypropagation addition by exploiting the property of carry-save adders. Thus the proposed FNT architecture requires less area and delay than the previous one. Synthesis results using 0.13-µm CMOS standard cells library demonstrate the superiority of the resulting architecture against the previously reported solution.
Introduction
The Fermat number transform (FNT) is more attractive than the conventional fast Fourier transform (FFT) in the area of cyclic convolution, since it can obtain exact calculation results and replace expensive multiplication operation by bit shifts [1] . The transform has been used in several applications such as video processing, digital filtering and multiplication of large integers [1, 2, 3] . Important operations in the FNT include the butterfly operation (BO) and the code conversion (CC) which are both composed of modulo 2 n + 1 addition mainly. The fastest modulo 2 n + 1 adder involving carry-propagation addition in the diminished-1 number system is proposed by Vergos and used in the recent architectures [4, 5, 6] . The carry-propagation modulo 2 n + 1 addition makes the existing FNT architectures require large area and delay.
In this paper, a novel parallel FNT architecture with the root of unity 2 or its integer power is proposed, which is mainly composed of carry-save code conversion (CSCC) and carry-save butterfly operation (CSBO). The carrypropagation addition is avoided in the CSCC and the CSBO by exploiting the property of carry-save adders. Thus the proposed FNT architecture requires less area and delay than the previous one. Experiment results show that the new architecture is more efficient than the existing one in both area and time sense.
Elementary operations of FNT
Important operations of the novel FNT architecture with the unity root 2 include the CSCC and the CSBO. Their hardware implementations are shown in Fig. 1 (a) and (b). The CSBO involves modulo 2 n + 1 4-2 compressor and multiplication by the power of 2 which are illustrated in Fig. 1 (c) and (d) respectively. In the figure, "X * " denotes the diminished-1 representation of X, i.e. X * = X − 1.
Proposed CSCC for efficient CC:
The CC converts the normal binary code (NBC) into the diminished-1 representation. It is the first stage in the FNT. The delay and area of CC of n-bit NBC is close to the ones of an n-bit carry-propagation adder. To reduce the cost, we propose the CSCC method by modifying the modulo 2 n + 1 If the subsequent operation of CC is modulo 2 n + 1 addition, then the conversion is performed as shown in Fig. 1 (a) . In the figure, "FA" represents "full adder" and the array of full adders forms the carry-save adder (CSA); A and B represent two operands whose widths are no more than n bits; C is the constant and equal to 2 n − 2; A, B and C are three inputs of the CSA; D * and E * are the sum vector and the carry vector of the CSA respectively. The most significant bit of E * is complemented and connected back to its least significant bit. In this way, A and B are converted into their equivalent diminished-1 representations D * and E * by replacing one input of the MCSA by a constant 2 n − 2. Because one of the operands of the modified MCSA is the constant 2 n − 2, every full adder in the conversion can be simplified according to the constant bit "0" or "1" respectively.
Let |A * +B * | 2 n +1 , |Ā * | 2 n +1 , |A * −B * | 2 n +1 and |A * ×2 i | 2 n +1 denote modulo 2 n + 1 addition, negation, subtraction and multiplication by the power of 2 respectively which are proposed by Leibowitz originally [8] . The CSCC for subsequent modulo 2 n + 1 addition can be described as follows:
If the subsequent operation is modulo 2 n + 1 subtraction, the result of CSCC is composed of the minuend and the complemented subtrahend. The conversion is described as follows:
After CSCC, we obtain the result consisting of two diminished-1 values. The result also contains the information of modulo 2 n + 1 addition or subtraction in the first stage of previous BO.
Modulo 2 n + 1 4-2 compressor
After CSCC, we obtain the results of modulo 2 n + 1 addition or subtraction in the diminished-1 representation. Each result consists of two diminished-1 values. Thus the subsequent modulo 2 n + 1 arithmetic operation involves four operands. A modulo 2 n + 1 4-2 compressor in the diminished-1 number system is designed to reduce four numbers H * 0 , H * 1 , I * 0 , I * 1 to two numbers J * 0 and J * 1 by two MCSAs as shown in Fig. 1 (c) . J * 0 and J * 1 are given by
where
. The compressor could be regarded as a novel modulo 2 n + 1 adder since they perform the same operation counting two inputs of the compressor as one operand. For the modulo 2 n + 1 4-2 compressor, the delay is close to a constant and the area is linearly proportional to n. For the modulo 2 n + 1 adder involving an n-bit carry-propagation addition and a zero indicator [1] , the delay and the area are approximately proportional to log n and n log n respectively [4] . When n is larger than 8, the compressor requires less delay and area than the adder.
Multiplication by the power of 2
Multiplication by an integer power of 2 in the diminished-1 number system is another important operation, which can be accomplished by the multiplication proposed by Leibowitz [8] . The operation is trivial and can be performed by left shifting the low-order n-i bits of the number by i bit positions and inversing and circulating the high-order i bits into the i least significant bit positions. Let F and G represent the multiplicand and the product respectively. The multiplication by the power i of 2 is implemented by a commutator and some inverters as shown in Fig. 1 (d) . The product is given by
where F = F * + 1.
Proposed CSBO for high-speed BO
After defining such operations as modulo 2 n + 1 4-2 compressor and multiplication by the power of 2, the CSBO is designed as shown in Fig. 1 (b) . The proposed operation involves two modulo 2 n + 1 4-2 compressors, a multiplier and some inverters. It can be performed without the carry-propagation chain so as to reduce delay and area obviously. K * , L * , M * , N * are corresponding to two inputs and two outputs of previous BO in the diminished-1 number system respectively and given by
Being different from the CSBO, the existing BO is composed of two modulo 2 n +1 adders, a multiplier and some inverters [1] . Thus the CSBO requires less area and delay than the BO.
Novel parallel architecture for FNT
Since the FNT has a mathematical algorithm similar to the FFT, an FFTtype structure can be applied to perform a fast FNT. In this section, the radix-2 decimation-in-time (DIT) fast algorithm which is by far the most widely used FFT algorithm is employed [9] . With the input data sequence stored in bit-reversed order and the CSBO performed in place, the resulting FNT sequence is obtained in natural order. Illustrative example of the novel architecture is shown in Fig. 2 in the case the transform length is 8 and the modulus is 2 4 + 1. In Fig. 2 , "MA" and "MS" denote "modulo 2 n + 1 addition" and "modulo 2 n + 1 subtraction" respectively. The novel parallel N -point FNT architecture with the unity root 2 is composed of one stage of CSCC, log 2 N − 1 stages of CSBO and one stage of modulo 2 n +1 carry-propagation addition. The final stage is used to evaluate the final results, each of which is a diminished-1 value. The existing parallel N -point FNT architecture with the unity root 2 consists of one stage of CC and log 2 N stages of BO. Both architectures involve the same numbers of CC stages and BO stages except their final stages. The proposed CSCC and CSBO overcome the disadvantage of carry-propagation addition and don't require a zero indicator. They are more area-delay efficient than the BO and CC respectively. The costs of the final stages of both architectures are approximately equal since every BO is composed of two parallel modulo 2 n + 1 adders chiefly. Thus the proposed parallel FNT architecture is more efficient than the existing one in both area and delay sense. Furthermore, the novel architecture is very suitable for implementation of the overlap-save and overlap-add techniques which are used to reduce a long linear convolution to a series of short cyclic convolutions.
Experiment results
In this section, we compare the proposed parallel FNT architecture against the existing FNT architecture composed of the CC and the BO which both involve modulo 2 n + 1 addition. The fastest modulo 2 n + 1 adder in the diminished-1 number system is proposed by Vergos and adopted by O'Donnell [6] . The modulo 2 n + 1 adder containing a standard n-bit carrypropagation addition and a zero indicator produces the longest execution delay and requires large area. The proposed CSCC and CSBO overcome the disadvantage of carry-propagation adder and don't require a zero indicator. Thus our architecture is faster and more efficient than the existing one.
To obtain more accurate results, we describe the proposed and the existing FNT architectures comprising the fastest modulo 2 n + 1 adder in verilog for modulus = 2 8 + 1, 2 16 + 1, 2 32 + 1, 2 64 + 1. The validated Verilog code is synthesized using 0.13-µm CMOS standard cells library in the worst operating condition by the Synopsys Design Compiler. The units of area and delay are µm 2 and ns respectively. Each design was recursively optimized for speed until the EDA software can't provide a faster design. The results for the fastest derived implementation are listed in Table I . Table I indicates that for values of n ≥ 8, the proposed architecture produces both faster and smaller implementation than the previous one. Moreover our algorithm will be more and more advantageous with growth of modulus width. When the modulus is equal to 2 64 + 1 and the transform length is 128, the novel FNT architecture results in a 46% reduction in area and a 49% reduction in delay respectively compared with the previous one.
Conclusion
An area-delay efficient FNT architecture comprising the CSCC and the CSBO is designed in the case the root of unity is equal to 2 or its integer power. The carry-propagation addition which is compulsory in the previous architecture is avoided in the proposed one except its final stage. Thus area and delay are reduced obviously when the modulus is no less than 2 8 + 1. The synthesis results show the proposed architecture can attain much less area and delay. The CSCC and the CSBO can also be used to perform the IFNT and improve its efficiency greatly.
