In this paper, we propose several high-speed area-efficient recursive discrete Fourier transform (DFT)/inverse DFT (IDFT) designs adopting the module-sharing and register-splitting schemes. The proposed core architecture achieves one multiplier reduction as well as less critical period and a saving of nearly half multiplications compared with the second-order and first-order recursive DFT structures, respectively. So as to reduce the number of computation cycles, based on the new core design, we develop the area-efficient parallel and folded recursive DFT/IDFT architectures. Moreover, due to the advantages of regular and modular structure, the resulting high-speed area-efficient recursive DFT/IDFT architectures are amenable to application-specific integrated circuit (ASIC) design.
Introduction
The discrete Fourier transform (DFT) has been widely applied in the analysis and implementation of discrete-time signal processing [1] and communication systems such as dual tone multi-frequency (DTMF) application [2] [3] . In many applications, the complex sequences in time-domain are expected to be frequency-domain signals via the DFT computation. Without loss of generality, the input data is assumed as complex-valued data. From existing research, there are possible four categories for the structures of DFT computations: 1) recursive-algorithm based architecture [1] [2] [3] [4] [5] [6] , 2) butterfly-based architecture [1, 7] , 3) ROM operation based structure [8] , and 4) multiplier-accumulator based structure. It is well known that the DFT architectures based on the recursive algorithm are more area-efficient than those realized by other approaches. Until now, the existing recursive algorithms for the orthogonal transform in the scope of DFT/DCT/DST (discrete Fourier/cosine/sine transform) involve the following: Goertzel algorithm [1] [2] [3] [4] [5] [6] 9] , C-S's algorithm [10] , Chebyshev polynomials [11] , and Clenshaw's recurrence formula (CRF) [12] [13] . In [10] [11] [12] , recursive expressions for the computation of the DCT-II that are suitable for VLSI implementation are presented. The recursive DCT-II architecture [11] is based on Chebyshev polynomials of the third kind while those in [12] are based on CRF. Recently, Kidambi [13] furnished recursive DCT-IV and DST-IV architectures, where this approach can be possible to develop recursive DFT architecture. Note that in [10] [11] [12] [13] , recursive algorithms are used to design recursive DCT/DST architectures rather than recursive DFT architecture. In [1, 6] , the original second-order recursive DFT architecture derived from Goertzel algorithm has one redundant multiplier and thus we can reuse the same multiplier to save the redundant one. This area-reduction strategy is referred to as the module-sharing scheme. Thus, the modified recursive DFT architecture has lower area than the preceding one [1, 6] . Next, we apply register-splitting scheme [14] to speedup the area-efficient architecture without affecting the system transfer function. Therefore, the proposed architecture possesses the following features: high speed, reduction of one multiplier compared with the second-order recursive DFT structure, and a saving of nearly half multiplications for each DFT output compared with the first-order recursive DFT structure. Regarding the new area-efficient recursive DFT/IDFT architecture as a core, we can develop parallel-and folded-type architectures to achieve less computation cycles for real-time media applications. The paper is organized as follows. A review of the first-and second-order recursive DFT structures is given in Section 2. In Section 3, we propose three new recursive DFT/IDFT architectures by module-sharing and register-splitting schemes: core-, parallel-, and folded-type architectures. In Section 4, comparison results are tabulated in terms of the critical period, the number of real multipliers, the amount of real multiplications as well as real additions for each DFT/IDFT output sequence, and the number of computation cycles for N-point DFT/IDFT. At last, the concise statements conclude this paper.
A Review of First-and Second-Order Recursive DFT Structures
Given input sequence and DFT output sequence denoted as ]
, respectively, the N -point DFT can be defined as
where
. The Goertzel algorithm [4] 
In order to simplify the final expression, let us define the sequence
From Eqs. (3) and (4) (4), we can obtain the first-order transfer function as
Eq. (6) can be mapped into the first-order recursive DFT structure as shown in Fig. 1(a) , where initial rest conditions are assumed and the vertical dash-line denotes the down-sample operation with N for each crossing signal path. Note that the dash-line as shown in Fig. 1 (a) can be possibly implemented by multiplexer-type or register-type down-sampling realization.
Here, we adopt the multiplexer-type down-sampling realization as shown in Fig. 1 (b) due to the advantages of less area and exact mapping from the equation to the architecture. In Fig. 1 It is possible to retain this simplification while reducing the number of multiplications by a factor of 2. To see how this may be treated, the transfer function of the first-order recursive DFT structure in Fig 1(a) can be noted. Multiplying both the numerator and the denominator of ) (z H k by the factor
Eq. (7) can be mapped into the second-order recursive DFT structure as shown in Fig. 2 . Fig. 2 . Block diagram of the second-order recursive DFT structure.
In Fig. 2 , only two real multiplications per sample are required to implement the poles of this system as shown in Fig. 2 . Note that, in the denominator of Eq. (7), the coefficients are real and the factor -1 need not be counted as a multiplication. It is worthy of emphasizing that the complex multiplication by are again computed implicitly in the iteration of the recursion formula implied in Fig. 2 . The second-order recursive DFT structure can decrease the number of multiplications by Goertzel algorithm; however, the amount of multipliers and the value of the critical period are sacrificed. Hence, the structures in Figs. 1(a) and 2 are not efficient.
New Recursive DFT/IDFT Architectures
Keeping in mind that we are encouraged to design an efficient architecture that satisfies the features of the lower critical period (i.e., high speed), less number of multipliers (i.e., low area), and less number of multiplications. Substituting the
into Eq. (7), Eq. (7) can be recast as
From Eq. (8), we find that there are two the same multiplicands ) / 2 cos( N k of the multiplier appeared in the first-order of the numerator and denominator. Let the feedforward and feedback signal paths of the first-order go through the same multiplier and then the feedback signal path is adjusted by shift register to obtain the two times result. Based on the above description, we can easily modify the second-order structure as a new area-efficient architecture as shown in Fig. 3 , where HS 1 is a hardwired shifter with one-bit left shift. The above reducing area method is referred to as the module-sharing scheme.
For the speed issue, we adopt the register-splitting scheme (i.e., one kind of retiming schemes), to reduce the critical period and this scheme has been successfully used in 2-D IIR/FIR digital filter [14] . Herein, we define two useful notations 0 and 1, where 0 and 1 indicate that the delay elements as shown in Fig. 3 are at top-to-down and bottom-to-up signal paths, respectively. Thus, we can easily use the digital number sequence to represent different register-splitting structures. For example, the proposed core design in Fig. 3 can be represented as 00. In this case, there are four combinations as listed in Table 1 , for register-splitting structures of Fig. 3 . With minimum critical period in mind, we select the 10 register-splitting structure for our design. Note that 10 and 11 as listed in Table 1 result in the same DFT design as depicted in Fig. 4 . The new DFT architecture owns higher speed and smaller area than the second-order DFT structures. As to the number of operations of the high-speed area-efficient recursive DFT architecture in Fig. 4 , only two real multiplications per sample and two real multiplications by ) / 2 sin( N k are required to implement the poles and the imaginary part of the DFT output, respectively. Thus, the total computation is N 2 real multiplications and N 4 real additions for the poles plus two real multiplications and four real additions for the zero. In similar behavior, the resulting transfer function of the recursive IDFT can be obtained as
Via the module-sharing and register-splitting schemes, Eq. (9) can be realized as a new recursive IDFT structure. In order to reduce the number of computation cycles for N-point DFT/IDFT, utilizing this powerful core design as shown in Fig. 4 as a processing element (PE), we can construct the parallel recursive DFT structure as shown in Fig. 5 . From comparison results as listed in Table 2 , it can be seen that the parallel recursive structure significantly reduces the number of computation cycles from 2 N to N 2 . Importantly, the parallel recursive structure is more area-efficient than that based on the conventional first-and second-order DFT designs. For sake of area saving, the parallel recursive structure can be improved as a folded recursive DFT architecture in Fig. 6 with sacrificing the number of N computation cycles. 
Comparison Results
In this section, we give a comprehensive comparison results as listed in Table 2 in terms of the critical period, the number of real multipliers, the total real multiplications as well as real additions for each DFT/IDFT output, and the number of computation cycles. Let m T and a T denote the operation time required for one real multiplication and one real addition, respectively. Note that the operation time of the complex multiplication requires a m T T and the operation time of the multiplexer in Fig. 1(b) compared to m T and a T can be ignored here. Our proposed work 1 (i.e., core-type design) has the same highest speed and lowest number of the multipliers as the first-order recursive DFT/IDFT architecture due to applying register-splitting and module-sharing schemes, respectively. Although the first-order recursive DFT/IDFT structure owns the above advantages as listed in the first-and second rows of Table  2 , the one cannot overcome the large operations for each DFT/IDFT output. That is, the one needs large power consumption. Our proposed work 1 and the second-order DFT/IDFT architecture based on Goertzel algorithm can save nearly half multiplications for each DFT/IDFT output compared with the first-order recursive DFT/IDFT structure. Furthermore, based on the proposed work 1, we can construct parallel and folded recursive DFT/IDFT architectures. These two architectures can significantly reduce the number of computation cycles for N -point DFT/IDFT from 2 N to N 2 and N 3 , respectively. Thus, more real-time operation can be achieved. Note that, although these two architectures extra require multipliers, these two ones are still area-efficient compared with those based on conventional core designs. Therefore, in Table 2 , it reveals that our proposed architectures have characteristics of high speed, area-efficient, and fewer computing operations.
Conclusion
We have devised three new recursive DFT/IDFT architectures based on Goertzel algorithm by the hybrid of module-sharing and register-splitting schemes. The module-sharing scheme can highly reduce the number of multipliers. On the other hand, register-splitting scheme results in a high-speed architecture. Based on Goertzel algorithm, we retain the characteristic of low operations for DFT/IDFT designs. 
