Abstract-In this paper, we propose two lowcomputation cycle and high-speed recursive discrete Fourier transform (DFT)/inverse DFT (IDFT) architectures adopting the hybrid of Chebyshev polynomial and register-splitting scheme. The proposed core-type recursive architecture achieves half computation-cycle reduction as well as less critical period compared with the conventional second-order DFT/IDFT architecture. So as to further reduce the number of computation cycles, based on the new coretype design, we develop the folded-type recursive DFT/IDFT architecture with the same operating frequency. Moreover, from the derivation results, the operation of DFT and IDFT can be performed with the same structure under different configurations.
I. INTRODUCTION
The discrete Fourier transform (DFT) has been widely applied in the analysis and implementation of communication systems such as OFDM-based wireless local area network (WLAN) [1, 2] and dual tone multi-frequency (DTMF) standards [3] . In many applications, the complex sequences in the time domain are expected to be analyzed in the frequency domain via DFT computation. Without loss of generality, the input data is assumed as complex-valued data. From existing research, there are possible four categories for the structures of DFT/IDFT computations: 1) recursive-algorithm based architecture [3] [4] [5] [6] [7] [8] , 2) butterflybased architecture [9] [10] , 3) ROM operation based structure [11] , and 4) multiplier-accumulator based structure. It is well known that the DFT architectures based on the recursive algorithm are more area-efficient than those realized by other approaches. Until now, the existing recursive algorithms for the orthogonal transform in the scope of DFT/DCT/DST (discrete Fourier/cosine/sine transform) involve the following: Goertzel algorithm [3] [4] [5] [6] [7] , C-S's algorithm [13] , Chebyshev polynomials [8, [14] [15] , and Clenshaw's recurrence formula (CRF) [16] . In [12] [13] [14] [15] [16] , recursive expressions for the computation of the DCT that are suitable for VLSI implementation are presented. Note that in [12] [13] [14] [15] [16] , recursive algorithms are used to design recursive DCT architectures rather than recursive DFT architecture. In [7] , compared with the conventional secondorder recursive DFT/IDFT architecture, Van et al. utilized computation-sharing and register-splitting schemes to reduce two multipliers and speedup the operation, respectively. Nevertheless, Van et al. did not improve the computation cycle. In [8] , Fan et al. applied the previous proposed method to reduce computation cycles but the effect is limited. On the other hand, they only proposed recursive DFT algorithm but IDFT algorithm is not yet ready in [8] . Therefore, we are motivated to propose performance-oriented VLSI algorithm and architecture that possesses the following features: low computation cycle and high speed at the expense of slightly increased area overhead compared with the second-order recursive DFT/IDFT structure. Regarding the new lower computation cycle recursive DFT/IDFT architecture as a core, we can develop a folded architecture to achieve less computation cycles for OFDM-based WLAN applications. The paper is organized as follows. A new core-type recursive DFT algorithm and architecture by the hybrid of Chebyshev polynomial and register-splitting schemes is given in Section II. In Section III, we propose the corresponding novel recursive IDFT algorithm and architecture. In Section IV, a folded architecture that features lower computation cycle is constructed for DFT/IDFT. Complexity comparison results are tabulated in terms of the amount of computation cycles for each output as well as N-point DFT/IDFT, the critical period, and the number of real multipliers in Section V. At last, the concise statements remark this paper.
II. NEW RECURSIVE FORMULA FOR DFT
The DFT of the N-point input x[n] is defined as 
By folding two input data, only half summation terms are demanded to express y [k] . Eq. (2) can be treated as DCT and DST parts, ) (k y DCT and ) (k y DST , respectively, as ( )
and ( )
In Eq. (3), we can define
Replacing n by N/2-1-n, Eq. (3) can be rewritten as
where
, and
It is known that Chebyshev polynomials are well defined as ) 
The z-transform of Eq. (9) can be denoted as
can be derived in similar behavior as
where ( ) 
Eqs. (10) and (13) can be easily mapped into the structures as shown in Fig. 1(a) and (b), respectively. Compared with the conventional architectures [4, 7] , it is obviously found that the computation cycles can be achieved to the reduction of 50% via the proposed algorithm and architecture. For high-speed issue, we adopt the register-splitting scheme (i.e., one kind of retiming schemes) to reduce the critical path. Herein, we define two useful notations 0 and 1, where 0 and 1 indicate that the delay elements as shown in Fig. 1(a) are at top-to-down and bottom-to-up signal paths, respectively. Thus, we can easily use the digital number sequence to represent different register-splitting structures. For example, the proposed DCT part of the core type design in Fig. 1(a) can be represented as 00. In this case, there are four combinations as listed in Table 1 , where m T and a T denote the operation time required for one real multiplication and one real addition, respectively. With minimum critical period and fewest registers in mind, we select the 10 registersplitting structure for DCT part. It is worthy of emphasizing that 10 and 11 as listed in Table 1 result in the same DCT part as depicted in the upper diagram of Fig. 2 , where <= 1 denotes a hardwired shifter with one-bit left shift. Similarly, DST part can be modified as the lower diagram of Fig. 2 . In order to remain the minimum critical period for the recursive DFT computation, the forward pipeline register, , is exploited for the final sum output. Later combining these two new parts into one, a recursive DFT architecture that possesses low computation cycle and higher speed than the conventional DFT structures can be obtained. 
To develop recursive IDFT algorithm, Eq. (14) can be recast as
. (15) Similarly, Eq. (15) can be treated as IDCT and IDST parts, ) (n x IDCT and ) (n x IDST , respectively, as ( )
In Eq. (16), we can define
Using the recursive identity stated in (7) 
where 
After using the register-splitting scheme, Eqs. (21) and (24) can be easily mapped into the modified structures as shown in Fig. 3 . Again, from the proposed algorithm and architecture, it is obviously found that the computation cycles can be achieved to the reduction of 50%. Fig. 3 . Block diagram of the proposed low-computation cycle and high-speed recursive IDFT architecture.
IV. FOLDED ARCHITECTURE FOR DFT/IDFT
Recently, there are many modern applications, which need larger frame size to achieve the higher frequency resolution [1, 2] . However, the strict specification would make the implementation more difficult to meet the requirement. In order to further reduce the number of computation cycles for N-point DFT/IDFT, regarding the recursive processing kernel of the core-type design in Fig. 2 as a processing element (PE), we can regularly construct the folded recursive DFT (RDFT) structure as shown in Fig. 4 . The whole architecture consists of a data buffer, a control unit and the number of N/2 RDFT units. The control unit not only plays the role of a sequence controller but also a parameter controller, which feed the proper coefficients for the RDFT units. The RDFT unit consists of one preprocessing unit and one recursive PE, where the preprocessing unit provides the intermediary data s k and r k to the following recursive PE. The final output comes out from the N/2 recursive PEs in parallel. Based on the proper scheduling, the data streams can be processed continually every clock cycle without extra computation latency. Consequently, we can keep the maximum throughput rate for N-point DFT/IDFT in N computation cycles.
RDFT Unit 
V. COMPARISONS AND DISCUSSIONS
In this section, we give a comprehensive comparison result as listed in Table 2 in terms of the number of computation cycles for each DFT/IDFT output as well as Npoint DFT/IDFT calculation, the critical period, and the number of real multipliers. Note that the operation time of the complex multiplication requires a m T T + . Our proposed work 1 (i.e., core-type design) based on Chebyshev polynomial can save half computation cycles for each DFT/IDFT output compared with the existing works [4, 7] at the expense of slightly increased area cost. Comparing with the results of the recursive algorithm in [8] which, for example, requires 2794 computational cycles to obtain all 64-point DFT outputs, the proposed core-type architecture requires 2048 computational cycles. In other words, our proposed work 1 has the lowest computation cycles among existing structures [4, 7, 8] . Due to applying register-splitting scheme, the proposed one has the higher speed than the recursive structures of [4, 8] and possesses the same operation frequency as that of our previous work [7] . However, considering the hardware complexity, the proposed core type DFT/IDFT architecture requires two more multipliers than the previously proposed one [7] . Consequently, the proposed methods have many advantages over the conventional proposed algorithms. Furthermore, based on the proposed work 1, we can construct a folded recursive DFT/IDFT architecture. The folded architecture can significantly reduce the number of computation cycles for N-point DFT/IDFT from 2 / 2 N to N. Thus, more realtime operation can be achieved. Therefore, in Table 2 , it reveals that our proposed architectures have characteristics of the lowest computation cycle and high speed. Concerning the chip implementation, it is worth noticing that the recursive PE mainly dominates the performance of the whole architecture. Hence, we are encouraged to design one efficient recursive PE that takes into account of speed, area, and easy mutual interconnection. For the purpose of further reducing the critical path, the complex multiplier operations exploit the shift-and-add arithmetic. The active chip layout area of the proposed recursive PE as shown in Fig. 5 is 515 um x 515 um in TSMC 0.13 um CMOS process. The critical delay time obtained from the static timing analysis (STA) of Synopsys is 11.32 ns under the worst-case condition. It is expected that the folded type architecture only needs 0.72 µs to complete 64-point DFT/IDFT operation, i.e., 64 cycles. That means that we can meet the timing specification of the IEEE 802.11a standard [2] . Table 3 summarizes the chip characteristics of proposed recursive PE for DFT/IDFT structure. 
VI. CONCLUSIONS
Two new recursive DFT/IDFT architectures based on the hybrid of Chebyshev polynomial and register-splitting scheme are devised in this paper. The analyzed results expose that the proposed VLSI algorithm leads to the fewest computation cycle and higher speed than others. In addition, the proposed folding recursive architecture with regular organization is certainly amenable to VLSI implementation. 
