Abstract-This paper presents a high-throughput lowcomplexity 512-point eight-parallel mixed-radix multipath delay feedback (MDF) fast Fourier transform (FFT) processor architecture for orthogonal frequency division multiplexing (OFDM) applications. To decrease the number of twiddle factor (TF) multiplications, a mixed-radix 2 4 /2 3 FFT algorithm is adopted. Moreover, a dual-path shared canonical signed digit (CSD) complex constant multiplier using a multi-layer scheme is proposed for reducing the hardware complexity of the TF multiplication. The proposed FFT processor is implemented using TSMC 90-nm CMOS technology. The synthesis results demonstrate that the proposed FFT processor can lead to a 16% reduction in hardware complexity and higher throughput compared to conventional architectures.
I. INTRODUCTION
The fast Fourier transform (FFT) algorithm is primarily a fast and efficient method to compute the discrete Fourier transform (DFT). FFT processors are among the most widely used components in various digital signal processing (DSP) applications and systems. Moreover, FFT processors gained intense research interest with the appearance of orthogonal frequency division multiplexing (OFDM) communication systems, not only in digital broadcasting systems and mobile telecommunications but also in power line communication (PLC) systems [1] [2] [3] . In particular, in recent years, with the increasing demand for multimedia applications using wireless transmissions over short distances, the millimeter wave (mmWave) 60 GHz wireless personal area network (WPAN) has been intensively researched for many years. Also, the IEEE 802.11 Task Group ad (IEEE 802.11ad) has been developed a standard for the mmWave wireless local area network (WLAN) and WPAN systems [4] . In the physical layer design of high-rate WPANs, OFDM modulation has been adopted, and the FFT processor has a high hardware complexity. One OFDM symbol in the IEEE 802.11ad standards consists of a length of 512 subcarriers. Therefore, FFT processor conducts the FFT computation with 512-point arithmetic. Because of its massive computational complexity, FFT processor architectures require high hardware complexity and power consumption. Thus, several FFT processor architectures have been proposed to reduce the hardware complexity and to provide higher throughput.
Among various FFT algorithms, the Cooley and Tukey algorithm [5] is highly popular because they first introduced the concept of FFT, which can reduce the computational complexity by making efficient use of the symmetry and periodicity properties of the twiddle factors (TFs). To further reduce the computational complexity, several algorithms have been proposed, including radix-2 3 [6] , radix-2 4 [7, 8] , radix-2 5 [9] , and mixed-radix [10] . In common, these higher-radix algorithms reduce the number of non-trivial multiplications in the radix-2 algorithm. Other studies have been done on parallel FFT architectures, which can achieve higher throughput with lower computation latency [6] [7] [8] . However, some critical problems still exist and need improvement for the speed, area, and power consumption considerations. Therefore, this paper focuses on the throughput and hardware complexity improvement for FFT processor architectures.
In this paper, we propose a high-throughput and lowcomplexity 512-point eight-parallel multipath delay feedback (MDF) architecture using an area-efficient mixed-radix 2 4 /2 3 FFT algorithm. In addition, we propose the architecture of a dual-path shared multi-layer canonical signed digit (CSD) complex constant multiplier (DPS-MLCCM) to reduce the hardware complexity for TF multiplication of parallel FFT processors. The proposed FFT processor provides better throughput and less hardware complexity compared to previous designs [9] [10] [11] . The rest of this paper is organized as follows. Section II describes the mixed-radix 2 4 /2 3 FFT algorithm. In Section III, the architecture of the proposed mixed-radix FFT processor and the DPS-MLCCM for TF multiplication is presented. Section IV presents the implementation results and performance comparison. Finally, conclusions are provided in Section V.
II. MIXED-RADIX 2 4 /2 3 FFT ALGORITHM
The DFT of length N is defined as
where k is the frequency index and n is the time index; the TF is defined as
The basic idea of FFT is based on the fundamental principle of decomposing the computation of the N-point DFT into successively smaller DFTs [5] .
Radix-2 3 Algorithm
To derive the radix-2 3 algorithm, the first three steps in cascade decomposition are considered. The linear index mapping is transformed into four-dimensional linear index maps [6] :
Applying the four-dimensional linear index map to (1), ( ) 
With the cascade decomposition, the TF can be expressed in the form of (5)
Radix-2 4 Algorithm
In [7] , radix-2 4 algorithm is derived by considering the first four steps of decomposition. Applying a fivedimensional linear index map, 
The CFA takes the form of ( , , , , )
With the cascade decomposition, the TF can be expressed in the form of (8) is adopted in the first four stages and the radix- 2 3 algorithm is used in the remaining stages. Table 1 presents the sequence of the 512-point FFT TF computation at each stage for the radix-2 k and mixedradix algorithm. As can be observed, the mixed-radix 2 4 /2 3 algorithm can reduce the number of TF multiplications. Hence, the area and power consumption of the complex multipliers can be reduced accordingly.
III. PROPOSED FFT ARCHITECTURE

Proposed MDF Architecture
Several pipeline architectures for FFT processor have been proposed over the past few decades [6] . In general, the multi-path delay commutator (MDC) scheme can achieve a higher throughput rate while the single-path delay feedback (SDF) scheme requires less memory and hardware complexity [12] . The proposed MDF architecture can provide high throughput rate with minimal hardware cost by incorporate the features of MDC and SDF [12] [13] [14] .
The block diagram of the proposed eight-parallel 512-point mixed-radix 2 4 X (2) X (50) X (18) X (42) X (10) X (58) X (26) X (38) X (6) X (54) X (22) X (46) X (14) X (62) X (30) X (33) X (1) X (49) X (17) X (41) X (9) X (57) X (25) X (37) X (5) X (59) X (21) X (45) X (13) X (61) X (29) X (35) X (3) X (51) X (19) X (43) X (11) X (59) X (27) X (39) X (7) X (55) X (23) X (47) X (15) X(63) algorithm; and the top control unit. The details of each module are discussed as follows. Module 1 has a radix-2 4 structure as depicted in Fig. 3 .
The input data is processed in eight-parallel data-paths. Module 1 covers from stage 1 to stage 4 of the proposed FFT processor and consists of the first-in first-out (FIFO) registers, two types of butterfly unit (BF1 and BF2), the conventional CSD complex multiplier (CCM) using a common sub-expression sharing (CSS) technique for stage 2, and the proposed DPS-MLCCM using the CSS technique for TF W 512 in stage 4. The BF1 only implements complex additions and subtractions while the BF2 includes TF W 4 multiplication utilizing the multiplexers and control signals [9] . Module 2 is realized by the radix-2 3 FFT algorithm as shown in Fig. 4 . Most of the components of module 2 are similar to that of module 1: FIFO registers, butterfly units BF1 and BF2, the CCM using the CSS technique for stage 6, and a dual-path shared CCM (DPS-CCM) using the CSS technique for TF W 32 in stage 7.
Proposed Dual-path Shared Multi-layer CSD Complex Constant Multiplier
A complex multiplier is a component that has a critical effect on the hardware complexity, the power consumption, and throughput of FFT processors. Even if a low-complexity FFT algorithm is adopted, the complex multiplier can be realized by various approaches to reduce hardware complexity. Generally, the CSD representation of a TF is able to reduce hardware complexity better than binary representation of a TF when the TF has only few coefficients. For a TF with many coefficients, the Booth multiplier has been widely used in existing research. However, its primary problem is that it requires high hardware complexity.
In 
Through the TF mapping and decomposition process, the entire multiplication unit can be implemented using 15 different TFs in total. The number of TFs needed for layer 1 and layer 2 of the proposed DPS-MLCCM are eight and seven, respectively. Fig. 6 presents the block diagram of the proposed DPS-MLCCM for W 512 TF multiplication.
The input data, together with the mapping TF coefficient and the regional selection data generated from the TF controller block, pass through two layers of the multiplier to generate the output result. On the basis of the characteristic of the scheduling of the TF in the DPS-MLCCM layer 2, the path sharing scheme can be applied in this layer for further hardware complexity reduction.
Layer 1 of the multiplier is responsible for the complex multiplication of input x with a TF 1 8 512 ,
Since the number of TFs is quite small, the multiplication in layer 1 can be exploited using the pre-CCM.
When the TF is equal to 0 512 , W the pre-CCM is operated by a bypass operation without an additional CCM. Therefore, this layer consists of a CCM using only 16 coefficients. In addition, the CSS technique is applied to these coefficients, for minimizing hardware complexity. The detailed hardware architecture of the pre-CCM of the proposed DPS-MLCCM for W 512 TF multiplication is shown in Fig. 7 . This pre-CCM of layer 1 consists of the CCM using the CSS technique, the two's complement logic, and the multiplexers for proper control. The output of layer 1 is then sent to the input of the DPS-CCM in layer 2 of the DPS-MLCCM for the remaining computations to be implemented. Layer 2 of the DPS-MLCCM is responsible for the multiplication of the input data set Out L1 {Re, Im}, which is the output from layer 1 of the multiplier with the TFs 2 512 2 , 0 7.
When the TF is equal to 0 512 , W no multiplier is required for computation, as described in the previous section. Therefore, this layer consists of only seven complex multipliers. Similarly, the TF multiplication in layer 2 can be exploited using 14 CCMs.
On the basis of the characteristic of the scheduling of the TF in eight-parallel data-paths, the proposed dualpath sharing technique can be applied in layer 2 of the DPS-MLCCM to further reduce the hardware complexity. t W for eight data-paths at different time slots. According to the scheduling of the TFs at each time slot for eight datapaths in the proposed 512-point mixed-radix FFT processor, it can be observed that the same TF never occurs at the same time in each pair of these two parallel data-paths: path 1 and path 6, path 2 and path 7, path 3 and path 8, and path 4 and path 5. However, the CCM used for t 2 = 4 must be duplicated in order to avoid the conflict of DPS-CCM control. Consequently, a total of 16 CCMs are required in the DPS-CCM architecture.
The detailed DPS-CCM architecture of layer 2 of the proposed DPS-MLCCM for W 512 TF multiplication is shown in Fig. 9 . This layer consists of the CCM using the sharing technique, two's complement logic, and the multiplexers for appropriate control of regional remapping.
IV. IMPLEMENTATION RESULTS AND PERFORMANCE COMPARISON
Prior to the hardware implementation of the proposed FFT processor, an appropriate word length and a quantization error performance evaluation is determined by a fixed-point evaluation is determined by a fixedpoint simulation using MATLAB. From the simulation results, a 12-bit word length is chosen for both the real and imaginary parts because the output signal to noise ratio (SNR) was saturated at a 12-bit word length.
The determined word length not only keeps the quantization noise to the least value but also can minimize the hardware complexity. When the word length is set to 12 bits, the proposed FFT processor architecture yields a signal to quantization noise ratio (SQNR) of 41.2 dB without using a data scaling approach.
After a proper word length was determined, the proposed FFT architecture was designed in Register Transfer Level (RTL) using Verilog HDL and functionally verified using a commercial Verilog HDL simulator. In addition, the entire design was synthesized using a Synopsys design compiler with a TSMC 90-nm CMOS technology optimized for a 1.2-V supply voltage. The proposed processor has a 243,000 gate count, and the operating clock frequency is 385 MHz. Table 2 presents a comparison of the hardware complexity between different 512-point FFT processor architectures. In order to compare the hardware complexity, the complex multipliers are synthesized and then the area of each multiplier was normalized. Compared with other architectures, the proposed architecture has the lowest total normalized area of the complex multipliers. In addition, there is no need to allocate memory to store the twiddle factor. Table 3 shows the performance comparisons between the proposed 512-point eight-parallel mixed-radix 2 4 /2 3 MDF FFT/IFFT processor using CCM and several existing 512-point FFT processors [9] [10] [11] . The results show that the proposed FFT processor obtains much better SQNR performance than that of [9] . The design in [9] is also a pipelined eight-parallel MDF architecture for a 512-point FFT processor. However, it requires much more complex multipliers and memory than the proposed design. Therefore, the proposed architecture results in a low hardware complexity. Moreover, the highest throughput of the proposed FFT processor can reach 3.08 GS/s at 385 MHz by employing eight-parallel data-paths. The throughput rate is the fastest among the presented algorithms in Table 3 . Finally, it is quite evident from Table 3 that the proposed architecture has an advantage in terms of hardware complexity; the gate count for the overall FFT processor is reduced more than 16% compared to that of [9] . This hardware complexity reduction results from using the proposed multi-layer CSD complex multiplier architectures and the sharing technique.
V. CONCLUSIONS
This paper presents a 512-point eight-parallel MDF mixed-radix 2 4 /2 3 FFT processor using a novel DPS-MLCCM. In particular, a dual-path sharing technique and a multi-layer CCM architecture for TF multiplication is proposed to efficiently reduce the hardware complexity of the FFT processor based on parallel MDF architecture. From fixed-point simulation results, the SQNR performance is 41.2 dB with a 12-bit word-length implementation. Total estimated NAND gates are 243,000 from the synthesized results, and the throughput is 3.08 GS/s for the proposed FFT processor. The proposed FFT processor is the most area-efficient and high-throughput architecture for the eight-parallel 512-point MDF FFT processors. Therefore, the proposed FFT processor is a promising solution for OFDM systems that require high throughput and low complexity. 
