In this letter, a unified hardware architecture that can be reconfigured to calculate 2, 3, 4, 5, or 7-point DFTs is presented. The architecture is based on the Winograd Fourier transform algorithm (WFTA) and the complexity is equal to a 7-point DFT in terms of adders/subtracters and multipliers plus only seven multiplexers introduced to enable reconfigurability. The processing element finds potential use in memory-based FFTs, where nonpower-of-two sizes are required such as in DMB-T.
Introduction:
The discrete Fourier transform (DFT) is an important algorithm in the field of digital signal processing. It transforms a signal from the time domain into the frequency domain, providing information about the spectrum of the signal. The direct computation of an N -point DFT requires to calculate a number of operations proportional to N 2 . In order to reduce the number of arithmetic operations, many fast algorithms have been proposed, such as Cooley-Tukey [1] , prime factor (PFA) [2] and Winograd Fourier transform (WFTA) [3] algorithms. Here, we refer to them collectively as fast Fourier transform (FFT) algorithms. These algorithms are based on decomposing an N -point DFT recursively into smaller DFTs, leading to a reduction of the computational complexity [4] .
Most FFT algorithms and architectures have focused on power-of-two size DFTs. However, recently the interest in non-power-of-two size DFTs has increased, mainly motivated by the 3780-point DFT in Chinese digital TV (DMB-T) [5, 6] based on orthogonal frequency-division multiplexing (OFDM). In the receiving side of OFDM systems, an inverse DFT (IDFT) is usually required, which is easily computed using a DFT processor.
Most FFT architectures are not well optimised for the computation of non-power-two-point FFTs, which make use of small point DFTs with varying sizes, as well as more complex data management. Some pipelined architectures for the 3780-point DFT in DMB-T have been proposed [5, 6] . However, the streaming nature of a pipelined architecture leads to the fact that it can often process data at a much higher rate compared to the required 7.56 Mb/s. Hence, the amount of computational resources are often excessive. In [7] , individual processing elements for 3 and 5-point DFTs was proposed and considered for a pipelined architecture. However, they were not based on the WFTA and have a slightly higher complexity.
Memory-based FFTs are often more suitable for low data rate applications (where the clock frequency offered by the implementation technology is higher than the data rate), as they allow reusing the computational resources to a higher degree [8] . For a non-power-oftwo memory-based FFT, a number of challenges remain. One is how to carry out the more complex data management to interconnect the small DFTs. Another one is to develop a processing element that is suitable for computing small point DFTs of different sizes. This letter presents a unified architecture to compute the 2, 3, 4, 5, and 7-point DFTs by a single processing element. This architecture can be used as the computational core of a memory-based architecture for any DFT whose size that can be decomposed into the factors 2, 3, 4, 5 and 7.
The proposed unified architecture is based on the WFTA algorithm. This algorithm has the minimum number of multiplications at the expense of introducing additions [3] . Although WFTA is very efficient for small prime size DFTs, for larger sizes the number of additions becomes too high for practical implementations.
Architectures of 2, 3, 4, 5, and 7-Point DFTs: Figure 1 shows the individual signal flow graphs of 2, 3, 4, 5, and 7-point DFTs. The signal flow graphs of the 3, 5, and 7-point DFTs are based on the WFTA algorithm, whereas the 4-point is based on the Cooley-Tukey FFT algorithm, and the 2-point is a direct computation. In the WFTA, the DFT computation can be written as
where I is a matrix corresponding to additions between inputs, M is a diagonal matrix with the multiplications, and O is a matrix corresponding to additions after the multiplications. Multiplications are performed by semi-complex multipliers at the second stage. A semi-complex
x (2) x (1) x (3) X (0) X (2) X (3) X(1)
x (1) x (3) x (4) x (2) X (0) X (1) X (2) X(4)
x (1) x (6) x (3) x (4) x (2) x (5) X ( multiplication has a complex input but a purely real or purely imaginary coefficient and, hence, has half the implementation cost compared to a general complex multiplication. The multiplication coefficients for each size are also shown in Fig. 1 . Finally, the numbers at the input represent the index of the input sequence, x(n), whereas those at one output are the frequencies k of the output signal X(k).
Proposed Unified Architecture:
The unified architecture is based on mapping the signal flow graphs in Fig. 1 into a single processing element. As starting point, we consider an direct mapping of the 7-point DFT onto which all the other sizes are mapped. As the number of operations for 7 points is the largest and, hence, there are enough computational resources available, the main challenge is to reduce the number of multiplexers. To obtain this, it is important to find common parts in the signal flow graphs that can be mapped without multiplexers. On the one hand, multiplexer can be avoided by setting the inputs of the circuit to zero or by setting the X (1) X (6) X (2) X (5) X (4) X (3) x (0) x (1) x (6) x (3) x (4) x (2) x ( coefficient of a multiplier to zero. This removes unnecessary connections of the circuit. On the other hand, by setting the coefficient of a multiplier to one, this multiplier can be bypassed, which also removes the need of a multiplexer. Using these techniques, a solution with only two two-to-one multiplexers and five single gate multiplixers, controlled with four control signals, has been found. The single gate multiplexers can be implemented by a single gate as shown in Fig 2. The resulting architecture is shown in Fig. 3 . The required multiplier coefficients are shown in Table 1 , where the C XY coefficient values are based on Fig. 1 , and 0 and 1 are required to bypass or break the operators as discussed above. The input and output relations are shown in Table 2 , where the dashes denote don't care conditions and 0 denotes that the input should be zeroed for proper operations. In both cases, zero values can be fed from the stored zero value in memory or by using a single gate multiplexer as shown in Fig. 2 . Finally, the signals controlling the multiplexers are shown in Table 3 , where the dashes again denote don'tcare conditions.
Conclusions:
In this letter, a reconfigurable unified processing element architecture for computing 2, 3, 4, 5, and 7-point DFTs is proposed. It is suitable as the core computational unit when computing DFTs in memorybased architectures. The processing element is suitable for any DFT size which can be decomposed into the included sizes. 
