I. Introduction
Pipeline FFT processor is a specified class of processors for DFT computation utilizing fast algorithms. It is characterized with real-time, non-stop processing as the data sequence passes through the processor. It is an AT 2 non-optimal approach with AT 2 =O(N 3 ), since the area lower bound is O(N). The class of pipeline FFT processors has probably the smallest "constant factor" among the approaches that meet the time requirement, due to its least number O(log N), of Arithmetic Elements (AE). The difference comes from the fact that an AE, especially, the multiplier, takes the much larger area than a register in digital VLSI implementation.
For hardware implementation, various FFT processors have been proposed. These implementations can be mainly classified into memory-based and pipeline architecture based. Memory-based architecture is widely adapted to design FFT processor, also known as single processing element (PE) approach. This design style is usually composed of a main PE and several memory units, thus the hardware cost and power consumption both is lower than the other architecture style.
However, this kind of architecture style has long latency, low throughput and can not be parallelized. On the other hand, the pipeline architecture style can get rid of the disadvantages of the foregoing style, at the cost of an acceptable hardware overhead. Generally, the pipeline FFT processors have two popular design types. One uses single-path delay feedback (SDF) pipeline architecture and the other uses multiple-path delay commutator (MDC) pipeline architecture. Such implementations are advantageous to low-power design, especially for the applications in portable DSP devices. Based on these reasons, the SDF pipeline FFT is adopted in this. The proposed architecture includes a distributed arithmetic based complex multiplier instead of using multiplier to store twiddle factors.
In this paper, a more detailed and completed description of the entire work is provided. The rest of this paper is organized as follows. First, a brief review of the Radix-2 2 Fast Fourier transform is described in Section II. Section III presents the Radix-2 2 FFT architecture. Section IV discusses the existing architecture. The performance evaluation of distributed arithmetic based complex multiplier is then discussed in Section V. Section VI presents proposed architecture. Results are compared with other architectures in Section VII and lastly, section VIII provides the conclusion.
II. Radix-2 2 Decimation In Frequency FFT Algorithm
The Discrete Fourier Transforms (DFT) X k of an N-point discrete-time signal x N is defined by:
Design of Efficient Pipelined Radix-22 Single Path Delay Feedback FFT
Where W N denotes the primitive N th root of unity, with its exponent modulo N, x(n) is the input sequence and X (k) is the DFT. Applying a 3-dimensional linear index map.
And Common factor algorithm (CFA) to derive a set of 4 DFTs of length N/4 as,
where n 1 ,n 2 ,n 3 are the index terms of input sample "n" and k 1 , k 1 , k 1, k 2, k 3 are the index terms of the output sample "k" and where H(k 1, k 2, k 3 ) is expressed in equation (4) .
Above equation (4) 
III. Radix-2 2 FFT Architecture
Mapping radix-2 2 DIF FFT algorithm derived to the radix-2 SDF architecture, a new architecture of R-2 2 SDF approach is obtained. Figure 1 outlines an implementation of the Radix-2 2 SDF signal flow graph for N=16, note the similarity of the data-path to R2SDF and the reduced number of multipliers. The implementation uses two kinds of butterflies; one identical to that in the R2SDF, the other contains also the logic to implement the trivial twiddle factor multiplication.
…………. (2) …….… (3) ………. (4)

Design of Efficient Pipelined Radix-22 Single Path Delay Feedback FFT
Due to the spatial regularity of Radix-2 2 algorithm, the processor"s synchronization control is very simple. A (log2N)-bit binary counter serves two purposes: synchronization controller and address counter for twiddle factor reading in each stage. With the help of butterfly structures shown in Figure 3 , the scheduled operation of the R-2 2 SDF processor in Figure 2 is as follows. On first N/2 cycles, the 2-to-1 multiplexers in the first butterfly module switch to position "0", and the butterfly is idle. The input data from left is directed to the shift registers until they are filled. On next N/2 cycles, the multiplexers switch to position "1", the butterfly computes a 2-point DFT with incoming data and the data stored in the shift registers.
The butterfly outputs Z1(n) and Z1(n + N/2) are computed according to the equation. Z1(n) is sent to apply the twiddle factors, and Z1(n + N/2) is sent back to the shift registers to be "multiplied" in still next N/2 cycles when the first half of the next frame of time sequence is loaded in. The operation of the second and third butterfly is similar to that of the first one, except the "distance" of butterfly input sequence are just N/4 and the trivial twiddle factor multiplication has been implemented by real-imaginary swapping with a commutator and controlled add/subtract operations, which needs two bit control signal from the synchronizing counter. Data then goes through a full complex multiplier. Further processing repeats this pattern with the distance of the input data decreases by half at each consecutive butterfly stages. After N-1 clock cycles, the result of complete DFT transform streams out to the right, in bi-transversed order. The next frame of transform can be computed without pausing due to the pipelined processing of each stage. 
IV. Existing Architecture
Traditional hardware implementation of FFT/IFFT processors usually employs a ROM to look up the wanted twiddle factors, and the word length of complex multipliers to perform FFT computation. However, this introduces more hardware cost, thus a bit-parallel complex multiplication scheme is used to improve the foregoing issue.
Since the twiddle factors have a symmetric property, the complex multiplications which are used in the FFT computation can be one of the following three operation types-
Given the above three equations, any twiddle factor can be obtained by combination of these twiddlefactor primary elements. In other words arbitrary twiddle factor used in FFT can utilize these operation types to derive the wanted value, thus can significantly reduce the size of ROM used to store the twiddle factors.
Figure 5: Conventional Complex Multiplier
Booth multiplication algorithm has been used for complex multipliers. Thus the overall latency of the real-time implementation varies as the processing word length changes.
V. Distributed Arithmetic Method for Complex Multiplication
In this section, complex multiplication operation using DA is explained. DA can be used to implement multiplication operation if either the multiplicand or the multiplier value is fixed. It stores the possible combinations of fixed operand in ROM and suitable combination is added and shifted with respect to bits of other operand. The method for DA base complex multiplication can be summarized as: In proposed architecture a pipelined Radix-2 2 SDF FFT unit is designed without using multipliers. All the complex multiplications required for this type of FFT are implemented using Distributed Arithmetic (DA) technique.
The detailed architecture for complex multiplier is shown in above Figure 6 . The real and imaginary parts of incoming words B R and B I are stored in two 8 bits wide parallel in serial out register. Shifting is carried out starting from LSB to MSB.
Each output bit of these two registers is used as address lines of the ROMs. The ROM stores precalculated outcomes for both Z R and Z I . The size of each ROM is 4×8. One of the input to the 2:1 MUX is directly fed from the output of ROM and the other input to MUX is inverted. Input and output bit width for MUX is also 8 bits. The select line of MUX is "cin" signal and it remains as "0" till the MSB arrives at output. If select line "cin" of Mux is 1, it selects inverted output from ROM and it is added to the value stored in the partial product register (PPR). The PPR is 8 bit wide "parallel in parallel out" register which also performs 1-bit right shift operation. Finally the output is taken from the left shift register.
VII. Parallel Prefix Adder
The Brent-Kung adder is a parallel prefix adder. Parallel prefix adders are special class of adders that are based on the use of generate and propagate signals. Simpler Brent-Kung adders was been proposed to solve the disadvantages of Kogge-Stone adders. The cost and wiring complexity is greatly reduced. It considered as one of the better tree adders for minimizing wiring tracks, fan out and gate count and used as a basis for many other networks. The block diagram of 8-bit Brent-Kung adder is shown in Fig.7 . 
VIII. Result And Discussion
The design of Single Path Delay Feedback FFT was modeled in VERILOG by making use of a module based structure approach. Both the SDF FFT were designed and synthesized by using Xilinx-13.2 version on Spartan-3 device. All the operation can be simulated at one time; the behavioural simulation is done by executing the test bench file. Table I . shows the comparison between existing and proposed complex multiplier. There is a huge difference between existing and proposed complex multiplier because in existing we have used multiplier blocks but in case of proposed, it is without the use of multiplier, it is basically made up of registers. And register takes less area when compared with the multiplier. Reduction in delay, because of the use of parallel prefix adder. 
IX. Conclusion
A multiplier less pipelined Radix-2 2 Single Path Delay Feedback FFT has been described in this paper. Mainly area and delay are considered for performance evaluation needed by the SDF FFT. Distributed Arithmetic based complex multiplier has been used as a proposed model.
Considering the symmetric property of twiddle factors in FFT, we have designed a distributed arithmetic based complex multiplier such that the size of twiddle factor is significantly shrunk. A new approach is proposed in this paper to reduce the area and delay of SDF FFT. The replacement of DA based complex multiplier with parallel prefix adder in place of conventional complex multiplier offers great advantage. This result shows that our design owns less area as well as less delay as compared to the existing one. Of course, our proposed architecture can also be adapted to high-point FFT applications, with lower size of twiddle-factor ROM"s.
