Abstract-A large family of signal processing techniques consist of Fourier-transforming a signal, manipulating the Fouriertransformed data in a simple way, and reversing the transformation. We widely use Fourier frequency analysis in equalization of audio recordings, X-ray crystallography, artefact removal in Neurological signal and image processing, Voice Activity Detection in Brain stem speech evoked potentials, speech processing spectrograms are used to identify phonetic sounds and so on. Discrete Fourier Transform (DFT) is a principal mathematical method for the frequency analysis. The way of splitting the DFT gives out various fast algorithms. In this paper, we present the implementation of two fast algorithms for the DFT for evaluating their performance. One of them is the popular radix-2 Cooley-Tukey fast Fourier transform algorithm (FFT) [1] and the other one is the Grigoryan FFT based on the splitting by the paired transform [2] . We evaluate the performance of these algorithms by implementing them on the Xilinx Virtex-II pro [3] and Virtex-5 [4] FPGAs, by developing our own FFT processor architectures. Finally we show that the Grigoryan FFT is working fatser than Cooley-Tukey FFT, consequently it is useful for higher sampling rates. Operating at higher sampling rates is a challenge in DSP applications.
I. INTRODUCTION
n the recent decades, fast orthogonal transforms have been widely used in areas of data compression, pattern recognition and image reconstruction, interpolation, linear filtering, and spectral analysis. The suitability of unitary transforms in each of the above applications depends on the properties of their basis functions as well as on the existence of fast algorithms, including parallel ones. Since the introduction of the Fast Fourier Transform (FFT), Fourier analysis has become one of the most frequently used tool in signal/image processing and communication systems; The main problem when calculating the transform relates to construction of the decomposition, namely, the transition to the short DFT's with minimal computational complexity. The computation of unitary transforms are complicated and time consuming process. Since the decomposition of the DFT is not unique, it is natural to ask how to manage splittings and how to obtain the fastest algorithm of the DFT. The difference between the lower bound of arithmetical operations and the complexity of fast transform algorithms shows that it is possible to obtain FFT algorithms of various speed [2] .
One approach is to design efficient manageable split algorithms. Indeed, many algorithms make different assumptions about the transform length. The signal/image processing related to engineering research becomes increasingly dependent on the development and implementation of the algorithms of orthogonal or non-orthogonal transforms and convolution operations in modern computer systems. The increasing importance of processing large vectors and parallel computing in many scientific and engineering applications require new ideas for designing super-efficient algorithms of the transforms and their implementations [2] .
In this paper we present the implementation techniques and their results for two different fast DFT algorithms. The difference between the algorithm development lies in the way the two algorithms use the splitting of the DFT. The two fast algorithms considered are radix-2 and paired transform [2] algorithms. The implementation of the algorithms is done both on the Xilinx Viretx-II Pro [3] and Virtex-5 [4] FPGAs. The performance of the two algorithms is compared in terms of their sampling rates and also in terms of their hardware resource utilization.
Section II presents the paired transform decomposition used in paired transform in the development of Grigoryan FFT. Section III presents the implementation techniques for the radix-2 and paired transform algorithms on FPGAs. Section IV presents the results. Finally with the Section V we conclude the work and put forward some suggestions for further sampling rate improvements.
II. DECOMPOSITION ALGORITHM OF THE FAST DFT USING PAIRED TRANSFORM
In this algorithm the decomposition of the DFT is done by using the paired transform [2] . Let { ) (n x }, n = 0:(N-1) be an input signal, N>1. Then the DFT of the input sequence
which is in matrix form 
which shows the applying transform is decomposed into short transforms Ni F , i = 1: k. Let F S be the domain of the transform F the set of sequences f over which F is defined. Let (D;  ) be a class of unitary transforms revealed by a partition  . For any transform  F (D;  ), the computation is performed by using paired transform in this particular algorithm. To denote this type of transform, we introduce "paired functions [2] ." , then the complex function 
Perform the L r-k -point DFT's over Y k , k=1: r 5) Make the permutation of outputs, if needed.
III. IMPLEMENTATION TECHNIQUES
We have implemented various architectures for radix-2 and paired transform processors on Virtex-II Pro and Virtex-5 FPGAs. As there are embedded multipliers [3] and embedded block RAMs [3] available, we can use them without using distributed logic, which economize some of the CLBs [3] . Virtex-5 [4] is having DSP48E slices As we are having DSP48E slices on Virtex-5 FPGAs, to utilize them and improve speed performance of these 2 FFTs and to compare their speed performances on Virtex-5 FPGAs and Virtex-II pro FPGAs. As most of the transforms are applied on complex data, the arithmetic unit always needs two data points at a time for each operand (real part and complex part), dualport RAMs are very useful in all these implementation techniques.
In the Fast Fourier Transform process the butterfly operation is the main unit on which the speed of the whole process of the FFT depends. So the faster the butterfly operation, the faster the FFT process. The adders and subtractors are implemented using the LUTs (distributed arithmetic). The inputs and outputs of all the arithmetic units can be registered or nonregistered.
Various possible implementations of multipliers we considered are:
Embedded multiplier:
a) With non-registered inputs and outputs b) With registered inputs or outputs, and c) With registered inputs and outputs.
Distributed multiplier: Distributed multipliers are implemented using the LUTs in the CLBs. These can also be implemented with the above three possible ways. Various considerations made to implement butterfly operation for its speed improvement and resource requirements. Basing on the availability of number of Embedded multipliers and design feasibility we have implemented both multiplication processes.
The various architectures proposed for implementing radix-2 and paired transform processors are single memory (pair) architecture, dual memory (pair) architecture and multiple www.ijsrp.org memory (pair) architectures. We applied the following two best butterfly techniques for the implementation of the processors on the FPGAs [ Single memory (pair) architecture (shown in Figure 1 ) is suitable for single snapshot applications, where samples are acquired and processed thereafter. The processing time is typically greater than the acquisition time. The main disadvantage in this architecture is while doing the transform process we cannot load the next coming data. We have to wait until the current data is processed. So we proposed dual memory (pair) architecture for faster sampling rate applications (shown in Figure 2 ). In this architecture there are three main processes for the transformation of the sampled data. Loading the sampled data into the memories, Processing the loaded data, Reading out the processed data. As there are two pairs of dual port memories available, one pair can be used for loading the incoming sampled data, while at the same time the other pair can be used for processing the previously loaded sampled data. For further sampling rate improvements we proposed multiple memory (pair) architecture (shown in Figure 3 ). This is the best of all architectures in case of very high sampling rate applications, but in case of hardware utilization it uses lot more resources than any other architecture. In this model there is a memory set, one arithmetic unit for each iteration. The advantage of this model over the previous models is that we do not need to wait until the end of all iterations (i.e. whole FFT process), to take the next set of samples to get the FFT process to be started again. We just need to wait until the end of the first iteration and then load the memory with the next set of samples and start the process again. After the first iteration the processed data is transferred to the next set of RAMs, so the previous set of RAMs can be loaded with the next coming new data samples. This leads to the increased sampling rate.
Coming to the implementation of the paired transform based DFT algorithm, there is no complete butterfly operation, as that in case of radix-2 algorithm. According to the mathematical description given in the Section II, the arithmetic unit is divided into two parts, addition part and multiplication part. This makes the main difference between the two algorithms, which causes the process of the DFT completes earlier than the radix-2 algorithm. The addition part of the algorithm for 8-point transform is shown in Figure 4 . The architectures are implemented for the 8-point 64-point,128-point,256-point transforms for both Viretx-II Pro and Virex-5 FPGAs. The radix-2 FFT algorithm is efficient in case of resource utilization and the paired transform algorithm is very efficient in case of higher sampling rate applications.
IV. THE IMPLEMENTATION RESULTS
Results obtained on Virtex-II Pro and Virtex-5 FPGAs: The hardware modeling of the algorithms is done by using Xilinx's system generator plug-in software tool running under SIMULINK environment provided under the Mathworks's MATLAB software. The functionality of the model is verified using the SIMULINK Simulator and the MODELSIM software as well. The implementation is done using the Xilinx project navigator backend software tools. Table 1 shows the implementation results of the two algorithms on the Virtex-II Pro FPGAs. Table 2 shows the implementation results of the two algorithms on Virtex-5 FPGAs. From Tables 1, 2 we can see that Grigoryan FFT is always faster than the Cooley-Tukey FFT algorithm. Thus paired-transform based algorithm can be used for higher sampling rate applications. In military applications, while doing the process, only some of the DFT coefficients are needed at a time. For this type of applications paired transform can be used as it generates some of the coefficients earlier, and also it is very fast.
V. CONCLUSIONS AND FURTHER RESEARCH
In this paper we have shown that both on on Virtex-II Pro and Virtex-5 FPGAs the paired transform based Grigoryan FFT algorithm is faster and can be used at higher sampling rates than the Cooley-Tukey FFT at an expence of high resource utilization.
1. In all implementations on FPGAs, the number of bits used for the data is 16-bits. So all the multipliers here are used as 16-bit multipliers. The size of the multipliers used were 18-bit multipliers. For instance, if there are some applications using only 8-bit data, then one can use the 40 dedicated multipliers as 80 multipliers, as two multiplications can be implemented by using a single embedded multiplier as long as the sum of the two products bits is less than 36 bits. 
