Transform (FFT) in the hardware paradigm has been a major challenge for design engineers. Twiddle Factor generation and complex multiplication thereafter are the decisive steps of VLSI implementation of FFT. Conventional FFT analyzers call for a dedicated memory bank to store the twiddle factor angles in a predefined order. This storage results in a increased resource utilization which increases with N, the length of the Fourier Transform. This study presents a phase generation scheme that generates the necessary twiddle factor angles with simple hardware logic, depending on the present step and stage of FFT. This relinquishes the use of memory storage elements. Use of CORDIC to carry out complex multiplication further enhances system throughput. The present logic has been synthesized in Spartan 3E FPGA. The timing diagram results match the theoretical analysis and the synthesis report supports minimal hardware resource utilization.
I. INTRODUCTION
Discrete Fourier Transform (DFT) plays a crucial role in Digital Signal Processing and is a powerful computational tool for evaluating Fourier Transform for sequences of finite length. DFT finds widespread usage in applications such as Synthetic Aperture Radar, medical image processing, Software Defined Radio [1] and Wireless communication protocols using OFDM. However, direct implementation of DFT is avoided due to its computational complexity. Fast Fourier Transform (FFT) is an efficient means of DFT computation, reducing the run time complexity from O(N^2) to O(Nlog2N) by exploiting the symmetric and periodic properties of Twiddle Factor [2] .
In the recent past, there has been a rapid progress in FPGA technologies. This gain in renewed interest is mainly due to increased device size and emergence of fast hardware floating point library. FPGA co-processors have become an extremely cost effective means of off-loading computationally intensive algorithms, enhancing overall system performance. Reconfigurable computing architectures are sufficiently flexible so that new operations can be implemented on the existing hardware besides being high speed for real-time execution [3] . Moreover, the price/performance ratio of these systems makes them a Manuscript broadly competitive alternative to ASICs. Besides reduced development time, decreased production cost while achieving higher throughput render FPGA nodes as a suitable hardware paradigm for low power, realtime applications.
The most computationally demanding stage of any FFT engine, especially in the hardware genre, is the butterfly operation. Conventionally, complex multipliers and adders constitute the butterfly unit and is the major speed impediment of FFT processors [4] . Though the VLSI multiplier structure can efficiently perform the complex multiplication, they are not much efficient in case of trigonometrical operations. To accomplish this, a common multiplier uses a Look UP Table (LUT) . Although the process enhances throughput, the main setback is the requirement of a huge LUT in the form of ROM. The CORDIC algorithm is a hardware efficient alternating solution which eliminates the need of a dedicated multiplier hardware. Use of CORDIC instead of complex multiplier not only makes the hardware requirement very simple but also has the advantage of low switching activity, ideal for low power applications.
Of late, design of FFT architectures over VLSI platform has been a subject of thorough research [5] - [8] . Some FFT designs based on CORDIC have also been reported for various usance. Lin et al [9] proposed a modified CORDIC algorithm using mixed-scale rotation to reduce the total iterations albeit enhancing hardware complexity. Using non-iterative CORDIC micro-rotations, a non-recursive FFT has been proposed in [10] . Though the study reduces ROM size, it does not eliminate it completely. A memory less architecture with reduced memory requirement has been presented in [11] but calls for a complex implementation.
For butterfly operation, the CORDIC based FFT engine requires the storage of only the twiddle factor angles in ROM, instead of storing the actual twiddle factors. Prevalently, dedicated memory banks are needed for storing the twiddle factor angles for the rotation in such processors. This calls for a separate memory element thereby increasing the overall resource utilization of the system. In this study, we present an online phase generation logic for computation of twiddle factors angles. The designed rationale generates the necessary angles online successively through a shifter-comparator arrangement. With this approach, the memory requirements of the FFT processor is considerably reduced as the decisive angles are now generated rather than being stored in any dedicated memory element.
The paper is organized as follows. In Section II, a brief mathematical description of FFT and CORDIC is provided. The online phase generation logic and its hardware design is presented in Section III. Section IV contains the FPGA synthesis results and the resource utilization summary while Section V concludes the paper.
II. FAST FOURIER TRANSFORM AND CORDIC

A. Fast Fourier Transform-An Introduction
The FFT algorithm has become almost ubiquitous and most important in many high speed signal processing applications. An FFT produces exactly the same result as evaluating the DFT definition directly; the only difference is that an FFT is much faster [2] . The Discrete Fourier Transform (DFT) of an N-point discrete-time complex sequence x (n), indexed by n = 0, l, ... , N-1, is defined by,
where W N = e (-j2π/N) , commonly known as Twiddle Factor.
The excessively large amount of computations required to compute the DFT directly when N is large has prompted to work out alternate methods for computing the DFT efficiently. Fig. 1 shows a butterfly of radix-2 Decimation in Frequency (DIF) FFT algorithm. All such radix algorithms have similar structure, while differing only on computation of butterflies. 
B. Basics of CORDIC
In CORDIC algorithm, acronym for Coordinate Rotation Digital Computer, a vector (x, y) can be rotated through an arbitrary angle θ to obtain a new vector ( ' x , ' y ) [12] . The generalized equation governing CORDIC operation is given by
Since rotation is associative in nature, the order of operation is interchangeable ie.
Thus, the basic concept of the CORDIC computation is to decompose the desired rotation angle θ into the weighted sum of a set of predefined elementary rotation angles (α i ) satisfying the following condition
where, b is the desired number of bits of precision and i = 0 to b-1. 
In other words, the rotation angle θ can be expressed as 
Factorizing cosine terms in Eq. (4) leads to Fig. 2 (a) sketches the overall computation scheme of a traditional FFT computation paradigm. The mannerism is to save the Twiddle Factor values in a separate turf [13] . Our study employs CORDIC algorithm for ciphering butterfly operation and also eliminates the LUT based concept of storing Twiddle Factors, as in Fig. 2 (b) . The details of the phase generation methodology and some customizations carried out for CORDIC usage are elaborated subsequently.
A. Phase generation (kθ) for Butterfly Input
Since a CORDIC is expected to rotate a vector by any angle between 0 and 2π covering all quadrants, so instead of representing all the elementary angles in conventional radians, it is represented in a normalized 2"s complement format with weights chosen from MSB side as -π, π/2, π/4, … , π/2 b-1 , "b" being the number of bits of precision This form of representing the angle offers the advantage of easy identification of the quadrant pertaining to the amount of rotation, simply by observing the first two bits from the MSB side. Also, for twiddle factor ( The values of kθ are modified as in Table II and so on, for different stages. From the above tables it is observed that bits b0 to b6 and b15 always remain 0, and b7 to b14 are corresponding to the outputs of and 8-bit up-counter with the varying output sequence depending on the number of stage of operation which is summarized in Table III.   TABLE I : TWIDDLE FACTOR VALUE 
B. Hardware Implementation
Though the present archetype can be extended for any higher transform length (N), the case for N=512 has been considered for presentation in this paper.
Initially only 8-bits corresponding to b7 to b14 are generated using a 8-bit up counter and then they are concatenated with other bit to represent the value of kθ in 16-bit two"s complement format (Fig. 4) . Since there are nine different sets of outputs corresponding to number of stages, two 9:1 desired output with the select input of MUX being controlled by a Stage counter, that counts form 0 to 8. As in each stage, there are (N/2) 256 butterfly the enable input of the Stage counter is controlled by a Step counter that counts from 0 to 255, shown in Fig 4. For stage-0, d0 of theoperations, so to ensure that the counter is incremented only after 256 outputs of a stage, MUXes are used to select the MUX1 will receive Step counter output for the full cycle from 0-255, which will be passed to the output without any shift, as the select signal of MUX, will remain at the same value continuously for 256 cycles. During the next phase, when the Stage counter output is 1, the Step counter output fed to d1 of MUX1 will be reset after 128 counts (0 to 127). Further, this output will be multiplied by 2, by right shifting the data by one bit at the input of MUX2. So, the output of MUX2 will be available as 0, 2, 4, …., 254 and this process will be repeated twice. Similarly, the output for different stages will be generated. With 8-bits (d7-d14) already generated, this bits will be concatenated with other 8-bits (d0 to d6 and d15 being equal to zero) to provide kθ value as theta input to the CORDIC for twiddle factor multiplication. The angle generator generates the respective twiddle factor angles for each stage and step of the butterfly. This value is fed as input to the CORDIC block which multiplies (a-b) with this value.
IV. RESULTS AND DISCUSSION
The architecture has been mapped in Xilinx Spartan 3E FPGA XC2S200 device with speed grade -5. A technology independent schematic view which gives a basic logic representation of the circuit and the overall of the synthesis result of the block have been shown in Fig. 5 and Fig. 7 respectively. The ISim timing diagram for stage zero of a 512 point is depicted in Fig. 6 ; din counter stands for each bit of precision and counts from 0-15 for 16 bits of precision of each step. After its full count, the step counter is incremented by one for another 15 clocks and so on. Kq out contains the generated phase value for each step. Following the convention as mentioned in Table I , it is observed that for stage 0, the phase values for step 0, 1 and 2 are 0, π/256 and π/128 respectively which is in accordance with the theoretical results. The device utilization summary is shown in Table IV , suggests a minimal hardware resource consumption. It was found that the architecture can operate at a minimum time period of 13.212 ns with a maximum frequency of 76.22 MHz, using a total memory of 201MB. 
V. CONCLUSION
In this work, leveraging the structured description of FFT algorithm, a FPGA based phase generation framework for FFT has been considered for. The phase generation technique eliminates the need of a dedicated memory bank as seen in conventional FFT engines. Although the design has been targeted for a Radix-2 Decimation in Frequency FFT processor of N=512, it can be extended for other N"s with minor structural alterations. The adopted methodology of the present design is based on shifters and comparators which significantly reduce hardware as well as the latency introduced thereon. Such a reduced hardware logic when ported to a FPGA chip is ideal for low power, real time spectral analysis applications.
