Abstract-This paper describes in detail the design of a CMOS custom fast Fourier transform (FFT) processor for computing a 256-point complex FFT. The FFT is well-suited for realtime spectrum analysis in instrumentation and measurement applications. The FFT butterfly processor reported here consists of one parallel-parallel multiplier and two adders. It is capable of computing one butterfly computation every 100 ns thus it can compute a 256-point complex FFT in 102. 4 s excluding data input and output processes.
Index Terms-Digital signal processors, discrete Fourier transform (DFT), fast Fourier transform (FFT), FFT butterfly, integrated circuit, silicon implementation, spectrum analysis, very large scale integration.
I. INTRODUCTION
T HE discrete Fourier transform (DFT) is of considerable importance in instrumentation, measurement and digital signal-processing (DSP) applications. However, the computation burden of the DFT had prevented it from being widely implemented in real-time applications. A fast implementation of the DFT is the fast Fourier transform (FFT). With the development of high-speed processors, the FFT has found many real-time applications in the field of measurement and instrumentation. With users demanding higher processing speeds for real-time measurement applications, dedicated FFT processors are replacing general-purpose DSP's in some application areas [1] - [3] . A number of dedicated FFT processor implementations have been reported in the literature [4] - [8] . The FFT processor architecture presented in this paper differs from all these, in that a bit-parallel pipelined butterfly processor is used rather than the commonly used bit-serial counterpart. Also, instead of having a column of butterfly processors, a single butterfly processor is deployed. In addition to this, the processor is programmable in the sense that the basic architecture enables it to be used for different size FFT operations and is capable of other commonly used functions such as windowing, filtering and fast convolution. The FFT processor chip reported in this paper is intended as a demonstrator of the basic architecture and is restricted to 256-point transforms by virtue of the on-chip RAM size. In the following sections we shall describe the architectural design, silicon implementation, and logic-level simulation of our FFT processor.
II. FFT BASICS AND IMPLEMENTATION CONSIDERATIONS
The -point DFT of a sequence is defined as [9] (1) and the Inverse DFT (IDFT) is defined as (2) where
The IDFT is easily computed without any major change to the DFT algorithm. The only extra facility required is the conjugation of . This is simply accomplished by negating the imaginary part of . The algorithm used in our FFT processor implementation is the modified version of the Cooley and Tukey's DecimationIn-Time (DIT) FFT algorithm with inputs in natural order and outputs in bit-reversed order, i.e., output scrambling. This input and output configuration is required if the processor is to be used for filtering applications. Fig. 1 shows the form of this scrambling.
As can be seen from Fig. 1 , the modified 8-point DIT-FFT algorithm consists of three butterfly stages. To the left, we have eight input data samples. Input data is multiplied with the twiddle factor . The solid dots represent addition/subtraction. The outputs are in bit-reversed order. Generally, anpoint DIT-FFT algorithm consists of stages, each stage containing /2 butterfly operations [10] .
Since the butterfly used here is the DIT-FFT radix-2 butterfly with all wordlength reductions performed at the output of the butterfly, results from the butterfly are scaled and quantized back to 24-bits in order to prevent overflow due to multiply and add/subtract operations. Convergent rounding is used since it is bias free. The DIT-FFT radix-2 butterfly is shown in Fig. 2 [10].
The DIT-FFT butterfly takes a pair of complex input data values " " and " " and produces a pair of complex outputs " " and " " where
III. ARCHITECTURAL DESIGN OF THE FFT PROCESSOR
The architecture of our FFT processor can best be understood, by tracing through its operation. The operation of the FFT processor is first partitioned into three main processes: data input, FFT computation, and data output. The operation cycle starts with the data input process, during which sampled data is read in and stored in memory. During the FFT Computation process, the FFT or inverse FFT (IFFT) is computed on the stored data. During the output process results of the FFT computation process are read out to the outside world.
The FFT processor architecture consists of a single DIT-FFT radix-2 butterfly (which is referred to as the butterfly processing element or butterfly PE), a dual-port FIFO RAM, a coefficient ROM, a controller and an address generation unit. Data pathways are in the form of 24-bit two's complement signed fractions. Coefficients are stored as 16-bit two's com- plement signed fractions. The block diagram representation of the FFT processor is depicted in Fig. 3 .
A. Butterfly Processing Element
The butterfly is the core calculation of the FFT and computes a two point FFT. The entire FFT is performed by combining butterflies in patterns determined by the FFT algorithm. The butterfly PE takes two complex data words from memory and computes the FFT on them. Results are written back to the same memory locations of the inputs since an in-place algorithm is used. This makes efficient use of the available memory as the transformed data overwrites the input data. However, the indexing required to determine which location in memory to fetch the input data for each butterfly is quite complex.
The structure of the butterfly PE employing straightforward implementation of (4) and (5) using standard real arithmetic units requires four multipliers, four adders, and two subtractors. This level of complexity makes it unsuitable for silicon implementation. A novel silicon area/computation-time efficient architecture is depicted in Fig. 4 .
The butterfly PE is capable of computing one butterfly operation every four cycles. It comprises one parallel-parallel multiplier and two adders. At each computation cycle the multiplier generates partial products of the complex multiplication , i.e., , , , and . These results are in 40-bit two's complement signed fraction format. Since, computation of the twiddle factors is time consuming these are pre-calculated and stored in the coefficient ROM. Only half a cycle of is stored, i.e., range of [ 1 1) with varying from 0 to 127. These are stored as 16-bit two's complement-signed-fractions format. The first adder is 40 bits wide and sums the cross products of the complex multiplication to generate the sum/difference of cross products. The output of this adder is rounded to a 26-bit result using convergent rounding. However, the "retain" (a variable used in convergent rounding) word is not incremented at this stage. Instead it is put back as 1 LSB and propagated until the second adder. It is combined together with the "negate" signal of the following negator and evaluated at the second adder hence saving us a 26-bit full adder otherwise required to increment the "retain" word. The second adder produces the sum and difference outputs of the butterfly computation. These results are scaled by 1/2, i.e., 1-bit right shift and clipped if the results overflow.
Implementing the butterfly PE in this way leads to an increase in computational speed at a cost of increased silicon area relative to using a serial-parallel multiplier. However, bearing the length of the transform in mind, to achieve high throughput and high speed of operation this trade-off is cost effective. The butterfly PE takes four cycles to compute a twopoint FFT, with a latency of five cycles. Three of these are associated with the fact that three input components ( , and ) are required before an output can be computed and two are used to pipeline the RAM read and write operations. Thus, the write and access times of the RAM are not a critical path of the operation of the processor. The target speed for the processor is a clock frequency of 40 MHz which results in a butterfly computation of 100 ns. Allowing for the pipeline delay, the total computation time for a 256-point complex FFT is 102. 4 s excluding data input and output processes. The butterfly PE sequence is shown in Table I .
B. Address Generation Unit
The purpose of the address generation unit (AGU) is to provide the RAM and the coefficient ROM with the correct addresses. It also keeps track of which butterfly is being computed in which stage. In addition to this, since address generation during input, output, and FFT computation processes are different it keeps track of the mode of operation of the processor and generates the required addresses. A block level description of the AGU is shown in Fig. 5 .
The butterfly generator keeps track of which butterfly is being computed in a particular stage. It is basically a nine-bit up counter, since for a 256-point complex FFT there are 128 butterflies per stage and four data words per butterfly (two real and two imaginary). The counter output is used for addressing the RAM during input and output processes and for providing the basic timing for the FFT process.
The stage generator keeps track of the current stage in the FFT computation, and supplies the base index generator with the number of the stage that is currently being computed. For a 256-point FFT, there are eight stages, and hence the stage generator is basically a three-bit counter which is incremented once every 128 butterfly counts.
The IO address generator is responsible for generating addresses for the RAM during the data input and output processes. During the data input process the output of the butterfly generator, "butterfly" can be used for addressing 512 locations in the RAM. However, during the data output process data should be bit-reversed while being written to the outside world. No extra hardware is required for implementing the bit-reversing in our hardware, as we simply reverse the wiring.
The base index generator is responsible for generating addresses during the FFT computation mode. FFT mode address generation is quite complex. The butterfly has two complex data inputs and . is referred to as "index0" and as "index1." "Index1" can be calculated from index0 by [8] index1 index0 (6) where is the index spacing which can be expressed as where is the stage number and is the transform length. Also "index0" can be expressed as [8] index0
butterfly DIV butterfly MOD (7) where "butterfly" consists of the first seven bits of the butterfly generator, i.e., butterfly , and "index0" is [8] the 8-bit wide RAM address. Table II shows the calculated "index0" for a 256-point FFT computation for each stage [8] .
As shown in Table II , "index0" can be derived by simply rearranging the bits in the "butterfly" and inserting zeros in the leading diagonal. "Index1" can simply be obtained from "index0" by replacing zeros on the leading diagonal with ones [8] .
C. Controller
The sequence of events is determined by the controller depending on the signals it receives from the surrounding units and generates information about which mode the chip is in, i.e., input, output, or FFT computation. This is important since address generation in each mode is different. It also generates other control signals to take care of required house keeping duties, i.e., incrementing and clearing counters.
IV. SIMULATION
The whole architecture has been simulated at the logic level using simulation models generated with Cascade Design Automation's EPOCH silicon compilation tool. These models included extracted and back annotated capacitive trackload models. The MENTOR Quicksim package was used to carry out the extracted and back annotated simulations. During simulations a variety of test signals including cosine and sine waves at different frequencies were fed into the in-phase and quadreture channels of the FFT processor. All the logic simulation results in all modes of operation proved to be satisfactory, and complied to the expected outputs, giving us the green light to go for fabrication.
V. CONCLUSION
As the FFT processor was designed and optimized for performing high-speed sum-of-products operations, it is easily deployable in a variety of DSP based sum-of-products intensive instrumentation and measurement applications, such as correlation, convolution and digital filtering. The processor is implemented in silicon based 0.7 m CMOS technology. The size of the chip (including pads) is 3.7 mm 4.1 mm, i.e., 15.17 mm . The size without the pads is 2.7 mm 3.2 mm, i.e., 8.64 mm . The overall FFT chip plot can be seen in Fig. 6 .
The chip architecture consists of a bit-parallel radix-2 DIT-FFT butterfly, dual-port FIFO RAM, address generation unit and the controller. Separate memories are used for storing the data and the coefficients. Although the processor we have designed and reported here is quite small and fast, there are some improvements that can be made. Most of the cells used to build the FFT processor have been optimized for speed rather than area and power consumption. These blocks can be redesigned for reduced area and power consumption. Also, investigation into the use of more than one butterfly processing element is another possibility. The FFT processor is capable of computing a 256-point complex FFT in 102.4 s excluding data input and output processes. The chip is operating with a clock frequency of 40 MHz.
