I. INTRODUCTION

F
FT (fast Fourier transform) is the most popular algorithm in digital signal processing. FFT has many high-end applications, such as radar, sonar, spread-spectrum communications, image processing, general filtering, convolution, etc. Many of them require a good precision and real-time response. With the advent of VLSI, digital signal processors (DSP's) provide a convenient way to develop these applications. However, high-performance applications are out of the reach of a single processor and parallel DSP chips and application specific DSP chips have been introduced for this reason. Application specific circuits developed in silicon compute a 1024-point FFT in tens of milliseconds [1] . Higher performance is only possible using several of these Silicon chips in parallel. However, digital gallium arsenide (GaAs) technology can provide a superior performance using monolithic solutions.
Most signal processing algorithms (e.g., fast Fourier transform) require the evaluation of trigonometric functions, which Manuscript received February 3, 1997 . This work was supported in part by the ESPRIT project CT93-0385, and the Spanish National Science Foundation (DGICYT) project CE94-0013.
The authors are with the Centro de Microelectrónica Aplicada, Universidad de Las Palmas de Gran Canaria, 35017 Las Palmas de Gran Canaria, Spain (e-mail: roberto@cma.ulpgc.es).
Publisher Item Identifier S 1063-8210(98) 01319-5. normally create a bottleneck in digital signal processors. These functions are frequently implemented using a multiplicationand-accumulation unit (MAC), which basically consists of multipliers, adders, and registers. The design and implementation of these primitives has already been studied in gallium arsenide [2] . However, mimicking the silicon solutions in GaAs could produce inefficient systems in terms of performance or cost. The CORDIC (COordinate Rotation DIgital Computer) algorithm has been shown to be an efficient way of evaluating the elementary functions [3] , such as trigonometric, exponential, and logarithmic functions. Although the CORDIC algorithm was proposed in 1959, current VLSI technology has created a new interest in a number of applications of the algorithm.
In this paper, the architecture and the implementation of a complex fast Fourier transform (CFFT) processor using 0.6 m gallium arsenide technology are presented. This processor computes a 1024-point FFT of 16 bit complex data in less than 8 s, working at a frequency beyond 700 MHz, with a power consumption of only 12.5 W.
The architecture of the processor is based on the CORDIC algorithm, that evaluates the trigonometric functions using only add and shift schemes. Previous work [4] , [5] has shown that the CORDIC algorithm usually yields slow and area demanding circuits. Hence, an application specific CORDIC has been used and other improvements (such as a novel mixed radix2/radix4 approach) have been made in order to overcome these drawbacks. To increase the throughput, pipelining and redundant arithmetic representation have been used. For this architecture, functional units have been designed and optimized for an enhancement/depletion self-aligned process taking into account such issues as process spread, temperature variations, etc. The processor is laid-out in full-custom using the merged logic approach [6] optimizing area and power consumption.
The organization of the paper is as follows. In Section II, an overview of the CORDIC processor is presented. A basic description of the FFT processor architecture, which includes a mixed radix2/radix4 approach is presented in Section III. The design and implementation of the different units and the design of the system using the selected technology is given in Section IV. This section also focuses on the implementation challenges put forward by this demanding technology. In Section V, the performance of the processor is highlighted based on the fabrication and test of processor units. A summary of the work realized and a discussion of future processor extensions and implementations conclude this paper.
II. CORDIC OVERVIEW
The CORDIC algorithm was proposed by Volder [7] and later generalized for the three coordinate system by Walther [8] . After this basic work, modifications to the algorithm have been presented [9] .
The CORDIC algorithm provides an efficient mean of computing trigonometric functions by rotating a vector through some angle, specified by its coordinates. This rotation is obtained by performing a number of microrotations through elementary rotation angles , into which the total rotation angle has been decomposed. In circular coordinates, the rotation of a vector through an angle is given by
where is the microrotation index, and the final angle is: with and . Equation (3) is for angle updating. The traditional solution for these equations has been in radix2 where the selected angles are and therefore . Equations (1) and (2) for radix2 become (4) (5) Here, the result has been divided by the scaling factor. Hence, for computation of FFT it is necessary to multiply the final result of the CORDIC iterations by this scaling factor. In radix2 the scaling factor is constant, as can be shown in the following expression: (6) The scaling factor compensation requires extra hardware in the processor. Several methods have been developed [3] for this purpose. Among them, in the FFT architecture we have used the repetition of several microrotations and the inclusion of specific scaling microrotations. Other solutions based on modifications to the basic algorithms made to eliminate the scaling factor can be found in the literature [10] .
Convergence criteria for the CORDIC algorithm requires that the maximum value of the variable be in the range of (7) which for infinite precision ( ) implies that
In CORDIC, the precision is obtained as a function of the number of iterations or microrotations. If a precision of is needed a CORDIC in radix2 should have microrotations. The main problem of GaAs CORDIC processors (in general) is the large area that is required. Increasing the radix from 2 to 4 brings down the number of microrotations from to only /2. The higher the radix the smaller the CORDIC processor area.
However, increasing the radix of the processor has several drawbacks. The most important one is that in order to keep the convergence of the algorithm would take the following values . Likewise, the scaling factor has the following expression: (8) and, therefore, is a function of the iteration index because is not 1 as in radix2. Moreover, the approximation is not valid in radix4, because is not longer ±1
On the other hand, as can be derived from (8) , the scaling factor in radix4 is only affected by the microrotations from 0 to /(2 1) and if the radix4 microrotations are introduced after the second half of microrotations (from /2 to 1) the global scaling factor is kept constant [11] , [12] . Furthermore, for the small angles involved in the second half of the microrotations the error incurred with the simplification would be less than the precision obtained by the CORDIC [13] . This approach has been implemented in the FFT processor in order to save 25% of the microrotation stages.
III. FFT PROCESSOR ARCHITECTURE
Given a discrete sequence of complex numbers , 0 ≤ ≤ 1, its Discrete Fourier Transform , 0 ≤ ≤ 1 is defined as (9) where is the twidle factor. In the FFT algorithm, the -point DFT is computed as two /2-point DFT subproblems for each [14] . Thus, the FFT can be evaluated through a recursive algorithm. Using decimation in time algorithm, the basic operation of a FFT, called a Butterfly operation, is (10) where , , , and are complex numbers. The use of the CORDIC algorithm for computation of that iterative process means that the exponential factor is realized by cosine and sine operations. Since the CORDIC processor develops the computation of vector ( ) as a further postprocessing is required in order to evaluate the butterfly (12) where and R denotes the real part and I denotes the imaginary part.
The CORDIC processor for CFFT computation includes two subprocessors: the processing section (PS), for computation of equations, and a routing section (RS) for data recirculation [see Fig. 1(a) ]. For simplicity in this paper only the PS will be considered because in the RS the clock frequency is half than in PS and is not critical. The block diagram of the processing section is shown in Fig. 1(b) . It consists of a CORDIC processor and a postprocessing unit. Following the decimation in time algorithm, one set of points is processed in the CORDIC which results in a vector and the other one is used in the postprocessing unit, which implements a specified orthogonal transform (in this case an FFT).
The floorplan of the processor is given in Fig. 2 . The real and imaginary components of the input vector are taken as the initial values of and . The CORDIC processor for realization of a 1024-point FFT of 16 bits requires 18 stages: initiation, seven radix2 microrotations, four radix4 microrotations, four repetitions of radix2 microrotation, and two scaling stages. In this mixed radix2-radix4 architecture /2 /4 3 /4 iterations are performed in order to produce bits of precision. The resulting and values are added to the sum that will become after repetitions of this process where . The convergence of CORDIC algorithm is limited to as has been made clear after (7) . In order to expand the angle range of the processor from 0 to (required for FFT computation) a initiation stage is used. This initiation stage rotates clockwise by all the angles whose associated rotation is greater than , as expressed by (13) where are the data to be processed in the first radix2 microrotation.
In the FFT implementation, the angles are known a priori and therefore the evaluation of (3) is not necessary. The angles can simply be stored in a ROM, which maps the required angles into the equivalent sequence of 's. The control unit is used for reading data stored in the ROM. It consists of a 9-bit counter, a shift register and several 2 : 1 multiplexers. As mentioned before, there are several methods for realization of scaling. In our case some iterations are repeated and two scaling stages are included.
The processor is pipelined and uses redundant notation based on carry-save adders (CSA), for the sake of avoiding carry propagation when evaluating (5) . The postprocessing unit is in charge of the evaluation of (12) , that is simply a sum. In order to speed-up the process CSA's are also used. Finally, a vector merging adder (VMA) stage is used to convert redundant notation to binary notation. This is one of the critical units of the processor and is also pipelined in order to match the microrotations delay.
IV. FFT CHIP IMPLEMENTATION
This section presents the logic design and different basic primitives for VLSI implementation of the CORDIC processor for FFT computation. The design strategy has included three major steps:
• algorithm transformation and physical mapping;
• design and implementation of basic computational and storage primitives; • abutment and interconnections of cells in order to create the processor. Due to the complexity of the processor in terms of chip area and power dissipation, the approach used for the design of the primitives has been to optimize the area and power consumption of the cells. A number of logic families are available in GaAs. One of the initial objectives has been to identify the logic families most suited for this architecture. This is based on performance figures measured in terms of propagation delays, power dissipation, transition times, chip area and noise margins. Since factors such as capacitive loads, fan-out and process spread have a large influence upon performance careful consideration has been given to these parameters during the design cycle.
A. GaAs VLSI Technology and Logic
Taking into consideration the complexity of the CORDIC architecture, the 0.6 m H-GaAs III process of Vitesse Semiconductor Corporation, US, has been selected. The chosen 0.6 m gate technology features different length MESFET's and up to five levels of interconnections and provides both enhancement (E-type) and depletion (D-type) self-aligned metal semiconductor FET's (MESFET's) [15] . In this process, E-MESFET's are formed initially. This is followed by an additive implantation technique to create D-MESFET's. Therefore, the technology generates circuits which are predominantly fast-fast (fast E and D-MESFET's) or slow-slow (slow E and D-MESFET's). Thus, when studying the influence of process spread on circuit performance only fast-fast and slow-slow corners need to be considered. For both, E-MESFET's and D-MESFET's, the more positive the threshold voltage is the slower the transistor becomes. For VLSI complexity, tolerance of at least ±2 is necessary.
Direct coupled FET logic (DCFL) is the most simple, dense and low power consumption logic for GaAs VLSI design. However, its high sensitivity to capacitance load implies need for buffering. Therefore, the approach pursued in the logic design of the CORDIC processor has been to use merged logic which combines different logic structures. SDCFL is used to implement the O-A-I structure, as shown in Fig. 3 . Although OR gate performance is very sensitive to fan-in, with this structure it is possible to design exclusive-OR, exclusive-NOR and multiplexer functions using the same transistor sizes as for SDCFL inverters. This is because the E-MESFET's of the OR gate cannot be active at the same time. The delay of the O-A-I gate is smaller than the delay of the NOR-OR structure in DCFL.
B. Implementation of the CORDIC Processor
The basic operation of the radix2 CORDIC microrotation is the addition/subtraction (4) and (5) . In order to speed up the processor, pipelining and redundant arithmetic representation [16] are used. In redundant notation, each vector is represented by sum and carry words. Hence, the basic cell for the implementation of microrotations is a 4-2 adder/subtracter. A register is necessary for pipelining. The radix2 cell is composed of one 4-2 adder and two D-flip flops. Although the connection of several of these cells together will produce FFT processors for different data widths, in this section the design of the CORDIC cells for a 16 bit wide processor for computation of a 1024 point CFFT is presented. This processor incorporates more than 800 4-2 adders/subtracters and more than 1600 registers in the microrotation stages. It is obvious that both primitives must be optimized in area and power consumption.
1) The Radix2 Cell: The radix2 cell implements (4) and (5). The 4-2 adder/subtracter in redundant notation is implemented using two levels carry-save adders (CSA) [16] as illustrated in Fig. 4 . In the first full adder, the operands , , and are added. The second stage adds the sum ( ) and carry ( ) of the previous stage to . The result of the adder/subtracter is obtained in redundant form of representation. The two exclusive-OR gates, for and data, are included for performing subtraction. A "1" should be added to the least significant bit. 5 represents the logic diagram of the radix2 cell using the merged logic approach. Basically, the adder has been done using DCFL and SDCFL logic families. In SDCFL the exclusive-OR is realized in the O-A-I structure. The O-A-I produces better results in terms of area and speed than the design in DCFL using only NOR gates. The performance of the 4-2 adder/subtracter and its sensitivity to process variations is shown in Table I .
The layout of the carry-save adder is shown in Fig. 6 . All inputs and outputs have been chosen to lie on a grid of 5 m. This facilitates the routing of signals between microrotations. In order to keep as many routing paths open as possible, all horizontal connections inside the cell are made in metal 2 and metal 1, with the exception of some short connections. Gate metal, in spite of its high resistance and capacitance, is used for vertical and short horizontal connections. The power is supplied to the metal 2 power wires by metal 3 power planes (not shown in Fig. 6 ). In this way, the (2 V) and (GND) metal 2 wires can be kept with minimum dimensions, producing layouts which are area efficient. The only signals routed in metal 3 are the clock and clear, which are vertical (like all metal 3 lines) in order to simplify their distribution. The radix2 cell width is 157 m and its height is 109 m.
It is well known [16] that carry save adders introduce an error called carry overflow, caused by the truncation of the carry generated by the full adder in the sign position. Logic design of this full adder is modified in order to avoid this error. The two most significant bits of sum and carry words, namely, and , are substituted by and as expressed in the following equations: if otherwise that can be directly implemented using exclusive-OR's (14) Due to the more complex function realized in the sign position cells, those cells are part of the critical path in the CORDIC processor.
2) The Radix4 Cell: The radix4 stages implement the following equations: . The realization of radix4 cells includes the same adder/subtracter unit, but when a multiplication by two is needed. This multiplication can be done with a shift. Hence, the radix2 stage can be converted to radix4 stage by including a multiplexer at the input, as shown in Fig. 7 . This multiplexer is designed based on the SDCFL O-A-I structure. The output of this multiplexer is the one's complement of the inputs, so rearrangements of a few connections in the radix2 cell are required.
The layout of the radix4 cell is depicted in Fig. 8 . This cell is 132 m in height, with the same width of the radix2 cell in order to be abutted in the pipeline.
As was mentioned before, a carry overflow correction is also needed in the radix4 cell. Due to the increment of logic gates in the signal path, the cell corresponding to the most significant bit is the one to be considered in the critical path delay computation of the CORDIC processor. Table II summarizes the performance of this cell for different threshold variations.
Other stages of the CORDIC processor, such as scaling compensation stages, are based on these two cells with some minor modifications to obtain simple schemes, and, hence no further details will be given in this paper.
3) D-Flip Flop:
In order to increase the speed of the processor a segmentation technique is used, by adding two sequential elements per bit (one for addition and one for carry) to the output of each 4-2 adder. The choice and design of flipflop are important as they determine the clock structure and, therefore, the final performance. The flip-flop can be designed as one phase or two phases and either level triggered or edge triggered.
In order to avoid complex structures for clock distribution the falling edge triggered flip-flop of Fig. 9 was selected. However, this flip-flop has the following several drawbacks: 1) number of transistors; 2) frequency limitation of where is the average delay per gate; 3) asymmetrical characteristics. The flip-flop was designed using the DCFL class of logic to produce a more compact and faster design. This design has been carried out with area and power consumption optimization. All the gates are sized to the minimal dimensions depending on the fan-out they are driving. So, NOR gates labeled 1 and 5 have 10 m E-MESFET, gates 3, 4, and 6 have 12 m and gate 2 has 15 m because they drive a fanout of 1, 2, and 3, respectively. The flip-flop characteristics can be seen in Table III . The delay of the flip-flop corresponds to the path from clock signal to the output of the flip-flop (17) and the setup is (18) The flip-flop can reach frequencies higher than 2 GHz, as shown in Table III , that are suitable for the purpose of the CORDIC processor, and the power consumption is about 1.9 mW.
Table III also shows the sensitivity of the D-flip-flop to process variations. Since the flip-flop is implemented using three-input NOR gates with a fan-out up to three, the average delay per gate is higher than in the DCFL 4-2 adder implementation.
The output of the flip-flop is taken from which is faster than the output. This output is then suitably buffered to drive the wiring capacitance and fan-out.
4) Rounding Error:
Once all cells involved in the CORDIC processor have been designed the microrotation stages are made by abutment of cells horizontally (depending on the number of data bits) and vertically. For a 16 bit processor 32 cells are required in each row. However, there is an error associated to the fixed point representation used in the processor and some extra bits should be added in order to get the required precision and avoid underflow.
The error incurred in the CORDIC processor for 1024 point FFT of 16-bit due to the representation of the data in fixed point with a limited number of bits is as follows [13] : (19) Where is the spectral norma of the operation realized in the iteration of index , and is the maximum error incurred when representing data in fixed point with bits, so . Evaluating (19) for all microrotation stages, we conclude (20) where and, hence, is taken for the FFT processor, which means that 3 underflow bits should be added to the data width in order to avoid the rounding error. So, data processed in the CORDIC are represented as (0, 18) and (0, 18) where 18 is the least significant bit. 
5) Placement of the Microrotation Cells:
The routing area of the CORDIC processor is substantial, since the rotation algorithm requires two shifts of at each level of the pipelined array. Hence, the placement of the different cells play an important role in the area of routing and in the global length of interconnection as well. Two methods have been proposed for placement:
• the standard method, where both vectors (0, 18) and (0, 18) are arranged as continuous vectors • the method proposed in [17] . In the standard method, using 2 levels of metal for routing (metal1 and metal2, because the wires are long and gate metal routing would introduce high capacitance), the area required for routing is (21) where is the number of bits, is the microrotation index and is the minimal metal 2 pitch. The second method eliminates the shift of one vector, i.e., (0, 18), by positioning the cell over the cell , . The area of the routing channels and the maximum length of connections are reduced. The maximum length is [ ], where is the width of the cell, 157 m. In this case the buffer of the D-flip-flop for cells operating on vector (0, 18) is different from that for cells operating on vector (0, 18). However, this method is not suitable for the proposed architecture of the CORDIC processor because although a reduction in channel area for radix2 stages is achieved, this particular placement scheme creates a higher channel area for radix4 stages.
A more efficient approach is to intercalate the and cells as presented in Fig. 10 . The length of the interconnections for the sum bits are and for the carry (22) where is the minimal metal 2 pitch (we took 5.8 m for HGaAs-III Technology). Table IV represents the routing area required in this case for each microrotation compared with the standard placement. A reduction of almost 50% is obtained.
6) Buffering: At the output of each flip-flop an inverterbuffer is placed, whose type and size depend on the associated load due to wiring and fan-out. The load of each cell is variable since the length of the connections grows in the order of where is the iteration index and is the cell width. A sign extension has also to be made and therefore some of the cells have a high fan-out at their output.
Three versions of the inverter-buffer have been implemented for three different types of load. The first version consists of a DCFL inverter-buffer. This is used when the cell has a low fanout and a small load capability, which is the condition under which most of the cells work. For higher charge values, due to wiring, a SDCFL version has been implemented. Finally, when long length wiring (more than 2 mm) is combined with fanout greater than three, SBFL Logic (super buffer FET logic) is used [18] . The main disadvantage of the SBFL is that it produces high current peaks at gate switching, which implies noise in the power supply buses. Therefore, the power supply for SBFL circuits is implemented separately from the rest of the logic.
C. Coefficients Table and Control Unit
In order to evaluate the butterflies of the computation stage ( ) the angles to take into account are where . This means that the angles involved in the operations in stage are a subset of the angles corresponding to the stage 1. As we have mentioned before, including the initiation stage assures the CORDIC convergence and reduces the number of angles stored in the ROM to . For the 1024 point FFT a 256-word ROM is required. Taking into account that 1 bit is needed for coding of for each radix2 stage and 3 bits for each radix4 stage, we conclude that a 256-word 23-bit ROM is required.
However, a further reduction in the ROM size can be obtained using a novel approach: including an additional rotation after the initiation stage. In this case, the size of the ROM can be reduced to positions and the angles computed in the CORDIC are in the range . It can be shown [13] that the coefficients for intervals and differ only in the sign. With an appropriate coding of the coefficients for radix4 it is possible to store only the required one for range . The control unit, based on the angle being evaluated, complements these coefficients when needed. In this way, the ROM designed for storing of 's is 23-bits wide and has only 129-words.
The main problem found at the time of implementation of ROM memories is the large leakage currents which are produced in the MESFET transistors. This fact produces a degradation in the noise margins, which is aggravated by the temperature effects. To overcome this problem, we use PCML (pseudo-current mode logic) in the design of the ROM. The memory which stores the coefficients consists of two cores of 65 24 bits, extracted via a multiplexing stage [19] . The maximum access time obtained is 0.8 ns, and the average power is 0.4 W. The area occupied by this memory is 0.9 mm 2 with a transistor density of 5500 transistors/mm 2 . In order to synchronize the coefficients with the data it is necessary to add a shift register at the output of the ROM. This shift register is made with D flip-flops, similar to those used for microrotation pipelining. SBFL buffers must be added at the output of these flip-flops in order to drive each line.
D. Postprocessing
The postprocessing stage is application dependent. For the complex Fourier transform, after the CORDIC microrotations, (12) is implemented. As data appear in redundant notation the same adders/subtracters of the basic radix2 cell can be used. Thus the postprocessing stage can be realized with 3-2 carry-save adders.
1) Vector Merging Adder:
The conversion from redundant representation to two's complement is performed by a vector merging adder (VMA). The VMA adds the vector sum and carry for each vector element and propagates the carry. It is clear that this operation takes more than one cycle and some kind of pipelining is needed. Following a study of several adders a carry-lookahead adder (CLA) [2] is implemented connected as shown in Fig. 11 . Table V represents the critical path delays of both adders (CLA1 and CLA2) and the power consumption of the VMA.
E. Clock Distribution
The clock distribution is the most critical factor in the design of the FFT processor in GaAs MESFET technology. Although there is only one clock signal, its distribution throughout the whole circuit must be carried out avoiding any possible skew. The clock distribution is made vertically in metal 3 (screened by ground planes) with which short connection sections can be made. Each radix2 cell will be moved along vertically by two clock lines, one for the addition bit flip-flop and the other for the carry bit flip-flop. All the clock lines have the same length and drive the same number of transistors.
The clock distribution is made via a fan-out tree of three and five branches, which in the last branch have 81 clock lines. Given that logic families in GaAs exhibit very different rise and fall times, it is necessary to add two inverters per branch, as can be seen in Fig. 12 , to prevent the pulses from swallowing too much. The size ratio of these inverters is 1 : 3 but in the last stage the ratio is 1 : 1. The inverter-buffer that drives the clock input of the flip-flops, is made with SBFL and has a fan-out of 25 and a load capacitance of approximately 800 fF.
For skew calculations it is necessary to take into account the driver delay over the clock tree and the effect of line delays. As mentioned before, the clock tree has to be designed and implemented in such a way that the load conditions of the drivers and line length in each branch of the tree are exactly the same. The skew in that case would come from the differences in the threshold voltage of the transistors placed far away from each other and from the line delays. The line delay depends on the dielectric used. For Vitesse process the mean delay is about 11 ps/mm. The skew incurred between the first microrotation cells and the last is in the order of 70 ps due to the line effect and about 80 ps in variations of threshold voltage. The total skew is in the order of 150 ps.
V. PERFORMANCE EVALUATION
In order to evaluate the performance of the FFT processor, each of its cells has been simulated (including the critical charge conditions, temperature variations and voltage drop). The complexity of the FFT processor, with regard to the number of transistors, means that we must consider the variations in threshold voltage with the manufacturing process (process spread). Cycle time of the processor is given by the following expression:
Cycle time Flip-Flop delay Buffers delay Radix cell delay + Flip-Flop setup where the radix cell delay term corresponds to the radix4 cell associated to the most significant bit that requires sign correction. In the buffer delay term the buffer associated to this cell is considered. Maximum clock skew is also taken into account. Table VI represents the performance of CORDIC processor and its dependency with process spread.
As it can be seen, the frequency can vary from 700-870 MHz. In all cases, the noise margins are kept at acceptable levels. The processing section occupies an area of about 36 2 , not including the input/output cells. The package used for this die is the LD256 (a core limited design), a LDCC package which supplies up to 196 inputs/outputs, although less than half are needed. The LD256 has a maximum working frequency of 700 MHz.
Following the delay evaluation in the critical path, and taking into account the clock skew and the most critical conditions, it can be concluded that the FFT processor can operate at frequencies higher than 700 MHz. With this frequency, the evaluation of a 16-bit FFT, with 1024 points has an approximate time of 7.5 s and an estimated power consumption of 12.5 W.
Two integrated circuits have been sent for manufacturing, one containing the different cells that make up the microrotations and the other containing ROM coefficients. With the first of these we tested the operation of the cells under real conditions; and with the second we tested the operation of the ROM under diverse temperatures, etc. Fig. 13 shows the chip containing the different cells and VMA.
Application specific processors for FFT computation implemented in Silicon are by far slower than the present GaAs realization, e.g., the BDSP9124 from Butterfly DSP, considered the fastest monolithic implementation in CMOS, computes a 1024-point complex FFT in 54 s. To reach performance levels similar to those achieved, up to six integrated circuits in CMOS technology working in parallel should be used. This is the case of the series of high performance fast Fourier transform processors from Dassault Electronique, called UFFT, which carry out a 1024 point FFT in 12.8 s using six integrated circuits of the same type. The GaAs implementation compares favorably in terms of speed, power consumption and area occupied.
VI. CONCLUSIONS
The GaAs implementation of a 1024-point complex fast Fourier transform for 16 bit data has been reported. The architecture is based on an application specific CORDIC algorithm which overcomes critical drawbacks of the algorithm. With only a few modifications this processor can be adapted to compute any other orthogonal transform (cosine transform, chirp-Z transform, etc.) or for higher precision. Likewise, the regularity of the architecture makes it convenient for automated synthesis [20] . The CFFT processor implemented in GaAs operates at 700 MHz clock frequency, contains more than 120 000 transistors and dissipates about 12.5 W. With this performance a 1024-point FFT can be solved in less than 8 s.
This work also demonstrates the maturity of GaAs technology in order to implement ultra-high performance signal processors in only one chip. With this technology and using the same package, a FFT processor for single precision floating point data is also possible. Furthermore, currently available GaAs technologies (gate lengths less than 0.4 m) make feasible the implementation of double precision floating point processors.
Roberto Sarmiento (S'89-M'92) received the Engineer and Doctorate degrees from the School of Electrical and Electronic Engineering, University of Las Palmas de Gran Canaria, Las Palmas de Gran Canaria, Spain.
In 1993, he was a Visiting Professor at the University of Adelaide, South Australia, Australia. Since 1987, he has been working in gallium arsenide VLSI design. In this topic, he was Activity Leader within PATMOS project (ESPRIT BRA 3237) and currently is involved in two ESPRIT projects and two research programs funded by the Spanish Government. He is also designing full-custom circuits for Vitesse Semiconductor Corporation using GaAs technology. He has published 14 journal papers and more than 40 conference papers in the field. In 1995 and 1996, he was Director of the international course "GaAs VLSI: Circuits and Systems," held in Las Palmas de Gran Canaria. His research interests are: design for performance, highspeed DSP's, GaAs VLSI design, logic synthesis, and data-path generation.
Félix Tobajas was born in Zaragoza, Spain, in 1971. He received the M.E. degree in telecommunication engineering from the University of Las Palmas de Gran Canaria, Spain, in 1996. He is working toward the Ph.D. degree in telecommunication engineering at the University of Las Palmas de Gran Canaria.
His research interests include computer arithmetic, GaAs VLSI design, and application-specific VLSI architectures for digital signal processing. In 1991, he joined the Centre for Applied Microelectronics as a Researcher engaged on hardware accelerators systems and subsystems design. He has participated in PATMOS (ESPRIT BRA 3237) and is involved in two ESPRIT projects and two research programs funded by the Spanish Government. His research interests include GaAs integrated circuits and systems design and datapath design automation.
Valentín de Armas
