A fully asynchronous xed p oint FFT processor is introduced for low power space applications. The architecture i s b ased on an algorithm developed by Suter and Stevens speci cally for a low power implementation. The novelty of this architecture lies in its high localization of components and pipelining with no need t o share a global memory. High throughput is attained u sing large numbers of small, local components working in parallel. A derivation of the algorithm from the discrete Fourier transform is presented f o l l o w e d by a discussion of circuit design parameters speci cally those relevant to space applications. A survey of this application speci c architecture is included with a detailed l o ok at the design of the complex-valued B o oth multiplier to demonstrate the design methodology of this project. Finally, simulation results based on layout extractions are p r esented and an outline for future work is given.
Introduction
This paper discusses the background, design, and implementation methodology of an asynchronous xed point FFT processor. The algorithm used to perform the FFT was speci cally tailored for low p o wer hardware realizations using asynchronous communication. This project has been motivated by s e v eral factors. First, a formal mathematical approach t o d e v elop low power high performance numerical applications is being investigated by Suter and Stevens 8 . An implementation of this circuit is being designed and will be fabricated to accurately evaluate and validate the power bene ts of this formal approach. The e ort to take a n academic paper design to a complete integrated circuit is also motivated by the desire to understand interface issues more accurately as well as validate the design methodology that will be presented later. The need for low p o wer signal processing in space applications is particularly vexing, and presented an excellent target for this design.
A brief mathematical d e r i v ation of the architecture based on wavelet theory is presented. Background information regarding micro-electronic radiation hardening and how it e ects low p o wer designs is discussed. A brief description of the overall architecture is presented followed by the design methodology using a simple component of the design as an example. The paper concludes with some simulation power and speed results as well as expected results from future test chips. 
General Notes on the Suter Stevens Algorithm
This algorithm can be implemented hierarchically, a llowing the N 1 and N 2 blocks to be a smaller instantiation of the algorithm. For instance, if N = 1024 where N 1 = 16, and N 2 = 64, the N 1 and N 2 blocks can each be decomposed into N 11 = N 12 = 4 a n d N 21 = N 22 = 8 .
Four point FFTs can be implemented without multiplication, so a hierarchical decomposition which m a p s the leaf nodes to four-point FFTs is the most e cient realization. Our test chip design implements a simple 16-point F F T w i t h N 1 = N 2 = 4 so that all the components of a larger FFT circuit are included. From this, it is easy to see that FFT point sizes that are an even power of the base FFT point size work best.
Although a synchronous implementation of this algorithm is realizable, the multirate algorithm based on concurrent execution of each decimated sequence maps very well to an asynchronous implementation. The shifts between time domains occurs naturally and with minimal energy expense.
Designing for Space Applications
In the harsh environment of space, there are many r adiation hazards that must be eliminated or minimized for a circuit to perform reliably. Several types of radiation are harmful to CMOS circuits and have di ering e ects, including neutron radiation, ionizing radiation, and total-dose radiation. Each one of these forms of harmful radiation can cause a single event e ect SEE which is de ned as either a hard or soft error. Hard errors include latchup, burnout, gate rupture, frozen bits, noise in CCDs, and snapback. Soft errors include bit upsets in memories or registers or transient signals in logic circuits. These soft errors are commonly referred to as single event upsets, or SEU 1 .
Ionizing radiation, for example, is caused mainly by gamma and x-rays as well as other minor sources and primarily a ects the oxide layers of a CMOS circuit 2 . Upon irradiation, electron-hole pairs are generated and evenly distributed throughout the SiO 2 layer. Many of these pairs recombine within 100sec, but some free electrons are swept out by the electric eld in the gate insulator. The trapped holes that remain in the insulator cause a negative shift in the MOSFET threshold voltage. Over time, the holes slowly migrate toward the most negative p o t e n tial within the SiO 2 . If this most negative potential is the channel, the holes will tend to migrate toward the insulator-channel interface decreasing V t from its pre-rad or initial value. If this most negative potential is the gate, then the holes migrate toward the insulator-gate interface decreasing V t from the initial value. After a period of time, the holes are annealed out of the SiO 2 , allowing V t to shift back t oward its initial value 3, 7 .
The key parameter changes caused by ionizing radiation is the threshold voltage shift. For P-FETS, V t is shifted negatively at all dose levels because the trapped holes in the oxide and the interface states work together. For N-FETS, V t is shifted negatively at low dose levels and positively at high dose levels 3 .
A thin gate oxide allows fewer electron-hole pairs to be formed, mitigating the e ects of ionizing radiation. A reentrant form of the N-FET also keeps the area of the eld oxide at a minimum, further reducing the amount of oxide that can be ionized".
The following section discusses how rad hardening effects low p o wer CMOS designs.
Designing Low Power Space Applications
Solar panels and nuclear generators are the only way a satellite can acquire energy. Therefore both peak and standby p o wer mu s t b e k ept to a minimum.
A common method of reducing power consumption in integrated circuits is to lower the supply voltage, yielding a quadratic improvement i n p o wer at a linear cost in performance. However, scaling the voltage of a CMOS circuit allows it to become more susceptible to SEU because the noise margin between a logic high and a logic low is reduced. SEU possibilities are further acerbated due to the threshold shifts that occur under radiation. Therefore, voltage scaling must be used judiciously and in general has more restrictions than in earthbound electronics.
The power and complexity required to implement many CMOS functions can be reduced using circuit techniques such as dynamic, pre-charge, and pass-gate logic. Unfortunately these techniques are also to be avoided since the single event e ects can prey very easily on these structures. Design is largely limited to static logic gates.
Fortunately standard CMOS processes can be used. A radiation tolerant 1 cell library developed jointly by Mission Research Corporation MRCand Air Force Research Laboratory 4 speci cally for the HP 0.8 m p r ocess via MOSIS has been used for our test chip. However, the radiation requirements result in devices much larger than would be used otherwise, resulting in additional power consumption. For example, the minimum size inverter width in this cell library is 50 for the N-FET and 90 for the P-FET! With the additional rad tolerant c haracteristics, the total size of the inverter cell is 42 119 .
Architecture becomes the primary means of reducing power in space applications, due to constraints to voltage scaling, circuit structure, and device size. This FFT architecture implemented using asynchronous circuits signi cantly reduces the power compared to other space worthy designs. The most signi cant c o n tribution 1 Rad tolerant implies the ability to withstand 100 kradSi and maintain whereas rad hard implies the ability t o w i t h s t a n d 1 MradSi to low p o wer in this architecture are twofold. First, the algorithm has been designed to maximize locality, point-to-point data pipelining, and hierarchy. The only shared structure in the design is the decimator inputs, expander outputs, and pipelined crossbar switch all discussed in Section 3.1. Second, the frequency is greatly reduced by decimation allowing devices and drivers to be undersized. This can signi cantly reduce the capacitance of transmitting data signals. For example, assume a 100MHz sample rate of a 256-point FFT. This design requires additions at a low frequency rate of 320ns in the leaf FFT cells. The pipelined crossbar transmits one data word across each r o w and column every 160ns. This allows ample latency to size devices optimally. The asynchronous implementation technology allows the common frequency changes to be supported at minimal energy dissipation.
3 Architecture 2
General Architecture
There are six major components required to implement the FFT of Equation 14. The block diagram of Figure 4 shows how each of the components t into the computation. First of all, the data is decimated in time into N 2 sequences of length N 1 . Then, the FFT of each sequence is computed the interior summation, followed by t h e m ultiplication of the constants W m 1 n 2 N . After the complex multiply, the partial transformed data is interleaved, as required by the FFT, through a pipelined crossbar switch. This pipes a data stream to the N 2 point FFT blocks. Finally the data, which is decimated in frequency, goes to an expander to correctly sequence each fully transformed element i n the output sequence as Xm and regenerate the input frequency.
Speci c Architecture for This Project
For test chip, the minimum iteration necessary to demonstrate the functionality of the algorithm was desired. Therefore N = 1 6 w as chosen with N 1 = N 2 = 4 . This allowed the basic architecture to reuse the N 1 and N 2 blocks, drastically saving on area with only a minimum of control overhead.
Control operates in data-ow pipelines, using data bundling and four-phase handshaking protocol between each stage. Control of every major component is implemented using one-hot encoded state machines 6 . There are 16 unique burst-mode asynchronous nite state machines AFSMs used in the control structures. Some of the AFSMs are very simple like the ones used to control register locking which h a ve 3 states with 2 inputs 2 patent pending and 2 outputs. Others are more complex like t h e m ultiplier controller which has 9 states with 8 inputs and 6 outputs. Table 1 Our design methodology uses VHDL as the central simulation and speci cation language. Designs are rened using VHDL, and when we are satis ed with the simulation results we then implement the blocks. Although there are many t o o l s a vailable today for asynchronous design, the necessary tools are not integrated into a tool ow s o t h e s i m ulation, synthesis, and implementation steps were disjoint.
Formal Speci cations
Generic timing diagrams and petri-nets were devised to demonstrate the ow of control and data through the computation and to help understand interface timing and sequencing. Burst mode AFSM speci cations were derived and synthesized by the 3D 10 and MEAT 5 tools. The synthesized equations were written in behavioral VHDL, analyzed and simulated with a test bench modeling the timing diagrams to validate the burst speci cations. Once the VHDL was con rmed, the static logic equations were laid out using the MRC cell library. Some of the asynchronous designs were testable with IRSIM v9.02 from the Berkeley tool suite and others would only function using SPICE Avanti Corporation HSPICE v95.1. Occasionally, i t w ould be found that a design was too big to be practical or its function could not deal with the simulated real hazards of the circuit and the design would need to be repartitioned or redesigned.
An Example of Iterative Component Improvement
The design of the complex multiply block of Figure 4 will be used as a design example. This is the most timecritical component in the design, since a complex multiply must occur every 160ns to meet the intended 10ns sample rate. A radix-4 Booth multiplication algorithm was chosen.
The shift-and-add control was implemented with a sequence of nine one-hot cells eight for shifting, and the ninth to indicate a done" condition. On each successive pulse, a di erent c o m bination of three bits are enabled onto the three decoding lines going to the Booth decoder. The nal design of the multiplier is shown in Figure 1 . Note that two m ultiplications are performed in parallel using a single control unit. As can be seen from Figure 4 , the FFT-4 produces a pipelined stream of real and imaginary data words to the complex product block. This data is multiplied in the product block b y a c o mplex constant. The FFT-4 rst produces the real data component followed by the imaginary data component. Since both the real and imaginary constants are always available stored locally in a static register bank, time and area are conserved by decoding the Booth instructions from the FFT-4 input and producing two partial products simultaneously. Although all the control in the multiply unit can be shared, only the control for the ALUs is illustrated to simplify the gure. Since the Booth instruction will be the same for both multiplies, the AREQ signal is routed to both ALUs. A C-element is required to synchronize the AACK signals from each ALU because the ALU operating time is data dependent and the real and imaginary constant additions will likely complete at di erent times. Each data path has its own ALU, shift register, and constant storage.
A complex multiply requires four integer multiplications, an integer addition, and an integer subtraction. We use the multiplier 2 " b l o c k w i t h t wo adders and registers to complete a complex multiply. T able 2 gives the step by step procedure of the complex multiply operation corresponding to Figure 2 parts a to d. As shown in Figure 2 , each output of the multiplier 2 block connects to one A and B input of an adder and subtracter. Each A input contains a latch to store the integer product of the rst two m ultiplications as described in step 2 of Table 2 . The second multiplication result can be passed directly to the adder subtracter and used with the latched value to produce the complex valued result. Figure 2a shows the circuit after the arrival of RefXg. The rst two partial products have been computed and latched in Figure 2b . Since the FFT-4 will likely produce its outputs faster than the multiplier can use them, ImfXg will probably arrive e a r l y . Despite this, ImfXg will not be used until after it is latched in step 3 of Table 2 . The second multiplication pair has completed in Figure 2c and held statically on the data lines. The nal step of Table 2 occurs when all four integer multiplication products are present, and the nal complex products can be computed and latched into the crossbar switch, as shown in Figure 2d .
Results
We are able to obtain some preliminary power and timing results based on SPICE simulations on extracted layouts. The MRC cell library is designed to be maximally radiation tolerant w h e n V d d i s 5 . 0 V olts. However, 2.2V is customary for many o f t o d a y's low p o wer designs due to the bene ts of voltage scaling. Our preliminary numbers use a middle-ground Vdd of 3.3 Volts. VHDL simulations at this voltage have been projected to the system timing chart of Figure 3 . We h a ve i ncluded results with Vdd of 5.0V and 2.2V along with the baseline of 3.3 Volts to examine the MRC operating range as well as to permit closer comparisons to other low-power FFT chips. Notice how the duty cycle of all the major components overlap. The actual amount o f o verlap pipelining will vary depending on the data but this gure gives a good timing estimate. The asterisk by the Mult0 line in Figure 3 is there to indicate that this is not an actual multiplier. The constants for the 0 th sequence are all equal to 1+j0 s o n o m ultiplication is required. In place o f a m ultiplier, a holding register is used so a full 32-bit complex value is sent to the crossbar switch. Based on empirical SPICE data, we are able to extrapolate system and component timing to Vdd = 5.0 Volts and Vdd = 2.2 Volts. Table 3 shows the timing comparison between the three Vdd levels. Using the actual SPICE power numbers for each c o mponent running at the frequencies in Table 3 , we a r e able to compile projected power consumption for the FFT-16. The component and system power consumption numbers are given in Table 4 . Then, using the scaleability properties of each segment, energy consumption for larger point sizes results can be determined as in Table 5 . Since these numbers are very rough estimates, direct comparisons against current F F T c hips are not that conclusive. These comparisons are still drawn to show that, despite using a power hungry cell library and disregarding many known power reduction techniques, similar power e ciency numbers can be achieved with architecture and asynchronous design techniques. Table 6 gives a rough comparison to the SPIFFEE project at Stanford University and a commercial FFT processor The Plessey PDSP 16510A. It is fair to point out that the Plessey DSP chip uses a block-oat data format instead of xed point which accounts for some of the additional energy required. The gure-of-merit we are using, that of energy consumed per unit transform, compares the energy e ciency of the architecture in generating a result and is independent of the frequency. We m ust also point out that the sample frequency of our asynchronous design is considerable faster than that of any of these processors. The numbers for this project will remain fairly constant for larger point sizes due to the hierarchical nature of our FFT algorithm. However, as the point-size grows, additional hierarchical layers are required which will result in increased power consumption.
The 2.2 Volt Vdd entry of This Work" in Table 6 probably could not be used in space because of the single event e ects discussed in Section 2.2. It is presented here to show h o w the e ciency FOM scales between the di erent Vdd levels.
Conclusion
This paper shows a formal mathematical approach that has been directed towards architecting low p o wer FFT circuits. The architecture relies on localization techniques, pipelining, and frequency shifts of decimation and expansion. Asynchronous implementation techniques are a particularly appealing low p o wer implementation methodology for such an architecture due to the ability to shift between various frequencies at no additional cost in power or complexity. The asynchronous pipeline supports both rapid computation and minimal energy dissipation during idle periods.
It is prudent t o s a y that the methodology we are using is e ective for this project. Seldomly will the best design arise from the rst speci cation. Typically a design will go from speci cation to implementation before many important design considerations become evident. Our methodology based on VHDL, as both a specication language and simulator, was very helpful in directing our top level design because handshake protocol inconsistencies are detected very easily due to the way VHDL performs value resolution. However, the VHDL is a bit unwieldy at times because it does not recognize dynamic values on a bus, as well as a few other minor problems. The hand layout was fairly easy because the MRC library cells t together in a gate-array format. However, accurate circuit simulations become di cult as the design increases requiring modular approaches to circuit analysis.
The numbers attained from the extracted SPICE simulations show surprisingly low p o wer for the device sizes required by the rad-tolerant library. W e surmise that a signi cant e ect is due to the synergy between asynchronous design and the architecture. Our preliminary results also give signi cant motivation for future design exploration and test chips, particularly for radiation tolerant applications. Should this project be extended to an implementation that is not radiation tolerant, we could expect an additional reduction power consumption beyond what is presented here, achieved through smaller transistor widths, pre-charge logic, and other circuit level optimizations. 
