There exist Fast Fourier transform (FFT) algorithms, called dimensionless' FFTs, that work independent of dimension. These algorithms can be configured to compute different dimensional DFTs simply by relabeling the input data and by changing the values of the twiddle factors occurring in the butterfly operations. This observation allows us to design an FFT processor, which with minor reconfiguring, can compute one, two, and three dimensional DFTs. In this paper we design a family of FFT processors, parameterized by the number of points, the dimension, the number of processors, and the internal dataflow, and show how to map different dimensionless FFTs onto this hardware design. Different dimensionless FFTs have different dataflows and consequently lead to different performance characteristics. Using a performance model we search for the optimal algorithm for the family of processors WO considered. The resulting algorithm and corresponding hardware design was implemented using FPGA.
Introduction
In many applications, the Fast Fourier Transform (FFT) presents an intensive computational task due to the amount of data to be processed. The amount of data (i.e., problem size) depends on the number of points and the dimension of the transform. To this end, engineers and scientists rely on approaches such as highly-tuned code for uniprocessors, DSP processors, ASIC, IP cores, and reconfigurable architecture, to meet the performance requirements with respect to other design constraints such as physical space. A list of references to these approaches is provided in [l] . Our study, which is part of the SPIRAL project [2] , focuses on using mathematical properties of the FFT to help us design a high-performance hardware.
The novelty of our work is threefold. First we base our processor on the dimensionless FFT [3] which allows a 'This work was partially supported by DARPA We thus propose a universal FFT engine that is parameterized in terms of the number of points and dimension of the transform, and the choice of the algorithm. We consider a distributed architecture comprised of processing units (with local memory) connected via an interconnection network. We derived a class of optimal FFT dataflow diagrams based on memory-access cost function. The derivation followed a design flow that uses a performance model and its simulation to evaluate the choice of algorithm prior to design of the hardware [6] . Implementation of the engine for proof of concept on the WildforceTM [7] reconfigurable (FPGA) board is performed in two steps, hardware description language model simulation of the board and the actual execution of the configured board. We have validated the actual configured board and are in the process of benchmarking its performance. Future implementation may use the ASIC technology for the floating-point (complex numbers) arithmetical cores and the FPGA technology for the parameterized flow control units.
In Section 2 we review the dimensionless FFT and the space of FFT algorithms that we will consider. In Section 3 we describe a family of FFT processors and the mapping of algorithms in Section 2 to this architecture.
In Section 4 we introduce a performance model for the architecture in Section 3 and find the optimal algorithm with respect to this model. In Section 5 we describe the implementation of the design selected in Section 4. De- tails not provided in this paper can be found in [l] . where I8 is an identity mat,rix, the Pi are permutation 
, define an algorithm whose dataflow is shown in Figure 2 .
In the following sections we will see that different dataflows lead to different performance characteristics. The set of possible dataflows are those sequences of permutations that can be configured to compute the FFT.
That is, those dataflows for which there exist an initial permutation and a sequence of twiddle factors, such that the resulting matrix factorization is equal to any compatible multidimensional DFT. To optimize our design, we search over the space of allowable dataflows for the one with the best performance. (1, 1, 1 , 1 , 1 , 1 , 1 , 1 , l , w 4 , l , w 4 , l , w 4 , l , w 4 )  T 3 =diag(l,1,1,1,1,w2,1,1,w2,1,w4,1,w4,1,w6,1,w6 
Architectural Framework
In this section, we describe the architecture of our FFT processor, and illustrate how the algorithms from the previous section can be mapped onto this architecture. The proposed architecture shown in Figure 3 contains 3 main units: the interface, the interconnection network, and the gating permutation P determines the addresses for each butterfly operation. Since all permutations are assumed to be tensor permutations, the permuted addresses can be generated by permuting the bits of a binary counter.
F~~ example, the 16-point pease algorithm is transformed into the four stage factorization I 0010->P1 I 010O->P1 I 100O->P1 I 0001->4 I processor elements (PES.) The interface unit is used to transfer parameters and data to/from the system. The reconfigurable interconnection network provides the communication between the PES. Each P E has 3 main units: a memory (M), a computation unit (CU), and an address generator (AG). This design is similar to the Xputer proposed in [lo] .
When used as an FFT processor, the data is distributed amongst the processor memories, the computation unit is used to generate twiddle factors and perform butterfly operations, and the address generators calculate the addresses of the two inputs to each butterfly operation. In order to map an FFT algorithm, represented as a matrix factorization, onto this architecture, address generators must be configured from the permutations occuring in the factorization, and the computation unit must be configured to compute t,he appropriate twiddle factors. The initial permutation is incorporated into the interface so that the data a.rrives in the appropriate order. To simplify the address generators, we assume that The N elements of the input vector are distributed consecutively in segments of size N / M to the processor memories. A memory location containing data is assigned a global n-bit address (bn-1b,,-2 . . . bo), where the leading m bits are the PE number and the trailing n -m bits are a local offset. To conserve memory, the computation is performed inplace. This requires a modification to the factorization presented in the previous section, so that each stage is of the form P ( 1 @ F2)TP-l. The conjuThe addresses obtained for the 8 butterfly operations (two consecutive data elements are used in a butterfly operation) in each of the four stages are shown in Table 2. Each stage shows the permuted address bits, and all addresses are obtained by counting from 0 to 15. Butterfly operations are assigned to PES using a round robin schedule. Assuming 4 PES, Table 3 shows the address sequences for each P E for the 16-bit Pease algorithm. The twiddle factors needed for a butterfly are determined in a manner similar to address calculation. In Section 2, it was shown that there are different FFT algorithms with different dataflow patterns. Each algorithm can be mapped onto the architectural framework outlined in Section 3. It is not clear a priori which algorithm should be used when selecting the ultimate design for the FFT processor. Using the performance model, a search, over a subset of possible FFT algorithms, was performed in order to select the most efficient design. The algorithm found using this search process substantially reduces the traffic ovei the interconnection network when compared to the Pease algorithm.The use of a performance model, instead of complete simulation, allows us to explore many different possible designs at an early stage of the design process. The use of performance models early in the design process has been promoted in [ll] .
The performance model was implemented using ADEPT [6], a performance modeling tool based on PetriNets and implemented in VHDL. In the performance model, data are represented by tokens containing information that affects performance. The flow of the tokens emulates the dataflow in the system. A token in our system contains a pair of addresses corresponding to a butterfly operation. The sequence of addresses are generated and mapped to PES by a scheduler. The PE, the memory (M) and the interconnection models have a mechanism of passing the tokens that imitates the computation steps including memory read, twiddle factor and butterfly operations, and memory write. The operations and the memory accesses are emulated as delays while the address sequences dictate the flow of data. We use the total simulation time (not the CPU time) reported by the VHDL simulator as the performance cost.
The optimization problem is to find the FFT dataflow with minimal running time. The set of dataflows considered are obtained from the Pease algorithm by multiplying the permutations by tensor permutations of the form P @ 4 . Any such permutations lead to a valid dataflow and any valid dataflow can be obtained in such a way. Since an exhaustive search is prohibative, the search was limited to the case where P is a stride permutation. The search was performed with a model using four processors with the number of data points ranging from 16 to 1024. Figure 4 compares the. performance of the optimal dataflow found with the Pease algorithm.
The addressing for the optimal algorithm for 16 points is given in Table 4 . The addresses were generated by the sequence of bit permut<ttions (bzblbsbo), (b2blbOb3), (b2boblb3), (boblb3b2). Comparing this sequence of addresses to the Pease algorithm shows that the Pease algorithm has 36 non-local mc'mory accesses while the optimal algorithm only has 16. There is a pattern in the optimal dataflow found by the search, which can be pa- rameterized by the number of processors and data points. This dataflow has many interesting properties which simplify its implementation: (1) all data access in the first m stages is local and (2) in the remaining stages, half of the data accessed by a processor is local and the other half is exchanged with one other processor.
Implement at ion
In this section, we describe the implementation of a universal FFT processor on the WildforceTM [7] FPGA board. The processor design is based on the optimal dataflow found in the previous section. The board consists of 5 Xilinx FPGA chips (XC4085XLA), with local memories, connected via a configurable crossbar interconnect. The board communicates with a host using a PCI bus. The architecture in Section 4, is mapped to the board as follows. One processor, CPEO, along with its FIFO is used for interface unit. The remaining processors implement four PES each with a computation unit and address generator. The crossbar is used for the interconnection network. Before the computation is performed the processor is configured using parameters containing size and dimension information. The interface uses this information to perform the initial permutation, Pd, and the remaining PES use the parameters to configure their address generators and the computation units.
After the board is configured, data is downloaded and distributed to the memories of the PES. Computation is performed on single-precision complex data. The computation is divided into 2 phases: "local" and "remote". During the local phase all data are in the local memory modules. During each stage of the the remote phase, half of data is available in local memory while the other half must be obtained over t,he interconnect. Before each remote stage the interconnect must be configured corresponding to the permutation at that stage.
The address generator in t.ach P E generates a sequence of addresses. The controller uses the address to read the data which is sent to the computation unit via a FIFO while the address is put in it FIFO for the writing. The computation unit includes the twiddle factor generator and arithmetic units for performing butterfly operations. The computation unit is implemented using pipelining so that an output is ready every other clock cycle as long as inputs are fed continuously. The implementation requires one pipelined floating-point adder and one pipeline floating-point multiplier. The output from the computation unit is written back to memory at the same address stored in the FIFO.
Special properties of the permutations in the optimal dataflow allow address generation to use the adder shown in Figure 5 In this paper we have presented a family of algorithms, called dimensionless FFTs, for computing multidimensional FFTs. We also introduced a parameterized family of FFT processors and showed how to map algorithms onto this hardware design. Finally, using a performance model we were able to find the optimal FFT algorithm for the architecture we considered. A prototype implementation of the optimal algorithm and resulting hardware design was carried out using FPGA and the resulting board configuration was validated. Currently we are in the process of benchmarking our implementation and comparing it to other FFT processors. In the future we hope to extend this methodology to a larger class of algorithms and processor designs.
