Abstract
Introduction
Field Programmable Gate Arrays (FPGAs) have been commonly used to accelerate fixed point DSP applications. Typically, floating point arithmetic was not used for these designs as it is extremely resource intensive [1] . Due to increasing resources available on modern FPGAs, floating point accelerators [2] [3] are now being realized as Systemson-Chip [4] and full floating point applications are possible on multi-FPGA platforms.
This paper presents a prototype hardware accelerated Fourier Integral Operator (FIO) kernel. The motivation for accelerating this application lies in its use in seismic imaging. Seismic imaging has many valuable applications, particularly for locating oil deposits using ultrasound like images of the earth. This application relies on the solution of complex mathematical equations that are exceedingly time consuming to execute on general purpose computers. A typical data sample set takes upwards of 20 days to process on a personal computer (PC).
This application-specific architecture is developed for a multi-FPGA platform (BEE2) [5] . To the best knowledge of the authors, this application is the first multi-FPGA implementation of a complete floating point application. Typically, floating point applications have been implemented on dedicated processors and GPUs. Furthermore, research has mainly focused on accelerating individual floating point operators [6] [7] [8] on FPGAs.
This system exploits the inherent capabilities of hardware, using parallelism and pipelining to speed up the mathematical calculations. The final hardware implementation achieved a throughput speedup of 12.4x over an optimized software algorithm. Furthermore, synthesizing the design for the BEE3 platform resulted in a throughput speedup of 15.8x. The major contributions of this paper are:
• An application-specific architecture to accelerate an FIO kernel, using scalable CE architectures • A detailed study of floating point FFTs • A prototype implementation on the BEE2 platform The remainder of this paper is structured as follows. Section 2 provides an overview of the FIO algorithm along with a background on available floating point FFT cores. The FIO system design is discussed in Section 3. Section 4 describes the floating point CE designs. Section 5 presents the performance results for individual CEs and the overall system. Finally, Section 6 concludes the paper along with providing suggestions for future work.
Background
The following is a discussion of the implemented FIO algorithm. A summary of available FFT cores is also provided.
Seismic Imaging Application
The purpose of seismic imaging is to obtain an extrapolated wavefield along a planar cross-section of the earth, taken from the surface, to produce an image of underground bodies. Figure 1 illustrates an example cross section.
Figure 1: Cross section of the earth
Data is taken at multiple points along the earth's surface, and the depth of a body determines the time it takes for a reflected wave to return to the same point.
Our prototype employs a depth stepping method as described by Margrave et al [9] . The algorithm is divided into five major Computing Elements (CEs) for the calculation of the single cross section shown in Figure 1 . Figure 2 illustrates the data flow of the algorithm, highlighting these CEs.
Figure 2: FIO Data flow
The function of each CE in the FIO kernel can be summarized as follows:
CE1: Perform FFTs in the spatial domain CE2: Perform FFTs in the temporal domain CE3: Load segments of a pre-calculated phase matrix CE4: Perform a matrix multiplication on the transformed cross section CE5: Perform inverse FFTs back in to the spatial domain
The phase matrix referred to in CE3 is generated by complex square roots and exponential operations, which are extremely resource intensive. Since this matrix is independent of the data, it can be pre-calculated in advance based on the appropriate velocity model and used repetitively for different sets of data. However, due to the size of the matrix, the entire matrix cannot be stored onchip and must be buffered from external memory in sections. The actual numerical processing occurs in CE4 when the extrapolated data vectors are multiplied with the buffered segments of the phase matrix.
Fast Fourier Transform
To date Fast Fourier Transforms (FFTs) have largely been implemented in fixed point. Floating point FFTs have recently become commercially available. Dillon Engineering [10] now offers parameterized FFT cores that can be re-targeted towards specific FPGA and ASIC architectures. 4DSP [11] also offers a single precision floating point FFT implemented using a radix-32 or radix-2 butterfly architecture. These cores are targeted at both Xilinx and Altera devices. Previous research has also investigated double-precision floating point FFTs on FPGAs [12] .
Altera and Xilinx both offer block floating point FFT. Block floating point is a compromise between fixed point performance and floating point dynamic range. It allows floating point calculations to be performed on fixed point hardware by shifting the data to share a common exponent for a block of data The Altera FFT [13] core implements a radix-2/4 decimation-in-frequency (DIF) FFT algorithm for transform lengths of 64 to 4k points. In its LogiCore package, Xilinx's FFT provides four different FFT architectures including fixed point and block floating point modes [14] . Due to resource requirements, Altera and Xilinx have targeted their FFT architectures towards their larger families of FPGAs. Since these commercial FFT cores were not freely available during the development our own FFT, this paper includes a study of the design costs and tradeoffs we investigated to create a scalable floating point FFT core.
Fourier Integral Operator Kernel in Seismic Imaging
Due to floating point resource requirements, the system is implemented as a multi-FPGA solution. The system is developed on the BEE2 platform [5] , which is a general processing platform comprised of five Xilinx FPGAs (Virtex II Pro 70). Four of these FPGAs are configurable for user applications whereas the fifth FPGA acts as the platform controller. The BEE2 platform also provides 20GB of high speed, DDR2 DRAM memory. The five CEs are divided amongst the four user FPGAs. The single control FPGA is not used. The four user FPGAs are laid out in a ring topology with independent high speed parallel channels capable of transferring data at up to 10Gbps. Details on the development of an applicationspecific architecture and its implementation are given in the following section. Figure 3 illustrates the FIO system block diagram partitioned into stages. How these stages are divided over FPGAs will be discussed in Section 5.4. Stage 1 transforms the data in the spatial domain using CE1. Stage 2 includes CE2, CE3, and CE4. This stage takes the transformed data, performs a temporal FFT, and processes it by performing a matrix multiply. Finally, Stage 3 uses CE5 to perform an inverse spatial FFT.
FIO System Design
A set of data constitutes 1024 surface sample points (columns) in the spatial domain (the earth's surface) with 256 data samples (rows) of varying time (depth) at each surface point. Initially, the data set is transformed in the spatial domain, row by row, for all time (depth) measurements via a 1024-point FFT (CE1). The FFT must be performed on all the rows of the entire matrix before the next stage can begin processing. The initial data is loaded from Memory A to the FFT. As each row is completed the resultant is transferred to Stage 2 and stored in Memory B. When the last 1024-point FFT is completed, the first stage is finished and the second stage can initiate while Stage 1 begins processing a new data set. Stage 3, which consists of CE5, mirrors Stage 1 in operation. It performs an inverse FFT along each row in the same fashion once a complete data set has been processed in Stage 2.
When Stage 2 initiates, CE2 performs a 256-point FFT along each column of data. As each column is completed, the resultant is transferred to CE4 for the matrix multiply in order to allow CE2 to continue processing temporal FFTs in parallel. CE4 takes the 256 points and multiplies it with the buffered 256x256 phase matrix loaded from Memory C by CE3. The resultant is then transferred to Stage 3 and stored in Memory D.
Each stage has a single MicroBlaze (soft processor core) that receives data from the previous stage to store in external memory. When the previous stage runs to completion, this MicroBlaze is then used to load the stored data for processing in its respective stage. Since the system is split into three stages, the system can act as a three stage pipeline, processing three sets of data in parallel.
In order to facilitate inter-CE communication, the Systems Integrating Modules with Predefined Physical Links (SIMPPL) model for application specific architectures is used [15] . SIMPPL is an architectural model that facilitates rapid integration of modules for SoC designs by using fixed communication protocols and physical links to facilitate system integration. This allows designers to focus on their CE designs for their applicationspecific architectures. Originally intended for on-chip communication in SoCs, this framework has been extended to multi-chip platforms [16] . The extended protocol abstracts inter-FPGA communication such that the multi-FPGA platform resembles a single large reconfigurable platform.
Hardware Accelerated Floating Point Operators
Implementing floating point arithmetic applications on an FPGA is challenging since basic operations require significant resources. Based on these results, the complexity of computing elements is limited by the resources available on an FPGA. For example, an FPGA with 100,000 Flip-Flops (FFs) could only implement 57 complex floating point multipliers. The following sections describe the design of the floating point computing elements.
Fast Fourier Transform
A common method of performing an FFT is the Cooley-Tukey algorithm [20] . Many available FFT cores employ this algorithm including Xilinx's FFT. This algorithm partitions the transform into smaller operations called butterflies. A radix-2 decimation-in-time (DIT) butterfly is used in the design. Figure 4 illustrates how a single butterfly is implemented for our system.
Each butterfly performs an operation on two points of data, multiplying one by a twiddle factor and subsequently adding and subtracting these values from the initial points of data. A table of twiddle factors and butterflies are used to compute the complete FFT. For example, to achieve the highest throughput for an 8-point FFT, it should be pipelined into three stages of butterflies, as shown in Figure  5 .
The seismic imaging application requires two FFTs of at least 1024 and 256-points. The primary concern when developing these FFTs was the amount of resources they require. One butterfly requires a floating point multiply, subtract and add, which totals approximately 4,000 FFs based on resource usages in Table 1 . As an FFT increases in size, the number of butterflies and resource requirements scale at a log-linear rate. For example, a 256-point FFT requires 8 stages of 128 butterflies. This totals over 4 million FFs, which exceeds the resources on all available commercial FPGAs.
In order to implement a large FFT on a smaller FPGA, the CE is designed to use a fixed bank of butterflies. This is beneficial since the FFT CE can be scaled according to available resources. Figure 6 shows the designed architecture for the FFTs (CE1, CE2, and CE5). Figure 5 , the butterfly bank contains an array of butterflies. While throughput is decreased in comparison to the fully pipelined FFT given in Figure 5 , a fixed butterfly bank makes large FFTs possible.
The input data is initially transferred into the FFT and is loaded into the single bank of butterflies by the loader to perform each stage incrementally. The unloader takes the result and loops the data back to the memory block. The loader then repeats the process for a given number of iterations. For example, a 1024-point FFT normally requires 4608 butterflies. If the butterfly bank contains 8 butterflies, the data would loop through the bank 576 times. The optimal butterfly bank size from our investigation is discussed in Section 5.1.
Complex Matrix Multipliers
The Complex Matrix Multiply (CMM) is used to process the data in CE4. Figure 7 illustrates the architecture that was designed.
Figure 7: Complex matrix multiply block diagram
The CMM contains two shift registers, a stage of multipliers, multiple stages of adders, and an accumulation stage. The number of adder stages depends on how many data points are calculated at a given time. Determining the optimal number of data points used for a CMM is presented in Section 5.2. The purpose of the CE is to take a column of data produced by the 256-point FFT and multiply it by a 256x256 phase matrix.
The CMM multiplies a segment of the 256-point temporal FFT vector by a segment of the phase matrix and accumulates the results. The final accumulation stage sums the results over the number of temporal data points (e.g. 256). For example, if a 256x256 multiply needs to be performed, and the core calculates 8x8 points at a time, the core iterates 32 times and the accumulation stage adds the total from each iteration to output the final result.
Phase Matrix Buffer
As previously described in Section 2.1, the phase matrix is pre-calculated before runtime and loaded into the system. Figure 8 shows the design used to buffer the matrix into the system in CE3.
Loader

BRAM
Unloader
BRAM
The designed CE consists of multiple memory blocks, which are loaded in parallel. A single memory block contains a set of 256 points which is unloaded sequentially into the CMM as it is needed. The optimal number of memory blocks, and how they are loaded is found in Section 5.3.
Results
A timing analysis of each CE was performed in order to determine the optimal design. The following sections give the results for each CE, as well as the final results for the complete FIO kernel.
Fast Fourier Transform
In order to determine the number of banks to be used in the FFT CEs, the tradeoffs between the number of banks and speedup was determined. Table 2 gives the latency and resource requirements for a 1024-point FFT using a varying number of butterflies. Based on the results, speedup is roughly doubled whenever the number of butterflies is doubled from one to four butterflies. However, when moving from four to eight butterflies speedup is only increased 1.04x while area is increased by 1.89x. This is due to different pipeline operation latencies in the design. In order to optimize the FFT design, each operation in the computation should have equal latencies. The longest fixed latency is the floating point multiplier supplied by Xilinx. This operation requires 21 clock cycles. For our load sequence, loading data for four butterflies take 24 clock cycles. Using more or less than four butterflies would increase the discrepancy in latency between the load and the multiplier. Therefore, we chose to use four butterflies in our fixed butterfly bank. Using this design, Table 3 lists the latencies of varying FFT lengths on different FPGAs using a fixed butterfly bank of four. Since the same design is used for varying FFT lengths, the resource requirements are the same for all lengths. In order to increase throughput, the FFT must be further pipelined or a faster multiplier needs to be used. Other options that were explored included using parallel banks of butterflies within a single FFT CE shown in Figure 6 or using multiple copies of FFT CEs in parallel. Table 4 compares the latency and throughput for these two options for a 1024-point FFT. Throughput is calculated based on data sets/100,000cycles. Throughput increases faster for parallel FFT CEs since multiple FFTs on different data sets can be calculated concurrently, as opposed to reducing the number of iterations to calculate a single data set. Therefore, based on the results, it was determined that it was more beneficial to use multiple FFT CEs rather than using multiple fixed butterfly banks within an FFT CE. In comparison to other commercially available floating point FFTs, such as 4DSP [11] , our FFT has longer processing latencies. However, it was determined that the bottleneck in our system is the CMM and optimizing the FFT was not the main focus. Our objective was to optimize the complete floating point application.
Complex Matrix Multiply
For an optimal design, the pipeline operations in the CMM must be approximately the same. Similar to the FFT CE, the bottleneck was the Xilinx floating point multiplier. Therefore, each stage is designed to have a latency of approximately 21 clock cycles. For each matrix multiply iteration, a complex single precision number takes 2 clock cycles to load. Therefore, the design operates on eight data points for a load time of 16 clock cycles. A 16x16 multiply would result in significant area increase and a load stage of 32 clock cycles, which is longer than the multiplier operation. Therefore, the design consists of one stage of four multipliers, three adder stages, and a final accumulation stage. The latency to calculate a 256x256 multiply is 771 clock cycles, and the latency to compute the complete data set (256*256x256) is 172 115 clock cycles.
Phase Matrix Buffer
The phase matrix buffer (PMB) needs to be capable of supplying the CMM with a new set of 256 points every 512 clock cycles. One memory block takes 2048 clock cycles to load. Thus, eight memory blocks are used in the design. Four memory blocks are loaded in parallel while the other four are unloaded into the CMM. Therefore, it takes 2048 clock cycles to load four memory blocks in parallel and 2048 clock cycles to unload four memory blocks sequentially into the CMM. These results in balanced load and unload latencies.
Fourier Integral Operator
Using the described CEs, the latencies of each CE is measured for the BEE2 platform. These results include the calculated latencies to transmit the data between FPGAs over onboard LVC-MOS parallel trace connections [14] . Table 5 gives the latency results and resource requirements for each module in the system, running at a max frequency of 195MHz. Currently, the max frequency is limited by the MicroBlaze. Based on these results, it was determined that inter-FPGA communication is not a bottleneck in the system as the CE latencies are orders of magnitude greater than the data transfer latencies. However, external memory access is a bottleneck in some parts of the design particularly the CMM (CE4) and PMB (CE3). A more sophisticated DRAM scheduler will result in speedups in throughput and latency by increasing max frequency and decreasing load latencies.
In order to speedup throughput, multiple CEs are used in parallel. From the results in Table 5 , the CMM has the longest latency. Therefore, to significantly increase throughput, multiple CMMs are performed in parallel. Based on available resources on the BEE2 platform, 2-CE1s, 1-CE2, 2-CE3s, and 2-CE4s are used. 1 CMM requires 4 banks of memory and approximately 35,000 FFs. Since each FPGA on the BEE2 platform offers four banks of memory, two FPGAs are dedicated to perform the CMMs. The CEs from the block diagram in Figure 3 are divided over FPGAs as follows: Table 6 lists the total latency for each stage of the system based on this FPGA partitioning. The latencies of each stage include the inter-FPGA communication latencies. Throughput is given in data sets/100 seconds, and runtime in microseconds given the running max frequency of 195MHz. The total latency of the system is measured and compared with the equivalent vector optimized Matlab algorithm run in single precision floating point. Initial work on a C-code implementation of the algorithm was also generated. However, the original unoptimized version was approximately four times slower than the optimized Matlab code. As optimizing the C code would only result in up to an additional two times speedup over the optimized Matlab code, it was decided that focusing on the development of the hardware implementation would be more constructive, resulting in significantly more speed up. The software algorithm is run on a PC with a 2.13GHz Intel Core 2 Duo and 4GB of RAM. Table 7 shows the speedup for our prototype application specific architecture. Latency is measured in seconds, and throughput in data sets/100 seconds. 
Conclusions and Future Work
This paper presents the prototype for an applicationspecific architecture of an FIO kernel. As part of the design process, a detailed study on floating point CEs, particularly the FFT CE, was performed. This resulted in configurable CEs that are scalable for different FPGA devices. The system runs on the BEE2 platform achieving a throughput speedup of 12.4x over an optimized software implementation. The prototype was also scaled and resynthesized for the BEE3 platform, which has the newer Virtex 5 FPGAs, for a predicted throughput speedup of 15.8x. While our FFT is slower than the Xilinx block floating point FFT, this is not the bottleneck. External memory access and the use of MicroBlazes to manage memory access is the current performance bottleneck in the prototype implantation.
Future work includes implementing more sophisticated dedicated DRAM scheduling for memory access. This will reduce memory access times and therefore the latency of the FIO CEs. Furthermore, significant resource usage and latency is incurred due to the normalization and denormalization of floating point numbers. We will also investigate internal representations of floating point input data within the FIO kernel that reduce or limit these operations, such as block floating point, to further decrease latency. However, it will be necessary to verify that the required computational precision is maintained before these representations can be employed. 1 The individual CEs have been mapped onto the BEE2 platform and have been verified to run in hardware using the Xilinx XUPV2Pro development board [21] and the Xilinx ML505 development board [22] . However, there have been problems loading the final design onto the actual BEE2 platform. Total latency measurements found in Table 6 include a worst case latency measurement for inter-FPGA communication.
