A reconfigurable computing (RC) platform called SPACE (Seismic data Processing Accelerator with reConfigurable Engine) is proposed to accelerate the execution of 3D prestack Kirchhoff time migration (PSTM) based on the optimized 6 th order NMO Correction technology . Our simulations demonstrated up to 22 X performance improvement on the proposed RC platform comparing with the pure software solution running on a P4 2.4 GHz referential Linux workstation. This RC platform is also consistent with prevailing PC-Cluster systems and can achieve much better price-performance ratio along with much lower power consumption.
Introduction
Prestack Kirchhoff time migration (PSTM) is the most popular migration technique in seismic data processing industry because of its simplicity, efficiency, feasibility and target-orientated property (Bevc, 1997) . However, practical PSTM tasks for large 3D surveys are still computationally intensive and cannot be used routinely except in institutes that can afford the high cost of running and maintaining supercomputers or large PC-cluster systems.
In this paper, we introduced an FPGA-based (Field Programmable Gate Array) reconfigurable computing engine to speed up the time-consuming 3D PSTM algorithm. Instead of a stand-alone system, this RC platform has been designed as a complimentary reconfigurable hardware resources attached to an ordinary computer platform. It has the potential to offer a great breakthrough in computational performance and efficiency while retaining most of the flexibility of a software solution. As the hardcore of RC platform, the FPGA chip consists of numerous island-like reconfigurable computing resources such as DSP slices, RAM blocks, and Logic Cells (LC). These hardware components can be interconnected willingly using abundant on-chip programmable routing resources and configured into various function modules according to need at runtime. This property makes RC very attractive for applications demanding unordinary arithmetic units that are not available in conventional CPUs. Moreover, fine-grain parallelism is another important property of RC: Different function modules can manipulate their own data sets concurrently or they can work together in a pipelined manner to maximize computational throughput. RC's potential to achieve much higher sustained floating-point performance than generic CPUs has been proven in many high-performance computing (HPC) applications such as finite difference seismic modeling (He et al, 2005) , computational electromagnetics (Durbano et al, 2004) , and molecular dynamics (Azizi et al, 2004) .
In spite of its remarkable computational potential, RC tends to be inefficient for certain types of operations such as branch control and variable-length loops. So we proposed it here as a coprocessor platform coupling with a generic computer such that the execution of software algorithms can be accelerated by mapping the most computationallyintensive instructions into FPGA. Particularly for the PSTM application, we designed a specific computing engine to calculate the down-up travel-time based on the optimized 6 th order NMO Correction technology . Related kernel codes in PSTM software are adjusted carefully to maximize their executions in RC platform without losing accuracy. Because over 90 percent of CPU time is consumed by billions of iterations of these short kernel codes, our approach achieved over an order of magnitude speedup for this application so that allows people to produce a satisfied image in a much shorter turnaround time.
Optimized 6
th order NMO Correction for PSTM Figure 1 shows the relationship between the source, receiver and scatter-point in PSTM algorithm. It assumes that the energy of a sampled point on an input trace is the superposition of the reflections from all the underground scatter points that have the same travel time. So for points on an input trace with known source and receiver coordinates, their energy should be spread to all possible scatter points according to the traveltimes, which are calculated using the Root Mean Square (RMS) velocities along with the ray-bending method. In order to retrieve all the underground scatter points, the energy of an input trace must be distributed to all possible scatter points correctly, and then the energy from different input traces is accumulated to get the migrated image. for 3D cases). To make matters worse, square-root is one of the slowest floating-point arithmetic operations (50 to 100 times slower than additions or multiplications) for almost all generic CPUs. So those two square roots in Equation (1) impose a severe performance bottleneck for PSTM algorithm.
Hardware architecture of the SPACE platform
There are two popular ways to introduce RC resources into computer platforms (Figure 2 ): as add-ons attached to peripheral interface such as PCI or VME (www.xilinx.com), or as a coprocessor unit connected directly to system bus (www.cray.com/products/xd1). Generally speaking, the tighter the RC is coupled with CPU in the system, the more significant accelerations it can achieve due to lower administration overhead and wider Communication bandwidth. Here we adopted the former approach for better software consistency.
Figure 2. Coupling between RC and PC
The proposed SPACE platform consists of three basic components: FPGA chip, onboard memory modules, and PCI interface (Figure 3) . The FPGA is an up-to-date Xilinx Vertex II Pro series chip with the capacity to contain hundreds of conventional floating -point arithmetic units. In our design, four complicated computing engines tailored for 6 th order down-up travel time calculations are embedded into this FPGA chip occupying only 60% of its on-chip reconfigurable hardware resources. There are two kinds of onboard memory modules: two static RAM chips acting as high-speed large-volume data cache and four offthe-shelf DDR-SDRAM modules that can be easily expanded to several Gigabytes. These memory modules are directly connected to the FPGA chip and employ part of the programmable logic resources to construct corresponding memory controller circuits. Besides these external memories, FPGA has abundant on-chip fast RAM blocks and has the ability to explicitly manage their utilizations and interconnections. By exploiting the dataflow nature of our application, these internal hardware resources can be utilized effectively as data cache to buffer and delivery a constant data flow to the computing engine through programmable internal routing paths. Finally, a PCI bridge chip is used to provide the fast interface between the coprocessor board and the host computer, through which over 100Mbyte/s continuous data transfer rate can be achieved. The proposed RC platform can speed up the most timeconsuming kernel part of the PSTM algorithm and leaves other operations like process management, I/O management, and network communications to its host computer. Figure 4 shows the structure of one Computing Engine (CE) inside FPGA, which evaluates the travel time function (Equation 1) and executes the summation operations. As mentioned above, SPACE integrates abundant memory resources so that all the output traces allotted to this host computer can be stored inside it. Up to four similar CEs can be integrated into this platform. Furthermore, multiple SPACE boards can be attached to a single computer to increase the computing power dramatically.
The actual implementation of CE is more complex than that shown in Figure 4 . Every arithmetic unit should be carefully designed in order to maximize its data throughput. The layout of different units inside FPGA will affect the paths of data flows, which in turn affects the final execution speed. Several data buffers should be added to guarantee the sustained computing speed. Detailed introduction to the design and simulation results can be found in (He et al, 2004) .
MultiplyAddition
MultiplyAddition Figure 5 is the program flow of the modified PSTM kernel code. The bold part of the program was migrated into SPACE platform. The PSTM program running on the host computer is almost the same as the pure software version except that it invokes RC resources as a subroutine, which makes the RC platform consistent with prevailing PCCluster systems. Input traces and pre-computed parameters are transmitted into SPACE through its PCI interface. When the computing engine finishes all the migration calculations between one input trace and all local output traces, a signal is sent back to activate the transmission of the next input trace. After all input traces are processed, the final output result is read out from SPACE for displaying or further processing steps.
Performance Comparison
A real 3D field data volume is used for the performance comparison between the proposed SPACE platform and a referential Intel P4 2.4 GHz workstation. Figure 6 shows a migrated image for a vertical output section created by SPACE. This image is the same as the output of the pure software running on the referential workstation. The proposed coprocessor accelerates only the EC part of the program and leaves all other operations to its mother machine. So the new total execution time will be:
According to Amdahl's Law (Patterson et. al. 1996) , the overall speedup is, Table 1 shows the performance comparison results for the designated task between the referential Intel workstation and the proposed SPACE platform with different configurations. The first observation is that a single Computing Engine (CE) can accelerate the elementary computation for 15.6 times faster than the referential Intel workstation with less than 20% hardware resources usage. This impressive result comes from the ultra-long pipelined structure of the CE. The second observation is that the speedup of the kernel codes increases linearly with the number of CEs inside the FPGA but the overall speedup just increases logarithmically. This is partly caused by the small migration task we used here. The available hardware resources inside an FPGA chip will restrict the number of CEs integrated in it. On the other hand, performance improvement can be gained easily by integrating more CEs into higher density FPGA chip in the future. 
Configurations

Conclusions
A novel reconfigurable computing platform is proposed to accelerate the 3D PSTM algorithm. The simulation result shows an impressive speedup comparing with a referential Intel P4 2.4 GHz workstation. The proposed SPACE platform can be easily attached to present PC workstations or cluster systems booming their performance for more than ten times with modest software modification work. The main motivation of introducing reconfigurable computing technology to seismic data processing industry is its immense computational potential along with acceptable flexibility. Moreover, its run-time reconfigurability allows the same hardware resources to be reconfigured for different algorithms used in different processing stages. Our future work in this field will concentrate on extending the utilization of RC platform to Kirchhoff depth migration and reverse-time migration methods. As we mentioned above, finite difference seismic modeling has been successfully implemented on RC platform with over 10 X performance improvement (He et al, 2005) . Unsurprisingly, this achievement can be easily applied to wave equation based reverse-time migration to make it a practical procedure.
