High-order finite difference schemes are always considered appropriate for seismic modeling applications because of their excellent numerical properties in dispersion, dissipation, and anisotropy. However, these algorithms are too computationally intensive to be used routinely, especially for 3D acoustic and elastic modeling cases. In this paper, an inexpensive Field Programmable Gate Array (FPGA) based reconfigurable computing (RC) platform is proposed to accelerate the execution of ordinary secondorder and high-order FD schemes. Our early experiments demonstrated 4 X ~ 19 X performance improvements for different FD schemes comparing with their conventional pure software solutions running on a 3.0GHz referential Linux workstation. This RC platform is also consistent with the prevailing PC-Cluster systems and can achieve much better price-performance ratio along with much lower power consumption.
Introduction
Simulating seismic wave propagation numerically in 2D or 3D space with Finite Difference (FD) algorithms, although extremely computationally-intensive, are capable of dealing with more complex geological models than asymptotic raytracing methods. In the past decade, seismic acoustic and elastic modeling efforts have grown rapidly with performance improvements in commodity computers and parallel computing environments. Pure software methods, from high-level parallelism on PC-Cluster system (House et al, 2000) to low-level memory and disk optimization (Moczo et al, 1999) or even instruction-reordering (Larsen et al, 1998) are exhausted to accelerate the execution of these simulation tasks. However, they are still timeconsuming procedures and cannot be used routinely except in institutes that can afford the high cost of running and maintaining supercomputers or large PC-cluster systems.
Working as complimentary reconfigurable hardware coprocessor attached to an ordinary computer platform, the FPGA-based RC platform offers a great breakthrough in computational performance and efficiency while retaining most of the flexibility of a software solution. Initially proposed in 1990s, it has been widely used in digital signal processing (DSP) field and achieved over an order of speedup for most typical applications. Recently, with the emergence of high capacity FPGAs, people start noticing its promising superiority for high-performance computing (HPC) applications. Highlighting researches on this track include seismic Kirchhoff migration (He et al, 2004) , computational electromagnetics (Durbano et al, 2004) , molecular dynamics (Azizi et al, 2004) , to name a few.
In this work, we described our effort to accelerate the execution of FD numerical algorithms on RC platform for seismic modeling applications. Comparing with pure software solutions running on general-purpose computers, our early evaluations using an entry-level Xilinx ML401 Virtex-4 evaluation platform demonstrates considerable performance improvements, which reveals the broad application perspectives of RC platform in seismic data processing industry.
Reconfigurable Computing: Pros and Cons
Floating-Point (FP) performance, which is often measured in terms of Megaflops or Gigaflops, is a key factor to measure the execution speed of a computer system for high-performance scientific computing. Specifically, seismic modeling applications are data-intensive as well as computationally demanding so that high sustained FP performance is always preferable. However, our numerical algorithms generally exhibit low FP operation to memory access ratio, require considerable memory space for intermediate results, and tend to perform irregular indirect addressing. These intrinsic properties inevitably result in poor caching behavior due to the complex memory hierarchy of modern computers. Consequently, a significant gap always exists between the theoretical peak Megaflops value of a modern CPU and its actual sustained performance. Even with careful hand optimization, most microprocessors could achieve less than 50% of its peak performance and some as low as 5% for realistic applications (Underwood and Hemmert, 2004) .
FPGA-based RC platform has the potential to achieve much higher sustained floating-point performance than generic CPUs for memory bandwidth sensitive applications. An FPGA chip consists of numerous island-like reconfigurable computing resources such as DSP slices, RAM blocks, and Logic Cells (LC). These hardware components can be interconnected willingly using abundant on-chip programmable routing resources and configured into various function modules according to need at runtime. This property makes RC very attractive for applications demanding unordinary arithmetic units that are not available in conventional CPUs. Fine-grain parallelism is another important property of RC: Different function modules can manipulate their own data sets concurrently or they can work together in a pipelined manner to maximize computational throughput. Although a single function module may not provide competitive performance as a modern CPU, (For example, a typical FPGA floating-point multiplication unit can work at a sustained speed of 200Megaflops.) FPGA has the potential to assemble hundreds of similar or different arithmetic units into a huge computing engine and customize the interconnection pattern to match the computational requirements of applications. The aggregate performance of this powerful computing engine could be far more significant than what a generic CPU provides. (Underwood, 2004) External memory bandwidth is always a performance bottleneck for data-intensive applications. Specifically, the constant rate at which input data or parameters are feed into the computing engine determines its sustained working speed. RC platform can always provide wider external memory bandwidth than generic computers. For example, up to eight 64-bit DDR memory channels can be implemented on RC platform comparing with two of them in PC. Moreover, FPGA has the ability to explicitly manage the utilization and interconnection of its RAM blocks. By exploiting the dataflow nature of applications, these internal resources can be utilized effectively as data cache to buffer and delivery a constant data flow to the computing engine through programmable internal routing paths. Consequently, the number of external memory references is reduced due to better caching behavior and the memory bandwidth bottleneck is greatly alleviated.
Figure 1. Coupling RC with generic CPU
In spite of its remarkable computational potential, RC tends to be inefficient for certain types of operations such as branch control and variable-length loops. So a nature choice is to couple it with a generic CPU so that the execution of software algorithms can be accelerated by mapping the most computationally-intensive instructions into FPGA. There are two popular ways to introduce RC resources into computer platforms: as an add-ons attached to peripheral interface such as PCI or VME (www.xilinx.com), or as a coprocessor unit connected directly to system bus (www.cray.com/products/xd1). Generally speaking, the tighter the RC is coupled with CPU in the system, the more significant accelerations it can achieve due to lower administration overhead and wider Communication bandwidth.
High-order FD algorithm on RC platform
We first consider the simplest acoustic wave equation,
where P is pressure, v is acoustic velocity, and Equation (1) is usually discretized on unstaggered grids, where those second-order spatial differential operators can be easily approximated by central FD stencils. Because there are only one wave field variable and one media parameter, taking into consider the 2 nd -order two-step timemarching scheme, averagely we need 4 memory accesses (3 read and 1 write) to update the pressure value at each grid point. The number of FP operations for every grid point is 18 for 3D (2, 2) scheme or 30 for (2, 4) scheme.
Figure 2. Function Blocks of the 2D (2, 4) FD scheme
Our first attempt is to implant the software version of FD schemes directly into FPGA-based RC platform. To make the expression clearly, we depict in Figure 2 the block diagram of our hardware implementation for the (2, 4) FD scheme in 2D space. Extending this design to 3D space is straightforward. This computing engine consists of 20 floating-point arithmetic units and works in a pipelined manner to achieve a sustained high computational throughput at up to 200MHz. Another FP multiplier was employed for damping boundary condition. Media velocities and known pressure values (nine at present time step and one at previous time step) are feed into the computing engine through its 11 input ports. In order to minimize external memory accesses, a specific data cache subsystem was designed using 16 on-chip RAM blocks and was configured in cascaded FIFO structure based on data dependency properties of the FD algorithm. Driving by a global system clock at 200MHz, the computing engine can always accept a new set of inputs and produce an updated pressure at each clock cycle. So this (2, 4) FD computing engine can deliver a sustained FP performance for 4.8 Gigaflops, which is 4 times faster than a 3.0GHz Intel P4 Linux workstation. (The referential C code was compiled using INTEL C++ compiler v8.1with optimization for speed. 20% of the CPU's peak performance was reached.) Detailed introduction to the design and simulation results can be found in (He et al, FCCM2005) .
Because the clock frequency applied to the FD computing engine and external memory are within the same range at hundreds of million Hz, the bandwidth of onboard memories would be saturated rapidly with considerable RC resources inside FPGA being wasted. An in-depth analysis reveals that the FP operation to memory access ratio is about 5 for the (2, 4) FD scheme, which is high enough for modern CPUs but still too low for RC platform. The simplest way to increase this ratio is to adopt high-order FD schemes. By choosing the accuracy order of FD schemes, we can always make the computations as complex as necessary to alter the performance bottleneck back to the computing engine. Moreover, high-order FD schemes allow larger sampling intervals so that the number of spatial grid points is considerably reduced, and consequently, memory bandwidth requirements for the same problem decrease in an indirect way. In other words, we can always find a point, at which the utilization of the RC resources and onboard memory bandwidth are well balanced.
We successfully implemented the 2D (2, 8) and (2, 16) FD schemes on the same prototyping platform. The number of FP arithmetic units inside FD computing engines now increase to 31 for (2, 8) scheme or 56 for (2, 16) scheme, which change the FP operation to memory access ratio to 8 or 14. Still driving by the 200MHz system clock, these computing engines delivered 6.2 or 11.2 Gigaflops sustained FP performance, which are 8 or 19 times faster than the pure software solution on the referential P4 workstation. The most exciting observation of our high-order implementation is that the aggregate FP performance was increasing to keep the nearly constant grid pressure updating rate. Comparatively, the sustained FP performances for high-order FD schemes are reduced significantly due to poor L1 cache behavior. The only cost we pay for high-order schemes are conventional FP addition or multiplication arithmetic units together with onchip memory blocks, which are all abundant inside an upto-date high density FPGA chip. This result encourages us adopting extraordinarily higher-order FD schemes in our design to further enlarge the spatial discretization interval until reaching the extreme of two samples per wavelength, which is bounded by the Nyquist sampling theorem. We can easily extend our design to 4 th -order time integration schemes based on the modified wave equation approach as, Extending our design to 3D space is also straightforward but the hardware implementation will become less efficient than 2D cases. Now, we need much larger FIFO structures to buffer 2D pages instead of 1D grid lines as before. One choice is to sacrifice some onboard memory bandwidth to meet our buffering requirements (He et al, FCCM2005) . The computing engine will not cause us any trouble. For example, in the case of the SEG/EAGE acoustic 3D modeling effort, the (2, 10) FD scheme needs only 53 floating-point operations (Larsen and Grieger, 1998) .
We now extend our analysis to elastic modeling problems. The 3D linear elastic wave equation can be presented as 9 first-order PDEs in Cartesian coordinates (Virieux, 1986) ,
where λ and µ are Lame parameters, ρ is density, (3) is always considered beneficial because all first-order finite difference stencils are naturally centered around relevant unknowns spatially and temporally, which leads to less numerical dispersion and efficient 2 nd -order in-place time-marching scheme. There are totally 12 floating-point values at each discrete grid point: 3 media parameters, 3 particle velocities, and 6 independent stress tensor components. At the beginning of each time-marching step (which is divided into two half steps), all of these 12 FP values are read out from external memory space and after FD computations, 9 updated wave field variables should be written back. As for the computational cost, (2, 2) staggered FD scheme requires more than 90 floating-point operations at each grid node and the number for the staggered (2, 4) scheme is about 140 (Larsen and Grieger, 1998) . This number partly explains why only low-order FD schemes could be adopted for elastic wave simulations.
There are no trouble to design a large computing engine with hundreds of FP arithmetic units, although it may reach the capacity extreme that the largest FPGA could support. An alternative is to divide this large computing engine into two smaller one for equation (3a) and equation (3b, c) respectively. Thanks to FPGA's run-time reconfigurability, the host CPU can reconfigure the RC resources in seconds so that those two smaller computing engine can work alternately, each answers for half time-marching step.
Conclusions
In this paper, we proposed an FPGA-based reconfigurable computing platform to accelerate finite difference numerical simulations for acoustic and elastic wave propagation problems. Our early experiments demonstrated significant performance improvements comparing with the conventional software solution running on PC platform. To the best of our knowledge, this is the first attempt to solve a practical seismic modeling problem using RC platform. The main motivation of introducing reconfigurable computing technology to seismic data processing industry is its immense computational potential along with acceptable flexibility and its run-time reconfigurability, which allows the same hardware resources be reconfigured for different algorithms used in different processing stages. Future work in this field will concentrate on improving our design for 3D cases and extending it to more general forms of wave equations.
