Co-processors offer attractive acceleration opportunities to waveform-based imaging and inversion applications in challenging exploration and production environments. Unlike seismic forward modeling, the large amount of data involved in seismic imaging and inversion can pose a significant challenge to scalable acceleration in both GPUs and dataflow engines (DFEs). For adjoint state problems such as reverse-time migration, the source and receiver wavefields propagate forward and backward respectively in time and at any timestep both wavefields need to be available simultaneously. The requirement for the backward-propagated wavefield poses an architectural challenge: as the algorithm progresses backwards over decrementing values of time-step index n, the wavefield must either be dynamically recalculated via forward propagation from 0 to n (the computational approach), or instead derived from a stored sequence of pre-calculated forward propagation time-steps (the storage-driven approach). Traditional CPU-based implementations typically avoid the computational approach due to its associated O(N 2 ) cost, and instead favor the storage-driven approach.
Summary
Co-processors offer attractive acceleration opportunities to waveform-based imaging and inversion applications in challenging exploration and production environments. Unlike seismic forward modeling, the large amount of data involved in seismic imaging and inversion can pose a significant challenge to scalable acceleration in both GPUs and dataflow engines (DFEs). For adjoint state problems such as reverse-time migration, the source and receiver wavefields propagate forward and backward respectively in time and at any timestep both wavefields need to be available simultaneously. The requirement for the backward-propagated wavefield poses an architectural challenge: as the algorithm progresses backwards over decrementing values of time-step index n, the wavefield must either be dynamically recalculated via forward propagation from 0 to n (the computational approach), or instead derived from a stored sequence of pre-calculated forward propagation time-steps (the storage-driven approach). Traditional CPU-based implementations typically avoid the computational approach due to its associated O(N 2 ) cost, and instead favor the storage-driven approach.
In this paper, we evaluate several RTM algorithms, and explore their limitations for GPU and DFE architectures as the level of DFE parallelization is increased. The large data volume required for the storage-based approach to anisotropic RTM causes it to become I/O-bound beyond a certain level of computational parallelization. First, the modeling computation for a storagebased (I/O-bound) approach (Figure 1a) is offloaded onto Maxeler's DFE-based accelerator cards. The DFE implementation is optimized to maximize data throughput to and from the host, and minimize latency. We observe a 4x speedup overall, whereas our DFE implementation of forward modeling is capable of 20-50x speedup (vs. a 2.6GHz 8-core Intel Xeon). This figure will not be increased by further DFE parallelization, nor by simply optimizing the computation. We develop an alternative, computational (compute-bound) approach (Figure 1b) 
